Judgemark V2 - Log Probability Evaluation

A fork of Judgemark that evaluates whether log probability-based scoring methods outperform traditional text-based judgment approaches fo## Interpreting th## Contributing

Overview

This project compares different methods of extracting scores from language models:

Raw: Traditional text-based scoring (baseline Judgemark method)
Weighted: Log probability weighting using normalized choice probabilities (similar to G-Eval)
Ranked: Ranking-based evaluation using Kendall's tau correlation between log probability rankings and score rankings

Results

Method	Score	Score (Normalized)
ranked_scaled	0.629332	0.80
ranked_norm	0.654562	0.74
weighted	0.634804	0.65
raw	0.634528	0.65
weighted_norm	0.623806	0.64
ranked	0.336333	0.28

Results for DeepSeek Chat V3 0324

The ranking approach performs best, particularly when scaled. This approach treats LLM log probabilities as rankings rather than true probabilities, which aligns better with how sampling methods like greedy and top-k actually work.

Methodology: Ranking Approach

Instead of treating log probabilities as probabilities, the ranking method:

Extracts the full distribution of log probabilities for all possible choices (0-10):

{
  "0": -1.2,
  "1": -0.5,
  "2": -0.3,
  "3": -0.1,
  "4": -0.05,
  "5": -0.02,
  "6": -0.01,
  "7": -0.005,
  "8": -0.002,
  "9": -0.001
}

Ranks choices by their log probabilities
Uses Kendall's tau to measure correlation between log probability rankings and expected score rankings
Achieves high efficiency by extracting complete score distributions from a single token

Setup & Installation

Clone the repository:

git clone https://github.com/wassname/Judgemark-v2lp.git
cd Judgemark-v2

Install Python dependencies (requires Python 3.9+):

uv sync
source ./venv/bin/activate  # Activate the virtual environment

Set up environment variables for API credentials:

export OPENAI_API_KEY="sk-..."
export OPENAI_API_URL="https://openrouter.ai/api/v1/chat/completions"

Usage

Run the benchmark with the main script:

# test
uv run python judgemark_v2.py \
  --judge-model "meta-llama/llama-3.2-3b-instruct" \
  --samples-file data/judgemark_v2.1_samples.json \
  --prompts-file data/judge_prompts.json \
  --runs-file outputs/my_judgemark_runs.json \
  --threads 1 \
  --num-runs 1 \
  --save-raw-judge-output

uv run python judgemark_v2.py \
  --judge-model "deepseek/deepseek-chat-v3-0324" \
  --samples-file data/judgemark_v2.1_samples.json \
  --prompts-file data/judge_prompts.json \
  --runs-file outputs/my_judgemark_runs2.json \
  --num-runs 1 \
  --save-raw-judge-output

See original repo https://github.com/EQ-bench/Judgemark-v2 for citation, license, and more details.

3.1 KiB Raw Blame History