A fork of Judgemark that evaluates whether log probability-based scoring methods outperform traditional text-based judgment approaches for interpreting the results of language models.

Contributing

Overview

This project compares different methods of extracting scores from language models:

Raw: Traditional text-based scoring (baseline Judgemark method)
Weighted: Log probability weighting using normalized choice probabilities (similar to G-Eval)
Ranked: Ranking-based evaluation using Kendall's tau correlation between log probability rankings and score rankings

Results

Method	Score	Score (Normalized)
ranked	0.77	0.817
ranked_scaled	0.77	0.817
raw (baseline)	0.678	0.731
weighted	0.668	0.716
ranked_norm	0.647	0.658
weighted_norm	0.562	0.551

name	judgemark_score_calib	stability_calib	separability_calib	human_correlation_calib
ranked_scaled	0.817	1.0	0.804	0.69
ranked	0.817	1.0	0.804	0.69
raw (baseline)	0.731	0.895	0.652	0.882
weighted	0.716	0.879	0.635	0.876
ranked_norm	0.658	0.8	0.576	0.843
weighted_norm	0.551	0.538	0.498	0.775

Results for DeepSeek r1

The ranking approach performs best, particularly when scaled. This approach treats LLM log probabilities as rankings rather than true probabilities, which aligns better with how sampling methods like greedy and top-k actually work.

ranked_scaled: this method is kendall tau (scaled to [0, 10] after normalising by the mean log probs over all samples kendallstau(logprobs-logprobs_all_mean(), range(10).collection
ranked: kendall tau (scaled to [0, 10]
weighted: this method is similar to G-Eval, where the log probabilities are used to weight the choices based on their normalized probabilities.\
- weighted_choice = choice * logprob / sum(logprobs) * 10

Methodology: Ranking Approach

Instead of treating log probabilities as probabilities, the ranking method:

Extracts the full distribution of log probabilities for all possible choices (0-10):

{
  "0": -1.2,
  "1": -0.5,
  "2": -0.3,
  "3": -0.1,
  "4": -0.05,
  "5": -0.02,
  "6": -0.01,
  "7": -0.005,
  "8": -0.002,
  "9": -0.001
}

Ranks choices by their log probabilities
Uses Kendall's tau to measure correlation between log probability rankings and expected score rankings
Achieves high stability by extracting complete score distributions from a single token

Setup & Installation

Clone the repository:

git clone https://github.com/wassname/Judgemark-v2lp.git
cd Judgemark-v2

Install Python dependencies (requires Python 3.9+):

uv sync
source ./venv/bin/activate  # Activate the virtual environment

Set up environment variables for API credentials:

export OPENAI_API_KEY="sk-..."
export OPENAI_API_URL="https://openrouter.ai/api/v1/chat/completions"

Usage

Run the benchmark with the main script:

# test
uv run python judgemark_v2.py \
  --judge-model "meta-llama/llama-3.2-3b-instruct" \

# gather data
uv run python judgemark_v2.py \
  --judge-model "qwen/qwen3-235b-a22b" \
  --threads 8

uv run python judgemark_v2.py \
  --judge-model "deepseek/deepseek-r1"

# crunch results
nbs/02_recomp.ipynb

See original repo https://github.com/EQ-bench/Judgemark-v2 for citation, license, and more details.