Judgemark V2 - Log Probability Evaluation
A fork of Judgemark that evaluates whether log probability-based scoring methods outperform traditional text-based judgment approaches for interpreting the results of language models.
Contributing
Overview
This project compares different methods of extracting scores from language models:
- Raw: Traditional text-based scoring (baseline Judgemark method)
- Weighted: Log probability weighting using normalized choice probabilities (similar to G-Eval)
- Ranked: Ranking-based evaluation using Kendall's tau correlation between log probability rankings and score rankings
Results
| Method | Score | Score (Normalized) |
|---|---|---|
| ranked | 0.77 | 0.817 |
| ranked_scaled | 0.77 | 0.817 |
| raw (baseline) | 0.678 | 0.731 |
| weighted | 0.668 | 0.716 |
| ranked_norm | 0.647 | 0.658 |
| weighted_norm | 0.562 | 0.551 |
| name | judgemark_score_calib | stability_calib | separability_calib | human_correlation_calib |
|---|---|---|---|---|
| ranked_scaled | 0.817 | 1.0 | 0.804 | 0.69 |
| ranked | 0.817 | 1.0 | 0.804 | 0.69 |
| raw (baseline) | 0.731 | 0.895 | 0.652 | 0.882 |
| weighted | 0.716 | 0.879 | 0.635 | 0.876 |
| ranked_norm | 0.658 | 0.8 | 0.576 | 0.843 |
| weighted_norm | 0.551 | 0.538 | 0.498 | 0.775 |
Results for DeepSeek r1
The ranking approach performs best, particularly when scaled. This approach treats LLM log probabilities as rankings rather than true probabilities, which aligns better with how sampling methods like greedy and top-k actually work.
- ranked_scaled: this method is kendall tau (scaled to [0, 10] after normalising by the mean log probs over all samples
kendallstau(logprobs-logprobs_all_mean(), range(10).collection - ranked: kendall tau (scaled to [0, 10]
- weighted: this method is similar to G-Eval, where the log probabilities are used to weight the choices based on their normalized probabilities.\
weighted_choice = choice * logprob / sum(logprobs) * 10
Methodology: Ranking Approach
Instead of treating log probabilities as probabilities, the ranking method:
-
Extracts the full distribution of log probabilities for all possible choices (0-10):
{ "0": -1.2, "1": -0.5, "2": -0.3, "3": -0.1, "4": -0.05, "5": -0.02, "6": -0.01, "7": -0.005, "8": -0.002, "9": -0.001 } -
Ranks choices by their log probabilities
-
Uses Kendall's tau to measure correlation between log probability rankings and expected score rankings
-
Achieves high stability by extracting complete score distributions from a single token
Setup & Installation
-
Clone the repository:
git clone https://github.com/wassname/Judgemark-v2lp.git cd Judgemark-v2 -
Install Python dependencies (requires Python 3.9+):
uv sync source ./venv/bin/activate # Activate the virtual environment -
Set up environment variables for API credentials:
export OPENAI_API_KEY="sk-..." export OPENAI_API_URL="https://openrouter.ai/api/v1/chat/completions"
Usage
Run the benchmark with the main script:
# test
uv run python judgemark_v2.py \
--judge-model "meta-llama/llama-3.2-3b-instruct" \
# gather data
uv run python judgemark_v2.py \
--judge-model "qwen/qwen3-235b-a22b" \
--threads 8
uv run python judgemark_v2.py \
--judge-model "deepseek/deepseek-r1"
# crunch results
nbs/02_recomp.ipynb
See original repo https://github.com/EQ-bench/Judgemark-v2 for citation, license, and more details.