Files
wassname 8cd3c61050 folklore: tuning playbook, Domingos, Bekman loss spikes, Ng error analysis; LLM-judge bias appendix
- SKILL.md: 3 new entries (exploration-over-exploitation + nuisance HPs,
  test-set contamination, loss-spikes-mean-bad-data-pocket) and an Ng
  100-misclassified-examples quote under inspect-the-data
- refs/llm_judges.md: position/verbosity/self-preference biases (Zheng,
  Wang 66/80 flip, Panickssery) + mitigation checklist from verdict docs
- Lones pitfalls linked as the exhaustive 36-item do/don't checklist
- 6 new frozen evidence files; Hamel evals link in further reading

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 15:30:41 +08:00

3.3 KiB

LLM-as-a-judge: known biases and mitigations

Appendix to the ML Debugging skill. When an LLM-judged eval looks surprisingly good, or a ranking flips between runs, suspect the judge before the model. Each bias below has been measured; verbatim sources in docs/evidence/llm_judge_biases.md.

The measured biases

Position bias is large enough to flip rankings outright. Wang et al. (ACL 2024):

the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. [...] e.g., Vicuna-13B could beat ChatGPT on 66 over 80 tested queries with ChatGPT as an evaluator.1

Zheng et al. (the MT-Bench paper) named the wider taxonomy: "position, verbosity, and self-enhancement biases, as well as limited reasoning ability"2 . Their headline agreement number (GPT-4 matches human preference "over 80%", the same as human-human agreement) is the case for LLM judges; the bias list is the fine print.

Self-preference tracks self-recognition. Panickssery et al. fine-tuned models to vary self-recognition ability and found "a linear correlation between self-recognition capability and the strength of self-preference bias"3 , with controlled experiments supporting a causal reading. A judge that can tell its own outputs apart will favor them, so judging a model with itself (or a sibling checkpoint) is structurally biased.

There are also output-distribution quirks. From Haize Labs' verdict docs (practitioner notes): the gpt-4o family skews numerical scores upward and mode-collapses even with logprobs; llama-family judges give higher-entropy, more discriminative score distributions; JSON-mode constrained decoding imposes its own inductive bias on scores.4

Mitigation checklist

From Wang's calibration framework and verdict's best-practices page:

  • Ask for an explanation or justification before the score, not after.
  • Score both orderings and aggregate (Wang's Balanced Position Calibration); at minimum, randomize position and check the flip rate.
  • Use a different model family for the judge (and for any verifier-of-the-judge) than the one being evaluated. Same-model verification produces a positive skew "that may not discriminate faithfully".4
  • Inspect the raw score distribution before trusting means: mode collapse or skew means the scale isn't being used.
  • Spot-check judge verdicts against your own reading of ~20 transcripts (the Ng error-analysis move, applied to the judge).
  • Judge quality is benchmarkable: JudgeBench ranks judges on objective-correctness pairs.

  1. Wang et al., "Large Language Models are not Fair Evaluators" (ACL 2024) — https://arxiv.org/abs/2305.17926 ↩︎

  2. Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (NeurIPS 2023) — https://arxiv.org/abs/2306.05685 ↩︎

  3. Panickssery, Bowman, Feng, "LLM Evaluators Recognize and Favor Their Own Generations" (2024) — https://arxiv.org/abs/2404.13076 ↩︎

  4. Haize Labs, verdict docs: best practices, distributional bias cookbook ↩︎