Files
ml_debug/docs/evidence/koaning_bad_labels.md
T
wassname ee4e9a5caa folklore: add koaning, gwern, kidger, nanochat, cleanrl; trim lucidrains
Gather debugging folklore from more practitioners, each a verbatim quote
checked against a cached source copy (footnoted with line numbers):
- koaning (Vincent Warmerdam), "Bad Labels": benchmark labels are often wrong;
  find them with confidence-sorted errors.
- gwern, the tank-detection legend: the canonical data-leakage parable, plus
  the scout-mindset twist that it's a likely-unsourced urban legend.
- Patrick Kidger, "Just Know Stuff": why research code is buggy ("kludge ...
  bugs that don't cripple things only because some other bug stops them") and
  "never accept the kludge". Plus a one-line jaxtyping pointer for shape bugs.
- nanochat (Karpathy): BOS-alignment fake metric improvement; all-ranks must
  clip on inf (a multi-GPU bug single-GPU testing hides).
- cleanrl "37 Implementation Details of PPO" -> RL sub-skill, as the canonical
  proof that reference-impl details (not ideas) decide whether PPO works.

Trim the lucidrains item to one quote (it had ballooned). Add wassname credit
+ companion-gist link. All 20 footnotes resolve.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 20:59:36 +08:00

34 lines
3.2 KiB
Markdown

# Bad Labels — Vincent D. Warmerdam (koaning)
Source: https://koaning.io/posts/labels/ (2021-09-02). Cached copy for the ML-debugging skill.
---
I write a lot of blogposts on why you need more than grid-search to properly judge a machine learning model. In this blogpost I want to demonstrate yet another reason; labels often seem to be wrong.
What I'll describe here is also available as a course on calmcode.io.
## Bit of Background
It turns out that bad labels are a *huge* problem in many popular benchmark datasets. To get an impression of the scale of the issue, just go to labelerrors.com. It's an impressive project that shows problems with many popular datasets; CIFAR, MNIST, Amazon Reviews, IMDB, Quickdraw and Newsgroups just to name a few. It's part of a research paper (https://arxiv.org/abs/2103.14749) that tries to quantify how big of a problem these bad labels are.
The issue here isn't just that we might have bad labels in our training set, the issue is that it appears in the validation set. If a machine learning model can become state of the art by squeezing another 0.5% out of a validation set one has to wonder. Are we really making a better model? Or are we creating a model that is better able to overfit on the bad labels?
## Quick Trick
Here's a quick trick seems worthwhile. Let's say that we train a model that is very general. That means high bias, low variance. You may have a lower capacity model this way, but it will be less prone to overfit on details.
After training such a model, it'd be interesting to see where the model disagrees with the training data. These would be valid candidates to check, but it might result in list that's a bit too long for comfort. So to save time you can can sort the data based on the `predict_proba()`-value. When the model gets it wrong, that's interesting, but when it *also* associates a very low confidence to the correct class, that's an example worth double checking.
## What does this mean?
The abstract of the [Northcutt et al.] paper certainly paints a clear picture of what this exercise means for state-of-the-art models:
> We find that lower capacity models may be practically more useful than higher capacity models in real-world datasets with high proportions of erroneously labeled data. For example, on ImageNet with corrected labels: ResNet-18 outperforms ResNet-50 if the prevalence of originally mislabeled test examples increases by just 6%. On CIFAR-10 with corrected labels: VGG-11 outperforms VGG-19 if the prevalence of originally mislabeled test examples increases by 5%. Traditionally, ML practitioners choose which model to deploy based on test accuracy -- our findings advise caution here, proposing that judging models over correctly labeled test sets may be more useful, especially for noisy real-world datasets.
## So what now?
More people should do check their labels more frequently. ... if you're looking for a simple place to start, check out the cleanlab project (https://github.com/cgnorthcutt/cleanlab). It's made by the same authors of the labelerrors-paper and is meant to help you find bad labels.
For everyone; maybe we should spend a less time tuning parameters and instead spend it trying to get a more meaningful dataset.