folklore: add koaning, gwern, kidger, nanochat, cleanrl; trim lucidrains

Gather debugging folklore from more practitioners, each a verbatim quote checked against a cached source copy (footnoted with line numbers): - koaning (Vincent Warmerdam), "Bad Labels": benchmark labels are often wrong; find them with confidence-sorted errors. - gwern, the tank-detection legend: the canonical data-leakage parable, plus the scout-mindset twist that it's a likely-unsourced urban legend. - Patrick Kidger, "Just Know Stuff": why research code is buggy ("kludge ... bugs that don't cripple things only because some other bug stops them") and "never accept the kludge". Plus a one-line jaxtyping pointer for shape bugs. - nanochat (Karpathy): BOS-alignment fake metric improvement; all-ranks must clip on inf (a multi-GPU bug single-GPU testing hides). - cleanrl "37 Implementation Details of PPO" -> RL sub-skill, as the canonical proof that reference-impl details (not ideas) decide whether PPO works. Trim the lucidrains item to one quote (it had ballooned). Add wassname credit + companion-gist link. All 20 footnotes resolve. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:15:57 +08:00 · 2026-06-02 20:59:36 +08:00
parent 9911ac83c5
commit ee4e9a5caa
6 changed files with 134 additions and 6 deletions
@@ -66,7 +66,7 @@ For RL, add reward scale/sign as a top-3 issue, and episode-boundary handling (d
 A catalog of small, well-worn checks, in rough dependency order (each assumes the one before). Pull from it; don't run it end-to-end as a ritual.

 **Step 1: Verify components in isolation.**[^goodfellow][^cs229] Most bugs are "doing the wrong calculation." Test each piece independently.
- Forward pass: feed known inputs, check output shapes and ranges. `assert` shapes everywhere, since `(None,)` vs `(None, 1)` silently broadcasts into `(None, None)`.
+- Forward pass: feed known inputs, check output shapes and ranges. `assert` shapes everywhere, since `(None,)` vs `(None, 1)` silently broadcasts into `(None, None)`. (Or make the shapes runtime-checked contracts with jaxtyping[^jaxtyping] + beartype, which turns the #1 silent bug loud.)
 - Loss: hand-compute a few targets and compare to code output.
 - Data pipeline: sample a batch, print it, eyeball it. Are labels aligned with inputs? Transforms applied correctly?
 - Preprocessing: look at processed inputs as a human. Can *you* solve the task from them?
@@ -151,6 +151,16 @@ The hard-won lessons, in the words of the people who learned them. Sources and l

 A bug can also hide, because most ML models have multiple adaptive parts: "If one part is broken, the other parts can adapt and still achieve roughly acceptable performance"[^goodfellow], and it may not show in the output at all. So raise the bar for "correct."

+### Never accept the kludge (Patrick Kidger)
+
+Why is research code so reliably buggy? Kidger's blunt answer:
+
+> Academic software is almost always a poorly-maintained kludge of leaky abstractions, awful formatting, and bugs that don't cripple things only because some other bug stops them from doing so.[^kidger]
+
+> This is a systemic professional failing. [...] the overwhelming majority of your time will be spent in front of a screen, staring at code. And yet most of you (yes, you) would not pass muster as a junior developer.[^kidger]
+
+His fix is a posture, "never accept the kludge": messed up your git repo? Find the commands to fix it, "don't just delete it and clone from the remote."[^kidger] The instinct that refuses kludges is the same one that refuses `.detach()`-to-silence-autograd and `except: pass`.
+
 ### Loss curves are a red herring

 > When someone's RL implementation isn't working, they *luuuuuurv* to copy-paste a screenshot of their loss curve to you. They do this because they know they want a pretty, exponentially-decaying loss curve, and they know what they have *isn't that*. The problem with using the loss curve as an indicator of correctness is somewhat that it's not reliable, but mostly because it doesn't localise errors. The shape of your loss curve says very little about where in your code you've messed up, and so says very little about what you need to change to get things working.[^jones]
@@ -177,6 +187,22 @@ Corollary, MurphyJitsu pre-flight: before launching a run, ask "if this fails, w

 Slavv's "37 reasons" list opens with the same anecdote (gradients flowing, loss falling, predictions all background) and puts "Verify that the input data is correct" and "Start with a really small dataset (2-20 samples). Overfit on it" at the top of its emergency checklist[^slavv]. FSDL names preprocessing and dataset construction as leading silent-failure categories[^fsdl].

+### Labels are often wrong (koaning)
+
+Even benchmark data is dirtier than you think. Vincent Warmerdam:
+
+> It turns out that bad labels are a *huge* problem in many popular benchmark datasets.[^koaning]
+
+His cheap way to find them: train a deliberately high-bias model, then sort by where it disagrees with the label while assigning the correct class low confidence (the confidence-sorted-errors trick). The takeaway: "maybe we should spend [...] less time tuning parameters and instead spend it trying to get a more meaningful dataset."[^koaning]
+
+### The tank story: your model learns the confound (gwern)
+
+The canonical data-leakage parable:
+
+> A cautionary tale in artificial intelligence tells about researchers training an neural network (NN) to detect tanks in photographs, succeeding, only to realize the photographs had been collected under specific conditions for tanks/non-tanks and the NN had learned something useless like time of day.[^gwern]
+
+gwern traced versions back to 1992 and concluded it is "a classic 'urban legend'" with no solid source[^gwern]. The lesson holds twice over: a model will gladly learn a confound in how the data was collected instead of the task (dataset bias / leakage), and even your cautionary tales deserve a citation.
+
 ### Overfit one batch first

 > Overfit a tiny subset of data. Lastly and most importantly, before training on the full dataset try to train on a tiny portion (e.g. 20 examples) of your data and make sure you can achieve zero cost. For this experiment it's also best to set regularization to zero [...]. Unless you pass this sanity check with a small dataset it is not worth proceeding to the full dataset.[^cs231n]
@@ -237,15 +263,23 @@ So: 3e-4 is a fine *starting* LR for Adam, not a law. The real folklore is "Adam
 | Loss spikes at start then recovers | Normal with large batch + warmup. No warmup? Add it |
 | Different results at different batch sizes (same total steps) | Missing LR scaling. Adjust LR proportionally |

-### Transformer instability: the fix lives in the code (lucidrains)
+### Tricks hide in reference code (lucidrains)

-The clearest living proof of "the trick is in the implementation, not the paper" is lucidrains' (Phil Wang's) x-transformers, a catalogue of training tricks each tied to the paper it came from. The one most worth knowing for debugging: when a transformer's loss spikes or diverges, a leading cause is attention logits growing unbounded, and the now-near-standard fix is to L2-normalize the queries and keys before their dot product (QK / cosine-sim normalization).
+lucidrains' x-transformers is a catalogue of training tricks, each tied to its paper. The debugging-relevant one: when a transformer diverges, attention logits blowing up is a prime suspect, and the now-standard fix is QK normalization (L2-normalize queries and keys before the dot product).

-> The normalization prevents the attention operation from overflowing, and removes any need for numerical stability measures prior to softmax. Both are perennial problems when training transformers.[^lucidrains]
+> We are nearing the point of wiping out a source of transformer training instability with one simple intervention.[^lucidrains]

-> We are nearing the point of wiping out a source of transformer training instability with one simple intervention, in my opinion.[^lucidrains]
+Scaled-up recipes accumulate these one-line stability fixes in code long before they're written up, which is the whole case for reading a working implementation.

-It has since been validated on 3B-22B parameter models. A related embedding-level stabilizer he notes: a LayerNorm right after the token+positional embeddings, which both BLOOM-175B and YaLM-100B used to stabilize training.[^lucidrains] The lesson is the read-a-working-implementation one again: scaled-up training recipes accumulate these one-line stability fixes in code long before they are written up, so a divergent run is often a cue to go read what the big runs actually did.
+### Modern LLM-pretraining gotchas (nanochat)
+
+Karpathy's nanochat is one of the few public records of what scaling a transformer from scratch actually takes. Two gotchas worth stealing:
+
+> The 'lower validation loss' from BOS-alignment is misleading—it's just fewer noisy tokens, not better learning.[^nanochat]
+
+> If any rank's gradient contains inf, all ranks must clip to avoid divergence.[^nanochat]
+
+The first is a fake-metric-improvement trap (a better number that isn't better learning); the second is a multi-GPU bug that single-GPU testing hides.

 ---

@@ -351,5 +385,12 @@ Folklore sources (the quotes above trace to these):
 [^mccandlish]: McCandlish, Kaplan et al., "An Empirical Model of Large-Batch Training" (2018) — https://arxiv.org/abs/1812.06162 ([cache](docs/evidence/mccandlish_2018_large_batch.md))
 [^goyal]: Goyal et al., "Accurate, Large Minibatch SGD" (2017) — https://arxiv.org/abs/1706.02677
 [^lucidrains]: Phil Wang (lucidrains), x-transformers README — https://github.com/lucidrains/x-transformers ([cache](docs/evidence/lucidrains_x_transformers_readme.md): post-embedding LayerNorm / BLOOM+YaLM L366, attention-overflow / cosine-sim norm L1230, autoregressive validation L1234, "wiping out a source of instability" / QK RMSNorm L1292)
+[^koaning]: Vincent D. Warmerdam (koaning), "Bad Labels" (2021) — https://koaning.io/posts/labels/ ([cache](docs/evidence/koaning_bad_labels.md): bad-labels-huge-problem L13, confidence-sort trick L21, spend-less-time-tuning L33)
+[^jaxtyping]: Patrick Kidger, jaxtyping (runtime shape/dtype checking) — https://github.com/patrick-kidger/jaxtyping
+[^nanochat]: nanochat (Karpathy), documented via DeepWiki — https://deepwiki.com/karpathy/nanochat ([cache](docs/evidence/nanochat_deepwiki_llm_pretraining_2026.md): BOS fake-improvement L97, all-ranks-clip-on-inf L131)
+[^kidger]: Patrick Kidger, "Just Know Stuff" (2023) — https://kidger.site/thoughts/just-know-stuff/ ([cache](docs/evidence/kidger_just_know_stuff.md): kludge-definition L7, junior-developer L9, never-accept-the-kludge L11, don't-delete-and-clone L13)
+[^gwern]: Gwern Branwen, "The Neural Net Tank Legend" — https://gwern.net/tank ([cache](docs/evidence/gwern_tank.md): cautionary tale L7, urban-legend conclusion L9)

 For modern transformer pretraining specifically (the sources above predate it), see [Karpathy's recipe](https://karpathy.github.io/2019/04/25/recipe/) and the [nanochat deepwiki](https://deepwiki.com/karpathy/nanochat) (320+ empirical HP sweeps for a GPT-2-scale run). Most multi-source claims trace to quotes in [docs/ml_debug_folklore.argdown](docs/ml_debug_folklore.argdown) (vargdown); the full evidence set is in [docs/evidence/](docs/evidence/).
+
+Curated by [wassname](https://github.com/wassname). Companion gist: https://gist.github.com/wassname/e45e41f75c0b50e72ec1f4cff811a277
@@ -0,0 +1,25 @@
+# The 37 Implementation Details of Proximal Policy Optimization
+
+Authors: Huang, Shengyi; Dossa, Rousslan Fernand Julien; Raffin, Antonin; Kanervisto, Anssi; Wang, Weixun.
+Source: ICLR Blog Track, 2022-03-25 — https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
+Code: https://github.com/vwxyzjn/ppo-implementation-details ; CleanRL: https://github.com/vwxyzjn/cleanrl
+
+Excerpt cached for the ML-debugging skill (the full post is long; key framing passages below, verbatim).
+
+---
+
+> Instead of doing ablation studies and making recommendations on which details matter, this blog post takes a step back and focuses on reproductions of PPO's results in all accounts.
+
+> During our re-implementation, we have compiled an implementation checklist containing 37 details as follows. For each implementation detail, we display the permanent link to its code (which is not done in academic papers) and point out its literature connection.
+
+The 37 details break down as:
+- 13 core implementation details
+- 9 Atari-specific implementation details
+- 9 implementation details for robotics tasks (continuous action spaces)
+- 5 LSTM implementation details
+- 1 `MultiDiscrete` action-spaces implementation detail
+- (plus 4 situational details not used in the official implementation)
+
+> Our ultimate purpose is to help people understand the PPO implementation through and through, reproduce past results with high fidelity, and facilitate customization for new research.
+
+Context: the official PPO implementation (`openai/baselines`, `ppo2`) has undergone several refactorings, so "it is important to recognize *which version* of the official implementation is worth studying." Libraries that match `ppo2`'s details closely (Stable-Baselines3, CleanRL) reproduce similar results; others report more diverse (worse) results.
@@ -0,0 +1,9 @@
+# The Neural Net Tank Legend — Gwern Branwen
+
+Source: https://gwern.net/tank . Cached excerpt for the ML-debugging skill (verbatim abstract passages).
+
+---
+
+> A cautionary tale in artificial intelligence tells about researchers training an neural network (NN) to detect tanks in photographs, succeeding, only to realize the photographs had been collected under specific conditions for tanks/non-tanks and the NN had learned something useless like time of day. This story is often told to warn about the limits of algorithms and importance of data collection to avoid "dataset bias"/"data leakage" where the collected data can be solved using algorithms that do not generalize to the true data distribution, but the tank story is usually never sourced.
+
+> I collate many extent versions dating back a quarter of a century to 1992 along with two NN-related anecdotes from the 1960s; their contradictions & details indicate a classic "urban legend", with a probable origin in a speculative question in the 1960s by Edward Fredkin at an AI conference about some early NN research, which was then classified & never followed up on.
@@ -0,0 +1,17 @@
+# Just Know Stuff (how to achieve success in an ML PhD) — Patrick Kidger
+
+Source: https://kidger.site/thoughts/just-know-stuff/ (2023-01-26). Cached excerpt from the "Software development" section, verbatim.
+
+---
+
+> Academic software is almost always a poorly-maintained kludge of leaky abstractions, awful formatting, and bugs that don't cripple things only because some other bug stops them from doing so.
+
+> This is a systemic professional failing. As an (applied) ML researcher, the overwhelming majority of your time will be spent in front of a screen, staring at code. And yet most of you (yes, you) would not pass muster as a junior developer.
+
+> So, how to improve? First of all, never accept the kludge.
+
+> You've messed up your Git repo? Figure out the commands to fix it... don't just delete it and clone from the remote.
+
+> Focus on writing clean code, based around orthogonal abstractions. When the code starts getting messy - and it will - be willing to refactor your code into something more legible. Avoid both spaghetti code and ravioli code.
+
+> When the documentation is inadequate, look at their source code.
@@ -0,0 +1,33 @@
+# Bad Labels — Vincent D. Warmerdam (koaning)
+
+Source: https://koaning.io/posts/labels/ (2021-09-02). Cached copy for the ML-debugging skill.
+
+---
+
+I write a lot of blogposts on why you need more than grid-search to properly judge a machine learning model. In this blogpost I want to demonstrate yet another reason; labels often seem to be wrong.
+
+What I'll describe here is also available as a course on calmcode.io.
+
+## Bit of Background
+
+It turns out that bad labels are a *huge* problem in many popular benchmark datasets. To get an impression of the scale of the issue, just go to labelerrors.com. It's an impressive project that shows problems with many popular datasets; CIFAR, MNIST, Amazon Reviews, IMDB, Quickdraw and Newsgroups just to name a few. It's part of a research paper (https://arxiv.org/abs/2103.14749) that tries to quantify how big of a problem these bad labels are.
+
+The issue here isn't just that we might have bad labels in our training set, the issue is that it appears in the validation set. If a machine learning model can become state of the art by squeezing another 0.5% out of a validation set one has to wonder. Are we really making a better model? Or are we creating a model that is better able to overfit on the bad labels?
+
+## Quick Trick
+
+Here's a quick trick seems worthwhile. Let's say that we train a model that is very general. That means high bias, low variance. You may have a lower capacity model this way, but it will be less prone to overfit on details.
+
+After training such a model, it'd be interesting to see where the model disagrees with the training data. These would be valid candidates to check, but it might result in list that's a bit too long for comfort. So to save time you can can sort the data based on the `predict_proba()`-value. When the model gets it wrong, that's interesting, but when it *also* associates a very low confidence to the correct class, that's an example worth double checking.
+
+## What does this mean?
+
+The abstract of the [Northcutt et al.] paper certainly paints a clear picture of what this exercise means for state-of-the-art models:
+
+> We find that lower capacity models may be practically more useful than higher capacity models in real-world datasets with high proportions of erroneously labeled data. For example, on ImageNet with corrected labels: ResNet-18 outperforms ResNet-50 if the prevalence of originally mislabeled test examples increases by just 6%. On CIFAR-10 with corrected labels: VGG-11 outperforms VGG-19 if the prevalence of originally mislabeled test examples increases by 5%. Traditionally, ML practitioners choose which model to deploy based on test accuracy -- our findings advise caution here, proposing that judging models over correctly labeled test sets may be more useful, especially for noisy real-world datasets.
+
+## So what now?
+
+More people should do check their labels more frequently. ... if you're looking for a simple place to start, check out the cleanlab project (https://github.com/cgnorthcutt/cleanlab). It's made by the same authors of the labelerrors-paper and is meant to help you find bad labels.
+
+For everyone; maybe we should spend a less time tuning parameters and instead spend it trying to get a more meaningful dataset.
@@ -108,6 +108,8 @@ The allure of writing from scratch is real but the self-correction mechanisms in
 2. Use reference impl as source of reliable components, work to the same API
 3. Have one eye on reference while you write your own -- copy hyperparameters, discounting code, termination handling

+The canonical demonstration is "The 37 Implementation Details of PPO" [Huang et al. 2022]: reproducing PPO meant matching 37 separate details (13 core, 9 Atari, 9 continuous-control, 5 LSTM, 1 MultiDiscrete), each linked to its exact line of code, "which is not done in academic papers." If your PPO underperforms a reference, you are probably missing some of these, not lacking a better idea.
+
 References: spinning-up (OpenAI), stable-baselines3, cleanrl (single-file per algo), OpenSpiel (multi-agent).

 **10. Don't over-interpret noise.** [Schulman 2017, Henderson 2018, Irpan 2018]
@@ -149,6 +151,7 @@ Sometimes (rarely) you don't. Schulman:
 - Alex Irpan, "Deep Reinforcement Learning Doesn't Work Yet" (2018): https://www.alexirpan.com/2018/02/14/rl-hard.html
 - McCandlish & Kaplan, "An Empirical Model of Large-Batch Training" (2018): https://arxiv.org/abs/1812.06162
 - Slav Ivanov, "37 Reasons why your Neural Network is not working" (2017): https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607
+- Huang et al., "The 37 Implementation Details of PPO" (ICLR Blog Track 2022): https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ ([cache](../docs/evidence/cleanrl_37_ppo_details.md))

 ### Reference implementations
 - OpenAI Spinning Up: https://github.com/openai/spinningup