folklore: add koaning, gwern, kidger, nanochat, cleanrl; trim lucidrains

Gather debugging folklore from more practitioners, each a verbatim quote checked against a cached source copy (footnoted with line numbers): - koaning (Vincent Warmerdam), "Bad Labels": benchmark labels are often wrong; find them with confidence-sorted errors. - gwern, the tank-detection legend: the canonical data-leakage parable, plus the scout-mindset twist that it's a likely-unsourced urban legend. - Patrick Kidger, "Just Know Stuff": why research code is buggy ("kludge ... bugs that don't cripple things only because some other bug stops them") and "never accept the kludge". Plus a one-line jaxtyping pointer for shape bugs. - nanochat (Karpathy): BOS-alignment fake metric improvement; all-ranks must clip on inf (a multi-GPU bug single-GPU testing hides). - cleanrl "37 Implementation Details of PPO" -> RL sub-skill, as the canonical proof that reference-impl details (not ideas) decide whether PPO works. Trim the lucidrains item to one quote (it had ballooned). Add wassname credit + companion-gist link. All 20 footnotes resolve. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 01:00:14 +08:00 · 2026-06-02 20:59:36 +08:00
parent 9911ac83c5
commit ee4e9a5caa
6 changed files with 134 additions and 6 deletions
@@ -0,0 +1,25 @@
+# The 37 Implementation Details of Proximal Policy Optimization
+
+Authors: Huang, Shengyi; Dossa, Rousslan Fernand Julien; Raffin, Antonin; Kanervisto, Anssi; Wang, Weixun.
+Source: ICLR Blog Track, 2022-03-25 — https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
+Code: https://github.com/vwxyzjn/ppo-implementation-details ; CleanRL: https://github.com/vwxyzjn/cleanrl
+
+Excerpt cached for the ML-debugging skill (the full post is long; key framing passages below, verbatim).
+
+---
+
+> Instead of doing ablation studies and making recommendations on which details matter, this blog post takes a step back and focuses on reproductions of PPO's results in all accounts.
+
+> During our re-implementation, we have compiled an implementation checklist containing 37 details as follows. For each implementation detail, we display the permanent link to its code (which is not done in academic papers) and point out its literature connection.
+
+The 37 details break down as:
+- 13 core implementation details
+- 9 Atari-specific implementation details
+- 9 implementation details for robotics tasks (continuous action spaces)
+- 5 LSTM implementation details
+- 1 `MultiDiscrete` action-spaces implementation detail
+- (plus 4 situational details not used in the official implementation)
+
+> Our ultimate purpose is to help people understand the PPO implementation through and through, reproduce past results with high fidelity, and facilitate customization for new research.
+
+Context: the official PPO implementation (`openai/baselines`, `ppo2`) has undergone several refactorings, so "it is important to recognize *which version* of the official implementation is worth studying." Libraries that match `ppo2`'s details closely (Stable-Baselines3, CleanRL) reproduce similar results; others report more diverse (worse) results.
@@ -0,0 +1,9 @@
+# The Neural Net Tank Legend — Gwern Branwen
+
+Source: https://gwern.net/tank . Cached excerpt for the ML-debugging skill (verbatim abstract passages).
+
+---
+
+> A cautionary tale in artificial intelligence tells about researchers training an neural network (NN) to detect tanks in photographs, succeeding, only to realize the photographs had been collected under specific conditions for tanks/non-tanks and the NN had learned something useless like time of day. This story is often told to warn about the limits of algorithms and importance of data collection to avoid "dataset bias"/"data leakage" where the collected data can be solved using algorithms that do not generalize to the true data distribution, but the tank story is usually never sourced.
+
+> I collate many extent versions dating back a quarter of a century to 1992 along with two NN-related anecdotes from the 1960s; their contradictions & details indicate a classic "urban legend", with a probable origin in a speculative question in the 1960s by Edward Fredkin at an AI conference about some early NN research, which was then classified & never followed up on.
@@ -0,0 +1,17 @@
+# Just Know Stuff (how to achieve success in an ML PhD) — Patrick Kidger
+
+Source: https://kidger.site/thoughts/just-know-stuff/ (2023-01-26). Cached excerpt from the "Software development" section, verbatim.
+
+---
+
+> Academic software is almost always a poorly-maintained kludge of leaky abstractions, awful formatting, and bugs that don't cripple things only because some other bug stops them from doing so.
+
+> This is a systemic professional failing. As an (applied) ML researcher, the overwhelming majority of your time will be spent in front of a screen, staring at code. And yet most of you (yes, you) would not pass muster as a junior developer.
+
+> So, how to improve? First of all, never accept the kludge.
+
+> You've messed up your Git repo? Figure out the commands to fix it... don't just delete it and clone from the remote.
+
+> Focus on writing clean code, based around orthogonal abstractions. When the code starts getting messy - and it will - be willing to refactor your code into something more legible. Avoid both spaghetti code and ravioli code.
+
+> When the documentation is inadequate, look at their source code.
@@ -0,0 +1,33 @@
+# Bad Labels — Vincent D. Warmerdam (koaning)
+
+Source: https://koaning.io/posts/labels/ (2021-09-02). Cached copy for the ML-debugging skill.
+
+---
+
+I write a lot of blogposts on why you need more than grid-search to properly judge a machine learning model. In this blogpost I want to demonstrate yet another reason; labels often seem to be wrong.
+
+What I'll describe here is also available as a course on calmcode.io.
+
+## Bit of Background
+
+It turns out that bad labels are a *huge* problem in many popular benchmark datasets. To get an impression of the scale of the issue, just go to labelerrors.com. It's an impressive project that shows problems with many popular datasets; CIFAR, MNIST, Amazon Reviews, IMDB, Quickdraw and Newsgroups just to name a few. It's part of a research paper (https://arxiv.org/abs/2103.14749) that tries to quantify how big of a problem these bad labels are.
+
+The issue here isn't just that we might have bad labels in our training set, the issue is that it appears in the validation set. If a machine learning model can become state of the art by squeezing another 0.5% out of a validation set one has to wonder. Are we really making a better model? Or are we creating a model that is better able to overfit on the bad labels?
+
+## Quick Trick
+
+Here's a quick trick seems worthwhile. Let's say that we train a model that is very general. That means high bias, low variance. You may have a lower capacity model this way, but it will be less prone to overfit on details.
+
+After training such a model, it'd be interesting to see where the model disagrees with the training data. These would be valid candidates to check, but it might result in list that's a bit too long for comfort. So to save time you can can sort the data based on the `predict_proba()`-value. When the model gets it wrong, that's interesting, but when it *also* associates a very low confidence to the correct class, that's an example worth double checking.
+
+## What does this mean?
+
+The abstract of the [Northcutt et al.] paper certainly paints a clear picture of what this exercise means for state-of-the-art models:
+
+> We find that lower capacity models may be practically more useful than higher capacity models in real-world datasets with high proportions of erroneously labeled data. For example, on ImageNet with corrected labels: ResNet-18 outperforms ResNet-50 if the prevalence of originally mislabeled test examples increases by just 6%. On CIFAR-10 with corrected labels: VGG-11 outperforms VGG-19 if the prevalence of originally mislabeled test examples increases by 5%. Traditionally, ML practitioners choose which model to deploy based on test accuracy -- our findings advise caution here, proposing that judging models over correctly labeled test sets may be more useful, especially for noisy real-world datasets.
+
+## So what now?
+
+More people should do check their labels more frequently. ... if you're looking for a simple place to start, check out the cleanlab project (https://github.com/cgnorthcutt/cleanlab). It's made by the same authors of the labelerrors-paper and is meant to help you find bad labels.
+
+For everyone; maybe we should spend a less time tuning parameters and instead spend it trying to get a more meaningful dataset.