feat(ml_debug): add Karpathy recipe + nanochat evidence, update-ratio diagnostic

Add 3 new evidence files from modern open-source sources:
- karpathy_recipe_training_nn_2019.md: Karpathy's training recipe blog post
- nanochat_deepwiki_llm_pretraining_2026.md: 320+ HP sweeps for GPT-2-scale pretraining
- sanh_simple_considerations_hf_2021.md: HuggingFace NLP debugging notes

Add update-to-data ratio diagnostic to refs/diagnostics.md (target ~1e-3).
Add LLM pretraining gap note to SKILL.md intro linking the new sources.
Add tanh saturation % to logging checklist.
This commit is contained in:
wassname
2026-03-10 05:32:37 +08:00
parent bbe3fe0985
commit ced4edc200
7 changed files with 309 additions and 21 deletions
+5 -15
View File
@@ -1,18 +1,6 @@
# ML Debugging Folklore
Deep research to uplift LLMs for ML debugging. Opinionated by source selection.
Distilled from Schulman's "Nuts and Bolts" talk, Andy Jones' debugging guide, Goodfellow Ch11, CS231n, FSDL, and more. Every non-obvious claim is traced to a verbatim source quote in [`docs/ml_debug_folklore.argdown`](docs/ml_debug_folklore.argdown) (vargdown format).
**Author**: [wassname](https://github.com/wassname)
## What's here
- **[SKILL.md](SKILL.md)** -- the main artifact. Designed to be loaded into an LLM agent's context as a debugging skill. Parts 1-5 are reference knowledge; Part 6 is a runnable triage protocol (grep patterns, diagnostic code snippets, decision tree); Part 7 is debugging mental models and practitioner priors.
- **[docs/ml_debug_folklore.argdown](docs/ml_debug_folklore.argdown)** -- vargdown source map. Traces each claim to an exact quote + file in `docs/evidence/`.
- **[docs/evidence/](docs/evidence/)** -- frozen local copies of source material (blog posts, talks, papers, reddit threads).
Practitioner knowledge for debugging ML systems, curated and synthesized by [wassname](https://github.com/wassname). Opinionated by source selection -- I picked sources I trust (Schulman, Goodfellow, CS231n, ...) and had an LLM extract the most relevant information for debugging ML systems.
## Use as a Claude skill
@@ -22,6 +10,8 @@ Distilled from Schulman's "Nuts and Bolts" talk, Andy Jones' debugging guide, Go
Or paste `SKILL.md` into your system prompt / context when debugging.
## Sources
## What's here
Schulman (2017), Jones (2021), Rahtz (2018), Goodfellow et al. (Deep Learning book), Karpathy (CS231n), Ng (CS229), FSDL, Henderson et al. (2018), McCandlish et al. (2018), Irpan (2018), Slavv (2017), and Reddit.
- **[SKILL.md](SKILL.md)** -- the main artifact. Load into an LLM agent's context as a debugging skill. Parts 1-5 are reference knowledge; Part 6 is a runnable triage protocol (grep patterns, diagnostic snippets, decision tree); Part 7 is debugging mental models and practitioner priors.
- **[docs/evidence/](docs/evidence/)** -- frozen local copies of source material (blog posts, talks, papers, reddit threads). Claims in SKILL.md link back to exact quotes here.
+8 -3
View File
@@ -1,5 +1,5 @@
---
name: ml-debugging
name: ml_debug
description: "Wassname's practical folklore for debugging ML systems: convergence issues, loss surface analysis, gradient analysis, sweep methodology, and same-seed comparisons. Use when stuck on training, designing sweeps, or analyzing experiment results."
---
@@ -7,7 +7,11 @@ description: "Wassname's practical folklore for debugging ML systems: convergenc
Practitioner knowledge that's hard to find in papers. Distilled from Schulman's "Nuts and Bolts" talk, Andy Jones' debugging guide, r/reinforcementlearning threads, competition write-ups, and personal experience. Most multi-source claims are traced to sourced quotes in [docs/ml_debug_folklore.argdown](docs/ml_debug_folklore.argdown) (vargdown format); uncovered claims are listed in the [process log](docs/ml_debug_folklore_log.md).
The core problem: in ML (especially RL), errors aren't local [Goodfellow Ch11]. Information flows in loops, so a numerical bug in one spot gets smeared through the whole system in seconds. From outside, everything goes weird at once -- loss explodes, KL collapses, rewards oscillate. You can tell something's wrong but not *what* or *where* [Jones 2021].
**Caveat:** Most sources are from 2017-2021, predating RLHF, large-scale pretraining, and JAX/PyTorch 2.0 workflows. Core debugging principles (isolation testing, logging, seed variance) are architecture-agnostic and likely durable. Specific RL HP defaults and reward-scaling advice may need updating for modern settings.
**LLM pretraining gap:** For modern transformer pretraining debugging, see [Karpathy's recipe](https://karpathy.github.io/2019/04/25/recipe/) (2019; general training workflow, activation/gradient health checks) and [nanochat deepwiki](https://deepwiki.com/karpathy/nanochat) (2026; documents 320+ empirical HP sweeps for training a GPT-2-scale model from scratch, covering MFU monitoring, precision management, BOS-aligned dataloaders, and cross-scale ablation discipline). Evidence files: [karpathy_recipe_training_nn_2019.md](docs/evidence/karpathy_recipe_training_nn_2019.md), [nanochat_deepwiki_llm_pretraining_2026.md](docs/evidence/nanochat_deepwiki_llm_pretraining_2026.md).
The core problem in RL (and to a lesser extent supervised ML): errors aren't local [Jones 2021]. In RL, information flows in a loop (actor -> learner -> actor), so a numerical bug in one spot gets smeared through the whole system in seconds. From outside, everything goes weird at once -- loss explodes, KL collapses, rewards oscillate. You can tell something's wrong but not *what* or *where*.
**When debugging, work in this order:**
1. Run static analysis (grep for silent bugs) -- Part 6.1
@@ -55,7 +59,8 @@ What to log:
- Gradient norms (per module if possible)
- Learning rates
- Parameter norms / update magnitudes
- Activation statistics (mean, std, fraction of dead ReLUs)
- Update-to-data ratio per layer: `((lr * p.grad).std() / p.data.std()).log10()` -- target ~-3 [Karpathy nn-zero-to-hero Lec 4]
- Activation statistics (mean, std, fraction of dead ReLUs, saturation % for tanh)
- Data statistics (input distributions, label distributions)
**Sanity check at init** [CS231n]: verify you get the expected loss at chance performance before training starts. E.g., for 10-class softmax the initial loss should be -ln(0.1) = 2.302 with small random weights. If not, something is wrong with initialization or the loss function. Then verify that increasing regularization increases the loss.
@@ -0,0 +1,119 @@
# A Recipe for Training Neural Networks
**Source:** Andrej Karpathy blog post, April 25, 2019
**URL:** https://karpathy.github.io/2019/04/25/recipe/
**Author:** Andrej Karpathy (then Stanford/OpenAI/Tesla)
---
## Core thesis: silent failure problem
> "The 'possible error surface' is large, logical (as opposed to syntactic), and very tricky to unit test... a 'fast and furious' approach to training neural networks does not work and only leads to suffering."
Examples of silent failures listed:
- Forgetting to flip labels during left-right augmentation (network learns to detect flipped images internally)
- Off-by-one bugs in autoregressive models
- Clipping loss instead of gradients
- Using wrong mean from pretrained checkpoint
- Misconfigured regularization / LR / decay
> "The qualities that in my experience correlate most strongly to success in deep learning are patience and attention to detail."
---
## Stage 1: Become one with the data
> "The first step to training a neural net is to not touch any neural net code at all and instead begin by thoroughly inspecting your data."
**Manual inspection:**
> "Scan through thousands of examples manually... understand distribution patterns. Look for: duplicates, corrupted images/labels, imbalances, biases. Pay attention to own classification process -- hints at needed architecture."
**Programmatic search for outliers:**
> "The outliers especially almost always uncover some bugs in data quality or preprocessing."
---
## Stage 2: End-to-end pipeline & dumb baselines
**Fix random seed:**
> "Always use a fixed random seed to guarantee that when you run the code twice you will get the same outcome. This removes a factor of variation and will help keep you sane."
**Disable complexity early:**
- Turn off data augmentation initially -- "it is just another opportunity to introduce some dumb bug"
**Verify loss at initialization:**
> "Verify that your loss starts at the correct loss value. E.g. if you initialize your final layer correctly you should measure -log(1/n_classes) on a softmax at initialization."
**Initialize final layer bias correctly:**
> "Regression with mean 50? Initialize bias to 50. Imbalanced dataset (1:10)? Set logit bias to predict 0.1 probability at init. Setting these correctly will speed up convergence and eliminate 'hockey stick' loss curves."
**Overfit a single batch:**
> "Overfit a single batch of only a few examples (e.g. as little as two). To do so we increase the capacity of our model and verify that we can reach the lowest achievable loss (e.g. zero)... If they do not, there is a bug somewhere and we cannot continue to the next stage."
**Visualize immediately before model input:**
> "The unambiguously correct place to visualize your data is immediately before your y_hat = model(x)... This is the only 'source of truth'. I can't count the number of times this has saved me and revealed problems in data preprocessing and augmentation."
**Visualize prediction dynamics:**
> "The 'dynamics' of how these predictions move will give you incredibly good intuition for how the training progresses. Many times it is possible to feel the network 'struggle' to fit your data if it wiggles too much in some way, revealing instabilities."
**Backprop-to-input dependency check:**
> "A relatively common bug I've come across... people use view instead of transpose/permute somewhere and inadvertently mix information across the batch dimension... set the loss to be something trivial like the sum of all outputs of example i, run the backward pass all the way to the input, and ensure that you get a non-zero gradient only on the i-th input."
**Input-independent baseline:**
> Train model with all inputs zeroed. "Does your model learn to extract any information out of the input at all? If not, something is wrong."
---
## Stage 3: Overfit
**Don't be a hero:**
> "I've seen a lot of people who are eager to get crazy and creative in stacking up the lego blocks of the neural net toolbox in various exotic architectures... Resist this temptation strongly in the early stages of your project. I always advise people to simply find the most related paper and copy paste their simplest architecture that achieves good performance."
**Adam as safe starting point:**
> "In the early stages of setting baselines I like to use Adam with a learning rate of 3e-4. In my experience Adam is much more forgiving to hyperparameters, including a bad learning rate."
> "For ConvNets a well-tuned SGD will almost always slightly outperform Adam, but the optimal learning rate region is much more narrow and problem-specific."
**Build complexity incrementally:**
> "If you have multiple signals to plug into your classifier I would advise that you plug them in one by one and every time ensure that you get a performance boost you'd expect. Don't throw the kitchen sink at your model at the start."
**Learning rate decay warning:**
> "If you are re-purposing code from some other domain always be very careful with learning rate decay... your code could secretly be driving your learning rate to zero too early, not allowing your model to converge."
> "In my own work I always disable learning rate decays entirely (I use a constant LR) and tune this all the way at the very end."
**First layer sanity check:**
> "To gain additional confidence that your network is a reasonable classifier, I like to visualize the network's first-layer weights and ensure you get nice edges that make sense. If your first layer filters look like noise then something could be off."
---
## Stage 4: Regularize
**Primary advice: get more real data**
> "It is a very common mistake to spend a lot engineering cycles trying to squeeze juice out of a small dataset when you could instead be collecting more data. As far as I'm aware adding more data is pretty much the only guaranteed way to monotonically improve the performance of a well-configured neural network almost indefinitely."
**Smaller batch size = more regularization (via batch norm):**
> "Due to the normalization inside batch norm smaller batch sizes somewhat correspond to stronger regularization. This is because the batch empirical mean/std are more approximate versions of the full mean/std so the scale & offset 'wiggles' your batch around more."
**Dropout + batchnorm warning:**
> "Use this [dropout] sparingly/carefully because dropout does not seem to play nice with batch normalization."
**Larger model + early stopping:**
> "I've found a few times in the past that larger models will of course overfit much more eventually, but their 'early stopped' performance can often be much better than that of smaller models."
---
## Stage 5: Hyperparameter tuning
**Random search over grid:**
> "It is best to use random search instead [of grid search]. Intuitively, this is because neural nets are often much more sensitive to some parameters than others. In the limit, if a parameter a matters but changing b has no effect then you'd rather sample a more thoroughly than at a few fixed points multiple times."
---
## Stage 6: Squeeze performance
**Don't stop early:**
> "I've often seen people tempted to stop the model training when the validation loss seems to be leveling off. In my experience networks keep training for unintuitively long time. One time I accidentally left a model training during the winter break and when I got back in January it was SOTA."
**Ensembles:**
> "Model ensembles are a pretty much guaranteed way to gain 2% of accuracy on anything."
@@ -0,0 +1,91 @@
# nanochat: LLM Pretraining Engineering Notes
**Source:** deepwiki.com/karpathy/nanochat (AI-generated wiki from karpathy/nanochat repo)
**URLs:** https://deepwiki.com/karpathy/nanochat, https://github.com/karpathy/nanochat
**Date accessed:** 2026-03
**Context:** nanochat is Karpathy's 2026 open-source minimal LLM speedrun (GPT-2 level in ~2.5h on 8xH100, ~3500 lines, ~$48).
**Caveat:** The deepwiki page is AI-generated from source code; treat as secondary documentation, not direct quotes.
---
## Design principle: explicit over implicit
> Explicit over implicit: No `torch.amp.autocast` magic; precision managed via `COMPUTE_DTYPE` global
Auto-detected at runtime: bfloat16 on SM 80+ (A100/H100), float32 on older GPUs.
**Debugging application:** Override globally for numerical stability debugging:
```bash
NANOCHAT_DTYPE=float32 python -m scripts.chat_cli -p "hello"
```
Avoids hunting through scattered `with autocast():` blocks when debugging NaN/Inf.
---
## Monitoring: MFU target
> When performance is unexpectedly low: Check `train/mfu` (Model FLOPs Utilization) should be >40%
MFU <40% suggests: GPU memory underutilized (batch size too small), I/O bottleneck (data loading slower than compute), or excessive distributed-training synchronization overhead.
---
## Data pipeline: BOS-aligned dataloader
> BOS-aligned best-fit dataloader ensuring every sequence starts with document boundary
Sequences must start at document boundaries (BOS token), not mid-document. Prevents loss spikes from predicting the start of an unrelated document as if it were a continuation.
---
## Systematic HP development: 320+ sweeps
> The dev/LOG.md experiment log documents 320+ hyperparameter sweeps and design decisions made since January 2026.
**Principled generalization criterion:** Changes must work across model depths (d8 to d50+), not just the target size. Improvements that only help at one scale are artifacts, not general algorithmic improvements.
---
## Four-axis improvement validation
When implementing an optimization, validate across:
1. Loss per training step (convergence speed)
2. Loss per wall-clock time (helps despite potentially slower per-step?)
3. Loss per FLOP (better hardware utilization vs. better algorithm?)
Prevents optimizations that appear good on one metric but regress on others.
---
## Scaling laws (empirical, from 320+ sweeps)
- Batch size: `B ∝ D^0.383` where D = target training tokens (sublinear, not linear scaling)
- Learning rate: per-component scaling with `√(768/n_embd)` factors
- Weight decay: `WD ∝ 1/width²`
Credence ~60-65%: stated as empirical, derivation not provided.
---
## OOM debugging: reduce device batch, keep effective batch
> Reducing device-batch-size from 32 to 16 triggers 2× gradient accumulation
Gradient accumulation maintains effective batch size. OOM errors often solvable without changing the training recipe.
---
## FP8 caveat
FP8 only works on Hopper architecture (H100). Remove `--fp8` on A100 or older.
---
## Key gap this fills
The existing ml_debug skill sources (2017-2021) predate modern LLM pretraining at scale. nanochat is one of the few open-source codebases that publicly documents the empirical decisions behind training a transformer from scratch in 2026, including 320+ sweep results. It covers:
- Loss spike prevention (BOS alignment)
- Distributed training OOM (gradient accumulation)
- Precision management (explicit dtype, FP8 caveat)
- MFU monitoring
- Cross-scale generalization testing
@@ -0,0 +1,61 @@
# Simple Considerations for Simple People Building Fancy Neural Networks
**Source:** Victor Sanh, Hugging Face Blog, February 25, 2021
**URL:** https://huggingface.co/blog/simple-considerations
**Author:** Victor Sanh (Hugging Face research scientist, author of DistilBERT)
---
## Core practices (overlaps heavily with Karpathy 2019 recipe)
**Data first:**
> "the very first step of building a neural network is to put aside machine learning and simply focus on your data"
**Overfit test:**
> "it is a good habit when you think you have finished implementing to overfit a small batch of examples (16 for instance). If your implementation is (nearly) correct, your model will be able to overfit and remember these examples by displaying a 0-loss (make sure you remove any form of regularization such as weight decay)."
**Baselines:**
> "Start as simple as possible to get a sense of the difficulty of your task and how well standard baselines would perform."
> "it is sometimes hard to understand if your performance comes from a bug in your model/code or is simply limited by your model's expressiveness"
---
## NLP-specific: tokenization warning
> "when you work with language, have a serious look at the outputs of the tokenizers. I can't count the number of lost hours I spent trying to reproduce results (and sometimes my own old results) because something went wrong with the tokenization."
---
## Common implementation errors listed
- Wrong indexing ("really the worst")
- Forgetting `model.eval()` or `model.zero_grad()`
- Preprocessing errors
- Loss receiving wrong argument type (probabilities vs. logits)
- Uniform constant initialization (breaks symmetry)
- Parameters not called in forward pass (no gradients)
- Learning rate stuck at 0
- Suboptimal input truncation
---
## HP tuning advice
> "there is no point of launching 1000 runs with different hyperparameters: compare a couple of runs with different hyperparameters to get an idea of which hyperparameters have the highest impact"
> "random over a reasonably manually defined grid search is still a tough-to-beat baseline" [re: Bayesian vs random search]
---
## Embeddings freezing (NLP, pre-trained LM fine-tuning)
> "in my experience working with pre-trained language models, freezing the embeddings modules to their pre-trained values doesn't affect much the fine-tuning task performance while considerably speeding up the training."
Credence ~65-70% -- specific domain claim, lacks ablation study reference.
---
## External links from this post
- "Checklist for debugging neural networks" -- Cecelia Shao (Towards Data Science)
- "A recipe for Training Neural Networks" -- Karpathy
+3 -3
View File
@@ -69,9 +69,9 @@ From NeuralPDE.jl tests/docs + Wang et al. 2021:
- **Init**: Glorot uniform (Xavier), zero biases. Standard.
**Modified MLP** (Wang et al. 2021, credence ~70%):
> Wang et al. propose a modified MLP with multiplicative interactions: `σ(Wz + b) * U + (1 - σ(Wz + b)) * V` where U, V are linear projections of the input. Authors claim this reduces Hessian stiffness. "Greatly enhances predictive accuracy" for nonlinear PDEs.
> Wang et al. propose a modified MLP with multiplicative interactions: `σ(Wz + b) * U + (1 - σ(Wz + b)) * V` where U, V are linear projections of the input. Authors claim this reduces Hessian stiffness.
> Source: https://arxiv.org/abs/2001.04536, Section 2.6
> Evidence: 49x improvement on Helmholtz, 64x on Klein-Gordon. But only tested by the proposing authors; no independent replication found.
> Evidence: 49x improvement on Helmholtz, 64x on Klein-Gordon. Only tested by the proposing authors; no independent replication found.
**Random Weight Factorization (RWF)** (arXiv 2210.01274, credence ~60%):
> Factorize each neuron's weight vector as w = s * w_unit, where s is a trainable scalar and w_unit is the unit-normalized direction. This changes the optimization geometry so the loss surface has better-conditioned local minima. "Predictions obtained by RWF are in excellent agreement with ground truth, while other weight parameterizations result in poor or non-physical approximations."
@@ -79,7 +79,7 @@ From NeuralPDE.jl tests/docs + Wang et al. 2021:
> Used in the PirateNet architecture alongside causal training, sequence-to-sequence, and Fourier features. Simple to implement as a custom parameterization on Linear layers.
> Credence: plausible mechanism, but proposing-author result; check jaxpi repo for independent adoption.
**PirateNet** (jaxpi library, credence ~55%): Bundles RWF + causal time-marching + seq2seq + Fourier features into one architecture. Good reference implementation when you want all the tricks.
**PirateNet** (jaxpi library, credence ~55%): Bundles RWF + causal time-marching + seq2seq + Fourier features into one architecture. Reference implementation if you want all the tricks together.
> Source: https://github.com/PredictiveIntelligenceLab/jaxpi
**Symmetry-enforcing architectures** (Julia Ling et al., credence ~75%):
+22
View File
@@ -159,6 +159,28 @@ for conf, pred, true, idx in errors[:10]:
# Inspect the actual inputs for these indices. Pattern = systematic bug.
```
**Update-to-data ratio check** [Karpathy nn-zero-to-hero Lec 4]
```python
# Track during training: how large are updates relative to parameter magnitudes?
# Target: ~1e-3 (log10 ~ -3). Much higher = LR too large. Much lower = LR too small.
ud = []
# Inside training loop (after optimizer.step()):
with torch.no_grad():
ud.append({
name: ((lr * p.grad).std() / p.data.std()).log10().item()
for name, p in model.named_parameters()
if p.grad is not None and p.ndim >= 2
})
# After training, plot per-layer ratios:
import matplotlib.pyplot as plt
for name in ud[0]:
plt.plot([d[name] for d in ud], label=name)
plt.axhline(-3, color='k', linestyle='--') # target ratio
plt.legend(); plt.ylabel('log10(update/param ratio)'); plt.show()
# If a layer's ratio is much above -3: reduce LR or add gradient clipping.
# If much below -3: that layer is barely updating -- possible dead/frozen layer.
```
**Weight/bias distribution check** [Slavv, CS231n]
```python
for name, p in model.named_parameters():