feat(ml_debug): add Karpathy recipe + nanochat evidence, update-ratio diagnostic

Add 3 new evidence files from modern open-source sources: - karpathy_recipe_training_nn_2019.md: Karpathy's training recipe blog post - nanochat_deepwiki_llm_pretraining_2026.md: 320+ HP sweeps for GPT-2-scale pretraining - sanh_simple_considerations_hf_2021.md: HuggingFace NLP debugging notes Add update-to-data ratio diagnostic to refs/diagnostics.md (target ~1e-3). Add LLM pretraining gap note to SKILL.md intro linking the new sources. Add tanh saturation % to logging checklist.
2026-06-27 01:00:14 +08:00 · 2026-03-10 05:32:37 +08:00
parent bbe3fe0985
commit ced4edc200
7 changed files with 309 additions and 21 deletions
@@ -1,5 +1,5 @@
 ---
-name: ml-debugging
+name: ml_debug
 description: "Wassname's practical folklore for debugging ML systems: convergence issues, loss surface analysis, gradient analysis, sweep methodology, and same-seed comparisons. Use when stuck on training, designing sweeps, or analyzing experiment results."
 ---

@@ -7,7 +7,11 @@ description: "Wassname's practical folklore for debugging ML systems: convergenc

 Practitioner knowledge that's hard to find in papers. Distilled from Schulman's "Nuts and Bolts" talk, Andy Jones' debugging guide, r/reinforcementlearning threads, competition write-ups, and personal experience. Most multi-source claims are traced to sourced quotes in [docs/ml_debug_folklore.argdown](docs/ml_debug_folklore.argdown) (vargdown format); uncovered claims are listed in the [process log](docs/ml_debug_folklore_log.md).

-The core problem: in ML (especially RL), errors aren't local [Goodfellow Ch11]. Information flows in loops, so a numerical bug in one spot gets smeared through the whole system in seconds. From outside, everything goes weird at once -- loss explodes, KL collapses, rewards oscillate. You can tell something's wrong but not *what* or *where* [Jones 2021].
+**Caveat:** Most sources are from 2017-2021, predating RLHF, large-scale pretraining, and JAX/PyTorch 2.0 workflows. Core debugging principles (isolation testing, logging, seed variance) are architecture-agnostic and likely durable. Specific RL HP defaults and reward-scaling advice may need updating for modern settings.
+
+**LLM pretraining gap:** For modern transformer pretraining debugging, see [Karpathy's recipe](https://karpathy.github.io/2019/04/25/recipe/) (2019; general training workflow, activation/gradient health checks) and [nanochat deepwiki](https://deepwiki.com/karpathy/nanochat) (2026; documents 320+ empirical HP sweeps for training a GPT-2-scale model from scratch, covering MFU monitoring, precision management, BOS-aligned dataloaders, and cross-scale ablation discipline). Evidence files: [karpathy_recipe_training_nn_2019.md](docs/evidence/karpathy_recipe_training_nn_2019.md), [nanochat_deepwiki_llm_pretraining_2026.md](docs/evidence/nanochat_deepwiki_llm_pretraining_2026.md).
+
+The core problem in RL (and to a lesser extent supervised ML): errors aren't local [Jones 2021]. In RL, information flows in a loop (actor -> learner -> actor), so a numerical bug in one spot gets smeared through the whole system in seconds. From outside, everything goes weird at once -- loss explodes, KL collapses, rewards oscillate. You can tell something's wrong but not *what* or *where*.

 **When debugging, work in this order:**
 1. Run static analysis (grep for silent bugs) -- Part 6.1
@@ -55,7 +59,8 @@ What to log:
 - Gradient norms (per module if possible)
 - Learning rates
 - Parameter norms / update magnitudes
- Activation statistics (mean, std, fraction of dead ReLUs)
+- Update-to-data ratio per layer: `((lr * p.grad).std() / p.data.std()).log10()` -- target ~-3 [Karpathy nn-zero-to-hero Lec 4]
+- Activation statistics (mean, std, fraction of dead ReLUs, saturation % for tanh)
 - Data statistics (input distributions, label distributions)

 **Sanity check at init** [CS231n]: verify you get the expected loss at chance performance before training starts. E.g., for 10-class softmax the initial loss should be -ln(0.1) = 2.302 with small random weights. If not, something is wrong with initialization or the loss function. Then verify that increasing regularization increases the loss.