mirror of
https://github.com/wassname/ml_debug.git
synced 2026-06-27 01:00:14 +08:00
feat(ml_debug): add Karpathy recipe + nanochat evidence, update-ratio diagnostic
Add 3 new evidence files from modern open-source sources: - karpathy_recipe_training_nn_2019.md: Karpathy's training recipe blog post - nanochat_deepwiki_llm_pretraining_2026.md: 320+ HP sweeps for GPT-2-scale pretraining - sanh_simple_considerations_hf_2021.md: HuggingFace NLP debugging notes Add update-to-data ratio diagnostic to refs/diagnostics.md (target ~1e-3). Add LLM pretraining gap note to SKILL.md intro linking the new sources. Add tanh saturation % to logging checklist.
This commit is contained in:
@@ -1,5 +1,5 @@
|
||||
---
|
||||
name: ml-debugging
|
||||
name: ml_debug
|
||||
description: "Wassname's practical folklore for debugging ML systems: convergence issues, loss surface analysis, gradient analysis, sweep methodology, and same-seed comparisons. Use when stuck on training, designing sweeps, or analyzing experiment results."
|
||||
---
|
||||
|
||||
@@ -7,7 +7,11 @@ description: "Wassname's practical folklore for debugging ML systems: convergenc
|
||||
|
||||
Practitioner knowledge that's hard to find in papers. Distilled from Schulman's "Nuts and Bolts" talk, Andy Jones' debugging guide, r/reinforcementlearning threads, competition write-ups, and personal experience. Most multi-source claims are traced to sourced quotes in [docs/ml_debug_folklore.argdown](docs/ml_debug_folklore.argdown) (vargdown format); uncovered claims are listed in the [process log](docs/ml_debug_folklore_log.md).
|
||||
|
||||
The core problem: in ML (especially RL), errors aren't local [Goodfellow Ch11]. Information flows in loops, so a numerical bug in one spot gets smeared through the whole system in seconds. From outside, everything goes weird at once -- loss explodes, KL collapses, rewards oscillate. You can tell something's wrong but not *what* or *where* [Jones 2021].
|
||||
**Caveat:** Most sources are from 2017-2021, predating RLHF, large-scale pretraining, and JAX/PyTorch 2.0 workflows. Core debugging principles (isolation testing, logging, seed variance) are architecture-agnostic and likely durable. Specific RL HP defaults and reward-scaling advice may need updating for modern settings.
|
||||
|
||||
**LLM pretraining gap:** For modern transformer pretraining debugging, see [Karpathy's recipe](https://karpathy.github.io/2019/04/25/recipe/) (2019; general training workflow, activation/gradient health checks) and [nanochat deepwiki](https://deepwiki.com/karpathy/nanochat) (2026; documents 320+ empirical HP sweeps for training a GPT-2-scale model from scratch, covering MFU monitoring, precision management, BOS-aligned dataloaders, and cross-scale ablation discipline). Evidence files: [karpathy_recipe_training_nn_2019.md](docs/evidence/karpathy_recipe_training_nn_2019.md), [nanochat_deepwiki_llm_pretraining_2026.md](docs/evidence/nanochat_deepwiki_llm_pretraining_2026.md).
|
||||
|
||||
The core problem in RL (and to a lesser extent supervised ML): errors aren't local [Jones 2021]. In RL, information flows in a loop (actor -> learner -> actor), so a numerical bug in one spot gets smeared through the whole system in seconds. From outside, everything goes weird at once -- loss explodes, KL collapses, rewards oscillate. You can tell something's wrong but not *what* or *where*.
|
||||
|
||||
**When debugging, work in this order:**
|
||||
1. Run static analysis (grep for silent bugs) -- Part 6.1
|
||||
@@ -55,7 +59,8 @@ What to log:
|
||||
- Gradient norms (per module if possible)
|
||||
- Learning rates
|
||||
- Parameter norms / update magnitudes
|
||||
- Activation statistics (mean, std, fraction of dead ReLUs)
|
||||
- Update-to-data ratio per layer: `((lr * p.grad).std() / p.data.std()).log10()` -- target ~-3 [Karpathy nn-zero-to-hero Lec 4]
|
||||
- Activation statistics (mean, std, fraction of dead ReLUs, saturation % for tanh)
|
||||
- Data statistics (input distributions, label distributions)
|
||||
|
||||
**Sanity check at init** [CS231n]: verify you get the expected loss at chance performance before training starts. E.g., for 10-class softmax the initial loss should be -ln(0.1) = 2.302 with small random weights. If not, something is wrong with initialization or the loss function. Then verify that increasing regularization increases the loss.
|
||||
|
||||
Reference in New Issue
Block a user