diff --git a/README.md b/README.md index 39878c0..99f1248 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,6 @@ -# ML Debugging Folklore +# wassname's ML Debugging Folklore + +In an attempt to upskill the ML debugging on AI coding assistants (and humans), I've collected high quality sources on ML debugging and the mindset and the "taste". When I started ML I went searching for discussions on best practices, and started a few discussions of my own and they helped me a lot, I hope they can help others. This intro is human written, and the below is AI written with human guidance. Practitioner knowledge for debugging ML systems, curated and synthesized by [wassname](https://github.com/wassname). Opinionated by source selection -- I picked sources I trust (Schulman, Goodfellow, CS231n, ...) and had an LLM extract the most relevant information for debugging ML systems. @@ -12,6 +14,6 @@ Or paste `SKILL.md` into your system prompt / context when debugging. ## What's here -- **[SKILL.md](SKILL.md)** -- the main artifact. Load into an LLM agent's context as a debugging skill. Parts 1-5 are reference knowledge; Part 6 is a runnable triage protocol (grep patterns, diagnostic snippets, decision tree); Part 7 is debugging mental models and practitioner priors. +- **[SKILL.md](SKILL.md)** -- the main artifact. Load into an LLM agent's context as a debugging skill. Leads with the mindset (calibrate, mental models, general debugging tricks, and reading a working implementation when stuck), then a folklore section of sourced quotes, then an LLM-agent playbook (debugging loop, triage menu, anti-patterns). Deeper one-off tricks (loss-surface analysis, stuck-metric diagnosis, sweep reliability) live in [refs/](refs/). - **[docs/evidence/](docs/evidence/)** -- frozen local copies of source material (blog posts, talks, papers, reddit threads). Claims in SKILL.md link back to exact quotes here. diff --git a/SKILL.md b/SKILL.md index 100d2e7..2cb3f64 100644 --- a/SKILL.md +++ b/SKILL.md @@ -1,49 +1,266 @@ --- name: ml-debug -description: "Wassname's practical folklore for debugging ML systems: convergence issues, loss surface analysis, gradient analysis, sweep methodology, and same-seed comparisons. Use when stuck on training, designing sweeps, or analyzing experiment results." +description: "Wassname's practical folklore for debugging ML systems: convergence issues, gradient pathologies, stuck metrics, sweep reliability, and same-seed comparisons. Use when stuck on training, designing sweeps, or analyzing experiment results." --- -# ML Debugging Folklore +# wassname's ML Debugging Folklore -Practitioner knowledge that's hard to find in papers, distilled from Schulman's "Nuts and Bolts" talk, Andy Jones' debugging guide, r/reinforcementlearning threads, competition write-ups, and personal experience. +In an attempt to upskill the ML debugging on AI coding assistants (and humans), I've collected high quality sources on ML debugging and the mindset and the "taste". When I started ML I went searching for discussions on best practices, and started a few discussions of my own and they helped me a lot, I hope they can help others. This intro is human written, and the below is AI written with human guidance. -The core problem: errors aren't local [Jones 2021]. Especially in RL, information flows in a loop (actor -> learner -> actor), so a numerical bug in one spot gets smeared through the whole system in seconds. From outside everything goes weird at once (loss explodes, KL collapses, rewards oscillate) and you can tell something's wrong but not *what* or *where*. That's why the discipline below leads with calibration and clue-collection before any fix. +Practitioner knowledge for debugging ML systems that's hard to find in papers: convergence stalls, NaNs, gradient pathologies, stuck metrics, sweeps you can trust. It's *folklore*, so the quotes are the content; the people who learned these lessons the hard way say it better than a paraphrase can. Full sources and links are collected at the end. + +The core problem: errors aren't local. Especially in RL, information flows in a loop (actor -> learner -> actor), so as Jones puts it, "a numerical error in one spot gets smeared throughout the system in seconds, poisoning everything"[^jones]. Most errors then show up as *all* your metrics going weird at once (loss exploding, KL collapsing, rewards oscillating), so you can tell something's wrong but not *what* or *where*. That's why the discipline here leads with calibration and clue-collection before any fix. ## Before you debug: calibrate -If you're an LLM agent reading this, the first thing to calibrate is your own behaviour. ML research code is often outside your training distribution: novel losses, custom architectures, methods that don't have a canonical "right answer" you've seen a thousand times. Your trained reflex there is to be confident and fast: pattern-match a symptom to a fix ("loss stuck -> drop the LR") and apply it. Here that reflex works against you. It commits to one hypothesis before you've looked, and a wrong fix on possibly-buggy code wastes a run *and* corrupts your evidence about what's actually happening. +The first thing to calibrate is your own behaviour, especially if you're an LLM agent. ML research code is often outside the training distribution: novel losses, custom architectures, methods with no canonical "right answer" you've seen a thousand times. The trained reflex there is to be confident and fast, to pattern-match a symptom to a fix ("loss stuck -> drop the LR") and apply it. Here that reflex works against you. It commits to one hypothesis before you've looked, and a wrong fix on possibly-buggy code wastes a run *and* corrupts your evidence about what's actually happening. -So slow down and widen out. The discipline below is a set of habits for staying calibrated and keeping your hypothesis space open until the evidence closes it. The habits transfer across timeseries, GANs, OCR, RL, PINNs, puzzles; the specific fixes in the tables below are local to their setting, so treat those tables as a menu of hypotheses to widen your search. +So slow down and widen out. Most of this skill is a set of habits for staying calibrated and keeping your hypothesis space open until the evidence closes it. The habits transfer across timeseries, GANs, OCR, RL, PINNs, puzzles; the specific fixes in the tables below are local to their setting, so treat those tables as a menu of hypotheses to widen your search, not a lookup-and-apply. -## The debugging loop (use judgment) +## Mental models + +How to *think* when generating hypotheses or deciding what to investigate next. Pick the lens that fits; each gives a different angle on the same problem. + +**1. Information flow: trace forward, trace backward.** Data flows forward through the model, gradients flow backward. A bug anywhere in either direction corrupts everything downstream. Manually trace shapes and values forward from input through each layer, then trace gradients backward from loss through each parameter. The break-point is where values first go wrong. +- Forward: input -> preprocess -> embed -> layers -> head -> loss +- Backward: loss -> d(loss)/d(head) -> d(head)/d(layers) -> ... -> d(layer1)/d(params) + +**2. Ablation: remove things until it works.**[^cs229] Systematically remove components (regularization, augmentation, auxiliary losses, fancy layers). If removing X fixes it, X was the problem. If nothing helps, the bug is in the core (data or main loss). Start by turning off ALL regularization/augmentation/dropout/scheduling; if it works, add back one at a time until it breaks. + +**3. Oracle substitution: replace each component with ground truth.**[^cs229] For pipeline systems (data -> features -> model -> postprocess -> metric), swap one component for a perfect version. The component whose oracle gives the biggest jump is the bottleneck. Replace model predictions with ground-truth labels and the metric barely moves? The model's fine; the problem is upstream (data) or downstream (metric). + +**4. Bias-variance via learning curves.**[^cs229][^fsdl] Plot train and val error vs dataset size (or steps). Both high and converging together = high bias (too simple, wrong features, or a capacity-reducing bug). Train low, val high = high variance (overfitting). Val flat even with 10x more data = not a data problem, fix the model. + +**5. Structural ceiling: can the parameterization even express what you want?** Sometimes a metric is stuck not because the optimizer fails but because the architecture literally cannot represent the target. Quick check: disable the loss term entirely; if the metric reaches the same value, the loss never moved it. Worked example in [refs/metric_stuck.md](refs/metric_stuck.md). + +### Practitioner priors: what's usually wrong + +With no other information, investigate in this order. Rough consensus from the folklore sources, not measured frequencies, and only a starting weight (a clue that points elsewhere overrides them outright): + +1. **Data pipeline** (~40%). Wrong preprocessing, labels misaligned with inputs, missing/wrong normalization, train/test leakage, a loader returning stale batches. It really is usually the data.[^slavv][^fsdl] +2. **Loss function** (~20%). Wrong loss for the task, wrong sign, double softmax, loss disconnected from the metric, competing losses canceling. +3. **Training procedure** (~15%). Wrong optimizer step order, missing `zero_grad`, frozen params, in-place ops breaking autograd. +4. **Architecture** (~10%). Too small to express it, too deep without skips, wrong activation. +5. **Hyperparameters** (~5%). LR, batch size, weight decay. Almost never the real problem if the code is buggy. +6. **Numerical** (~5%). NaN, overflow, underflow, usually a symptom of one of the above. +7. **Environment** (~5%). Library version, GPU memory, nondeterminism, stale cache. + +For RL, add reward scale/sign as a top-3 issue, and episode-boundary handling (done signals, discounting across resets). + +### When to suspect the data + +| Signal | Likely meaning | Check | +|--------|----------------|-------| +| Init loss << expected (e.g. 0.01 vs 2.3) | Leakage or a shortcut: the model "knows" the answer at init | Are labels in the input? Is test data in train? A trivial feature? | +| Random input gives the same loss as real input | Pipeline is destroying information (over-aggressive preprocessing, wrong transforms, all-zero input) | Print raw data at each stage; visualize | +| Predicts the same class for everything | Class imbalance (100:1 -> "always predict majority") | Label-count check; weighted loss or resample | +| Val much worse than train from the start | Distribution shift between splits | Same preprocessing? Same time period? Same source? | +| Learning curve flat even with 10x data | NOT data: high bias | Add capacity, fix features, check for capacity-reducing bugs | +| Adding data makes val worse | Data-quality issue: new data noisier or off-distribution | Inspect recent additions, check label quality | +| Works on MNIST/CIFAR but not your set | Your data is the problem | Simplify your data (fewer classes, clean labels), scale up gradually[^slavv] | + +--- + +## Part 1: General ML debugging + +A catalog of small, well-worn checks, in rough dependency order (each assumes the one before). Pull from it; don't run it end-to-end as a ritual. + +**Step 1: Verify components in isolation.**[^goodfellow][^cs229] Most bugs are "doing the wrong calculation." Test each piece independently. +- Forward pass: feed known inputs, check output shapes and ranges. `assert` shapes everywhere, since `(None,)` vs `(None, 1)` silently broadcasts into `(None, None)`. +- Loss: hand-compute a few targets and compare to code output. +- Data pipeline: sample a batch, print it, eyeball it. Are labels aligned with inputs? Transforms applied correctly? +- Preprocessing: look at processed inputs as a human. Can *you* solve the task from them? + +**Five most common deep-learning bugs**[^fsdl]: (1) tensor shapes that fail silently via broadcasting, (2) preprocessing inputs incorrectly (wrong normalization, over-augmentation), (3) wrong loss function or wrong sign, (4) forgetting train vs eval mode (dropout/batchnorm differ), (5) numerical instability (NaN from log(0), overflow, vanishing grads). + +**Step 2: Get signs of life on a toy problem, and overfit one batch.**[^cs231n][^fsdl] Before the real task, solve something trivial with the same codebase so you know what "healthy" looks like. Then overfit a tiny batch (see folklore: "overfit one batch first"). Start with a lightweight implementation (<200 lines of new code), no fancy data pipeline; build that later once the core works. + +**Baseline ladder** (for physics/simulation models, each step must beat the previous): +1. Persistence: y(t) = y(t-1). Does the model capture *any* dynamics? +2. Exponential decay to steady state (first-order fit). +3. Linear state-space / OLS on finite differences. +4. Pure-data MLP (same architecture, no physics). If a PINN can't beat this, the physics constraint is hurting. +5. Classical solver, fixed parameters (scipy `solve_bvp`, ODE). +6. Classical solver, fitted parameters. +7. Then and only then: PINN / learned physics. + +Make complexity pay rent: every added component (physics, dimensions, losses) should improve a metric you care about, or come out. + +**Step 3: Log everything, then look for specific pathologies.**[^goodfellow][^rahtz][^cs231n] Log train+val loss (per-component if multi-objective), gradient norms per module, learning rate, parameter-update magnitudes, the update-to-data ratio per layer (`((lr * p.grad).std() / p.data.std()).log10()`, target ~-3), activation stats (mean, std, dead-ReLU fraction, tanh saturation), and input/label distributions. + +**Sanity-check the loss at init**[^cs231n]: verify chance-level loss before training. For 10-class softmax the initial loss should be `-ln(0.1) = 2.302` with small random weights. Wrong init loss means a bad initialization or a broken loss. Then check that increasing regularization increases the loss. + +| Symptom | Likely cause | +|---|---| +| Loss stuck from the start | LR too low, bad init, data pipeline broken, wrong loss function | +| Loss decreases then explodes | LR too high, numerical instability (log(0), div by 0), gradient-accumulation bug | +| Loss NaN | log(0), 0/0, overflow. Use `log(x.clamp(min=1e-8))`, `1/(std + 1e-5)` | +| Train loss good, val loss bad | Overfitting. More data, regularization, smaller model | +| Loss oscillates wildly | LR too high, batch too small, data shuffling broken | +| Gradients vanish | Too-deep net without skips, saturating activations, bad init | +| Gradients explode | No gradient clipping, LR too high, RNN without clipping | +| Different results per seed | Normal if small; suspicious if large. Check init sensitivity, batch order, nondeterminism | +| Model outputs constant | Dead neurons, vanishing gradients, mode collapse, all-zero init | +| Physics loss low but BCs violated | Gradient imbalance: PDE residual dominates the BC gradient; adaptive weighting or hard BCs | +| PINN worse than pure-data MLP | Wrong equations, bad scaling (forgot to nondimensionalize), or physics fighting the data | +| Scalar parameter stuck at 0 or a bound | Degenerate solution; bound and initialize it, or estimate it separately first | + +**Step 4: Numerical hygiene.**[^cs231n] + +```python +log_prob = prob.clamp(min=1e-8).log() # clamp log inputs +ratio = x / (std + 1e-5) # never divide by zero +grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=20.0) +logger.log("grad_norm", grad_norm) # clip, but LOG the pre-clip norm +assert torch.isfinite(loss), f"Loss is {loss}" # catch NaNs early +torch.autograd.gradcheck(my_fn, inputs.double().requires_grad_(True)) # float64! 1e-2 -> 1e-8 +``` + +Gradient clipping *masks* problems, so always log the pre-clip norm to see if it fires every step. For a custom gradient, use relative error (centered difference): `>1e-2` probably wrong, `1e-4` uncomfortable, `1e-7` happy; turn off regularization/dropout and use float64 first. + +**Step 5: Normalization and scale.**[^schulman][^cs231n][^fsdl][^slavv] Most training issues trace back to scale. Normalize inputs to mean 0, std 1 per feature (see the folklore quote from Schulman on running statistics). For physics/PDE models, nondimensionalize *before* training: raw SI units (Kelvin, Joules, meters) create loss terms with wildly different magnitudes; pick characteristic scales, substitute, and the resulting groups (NTU, Biot) come out O(1). For time-series, use a temporal train/test split, not random, or you leak correlation. + +**Step 6: Check your optimizer assumptions.** Adam's moment estimates can mask gradient problems; if step statistics look weird, inspect raw gradients separately. `abs_max(param_update)` should be small (~1e-3 at LR 1e-2). Supervised-learning tricks (batchnorm, dropout, big nets) often *don't* transfer to RL. + +--- + +## When stuck, read a working implementation + +After 1-2 diagnostic cycles that don't localize the bug, or whenever you're building something you haven't built before, stop guessing and go read code that already works. Agents tend to skip this for another round of from-scratch guessing, which is usually the worse bet. The folklore is blunt about this: writing RL from scratch is "the most catastrophically self-sabotaging thing you can do," because the self-correction signal is too weak to catch your bugs[^jones]. + +Use the `gh` skill to find an implementation. Rank candidates by trust signal: community adoption > papers citing it > open source that runs > author reputation > self-reports. A repo other researchers use as a baseline beats a flashy README. + +Read it for three things, explicitly: +1. **The algorithm done right.** Diff your math and your computation graph against theirs. The bug is usually something "trivial": a sign, a reset, an off-by-one, an advantage normalization you skipped. Implementation differences that papers never mention dominate results[^henderson]. +2. **The engineering tricks the paper omits.** Did they normalize the input? tanh instead of ReLU? mean-pool instead of last-token? only 6 layers? clip to stop gradient saturation? warm-start? an easier dataset than yours? These live in the code, not the abstract, and they're the difference between "works" and "doesn't." +3. **Proven hyperparameters, schedule, and optimizer.** Copy the values known to work before tuning your own. Their LR, warmup, batch size, weight decay, and optimizer are a working starting point you get for free. + +For RL specifically, see [rl/SKILL.md](rl/SKILL.md) (spinning-up, stable-baselines3, cleanrl, OpenSpiel). + +--- + +## Folklore + +The hard-won lessons, in the words of the people who learned them. Sources and links are collected under [Links](#links-and-further-reading). + +### Assume you have a bug + +> When their RL implementation doesn't work, people are often keen to either (a) adjust their network architecture or (b) adjust their hyperparameters. On the other hand, they're reluctant to say they've got a bug. Most often, it turns out they've got a bug. Why bugs are so much more common in RL code is discussed above, but there's another advantage to assuming you've got a bug: bugs are a damn sight faster to find and fix than validating that your new architecture is an improvement over the old one.[^jones] + +> What I'm advocating for here is not a blind faith in the buginess of your code, but for dramatically raising the threshold at which you start thinking 'OK, I think this is correct.'[^jones] + +A bug can also hide, because most ML models have multiple adaptive parts: "If one part is broken, the other parts can adapt and still achieve roughly acceptable performance"[^goodfellow], and it may not show in the output at all. So raise the bar for "correct." + +### Loss curves are a red herring + +> When someone's RL implementation isn't working, they *luuuuuurv* to copy-paste a screenshot of their loss curve to you. They do this because they know they want a pretty, exponentially-decaying loss curve, and they know what they have *isn't that*. The problem with using the loss curve as an indicator of correctness is somewhat that it's not reliable, but mostly because it doesn't localise errors. The shape of your loss curve says very little about where in your code you've messed up, and so says very little about what you need to change to get things working.[^jones] + +Their real value is splitting "how fast it learns" from "where it plateaus." Use them after better methods, not as a first resort. + +### Pursue anomalies; investigate confusion + +> If you ever see a plot or a behaviour that just *seems weird*, chase right after it! Do not - do *not* - just 'hope it goes away'. Chasing anomalies is one of the most powerful ways to debug your system, because if you've noticed a problem without having had to go look for it, that means it's a *really big problem*. [...] It's really tempting to think that the cool extra functionality you were planning to write today [...] might just magically fix this anomalous behaviour. It won't. Give up on your plan for the day and chase the anomaly instead.[^jones] + +> It was only by following that confusion and realising that taking the difference between frames zeroed out the background that gave the hint of a problem with normalization.[^rahtz] +> +> It seems important to really commit yourself to *always* investigate whenever you notice confusion.[^rahtz] + +### Think more, experiment less + +> Switching from experimenting a lot and thinking a little to experimenting a little and thinking a lot was a key turnaround in productivity. When debugging with long iteration times, you really need to *pour* time into the hypothesis-forming step - thinking about what all the possibilities are, how likely they seem on their own, and how likely they seem in light of everything you've seen so far. Spend as much time as you need, even if it takes 30 minutes, or an hour. Reserve experiments for once you've fleshed out the hypothesis space as thoroughly as possible and know which pieces of evidence would allow you to best distinguish between the different possibilities.[^rahtz] + +Corollary, MurphyJitsu pre-flight: before launching a run, ask "if this fails, what's the most likely cause?" If you can name it, test for it first. + +### Inspect the data first + +> The first step to training a neural net is to not touch any neural net code at all and instead begin by thoroughly inspecting your data. [...] The outliers especially almost always uncover some bugs in data quality or preprocessing.[^karpathy-recipe] + +Slavv's "37 reasons" list opens with the same anecdote (gradients flowing, loss falling, predictions all background) and puts "Verify that the input data is correct" and "Start with a really small dataset (2-20 samples). Overfit on it" at the top of its emergency checklist[^slavv]. FSDL names preprocessing and dataset construction as leading silent-failure categories[^fsdl]. + +### Overfit one batch first + +> Overfit a tiny subset of data. Lastly and most importantly, before training on the full dataset try to train on a tiny portion (e.g. 20 examples) of your data and make sure you can achieve zero cost. For this experiment it's also best to set regularization to zero [...]. Unless you pass this sanity check with a small dataset it is not worth proceeding to the full dataset.[^cs231n] + +> Overfit a single batch of only a few examples (e.g. as little as two). [...] If they do not, there is a bug somewhere and we cannot continue to the next stage.[^karpathy-recipe] + +And remove a variable while you're at it: "Always use a fixed random seed [...]. This removes a factor of variation and will help keep you sane."[^karpathy-recipe] + +### Normalize and scale everything + +From the slides[^schulman] (bullet points, de-artifacted from the PDF): +> - If observations have unknown range, standardize +> - Compute running estimate of mean and standard deviation +> - x' = clip((x - mu)/sigma, -10, 10) +> - Rescale the rewards, but don't shift mean, as that affects agent's will to live +> - Standardize prediction targets (e.g., value functions) the same way + +Use running statistics over *all* data seen so far, not just recent data; using only recent data silently shifts the input distribution out from under the model. + +### Tricks substitute for each other + +On the slides[^schulman]: +> Always Be Ablating +> - Different tricks may substitute +> - Especially whitening + +Many normalization/regularization tricks do roughly the same job (they improve conditioning), so stacking them adds complexity without proportional benefit. If you have three normalization schemes and it still doesn't work, the problem isn't normalization. So ablate: most of the things you added are probably unnecessary. + +### Don't write RL from scratch; diff against a reference + +> If you're doing anything that involves an RL algorithm as a component in a larger system, don't try and implement the RL algorithm yourself. [...] RL is unstable enough at the moment that you'll never be sure whether your system doesn't work because of a bug in your RL implementation or because of a bug in your larger system.[^rahtz] + +> We find that implementation differences which are often not reflected in publications can have dramatic impacts on performance.[^henderson] + +### Seed variance: you can't tell a bug from bad luck + +> Look, there's variance in supervised learning too, but it's rarely this bad. If my supervised learning code failed to beat random chance 30% of the time, I'd have super high confidence there was a bug in data loading or training. If my reinforcement learning code does no better than random, I have no idea if it's a bug, if my hyperparameters are bad, or if I simply got unlucky.[^irpan] + +> Instability to random seed is like a canary in a coal mine. If pure randomness is enough to lead to this much variance between runs, imagine how much an actual difference in the code could make.[^irpan] + +Henderson confirmed it quantitatively: splitting 10 same-config runs (differing only in seed) into two groups of five produces "statistically different distributions just from varying random seeds."[^henderson] This is why one good run proves nothing, and why sweeps need same-seed pairing and a cross-seed reliability test ([refs/sweeps.md](refs/sweeps.md)). + +### 3e-4, and learning-rate folklore + +The most-quoted line in the genre is Karpathy's tweet, "3e-4 is the best learning rate for Adam, hands down."[^karpathy-3e4] He confirmed in the same thread that it was a joke, but it stuck because it's a decent default. Read it next to what he actually does in the recipe: + +> In the early stages of setting baselines I like to use Adam with a learning rate of 3e-4. In my experience Adam is much more forgiving to hyperparameters, including a bad learning rate.[^karpathy-recipe] + +So: 3e-4 is a fine *starting* LR for Adam, not a law. The real folklore is "Adam is forgiving, so start there and stop fiddling." It has exceptions, and the biggest is batch size: +- There's a critical batch size below which doubling the batch ~halves wall-clock time, and above which it just burns compute; it rises during training as loss falls[^mccandlish]. +- LR must scale with batch size: linearly for SGD (double batch, double LR)[^goyal]; for Adam, an exponent between 0.5 and 1, task-dependent[^mccandlish]. Changing batch size without adjusting LR is a common silent mistake. +- Large-batch + high-LR diverges at the start without warmup[^goyal]. No warmup -> first-epoch loss spike/NaN -> you wrongly think the code is broken. + +| Symptom | Likely cause | +|---------|-------------| +| Very noisy, loss oscillates | Batch too small; gradient noise swamps signal. Try 4-8x larger | +| Smooth but slow, poor generalization | Batch too large without LR scaling. Higher LR or smaller batch | +| Loss spikes at start then recovers | Normal with large batch + warmup. No warmup? Add it | +| Different results at different batch sizes (same total steps) | Missing LR scaling. Adjust LR proportionally | + +--- + +## For LLM agents + +Unfortunately, agents need these procedural mindset-shifts spelled out. This is the babysitting layer, not the durable folklore, hence its place at the bottom. If you're an agent debugging ML code, run the loop and avoid the anti-patterns. + +### The debugging loop (use judgment, it's not a checklist) Roughly in this order, though the point is the underlying mindset: -**Collect clues before theorizing.** Read the traceback and logs. Run static analysis (Part 6.1) and the cheap diagnostics (Part 6.2: data sanity check, init-loss check, overfit-one-batch). If you catch yourself proposing a fix before you've looked at anything, stop. +**Collect clues before theorizing.** Read the traceback and logs. Run static analysis ([refs/static_analysis.md](refs/static_analysis.md)) and the cheap diagnostics ([refs/diagnostics.md](refs/diagnostics.md): data sanity check, init-loss check, overfit-one-batch). If you catch yourself proposing a fix before you've looked at anything, stop. -**Hold several hypotheses at once; resist converging early.** Unless the cause is already obvious (a traceback usually points right at it), generate a few genuinely different explanations before ranking any of them, so you don't marry the first one that comes to mind. Part 7.1 has five lenses for generating them: information flow, ablation, oracle substitution, learning curves, structural ceiling. Then sanity-check yourself with the failure-mode triplet: +**Hold several hypotheses at once; resist converging early.** Unless the cause is already obvious (a traceback usually points right at it), generate a few genuinely different explanations before ranking any, so you don't marry the first one. Use the five lenses in Mental models. Then sanity-check yourself with the failure-mode triplet (same idiom as the `research-journal` skill): - *Likely*: your strongest competitor explanation, with a rough credence. - *Subtle*: the sneaky one, like sample size, leakage, a confound, a metric artifact, or plain seed variance masquerading as signal. - *Null*: there's no real effect, or it comes from something else you also changed. -Give each a one-line prior (rough credence) and its cheapest falsifier, a `Check: ...` line naming the observation that would kill it. Those falsifiers are the menu the next step draws from. +Give each a one-line prior and its cheapest falsifier (`Check: ...`). Anchor priors on Practitioner priors above, but a clue that points elsewhere overrides them outright. Keep observations (reproducible, auditable) separate from inferences, so you can rethink without degrading the evidence. -Anchor priors on what's usually wrong (Part 7.2: data ~40%, loss ~20%, training ~15%, architecture ~10%, hyperparameters ~5%), but priors are only a starting weight. A clue that points elsewhere overrides them outright: a traceback naming a line, a metric stuck while the loss is healthy (loss-metric misalignment), or an init-loss that's exactly right all redirect you regardless of the ~40% data prior. +**Run the cheapest observation that splits your top hypotheses.** Not the most thorough experiment, the most *discriminating* one. Forward-predict each hypothesis ("what would I see if this were the cause?"); a test is strong evidence only where the predictions diverge. A grad-norm line reading ~0 under "dead layer" but healthy under "LR too low" beats a 4-hour sweep that only confirms what you believed. -Make sure to separate observations (to be faithfully reproduced in an auditable manner) and inferences. That way you can go back and rethink things without degrading the evidence. +**Bisect the path to localize where it breaks.** Data flows forward and gradients backward in a chain (input -> preprocess -> layers -> loss -> grads), so probe the midpoint: is the value or gradient already wrong halfway through? Each probe halves the search space. Finding the first module to produce a non-finite value is one case; the same bisection works for finite-but-wrong values, exploded norms, and dead activations. - -**Run the cheapest observation that splits your top hypotheses.** Not the most thorough experiment, the most *discriminating* one (Rahtz: think more, experiment less, Part 1). To find it, forward-predict each hypothesis ("what would I see if this were the cause?"): a test is strong evidence only where the predictions diverge, and worthless where every hypothesis predicts the same outcome. Prefer the check whose result you'd bet on differently under each explanation. A grad-norm line reading ~0 under "dead layer" but healthy under "LR too low" beats a 4-hour sweep that only confirms what you already believed. - -But before you run a 10 minute test, remember it's much faster to step back, and have good priors before you start running experiments. It's also good to rank multiple possible diagnostics and think about how much you learn, vs how much they cost in code complexity and gpu time. You want to pick ones where the learning is worth the cost. - -**Bisect the path to localize where it breaks.** Once you have a hypothesis about the cause, you still need to find where in the pipeline it goes wrong. Data flows forward and gradients flow backward in a chain (input -> preprocess -> layers -> loss -> grads), so probe the midpoint instead of reading every step: is the value or gradient already wrong halfway through? Each probe halves the search space. Finding the first module to produce a non-finite value (the NaN search in Part 6.2) is one case of this; the same bisection works for finite-but-wrong values, exploded grad norms, dead activations, and so on. - -**Then act, and only on what the observation pointed to.** If a cycle or two hasn't localized it, stop tuning and go read working code (next section), which is usually better than another guess. - -Consult these only once you're inside the loop, as reference: triage tree (Part 6.3), hypothesis-generating lenses (Part 7.1), the metric-stuck decision tree (Part 5), RL specifics (`rl/SKILL.md`). - -The same loop in pseudocode (for humans and agents to hold in one glance): +**Then act, only on what the observation pointed to.** If a cycle or two hasn't localized it, stop tuning and go read working code, which usually beats another guess. ```py # ── ML debugging loop ──────────────────────── @@ -53,7 +270,6 @@ def debug(symptom): prior ← anchor(H) # base rates: data .40 loss .20 train .15 arch .10 hp .05 while not localized: - # the cheapest test whose outcome the hypotheses disagree on t̂ ← argmax(divergence(predict(h, t) for h in H) / cost(t) for t in candidates) obs ← run(t̂) # one log line or toy run; keep obs apart from inference prior ← update(prior, obs) @@ -64,640 +280,65 @@ def debug(symptom): fix(root_cause); assert reproduces(obs) # no silent fallback; crash if it doesn't ``` -## When stuck, read a working implementation +### Triage (a menu, not a flowchart to obey) -After 1-2 diagnostic cycles that don't localize the bug, or whenever you're building something you haven't built before, stop guessing and go read code that already works. Agents tend to skip this in favour of another round of from-scratch guessing, which is usually the worse bet. +Rough order to consider, not authoritative; it may not fit your project. Stop when a question fits. -Use the `gh` skill to find an implementation. Rank candidates by trust signal (per CLAUDE.md): community adoption > papers citing it > open source code that runs > author reputation > self-reports. A repo other researchers use as a baseline is worth more than a flashy README. +1. Exception/traceback? Read it, fix it, done. +2. Loss NaN/Inf? Attach NaN hooks ([refs/diagnostics.md](refs/diagnostics.md)), find the first module producing NaN. Usual causes: log(0), 0/0, exp(large); add clamp/eps. +3. Init loss wrong? Check the data pipeline and loss; check for double softmax; check labels match output format. Same loss on random input -> data destroyed. Init loss << expected -> leakage. +4. Can't overfit one batch? Gradient-flow check: None grads -> disconnected layer; all-zero grads -> dead layer / detach. Check autograd breakers and optimizer step order. +5. Loss stuck from step 0 but you *can* overfit one batch? LR too low (try 10x), frozen params (check `requires_grad`), wrong loss. +6. Loss decreases then explodes? LR too high (try 0.1x), log the pre-clip grad norm, hunt numerical instability. +7. Train good, val bad? Overfitting, not a bug. More data, regularization, smaller model. +8. Train loss fine but the metric is bad? Loss-metric misalignment ([refs/metric_stuck.md](refs/metric_stuck.md)). +9. Outputs constant? Mode collapse: class imbalance, all-zero init, dead ReLUs, look at confidence-sorted errors. +10. Slow but not stuck? Not a bug. Consider batch size, depth/width, data quality. -Read it for three things, explicitly: -1. **The algorithm done right.** Diff your math and your computation graph against theirs. The bug is usually something "trivial", like a sign, a reset, an off-by-one in indexing, or an advantage normalization you skipped. -2. **The engineering tricks they don't mention in the paper.** Did they normalize the input? tanh instead of ReLU? mean-pool instead of last-token? only 6 layers? clip to stop gradient saturation? warm-start? an easier dataset than yours? These are the difference between "works" and "doesn't," and they live in the code, where papers rarely spell them out. -3. **Proven hyperparameters, schedule, and optimizer.** Copy the values that are known to work before you tune your own. Their LR, warmup, batch size, weight decay, and optimizer choice are a working starting point you get for free. +### Anti-patterns -Reference implementations are domain-specific. For RL, see `rl/SKILL.md` section 9 (spinning-up, stable-baselines3, cleanrl, OpenSpiel). For everything, diff against the reference rather than trusting your from-scratch version (Part 7.3). +These are the overconfident reflexes the "calibrate" section warns about, made concrete. Every one changes behaviour before localizing the bug. (As people put it: "this is sklearn slop," or "the LLM is tweaking hyperparameters like it's in a hackathon, not understanding the problem.") -## Scope and modern pointers - -Most sources here are 2017-2021, predating RLHF, large-scale pretraining, and JAX/PyTorch 2.0. The core principles (isolation testing, logging, seed variance, the loop above) are architecture-agnostic and durable; specific RL HP defaults and reward-scaling advice may need updating. For modern transformer pretraining specifically, go to [Karpathy's recipe](https://karpathy.github.io/2019/04/25/recipe/) (2019; activation/gradient health checks) and [nanochat deepwiki](https://deepwiki.com/karpathy/nanochat) (2026; 320+ empirical HP sweeps for a GPT-2-scale run, MFU monitoring, precision management, BOS-aligned dataloaders). Evidence files: [karpathy_recipe_training_nn_2019.md](docs/evidence/karpathy_recipe_training_nn_2019.md), [nanochat_deepwiki_llm_pretraining_2026.md](docs/evidence/nanochat_deepwiki_llm_pretraining_2026.md). Most multi-source claims below trace to quotes in [docs/ml_debug_folklore.argdown](docs/ml_debug_folklore.argdown) (vargdown); uncovered claims are in the [process log](docs/ml_debug_folklore_log.md). +- Hyperparameter changes before verifying correctness. "Try reducing the learning rate" is the #1 wrong response. Verify the code first; HP tuning on buggy code wastes time. +- `try/except` around training code. Training should crash loudly. A caught exception hides the bug and produces silently wrong results. The one exception is checkpoint-on-KeyboardInterrupt. +- "Try a different optimizer." If Adam doesn't converge, it's almost never the optimizer; it's the loss, the data, the architecture, or a bug. +- `.detach()` / `.item()` to "fix" gradient errors. If autograd complains, the graph is wrong. Detaching silences it by cutting gradient flow, so the model just stops learning from that path. +- `lr_scheduler` as a *cure for non-convergence*. Schedules matter (transformers need warmup, cyclic/cosine is often best-in-class, AdamW is the standard pairing), but they refine or enable convergence in an otherwise-healthy setup; they don't rescue a model that can't learn at constant LR because of a bug. Add the schedule once the basics work, not as a debugging band-aid. +- More layers / a bigger model. If it can't overfit one batch, more parameters won't help. The problem is gradient flow, loss, or data. +- "Normalize your data" without checking whether it already is. Run the data sanity check first. +- `float()` / `.to(dtype)` to suppress type warnings. Type mismatches are signals; a float32/float64 mismatch might mean you're mixing model weights with double-precision data. Fix the root cause. --- -## Part 1: General ML Debugging - -### What "collect clues" looks like - -This is the catalog the loop's clue-collection step pulls from: the substantive checks, in rough dependency order (each assumes the one before). It's a menu to draw on from inside the loop above. - -**Step 1: Verify components in isolation.** [Goodfellow Ch11, CS229] -Most bugs are "doing the wrong calculation." Test each piece independently. - -- Network forward pass: feed known inputs, check output shapes and ranges. `assert` shapes everywhere, since `(None,)` vs `(None, 1)` silently broadcasts into `(None, None)`. -- Loss computation: hand-compute a few targets and compare to code output. -- Data pipeline: sample a batch, print it, eyeball it. Are labels aligned with inputs? Are transforms applied correctly? -- Preprocessing: look at your processed inputs as a human. Can *you* solve the task from them? If you downsampled images, can you still tell what's going on? - -**Five most common deep learning bugs** [FSDL]: (1) incorrect tensor shapes that fail silently via broadcasting, (2) preprocessing inputs incorrectly (wrong normalization, over-augmentation), (3) incorrect loss function or wrong sign in loss/gradient, (4) forgot to set up train vs eval mode (dropout/batchnorm behave differently), (5) numerical instability (NaN from log(0), overflow, vanishing grads). - -**Step 2: Get signs of life on a toy problem. Work the baseline ladder.** [CS231n, FSDL, Goodfellow Ch11] -Before your real task, solve something trivial with the same codebase. This establishes what "healthy" looks like. Run on CartPole (or equivalent) and log the same curves so you know what healthy learning looks like for your setup [reddit]. If it works on the toy but not your real task, the gap is usually scale/normalization, not fundamental correctness. - -Also try to overfit to train. If you can't do that, you likely won't be able to generalise. [CS231n: "Unless you pass this sanity check with a small dataset it is not worth proceeding to the full dataset."] Start with a lightweight implementation (<200 lines of new code), no complicated data pipelines [FSDL]. Build those later once the core works. - -**Baseline ladder** (for physics/simulation models, make each step beat the previous one): -1. Persistence: y(t) = y(t-1). Bar for "does the model capture any dynamics at all?" -2. Exponential decay to steady state (first-order response fit). -3. Linear state-space / OLS on finite differences. -4. Pure data MLP (same architecture, no physics). If PINN doesn't beat this, the physics constraint is hurting. -5. Classical solver with fixed parameters (scipy solve_bvp, ODE, etc.). -6. Classical solver with fitted parameters. -7. Then and only then: PINN / learned physics. - -Make complexity pay rent. Every added component (more physics, more dimensions, more losses) should improve a metric you care about. If it doesn't, remove it. - -**Step 3: Log everything, look for specific pathologies.** [Goodfellow Ch11, Rahtz 2018, CS231n] - -What to log: -- Losses (train and val, per-component if multi-objective) -- Gradient norms (per module if possible) -- Learning rates -- Parameter norms / update magnitudes -- Update-to-data ratio per layer: `((lr * p.grad).std() / p.data.std()).log10()`, target ~-3 [Karpathy nn-zero-to-hero Lec 4] -- Activation statistics (mean, std, fraction of dead ReLUs, saturation % for tanh) -- Data statistics (input distributions, label distributions) - -**Sanity check at init** [CS231n]: verify you get the expected loss at chance performance before training starts. E.g., for 10-class softmax the initial loss should be -ln(0.1) = 2.302 with small random weights. If not, something is wrong with initialization or the loss function. Then verify that increasing regularization increases the loss. - -| Symptom | Likely cause | -|---|---| -| Loss stuck from the start | LR too low, bad init, data pipeline broken, wrong loss function | -| Loss decreases then explodes | LR too high, numerical instability (log(0), div by 0), gradient accumulation bug | -| Loss NaN | log(0), 0/0, overflow. Use `log(x.clamp(min=1e-8))`, `1/(std + 1e-5)` | -| Train loss good, val loss bad | Overfitting. More data, regularization, smaller model | -| Loss oscillates wildly | LR too high, batch size too small, data shuffling broken | -| Gradients vanish | Too-deep network without skip connections, saturating activations (tanh with large inputs), bad init | -| Gradients explode | No gradient clipping, learning rate too high, recurrent networks without gradient clipping | -| Different results per seed | Normal if small variance; suspicious if large. Check init sensitivity, batch ordering, floating point nondeterminism | -| Model outputs constant | Dead neurons, vanishing gradients, mode collapse, all-zero init | -| Physics loss low but BCs violated | Gradient imbalance: PDE residual dominates BC gradient; use adaptive loss weighting or hard BCs | -| PINN worse than pure-data MLP | Wrong equations, bad scaling (forgot to nondimensionalize), or physics constraint fighting the data | -| PINN fails on hard PDE regime, works on easy | Curriculum regularization: start with easy parameters, warm-start and increase to target | -| Scalar parameter (U, alpha) stuck at 0 or bound | Degenerate solution; bound and initialize it, or estimate separately before joint training | - -**Step 4: Numerical hygiene.** [CS231n] - -```python -# Clamp log values -log_prob = prob.clamp(min=1e-8).log() - -# Never divide by zero -ratio = x / (std + 1e-5) - -# Clip gradients and LOG the pre-clip norm -grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=20.0) -logger.log("grad_norm", grad_norm) - -# Catch NaNs early -assert torch.isfinite(loss), f"Loss is {loss}" - -# Verify custom gradients (use float64! relative error plummets from 1e-2 to 1e-8) -torch.autograd.gradcheck(my_custom_fn, inputs.double().requires_grad_(True)) -``` - -Gradient clipping *masks* problems, so always log the pre-clip norm to see if it's constantly being triggered. [CS231n: "the ratio of the update magnitudes to the value magnitudes... should be somewhere around 1e-3."] - -**Gradient check thresholds** [CS231n]: use relative error, not absolute. Compare analytic vs numerical gradient using centered difference formula. Relative error > 1e-2 = probably wrong. 1e-4 = uncomfortable. 1e-7 = happy. Before checking: (a) turn off regularization and check data loss alone first (regularization can mask data loss bugs), (b) disable dropout and augmentation, (c) use float64 not float32. - -**Step 5: Normalization and Nondimensionalization.** [Schulman 2017, CS231n, FSDL, Slavv 2017] -Most ML training issues trace back to scale problems. - -- Input normalization: mean 0, std 1 per feature. Use running statistics over ALL data seen so far, not just recent data [Schulman 2017]. Using only recent data silently changes the input distribution in a way the policy doesn't know about, which can collapse performance. [Schulman slides: "Compute running estimate of mean and standard deviation, x' = clip((x-mu)/sigma, -10, 10)"] -- Schulman: "plot histograms of all observations and rewards and make sure each component has the right mean and standard deviation and doesn't have crazy outliers." -- Layer normalization helps stability. -- For targets/labels: think about whether the scale is reasonable for your loss function. -- **For physics/PDE models (PINNs)**: nondimensionalize *before* training. Raw SI units (Kelvin, Joules, meters) create loss terms with wildly different magnitudes. This is the multi-scale problem that adaptive weighting tries to fix downstream. Nondimensionalizing fixes it at the source by making all PDE coefficients O(1). Recipe: pick characteristic scales (T_ref, L_ref, etc.), define dimensionless variables (T* = T/T_ref, z* = z/L), substitute into the PDE. The resulting groups (NTU, Biot, etc.) are all O(1). -- **Train/test split**: use temporal split (not random) for time-series or plant data. Random splitting leaks temporal correlation and gives optimistic test RMSE. Conventional: first 75% train, last 25% test. - -**Step 6: Check your assumptions about the optimizer.** - -- Adam's moment estimates can mask gradient problems. If step statistics look weird, check raw gradients separately. -- `abs_max(param_update)` should be small (e.g., ~1e-3 at LR 1e-2); `mean_square(param_update)` should be very small but substantially smaller than abs_max. -- Supervised learning tricks (batch norm, dropout, big networks) often *don't* transfer to RL. People tried them. They usually don't help. - -### Assume you have a bug [Jones 2021, Goodfellow Ch11] - -> When their RL implementation doesn't work, people are often keen to either (a) adjust their network architecture or (b) adjust their hyperparameters. On the other hand, they're reluctant to say they've got a bug. Most often, it turns out they've got a bug. -- Andy Jones - -Bugs are faster to find and fix than validating that a new architecture is an improvement. Dramatically raise your threshold for "OK, I think this is correct." Neural net components can adapt to compensate for bugs, masking them [Goodfellow Ch11: "If one part is broken, the other parts can adapt and still achieve roughly acceptable performance."] - -### Loss curves are a red herring [Jones 2021] - -They give global information about performance but don't localize errors. Don't debug by staring at loss curves. Use them *after* you've exhausted better methods. Their main value: splitting performance into "how fast it learns" vs "where it plateaus." [Jones: "The shape of your loss curve says very little about where in your code you've messed up."] - -### Pursue anomalies [Jones 2021, Rahtz 2018] - -> If you ever see a plot or a behaviour that just *seems weird*, chase right after it. Do not just 'hope it goes away'. -- Andy Jones - -That cool new feature you were going to add today? It won't magically fix the anomaly. Give up on your plan and chase the anomaly instead. Rahtz independently calls this "noticing confusion": following confusion about a frame-differencing improvement led to finding a normalization bug that had hidden for months. - -### With long feedback loops, think more, experiment less [Rahtz 2018] - -> Switching from experimenting a lot and thinking a little to experimenting a little and thinking a lot was a key turnaround in productivity. -- Rahtz - -When runs take hours, pour time into hypothesis-forming *before* launching. Spend 30-60 minutes mapping out possibilities, ranking them by likelihood given all evidence so far. Reserve experiments for distinguishing between your top hypotheses. - -Keep a structured work log for long debugging sessions: -1. What specific output am I working on right now? -2. Thinking out loud: hypotheses about the current problem -3. Record of currently running experiments with what each one is supposed to answer -4. Results of runs (graphs, observations), separated by type - ---- - -## Part 2: RL-Specific Debugging - -> See [rl/SKILL.md](rl/SKILL.md) for the full RL debugging sub-skill: probe environments, reward engineering, diagnostics tables, hyperparameter defaults, and reference implementations. - ---- - -## Sources - -**Evidence map**: [docs/ml_debug_folklore.argdown](docs/ml_debug_folklore.argdown) traces each claim to verbatim quotes across 21 evidence files in [docs/evidence/](docs/evidence/). Process log at [docs/ml_debug_folklore_log.md](docs/ml_debug_folklore_log.md). - -### Debugging deep networks (general) -- Goodfellow et al., Deep Learning Book, "Practical Methodology" chapter: https://www.deeplearningbook.org/ -- Stanford CS231n, Neural Networks Part 3: https://cs231n.github.io/neural-networks-3/ -- Glorot & Bengio, "Understanding the difficulty of training deep feedforward neural networks" (2010) -- Josh Tobin, FSDL Spring 2021 Lecture 7 "Troubleshooting Deep Neural Networks": https://fullstackdeeplearning.com/ -- Andrew Ng, CS229 Machine Learning Advice: Stanford CS229 -- McCandlish & Kaplan, "An Empirical Model of Large-Batch Training" (2018): https://arxiv.org/abs/1812.06162 -- Slav Ivanov, "37 Reasons why your Neural Network is not working" (2017): https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607 - -### Tools -- PyTorch memory profiling: https://github.com/Stonesjtu/pytorch_memlab -- GPU profiling: nsight, snakeviz, tuna -- Gradient debugging: `torch.autograd.gradcheck`, `torch.autograd.detect_anomaly()` - ---- - -## Part 3: Loss Surface & Gradient Analysis (No Model Required) - -When a loss isn't behaving as expected, don't guess. Visualize the loss surface and check gradient flow directly. This technique feeds *synthetic tensors* into loss sub-components, with no model, forward pass, or GPU needed, just the math. - -### The method - -1. Identify each loss sub-component as a function of its immediate inputs. -2. Pick 1-2 axes that matter (the "natural axes" you think about when reasoning about the loss). -3. Grid over those axes, feed through the loss, call `.backward()`, collect gradients. -4. Plot: contour heatmap + quiver overlay (negative gradient = optimization direction). -5. Build a summary table: component x representative_input -> loss_value, grad_value. Flag zero or non-finite gradients. - -### Pseudocode - -```py -# ── 2D loss surface with gradient quiver ────── -def analyze_component(loss_fn, x_range, y_range, n=80): - xs = torch.linspace(*x_range, n) - ys = torch.linspace(*y_range, n) - X, Y = torch.meshgrid(xs, ys, indexing='ij') - x_flat = X.flatten().requires_grad_(True) - y_flat = Y.flatten().requires_grad_(True) - - losses = loss_fn(x_flat, y_flat) # vectorized, returns (n*n,) - losses.sum().backward() - - loss_grid = losses.detach().reshape(n, n) - gx = x_flat.grad.reshape(n, n) - gy = y_flat.grad.reshape(n, n) - - # contourf(X, Y, loss_grid) + quiver(X, Y, -gx, -gy) - # negative gradient = direction optimizer moves - -# ── Gradient flow verification table ────────── -# -# For each component, evaluate at representative inputs -# (zero, small, converged, degenerate). Report loss + grad. -# Flag: zero grad (dead zone), non-finite (numerical issue). -# -# | Component | Param | Input | Loss | Grad | -# |----------------|----------|--------------|----------|----------| -# | barrier_penalty | v | v=0.0 | +0.000 | +0.000 | <-- zero grad! -# | barrier_penalty | v | v=0.5 | +12.50 | +50.00 | -# | pair_loss | dot_pos | (0.3, -0.3) | -2.340 | -3.000 | -# | pair_loss | dot_neg | (0.3, -0.3) | -2.340 | +3.000 | <-- antisym, good -# | pair_loss | dot_pos | (0.0, 0.0) | +0.000 | +0.000 | <-- dead at init! -``` - -### What to look for - -| Pattern | Meaning | Action | -|---------|---------|--------| -| Gradient arrows point toward desired region | Loss is well-shaped | Ship it | -| Large flat region (zero gradient) | Dead zone: optimizer stuck if it lands here | Add curvature, change init, or use different parameterization | -| Gradient magnitude 1000x in one axis vs another | Imbalanced: one axis dominates | Rescale, use log-space, or normalize | -| Saddle point at origin | Common with product-form losses (A*B) | Switch to additive (log A + log B) for independent gradients | -| Arrows point away from desired region | Loss is wrong or has unexpected local min | Rethink the formula | -| Non-finite values in a region | Numerical issue (log(0), 0/0) | Add eps, clamp, or use log1p | - -### The log-space decomposition trick - -When your loss involves a product of factors A*B and one factor can be near zero: - -``` -# BAD: symlog(A * B), when B~0 the chain rule gives 0 grad to A too -# GOOD: sign * (log|A| + log|B|) gives independent gradients -# d/dA = 1/A regardless of B -# d/dB = 1/B regardless of A -``` - -General principle: **if you want gradient to flow independently through two factors, decompose multiplicatively in log space.** - -### Structural ceiling analysis - -Sometimes a metric is stuck not because the optimizer fails but because the parameterization can't express a higher value. To diagnose: - -```py -# 1. Check: is d(loss)/d(metric) large? If yes, optimizer IS trying. -metric = torch.tensor(0.5, requires_grad=True) -loss = loss_fn(metric) -loss.backward() -print(metric.grad) # if large (e.g. 350x the other gradients), it's trying - -# 2. Check: can the parameter CHANGE the metric? -# Trace the chain: loss -> metric -> intermediate -> parameter -# If d(metric)/d(parameter) ~ 0, the param is structurally unable to move it. -# Example: V-rotation can't change output basis (U is fixed) so r_sub is capped. - -# 3. Confirm empirically: set the exponent to 0 (disable the term). -# If metric reaches the SAME value, it's purely structural (not learned). -``` - -### When to use this - -- New loss function: always visualize before training. 5 minutes of plotting saves hours of puzzling over curves. -- Metric stuck at a value: distinguish "optimizer can't" from "parameterization can't" from "competing losses cancel out." -- After changing loss formula: verify gradient flow didn't break, especially at the operating point (not just at init). -- Comparing loss variants: grid the same axes for both, compare arrow fields side by side. - ---- - -## Part 4: Experiment Sweeps & Statistical Analysis - -Principled hyperparameter sweeps with same-seed comparisons, within-group z-scores, and t-stat stability. This is the difference between "I tried it and it seemed better" and "I have evidence it's reliably better." - -### Sweep design (justfile pattern) - -Each sweep is a justfile recipe. Key conventions: - -```just -set shell := ["bash", "-c"] -SEEDS_4 := "2024 4096 8192 9000" -BASE := "uv run python train.py gemma1b" - -# Q: Does rotation type matter? block vs full vs givens. -# Hypothesis: block should balance expressiveness vs cost. -# 12 runs, ~3 hours -sweep-rotation-type: - #!/usr/bin/env bash - set -x - export WANDB_RUN_GROUP="sweep-rotation-type-$(date +%Y%m%d-%H%M)" - for seed in {{ SEEDS_4 }}; do - {{ BASE }} --seed=$seed --svd_rotation_type=block - {{ BASE }} --seed=$seed --svd_rotation_type=full - {{ BASE }} --seed=$seed --svd_rotation_type=givens - done -``` - -**Rules:** -- One WANDB_RUN_GROUP per sweep, timestamped. -- Same seeds across all values within a sweep (enables paired comparison). -- Vary ONE parameter per sweep when possible (all-else-equal). If you must vary two, the analysis script warns about confounders. -- Comment the recipe with: question, hypothesis, run count, time estimate. -- Queue sweeps in a `queue` recipe in priority order. - -### Logging to wandb - -Every run logs to wandb with: group name, seed, all config as hyperparams, final eval metric (SI = TPR - FPR). - -Cache locally as parquet to avoid slow API calls on every analysis: - -```py -# download_wandb.py pattern: -# 1. Load cached parquet (if exists) -# 2. Find latest cached run date, subtract safety margin (1 day) -# 3. Fetch only runs newer than that -# 4. Merge (diagonal concat, dedup on run_id) -# 5. Save back to parquet + TSV -# -# Also downloads output.log per run for post-hoc log diagnosis. -``` - -### Analysis: within-group z-scores -> t-stat - -The core insight: don't compare raw SI across groups (different base configs, different dates). Compare *within* each group, then aggregate stability across seeds. - -```py -# analyze_results.py pseudocode: - -for group in groups: - for seed in seeds_in_group: - # 1. Collect all SI values for this (group, seed) combo - si_values = {param_value: SI for runs matching (group, seed, param)} - - # 2. Compute within-(group,seed) z-score - mu = mean(si_values) - sigma = std(si_values) - z[value] = (si[value] - mu) / sigma - # This normalizes out seed-level baseline differences - - # 3. Aggregate z-scores across seeds for each param value - for value in param_values: - mean_z = mean(z[value] across seeds) - std_z = std(z[value] across seeds) - t_stat = mean_z / (std_z / sqrt(n_seeds)) - # t_stat >> 2: reliably better across seeds - # t_stat ~ 0: no consistent effect - # t_stat << -2: reliably worse - - # 4. Also compute linear trend (Pearson r) for numeric params - # r > 0: more is better. t_stat on r tests reliability. -``` - -### Interpreting results - -| Metric | What it tells you | -|--------|-------------------| -| SI_mean | Raw effect size (higher = better behavioral control) | -| si_q10, si_q90 | Spread. Wide = seed-sensitive. | -| t_stat | Cross-seed reliability. \|t\| > 2 with 4+ seeds is meaningful. | -| linear r | Monotonic trend. r near +1/-1 with significant t_stat = dose-response. | -| "Also varies" warning | Confounders. Can't attribute effect to this param alone. | - -**What you're looking for**: high SI_mean *and* strong t_stat (reliable). A value with SI_mean=20 but t_stat=0.5 is a lucky seed. A value with SI_mean=10 but t_stat=4.0 is a real (if modest) effect. - -### Common pitfalls - -- **Stale cache**: always `download_wandb.py` before analyzing. Stale cache hides new groups. -- **Cross-group comparisons**: different groups have different base configs. "Group A's best value vs Group B's best value" is apples-to-oranges. Compare within groups. -- **n_seeds=1**: t_stat is NaN. You have one data point. Replicate before concluding. -- **Too many params varied**: if a sweep varies 3 params simultaneously, effects are confounded. Split into separate sweeps. -- **Interpreting NaN SI**: usually means eval crashed or the model diverged. Investigate the run log, don't just skip it. -- **"Fill" sweeps**: if a sweep is 13/16 runs done (missing a seed), run the missing seed in a separate group with a clear name (e.g. `sweep-coh-tau-fill`). The analysis script treats it as a separate group, so you merge mentally. - -### The full workflow - -```bash -# 1. Design sweep: write justfile recipe with hypothesis -# 2. Run it -just sweep-rotation-type -# 3. Wait for completion, then: -uv run python scripts/download_wandb.py -uv run python scripts/analyze_results.py --after $(date +%Y-%m-%d) -# 4. Read the output: -# - Group Summary table: SI_mean, n_seeds per group -# - Param tables: per-value SI with t_stat -# - Linear trends: dose-response for numeric params -# 5. Record findings in research journal -# 6. Update default config if result is clear and reliable -``` - ---- - -## Part 5: Diagnosing "Why Won't This Metric Move?" - -A structured decision tree for when a metric is stuck. Applies to any training scenario where a quantity you're optimizing plateaus. - -### Step 1: Is the gradient nonzero at the metric level? - -```py -metric_val = torch.tensor(current_value, requires_grad=True) -loss = loss_fn(metric_val) -loss.backward() -print(f"d(loss)/d(metric) = {metric_val.grad}") -``` - -- If ~0: the loss doesn't care about this metric at the current operating point. Likely saturated (log1p of huge value), in a dead zone, or the metric is disconnected from the loss. -- If large: the loss IS trying to move it. Problem is downstream. - -### Step 2: Can the parameter change the metric? - -Trace the chain rule: `loss -> metric -> ... -> parameter`. The metric is a function of intermediate quantities, which are functions of learned parameters. Check `d(metric)/d(parameter)`: - -- Analytically: is there a structural reason this derivative is ~0? (e.g., a rotation of V can't change span(U)) -- Empirically: disable the loss term entirely (set coefficient to 0). Does the metric reach the same value? If yes, it's structural, and the optimization never moved it in the first place. - -### Step 3: Is something else fighting it? - -If gradient is nonzero and the parameter CAN change the metric: -- Check competing loss terms: compute gradient contribution from each loss component separately. If two terms have opposite-sign gradients on the same parameter, they cancel. -- Check optimizer state: AdamW momentum from earlier training may resist direction changes. Try resetting optimizer state or using a warmup schedule. -- Check conditioning: if the metric requires coordinated changes across many parameters (e.g., rotating multiple layers simultaneously), the gradient per-parameter may be too small even though the aggregate signal is large. - -### Decision table - -| d(loss)/d(metric) | d(metric)/d(param) | Same value without loss term? | Diagnosis | -|---|---|---|---| -| ~0 | any | any | Loss saturated or disconnected. Change loss formula. | -| large | ~0 | yes | Structural ceiling. Change parameterization. | -| large | large | no | Competing losses or optimizer inertia. Isolate. | -| large | large | yes | The term helps but converges to same basin. Coincidence or weak effect. | - -Note this is just a guide and in no way authoritative; it might not apply to your project. - ---- - -## Part 6: LLM Debugging Playbook - -Concrete procedures for an LLM agent debugging ML code. Work top-to-bottom: static analysis first, then diagnostics, then the decision tree. Don't skip to hyperparameter suggestions. - -### 6.1 Static analysis: grep for silent bugs - -> See [refs/static_analysis.md](refs/static_analysis.md) for the full list of grep patterns. Categories: shape mismatches, autograd breakers, train/eval mode, in-place ops, double softmax, optimizer step ordering, broadcasting traps, wrong loss sign, frozen params, data leakage, class imbalance. - -### 6.2 Diagnostic code snippets - -> See [refs/diagnostics.md](refs/diagnostics.md) for copy-paste snippets. Includes: data pipeline sanity check, init loss check (with expected values per loss type), overfit-one-batch test, gradient flow check, NaN/Inf hooks, random input test, prime dimension trick, class imbalance check, confidence-sorted errors, weight/bias distributions. - -### 6.3 Triage decision tree - -Walk the list top-to-bottom and stop at the first question you answer "yes". - -1. Exception or traceback? Read it, fix the error, done. -2. Loss is NaN/Inf? Attach NaN hooks (6.2), find the first module producing NaN. Common causes: log(0), 0/0, exp(large); add clamp/eps. -3. Init loss wrong? (expected values in 6.2) - - Check the data pipeline (6.2) and the loss function. - - Check for double softmax (6.1). - - Check labels match the model output format. - - Random input test (6.2): same loss? -> data destroyed. - - Init loss << expected? -> data leakage (Part 7.4). -4. Can't overfit 1 batch? Run the gradient flow check (6.2). - - Any None grads? -> disconnected layer. - - All-zero grads? -> dead layer / detach. - - Check for autograd breakers (6.1) and optimizer step ordering (6.1). -5. Loss stuck from step 0 (but CAN overfit 1 batch)? - - LR too low? Try 10x. - - Frozen params? Check requires_grad (6.1). - - Wrong loss function? -6. Loss decreases then explodes? - - LR too high? Try 0.1x. - - Gradient clipping? Log the pre-clip norm. - - Numerical instability? (log, exp, div) -7. Train loss good, val loss bad? Overfitting, not a bug. More data, regularization, smaller model. -8. Train loss okay but metric bad? Loss-metric misalignment. Is minimizing the loss equivalent to improving the metric? (Part 5) -9. Model outputs constant? Mode collapse. Check: - - Class imbalance? Run the label count (6.2). - - All-zero init? Run the weight check (6.2). - - Dead ReLUs? (try LeakyReLU) - - Confidence-sorted errors (6.2) reveal a pattern? -10. Training slow but not stuck? Not a bug. Consider batch size (Part 1 Step 6), architecture depth/width, data quality, and so on. -11. None of the above? Read Part 1 (general) or Part 2 (RL-specific) for deeper diagnostics. Log everything (Part 1 Step 3) and pursue anomalies. - -Again this is just a guide and in no way authoritative; it might not apply to your project. - -### 6.4 LLM anti-patterns - -These are the overconfident reflexes the "calibrate" section warns about, made concrete. Every one of them changes behaviour before localizing the bug. Some people say "this is sklearn slop", or "the LLM is acting like it's tweaking hyperparameters in a hackathon, not understanding the problem". - -- Hyperparameter changes before verifying correctness. "Try reducing the learning rate" is the #1 wrong response to any training problem. Verify the code is correct first (Parts 1-2); HP tuning on buggy code wastes time. -- try/except around training code. Training should crash loudly. A caught exception hides the bug and produces silently wrong results. The one exception is checkpoint saving on KeyboardInterrupt. -- "Try a different optimizer." If Adam doesn't converge, the cause is almost never the optimizer choice. It's usually the loss, the data, the architecture, or a bug. -- `.detach()` or `.item()` to "fix" gradient errors. If autograd complains, the computation graph is wrong. Detaching silences the error by cutting gradient flow, so the model just stops learning from that path. Understand why autograd is complaining first. -- lr_scheduler as a cure for non-convergence. Schedulers refine convergence, they don't cause it. If the model won't learn at constant LR, a schedule won't save it. -- More layers or a bigger model. If it can't overfit one batch, more parameters won't help. The problem is gradient flow, loss, or data. -- "Normalize your data" without checking whether it already is. Run the data sanity check (6.2) first; if it's already mean~0, std~1, normalization isn't your problem. -- `float()` or `.to(dtype)` to suppress type warnings. Type mismatches are signals. A float32/float64 mismatch might mean you're mixing model weights with double-precision data. Fix the root cause. - ---- - -## Part 7: Debugging Folklore & Mental Models - -Part 6 tells you what to DO. This part tells you how to THINK. Use these frameworks when generating hypotheses, brainstorming causes, or deciding what to investigate next. - -### 7.1 Five mental models for ML debugging - -Pick the model that fits your situation. Each gives a different angle on the same problem. - -**1. Information flow: trace forward, trace backward.** -Data flows forward through the model; gradients flow backward. A bug anywhere in either direction corrupts everything downstream. When stuck: manually trace shapes and values forward from input through each layer. Then trace gradients backward from loss through each parameter. The break-point is where values go wrong. -- Forward: input -> preprocess -> embed -> layers -> head -> loss -- Backward: loss -> d(loss)/d(head) -> d(head)/d(layers) -> ... -> d(layer1)/d(params) -- Tool: gradient flow check (6.2), NaN hooks (6.2) - -**2. Ablation: remove things until it works.** [CS229] -Systematically remove components (regularization, augmentation, auxiliary losses, fancy layers). If removing X fixes the problem, X is the problem. If nothing helps, the bug is in the core (data or main loss). -- Start: turn off ALL regularization, augmentation, dropout, scheduling -- If it works now: add back one-at-a-time until it breaks -- If still broken: problem is in data pipeline, loss, or base architecture -- Tool: just comment things out and rerun overfit-one-batch (6.2) - -**3. Oracle substitution: replace each component with ground truth.** [CS229] -For pipeline systems (data -> features -> model -> postprocess -> metric), replace one component at a time with a perfect/oracle version. The component whose oracle gives the biggest accuracy jump is the bottleneck. -- Example: replace learned features with hand-crafted features. Big jump? -> feature learning is the problem. -- Example: replace model predictions with ground truth labels. Small jump? -> model is fine, problem is upstream (data) or downstream (metric). -- This is especially useful for multi-stage systems (NLP pipelines, detection + classification, etc.) - -**4. Bias-variance via learning curves.** [CS229, FSDL] -Plot train error and val error as a function of dataset size (or training steps). The shape tells you what to do: -- Both high (converging together): high bias. Model too simple, wrong features, or bug reducing capacity. -- Train low, val high (diverging): high variance. Overfitting. More data, regularization, smaller model. -- Both low: working. Ship it. -- Train low, val high, but val improves with more data: getting there, need more data. -- Val error flat even with 10x more data: not a data problem. Fix the model. - -**5. Structural ceiling: can the parameterization express what you want?** (Part 5 expands this) -Sometimes the metric is stuck not because the optimizer fails but because the architecture/parameterization literally cannot represent the desired function. Check: disable the loss term entirely. Does the metric reach the same value? If yes, the loss never moved it, and the model can't express higher values. - -### 7.2 Practitioner priors: what's usually wrong - -When you have no other information, investigate in this order. Rough estimates synthesized from [Goodfellow, FSDL, Slavv, Jones, CS231n], not measured frequencies, just practitioner consensus on what's usually wrong: - -1. **Data pipeline** (~40% of bugs). Wrong preprocessing, labels misaligned with inputs, normalization missing or wrong, train/test leakage, data loader returning stale/wrong batches. "It's almost always the data." [FSDL, Slavv] -2. **Loss function** (~20%). Wrong loss for the task, wrong sign, double softmax, loss not connected to metric, competing losses canceling gradients. -3. **Training procedure** (~15%). Wrong optimizer step order, missing zero_grad, wrong LR, frozen params, in-place ops breaking autograd. -4. **Architecture** (~10%). Too small (can't express), too deep (vanishing grads), wrong activation, missing skip connections. -5. **Hyperparameters** (~5%). LR, batch size, weight decay. Almost never the real problem if the code is buggy. -6. **Numerical issues** (~5%). NaN, overflow, underflow. Usually a symptom of something else. -7. **Environment/infrastructure** (~5%). Wrong library version, GPU memory, nondeterminism, stale cache. - -For RL specifically, add: -- **Reward scale/sign** as a top-3 issue [Henderson, Schulman]. Rescaling from [-1,1] to [0,1] or vice versa can be the entire difference. -- **Episode boundary handling** (done signals, reward discounting across resets) [Jones]. - -### 7.3 The debugging mindset - -Core attitudes live in the top-of-skill debugging loop (calibrate, hold several hypotheses, read a working implementation when stuck) and in Part 1 ("Assume you have a bug," "Pursue anomalies," "Loss curves are a red herring"). Here are the additional mental habits not covered there: - -MurphyJitsu pre-flight [Rahtz 2018]: before starting a run, ask "if this run fails, what's the most likely cause?" If you can name it, test for it first. It's the rationalist habit of "pre-hindsight": imagining the failure and working backward. This is the same move as naming the *likely* and *subtle* entries of the protocol's failure-mode triplet, applied before launch instead of after a crash. - -"Tricks substitute for each other" [Schulman 2017]: many normalization and regularization tricks do roughly the same thing, so stacking them adds complexity without proportional benefit. If you have three normalization schemes and the model still doesn't work, the problem isn't normalization. - -(Other attitudes live elsewhere: "think more, experiment less" with the work-log structure is in Part 1; diffing against a working implementation is the top-of-skill "When stuck, read a working implementation" section.) - -### 7.4 When to suspect the data - -Specific signal patterns that point to data problems. - -| Signal | Diagnosis | Action | -|--------|-----------|--------| -| Init loss << expected (e.g., 0.01 instead of 2.3) | Data leakage or shortcut. Model "knows" the answer at init. | Check: are labels in the input? Is test data in train? Is there a trivial feature? | -| Random input gives same loss as real input (6.2) | Data pipeline destroying information. Preprocessing too aggressive, wrong transforms, input all zeros. | Print raw data at each pipeline stage. Visualize. | -| Model predicts same class for everything | Class imbalance. 100:1 ratio = model learns "always predict majority." | Run class balance check (6.2). Use weighted loss or resample. | -| Val loss much worse than expected but train is fine | Distribution shift. Val set from different distribution than train. | Check: same preprocessing? Same time period? Same source? Use dual val sets [FSDL]. | -| Learning curve flat even with 10x more data | NOT a data problem. High bias. Model too simple or wrong features. | Add capacity, fix features, check for bugs reducing effective capacity. | -| Adding data makes val WORSE | Data quality issue. New data is noisier or from wrong distribution. | Inspect recent additions. Check label quality. | -| Model works on reference dataset (MNIST/CIFAR) but not yours | Your data is the problem, not the model. | Simplify your data (fewer classes, clean labels, easy examples only). Scale up gradually. [Slavv] | - -### 7.5 Batch size & learning rate folklore - -These interact in non-obvious ways. Get them wrong and training looks broken even with correct code. - -**Critical batch size** [McCandlish 2018]: there's a batch size B_crit below which doubling batch size ~halves training time (compute-efficient), and above which it doesn't help (just wastes compute). B_crit depends on the task and increases during training as the loss decreases. - -**LR must scale with batch size.** [McCandlish 2018, Goyal et al. 2017] -- Linear scaling rule (SGD): if you double batch size, double LR. [Goyal et al. 2017] -- For Adam: the scaling exponent is between 0.5 and 1 (between sqrt and linear), task-dependent. [McCandlish 2018] -- Changing batch size without adjusting LR is a common silent mistake. - -**Adam default LR = 3e-4.** [FSDL, Karpathy] -This is the "just works" starting point. If you're using Adam and haven't tuned LR, start here. Karpathy: "3e-4 is the best learning rate for Adam." - -**Big batches need warmup.** [Goyal et al. 2017] -Large batch training with high LR diverges at the start. Warm up LR linearly over the first few hundred steps. Without warmup, you'll see loss spike/NaN in the first epoch and think the code is broken. - -**Batch size signals:** - -| Symptom | Likely cause | -|---------|-------------| -| Training very noisy, loss oscillates | Batch too small. Gradient noise overwhelms signal. Try 4-8x larger. | -| Training smooth but slow, poor generalization | Batch too large without LR scaling. Try higher LR or smaller batch. | -| Loss spikes at start then recovers | Normal with large batch + warmup. If no warmup: add it. | -| Different results at different batch sizes (same total steps) | Missing LR scaling. Adjust LR proportionally. | - - ---- - -## Further reading - -Local evidence files (verbatim quotes behind the claims above) live in [docs/evidence/](docs/evidence/). The most useful starting points: - -General neural-net debugging: -- [Karpathy, "A Recipe for Training Neural Networks" (2019)](docs/evidence/karpathy_recipe_training_nn_2019.md) -- [CS231n, Neural Networks Part 3](docs/evidence/cs231n_neural_networks_3.md) -- [Goodfellow Ch11, Practical Methodology](docs/evidence/goodfellow_ch11_practical_methodology.md) -- [FSDL Spring 2021, Troubleshooting DNNs](docs/evidence/fsdl_spring2021_lecture7.md) -- [Slav Ivanov, "37 Reasons your NN is not working"](docs/evidence/slavv_37_reasons_nn.md) -- [CS229 ML advice](docs/evidence/cs229_ml_advice.md) - -Modern LLM / large-batch training: -- [nanochat deepwiki, LLM pretraining (2026)](docs/evidence/nanochat_deepwiki_llm_pretraining_2026.md) -- [Karpathy nn-zero-to-hero Lec 4, diagnostics](docs/evidence/karpathy_nn_zero_to_hero_lec4_diagnostics.md) -- [McCandlish & Kaplan, large-batch training (2018)](docs/evidence/mccandlish_2018_large_batch.md) - -RL-specific: -- [Schulman, "Nuts and Bolts of Deep RL"](docs/evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md) -- [Andy Jones, RL debugging](docs/evidence/andyljones_rl_debugging.md) -- [Amid Fish, "Lessons from reproducing deep RL"](docs/evidence/amid_fish_reproducing_deep_rl.md) -- [Alex Irpan, "Deep RL Doesn't Work Yet"](docs/evidence/alexirpan_rl_hard.md) -- [Henderson et al., "Deep RL that Matters" (2018)](docs/evidence/henderson_2018_deep_rl_matters.md) - -Plus several practitioner reddit threads and a few more author notes in the same directory. +## Appendix: deeper tricks + +Look these up when the symptom calls for them; they're kept out of the main flow on purpose. + +- [refs/loss_surface.md](refs/loss_surface.md) — visualize a loss surface and its gradient field with synthetic tensors, no model or GPU. For when a custom loss misbehaves. +- [refs/metric_stuck.md](refs/metric_stuck.md) — "why won't this metric move?" plus the structural-ceiling check (is the optimizer failing, or can the parameterization not express it?). +- [refs/sweeps.md](refs/sweeps.md) — same-seed paired comparison and cross-seed t-stat reliability, so a result is "reliably better" not "a lucky seed." +- [refs/static_analysis.md](refs/static_analysis.md) — grep patterns for silent bugs (shape mismatches, autograd breakers, double softmax, step ordering, leakage). +- [refs/diagnostics.md](refs/diagnostics.md) — copy-paste diagnostic snippets (init-loss check, overfit-one-batch, gradient-flow check, NaN hooks, class-imbalance check). +- [rl/SKILL.md](rl/SKILL.md) — RL-specific debugging: probe environments, reward engineering, HP defaults, reference implementations. +- [pinn/SKILL.md](pinn/SKILL.md) — physics-informed-network debugging: nondimensionalization, gradient pathologies, curriculum. + +## Links and further reading + +Folklore sources (the quotes above trace to these): + +[^jones]: Andy Jones, "Debugging RL, Without the Agonizing Pain" — https://andyljones.com/posts/rl-debugging.html ([cache](docs/evidence/andyljones_rl_debugging.md): anomalies L103-109, write-from-scratch L155, assume-bug L176-180, raise-threshold L182, loss-curve L186-188) +[^rahtz]: Matthew Rahtz (Amid Fish), "Lessons Learned Reproducing a Deep RL Paper" — http://amid.fish/reproducing-deep-rl ([cache](docs/evidence/amid_fish_reproducing_deep_rl.md): frame-diff confusion L85-87, investigate-confusion L100-102, think-more L145-153, don't-implement-RL-yourself L497-501) +[^karpathy-recipe]: Andrej Karpathy, "A Recipe for Training Neural Networks" (2019) — https://karpathy.github.io/2019/04/25/recipe/ ([cache](docs/evidence/karpathy_recipe_training_nn_2019.md): inspect-data L26+L32, fixed-seed L39, overfit-one-batch L51, Adam-3e-4 L73; note: this is an abridged note with its own "..." elisions) +[^karpathy-3e4]: Andrej Karpathy, tweet, 23 Nov 2016: "3e-4 is the best learning rate for Adam, hands down." — https://x.com/karpathy/status/801621764144971776 (he confirmed in-thread it was a joke; not in the local evidence files, verified against the tweet) +[^schulman]: John Schulman, "Nuts and Bolts of Deep RL Research" slides — http://joschu.net/docs/nuts-and-bolts.pdf ([cache](docs/evidence/joschu_nuts_and_bolts.md): Always-Be-Ablating L98-101, standardize-observations L118-125; rendered as bullets because the PDF source is slide fragments) +[^henderson]: Henderson et al., "Deep Reinforcement Learning that Matters" (AAAI 2018) — https://arxiv.org/abs/1709.06560 ([cache](docs/evidence/henderson_2018_deep_rl_matters.md): seeds-create-different-distributions L235, implementation-differences L251) +[^irpan]: Alex Irpan, "Deep Reinforcement Learning Doesn't Work Yet" (2018) — https://www.alexirpan.com/2018/02/14/rl-hard.html ([cache](docs/evidence/alexirpan_rl_hard.md): variance-bug-or-unlucky L674-678, seed-canary L705-707) +[^cs231n]: Stanford CS231n, "Neural Networks Part 3" — https://cs231n.github.io/neural-networks-3/ ([cache](docs/evidence/cs231n_neural_networks_3.md): overfit-tiny-subset L89) +[^slavv]: Slav Ivanov, "37 Reasons why your Neural Network is not working" (2017) — https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607 ([cache](docs/evidence/slavv_37_reasons_nn.md): opening anecdote L19, emergency checklist L45-51) +[^fsdl]: Josh Tobin, Full Stack Deep Learning Spring 2021, Lecture 7 "Troubleshooting DNNs" — https://fullstackdeeplearning.com/spring2021/lecture-7/ ([cache](docs/evidence/fsdl_spring2021_lecture7.md)) +[^goodfellow]: Goodfellow, Bengio, Courville, *Deep Learning*, ch. 11 "Practical Methodology" — https://www.deeplearningbook.org/ ([cache](docs/evidence/goodfellow_ch11_practical_methodology.md): one-part-broken-others-adapt L198, weights-adapt-to-compensate L204) +[^cs229]: Andrew Ng, CS229 "Advice for Applying Machine Learning" — https://cs229.stanford.edu/ ([cache](docs/evidence/cs229_ml_advice.md)) +[^mccandlish]: McCandlish, Kaplan et al., "An Empirical Model of Large-Batch Training" (2018) — https://arxiv.org/abs/1812.06162 ([cache](docs/evidence/mccandlish_2018_large_batch.md)) +[^goyal]: Goyal et al., "Accurate, Large Minibatch SGD" (2017) — https://arxiv.org/abs/1706.02677 + +For modern transformer pretraining specifically (the sources above predate it), see [Karpathy's recipe](https://karpathy.github.io/2019/04/25/recipe/) and the [nanochat deepwiki](https://deepwiki.com/karpathy/nanochat) (320+ empirical HP sweeps for a GPT-2-scale run). Most multi-source claims trace to quotes in [docs/ml_debug_folklore.argdown](docs/ml_debug_folklore.argdown) (vargdown); the full evidence set is in [docs/evidence/](docs/evidence/). diff --git a/SKILL2.md b/SKILL2.md deleted file mode 100644 index 5618710..0000000 --- a/SKILL2.md +++ /dev/null @@ -1,74 +0,0 @@ ---- -name: ml-debug -description: "Wassname's practical folklore for debugging ML systems: convergence, gradients, loss surfaces, sweeps, same-seed comparisons. Use when stuck on training, designing sweeps, or analyzing results. Condensed anchor; deep tables and methodology are linked." ---- - -# ML debugging folklore (anchor) - -Condensed from Schulman's "Nuts and Bolts", Andy Jones' debugging guide, Karpathy's recipe, r/reinforcementlearning, competition write-ups, and personal experience. The tables, triage tree, and sweep methodology are one hop away (see Reference). This page is the part that changes how you debug. - -## Calibrate first - -If you're an LLM agent, start by calibrating yourself. ML research code is often outside your training distribution: novel losses, custom architectures, methods with no canonical right answer you've seen a thousand times. Your trained reflex is to be fast and confident, pattern-matching a symptom to a fix ("loss stuck -> drop the LR") and applying it. On possibly-buggy research code that reflex burns a run and corrupts the evidence you need to find the real cause. Slow down and widen your hypotheses before you touch anything. - -Two moves the model skips by default, and they are the highest-leverage ones: - -- Assume you have a bug [Jones]. A bug is faster to find than a new architecture is to validate, and healthy components adapt to mask a broken one. Raise your bar for "this is correct". -- When stuck, read a working implementation. After a cycle or two that doesn't localize the bug, stop guessing and diff your math, computation graph, and hyperparameters against code that already runs. Rank candidates by trust signal (adoption > papers citing it > code that runs > author reputation). More in [reading working code](SKILL.md#when-stuck-read-a-working-implementation). - -## The loop - -```py -# ── ML debugging loop ──────────────────────── -def debug(symptom): - clues ← collect(traceback, logs, static_analysis, cheap_diagnostics) # look before theorizing - H ← generate(clues, lenses=5) | {likely, subtle, null} # ≥3 genuinely different - prior ← anchor(H) # base rates: data .40 loss .20 train .15 arch .10 hp .05 - - while not localized: - # the cheapest test whose outcome the hypotheses disagree on - t̂ ← argmax(divergence(predict(h, t) for h in H) / cost(t) for t in candidates) - obs ← run(t̂) # one log line or toy run; keep obs apart from inference - prior ← update(prior, obs) - H ← bisect_path(H, obs) # halve the search space each probe - if cycles ≥ 2: - return read_working_code() - - fix(root_cause); assert reproduces(obs) # no silent fallback; crash if it doesn't -``` - -In words: - -- Collect clues before theorizing. Read the traceback and logs, run [static analysis](refs/static_analysis.md) and the [cheap diagnostics](refs/diagnostics.md): data sanity, init-loss, overfit-one-batch. -- Hold several hypotheses. Generate a few genuinely different explanations, then attach a failure-mode triplet: likely (your strongest competitor), subtle (sample size, leakage, a confound, seed variance, and so on), null (effect is noise, or came from something else you also changed). Give each a prior and its cheapest falsifier (a `Check:` line). -- Run the most discriminating cheap test. Forward-predict each hypothesis ("what would I see if this were the cause?"); a test is strong evidence only where the predictions diverge. Weigh the learning against code and GPU cost. -- Bisect to localize. Data flows forward and gradients flow backward, so probe the midpoint and ask whether the value or gradient is already wrong halfway through. Each probe halves the search space. -- Act only on what the observation pointed to. - -## A few non-obvious numbers - -The model already recalls most symptom-to-cause pairs, so they don't earn space here. These are the ones it tends to get wrong, worth holding in context: - -- Init loss at chance is -ln(1/k) for a k-class softmax (2.30 at k=10). Far above means a loss or init bug; far below suggests data leakage. -- Update-to-data ratio per layer near 1e-3: `((lr*p.grad).std()/p.data.std()).log10()` around -3 [Karpathy]. -- Normalize inputs with running stats over all data seen so far; a recent-only window silently shifts the input distribution [Schulman]. -- Adam starting LR 3e-4 [Karpathy]. LR scales with batch size, between sqrt and linear for Adam [McCandlish]. -- For physics/PINNs, nondimensionalize before training so PDE coefficients are O(1). - -## Reference (load when you need the menu) - -Each of these is a menu of hypotheses to widen your search, accurate to its own setting, so read the relevant one when the task calls for it rather than up front: - -- [Triage decision tree](SKILL.md#63-triage-decision-tree): symptom to first checks, top to bottom. -- [Symptom and gradient tables](SKILL.md#part-1-general-ml-debugging): loss-curve patterns, gradient health, numerical hygiene. -- [Loss-surface and gradient analysis](SKILL.md#part-3-loss-surface--gradient-analysis-no-model-required): visualize a loss before training. -- [Why won't this metric move?](SKILL.md#part-5-diagnosing-why-wont-this-metric-move): structural ceiling vs competing losses. -- [Sweeps and statistics](SKILL.md#part-4-experiment-sweeps--statistical-analysis): same-seed comparisons, within-group z-scores, t-stats. -- [Mental models and priors](SKILL.md#part-7-debugging-folklore--mental-models): five hypothesis-generating lenses, when to suspect data. -- Domain sub-skills: RL [rl/SKILL.md](rl/SKILL.md), PINNs [pinn/SKILL.md](pinn/SKILL.md). - -Scope: most sources are 2017-2021. The mindset and loop are durable; specific RL defaults and reward-scaling advice may have moved on. For modern transformer pretraining see [Karpathy's recipe](https://karpathy.github.io/2019/04/25/recipe/) and [nanochat deepwiki](https://deepwiki.com/karpathy/nanochat). - -## Sources - -Claims trace to verbatim quotes in [docs/evidence/](docs/evidence/) via the map in [docs/ml_debug_folklore.argdown](docs/ml_debug_folklore.argdown). Starting points: [Karpathy recipe](docs/evidence/karpathy_recipe_training_nn_2019.md), [CS231n](docs/evidence/cs231n_neural_networks_3.md), [Goodfellow Ch11](docs/evidence/goodfellow_ch11_practical_methodology.md), [Schulman Nuts and Bolts](docs/evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md), [Andy Jones](docs/evidence/andyljones_rl_debugging.md). diff --git a/refs/loss_surface.md b/refs/loss_surface.md new file mode 100644 index 0000000..5537b21 --- /dev/null +++ b/refs/loss_surface.md @@ -0,0 +1,70 @@ +# Loss surface & gradient analysis (no model required) + +Appendix to the [ML Debugging skill](../SKILL.md). A trick worth reaching for when a *loss* (not the whole model) is misbehaving: visualize its surface and gradient flow directly, feeding synthetic tensors into the loss sub-components. No model, forward pass, or GPU, just the math. Five minutes of plotting often saves hours of squinting at training curves. + +When you'd look this up: a new or custom loss behaves oddly; a metric is stuck and you suspect the loss shape; you just changed a loss formula and want to confirm gradients still flow at the operating point (not just at init); you're comparing two loss variants and want to see their gradient fields side by side. + +## The method + +1. Identify each loss sub-component as a function of its immediate inputs. +2. Pick 1-2 axes that matter (the "natural axes" you reason about when you think about the loss). +3. Grid over those axes, feed through the loss, call `.backward()`, collect gradients. +4. Plot: contour heatmap + quiver overlay (negative gradient = the direction the optimizer moves). +5. Build a summary table: component x representative_input -> loss_value, grad_value. Flag zero or non-finite gradients. + +```py +# ── 2D loss surface with gradient quiver ────── +def analyze_component(loss_fn, x_range, y_range, n=80): + xs = torch.linspace(*x_range, n) + ys = torch.linspace(*y_range, n) + X, Y = torch.meshgrid(xs, ys, indexing='ij') + x_flat = X.flatten().requires_grad_(True) + y_flat = Y.flatten().requires_grad_(True) + + losses = loss_fn(x_flat, y_flat) # vectorized, returns (n*n,) + losses.sum().backward() + + loss_grid = losses.detach().reshape(n, n) + gx = x_flat.grad.reshape(n, n) + gy = y_flat.grad.reshape(n, n) + + # contourf(X, Y, loss_grid) + quiver(X, Y, -gx, -gy) + # negative gradient = direction optimizer moves + +# ── Gradient flow verification table ────────── +# For each component, evaluate at representative inputs +# (zero, small, converged, degenerate). Report loss + grad. +# Flag: zero grad (dead zone), non-finite (numerical issue). +# +# | Component | Param | Input | Loss | Grad | +# |-----------------|---------|--------------|----------|----------| +# | barrier_penalty | v | v=0.0 | +0.000 | +0.000 | <-- zero grad! +# | barrier_penalty | v | v=0.5 | +12.50 | +50.00 | +# | pair_loss | dot_pos | (0.3, -0.3) | -2.340 | -3.000 | +# | pair_loss | dot_neg | (0.3, -0.3) | -2.340 | +3.000 | <-- antisym, good +# | pair_loss | dot_pos | (0.0, 0.0) | +0.000 | +0.000 | <-- dead at init! +``` + +## What to look for + +| Pattern | Meaning | Action | +|---------|---------|--------| +| Gradient arrows point toward desired region | Loss is well-shaped | Ship it | +| Large flat region (zero gradient) | Dead zone: optimizer stuck if it lands here | Add curvature, change init, or reparameterize | +| Gradient magnitude 1000x in one axis vs another | Imbalanced: one axis dominates | Rescale, use log-space, or normalize | +| Saddle point at origin | Common with product-form losses (A*B) | Switch to additive (log A + log B) for independent gradients | +| Arrows point away from desired region | Loss is wrong or has an unexpected local min | Rethink the formula | +| Non-finite values in a region | Numerical issue (log(0), 0/0) | Add eps, clamp, or use log1p | + +## The log-space decomposition trick + +When your loss is a product of factors A*B and one factor can be near zero: + +``` +# BAD: symlog(A * B), when B~0 the chain rule gives 0 grad to A too +# GOOD: sign * (log|A| + log|B|) gives independent gradients +# d/dA = 1/A regardless of B +# d/dB = 1/B regardless of A +``` + +General principle: if you want gradient to flow independently through two factors, decompose multiplicatively in log space. diff --git a/refs/metric_stuck.md b/refs/metric_stuck.md new file mode 100644 index 0000000..4384a8c --- /dev/null +++ b/refs/metric_stuck.md @@ -0,0 +1,57 @@ +# Why won't this metric move? + +Appendix to the [ML Debugging skill](../SKILL.md). When a quantity you're optimizing plateaus, these are ideas for telling *why*, not a flowchart to obey. They apply to most training setups, but they're suggestions; your project may not fit them. + +The useful split is three questions, cheapest first. + +## 1. Is the gradient nonzero at the metric level? + +```py +metric_val = torch.tensor(current_value, requires_grad=True) +loss = loss_fn(metric_val) +loss.backward() +print(f"d(loss)/d(metric) = {metric_val.grad}") +``` + +- ~0: the loss doesn't care about this metric at the current operating point. Maybe saturated (log1p of a huge value), in a dead zone, or the metric is disconnected from the loss. +- large: the loss is trying to move it. The problem is downstream. + +## 2. Can the parameter even change the metric? + +Trace the chain `loss -> metric -> ... -> parameter`. The metric is a function of intermediate quantities, which are functions of learned parameters. Look at `d(metric)/d(parameter)`: + +- Analytically: is there a structural reason this derivative is ~0? (e.g. a rotation of V can't change span(U).) +- Empirically: disable the loss term (set its coefficient to 0). Does the metric reach the same value anyway? If yes, the optimization never moved it; it's a structural ceiling, and you need a different parameterization, not a different loss weight. + +## 3. Is something else fighting it? + +If the gradient is nonzero and the parameter *can* change the metric: + +- Competing loss terms: compute each component's gradient on the shared parameter separately. Opposite-sign gradients cancel. +- Optimizer state: AdamW momentum from earlier training can resist a direction change. Try resetting optimizer state or a warmup. +- Conditioning: if the metric needs coordinated changes across many parameters (rotating several layers at once), the per-parameter gradient may be too small even when the aggregate signal is large. + +## A rough map (a guide, not a verdict) + +| d(loss)/d(metric) | d(metric)/d(param) | Same value with the term off? | Reading | +|---|---|---|---| +| ~0 | any | any | Loss saturated or disconnected; reconsider the loss formula. | +| large | ~0 | yes | Structural ceiling; reconsider the parameterization. | +| large | large | no | Competing losses or optimizer inertia; isolate them. | +| large | large | yes | The term helps but converges to the same basin; weak effect or coincidence. | + +## Structural-ceiling check, concretely + +```py +# 1. Is d(loss)/d(metric) large? If so, the optimizer IS trying. +metric = torch.tensor(0.5, requires_grad=True) +loss = loss_fn(metric); loss.backward() +print(metric.grad) # large (e.g. 350x the other grads) => it's trying + +# 2. Can the parameter change the metric? Trace loss -> metric -> intermediate -> parameter. +# If d(metric)/d(parameter) ~ 0, the parameter structurally cannot move it. +# (e.g. a V-rotation can't change the output basis when U is fixed.) + +# 3. Confirm empirically: set the term's coefficient to 0. +# If the metric reaches the SAME value, it was never learned; it's structural. +``` diff --git a/refs/sweeps.md b/refs/sweeps.md new file mode 100644 index 0000000..c2c5bf9 --- /dev/null +++ b/refs/sweeps.md @@ -0,0 +1,32 @@ +# Sweeps: same-seed comparison and cross-seed reliability + +Appendix to the [ML Debugging skill](../SKILL.md). The general idea behind a trustworthy hyperparameter sweep, tool-agnostic. The point is the difference between "I tried it and it seemed better" and "it's reliably better across seeds." Irpan's 30% seed-failure result and Henderson's "seeds alone create statistically different distributions" (see the main skill's folklore section) are why this matters: a single lucky run proves nothing. + +## The core move: pair on seed, normalize within group, test across seeds + +1. Run the same set of seeds for every value of the parameter you're varying. Same seeds across values turns this into a paired comparison and cancels seed-level baseline differences. +2. Vary one parameter per sweep when you can (all-else-equal). If you vary two, effects confound and you can't attribute the result. +3. Within each (group, seed), z-score the metric across the parameter values. This removes the per-seed baseline offset so you compare *shapes*, not absolute levels. +4. Aggregate the z-scores across seeds per value, then take a t-stat: `mean_z / (std_z / sqrt(n_seeds))`. `|t| > 2` with 4+ seeds is a real, reliable effect; `t ~ 0` is no consistent effect. +5. For numeric parameters, also fit a linear trend (Pearson r) and t-test it: a clean dose-response is `r` near +/-1 with a significant t-stat. + +```py +for group in groups: + for seed in seeds_in_group: + vals = {param_value: metric for runs matching (group, seed, param)} + z[seed] = (vals - mean(vals)) / std(vals) # within-(group,seed) normalization + for value in param_values: + mean_z, std_z = mean(z[:, value]), std(z[:, value]) + t_stat = mean_z / (std_z / sqrt(n_seeds)) # >>2 reliably better, <<-2 reliably worse +``` + +## What you're looking for + +High effect size *and* a strong t-stat. A value with a big mean but `t=0.5` is a lucky seed; a value with a modest mean but `t=4.0` is a real (if small) effect. + +## Common pitfalls + +- `n_seeds = 1`: t-stat is undefined. One data point. Replicate before concluding anything. +- Cross-group comparisons: different groups often have different base configs, so "group A's best value vs group B's best" is apples-to-oranges. Compare within groups. +- Too many parameters varied at once: split into separate sweeps. +- Crashed / diverged runs showing as missing or NaN metrics: investigate the run, don't silently drop it; a divergence is itself a finding.