restructure: quotes-first SKILL.md, synthesized playbook split out

SKILL.md is now folklore only: verbatim practitioner quotes ordered most-general-first, transformer/LLM fine-tuning entries in their own section, minimal context, links and footnotes. New sources: unsloth, axolotl (+training stability), HF course ch8.4, Bekman debug_utils (evidence frozen in docs/evidence/). The synthesized material (mental models, priors, symptom tables, agent loop, triage, anti-patterns) moves to PLAYBOOK.md, framed as menus of hypotheses rather than authoritative diagnoses. Made-up symptom tables no longer sit next to sourced quotes. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:46:04 +08:00 · 2026-06-11 14:33:32 +08:00
parent 8ee980d62f
commit fb753d093e
8 changed files with 2391 additions and 284 deletions
@@ -0,0 +1,225 @@
 # ML debugging playbook (long-form reference)
 Part of the [ml-debug skill](SKILL.md); the folklore quotes and sources live there. This file holds the synthesized side: mental models, practitioner priors, step catalogs, symptom tables, the agent debugging loop, triage, and anti-patterns. Read these as menus of hypotheses to widen a search, not as authoritative diagnoses; they are distilled from the folklore sources, not quoted from them.
 ## Mental models
 How to *think* when generating hypotheses or deciding what to investigate next. Pick the lens that fits; each gives a different angle on the same problem.
 **1. Information flow: trace forward, trace backward.** Data flows forward through the model, gradients flow backward. A bug anywhere in either direction corrupts everything downstream. Manually trace shapes and values forward from input through each layer, then trace gradients backward from loss through each parameter. The break-point is where values first go wrong.
 - Forward: input -> preprocess -> embed -> layers -> head -> loss
 - Backward: loss -> d(loss)/d(head) -> d(head)/d(layers) -> ... -> d(layer1)/d(params)
 **2. Ablation: remove things until it works.**[^cs229] Systematically remove components (regularization, augmentation, auxiliary losses, fancy layers). If removing X fixes it, X was the problem. If nothing helps, the bug is in the core (data or main loss). Start by turning off ALL regularization/augmentation/dropout/scheduling; if it works, add back one at a time until it breaks.
 **3. Oracle substitution: replace each component with ground truth.**[^cs229] For pipeline systems (data -> features -> model -> postprocess -> metric), swap one component for a perfect version. The component whose oracle gives the biggest jump is the bottleneck. Replace model predictions with ground-truth labels and the metric barely moves? The model's fine; the problem is upstream (data) or downstream (metric).
 **4. Bias-variance via learning curves.**[^cs229][^fsdl] Plot train and val error vs dataset size (or steps). Both high and converging together = high bias (too simple, wrong features, or a capacity-reducing bug). Train low, val high = high variance (overfitting). Val flat even with 10x more data = not a data problem, fix the model.
 **5. Structural ceiling: can the parameterization even express what you want?** Sometimes a metric is stuck not because the optimizer fails but because the architecture literally cannot represent the target. Quick check: disable the loss term entirely; if the metric reaches the same value, the loss never moved it. Worked example in [refs/metric_stuck.md](refs/metric_stuck.md).
 ### Practitioner priors: what's usually wrong
 With no other information, investigate in this order. Rough consensus from the folklore sources, not measured frequencies, and only a starting weight (a clue that points elsewhere overrides them outright):
 1. **Data pipeline** (~40%). Wrong preprocessing, labels misaligned with inputs, missing/wrong normalization, train/test leakage, a loader returning stale batches. It really is usually the data.[^slavv][^fsdl]
 2. **Loss function** (~20%). Wrong loss for the task, wrong sign, double softmax, loss disconnected from the metric, competing losses canceling.
 3. **Training procedure** (~15%). Wrong optimizer step order, missing `zero_grad`, frozen params, in-place ops breaking autograd.
 4. **Architecture** (~10%). Too small to express it, too deep without skips, wrong activation.
 5. **Hyperparameters** (~5%). LR, batch size, weight decay. Almost never the real problem if the code is buggy.
 6. **Numerical** (~5%). NaN, overflow, underflow, usually a symptom of one of the above.
 7. **Environment** (~5%). Library version, GPU memory, nondeterminism, stale cache.
 For RL, add reward scale/sign as a top-3 issue, and episode-boundary handling (done signals, discounting across resets).
 ### When to suspect the data
 | Signal | Likely meaning | Check |
 |--------|----------------|-------|
 | Init loss << expected (e.g. 0.01 vs 2.3) | Leakage or a shortcut: the model "knows" the answer at init | Are labels in the input? Is test data in train? A trivial feature? Localize with the NaN-poisoning tracer or backprop-to-input check ([refs/diagnostics.md](refs/diagnostics.md)) |
 | Random input gives the same loss as real input | Pipeline is destroying information (over-aggressive preprocessing, wrong transforms, all-zero input) | Print raw data at each stage; visualize |
 | Predicts the same class for everything | Class imbalance (100:1 -> "always predict majority") | Label-count check; weighted loss or resample |
 | Val much worse than train from the start | Distribution shift between splits | Same preprocessing? Same time period? Same source? |
 | Learning curve flat even with 10x data | NOT data: high bias | Add capacity, fix features, check for capacity-reducing bugs |
 | Adding data makes val worse | Data-quality issue: new data noisier or off-distribution | Inspect recent additions, check label quality |
 | Works on MNIST/CIFAR but not your set | Your data is the problem | Simplify your data (fewer classes, clean labels), scale up gradually[^slavv] |
 ---
 ## Part 1: General ML debugging
 A catalog of small, well-worn checks, in rough dependency order (each assumes the one before). Pull from it; don't run it end-to-end as a ritual.
 **Step 1: Verify components in isolation.**[^goodfellow][^cs229] Most bugs are "doing the wrong calculation." Test each piece independently.
 - Forward pass: feed known inputs, check output shapes and ranges. `assert` shapes everywhere, since `(None,)` vs `(None, 1)` silently broadcasts into `(None, None)`. (Or make the shapes runtime-checked contracts with jaxtyping[^jaxtyping] + beartype, which turns the #1 silent bug loud.)
 - Loss: hand-compute a few targets and compare to code output.
 - Data pipeline: sample a batch, print it, eyeball it. Are labels aligned with inputs? Transforms applied correctly?
 - Preprocessing: look at processed inputs as a human. Can *you* solve the task from them?
 **Five most common deep-learning bugs**[^fsdl]: (1) tensor shapes that fail silently via broadcasting, (2) preprocessing inputs incorrectly (wrong normalization, over-augmentation), (3) wrong loss function or wrong sign, (4) forgetting train vs eval mode (dropout/batchnorm differ), (5) numerical instability (NaN from log(0), overflow, vanishing grads).
 **Step 2: Get signs of life on a toy problem, and overfit one batch.**[^cs231n][^fsdl] Before the real task, solve something trivial with the same codebase so you know what "healthy" looks like. Then overfit a tiny batch (see the folklore in [SKILL.md](SKILL.md)). Start with a lightweight implementation (<200 lines of new code), no fancy data pipeline; build that later once the core works.
 **Baseline ladder** (for physics/simulation models, each step must beat the previous):
 1. Persistence: y(t) = y(t-1). Does the model capture *any* dynamics?
 2. Exponential decay to steady state (first-order fit).
 3. Linear state-space / OLS on finite differences.
 4. Pure-data MLP (same architecture, no physics). If a PINN can't beat this, the physics constraint is hurting.
 5. Classical solver, fixed parameters (scipy `solve_bvp`, ODE).
 6. Classical solver, fitted parameters.
 7. Then and only then: PINN / learned physics.
 Make complexity pay rent: every added component (physics, dimensions, losses) should improve a metric you care about, or come out.
 **Step 3: Log everything, then look for specific pathologies.**[^goodfellow][^rahtz][^cs231n] Log train+val loss (per-component if multi-objective), gradient norms per module, learning rate, parameter-update magnitudes, the update-to-data ratio per layer (`((lr * p.grad).std() / p.data.std()).log10()`, target ~-3), activation stats (mean, std, dead-ReLU fraction, tanh saturation), and input/label distributions.
 **Sanity-check the loss at init**[^cs231n]: verify chance-level loss before training. For 10-class softmax the initial loss should be `-ln(0.1) = 2.302` with small random weights. Wrong init loss means a bad initialization or a broken loss. Then check that increasing regularization increases the loss.
 | Symptom | Likely cause |
 |---|---|
 | Loss stuck from the start | LR too low, bad init, data pipeline broken, wrong loss function |
 | Loss decreases then explodes | LR too high, numerical instability (log(0), div by 0), gradient-accumulation bug |
 | Loss NaN | log(0), 0/0, overflow. Use `log(x.clamp(min=1e-8))`, `1/(std + 1e-5)` |
 | Train loss good, val loss bad | Overfitting. More data, regularization, smaller model |
 | Loss oscillates wildly | LR too high, batch too small, data shuffling broken |
 | Gradients vanish | Too-deep net without skips, saturating activations, bad init |
 | Gradients explode | No gradient clipping, LR too high, RNN without clipping |
 | Different results per seed | Normal if small; suspicious if large. Check init sensitivity, batch order, nondeterminism |
 | Model outputs constant | Dead neurons, vanishing gradients, mode collapse, all-zero init |
 | Physics loss low but BCs violated | Gradient imbalance: PDE residual dominates the BC gradient; adaptive weighting or hard BCs |
 | PINN worse than pure-data MLP | Wrong equations, bad scaling (forgot to nondimensionalize), or physics fighting the data |
 | Scalar parameter stuck at 0 or a bound | Degenerate solution; bound and initialize it, or estimate it separately first |
 **Step 4: Numerical hygiene.**[^cs231n]
 ```python
 log_prob  = prob.clamp(min=1e-8).log()                 # clamp log inputs
 ratio     = x / (std + 1e-5)                            # never divide by zero
 grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=20.0)
 logger.log("grad_norm", grad_norm)                      # clip, but LOG the pre-clip norm
 assert torch.isfinite(loss), f"Loss is {loss}"          # catch NaNs early
 torch.autograd.gradcheck(my_fn, inputs.double().requires_grad_(True))  # float64! 1e-2 -> 1e-8
 ```
 Gradient clipping *masks* problems, so always log the pre-clip norm to see if it fires every step. For a custom gradient, use relative error (centered difference): `>1e-2` probably wrong, `1e-4` uncomfortable, `1e-7` happy; turn off regularization/dropout and use float64 first.
 **Step 5: Normalization and scale.**[^schulman][^cs231n][^fsdl][^slavv] Most training issues trace back to scale. Normalize inputs to mean 0, std 1 per feature (see Schulman's quote in [SKILL.md](SKILL.md)). For physics/PDE models, nondimensionalize *before* training: raw SI units (Kelvin, Joules, meters) create loss terms with wildly different magnitudes; pick characteristic scales, substitute, and the resulting groups (NTU, Biot) come out O(1). For time-series, use a temporal train/test split, not random, or you leak correlation.
 **Step 6: Check your optimizer assumptions.** Adam's moment estimates can mask gradient problems; if step statistics look weird, inspect raw gradients separately. `abs_max(param_update)` should be small (~1e-3 at LR 1e-2). Supervised-learning tricks (batchnorm, dropout, big nets) often *don't* transfer to RL.
 ---
 ## When stuck, read a working implementation
 After 1-2 diagnostic cycles that don't localize the bug, or whenever you're building something you haven't built before, stop guessing and go read code that already works. Agents tend to skip this for another round of from-scratch guessing, which is usually the worse bet. The folklore is blunt about this: writing RL from scratch is "the most catastrophically self-sabotaging thing you can do," because the self-correction signal is too weak to catch your bugs[^jones].
 Use the `gh` skill to find an implementation. Rank candidates by trust signal: community adoption > papers citing it > open source that runs > author reputation > self-reports. A repo other researchers use as a baseline beats a flashy README.
 Read it for three things, explicitly:
 1. **The algorithm done right.** Diff your math and your computation graph against theirs. The bug is usually something "trivial": a sign, a reset, an off-by-one, an advantage normalization you skipped. Implementation differences that papers never mention dominate results[^henderson].
 2. **The engineering tricks the paper omits.** Did they normalize the input? tanh instead of ReLU? mean-pool instead of last-token? only 6 layers? clip to stop gradient saturation? warm-start? an easier dataset than yours? These live in the code, not the abstract, and they're the difference between "works" and "doesn't."
 3. **Proven hyperparameters, schedule, and optimizer.** Copy the values known to work before tuning your own. Their LR, warmup, batch size, weight decay, and optimizer are a working starting point you get for free. For LoRA/fine-tuning, the params vary by model, but unsloth and axolotl defaults are good working knowledge: each is backed by a runnable demo notebook, which is a stronger trust signal than any blog post.
 For RL specifically, see [rl/SKILL.md](rl/SKILL.md) (spinning-up, stable-baselines3, cleanrl, OpenSpiel).
 ## For LLM agents
 Unfortunately, agents need these procedural mindset-shifts spelled out. This is the babysitting layer, not the durable folklore, hence its place at the bottom. If you're an agent debugging ML code, run the loop and avoid the anti-patterns.
 ### The debugging loop (use judgment, it's not a checklist)
 Roughly in this order, though the point is the underlying mindset:
 **Collect clues before theorizing.** Read the traceback and logs. Run static analysis ([refs/static_analysis.md](refs/static_analysis.md)) and the cheap diagnostics ([refs/diagnostics.md](refs/diagnostics.md): data sanity check, init-loss check, overfit-one-batch). If you catch yourself proposing a fix before you've looked at anything, stop.
 **Hold several hypotheses at once; resist converging early.** Unless the cause is already obvious (a traceback usually points right at it), generate a few genuinely different explanations before ranking any, so you don't marry the first one. Use the five lenses in Mental models. Then sanity-check yourself with the failure-mode triplet (same idiom as the `research-journal` skill):
 - *Likely*: your strongest competitor explanation, with a rough credence.
 - *Subtle*: the sneaky one, like sample size, leakage, a confound, a metric artifact, or plain seed variance masquerading as signal.
 - *Null*: there's no real effect, or it comes from something else you also changed.
 Give each a one-line prior and its cheapest falsifier (`Check: ...`). Anchor priors on Practitioner priors above, but a clue that points elsewhere overrides them outright. Keep observations (reproducible, auditable) separate from inferences, so you can rethink without degrading the evidence.
 **Run the cheapest observation that splits your top hypotheses.** Not the most thorough experiment, the most *discriminating* one. Forward-predict each hypothesis ("what would I see if this were the cause?"); a test is strong evidence only where the predictions diverge. A grad-norm line reading ~0 under "dead layer" but healthy under "LR too low" beats a 4-hour sweep that only confirms what you believed.
 **Bisect the path to localize where it breaks.** Data flows forward and gradients backward in a chain (input -> preprocess -> layers -> loss -> grads), so probe the midpoint: is the value or gradient already wrong halfway through? Each probe halves the search space. Finding the first module to produce a non-finite value is one case; the same bisection works for finite-but-wrong values, exploded norms, and dead activations.
 **Then act, only on what the observation pointed to.** If a cycle or two hasn't localized it, stop tuning and go read working code, which usually beats another guess.
 ```py
 # ── ML debugging loop ────────────────────────
 def debug(symptom):
    clues ← collect(traceback, logs, static_analysis, cheap_diagnostics)  # look before theorizing
    H     ← generate(clues, lenses=5) | {likely, subtle, null}            # ≥3 genuinely different
    prior ← anchor(H)              # base rates: data .40 loss .20 train .15 arch .10 hp .05
    while not localized:
        t̂     ← argmax(divergence(predict(h, t) for h in H) / cost(t) for t in candidates)
        obs   ← run(t̂)            # one log line or toy run; keep obs apart from inference
        prior ← update(prior, obs)
        H     ← bisect_path(H, obs)  # halve the search space each probe
        if cycles ≥ 2:
            return read_working_code()   # diff your math + graph vs a trusted impl
    fix(root_cause); assert reproduces(obs)   # no silent fallback; crash if it doesn't
 ```
 ### Triage (a menu, not a flowchart to obey)
 Rough order to consider, not authoritative; it may not fit your project. Stop when a question fits.
 1. Exception/traceback? Read it, fix it, done.
 2. Loss NaN/Inf? Attach NaN hooks ([refs/diagnostics.md](refs/diagnostics.md)), find the first module producing NaN. Usual causes: log(0), 0/0, exp(large); add clamp/eps.
 3. Init loss wrong? Check the data pipeline and loss; check for double softmax; check labels match output format. Same loss on random input -> data destroyed. Init loss << expected -> leakage.
 4. Can't overfit one batch? Gradient-flow check: None grads -> disconnected layer; all-zero grads -> dead layer / detach. Check autograd breakers and optimizer step order.
 5. Loss stuck from step 0 but you *can* overfit one batch? LR too low (try 10x), frozen params (check `requires_grad`), wrong loss.
 6. Loss decreases then explodes? LR too high (try 0.1x), log the pre-clip grad norm, hunt numerical instability.
 7. Train good, val bad? Overfitting, not a bug. More data, regularization, smaller model.
 8. Train loss fine but the metric is bad? Loss-metric misalignment ([refs/metric_stuck.md](refs/metric_stuck.md)).
 9. Outputs constant? Mode collapse: class imbalance, all-zero init, dead ReLUs, look at confidence-sorted errors.
 10. Slow but not stuck? Not a bug. Consider batch size, depth/width, data quality.
 ### Anti-patterns
 These are the overconfident reflexes the "calibrate" section warns about, made concrete. Every one changes behaviour before localizing the bug. (As people put it: "this is sklearn slop," or "the LLM is tweaking hyperparameters like it's in a hackathon, not understanding the problem.")
 - Hyperparameter changes before verifying correctness. "Try reducing the learning rate" is the #1 wrong response. Verify the code first; HP tuning on buggy code wastes time.
 - `try/except` around training code. Training should crash loudly. A caught exception hides the bug and produces silently wrong results. The one exception is checkpoint-on-KeyboardInterrupt.
 - "Try a different optimizer." If Adam doesn't converge, it's almost never the optimizer; it's the loss, the data, the architecture, or a bug.
 - `.detach()` / `.item()` to "fix" gradient errors. If autograd complains, the graph is wrong. Detaching silences it by cutting gradient flow, so the model just stops learning from that path.
 - `lr_scheduler` as a *cure for non-convergence*. Schedules matter (transformers need warmup, cyclic/cosine is often best-in-class, AdamW is the standard pairing), but they refine or enable convergence in an otherwise-healthy setup; they don't rescue a model that can't learn at constant LR because of a bug. Add the schedule once the basics work, not as a debugging band-aid.
 - More layers / a bigger model. If it can't overfit one batch, more parameters won't help. The problem is gradient flow, loss, or data.
 - "Normalize your data" without checking whether it already is. Run the data sanity check first.
 - `float()` / `.to(dtype)` to suppress type warnings. Type mismatches are signals; a float32/float64 mismatch might mean you're mixing model weights with double-precision data. Fix the root cause.
 ---
 ## Appendix: deeper tricks
 Look these up when the symptom calls for them; they're kept out of the main flow on purpose.
 - [refs/loss_surface.md](refs/loss_surface.md) — visualize a loss surface and its gradient field with synthetic tensors, no model or GPU. For when a custom loss misbehaves.
 - [refs/metric_stuck.md](refs/metric_stuck.md) — "why won't this metric move?" plus the structural-ceiling check (is the optimizer failing, or can the parameterization not express it?).
 - [refs/sweeps.md](refs/sweeps.md) — same-seed paired comparison and cross-seed t-stat reliability, so a result is "reliably better" not "a lucky seed."
 - [refs/static_analysis.md](refs/static_analysis.md) — grep patterns for silent bugs (shape mismatches, autograd breakers, double softmax, step ordering, leakage).
 - [refs/diagnostics.md](refs/diagnostics.md) — copy-paste diagnostic snippets (init-loss check, overfit-one-batch, gradient-flow check, NaN hooks, NaN-poisoning leakage tracer, backprop-to-input dependency check, class-imbalance check).
 - [rl/SKILL.md](rl/SKILL.md) — RL-specific debugging: probe environments, reward engineering, HP defaults, reference implementations.
 - [pinn/SKILL.md](pinn/SKILL.md) — physics-informed-network debugging: nondimensionalization, gradient pathologies, curriculum.
 ## Links and further reading
 Folklore sources (the quotes above trace to these):
 [^jones]: Andy Jones, "Debugging RL, Without the Agonizing Pain" — https://andyljones.com/posts/rl-debugging.html ([cache](docs/evidence/andyljones_rl_debugging.md): anomalies L103-109, write-from-scratch L155, assume-bug L176-180, raise-threshold L182, loss-curve L186-188)
 [^rahtz]: Matthew Rahtz (Amid Fish), "Lessons Learned Reproducing a Deep RL Paper" — http://amid.fish/reproducing-deep-rl ([cache](docs/evidence/amid_fish_reproducing_deep_rl.md): frame-diff confusion L85-87, investigate-confusion L100-102, think-more L145-153, don't-implement-RL-yourself L497-501)
 [^schulman]: John Schulman, "Nuts and Bolts of Deep RL Research" slides — http://joschu.net/docs/nuts-and-bolts.pdf ([cache](docs/evidence/joschu_nuts_and_bolts.md): Always-Be-Ablating L98-101, standardize-observations L118-125; rendered as bullets because the PDF source is slide fragments)
 [^henderson]: Henderson et al., "Deep Reinforcement Learning that Matters" (AAAI 2018) — https://arxiv.org/abs/1709.06560 ([cache](docs/evidence/henderson_2018_deep_rl_matters.md): seeds-create-different-distributions L235, implementation-differences L251)
 [^cs231n]: Stanford CS231n, "Neural Networks Part 3" — https://cs231n.github.io/neural-networks-3/ ([cache](docs/evidence/cs231n_neural_networks_3.md): overfit-tiny-subset L89)
 [^slavv]: Slav Ivanov, "37 Reasons why your Neural Network is not working" (2017) — https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607 ([cache](docs/evidence/slavv_37_reasons_nn.md): opening anecdote L19, emergency checklist L45-51)
 [^fsdl]: Josh Tobin, Full Stack Deep Learning Spring 2021, Lecture 7 "Troubleshooting DNNs" — https://fullstackdeeplearning.com/spring2021/lecture-7/ ([cache](docs/evidence/fsdl_spring2021_lecture7.md))
 [^goodfellow]: Goodfellow, Bengio, Courville, *Deep Learning*, ch. 11 "Practical Methodology" — https://www.deeplearningbook.org/ ([cache](docs/evidence/goodfellow_ch11_practical_methodology.md): one-part-broken-others-adapt L198, weights-adapt-to-compensate L204)
 [^cs229]: Andrew Ng, CS229 "Advice for Applying Machine Learning" — https://cs229.stanford.edu/ ([cache](docs/evidence/cs229_ml_advice.md))
 [^jaxtyping]: Patrick Kidger, jaxtyping (runtime shape/dtype checking) — https://github.com/patrick-kidger/jaxtyping
 For modern transformer pretraining specifically (the sources above predate it), see [Karpathy's recipe](https://karpathy.github.io/2019/04/25/recipe/) and the [nanochat deepwiki](https://deepwiki.com/karpathy/nanochat) (320+ empirical HP sweeps for a GPT-2-scale run). Most multi-source claims trace to quotes in [docs/ml_debug_folklore.argdown](docs/ml_debug_folklore.argdown) (vargdown); the full evidence set is in [docs/evidence/](docs/evidence/).
 Curated by [wassname](https://github.com/wassname). Companion gist: https://gist.github.com/wassname/e45e41f75c0b50e72ec1f4cff811a277
@@ -12,7 +12,9 @@ Or paste `SKILL.md` into your system prompt / context when debugging.
 ## What's here
- **[SKILL.md](SKILL.md)** -- the main artifact. Load into an LLM agent's context as a debugging skill. Leads with the mindset (calibrate, mental models, general debugging tricks, and reading a working implementation when stuck), then a folklore section of sourced quotes, then an LLM-agent playbook (debugging loop, triage menu, anti-patterns). Deeper one-off tricks (loss-surface analysis, stuck-metric diagnosis, sweep reliability) live in [refs/](refs/).
+- **[SKILL.md](SKILL.md)** -- the main artifact. Load into an LLM agent's context as a debugging skill. A short calibration note, then the folklore itself: verbatim sourced quotes from practitioners, general lessons first, modern transformers and LLM fine-tuning in their own section.
 - **[PLAYBOOK.md](PLAYBOOK.md)** -- the synthesized long-form: mental models, practitioner priors, step catalogs, symptom tables, the agent debugging loop, triage, and anti-patterns. Menus of hypotheses distilled from the same sources, not quotes. Deeper one-off tricks (loss-surface analysis, stuck-metric diagnosis, sweep reliability) live in [refs/](refs/).
 - **[docs/evidence/](docs/evidence/)** -- frozen local copies of source material (blog posts, talks, papers, reddit threads). Claims in SKILL.md link back to exact quotes here.
@@ -1,151 +1,78 @@
 ---
 name: ml-debug
-description: "Wassname's practical folklore for debugging ML systems: convergence issues, gradient pathologies, stuck metrics, sweep reliability, and same-seed comparisons. Use when stuck on training, designing sweeps, or analyzing experiment results."
+description: "Wassname's practical folklore for debugging ML systems: convergence, gradients, stuck metrics, sweep reliability, same-seed comparisons. Use when stuck on training, designing sweeps, or analyzing results. Verbatim quotes from practitioners; catalogs and diagnostics are one hop away."
 ---
 # wassname's ML Debugging Folklore
 Foreword: In an attempt to upskill the ML debugging on AI coding assistants (and humans), I've collected high quality sources on ML debugging and the mindset and the "taste". When I started ML I went searching for discussions on best practices, and started a few discussions of my own and they helped me a lot, I hope they can help others. This intro is human written, and the below is AI written with human guidance.
-## Before you debug: calibrate
+## How to read this
-The first thing to calibrate is your own behaviour, especially if you're an LLM agent. ML research code is often outside the training distribution: novel losses, custom architectures, methods with no canonical "right answer" you've seen a thousand times. The trained reflex there is to be confident and fast, to pattern-match a symptom to a fix ("loss stuck -> drop the LR") and apply it. Here that reflex works against you. It commits to one hypothesis before you've looked, and a wrong fix on possibly-buggy code wastes a run *and* corrupts your evidence about what's actually happening.
+If you're an LLM agent, calibrate yourself first. ML research code is often outside your training distribution: novel losses, custom architectures, methods with no canonical right answer you've seen a thousand times. The trained reflex there is to be confident and fast, to pattern-match a symptom to a fix ("loss stuck -> drop the LR") and apply it, and on possibly-buggy research code that reflex wastes a run and corrupts the evidence you need to find the real cause. The quotes below are the counter-evidence, in the words of people who paid for these lessons in months of wasted runs. If you notice yourself converging on the first plausible hypothesis, read [Rahtz](#think-more-experiment-less); if you're reaching for hyperparameters, read [Jones](#assume-you-have-a-bug); if the code looks like it's working, read [Achiam](#broken-code-fails-silently-measure-everything-spinning-up); if you're about to declare the fix done, read [Nanda](#default-to-disbelieving-your-own-results-neel-nanda).
-So slow down and widen out. Most of this skill is a set of habits for staying calibrated and keeping your hypothesis space open until the evidence closes it. The habits transfer across timeseries, GANs, OCR, RL, PINNs, puzzles; the specific fixes in the tables below are local to their setting, so treat those tables as a menu of hypotheses to widen your search, not a lookup-and-apply.
+These are common failure modes worth ruling out, not a complete diagnosis of your situation; you know your system and I don't. Checklists, diagnostics, and symptom catalogs are one hop away under [Reference](#reference-one-hop-away).
 ## Mental models
 How to *think* when generating hypotheses or deciding what to investigate next. Pick the lens that fits; each gives a different angle on the same problem.
 **1. Information flow: trace forward, trace backward.** Data flows forward through the model, gradients flow backward. A bug anywhere in either direction corrupts everything downstream. Manually trace shapes and values forward from input through each layer, then trace gradients backward from loss through each parameter. The break-point is where values first go wrong.
 - Forward: input -> preprocess -> embed -> layers -> head -> loss
 - Backward: loss -> d(loss)/d(head) -> d(head)/d(layers) -> ... -> d(layer1)/d(params)
 **2. Ablation: remove things until it works.**[^cs229] Systematically remove components (regularization, augmentation, auxiliary losses, fancy layers). If removing X fixes it, X was the problem. If nothing helps, the bug is in the core (data or main loss). Start by turning off ALL regularization/augmentation/dropout/scheduling; if it works, add back one at a time until it breaks.
 **3. Oracle substitution: replace each component with ground truth.**[^cs229] For pipeline systems (data -> features -> model -> postprocess -> metric), swap one component for a perfect version. The component whose oracle gives the biggest jump is the bottleneck. Replace model predictions with ground-truth labels and the metric barely moves? The model's fine; the problem is upstream (data) or downstream (metric).
 **4. Bias-variance via learning curves.**[^cs229][^fsdl] Plot train and val error vs dataset size (or steps). Both high and converging together = high bias (too simple, wrong features, or a capacity-reducing bug). Train low, val high = high variance (overfitting). Val flat even with 10x more data = not a data problem, fix the model.
 **5. Structural ceiling: can the parameterization even express what you want?** Sometimes a metric is stuck not because the optimizer fails but because the architecture literally cannot represent the target. Quick check: disable the loss term entirely; if the metric reaches the same value, the loss never moved it. Worked example in [refs/metric_stuck.md](refs/metric_stuck.md).
 ### Practitioner priors: what's usually wrong
 With no other information, investigate in this order. Rough consensus from the folklore sources, not measured frequencies, and only a starting weight (a clue that points elsewhere overrides them outright):
 1. **Data pipeline** (~40%). Wrong preprocessing, labels misaligned with inputs, missing/wrong normalization, train/test leakage, a loader returning stale batches. It really is usually the data.[^slavv][^fsdl]
 2. **Loss function** (~20%). Wrong loss for the task, wrong sign, double softmax, loss disconnected from the metric, competing losses canceling.
 3. **Training procedure** (~15%). Wrong optimizer step order, missing `zero_grad`, frozen params, in-place ops breaking autograd.
 4. **Architecture** (~10%). Too small to express it, too deep without skips, wrong activation.
 5. **Hyperparameters** (~5%). LR, batch size, weight decay. Almost never the real problem if the code is buggy.
 6. **Numerical** (~5%). NaN, overflow, underflow, usually a symptom of one of the above.
 7. **Environment** (~5%). Library version, GPU memory, nondeterminism, stale cache.
 For RL, add reward scale/sign as a top-3 issue, and episode-boundary handling (done signals, discounting across resets).
 ### When to suspect the data
 | Signal | Likely meaning | Check |
 |--------|----------------|-------|
 | Init loss << expected (e.g. 0.01 vs 2.3) | Leakage or a shortcut: the model "knows" the answer at init | Are labels in the input? Is test data in train? A trivial feature? Localize with the NaN-poisoning tracer or backprop-to-input check ([refs/diagnostics.md](refs/diagnostics.md)) |
 | Random input gives the same loss as real input | Pipeline is destroying information (over-aggressive preprocessing, wrong transforms, all-zero input) | Print raw data at each stage; visualize |
 | Predicts the same class for everything | Class imbalance (100:1 -> "always predict majority") | Label-count check; weighted loss or resample |
 | Val much worse than train from the start | Distribution shift between splits | Same preprocessing? Same time period? Same source? |
 | Learning curve flat even with 10x data | NOT data: high bias | Add capacity, fix features, check for capacity-reducing bugs |
 | Adding data makes val worse | Data-quality issue: new data noisier or off-distribution | Inspect recent additions, check label quality |
 | Works on MNIST/CIFAR but not your set | Your data is the problem | Simplify your data (fewer classes, clean labels), scale up gradually[^slavv] |
 ---
 ## Part 1: General ML debugging
 A catalog of small, well-worn checks, in rough dependency order (each assumes the one before). Pull from it; don't run it end-to-end as a ritual.
 **Step 1: Verify components in isolation.**[^goodfellow][^cs229] Most bugs are "doing the wrong calculation." Test each piece independently.
 - Forward pass: feed known inputs, check output shapes and ranges. `assert` shapes everywhere, since `(None,)` vs `(None, 1)` silently broadcasts into `(None, None)`. (Or make the shapes runtime-checked contracts with jaxtyping[^jaxtyping] + beartype, which turns the #1 silent bug loud.)
 - Loss: hand-compute a few targets and compare to code output.
 - Data pipeline: sample a batch, print it, eyeball it. Are labels aligned with inputs? Transforms applied correctly?
 - Preprocessing: look at processed inputs as a human. Can *you* solve the task from them?
 **Five most common deep-learning bugs**[^fsdl]: (1) tensor shapes that fail silently via broadcasting, (2) preprocessing inputs incorrectly (wrong normalization, over-augmentation), (3) wrong loss function or wrong sign, (4) forgetting train vs eval mode (dropout/batchnorm differ), (5) numerical instability (NaN from log(0), overflow, vanishing grads).
 **Step 2: Get signs of life on a toy problem, and overfit one batch.**[^cs231n][^fsdl] Before the real task, solve something trivial with the same codebase so you know what "healthy" looks like. Then overfit a tiny batch (see folklore: "overfit one batch first"). Start with a lightweight implementation (<200 lines of new code), no fancy data pipeline; build that later once the core works.
 **Baseline ladder** (for physics/simulation models, each step must beat the previous):
 1. Persistence: y(t) = y(t-1). Does the model capture *any* dynamics?
 2. Exponential decay to steady state (first-order fit).
 3. Linear state-space / OLS on finite differences.
 4. Pure-data MLP (same architecture, no physics). If a PINN can't beat this, the physics constraint is hurting.
 5. Classical solver, fixed parameters (scipy `solve_bvp`, ODE).
 6. Classical solver, fitted parameters.
 7. Then and only then: PINN / learned physics.
 Make complexity pay rent: every added component (physics, dimensions, losses) should improve a metric you care about, or come out.
 **Step 3: Log everything, then look for specific pathologies.**[^goodfellow][^rahtz][^cs231n] Log train+val loss (per-component if multi-objective), gradient norms per module, learning rate, parameter-update magnitudes, the update-to-data ratio per layer (`((lr * p.grad).std() / p.data.std()).log10()`, target ~-3), activation stats (mean, std, dead-ReLU fraction, tanh saturation), and input/label distributions.
 **Sanity-check the loss at init**[^cs231n]: verify chance-level loss before training. For 10-class softmax the initial loss should be `-ln(0.1) = 2.302` with small random weights. Wrong init loss means a bad initialization or a broken loss. Then check that increasing regularization increases the loss.
 | Symptom | Likely cause |
 |---|---|
 | Loss stuck from the start | LR too low, bad init, data pipeline broken, wrong loss function |
 | Loss decreases then explodes | LR too high, numerical instability (log(0), div by 0), gradient-accumulation bug |
 | Loss NaN | log(0), 0/0, overflow. Use `log(x.clamp(min=1e-8))`, `1/(std + 1e-5)` |
 | Train loss good, val loss bad | Overfitting. More data, regularization, smaller model |
 | Loss oscillates wildly | LR too high, batch too small, data shuffling broken |
 | Gradients vanish | Too-deep net without skips, saturating activations, bad init |
 | Gradients explode | No gradient clipping, LR too high, RNN without clipping |
 | Different results per seed | Normal if small; suspicious if large. Check init sensitivity, batch order, nondeterminism |
 | Model outputs constant | Dead neurons, vanishing gradients, mode collapse, all-zero init |
 | Physics loss low but BCs violated | Gradient imbalance: PDE residual dominates the BC gradient; adaptive weighting or hard BCs |
 | PINN worse than pure-data MLP | Wrong equations, bad scaling (forgot to nondimensionalize), or physics fighting the data |
 | Scalar parameter stuck at 0 or a bound | Degenerate solution; bound and initialize it, or estimate it separately first |
 **Step 4: Numerical hygiene.**[^cs231n]
 ```python
 log_prob  = prob.clamp(min=1e-8).log()                 # clamp log inputs
 ratio     = x / (std + 1e-5)                            # never divide by zero
 grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=20.0)
 logger.log("grad_norm", grad_norm)                      # clip, but LOG the pre-clip norm
 assert torch.isfinite(loss), f"Loss is {loss}"          # catch NaNs early
 torch.autograd.gradcheck(my_fn, inputs.double().requires_grad_(True))  # float64! 1e-2 -> 1e-8
 ```
 Gradient clipping *masks* problems, so always log the pre-clip norm to see if it fires every step. For a custom gradient, use relative error (centered difference): `>1e-2` probably wrong, `1e-4` uncomfortable, `1e-7` happy; turn off regularization/dropout and use float64 first.
 **Step 5: Normalization and scale.**[^schulman][^cs231n][^fsdl][^slavv] Most training issues trace back to scale. Normalize inputs to mean 0, std 1 per feature (see the folklore quote from Schulman on running statistics). For physics/PDE models, nondimensionalize *before* training: raw SI units (Kelvin, Joules, meters) create loss terms with wildly different magnitudes; pick characteristic scales, substitute, and the resulting groups (NTU, Biot) come out O(1). For time-series, use a temporal train/test split, not random, or you leak correlation.
 **Step 6: Check your optimizer assumptions.** Adam's moment estimates can mask gradient problems; if step statistics look weird, inspect raw gradients separately. `abs_max(param_update)` should be small (~1e-3 at LR 1e-2). Supervised-learning tricks (batchnorm, dropout, big nets) often *don't* transfer to RL.
 ---
 ## When stuck, read a working implementation
 After 1-2 diagnostic cycles that don't localize the bug, or whenever you're building something you haven't built before, stop guessing and go read code that already works. Agents tend to skip this for another round of from-scratch guessing, which is usually the worse bet. The folklore is blunt about this: writing RL from scratch is "the most catastrophically self-sabotaging thing you can do," because the self-correction signal is too weak to catch your bugs[^jones].
 Use the `gh` skill to find an implementation. Rank candidates by trust signal: community adoption > papers citing it > open source that runs > author reputation > self-reports. A repo other researchers use as a baseline beats a flashy README.
 Read it for three things, explicitly:
 1. **The algorithm done right.** Diff your math and your computation graph against theirs. The bug is usually something "trivial": a sign, a reset, an off-by-one, an advantage normalization you skipped. Implementation differences that papers never mention dominate results[^henderson].
 2. **The engineering tricks the paper omits.** Did they normalize the input? tanh instead of ReLU? mean-pool instead of last-token? only 6 layers? clip to stop gradient saturation? warm-start? an easier dataset than yours? These live in the code, not the abstract, and they're the difference between "works" and "doesn't."
 3. **Proven hyperparameters, schedule, and optimizer.** Copy the values known to work before tuning your own. Their LR, warmup, batch size, weight decay, and optimizer are a working starting point you get for free.
 For RL specifically, see [rl/SKILL.md](rl/SKILL.md) (spinning-up, stable-baselines3, cleanrl, OpenSpiel).
 ---
 ## Folklore
 The hard-won lessons, in the words of the people who learned them. Sources and links are collected under [Links](#links-and-further-reading).
 ### Assume you have a bug
 > When their RL implementation doesn't work, people are often keen to either (a) adjust their network architecture or (b) adjust their hyperparameters. On the other hand, they're reluctant to say they've got a bug. Most often, it turns out they've got a bug. Why bugs are so much more common in RL code is discussed above, but there's another advantage to assuming you've got a bug: bugs are a damn sight faster to find and fix than validating that your new architecture is an improvement over the old one.[^jones]
 > What I'm advocating for here is not a blind faith in the buginess of your code, but for dramatically raising the threshold at which you start thinking 'OK, I think this is correct.'[^jones]
-A bug can also hide, because most ML models have multiple adaptive parts: "If one part is broken, the other parts can adapt and still achieve roughly acceptable performance"[^goodfellow], and it may not show in the output at all. So raise the bar for "correct."
+A bug can also hide, because most ML models have multiple adaptive parts: "If one part is broken, the other parts can adapt and still achieve roughly acceptable performance"[^goodfellow], and it may not show in the output at all.
 ### Think more, experiment less
 > Switching from experimenting a lot and thinking a little to experimenting a little and thinking a lot was a key turnaround in productivity. When debugging with long iteration times, you really need to *pour* time into the hypothesis-forming step - thinking about what all the possibilities are, how likely they seem on their own, and how likely they seem in light of everything you've seen so far. Spend as much time as you need, even if it takes 30 minutes, or an hour. Reserve experiments for once you've fleshed out the hypothesis space as thoroughly as possible and know which pieces of evidence would allow you to best distinguish between the different possibilities.[^rahtz]
 ### Don't write RL from scratch; diff against a reference
 > If you're doing anything that involves an RL algorithm as a component in a larger system, don't try and implement the RL algorithm yourself. [...] RL is unstable enough at the moment that you'll never be sure whether your system doesn't work because of a bug in your RL implementation or because of a bug in your larger system.[^rahtz]
 > We find that implementation differences which are often not reflected in publications can have dramatic impacts on performance.[^henderson]
 When you're stuck after a diagnostic cycle or two, the generalization of this advice is to find a working implementation (rank candidates by community adoption > papers citing it > code that runs > author reputation) and diff your math, computation graph, and hyperparameters against it. For RL see [rl/SKILL.md](rl/SKILL.md).
 ### Default to disbelieving your own results (Neel Nanda)
 > The default state of the world is that your research is false, because doing research is hard.[^nanda]
 > Excitement is evidence of bullshit: Generally, most true results are not exciting, but a fair amount of false results are. So from a Bayesian perspective, if a result is exciting and cool, it's even more likely to be false than normal![^nanda]
 The cheapest antidote he gives: "Read your data ... Often, the quality of the data is a crucial driver of the results of your experiments. Often, it is quite bad."[^nanda]
 ### Understand the system to shrink the search (Ulisse Mini)
 > When good programmers debug hard problems fast, it's usually because they understand the system well enough to *track the important internal state* in their head, letting them drastically *reduce the solution space they're searching over.*[^ulisse]
 ### Gears beat black boxes (John Wentworth)
 > figuring out a system's gears takes extra work up-front, but yields dividends forever. [...] The black-box approach is cheaper for one-off tasks, but usually doesn't yield any insights which will generalize to new tasks using the same system[^wentworth]
 ### Broken code fails silently; measure everything (Spinning Up)
 Josh Achiam's warning is RL-framed but general:
 > broken RL code almost always fails silently, where the code appears to run fine except that the agent never learns how to solve the task.[^spinningup]
 So instrument heavily, because "you can't tell it's broken if you can't see that it's breaking,"[^spinningup] and don't trust one passing setup: "sometimes things will work in one environment even when you have a breaking bug, so make sure to test in more than one environment."[^spinningup]
 ### Pursue anomalies; investigate confusion
 > If you ever see a plot or a behaviour that just *seems weird*, chase right after it! Do not - do *not* - just 'hope it goes away'. Chasing anomalies is one of the most powerful ways to debug your system, because if you've noticed a problem without having had to go look for it, that means it's a *really big problem*. [...] It's really tempting to think that the cool extra functionality you were planning to write today [...] might just magically fix this anomalous behaviour. It won't. Give up on your plan for the day and chase the anomaly instead.[^jones]
 > It was only by following that confusion and realising that taking the difference between frames zeroed out the background that gave the hint of a problem with normalization.[^rahtz]
 >
 > It seems important to really commit yourself to *always* investigate whenever you notice confusion.[^rahtz]
 ### Read what you actually wrote, not what you meant (gwern)
 You can't see your own work clearly, which is why fresh eyes (or a fresh-eyes subagent) catch what you can't:
 > you can't find typos in your own writing without a great deal of effort because you know what it's *supposed* to say; so copyediting advice runs like 'read it out loud' or 'print it out and read it' or 'wait a week' [...] or even 'read it upside down'. That's the sort of thing it takes to force you to read what you actually wrote, and not what you thought you wrote.[^gwern-unseeing]
 ### Never accept the kludge (Patrick Kidger)
@@ -157,39 +84,17 @@ Why is research code so reliably buggy? Kidger's blunt answer:
 His fix is a posture, "never accept the kludge": messed up your git repo? Find the commands to fix it, "don't just delete it and clone from the remote."[^kidger] The instinct that refuses kludges is the same one that refuses `.detach()`-to-silence-autograd and `except: pass`.
 ### Broken code fails silently; measure everything (Spinning Up)
 Josh Achiam's warning is RL-framed but general:
 > broken RL code almost always fails silently, where the code appears to run fine except that the agent never learns how to solve the task.[^spinningup]
 So instrument heavily, because "you can't tell it's broken if you can't see that it's breaking,"[^spinningup] and don't trust one passing setup: "sometimes things will work in one environment even when you have a breaking bug, so make sure to test in more than one environment."[^spinningup]
 ### Loss curves are a red herring
 > When someone's RL implementation isn't working, they *luuuuuurv* to copy-paste a screenshot of their loss curve to you. They do this because they know they want a pretty, exponentially-decaying loss curve, and they know what they have *isn't that*. The problem with using the loss curve as an indicator of correctness is somewhat that it's not reliable, but mostly because it doesn't localise errors. The shape of your loss curve says very little about where in your code you've messed up, and so says very little about what you need to change to get things working.[^jones]
-Their real value is splitting "how fast it learns" from "where it plateaus." Use them after better methods, not as a first resort.
+(But sometimes they are not, they seperate underfitting and over, gradient explosion vs vanishing, saturation vs not... and so on)
 ### Pursue anomalies; investigate confusion
 > If you ever see a plot or a behaviour that just *seems weird*, chase right after it! Do not - do *not* - just 'hope it goes away'. Chasing anomalies is one of the most powerful ways to debug your system, because if you've noticed a problem without having had to go look for it, that means it's a *really big problem*. [...] It's really tempting to think that the cool extra functionality you were planning to write today [...] might just magically fix this anomalous behaviour. It won't. Give up on your plan for the day and chase the anomaly instead.[^jones]
 > It was only by following that confusion and realising that taking the difference between frames zeroed out the background that gave the hint of a problem with normalization.[^rahtz]
 >
 > It seems important to really commit yourself to *always* investigate whenever you notice confusion.[^rahtz]
 ### Think more, experiment less
 > Switching from experimenting a lot and thinking a little to experimenting a little and thinking a lot was a key turnaround in productivity. When debugging with long iteration times, you really need to *pour* time into the hypothesis-forming step - thinking about what all the possibilities are, how likely they seem on their own, and how likely they seem in light of everything you've seen so far. Spend as much time as you need, even if it takes 30 minutes, or an hour. Reserve experiments for once you've fleshed out the hypothesis space as thoroughly as possible and know which pieces of evidence would allow you to best distinguish between the different possibilities.[^rahtz]
 Corollary, MurphyJitsu pre-flight: before launching a run, ask "if this fails, what's the most likely cause?" If you can name it, test for it first.
 ### Inspect the data first
 > The first step to training a neural net is to not touch any neural net code at all and instead begin by thoroughly inspecting your data. [...] The outliers especially almost always uncover some bugs in data quality or preprocessing.[^karpathy-recipe]
-Slavv's "37 reasons" list opens with the same anecdote (gradients flowing, loss falling, predictions all background) and puts "Verify that the input data is correct" and "Start with a really small dataset (2-20 samples). Overfit on it" at the top of its emergency checklist[^slavv]. FSDL names preprocessing and dataset construction as leading silent-failure categories[^fsdl].
+Slavv's "37 reasons" list opens with the same anecdote (gradients flowing, loss falling, predictions all background) and puts "Verify that the input data is correct" and "Start with a really small dataset (2-20 samples). Overfit on it" at the top of its emergency checklist[^slavv].
 ### Labels are often wrong (koaning)
@@ -197,7 +102,7 @@ Even benchmark data is dirtier than you think. Vincent Warmerdam:
 > It turns out that bad labels are a *huge* problem in many popular benchmark datasets.[^koaning]
-His cheap way to find them: train a deliberately high-bias model, then sort by where it disagrees with the label while assigning the correct class low confidence (the confidence-sorted-errors trick). The takeaway: "maybe we should spend [...] less time tuning parameters and instead spend it trying to get a more meaningful dataset."[^koaning]
+His cheap way to find them: train a deliberately high-bias model, then sort by where it disagrees with the label while assigning the correct class low confidence. The takeaway: "maybe we should spend [...] less time tuning parameters and instead spend it trying to get a more meaningful dataset."[^koaning]
 ### The tank story: your model learns the confound (gwern)
@@ -205,13 +110,7 @@ The canonical data-leakage parable:
 > A cautionary tale in artificial intelligence tells about researchers training an neural network (NN) to detect tanks in photographs, succeeding, only to realize the photographs had been collected under specific conditions for tanks/non-tanks and the NN had learned something useless like time of day.[^gwern]
-gwern traced versions back to 1992 and concluded it is "a classic 'urban legend'" with no solid source[^gwern]. The lesson holds twice over: a model will gladly learn a confound in how the data was collected instead of the task (dataset bias / leakage), and even your cautionary tales deserve a citation.
+gwern traced versions back to 1992 and concluded it is "a classic 'urban legend'" with no solid source[^gwern]. The lesson holds twice over: a model will gladly learn a confound in how the data was collected instead of the task, and even your cautionary tales deserve a citation.
 ### Read what you actually wrote, not what you meant (gwern)
 You can't see your own work clearly, which is why fresh eyes (or a fresh-eyes subagent) catch what you can't:
 > you can't find typos in your own writing without a great deal of effort because you know what it's *supposed* to say; so copyediting advice runs like 'read it out loud' or 'print it out and read it' or 'wait a week' [...] or even 'read it upside down'. That's the sort of thing it takes to force you to read what you actually wrote, and not what you thought you wrote.[^gwern-unseeing]
 ### Overfit one batch first
@@ -221,6 +120,14 @@ You can't see your own work clearly, which is why fresh eyes (or a fresh-eyes su
 And remove a variable while you're at it: "Always use a fixed random seed [...]. This removes a factor of variation and will help keep you sane."[^karpathy-recipe]
 ### Seed variance: you can't tell a bug from bad luck
 > Look, there's variance in supervised learning too, but it's rarely this bad. If my supervised learning code failed to beat random chance 30% of the time, I'd have super high confidence there was a bug in data loading or training. If my reinforcement learning code does no better than random, I have no idea if it's a bug, if my hyperparameters are bad, or if I simply got unlucky.[^irpan]
 > Instability to random seed is like a canary in a coal mine. If pure randomness is enough to lead to this much variance between runs, imagine how much an actual difference in the code could make.[^irpan]
 Henderson confirmed it quantitatively: splitting 10 same-config runs (differing only in seed) into two groups of five produces "statistically different distributions just from varying random seeds."[^henderson] This is why one good run proves nothing ([refs/sweeps.md](refs/sweeps.md)).
 ### Normalize and scale everything
 From the slides[^schulman] (bullet points, de-artifacted from the PDF):
@@ -239,43 +146,21 @@ On the slides[^schulman]:
 > - Different tricks may substitute
 > - Especially whitening
-Many normalization/regularization tricks do roughly the same job (they improve conditioning), so stacking them adds complexity without proportional benefit. If you have three normalization schemes and it still doesn't work, the problem isn't normalization. So ablate: most of the things you added are probably unnecessary.
+Many normalization/regularization tricks do roughly the same job (they improve conditioning), so stacking them adds complexity without proportional benefit.
-### Don't write RL from scratch; diff against a reference
+### Adam at 3e-4 for baselines (Karpathy)
 > If you're doing anything that involves an RL algorithm as a component in a larger system, don't try and implement the RL algorithm yourself. [...] RL is unstable enough at the moment that you'll never be sure whether your system doesn't work because of a bug in your RL implementation or because of a bug in your larger system.[^rahtz]
 > We find that implementation differences which are often not reflected in publications can have dramatic impacts on performance.[^henderson]
 ### Seed variance: you can't tell a bug from bad luck
 > Look, there's variance in supervised learning too, but it's rarely this bad. If my supervised learning code failed to beat random chance 30% of the time, I'd have super high confidence there was a bug in data loading or training. If my reinforcement learning code does no better than random, I have no idea if it's a bug, if my hyperparameters are bad, or if I simply got unlucky.[^irpan]
 > Instability to random seed is like a canary in a coal mine. If pure randomness is enough to lead to this much variance between runs, imagine how much an actual difference in the code could make.[^irpan]
 Henderson confirmed it quantitatively: splitting 10 same-config runs (differing only in seed) into two groups of five produces "statistically different distributions just from varying random seeds."[^henderson] This is why one good run proves nothing, and why sweeps need same-seed pairing and a cross-seed reliability test ([refs/sweeps.md](refs/sweeps.md)).
 ### 3e-4, and learning-rate folklore
 The most-quoted line in the genre is Karpathy's tweet, "3e-4 is the best learning rate for Adam, hands down."[^karpathy-3e4] He confirmed in the same thread that it was a joke, but it stuck because it's a decent default. Read it next to what he actually does in the recipe:
 > In the early stages of setting baselines I like to use Adam with a learning rate of 3e-4. In my experience Adam is much more forgiving to hyperparameters, including a bad learning rate.[^karpathy-recipe]
-So: 3e-4 is a fine *starting* LR for Adam, not a law. The real folklore is "Adam is forgiving, so start there and stop fiddling." It has exceptions, and the biggest is batch size:
+If you change the batch size, the learning rate has to move with it: linearly for SGD[^goyal], with an exponent between 0.5 and 1 for Adam[^mccandlish], and large-batch training without warmup can diverge in the first epoch and look like a code bug[^goyal].
 - There's a critical batch size below which doubling the batch ~halves wall-clock time, and above which it just burns compute; it rises during training as loss falls[^mccandlish].
 - LR must scale with batch size: linearly for SGD (double batch, double LR)[^goyal]; for Adam, an exponent between 0.5 and 1, task-dependent[^mccandlish]. Changing batch size without adjusting LR is a common silent mistake.
 - Large-batch + high-LR diverges at the start without warmup[^goyal]. No warmup -> first-epoch loss spike/NaN -> you wrongly think the code is broken.
-| Symptom | Likely cause |
+## Modern transformers and LLM fine-tuning
-|---------|-------------|
+
-| Very noisy, loss oscillates | Batch too small; gradient noise swamps signal. Try 4-8x larger |
+Most of the sources above predate large transformers; these come from the people training and fine-tuning them.
 | Smooth but slow, poor generalization | Batch too large without LR scaling. Higher LR or smaller batch |
 | Loss spikes at start then recovers | Normal with large batch + warmup. No warmup? Add it |
 | Different results at different batch sizes (same total steps) | Missing LR scaling. Adjust LR proportionally |
 ### Tricks hide in reference code (lucidrains)
-lucidrains' x-transformers is a catalogue of training tricks, each tied to its paper. The debugging-relevant one: when a transformer diverges, attention logits blowing up is a prime suspect, and the now-standard fix is QK normalization (L2-normalize queries and keys before the dot product).
+lucidrains' x-transformers is a catalogue of training tricks, each tied to its paper. The debugging-relevant one: when a transformer diverges, attention logits blowing up is a prime suspect, and the now-standard fix is QK normalization.
 > We are nearing the point of wiping out a source of transformer training instability with one simple intervention.[^lucidrains]
@@ -289,116 +174,56 @@ Karpathy's nanochat is one of the few public records of what scaling a transform
 > If any rank's gradient contains inf, all ranks must clip to avoid divergence.[^nanochat]
-The first is a fake-metric-improvement trap (a better number that isn't better learning); the second is a multi-GPU bug that single-GPU testing hides.
+The first is a better number that isn't better learning; the second is a multi-GPU bug that single-GPU testing hides.
---
+### When NaN hits, look at the frames before it (Stas Bekman)
-## Research taste (adjacent to debugging)
+Bekman wrote the `DebugUnderflowOverflow` tool during BLOOM-era large-model training. It keeps a rolling buffer of per-module abs-min/abs-max frames, so when inf/NaN is detected you see the run-up rather than only the crash site.
-Debugging taste and research taste are the same muscle: stay skeptical of your own results, and build a real model of your system instead of pattern-matching.
+> As you can see it's the previous frames that we need to look into when the numbers start going into very large for fp16 numbers.[^bekman]
-### Default to disbelieving your own results (Neel Nanda)
+Corollary from the same docstring: validate your debugging instrumentation on a few cheap batches before betting an hours-long run on it.
-> The default state of the world is that your research is false, because doing research is hard.[^nanda]
+### Walk the pipeline in data order (HF course)
-> Excitement is evidence of bullshit: Generally, most true results are not exciting, but a fair amount of false results are. So from a Bayesian perspective, if a result is exciting and cool, it's even more likely to be false than normal![^nanda]
+The HF LLM course debugging chapter is a worked narrative in the Karpathy-recipe lineage: a deliberately broken fine-tune, fixed step by step, checking each stage at the exact point it enters the model.
-The cheapest antidote he gives: "Read your data ... Often, the quality of the data is a crucial driver of the results of your experiments. Often, it is quite bad."[^nanda]
+> The best way to debug an error that arises in `trainer.train()` is to manually go through this whole pipeline to see where things went awry. The error is then often very easy to solve.[^hfcourse]
-### Understand the system to shrink the search (Ulisse Mini)
+> Hyperparameter tuning is always emphasized as being the hardest part of machine learning, but it's just the last step to help you gain a little bit on the metric. [...] don't launch into a time-consuming and costly hyperparameter search until you have something that beats the baseline you have on your dataset.[^hfcourse]
-> When good programmers debug hard problems fast, it's usually because they understand the system well enough to *track the important internal state* in their head, letting them drastically *reduce the solution space they're searching over.*[^ulisse]
+### Chat template and BOS handling must match across train and deploy (unsloth)
-### Gears beat black boxes (John Wentworth)
+When a model trains fine but produces nonsense after export to llama.cpp or Ollama, the weights are usually innocent:
-> figuring out a system's gears takes extra work up-front, but yields dividends forever. [...] The black-box approach is cheaper for one-off tasks, but usually doesn't yield any insights which will generalize to new tasks using the same system[^wentworth]
+> The most common cause of this error is using an **incorrect chat template**. It's essential to use the SAME chat template that was used when training the model in Unsloth and later when you run it in another framework, such as llama.cpp or Ollama. [...] It might also be because your inference engine adds an unnecessary "start of sequence" token (or the lack of thereof on the contrary) so ensure you check both hypotheses![^unsloth]
-The pattern-matched fix is the black box; a mechanistic model of your system is the capital investment that pays off across many bugs.
+Their FAQ also explains the suspiciously perfect loss curve: when the loss sits at exactly zero, every label has probably been masked out and the model is learning nothing.
---
+> All labels in your dataset are -100. Training losses will be all 0.[^unsloth]
-## For LLM agents
+### Shrink every axis at once, and clear the caches (axolotl)
-Unfortunately, agents need these procedural mindset-shifts spelled out. This is the babysitting layer, not the durable folklore, hence its place at the bottom. If you're an agent debugging ML code, run the loop and avoid the anti-patterns.
+Axolotl's debugging guide (the general tips trace to Hamel Husain) gives the minimal-repro recipe for training loops: one GPU, one process, a tiny model, tiny data, a single step, no eval. It also warns that caching can quietly undo your experiment, because the run you think you changed may be replaying artifacts produced before the change:
-### The debugging loop (use judgment, it's not a checklist)
+> **Eliminate concurrency**: Restrict the number of processes to 1 for both training and data preprocessing[^axolotl]
-Roughly in this order, though the point is the underlying mindset:
+> Axolotl caches certain steps and so does the underlying HuggingFace trainer. You may want to clear some of these caches when debugging.[^axolotl]
-**Collect clues before theorizing.** Read the traceback and logs. Run static analysis ([refs/static_analysis.md](refs/static_analysis.md)) and the cheap diagnostics ([refs/diagnostics.md](refs/diagnostics.md): data sanity check, init-loss check, overfit-one-batch). If you catch yourself proposing a fix before you've looked at anything, stop.
+Their training-stability page adds the masking check ("inspect tokenized samples to confirm only the target tokens are trainable") and, bluntly: "Debugging a failed run without metrics is guesswork."[^axolotl-stability]
-**Hold several hypotheses at once; resist converging early.** Unless the cause is already obvious (a traceback usually points right at it), generate a few genuinely different explanations before ranking any, so you don't marry the first one. Use the five lenses in Mental models. Then sanity-check yourself with the failure-mode triplet (same idiom as the `research-journal` skill):
+## Reference (one hop away)
 - *Likely*: your strongest competitor explanation, with a rough credence.
 - *Subtle*: the sneaky one, like sample size, leakage, a confound, a metric artifact, or plain seed variance masquerading as signal.
 - *Null*: there's no real effect, or it comes from something else you also changed.
-Give each a one-line prior and its cheapest falsifier (`Check: ...`). Anchor priors on Practitioner priors above, but a clue that points elsewhere overrides them outright. Keep observations (reproducible, auditable) separate from inferences, so you can rethink without degrading the evidence.
+Open the relevant one when the task calls for it. These are synthesized checklists and menus, useful for widening a hypothesis search but not authoritative for your particular system:
-**Run the cheapest observation that splits your top hypotheses.** Not the most thorough experiment, the most *discriminating* one. Forward-predict each hypothesis ("what would I see if this were the cause?"); a test is strong evidence only where the predictions diverge. A grad-norm line reading ~0 under "dead layer" but healthy under "LR too low" beats a 4-hour sweep that only confirms what you believed.
+- [PLAYBOOK.md](PLAYBOOK.md) — the long-form version: mental models and practitioner priors, the general step catalog (component isolation, baseline ladder, what to log, numerical hygiene), symptom tables, the agent debugging loop, triage, and anti-patterns.
-
+- [refs/diagnostics.md](refs/diagnostics.md) — copy-paste diagnostic snippets: init-loss check, overfit-one-batch, gradient-flow check, NaN hooks, NaN-poisoning leakage tracer, backprop-to-input dependency check, class-imbalance check.
 **Bisect the path to localize where it breaks.** Data flows forward and gradients backward in a chain (input -> preprocess -> layers -> loss -> grads), so probe the midpoint: is the value or gradient already wrong halfway through? Each probe halves the search space. Finding the first module to produce a non-finite value is one case; the same bisection works for finite-but-wrong values, exploded norms, and dead activations.
 **Then act, only on what the observation pointed to.** If a cycle or two hasn't localized it, stop tuning and go read working code, which usually beats another guess.
 ```py
 # ── ML debugging loop ────────────────────────
 def debug(symptom):
    clues ← collect(traceback, logs, static_analysis, cheap_diagnostics)  # look before theorizing
    H     ← generate(clues, lenses=5) | {likely, subtle, null}            # ≥3 genuinely different
    prior ← anchor(H)              # base rates: data .40 loss .20 train .15 arch .10 hp .05
    while not localized:
        t̂     ← argmax(divergence(predict(h, t) for h in H) / cost(t) for t in candidates)
        obs   ← run(t̂)            # one log line or toy run; keep obs apart from inference
        prior ← update(prior, obs)
        H     ← bisect_path(H, obs)  # halve the search space each probe
        if cycles ≥ 2:
            return read_working_code()   # diff your math + graph vs a trusted impl
    fix(root_cause); assert reproduces(obs)   # no silent fallback; crash if it doesn't
 ```
 ### Triage (a menu, not a flowchart to obey)
 Rough order to consider, not authoritative; it may not fit your project. Stop when a question fits.
 1. Exception/traceback? Read it, fix it, done.
 2. Loss NaN/Inf? Attach NaN hooks ([refs/diagnostics.md](refs/diagnostics.md)), find the first module producing NaN. Usual causes: log(0), 0/0, exp(large); add clamp/eps.
 3. Init loss wrong? Check the data pipeline and loss; check for double softmax; check labels match output format. Same loss on random input -> data destroyed. Init loss << expected -> leakage.
 4. Can't overfit one batch? Gradient-flow check: None grads -> disconnected layer; all-zero grads -> dead layer / detach. Check autograd breakers and optimizer step order.
 5. Loss stuck from step 0 but you *can* overfit one batch? LR too low (try 10x), frozen params (check `requires_grad`), wrong loss.
 6. Loss decreases then explodes? LR too high (try 0.1x), log the pre-clip grad norm, hunt numerical instability.
 7. Train good, val bad? Overfitting, not a bug. More data, regularization, smaller model.
 8. Train loss fine but the metric is bad? Loss-metric misalignment ([refs/metric_stuck.md](refs/metric_stuck.md)).
 9. Outputs constant? Mode collapse: class imbalance, all-zero init, dead ReLUs, look at confidence-sorted errors.
 10. Slow but not stuck? Not a bug. Consider batch size, depth/width, data quality.
 ### Anti-patterns
 These are the overconfident reflexes the "calibrate" section warns about, made concrete. Every one changes behaviour before localizing the bug. (As people put it: "this is sklearn slop," or "the LLM is tweaking hyperparameters like it's in a hackathon, not understanding the problem.")
 - Hyperparameter changes before verifying correctness. "Try reducing the learning rate" is the #1 wrong response. Verify the code first; HP tuning on buggy code wastes time.
 - `try/except` around training code. Training should crash loudly. A caught exception hides the bug and produces silently wrong results. The one exception is checkpoint-on-KeyboardInterrupt.
 - "Try a different optimizer." If Adam doesn't converge, it's almost never the optimizer; it's the loss, the data, the architecture, or a bug.
 - `.detach()` / `.item()` to "fix" gradient errors. If autograd complains, the graph is wrong. Detaching silences it by cutting gradient flow, so the model just stops learning from that path.
 - `lr_scheduler` as a *cure for non-convergence*. Schedules matter (transformers need warmup, cyclic/cosine is often best-in-class, AdamW is the standard pairing), but they refine or enable convergence in an otherwise-healthy setup; they don't rescue a model that can't learn at constant LR because of a bug. Add the schedule once the basics work, not as a debugging band-aid.
 - More layers / a bigger model. If it can't overfit one batch, more parameters won't help. The problem is gradient flow, loss, or data.
 - "Normalize your data" without checking whether it already is. Run the data sanity check first.
 - `float()` / `.to(dtype)` to suppress type warnings. Type mismatches are signals; a float32/float64 mismatch might mean you're mixing model weights with double-precision data. Fix the root cause.
 ---
 ## Appendix: deeper tricks
 Look these up when the symptom calls for them; they're kept out of the main flow on purpose.
 - [refs/loss_surface.md](refs/loss_surface.md) — visualize a loss surface and its gradient field with synthetic tensors, no model or GPU. For when a custom loss misbehaves.
 - [refs/metric_stuck.md](refs/metric_stuck.md) — "why won't this metric move?" plus the structural-ceiling check (is the optimizer failing, or can the parameterization not express it?).
 - [refs/sweeps.md](refs/sweeps.md) — same-seed paired comparison and cross-seed t-stat reliability, so a result is "reliably better" not "a lucky seed."
 - [refs/static_analysis.md](refs/static_analysis.md) — grep patterns for silent bugs (shape mismatches, autograd breakers, double softmax, step ordering, leakage).
- [refs/diagnostics.md](refs/diagnostics.md) — copy-paste diagnostic snippets (init-loss check, overfit-one-batch, gradient-flow check, NaN hooks, NaN-poisoning leakage tracer, backprop-to-input dependency check, class-imbalance check).
+- [refs/loss_surface.md](refs/loss_surface.md) — visualize a loss surface and its gradient field with synthetic tensors, no model or GPU, for when a custom loss misbehaves.
- [rl/SKILL.md](rl/SKILL.md) — RL-specific debugging: probe environments, reward engineering, HP defaults, reference implementations.
+- [refs/metric_stuck.md](refs/metric_stuck.md) — "why won't this metric move?" plus the structural-ceiling check.
- [pinn/SKILL.md](pinn/SKILL.md) — physics-informed-network debugging: nondimensionalization, gradient pathologies, curriculum.
+- [refs/sweeps.md](refs/sweeps.md) — same-seed paired comparison and cross-seed t-stat reliability, for before you claim method A beats method B.
 - [rl/SKILL.md](rl/SKILL.md) — RL-specific: probe environments, reward engineering, HP defaults, reference implementations.
 - [pinn/SKILL.md](pinn/SKILL.md) — physics-informed networks: nondimensionalization, gradient pathologies, curriculum.
 ## Links and further reading
@@ -407,20 +232,16 @@ Folklore sources (the quotes above trace to these):
 [^jones]: Andy Jones, "Debugging RL, Without the Agonizing Pain" — https://andyljones.com/posts/rl-debugging.html ([cache](docs/evidence/andyljones_rl_debugging.md): anomalies L103-109, write-from-scratch L155, assume-bug L176-180, raise-threshold L182, loss-curve L186-188)
 [^rahtz]: Matthew Rahtz (Amid Fish), "Lessons Learned Reproducing a Deep RL Paper" — http://amid.fish/reproducing-deep-rl ([cache](docs/evidence/amid_fish_reproducing_deep_rl.md): frame-diff confusion L85-87, investigate-confusion L100-102, think-more L145-153, don't-implement-RL-yourself L497-501)
 [^karpathy-recipe]: Andrej Karpathy, "A Recipe for Training Neural Networks" (2019) — https://karpathy.github.io/2019/04/25/recipe/ ([cache](docs/evidence/karpathy_recipe_training_nn_2019.md): inspect-data L26+L32, fixed-seed L39, overfit-one-batch L51, Adam-3e-4 L73; note: this is an abridged note with its own "..." elisions)
 [^karpathy-3e4]: Andrej Karpathy, tweet, 23 Nov 2016: "3e-4 is the best learning rate for Adam, hands down." — https://x.com/karpathy/status/801621764144971776 (he confirmed in-thread it was a joke; not in the local evidence files, verified against the tweet)
 [^schulman]: John Schulman, "Nuts and Bolts of Deep RL Research" slides — http://joschu.net/docs/nuts-and-bolts.pdf ([cache](docs/evidence/joschu_nuts_and_bolts.md): Always-Be-Ablating L98-101, standardize-observations L118-125; rendered as bullets because the PDF source is slide fragments)
 [^henderson]: Henderson et al., "Deep Reinforcement Learning that Matters" (AAAI 2018) — https://arxiv.org/abs/1709.06560 ([cache](docs/evidence/henderson_2018_deep_rl_matters.md): seeds-create-different-distributions L235, implementation-differences L251)
 [^irpan]: Alex Irpan, "Deep Reinforcement Learning Doesn't Work Yet" (2018) — https://www.alexirpan.com/2018/02/14/rl-hard.html ([cache](docs/evidence/alexirpan_rl_hard.md): variance-bug-or-unlucky L674-678, seed-canary L705-707)
 [^cs231n]: Stanford CS231n, "Neural Networks Part 3" — https://cs231n.github.io/neural-networks-3/ ([cache](docs/evidence/cs231n_neural_networks_3.md): overfit-tiny-subset L89)
 [^slavv]: Slav Ivanov, "37 Reasons why your Neural Network is not working" (2017) — https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607 ([cache](docs/evidence/slavv_37_reasons_nn.md): opening anecdote L19, emergency checklist L45-51)
 [^fsdl]: Josh Tobin, Full Stack Deep Learning Spring 2021, Lecture 7 "Troubleshooting DNNs" — https://fullstackdeeplearning.com/spring2021/lecture-7/ ([cache](docs/evidence/fsdl_spring2021_lecture7.md))
 [^goodfellow]: Goodfellow, Bengio, Courville, *Deep Learning*, ch. 11 "Practical Methodology" — https://www.deeplearningbook.org/ ([cache](docs/evidence/goodfellow_ch11_practical_methodology.md): one-part-broken-others-adapt L198, weights-adapt-to-compensate L204)
 [^cs229]: Andrew Ng, CS229 "Advice for Applying Machine Learning" — https://cs229.stanford.edu/ ([cache](docs/evidence/cs229_ml_advice.md))
 [^mccandlish]: McCandlish, Kaplan et al., "An Empirical Model of Large-Batch Training" (2018) — https://arxiv.org/abs/1812.06162 ([cache](docs/evidence/mccandlish_2018_large_batch.md))
 [^goyal]: Goyal et al., "Accurate, Large Minibatch SGD" (2017) — https://arxiv.org/abs/1706.02677
 [^lucidrains]: Phil Wang (lucidrains), x-transformers README — https://github.com/lucidrains/x-transformers ([cache](docs/evidence/lucidrains_x_transformers_readme.md): post-embedding LayerNorm / BLOOM+YaLM L366, attention-overflow / cosine-sim norm L1230, autoregressive validation L1234, "wiping out a source of instability" / QK RMSNorm L1292)
 [^koaning]: Vincent D. Warmerdam (koaning), "Bad Labels" (2021) — https://koaning.io/posts/labels/ ([cache](docs/evidence/koaning_bad_labels.md): bad-labels-huge-problem L13, confidence-sort trick L21, spend-less-time-tuning L33)
 [^jaxtyping]: Patrick Kidger, jaxtyping (runtime shape/dtype checking) — https://github.com/patrick-kidger/jaxtyping
 [^nanochat]: nanochat (Karpathy), documented via DeepWiki — https://deepwiki.com/karpathy/nanochat ([cache](docs/evidence/nanochat_deepwiki_llm_pretraining_2026.md): BOS fake-improvement L97, all-ranks-clip-on-inf L131)
 [^kidger]: Patrick Kidger, "Just Know Stuff" (2023) — https://kidger.site/thoughts/just-know-stuff/ ([cache](docs/evidence/kidger_just_know_stuff.md): kludge-definition L7, junior-developer L9, never-accept-the-kludge L11, don't-delete-and-clone L13)
 [^gwern]: Gwern Branwen, "The Neural Net Tank Legend" — https://gwern.net/tank ([cache](docs/evidence/gwern_tank.md): cautionary tale L7, urban-legend conclusion L9)
@@ -429,7 +250,12 @@ Folklore sources (the quotes above trace to these):
 [^gwern-unseeing]: Gwern Branwen, "Unseeing" — https://gwern.net/unseeing ([cache](docs/evidence/gwern_unseeing.md): read-what-you-wrote L9, single-anomaly L13)
 [^ulisse]: Ulisse Mini, "How to get good at programming" — https://www.lesswrong.com/posts/LTypqBMTSmRrrhb2v/how-to-get-good-at-programming ([cache](docs/evidence/ulisse_how_to_get_good_at_programming.md): track-internal-state L7, brute-force-search L9, leaky-abstractions L11)
 [^wentworth]: John Wentworth, "Gears-Level Models are Capital Investments" — https://www.lesswrong.com/posts/nEBbw2Bc2CnN2RMxy/gears-level-models-are-capital-investments ([cache](docs/evidence/wentworth_gears_level_models.md): gears-dividends L7, valley-of-bad-theory L11)
 [^hfcourse]: Sylvain Gugger et al., HF LLM Course ch. 8.4, "Debugging the training pipeline" — https://huggingface.co/learn/llm-course/chapter8/4 ([cache](docs/evidence/hf_llm_course_ch8_4_debugging_pipeline.md): walk-the-pipeline L14, overfit-one-batch L678-680, no-tuning-before-baseline L724-726)
 [^bekman]: Stas Bekman, `DebugUnderflowOverflow` docstring, transformers `debug_utils.py` (2021) — https://github.com/huggingface/transformers/blob/main/src/transformers/debug_utils.py ([cache](docs/evidence/bekman_debug_utils_transformers.md): purpose L35-36, detection-and-frame-buffer L51-53, previous-frames L86-92)
 [^unsloth]: Unsloth (Daniel & Michael Han-Chen), "Troubleshooting & FAQs" — https://docs.unsloth.ai/basics/troubleshooting-and-faqs ([cache](docs/evidence/unsloth_troubleshooting_faqs.md): template-mismatch + BOS L38-39, shuffle-eval L100, all-labels–100-loss-0 L227-229)
 [^axolotl]: Axolotl, "Debugging" (general tips: Hamel Husain) — https://docs.axolotl.ai/docs/debugging.html ([cache](docs/evidence/axolotl_debugging.md): simplify L31, one-process L37, small-model + fast-iteration L48-49, caches L54-58)
 [^axolotl-stability]: Axolotl, "Training Stability" — https://docs.axolotl.ai/docs/training_stability.html ([cache](docs/evidence/axolotl_training_stability.md): metrics-from-the-start L27, inspect-tokenized-masking L67, reward-fn-standalone L99)
-For modern transformer pretraining specifically (the sources above predate it), see [Karpathy's recipe](https://karpathy.github.io/2019/04/25/recipe/) and the [nanochat deepwiki](https://deepwiki.com/karpathy/nanochat) (320+ empirical HP sweeps for a GPT-2-scale run). Most multi-source claims trace to quotes in [docs/ml_debug_folklore.argdown](docs/ml_debug_folklore.argdown) (vargdown); the full evidence set is in [docs/evidence/](docs/evidence/).
+For modern transformer pretraining specifically (most sources above predate it), see [Karpathy's recipe](https://karpathy.github.io/2019/04/25/recipe/) and the [nanochat deepwiki](https://deepwiki.com/karpathy/nanochat) (320+ empirical HP sweeps for a GPT-2-scale run). Most multi-source claims trace to quotes in [docs/ml_debug_folklore.argdown](docs/ml_debug_folklore.argdown) (vargdown); the full evidence set is in [docs/evidence/](docs/evidence/).
 Curated by [wassname](https://github.com/wassname). Companion gist: https://gist.github.com/wassname/e45e41f75c0b50e72ec1f4cff811a277
@@ -0,0 +1,241 @@
 Source: https://docs.axolotl.ai/docs/debugging.html
 Title: Debugging - Axolotl documentation, general tips trace to Hamel Husain (undated, fetched 2026)
 Fetched-via: uvx markitdown https://docs.axolotl.ai/docs/debugging.html
 Fetch-status: verbatim, nav/sidebar/TOC boilerplate trimmed
 # Debugging
 How to debug Axolotl
 This document provides some tips and tricks for debugging Axolotl. It also provides an example configuration for debugging with VSCode. A good debugging setup is essential to understanding how Axolotl code works behind the scenes.
 Tip
 For training-specific debugging (loss spikes, NaN gradients, OOM errors, RL training stability), see [Training Stability & Debugging](../docs/training_stability.html).
 ## Table of Contents
 * [General Tips](#general-tips)
 * [Debugging with VSCode](#debugging-with-vscode)
  + [Background](#background)
  + [Configuration](#configuration)
  + [Customizing your debugger](#customizing-your-debugger)
  + [Video Tutorial](#video-tutorial)
 * [Debugging With Docker](#debugging-with-docker)
  + [Setup](#setup)
  + [Attach To Container](#attach-to-container)
  + [Video - Attaching To Docker On Remote Host](#video---attaching-to-docker-on-remote-host)
 ## General Tips
 While debugging it’s helpful to simplify your test scenario as much as possible. Here are some tips for doing so:
 > [!Important]
 > All of these tips are incorporated into the [example configuration](#configuration) for debugging with VSCode below.
 1. **Make sure you are using the latest version of axolotl**: This project changes often and bugs get fixed fast. Check your git branch and make sure you have pulled the latest changes from `main`.
 2. **Eliminate concurrency**: Restrict the number of processes to 1 for both training and data preprocessing:
   * Set `CUDA_VISIBLE_DEVICES` to a single GPU, ex: `export CUDA_VISIBLE_DEVICES=0`.
   * Set `dataset_num_proc: 1` in your axolotl config or run the training command with `--dataset_num_proc=1`.
 3. **Use a small dataset**: Construct or use a small dataset from HF Hub. When using a small dataset, you will often have to make sure `sample_packing: False` and `eval_sample_packing: False` to avoid errors. If you are in a pinch and don’t have time to construct a small dataset but want to use from the HF Hub, you can shard the data (this will still tokenize the entire dataset, but will only use a fraction of the data for training. For example, to shard the dataset into 20 pieces, add the following to your axolotl config):
   ```
   datasets:
       ...
       shards: 20
   ```
 4. **Use a small model**: A good example of a small model is [TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0).
 5. **Minimize iteration time**: Make sure the training loop finishes as fast as possible, with these settings.
   * `micro_batch_size: 1`
   * `max_steps: 1`
   * `val_set_size: 0`
 6. **Clear Caches:** Axolotl caches certain steps and so does the underlying HuggingFace trainer. You may want to clear some of these caches when debugging.
   * Data preprocessing: When debugging data preprocessing, which includes prompt template formation, you may want to delete the directory set in `dataset_prepared_path:` in your axolotl config. If you didn’t set this value, the default is `last_run_prepared`.
   * HF Hub: If you are debugging data preprocessing, you should clear the relevant HF cache [HuggingFace cache](https://huggingface.co/docs/datasets/cache), by deleting the appropriate `~/.cache/huggingface/datasets/...` folder(s).
   * **The recommended approach is to redirect all outputs and caches to a temporary folder and delete selected subfolders before each run. This is demonstrated in the example configuration below.**
 ## Debugging with VSCode
 ### Background
 The below example shows how to configure VSCode to debug data preprocessing of the `chat_template` format. This is the format used when you have the following in your axolotl config:
 ```
 datasets:
  - path: <path to your chat_template formatted dataset> # example on HF Hub: fozziethebeat/alpaca_messages_2k_test
    type: chat_template
 ```
 > [!Important]
 > If you are already familiar with advanced VSCode debugging, you can skip the below explanation and look at the files [.vscode/launch.json](../.vscode/launch.json) and [.vscode/tasks.json](../.vscode/tasks.json) for an example configuration.
 > [!Tip]
 > If you prefer to watch a video, rather than read, you can skip to the [video tutorial](#video-tutorial) below (but doing both is recommended).
 ### Setup
 Make sure you have an [editable install](https://setuptools.pypa.io/en/latest/userguide/development_mode.html) of Axolotl, which ensures that changes you make to the code are reflected at runtime. Run the following commands from the root of this project:
 ```
 export UV_TORCH_BACKEND=cu128  # or cu130
 uv venv --no-project --relocatable
 source .venv/bin/activate
 uv pip install --no-build-isolation -e '.[deepspeed]' --group dev --group test
 ```
 #### Remote Hosts
 If you developing on a remote host, you can easily use VSCode to debug remotely. To do so, you will need to follow this [remote - SSH guide](https://code.visualstudio.com/docs/remote/ssh). You can also see the video below on [Docker and Remote SSH debugging](#video---attaching-to-docker-on-remote-host).
 ### Configuration
 The easiest way to get started is to modify the [.vscode/launch.json](../.vscode/launch.json) file in this project. This is just an example configuration, so you may need to modify or copy it to suit your needs.
 For example, to mimic the command `cd devtools && CUDA_VISIBLE_DEVICES=0 axolotl train dev_chat_template.yml`, you would use the below configuration[1](#fn1). Note that we add additional flags that override the axolotl config and incorporate the tips above (see the comments). We also set the working directory to `devtools` and set the `env` variable `HF_HOME` to a temporary folder that is later partially deleted. This is because we want to delete the HF dataset cache before each run in order to ensure that the data preprocessing code is run from scratch.
 ```
 // .vscode/launch.json
 {
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Debug axolotl prompt - chat_template",
            "type": "python",
            "module": "accelerate.commands.launch",
            "request": "launch",
            "args": [
                "-m", "axolotl.cli.train", "dev_chat_template.yml",
                // The flags below simplify debugging by overriding the axolotl config
                // with the debugging tips above.  Modify as needed.
                "--dataset_num_proc=1",      // limits data preprocessing to one process
                "--max_steps=1",              // limits training to just one step
                "--batch_size=1",             // minimizes batch size
                "--micro_batch_size=1",       // minimizes batch size
                "--val_set_size=0",           // disables validation
                "--sample_packing=False",     // disables sample packing which is necessary for small datasets
                "--eval_sample_packing=False",// disables sample packing on eval set
                "--dataset_prepared_path=temp_debug/axolotl_outputs/data", // send data outputs to a temp folder
                "--output_dir=temp_debug/axolotl_outputs/model" // send model outputs to a temp folder
                ],
            "console": "integratedTerminal",      // show output in the integrated terminal
            "cwd": "${workspaceFolder}/devtools", // set working directory to devtools from the root of the project
            "justMyCode": true,                   // step through only axolotl code
            "env": {"CUDA_VISIBLE_DEVICES": "0",  // Since we aren't doing distributed training, we need to limit to one GPU
                    "HF_HOME": "${workspaceFolder}/devtools/temp_debug/.hf-cache"}, // send HF cache to a temp folder
            "preLaunchTask": "cleanup-for-dataprep", // delete temp folders (see below)
        }
    ]
 }
 ```
 **Additional notes about this configuration:**
 * The argument `justMyCode` is set to `true` such that you step through only the axolotl code. If you want to step into dependencies, set this to `false`.
 * The `preLaunchTask`: `cleanup-for-dataprep` is defined in [.vscode/tasks.json](../.vscode/tasks.json) and is used to delete the following folders before debugging, which is essential to ensure that the data pre-processing code is run from scratch:
  + `./devtools/temp_debug/axolotl_outputs`
  + `./devtools/temp_debug/.hf-cache/datasets`
 > [!Tip]
 > You may not want to delete these folders. For example, if you are debugging model training instead of data pre-processing, you may NOT want to delete the cache or output folders. You may also need to add additional tasks to the `tasks.json` file depending on your use case.
 Below is the [./vscode/tasks.json](../.vscode/tasks.json) file that defines the `cleanup-for-dataprep` task. This task is run before each debugging session when you use the above configuration. Note how there are two tasks that delete the two folders mentioned above. The third task `cleanup-for-dataprep` is a composite task that combines the two tasks. A composite task is necessary because VSCode does not allow you to specify multiple tasks in the `preLaunchTask` argument of the `launch.json` file.
 ```
 // .vscode/tasks.json
 // this file is used by launch.json
 {
    "version": "2.0.0",
    "tasks": [
      // this task changes into the devtools directory and deletes the temp_debug/axolotl_outputs folder
      {
        "label": "delete-outputs",
        "type": "shell",
        "command": "rm -rf temp_debug/axolotl_outputs",
        "options":{ "cwd": "${workspaceFolder}/devtools"},
        "problemMatcher": []
      },
      // this task changes into the devtools directory and deletes the `temp_debug/.hf-cache/datasets` folder
      {
        "label": "delete-temp-hf-dataset-cache",
        "type": "shell",
        "command": "rm -rf temp_debug/.hf-cache/datasets",
        "options":{ "cwd": "${workspaceFolder}/devtools"},
        "problemMatcher": []
      },
        // this task combines the two tasks above
      {
       "label": "cleanup-for-dataprep",
       "dependsOn": ["delete-outputs", "delete-temp-hf-dataset-cache"],
      }
    ]
 }
 ```
 ### Customizing your debugger
 Your debugging use case may differ from the example above. The easiest thing to do is to put your own axolotl config in the `devtools` folder and modify the `launch.json` file to use your config. You may also want to modify the `preLaunchTask` to delete different folders or not delete anything at all.
 ### Video Tutorial
 The following video tutorial walks through the above configuration and demonstrates how to debug with VSCode, (click the image below to watch):
 [![](https://i.ytimg.com/vi/xUUB11yeMmc/maxresdefault.jpg)](https://youtu.be/xUUB11yeMmc "How to debug Axolotl (for fine tuning LLMs)")
 [Hamel Husain’s](https://hamel.dev) tutorial: [Debugging Axolotl w/VSCode](https://www.youtube.com/watch?v=xUUB11yeMmc)
 ## Debugging With Docker
 Using [official Axolotl Docker images](https://hub.docker.com/r/axolotlai/axolotl/tags) is a great way to debug your code, and is a very popular way to use Axolotl. Attaching VSCode to Docker takes a few more steps.
 ### Setup
 On the host that is running axolotl (ex: if you are using a remote host), clone the axolotl repo and change your current directory to the root:
 ```
 git clone https://github.com/axolotl-ai-cloud/axolotl
 cd axolotl
 ```
 > [!Tip]
 > If you already have axolotl cloned on your host, make sure you have the latest changes and change into the root of the project.
 Next, run the desired docker image and mount the current directory. Below is a docker command you can run to do this:[2](#fn2)
 ```
 docker run --privileged --gpus '"all"' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src="${PWD}",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface axolotlai/axolotl-uv:main-latest
 ```
 > [!Tip]
 > To understand which containers are available, see the [Docker section of the README](../README.md#docker) and the [DockerHub repo](https://hub.docker.com/r/axolotlai/axolotl/tags). For details of how the Docker containers are built, see axolotl’s [Docker CI builds](../.github/workflows/main.yml).
 You will now be in the container. Next, install Axolotl with dev dependencies:
 ```
 uv venv --no-project --relocatable
 source .venv/bin/activate
 uv pip install --no-build-isolation -e '.[deepspeed]' --group dev --group test
 ```
 ### Attach To Container
 Next, if you are using a remote host, [Remote into this host with VSCode](https://code.visualstudio.com/docs/remote/ssh). If you are using a local host, you can skip this step.
 Next, select `Dev Containers: Attach to Running Container...` using the command palette (`CMD + SHIFT + P`) in VSCode. You will be prompted to select a container to attach to. Select the container you just created. You will now be in the container with a working directory that is at the root of the project. Any changes you make to the code will be reflected both in the container and on the host.
 Now you are ready to debug as described above (see [Debugging with VSCode](#debugging-with-vscode)).
 ### Video - Attaching To Docker On Remote Host
 Here is a short video that demonstrates how to attach to a Docker container on a remote host:
 [![](https://i.ytimg.com/vi/0AuoR7QnHR0/hqdefault.jpg)](https://youtu.be/0AuoR7QnHR0 "Debugging Axolotl Part 2: Attaching to Docker on a Remote Host")
 [Hamel Husain’s](https://hamel.dev) tutorial: [Debugging Axolotl Part 2: Attaching to Docker on a Remote Host](https://youtu.be/0AuoR7QnHR0)
 ## Footnotes
 1. The VSCode config uses `accelerate.commands.launch` as the Python module entry point, which is what `axolotl train` invokes under the hood.[↩︎](#fnref1)
 2. Many of the below flags are recommended best practices by Nvidia when using nvidia-container-toolkit. You can read more about these flags [here](https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html).[↩︎](#fnref2)
@@ -0,0 +1,399 @@
 Source: https://docs.axolotl.ai/docs/training_stability.html
 Title: Training Stability & Debugging - Axolotl documentation (undated, fetched 2026)
 Fetched-via: uvx markitdown https://docs.axolotl.ai/docs/training_stability.html
 Fetch-status: verbatim, nav/sidebar/TOC boilerplate trimmed
 # Training Stability & Debugging
 Guide to monitoring, debugging, and stabilizing training runs in axolotl
 This guide covers practical techniques for monitoring training health, diagnosing instability, and resolving common failures in both supervised fine-tuning (SFT) and reinforcement learning (GRPO/EBFT) workflows.
 ## Monitoring Training
 ### Key Metrics for SFT
 Every SFT run should be monitored through at least these four metrics:
 | Metric | What It Tells You | Healthy Range |
 | --- | --- | --- |
 | `train/loss` | How well the model fits training data | Decreasing; typically 0.5–2.0 for chat fine-tuning |
 | `eval/loss` | Generalization performance | Tracks train loss with small gap; divergence signals overfitting |
 | `grad_norm` | Gradient magnitude | 0.1–10.0; spikes above 100 indicate instability |
 | `learning_rate` | Current LR from scheduler | Should follow expected schedule (warmup then decay) |
 TipSet Up Logging Early
 Enable W&B or TensorBoard from the start. Debugging a failed run without metrics is guesswork.
 ```
 wandb_project: my-project
 wandb_run_id:   # optional, for resuming
 logging_steps: 1
 ```
 ### Key Metrics for RL (GRPO)
 GRPO training logs a richer set of metrics. These are the critical ones:
 | Metric | Healthy Range | Red Flag |
 | --- | --- | --- |
 | `rewards/<name>/mean` | > 0.15 within 20 steps | Stays at 0 – reward function is broken or task is too hard |
 | `reward_std` | > 0 on most steps | Always 0 – no learning signal (all completions get the same reward) |
 | `frac_reward_zero_std` | < 0.8 | 1.0 on every step – zero-advantage skip fires constantly, no gradient updates |
 | `grad_norm` | 0.001–1.0 | 0.0 is acceptable occasionally (zero-adv skip); > 10.0 is unstable |
 | `entropy` | 0.05–0.5 | < 0.01 suggests mode collapse; > 1.0 suggests the model is not converging |
 | `kl` | 0.0–0.5 | > 2.0 suggests policy has diverged too far from reference |
 | `sampling/sampling_logp_difference/mean` | < 0.1 | > 1.0 means policy has diverged far from vLLM server weights |
 | `sampling/importance_sampling_ratio/min` | > 0.1 | Near 0 indicates stale off-policy data; increase `vllm_sync_interval` |
 | `clip_ratio/region_mean` | < 0.1 | > 0.3 means PPO clipping is too aggressive |
 | `completions/mean_length` | Task-dependent | Monotonically increasing to max length suggests reward hacking |
 | `completions/clipped_ratio` | < 0.3 | > 0.8 means most completions hit `max_completion_length` – increase it |
 NoteEBFT-Specific Metrics
 For EBFT training, also monitor `ebft/alignment` (should trend upward, healthy 0.3–0.9), `ebft/diversity` (healthy 0.01–0.1; > 1.0 indicates mode collapse), and `ebft/cfm_loss` (should trend downward, < 10).
 ## SFT Stability
 ### Loss Plateau
 **Symptom**: Loss stops decreasing early in training, well above expected values.
 **Causes and fixes**:
 * **Learning rate too low**: Increase by 2–5x. Typical ranges: full fine-tune 1e-5 to 5e-5, LoRA 1e-4 to 3e-4.
 * **Insufficient warmup**: Set `warmup_steps` to 5–10% of total steps. Too-aggressive learning at the start can push the model into a flat region.
 * **Data quality**: Check that labels are correctly masked. Use `axolotl preprocess` and inspect tokenized samples to confirm only the target tokens are trainable.
 * **Weight decay too high**: Default 0.01 is usually fine. Values above 0.1 can suppress learning in LoRA.
 ### Loss Spikes
 **Symptom**: Loss suddenly jumps by 2–10x then (possibly) recovers.
 **Causes and fixes**:
 * **Bad data samples**: A single malformed or extremely long example can cause a spike. Enable `sample_packing: false` temporarily and check if spikes correlate with specific batches.
 * **Learning rate too high**: Reduce by 2–5x, or increase warmup.
 * **Gradient accumulation mismatch**: Effective batch size = `micro_batch_size * gradient_accumulation_steps * num_gpus`. Very large effective batch sizes amplify gradient noise.
 * **Mixed precision issues**: With `bf16: true`, some operations can lose precision. If spikes are severe, try `fp32` for diagnosis.
 ### Overfitting
 **Symptom**: Train loss keeps decreasing but eval loss starts increasing.
 **Fixes**:
 * Increase `val_set_size` (e.g., 0.05) and monitor `eval/loss`.
 * Reduce `num_epochs` or `max_steps`.
 * Increase `weight_decay` (try 0.01–0.1).
 * Use a smaller LoRA rank (`lora_r`). Typical values: 8–32.
 * Increase dropout: `lora_dropout: 0.05`.
 ## RL/GRPO Stability
 ### Reward Never Increases
 If `rewards/*/mean` stays at 0 for more than 20 steps:
 1. **Test reward function standalone**: Run it outside training with known inputs to verify it returns nonzero values.
   ```
   cd experiments && python -c "import my_rewards; print(my_rewards.accuracy_reward(...))"
   ```
 2. **Check dataset columns**: The reward function receives `**kwargs` containing dataset columns. Verify the columns it needs (e.g., `answer`) are not removed by the dataset transform.
 3. **Check completion content**: Enable `log_completions: true` in the `trl:` config and inspect logged completions in W&B. If completions are empty or incoherent, the model may be too weak for the task.
 4. **Verify vLLM is serving the right model**: Hit the vLLM health endpoint and confirm the model name matches your config.
 ### Entropy Collapse (Mode Collapse)
 **Symptom**: `entropy` drops below 0.01; all completions become nearly identical.
 **Fixes**:
 * Increase `temperature` in generation kwargs (try 0.8–1.0).
 * Reduce learning rate.
 * Add a KL penalty term (`beta` parameter in GRPO config).
 * Check that `num_generations` is sufficient (16+ gives better advantage estimates).
 ### IS Ratio Divergence
 **Symptom**: `sampling/importance_sampling_ratio/min` drops near 0, or `sampling/sampling_logp_difference/mean` exceeds 1.0.
 This means the policy has diverged significantly from the weights used by vLLM for generation. The importance sampling correction becomes unreliable.
 **Fixes**:
 * Decrease `vllm_sync_interval` (sync weights more often).
 * Enable `off_policy_mask_threshold` (e.g., 0.5) to mask stale off-policy samples.
 * Use `importance_sampling_level: token` for finer-grained correction.
 ### Gradient Norm Instability
 **Symptom**: `grad_norm` oscillates wildly or exceeds 10.0 regularly.
 **Fixes**:
 * Enable gradient clipping: `max_grad_norm: 1.0` (default in most configs).
 * Reduce learning rate.
 * Increase `gradient_accumulation_steps` to smooth out noisy batches.
 * Check for NaN issues (see next section).
 ## NaN and Inf Handling
 ### Common Causes
 | Cause | Where It Manifests | Detection |
 | --- | --- | --- |
 | FP8 zero-scale division | Forward pass logits | `grad_norm: nan`, loss becomes NaN immediately |
 | Gradient explosion | Backward pass | `grad_norm` spikes to inf, then loss goes NaN |
 | Bad data (empty sequences) | Logprob computation | NaN in specific batches only |
 | Numerical overflow in log-softmax | Loss computation | Large negative logprobs cause exp() overflow |
 ### FP8-Specific NaN Issues
 FP8 quantization (`fp8: true`) can produce NaN when the activation quantization kernel divides by `max(abs(x)) / 448`. If the input tensor is all zeros (e.g., padding positions), the scale becomes 0, causing division by zero.
 **Fixes applied in axolotl**:
 * The `act_quant_kernel` has a zero-guard: `s = tl.where(s == 0, 1.0, s)`.
 * A safety net `nan_to_num(logits, nan=0.0)` is applied in `_get_per_token_logps_and_entropies`.
 * Embedding padding is zero-padded for FP8 compatibility.
 ImportantAfter Modifying Triton Kernels
 If you patch any Triton JIT kernel (e.g., the FP8 quantization kernels in transformers), you must clear the Triton cache for changes to take effect:
 ```
 rm -rf ~/.triton/cache
 ```
 ### General NaN Debugging Steps
 1. **Enable anomaly detection** (slow, but pinpoints the source):
   ```
   torch.autograd.set_detect_anomaly(True)
   ```
 2. **Check grad\_norm**: If it goes to NaN, the backward pass is the problem. If loss is NaN but grad\_norm was fine on the previous step, the forward pass is the problem.
 3. **Reduce to single GPU, single batch**: Eliminate distributed training variables.
 4. **Inspect data**: Print the batch that triggers NaN. Look for empty sequences, extreme token IDs, or unexpected padding patterns.
 ## OOM Debugging
 Out-of-memory errors are the most common training failure. Use this systematic approach, from least to most disruptive:
 ### Step 1: Reduce Batch Size
 The single highest-impact change. VRAM scales roughly linearly with batch size.
 ```
 micro_batch_size: 1              # Start here
 gradient_accumulation_steps: 16  # Increase to maintain effective batch size
 ```
 For GRPO specifically, the logits tensor for policy logprob computation can be very large. `batch_size * num_generations * seq_len * vocab_size` in bf16. For example, with `num_generations: 16` and `micro_batch_size: 8`, the logits tensor alone is:
 ```
 8 * 16 * 2048 * 151936 * 2 bytes = ~75 GB  (way too large)
 ```
 Reduce `micro_batch_size` to 2–4 for GRPO.
 ### Step 2: Enable Gradient Checkpointing
 Trades compute for memory by recomputing activations during the backward pass instead of storing them.
 ```
 gradient_checkpointing: true
 gradient_checkpointing_kwargs:
  use_reentrant: false     # Recommended default
 ```
 WarningReentrant Checkpointing Exceptions
 Some configurations require `use_reentrant: true`:
 * DeepSpeed ZeRO-3 (non-reentrant causes `CheckpointError`)
 * EBFT strided mode with flex\_attention
 ### Step 3: Use Quantization
 Load the base model in reduced precision:
 ```
 # 4-bit QLoRA
 adapter: qlora
 load_in_4bit: true
 # 8-bit
 load_in_8bit: true
 # FP8 (saves ~50% model VRAM, same compute speed as bf16)
 fp8: true
 ```
 ### Step 4: Reduce Sequence Length
 ```
 sequence_len: 1024     # Down from 2048 or 4096
 ```
 For GRPO, also reduce `max_completion_length`. Memory scales quadratically with sequence length when using standard attention.
 ### Step 5: Use Flash Attention
 Reduces attention memory from O(n^2) to O(n):
 ```
 attn_implementation: flash_attention_2
 ```
 ### Step 6: Offload with DeepSpeed
 For extreme cases, offload optimizer states or parameters to CPU:
 ```
 deepspeed: deepspeed_configs/zero3_bf16.json
 ```
 ### Diagnosing the Specific Culprit
 Use the `profiler_steps` config option to capture GPU memory snapshots:
 ```
 profiler_steps: [1, 2]
 ```
 This generates PyTorch profiler traces you can inspect to see exactly which tensor allocation caused the OOM.
 ## Common Errors
 | Error Message | Likely Cause | Fix |
 | --- | --- | --- |
 | `exitcode: -9` | System RAM exhaustion | Reduce dataset size, `dataset_num_proc`, or number of data workers |
 | `exitcode: -7` (DeepSpeed) | DeepSpeed version issue | `pip install -U deepspeed` |
 | `CUDA out of memory` | GPU VRAM exhaustion | Follow OOM debugging steps above |
 | `RuntimeError: NCCL communicator was aborted` | GPU communication failure | See [NCCL docs](../docs/nccl.html); check `NCCL_DEBUG=INFO` output |
 | `ValueError: Asking to pad but the tokenizer does not have a padding token` | Missing pad token | Add `special_tokens: { pad_token: "<\|endoftext\|>" }` to config |
 | `'DummyOptim' object has no attribute 'step'` | DeepSpeed on single GPU | Remove `deepspeed:` section from config |
 | `unable to load strategy X` then `None is not callable` | Reward module not importable | Run `cd experiments && python -c "import my_rewards"` to check |
 | `generation_batch_size not divisible by num_generations` | micro\_batch\_size too small | Set `micro_batch_size >= num_generations` and make it divisible |
 | `'weight' must be 2-D` | FSDP1 flattened parameters | Use `fsdp_version: 2` or skip `unwrap_model` when FSDP is enabled |
 | `CheckpointError` (tensor count mismatch) | Non-reentrant checkpointing + ZeRO-3 or flex\_attention | Set `use_reentrant: true` in `gradient_checkpointing_kwargs` |
 | `BFloat16` TypeError during weight sync | NumPy does not support bf16 | Fixed in axolotl’s `weight_serde.py` (auto bf16 to fp16 conversion) |
 | `Content end boundary is before start boundary` | Chat template parsing issue | Check `eos_token` matches template; file a GitHub issue if persistent |
 | `CAS service error` during data processing | HuggingFace XET issue | Set `export HF_HUB_DISABLE_XET=1` |
 | Training hangs (multi-GPU) | FSDP + async prefetch deadlock | Set `async_prefetch: false` with FSDP |
 ## Profiling
 ### PyTorch Profiler
 Axolotl supports PyTorch profiler integration via the config:
 ```
 profiler_steps: [1, 2, 3]
 ```
 This captures profiler traces for the specified steps. View them in TensorBoard:
 ```
 tensorboard --logdir output_dir/runs
 ```
 Or open the `.json` trace file in `chrome://tracing`.
 ### CUDA Memory Snapshots
 For detailed memory analysis, use PyTorch’s memory snapshot API. Add this to your training script or use it interactively:
 ```
 import torch
 # Enable memory history tracking
 torch.cuda.memory._record_memory_history()
 # ... run your training step ...
 # Save snapshot
 torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")
 ```
 Visualize with PyTorch’s memory visualizer:
 ```
 python -m torch.cuda.memory._viz memory_snapshot.pickle
 ```
 ### Quick GPU Memory Check
 During training, monitor GPU utilization in a separate terminal:
 ```
 watch -n 1 nvidia-smi
 ```
 For programmatic access within axolotl, the logged metrics `memory/max_alloc` and `memory/max_reserved` come from `torch.cuda.max_memory_allocated()` and `torch.cuda.max_memory_reserved()`. Note these report PyTorch’s view of memory, which may differ from `nvidia-smi` (see [FAQ](../docs/faq.html)).
 ## W&B and Logging
 ### Enabling Logging
 ```
 wandb_project: my-project
 wandb_entity: my-team          # optional
 wandb_run_id: run-123          # optional, for resuming
 wandb_name: experiment-name    # optional
 logging_steps: 1               # log every step (recommended for RL)
 ```
 ### Debug Logging
 For detailed axolotl-internal debug output:
 ```
 AXOLOTL_LOG_LEVEL=DEBUG axolotl train config.yaml 2>&1 | tee /tmp/training.log
 ```
 TipAlways Log to a File
 Pipe training output to a log file so you can inspect it after the run:
 ```
 axolotl train config.yaml 2>&1 | tee /tmp/my_run.log
 ```
 ### What Axolotl Logs
 **SFT metrics** (logged every `logging_steps`):
 * `train/loss`, `eval/loss` – training and validation loss
 * `train/grad_norm` – gradient L2 norm (before clipping)
 * `train/learning_rate` – current learning rate
 * `memory/max_alloc`, `memory/max_reserved` – peak GPU memory
 **GRPO/RL metrics** (logged every step):
 * `rewards/<name>/mean`, `rewards/<name>/std` – per-reward-function statistics
 * `reward`, `reward_std` – aggregated reward across all reward functions
 * `frac_reward_zero_std` – fraction of prompt groups where all completions got the same reward
 * `completions/mean_length`, `completions/min_length`, `completions/max_length` – completion token lengths
 * `completions/clipped_ratio` – fraction of completions that hit the max length
 * `completions/mean_terminated_length`, `completions/min_terminated_length`, `completions/max_terminated_length` – lengths of naturally terminated completions
 * `kl` – KL divergence between policy and reference
 * `entropy` – policy entropy (measure of output diversity)
 * `clip_ratio/region_mean`, `clip_ratio/low_mean`, `clip_ratio/high_mean` – PPO clipping statistics
 * `sampling/sampling_logp_difference/mean`, `sampling/sampling_logp_difference/max` – log-probability difference between policy and sampling distribution
 * `sampling/importance_sampling_ratio/min`, `sampling/importance_sampling_ratio/mean`, `sampling/importance_sampling_ratio/max` – IS ratio statistics for off-policy correction
 * `num_tokens` – total tokens processed
 ### Reading W&B Charts
 For a healthy GRPO run, expect to see:
 1. **`reward/mean`**: Gradual upward trend. May start near 0 and reach 0.3–0.8 depending on task difficulty. Not monotonic – fluctuations are normal.
 2. **`entropy`**: Gradual decrease from initial values (often 0.3–0.6) as the model becomes more confident. Should not collapse to near-zero.
 3. **`grad_norm`**: Mostly in the 0.001–1.0 range. Occasional 0.0 values are fine (zero-advantage skip). Persistent values above 10.0 need investigation.
 4. **`kl`**: Starts near 0 and grows slowly. If it shoots up rapidly, the policy is diverging from the reference.
 5. **`completions/mean_length`**: Should reflect the task’s natural answer length. If it steadily increases to `max_completion_length`, the model may be reward-hacking by generating longer outputs.
@@ -0,0 +1,355 @@
 Source: https://raw.githubusercontent.com/huggingface/transformers/main/src/transformers/debug_utils.py
 Title: debug_utils.py (DebugUnderflowOverflow) - Stas Bekman, huggingface/transformers (2021)
 Fetched-via: curl -sL https://raw.githubusercontent.com/huggingface/transformers/main/src/transformers/debug_utils.py
 Fetch-status: verbatim
 ```python
 # Copyright 2020 The HuggingFace Team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import collections
 from .utils import ExplicitEnum, is_torch_available, logging
 if is_torch_available():
    import torch
 logger = logging.get_logger(__name__)
 class DebugUnderflowOverflow:
    """
    This debug class helps detect and understand where the model starts getting very large or very small, and more
    importantly `nan` or `inf` weight and activation elements.
    There are 2 working modes:
    1. Underflow/overflow detection (default)
    2. Specific batch absolute min/max tracing without detection
    Mode 1: Underflow/overflow detection
    To activate the underflow/overflow detection, initialize the object with the model :
    ```python
    debug_overflow = DebugUnderflowOverflow(model)
    ```
    then run the training as normal and if `nan` or `inf` gets detected in at least one of the weight, input or output
    elements this module will throw an exception and will print `max_frames_to_save` frames that lead to this event,
    each frame reporting
    1. the fully qualified module name plus the class name whose `forward` was run
    2. the absolute min and max value of all elements for each module weights, and the inputs and output
    For example, here is the header and the last few frames in detection report for `google/mt5-small` run in fp16
    mixed precision :
    ```
    Detected inf/nan during batch_number=0
    Last 21 forward frames:
    abs min  abs max  metadata
    [...]
                      encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
    2.17e-07 4.50e+00 weight
    1.79e-06 4.65e+00 input[0]
    2.68e-06 3.70e+01 output
                      encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
    8.08e-07 2.66e+01 weight
    1.79e-06 4.65e+00 input[0]
    1.27e-04 2.37e+02 output
                      encoder.block.2.layer.1.DenseReluDense.wo Linear
    1.01e-06 6.44e+00 weight
    0.00e+00 9.74e+03 input[0]
    3.18e-04 6.27e+04 output
                      encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
    1.79e-06 4.65e+00 input[0]
    3.18e-04 6.27e+04 output
                      encoder.block.2.layer.1.dropout Dropout
    3.18e-04 6.27e+04 input[0]
    0.00e+00      inf output
    ```
    You can see here, that `T5DenseGatedGeluDense.forward` resulted in output activations, whose absolute max value was
    around 62.7K, which is very close to fp16's top limit of 64K. In the next frame we have `Dropout` which
    renormalizes the weights, after it zeroed some of the elements, which pushes the absolute max value to more than
    64K, and we get an overflow.
    As you can see it's the previous frames that we need to look into when the numbers start going into very large for
    fp16 numbers.
    The tracking is done in a forward hook, which gets invoked immediately after `forward` has completed.
    By default the last 21 frames are printed. You can change the default to adjust for your needs. For example :
    ```python
    debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100)
    ```
        To validate that you have set up this debugging feature correctly, and you intend to use it in a training that
        may take hours to complete, first run it with normal tracing enabled for one of a few batches as explained in
        the next section.
        Mode 2. Specific batch absolute min/max tracing without detection
        The second work mode is per-batch tracing with the underflow/overflow detection feature turned off.
        Let's say you want to watch the absolute min and max values for all the ingredients of each `forward` call of a
    given batch, and only do that for batches 1 and 3. Then you instantiate this class as :
    ```python
    debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3])
    ```
    And now full batches 1 and 3 will be traced using the same format as explained above. Batches are 0-indexed.
    This is helpful if you know that the program starts misbehaving after a certain batch number, so you can
    fast-forward right to that area.
    Early stopping:
    You can also specify the batch number after which to stop the training, with :
    ```python
    debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3], abort_after_batch_num=3)
    ```
    This feature is mainly useful in the tracing mode, but you can use it for any mode.
    **Performance**:
    As this module measures absolute `min`/``max` of each weight of the model on every forward it'll slow the training
    down. Therefore remember to turn it off once the debugging needs have been met.
    Args:
        model (`nn.Module`):
            The model to debug.
        max_frames_to_save (`int`, *optional*, defaults to 21):
            How many frames back to record
        trace_batch_nums(`list[int]`, *optional*, defaults to `[]`):
            Which batch numbers to trace (turns detection off)
        abort_after_batch_num  (`int``, *optional*):
            Whether to abort after a certain batch number has finished
    """
    def __init__(self, model, max_frames_to_save=21, trace_batch_nums=None, abort_after_batch_num=None):
        if trace_batch_nums is None:
            trace_batch_nums = []
        self.model = model
        self.trace_batch_nums = trace_batch_nums
        self.abort_after_batch_num = abort_after_batch_num
        # keep a LIFO buffer of frames to dump as soon as inf/nan is encountered to give context to the problem emergence
        self.frames = collections.deque([], max_frames_to_save)
        self.frame = []
        self.batch_number = 0
        self.total_calls = 0
        self.detected_overflow = False
        self.prefix = "                 "
        self.analyse_model()
        self.register_forward_hook()
    def save_frame(self, frame=None):
        if frame is not None:
            self.expand_frame(frame)
        self.frames.append("\n".join(self.frame))
        self.frame = []  # start a new frame
    def expand_frame(self, line):
        self.frame.append(line)
    def trace_frames(self):
        print("\n".join(self.frames))
        self.frames = []
    def reset_saved_frames(self):
        self.frames = []
    def dump_saved_frames(self):
        print(f"\nDetected inf/nan during batch_number={self.batch_number}")
        print(f"Last {len(self.frames)} forward frames:")
        print(f"{'abs min':8} {'abs max':8} metadata")
        print("\n".join(self.frames))
        print("\n\n")
        self.frames = []
    def analyse_model(self):
        # extract the fully qualified module names, to be able to report at run time. e.g.:
        # encoder.block.2.layer.0.SelfAttention.o
        #
        # for shared weights only the first shared module name will be registered
        self.module_names = {m: name for name, m in self.model.named_modules()}
        # self.longest_module_name = max(len(v) for v in self.module_names.values())
    def analyse_variable(self, var, ctx):
        if torch.is_tensor(var):
            self.expand_frame(get_abs_min_max(var, ctx))
            if detect_overflow(var, ctx):
                self.detected_overflow = True
        elif var is None:
            self.expand_frame(f"{'None':>17} {ctx}")
        else:
            self.expand_frame(f"{'not a tensor':>17} {ctx}")
    def batch_start_frame(self):
        self.expand_frame(f"\n\n{self.prefix} *** Starting batch number={self.batch_number} ***")
        self.expand_frame(f"{'abs min':8} {'abs max':8} metadata")
    def batch_end_frame(self):
        self.expand_frame(f"{self.prefix} *** Finished batch number={self.batch_number - 1} ***\n\n")
    def create_frame(self, module, input, output):
        self.expand_frame(f"{self.prefix} {self.module_names[module]} {module.__class__.__name__}")
        # params
        for name, p in module.named_parameters(recurse=False):
            self.analyse_variable(p, name)
        # inputs
        if isinstance(input, tuple):
            for i, x in enumerate(input):
                self.analyse_variable(x, f"input[{i}]")
        else:
            self.analyse_variable(input, "input")
        # outputs
        if isinstance(output, tuple):
            for i, x in enumerate(output):
                # possibly a tuple of tuples
                if isinstance(x, tuple):
                    for j, y in enumerate(x):
                        self.analyse_variable(y, f"output[{i}][{j}]")
                else:
                    self.analyse_variable(x, f"output[{i}]")
        else:
            self.analyse_variable(output, "output")
        self.save_frame()
    def register_forward_hook(self):
        self.model.apply(self._register_forward_hook)
    def _register_forward_hook(self, module):
        module.register_forward_hook(self.forward_hook)
    def forward_hook(self, module, input, output):
        # - input is a tuple of packed inputs (could be non-Tensors)
        # - output could be a Tensor or a tuple of Tensors and non-Tensors
        last_frame_of_batch = False
        trace_mode = self.batch_number in self.trace_batch_nums
        if trace_mode:
            self.reset_saved_frames()
        if self.total_calls == 0:
            self.batch_start_frame()
        self.total_calls += 1
        # count batch numbers - the very first forward hook of the batch will be called when the
        # batch completes - i.e. it gets called very last - we know this batch has finished
        if module == self.model:
            self.batch_number += 1
            last_frame_of_batch = True
        self.create_frame(module, input, output)
        # if last_frame_of_batch:
        #     self.batch_end_frame()
        if trace_mode:
            self.trace_frames()
        if last_frame_of_batch:
            self.batch_start_frame()
        if self.detected_overflow and not trace_mode:
            self.dump_saved_frames()
            # now we can abort, as it's pointless to continue running
            raise ValueError(
                "DebugUnderflowOverflow: inf/nan detected, aborting as there is no point running further. "
                "Please scroll up above this traceback to see the activation values prior to this event."
            )
        # abort after certain batch if requested to do so
        if self.abort_after_batch_num is not None and self.batch_number > self.abort_after_batch_num:
            raise ValueError(
                f"DebugUnderflowOverflow: aborting after {self.batch_number} batches due to"
                f" `abort_after_batch_num={self.abort_after_batch_num}` arg"
            )
 def get_abs_min_max(var, ctx):
    abs_var = var.abs()
    return f"{abs_var.min():8.2e} {abs_var.max():8.2e} {ctx}"
 def detect_overflow(var, ctx):
    """
    Report whether the tensor contains any `nan` or `inf` entries.
    This is useful for detecting overflows/underflows and best to call right after the function that did some math that
    modified the tensor in question.
    This function contains a few other helper features that you can enable and tweak directly if you want to track
    various other things.
    Args:
        var: the tensor variable to check
        ctx: the message to print as a context
    Return:
        `True` if `inf` or `nan` was detected, `False` otherwise
    """
    detected = False
    if torch.isnan(var).any().item():
        detected = True
        print(f"{ctx} has nans")
    if torch.isinf(var).any().item():
        detected = True
        print(f"{ctx} has infs")
    # if needed to monitor large elements can enable the following
    if 0:  # and detected:
        n100 = var[torch.ge(var.abs(), 100)]
        if n100.numel() > 0:
            print(f"{ctx}:  n100={n100.numel()}")
        n1000 = var[torch.ge(var.abs(), 1000)]
        if n1000.numel() > 0:
            print(f"{ctx}: n1000={n1000.numel()}")
        n10000 = var[torch.ge(var.abs(), 10000)]
        if n10000.numel() > 0:
            print(f"{ctx}: n10000={n10000.numel()}")
    if 0:
        print(f"min={var.min():9.2e} max={var.max():9.2e}")
    if 0:
        print(f"min={var.min():9.2e} max={var.max():9.2e} var={var.var():9.2e} mean={var.mean():9.2e} ({ctx})")
    return detected
 class DebugOption(ExplicitEnum):
    UNDERFLOW_OVERFLOW = "underflow_overflow"
    TPU_METRICS_DEBUG = "tpu_metrics_debug"
 ```
@@ -0,0 +1,745 @@
 Source: https://huggingface.co/learn/llm-course/chapter8/4
 Title: Debugging the training pipeline (PyTorch variant) - Sylvain Gugger et al., HF LLM Course ch 8.4 (2022)
 Fetched-via: uvx markitdown https://huggingface.co/learn/llm-course/chapter8/4
 Fetch-status: verbatim
 # Debugging the training pipeline[[debugging-the-training-pipeline]]
 You've written a beautiful script to train or fine-tune a model on a given task, dutifully following the advice from [Chapter 7](/course/chapter7). But when you launch the command `trainer.train()`, something horrible happens: you get an error 😱! Or worse, everything seems to be fine and the training runs without error, but the resulting model is crappy. In this section, we will show you what you can do to debug these kinds of issues.
 ## Debugging the training pipeline[[debugging-the-training-pipeline]]
 The problem when you encounter an error in `trainer.train()` is that it could come from multiple sources, as the `Trainer` usually puts together lots of things. It converts datasets to dataloaders, so the problem could be something wrong in your dataset, or some issue when trying to batch elements of the datasets together. Then it takes a batch of data and feeds it to the model, so the problem could be in the model code. After that, it computes the gradients and performs the optimization step, so the problem could also be in your optimizer. And even if everything goes well for training, something could still go wrong during the evaluation if there is a problem with your metric.
 The best way to debug an error that arises in `trainer.train()` is to manually go through this whole pipeline to see where things went awry. The error is then often very easy to solve.
 To demonstrate this, we will use the following script that (tries to) fine-tune a DistilBERT model on the [MNLI dataset](https://huggingface.co/datasets/glue):
 ```py
 from datasets import load_dataset
 import evaluate
 from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
 )
 raw_datasets = load_dataset("glue", "mnli")
 model_checkpoint = "distilbert-base-uncased"
 tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
 def preprocess_function(examples):
    return tokenizer(examples["premise"], examples["hypothesis"], truncation=True)
 tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
 model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
 args = TrainingArguments(
    f"distilbert-finetuned-mnli",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
 )
 metric = evaluate.load("glue", "mnli")
 def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    return metric.compute(predictions=predictions, references=labels)
 trainer = Trainer(
    model,
    args,
    train_dataset=raw_datasets["train"],
    eval_dataset=raw_datasets["validation_matched"],
    compute_metrics=compute_metrics,
 )
 trainer.train()
 ```
 If you try to execute it, you will be met with a rather cryptic error:
 ```python out
 'ValueError: You have to specify either input_ids or inputs_embeds'
 ```
 ### Check your data[[check-your-data]]
 This goes without saying, but if your data is corrupted, the `Trainer` is not going to be able to form batches, let alone train your model. So first things first, you need to have a look at what is inside your training set.
 To avoid countless hours spent trying to fix something that is not the source of the bug, we recommend you use `trainer.train_dataset` for your checks and nothing else. So let's do that here:
 ```py
 trainer.train_dataset[0]
 ```
 ```python out
 {'hypothesis': 'Product and geography are what make cream skimming work. ',
 'idx': 0,
 'label': 1,
 'premise': 'Conceptually cream skimming has two basic dimensions - product and geography.'}
 ```
 Do you notice something wrong? This, in conjunction with the error message about `input_ids` missing, should make you realize those are texts, not numbers the model can make sense of. Here, the original error is very misleading because the `Trainer` automatically removes the columns that don't match the model signature (that is, the arguments expected by the model). That means here, everything apart from the labels was discarded. There was thus no issue with creating batches and then sending them to the model, which in turn complained it didn't receive the proper input.
 Why wasn't the data processed? We did use the `Dataset.map()` method on the datasets to apply the tokenizer on each sample. But if you look closely at the code, you will see that we made a mistake when passing the training and evaluation sets to the `Trainer`. Instead of using `tokenized_datasets` here, we used `raw_datasets` 🤦. So let's fix this!
 ```py
 from datasets import load_dataset
 import evaluate
 from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
 )
 raw_datasets = load_dataset("glue", "mnli")
 model_checkpoint = "distilbert-base-uncased"
 tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
 def preprocess_function(examples):
    return tokenizer(examples["premise"], examples["hypothesis"], truncation=True)
 tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
 model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
 args = TrainingArguments(
    f"distilbert-finetuned-mnli",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
 )
 metric = evaluate.load("glue", "mnli")
 def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    return metric.compute(predictions=predictions, references=labels)
 trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation_matched"],
    compute_metrics=compute_metrics,
 )
 trainer.train()
 ```
 This new code will now give a different error (progress!):
 ```python out
 'ValueError: expected sequence of length 43 at dim 1 (got 37)'
 ```
 Looking at the traceback, we can see the error happens in the data collation step:
 ```python out
 ~/git/transformers/src/transformers/data/data_collator.py in torch_default_data_collator(features)
    105                 batch[k] = torch.stack([f[k] for f in features])
    106             else:
 --> 107                 batch[k] = torch.tensor([f[k] for f in features])
    108
    109     return batch
 ```
 So, we should move to that. Before we do, however, let's finish inspecting our data, just to be 100% sure it's correct.
 One thing you should always do when debugging a training session is have a look at the decoded inputs of your model. We can't make sense of the numbers that we feed it directly, so we should look at what those numbers represent. In computer vision, for example, that means looking at the decoded pictures of the pixels you pass, in speech it means listening to the decoded audio samples, and for our NLP example here it means using our tokenizer to decode the inputs:
 ```py
 tokenizer.decode(trainer.train_dataset[0]["input_ids"])
 ```
 ```python out
 '[CLS] conceptually cream skimming has two basic dimensions - product and geography. [SEP] product and geography are what make cream skimming work. [SEP]'
 ```
 So that seems correct. You should do this for all the keys in the inputs:
 ```py
 trainer.train_dataset[0].keys()
 ```
 ```python out
 dict_keys(['attention_mask', 'hypothesis', 'idx', 'input_ids', 'label', 'premise'])
 ```
 Note that the keys that don't correspond to inputs accepted by the model will be automatically discarded, so here we will only keep `input_ids`, `attention_mask`, and `label` (which will be renamed `labels`). To double-check the model signature, you can print the class of your model, then go check its documentation:
 ```py
 type(trainer.model)
 ```
 ```python out
 transformers.models.distilbert.modeling_distilbert.DistilBertForSequenceClassification
 ```
 So in our case, we can check the parameters accepted on [this page](https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforsequenceclassification). The `Trainer` will also log the columns it's discarding.
 We have checked that the input IDs are correct by decoding them. Next is the `attention_mask`:
 ```py
 trainer.train_dataset[0]["attention_mask"]
 ```
 ```python out
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
 ```
 Since we didn't apply padding in our preprocessing, this seems perfectly natural. To be sure there is no issue with that attention mask, let's check it is the same length as our input IDs:
 ```py
 len(trainer.train_dataset[0]["attention_mask"]) == len(
    trainer.train_dataset[0]["input_ids"]
 )
 ```
 ```python out
 True
 ```
 That's good! Lastly, let's check our label:
 ```py
 trainer.train_dataset[0]["label"]
 ```
 ```python out
 1
 ```
 Like the input IDs, this is a number that doesn't really make sense on its own. As we saw before, the map between integers and label names is stored inside the `names` attribute of the corresponding *feature* of the dataset:
 ```py
 trainer.train_dataset.features["label"].names
 ```
 ```python out
 ['entailment', 'neutral', 'contradiction']
 ```
 So `1` means `neutral`, which means the two sentences we saw above are not in contradiction, and the first one does not imply the second one. That seems correct!
 We don't have token type IDs here, since DistilBERT does not expect them; if you have some in your model, you should also make sure that they properly match where the first and second sentences are in the input.
 > [!TIP]
 > ✏️ **Your turn!** Check that everything seems correct with the second element of the training dataset.
 We are only doing the check on the training set here, but you should of course double-check the validation and test sets the same way.
 Now that we know our datasets look good, it's time to check the next step of the training pipeline.
 ### From datasets to dataloaders[[from-datasets-to-dataloaders]]
 The next thing that can go wrong in the training pipeline is when the `Trainer` tries to form batches from the training or validation set. Once you are sure the `Trainer`'s datasets are correct, you can try to manually form a batch by executing the following (replace `train` with `eval` for the validation dataloader):
 ```py
 for batch in trainer.get_train_dataloader():
    break
 ```
 This code creates the training dataloader, then iterates through it, stopping at the first iteration. If the code executes without error, you have the first training batch that you can inspect, and if the code errors out, you know for sure the problem is in the dataloader, as is the case here:
 ```python out
 ~/git/transformers/src/transformers/data/data_collator.py in torch_default_data_collator(features)
    105                 batch[k] = torch.stack([f[k] for f in features])
    106             else:
 --> 107                 batch[k] = torch.tensor([f[k] for f in features])
    108
    109     return batch
 ValueError: expected sequence of length 45 at dim 1 (got 76)
 ```
 Inspecting the last frame of the traceback should be enough to give you a clue, but let's do a bit more digging. Most of the problems during batch creation arise because of the collation of examples into a single batch, so the first thing to check when in doubt is what `collate_fn` your `DataLoader` is using:
 ```py
 data_collator = trainer.get_train_dataloader().collate_fn
 data_collator
 ```
 ```python out
 Dict[str, Any]>
 ```
 So this is the `default_data_collator`, but that's not what we want in this case. We want to pad our examples to the longest sentence in the batch, which is done by the `DataCollatorWithPadding` collator. And this data collator is supposed to be used by default by the `Trainer`, so why is it not used here?
 The answer is because we did not pass the `tokenizer` to the `Trainer`, so it couldn't create the `DataCollatorWithPadding` we want. In practice, you should never hesitate to explicitly pass along the data collator you want to use, to make sure you avoid these kinds of errors. Let's adapt our code to do exactly that:
 ```py
 from datasets import load_dataset
 import evaluate
 from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
 )
 raw_datasets = load_dataset("glue", "mnli")
 model_checkpoint = "distilbert-base-uncased"
 tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
 def preprocess_function(examples):
    return tokenizer(examples["premise"], examples["hypothesis"], truncation=True)
 tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
 model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
 args = TrainingArguments(
    f"distilbert-finetuned-mnli",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
 )
 metric = evaluate.load("glue", "mnli")
 def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    return metric.compute(predictions=predictions, references=labels)
 data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
 trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation_matched"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    tokenizer=tokenizer,
 )
 trainer.train()
 ```
 The good news? We don't get the same error as before, which is definitely progress. The bad news? We get an infamous CUDA error instead:
 ```python out
 RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
 ```
 This is bad because CUDA errors are extremely hard to debug in general. We will see in a minute how to solve this, but first let's finish our analysis of batch creation.
 If you are sure your data collator is the right one, you should try to apply it on a couple of samples of your dataset:
 ```py
 data_collator = trainer.get_train_dataloader().collate_fn
 batch = data_collator([trainer.train_dataset[i] for i in range(4)])
 ```
 This code will fail because the `train_dataset` contains string columns, which the `Trainer` usually removes. You can remove them manually, or if you want to replicate exactly what the `Trainer` is doing behind the scenes, you can call the private `Trainer._remove_unused_columns()` method that does that:
 ```py
 data_collator = trainer.get_train_dataloader().collate_fn
 actual_train_set = trainer._remove_unused_columns(trainer.train_dataset)
 batch = data_collator([actual_train_set[i] for i in range(4)])
 ```
 You should then be able to manually debug what happens inside the data collator if the error persists.
 Now that we've debugged the batch creation process, it's time to pass one through the model!
 ### Going through the model[[going-through-the-model]]
 You should be able to get a batch by executing the following command:
 ```py
 for batch in trainer.get_train_dataloader():
    break
 ```
 If you're running this code in a notebook, you may get a CUDA error that's similar to the one we saw earlier, in which case you need to restart your notebook and reexecute the last snippet without the `trainer.train()` line. That's the second most annoying thing about CUDA errors: they irremediably break your kernel. The most annoying thing about them is the fact that they are hard to debug.
 Why is that? It has to do with the way GPUs work. They are extremely efficient at executing a lot of operations in parallel, but the drawback is that when one of those instructions results in an error, you don't know it instantly. It's only when the program calls a synchronization of the multiple processes on the GPU that it will realize something went wrong, so the error is actually raised at a place that has nothing to do with what created it. For instance, if we look at our previous traceback, the error was raised during the backward pass, but we will see in a minute that it actually stems from something in the forward pass.
 So how do we debug those errors? The answer is easy: we don't. Unless your CUDA error is an out-of-memory error (which means there is not enough memory in your GPU), you should always go back to the CPU to debug it.
 To do this in our case, we just have to put the model back on the CPU and call it on our batch -- the batch returned by the `DataLoader` has not been moved to the GPU yet:
 ```python
 outputs = trainer.model.cpu()(**batch)
 ```
 ```python out
 ~/.pyenv/versions/3.7.9/envs/base/lib/python3.7/site-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
   2386         )
   2387     if dim == 2:
 -> 2388         ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
   2389     elif dim == 4:
   2390         ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
 IndexError: Target 2 is out of bounds.
 ```
 So, the picture is getting clearer. Instead of having a CUDA error, we now have an `IndexError` in the loss computation (so nothing to do with the backward pass, as we said earlier). More precisely, we can see that it's target 2 that creates the error, so this is a very good moment to check the number of labels of our model:
 ```python
 trainer.model.config.num_labels
 ```
 ```python out
 2
 ```
 With two labels, only 0s and 1s are allowed as targets, but according to the error message we got a 2. Getting a 2 is actually normal: if we remember the label names we extracted earlier, there were three, so we have indices 0, 1, and 2 in our dataset. The problem is that we didn't tell that to our model, which should have been created with three labels. So let's fix that!
 ```py
 from datasets import load_dataset
 import evaluate
 from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
 )
 raw_datasets = load_dataset("glue", "mnli")
 model_checkpoint = "distilbert-base-uncased"
 tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
 def preprocess_function(examples):
    return tokenizer(examples["premise"], examples["hypothesis"], truncation=True)
 tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
 model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=3)
 args = TrainingArguments(
    f"distilbert-finetuned-mnli",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
 )
 metric = evaluate.load("glue", "mnli")
 def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    return metric.compute(predictions=predictions, references=labels)
 data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
 trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation_matched"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    tokenizer=tokenizer,
 )
 ```
 We aren't including the `trainer.train()` line yet, to take the time to check that everything looks good. If we request a batch and pass it to our model, it now works without error!
 ```py
 for batch in trainer.get_train_dataloader():
    break
 outputs = trainer.model.cpu()(**batch)
 ```
 The next step is then to move back to the GPU and check that everything still works:
 ```py
 import torch
 device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
 batch = {k: v.to(device) for k, v in batch.items()}
 outputs = trainer.model.to(device)(**batch)
 ```
 If you still get an error, make sure you restart your notebook and only execute the last version of the script.
 ### Performing one optimization step[[performing-one-optimization-step]]
 Now that we know that we can build batches that actually go through the model, we are ready for the next step of the training pipeline: computing the gradients and performing an optimization step.
 The first part is just a matter of calling the `backward()` method on the loss:
 ```py
 loss = outputs.loss
 loss.backward()
 ```
 It's pretty rare to get an error at this stage, but if you do get one, make sure to go back to the CPU to get a helpful error message.
 To perform the optimization step, we just need to create the `optimizer` and call its `step()` method:
 ```py
 trainer.create_optimizer()
 trainer.optimizer.step()
 ```
 Again, if you're using the default optimizer in the `Trainer`, you shouldn't get an error at this stage, but if you have a custom optimizer, there might be some problems to debug here. Don't forget to go back to the CPU if you get a weird CUDA error at this stage. Speaking of CUDA errors, earlier we mentioned a special case. Let's have a look at that now.
 ### Dealing with CUDA out-of-memory errors[[dealing-with-cuda-out-of-memory-errors]]
 Whenever you get an error message that starts with `RuntimeError: CUDA out of memory`, this indicates that you are out of GPU memory. This is not directly linked to your code, and it can happen with a script that runs perfectly fine. This error means that you tried to put too many things in the internal memory of your GPU, and that resulted in an error. Like with other CUDA errors, you will need to restart your kernel to be in a spot where you can run your training again.
 To solve this issue, you just need to use less GPU space -- something that is often easier said than done. First, make sure you don't have two models on the GPU at the same time (unless that's required for your problem, of course). Then, you should probably reduce your batch size, as it directly affects the sizes of all the intermediate outputs of the model and their gradients. If the problem persists, consider using a smaller version of your model.
 > [!TIP]
 > In the next part of the course, we'll look at more advanced techniques that can help you reduce your memory footprint and let you fine-tune the biggest models.
 ### Evaluating the model[[evaluating-the-model]]
 Now that we've solved all the issues with our code, everything is perfect and the training should run smoothly, right? Not so fast! If you run the `trainer.train()` command, everything will look good at first, but after a while you will get the following:
 ```py
 # This will take a long time and error out, so you shouldn't run this cell
 trainer.train()
 ```
 ```python out
 TypeError: only size-1 arrays can be converted to Python scalars
 ```
 You will realize this error appears during the evaluation phase, so this is the last thing we will need to debug.
 You can run the evaluation loop of the `Trainer` independently form the training like this:
 ```py
 trainer.evaluate()
 ```
 ```python out
 TypeError: only size-1 arrays can be converted to Python scalars
 ```
 > [!TIP]
 > 💡 You should always make sure you can run `trainer.evaluate()` before launching `trainer.train()`, to avoid wasting lots of compute resources before hitting an error.
 Before attempting to debug a problem in the evaluation loop, you should first make sure that you've had a look at the data, are able to form a batch properly, and can run your model on it. We've completed all of those steps, so the following code can be executed without error:
 ```py
 for batch in trainer.get_eval_dataloader():
    break
 batch = {k: v.to(device) for k, v in batch.items()}
 with torch.no_grad():
    outputs = trainer.model(**batch)
 ```
 The error comes later, at the end of the evaluation phase, and if we look at the traceback we see this:
 ```python trace
 ~/git/datasets/src/datasets/metric.py in add_batch(self, predictions, references)
    431         """
    432         batch = {"predictions": predictions, "references": references}
 --> 433         batch = self.info.features.encode_batch(batch)
    434         if self.writer is None:
    435             self._init_writer()
 ```
 This tells us that the error originates in the `datasets/metric.py` module -- so this is a problem with our `compute_metrics()` function. It takes a tuple with the logits and the labels as NumPy arrays, so let's try to feed it that:
 ```py
 predictions = outputs.logits.cpu().numpy()
 labels = batch["labels"].cpu().numpy()
 compute_metrics((predictions, labels))
 ```
 ```python out
 TypeError: only size-1 arrays can be converted to Python scalars
 ```
 We get the same error, so the problem definitely lies with that function. If we look back at its code, we see it's just forwarding the `predictions` and the `labels` to `metric.compute()`. So is there a problem with that method? Not really. Let's have a quick look at the shapes:
 ```py
 predictions.shape, labels.shape
 ```
 ```python out
 ((8, 3), (8,))
 ```
 Our predictions are still logits, not the actual predictions, which is why the metric is returning this (somewhat obscure) error. The fix is pretty easy; we just have to add an argmax in the `compute_metrics()` function:
 ```py
 import numpy as np
 def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)
 compute_metrics((predictions, labels))
 ```
 ```python out
 {'accuracy': 0.625}
 ```
 Now our error is fixed! This was the last one, so our script will now train a model properly.
 For reference, here is the completely fixed script:
 ```py
 import numpy as np
 from datasets import load_dataset
 import evaluate
 from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
 )
 raw_datasets = load_dataset("glue", "mnli")
 model_checkpoint = "distilbert-base-uncased"
 tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
 def preprocess_function(examples):
    return tokenizer(examples["premise"], examples["hypothesis"], truncation=True)
 tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
 model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=3)
 args = TrainingArguments(
    f"distilbert-finetuned-mnli",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
 )
 metric = evaluate.load("glue", "mnli")
 def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)
 data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
 trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation_matched"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    tokenizer=tokenizer,
 )
 trainer.train()
 ```
 In this instance, there are no more problems, and our script will fine-tune a model that should give reasonable results. But what can we do when the training proceeds without any error, and the model trained does not perform well at all? That's the hardest part of machine learning, and we'll show you a few techniques that can help.
 > [!TIP]
 > 💡 If you're using a manual training loop, the same steps apply to debug your training pipeline, but it's easier to separate them. Make sure you have not forgotten the `model.eval()` or `model.train()` at the right places, or the `zero_grad()` at each step, however!
 ## Debugging silent errors during training[[debugging-silent-errors-during-training]]
 What can we do to debug a training that completes without error but doesn't get good results? We'll give you some pointers here, but be aware that this kind of debugging is the hardest part of machine learning, and there is no magical answer.
 ### Check your data (again!)[[check-your-data-again]]
 Your model will only learn something if it's actually possible to learn anything from your data. If there is a bug that corrupts the data or the labels are attributed randomly, it's very likely you won't get any model training on your dataset. So always start by double-checking your decoded inputs and labels, and ask yourself the following questions:
 - Is the decoded data understandable?
 - Do you agree with the labels?
 - Is there one label that's more common than the others?
 - What should the loss/metric be if the model predicted a random answer/always the same answer?
 > [!WARNING]
 > ⚠️ If you are doing distributed training, print samples of your dataset in each process and triple-check that you get the same thing. One common bug is to have some source of randomness in the data creation that makes each process have a different version of the dataset.
 After looking at your data, go through a few of the model's predictions and decode them too. If the model is always predicting the same thing, it might be because your dataset is biased toward one category (for classification problems); techniques like oversampling rare classes might help.
 If the loss/metric you get on your initial model is very different from the loss/metric you would expect for random predictions, double-check the way your loss or metric is computed, as there is probably a bug there. If you are using several losses that you add at the end, make sure they are of the same scale.
 When you are sure your data is perfect, you can see if the model is capable of training on it with one simple test.
 ### Overfit your model on one batch[[overfit-your-model-on-one-batch]]
 Overfitting is usually something we try to avoid when training, as it means the model is not learning to recognize the general features we want it to but is instead just memorizing the training samples. However, trying to train your model on one batch over and over again is a good test to check if the problem as you framed it can be solved by the model you are attempting to train. It will also help you see if your initial learning rate is too high.
 Doing this once you have defined your `Trainer` is really easy; just grab a batch of training data, then run a small manual training loop only using that batch for something like 20 steps:
 ```py
 for batch in trainer.get_train_dataloader():
    break
 batch = {k: v.to(device) for k, v in batch.items()}
 trainer.create_optimizer()
 for _ in range(20):
    outputs = trainer.model(**batch)
    loss = outputs.loss
    loss.backward()
    trainer.optimizer.step()
    trainer.optimizer.zero_grad()
 ```
 > [!TIP]
 > 💡 If your training data is unbalanced, make sure to build a batch of training data containing all the labels.
 The resulting model should have close-to-perfect results on the same `batch`. Let's compute the metric on the resulting predictions:
 ```py
 with torch.no_grad():
    outputs = trainer.model(**batch)
 preds = outputs.logits
 labels = batch["labels"]
 compute_metrics((preds.cpu().numpy(), labels.cpu().numpy()))
 ```
 ```python out
 {'accuracy': 1.0}
 ```
 100% accuracy, now this is a nice example of overfitting (meaning that if you try your model on any other sentence, it will very likely give you a wrong answer)!
 If you don't manage to have your model obtain perfect results like this, it means there is something wrong with the way you framed the problem or your data, so you should fix that. Only when you manage to pass the overfitting test can you be sure that your model can actually learn something.
 > [!WARNING]
 > ⚠️ You will have to recreate your model and your `Trainer` after this test, as the model obtained probably won't be able to recover and learn something useful on your full dataset.
 ### Don't tune anything until you have a first baseline[[dont-tune-anything-until-you-have-a-first-baseline]]
 Hyperparameter tuning is always emphasized as being the hardest part of machine learning, but it's just the last step to help you gain a little bit on the metric. Most of the time, the default hyperparameters of the `Trainer` will work just fine to give you good results, so don't launch into a time-consuming and costly hyperparameter search until you have something that beats the baseline you have on your dataset.
 Once you have a good enough model, you can start tweaking a bit. Don't try launching a thousand runs with different hyperparameters, but compare a couple of runs with different values for one hyperparameter to get an idea of which has the greatest impact.
 If you are tweaking the model itself, keep it simple and don't try anything you can't reasonably justify. Always make sure you go back to the overfitting test to verify that your change hasn't had any unintended consequences.
 ### Ask for help[[ask-for-help]]
 Hopefully you will have found some advice in this section that helped you solve your issue, but if that's not the case, remember you can always ask the community on the [forums](https://discuss.huggingface.co/).
 Here are some additional resources that may prove helpful:
 - ["Reproducibility as a vehicle for engineering best practices"](https://docs.google.com/presentation/d/1yHLPvPhUs2KGI5ZWo0sU-PKU3GimAk3iTsI38Z-B5Gw/edit#slide=id.p) by Joel Grus
 - ["Checklist for debugging neural networks"](https://towardsdatascience.com/checklist-for-debugging-neural-networks-d8b2a9434f21) by Cecelia Shao
 - ["How to unit test machine learning code"](https://medium.com/@keeper6928/how-to-unit-test-machine-learning-code-57cf6fd81765) by Chase Roberts
 - ["A Recipe for Training Neural Networks"](http://karpathy.github.io/2019/04/25/recipe/) by Andrej Karpathy
 Of course, not every problem you encounter when training neural nets is your own fault! If you encounter something in the 🤗 Transformers or 🤗 Datasets library that does not seem right, you may have encountered a bug. You should definitely tell us all about it, and in the next section we'll explain exactly how to do that.
@@ -0,0 +1,314 @@
 Source: https://docs.unsloth.ai/basics/troubleshooting-and-faqs
 Title: Troubleshooting & FAQs - Unsloth Documentation, Daniel & Michael Han-Chen (2025)
 Fetched-via: curl -sL https://docs.unsloth.ai/basics/troubleshooting-and-faqs.md (GitBook raw markdown endpoint; uvx markitdown returned raw HTML junk)
 Fetch-status: verbatim, trailing "Agent Instructions" doc-query boilerplate trimmed
 # Troubleshooting & FAQs
 If you're still encountering any issues with versions or dependencies, please use our [Docker image](/docs/get-started/install/docker.md) which will have everything pre-installed.
 {% hint style="success" %}
 **Try always to update Unsloth if you find any issues.**
 `pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo`
 {% endhint %}
 ### Fine-tuning a new model not supported by Unsloth?
 Unsloth works with any model supported by `transformers`. If a model isn’t in our uploads or doesn’t run out of the box, it’s usually still supported, some newer models may just need a small manual tweak due to our optimizations.
 In most cases, you can enable compatibility by setting `trust_remote_code=True` in your fine-tuning script. Here’s an example using [DeepSeek-OCR](/docs/models/tutorials/deepseek-ocr-how-to-run-and-fine-tune.md):
 <pre class="language-python" data-expandable="true"><code class="lang-python">from huggingface_hub import snapshot_download
 snapshot_download("unsloth/DeepSeek-OCR", local_dir = "deepseek_ocr")
 model, tokenizer = FastVisionModel.from_pretrained(
    "./deepseek_ocr",
    load_in_4bit = False, # Use 4bit to reduce memory use. False for 16bit LoRA.
    auto_model = AutoModel,
    <a data-footnote-ref href="#user-content-fn-1">trust_remote_code = True</a>, # Enable to support new models
    unsloth_force_compile = True,
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
 )
 </code></pre>
 ### Running in Unsloth works well, but after exporting & running on other platforms, the results are poor
 You might sometimes encounter an issue where your model runs and produces good results on Unsloth, but when you use it on another platform like Ollama or vLLM, the results are poor or you might get gibberish, endless/infinite generations *or* repeated output&#x73;**.**
 * The most common cause of this error is using an <mark style="background-color:blue;">**incorrect chat template**</mark>**.** It’s essential to use the SAME chat template that was used when training the model in Unsloth and later when you run it in another framework, such as llama.cpp or Ollama. When inferencing from a saved model, it's crucial to apply the correct template.
 * It might also be because your inference engine adds an unnecessary "start of sequence" token (or the lack of thereof on the contrary) so ensure you check both hypotheses!
 * <mark style="background-color:green;">**Use our conversational notebooks to force the chat template - this will fix most issues.**</mark>
  * Qwen-3 14B Conversational notebook [**Open in Colab**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_\(14B\)-Reasoning-Conversational.ipynb)
  * Gemma-3 4B Conversational notebook [**Open in Colab**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma3_\(4B\).ipynb)
  * Llama-3.2 3B Conversational notebook [**Open in Colab**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_\(1B_and_3B\)-Conversational.ipynb)
  * Phi-4 14B Conversational notebook [**Open in Colab**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_4-Conversational.ipynb)
  * Mistral v0.3 7B Conversational notebook [**Open in Colab**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_\(7B\)-Conversational.ipynb)
  * **More notebooks in our** [**notebooks docs**](/docs/get-started/unsloth-notebooks.md)
 ### Saving to GGUF / vLLM 16bit crashes
 You can try reducing the maximum GPU usage during saving by changing `maximum_memory_usage`.
 The default is `model.save_pretrained(..., maximum_memory_usage = 0.75)`. Reduce it to say 0.5 to use 50% of GPU peak memory or lower. This can reduce OOM crashes during saving.
 ### How do I manually save to GGUF?
 First save your model to 16bit via:
 ```python
 model.save_pretrained_merged("merged_model", tokenizer, save_method = "merged_16bit",)
 ```
 Compile llama.cpp from source like below:
 ```bash
 apt-get update
 apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
 git clone https://github.com/ggml-org/llama.cpp
 cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
 cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
 cp llama.cpp/build/bin/llama-* llama.cpp
 ```
 Then, save the model to F16:
 ```bash
 python llama.cpp/convert_hf_to_gguf.py merged_model \
    --outfile model-F16.gguf --outtype f16 \
    --split-max-size 50G
 ```
 ```bash
 # For BF16:
 python llama.cpp/convert_hf_to_gguf.py merged_model \
    --outfile model-BF16.gguf --outtype bf16 \
    --split-max-size 50G
 # For Q8_0:
 python llama.cpp/convert_hf_to_gguf.py merged_model \
    --outfile model-Q8_0.gguf --outtype q8_0 \
    --split-max-size 50G
 ```
 ### Why is Q8\_K\_XL slower than Q8\_0 GGUF?
 On Mac devices, it seems like that BF16 might be slower than F16. Q8\_K\_XL upcasts some layers to BF16, so hence the slowdown, We are actively changing our conversion process to make F16 the default choice for Q8\_K\_XL to reduce performance hits.
 ### How to do Evaluation
 To set up evaluation in your training run, you first have to split your dataset into a training and test split. You should <mark style="background-color:green;">**always shuffle the selection of the dataset**</mark>, otherwise your evaluation is wrong!
 ```python
 new_dataset = dataset.train_test_split(
    test_size = 0.01, # 1% for test size can also be an integer for # of rows
    shuffle = True, # Should always set to True!
    seed = 3407,
 )
 train_dataset = new_dataset["train"] # Dataset for training
 eval_dataset = new_dataset["test"] # Dataset for evaluation
 ```
 Then, we can set the training arguments to enable evaluation. Reminder evaluation can be very very slow especially if you set `eval_steps = 1` which means you are evaluating every single step. If you are, try reducing the eval\_dataset size to say 100 rows or something.
 ```python
 from trl import SFTTrainer, SFTConfig
 trainer = SFTTrainer(
    args = SFTConfig(
        fp16_full_eval = True,         # Set this to reduce memory usage
        per_device_eval_batch_size = 2,# Increasing this will use more memory
        eval_accumulation_steps = 4,   # You can increase this include of batch_size
        eval_strategy = "steps",       # Runs eval every few steps or epochs.
        eval_steps = 1,                # How many evaluations done per # of training steps
    ),
    train_dataset = new_dataset["train"],
    eval_dataset = new_dataset["test"],
    ...
 )
 trainer.train()
 ```
 ### Evaluation Loop - Out of Memory or crashing.
 A common issue when you OOM is because you set your batch size too high. Set it lower than 2 to use less VRAM. Also use `fp16_full_eval=True` to use float16 for evaluation which cuts memory by 1/2.
 First split your training dataset into a train and test split. Set the trainer settings for evaluation to:
 ```python
 new_dataset = dataset.train_test_split(test_size = 0.01)
 from trl import SFTTrainer, SFTConfig
 trainer = SFTTrainer(
    args = SFTConfig(
        fp16_full_eval = True,
        per_device_eval_batch_size = 2,
        eval_accumulation_steps = 4,
        eval_strategy = "steps",
        eval_steps = 1,
    ),
    train_dataset = new_dataset["train"],
    eval_dataset = new_dataset["test"],
    ...
 )
 ```
 This will cause no OOMs and make it somewhat faster. You can also use `bf16_full_eval=True` for bf16 machines. By default Unsloth should have set these flags on by default as of June 2025.
 ### How do I do Early Stopping?
 If you want to stop the finetuning / training run since the evaluation loss is not decreasing, then you can use early stopping which stops the training process. Use `EarlyStoppingCallback`.
 As usual, set up your trainer and your evaluation dataset. The below is used to stop the training run if the `eval_loss` (the evaluation loss) is not decreasing after 3 steps or so.
 ```python
 from trl import SFTConfig, SFTTrainer
 trainer = SFTTrainer(
    args = SFTConfig(
        fp16_full_eval = True,
        per_device_eval_batch_size = 2,
        eval_accumulation_steps = 4,
        output_dir = "training_checkpoints", # location of saved checkpoints for early stopping
        save_strategy = "steps",             # save model every N steps
        save_steps = 10,                     # how many steps until we save the model
        save_total_limit = 3,                # keep only 3 saved checkpoints to save disk space
        eval_strategy = "steps",             # evaluate every N steps
        eval_steps = 10,                     # how many steps until we do evaluation
        load_best_model_at_end = True,       # MUST USE for early stopping
        metric_for_best_model = "eval_loss", # metric we want to early stop on
        greater_is_better = False,           # the lower the eval loss, the better
    ),
    model = model,
    tokenizer = tokenizer,
    train_dataset = new_dataset["train"],
    eval_dataset = new_dataset["test"],
 )
 ```
 We then add the callback which can also be customized:
 ```python
 from transformers import EarlyStoppingCallback
 early_stopping_callback = EarlyStoppingCallback(
    early_stopping_patience = 3,     # How many steps we will wait if the eval loss doesn't decrease
                                     # For example the loss might increase, but decrease after 3 steps
    early_stopping_threshold = 0.0,  # Can set higher - sets how much loss should decrease by until
                                     # we consider early stopping. For eg 0.01 means if loss was
                                     # 0.02 then 0.01, we consider to early stop the run.
 )
 trainer.add_callback(early_stopping_callback)
 ```
 Then train the model as usual via `trainer.train() .`
 ### Downloading gets stuck at 90 to 95%
 If your model gets stuck at 90, 95% for a long time before you can disable some fast downloading processes to force downloads to be synchronous and to print out more error messages.
 Simply use `UNSLOTH_STABLE_DOWNLOADS=1` before any Unsloth import.
 ```python
 import os
 os.environ["UNSLOTH_STABLE_DOWNLOADS"] = "1"
 from unsloth import FastLanguageModel
 ```
 ### RuntimeError: CUDA error: device-side assert triggered
 Restart and run all, but place this at the start before any Unsloth import. Also please file a bug report asap thank you!
 ```python
 import os
 os.environ["UNSLOTH_COMPILE_DISABLE"] = "1"
 os.environ["UNSLOTH_DISABLE_FAST_GENERATION"] = "1"
 ```
 ### All labels in your dataset are -100. Training losses will be all 0.
 This means that your usage of `train_on_responses_only` is incorrect for that particular model. train\_on\_responses\_only allows you to mask the user question, and train your model to output the assistant response with higher weighting. This is known to increase accuracy by 1% or more. See our [**LoRA Hyperparameters Guide**](/docs/get-started/fine-tuning-llms-guide/lora-hyperparameters-guide.md) for more details.
 For Llama 3.1, 3.2, 3.3 type models, please use the below:
 ```python
 from unsloth.chat_templates import train_on_responses_only
 trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
 )
 ```
 For Gemma 2, 3. 3n models, use the below:
 ```python
 from unsloth.chat_templates import train_on_responses_only
 trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
 )
 ```
 ### Unsloth is slower than expected?
 If your speed seems slower at first, it’s likely because `torch.compile` typically takes \~5 minutes (or longer) to warm up and finish compiling. Make sure you measure throughput **after** it’s fully loaded as over longer runs, Unsloth should be much faster.
 To disable use:
 ```python
 import os
 os.environ["UNSLOTH_COMPILE_DISABLE"] = "1"
 ```
 ### Some weights of Gemma3nForConditionalGeneration were not initialized from the model checkpoint
 This is a critical error, since this means some weights are not parsed correctly, which will cause incorrect outputs. This can normally be fixed by upgrading Unsloth
 `pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo`
 Then upgrade transformers and timm:
 `pip install --upgrade --force-reinstall --no-cache-dir --no-deps transformers timm`
 However if the issue still persists, please file a bug report asap!
 ### NotImplementedError: A UTF-8 locale is required. Got ANSI
 See <https://github.com/googlecolab/colabtools/issues/3409>
 In a new cell, run the below:
 ```python
 import locale
 locale.getpreferredencoding = lambda: "UTF-8"
 ```
 ### Citing Unsloth
 If you are citing the usage of our model uploads, use the below Bibtex. This is for Qwen3-30B-A3B-GGUF Q8\_K\_XL:
 ```
@misc{unsloth_2025_qwen3_30b_a3b,
  author       = {Unsloth AI and Han-Chen, Daniel and Han-Chen, Michael},
  title        = {Qwen3-30B-A3B-GGUF:Q8\_K\_XL},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF}}
 }
 ```
 To cite the usage of our Github package or our work in general:
 ```
@misc{unsloth,
  author       = {Unsloth AI and Han-Chen, Daniel and Han-Chen, Michael},
  title        = {Unsloth},
  year         = {2025},
  publisher    = {Github},
  howpublished = {\url{https://github.com/unslothai/unsloth}}
 }
 ```
 [^1]: Enable this line of code and see if it works.