docs: resolve ml-debug TODO references

2026-06-27 01:00:14 +08:00 · 2026-06-12 06:52:38 +08:00
parent 966f948d36
commit 8b9a1d62ed
5 changed files with 185 additions and 9 deletions
@@ -36,7 +36,7 @@ For RL, add reward scale/sign as a top-3 issue, and episode-boundary handling (d

 | Signal | Likely meaning | Check |
 |--------|----------------|-------|
-| Init loss << expected (e.g. 0.01 vs 2.3) | Leakage or a shortcut: the model "knows" the answer at init | Are labels in the input? Is test data in train? A trivial feature? Localize with the NaN-poisoning tracer or backprop-to-input check ([refs/diagnostics.md](refs/diagnostics.md)) |
+| Init loss << expected (e.g. 0.01 vs 2.3) | Leakage or a shortcut: the model "knows" the answer at init | Are labels in the input? Is test data in train? A trivial feature? Localize with Wassname's NaN-poisoning tracer or backprop-to-input check ([refs/diagnostics.md](refs/diagnostics.md)) |
 | Random input gives the same loss as real input | Pipeline is destroying information (over-aggressive preprocessing, wrong transforms, all-zero input) | Print raw data at each stage; visualize |
 | Predicts the same class for everything | Class imbalance (100:1 -> "always predict majority") | Label-count check; weighted loss or resample |
 | Val much worse than train from the start | Distribution shift between splits | Same preprocessing? Same time period? Same source? |
@@ -187,7 +187,7 @@ These are the overconfident reflexes the "calibrate" section warns about, made c
 - `try/except` around training code. Training should crash loudly. A caught exception hides the bug and produces silently wrong results. The one exception is checkpoint-on-KeyboardInterrupt.
 - "Try a different optimizer." If Adam doesn't converge, it's almost never the optimizer; it's the loss, the data, the architecture, or a bug.
 - `.detach()` / `.item()` to "fix" gradient errors. If autograd complains, the graph is wrong. Detaching silences it by cutting gradient flow, so the model just stops learning from that path.
- `lr_scheduler` as a *cure for non-convergence*. Schedules matter (transformers need warmup; OneCycle/cosine can work well; AdamW is a common pairing), but they refine or enable convergence in an otherwise-healthy setup; they don't rescue a model that can't learn at constant LR because of a bug. Add the schedule once the basics work, not as a debugging band-aid. An LR range test is a separate short run that increases LR until loss stops improving or diverges; use it to choose a candidate `max_lr` before a OneCycle run.
+- `lr_scheduler` as a *cure for non-convergence*. Schedules matter (transformers need warmup; WSD, OneCycle, or cosine can work well in different regimes; AdamW is a common pairing), but they refine or enable convergence in an otherwise-healthy setup; they don't rescue a model that can't learn at constant LR because of a bug. Use an LR range test or published recipe to choose a candidate maximum LR before a schedule run.
 - More layers / a bigger model. If it can't overfit one batch, more parameters won't help. The problem is gradient flow, loss, or data.
 - "Normalize your data" without checking whether it already is. Run the data sanity check first.
 - `float()` / `.to(dtype)` to suppress type warnings. Type mismatches are signals; a float32/float64 mismatch might mean you're mixing model weights with double-precision data. Fix the root cause.
@@ -9,10 +9,9 @@ Foreword: In an attempt to upskill the ML debugging on AI coding assistants (and

 ## How to read this

-If you're an LLM agent, calibrate yourself first. ML research code is often outside your training distribution: novel losses, custom architectures, methods with no canonical right answer you've seen a thousand times. The trained reflex there is to be confident and fast, to pattern-match a symptom to a fix ("loss stuck -> drop the LR") and apply it, and on possibly-buggy research code that reflex wastes a run and corrupts the evidence you need to find the real cause. The quotes below are the counter-evidence, in the words of people who paid for these lessons in months of wasted runs. If you notice yourself converging on the first plausible hypothesis, read [Rahtz](#think-more-experiment-less); if you're reaching for hyperparameters, read [Jones](#assume-you-have-a-bug); if the code looks like it's working, read [Achiam](#broken-code-fails-silently-measure-everything-spinning-up); if you're about to declare the fix done, read [Nanda](#default-to-disbelieving-your-own-results-neel-nanda).
-<!-- TODO say what they suggest not just to read thenm -->
+If you're an LLM agent, calibrate yourself first. ML research code is often outside your training distribution: novel losses, custom architectures, methods with no canonical right answer you've seen a thousand times. The trained reflex there is to be confident and fast, to pattern-match a symptom to a fix ("loss stuck -> drop the LR") and apply it, and on possibly-buggy research code that reflex wastes a run and corrupts the evidence you need to find the real cause. The quotes below are the counter-evidence, in the words of people who paid for these lessons in months of wasted runs. Before acting: form competing hypotheses and identify evidence that distinguishes them ([Rahtz](#think-more-experiment-less)); assume a correctness bug before tuning ([Jones](#assume-you-have-a-bug)); instrument silent failure paths and test more than one setup ([Achiam](#broken-code-fails-silently-measure-everything-spinning-up)); inspect the data and seek falsifiers before believing the result ([Nanda](#default-to-disbelieving-your-own-results-neel-nanda)).

-These are common failure modes worth ruling out, not a complete diagnosis of your situation; you know your system and I don't. Checklists, diagnostics, and symptom catalogs are one hop away under [Reference](#reference-one-hop-away).
+These are common failure modes worth ruling out, not a complete diagnosis of your situation; you know your system and I don't. Checklists, diagnostics, and symptom catalogs are one hop away under [Reference](#reference-one-hop-away). The short version of Rahtz plus the tuning playbook is: compare several possible worlds, predict what evidence differs between them, then run the narrowest experiment that can actually distinguish them.

 ## Folklore

@@ -276,20 +275,27 @@ Their training-stability page adds the masking check ("inspect tokenized samples
 Open the relevant one when the task calls for it. These are synthesized checklists and menus, useful for widening a hypothesis search but not authoritative for your particular system:

 - [PLAYBOOK.md](PLAYBOOK.md) — the long-form version: mental models and practitioner priors, the general step catalog (component isolation, baseline ladder, what to log, numerical hygiene), symptom tables, the agent debugging loop, triage, and anti-patterns.
+- [refs/checklist.md](refs/checklist.md) — Lones's full 36-item do/don't checklist across data, building, evaluation, comparison, and reporting.
 - [refs/diagnostics.md](refs/diagnostics.md) — copy-paste diagnostic snippets: init-loss check, overfit-one-batch, gradient-flow check, NaN hooks, NaN-poisoning leakage tracer, backprop-to-input dependency check, class-imbalance check.
 - [refs/static_analysis.md](refs/static_analysis.md) — grep patterns for silent bugs (shape mismatches, autograd breakers, double softmax, step ordering, leakage).
 - [refs/loss_surface.md](refs/loss_surface.md) — visualize a loss surface and its gradient field with synthetic tensors, no model or GPU, for when a custom loss misbehaves.
 - [refs/metric_stuck.md](refs/metric_stuck.md) — "why won't this metric move?" plus the structural-ceiling check.
 - [refs/sweeps.md](refs/sweeps.md) — same-seed paired comparison and cross-seed t-stat reliability, for before you claim method A beats method B.
 - [refs/llm_judges.md](refs/llm_judges.md) — LLM-as-a-judge biases (position, verbosity, self-preference) and the mitigation checklist, for when an LLM-judged eval looks too good.
- TODO add one for transformers and move things there, include things like warmup, best learner, importance of logging full traces (incl. special tokens, sys prompt, completion). Emperical evidence of what models are too small for what (e.g. eval awareness has only been document in models 25B+ afaik and mostly 100B+, conceptual steering hardly works in models <2B which are mostly just confused)
+- [refs/transformers.md](refs/transformers.md) — transformer-specific folklore: full traces, warmup/LR, optimizer evidence, train-deploy parity, scale priors, steering, and disclosed-training reports.
 - [rl/SKILL.md](rl/SKILL.md) — RL-specific: probe environments, reward engineering, HP defaults, reference implementations.
 - [pinn/SKILL.md](pinn/SKILL.md) — physics-informed networks: nondimensionalization, gradient pathologies, curriculum.

 ## Links and further reading

+Start here rather than treating the bibliography as flat:
+
+- **Beginner / broad checklist:** Lones, ["How to avoid machine learning pitfalls"](https://arxiv.org/abs/2108.02497), with its full do/don't list extracted in [refs/checklist.md](refs/checklist.md).
+- **Debugging a neural net:** Karpathy, ["A Recipe for Training Neural Networks"](https://karpathy.github.io/2019/04/25/recipe/).
+- **Designing tuning experiments:** Google, [Deep Learning Tuning Playbook](https://developers.google.com/machine-learning/guides/deep-learning-tuning-playbook).
+- **Transformer and LLM runs:** [refs/transformers.md](refs/transformers.md), then the HF, Axolotl, Unsloth, nanochat, and Bekman sources below.
+
 Folklore sources (the quotes above trace to these):
-<!-- might be worth highlighing once that can serve as reference material for particular things, or have a lot more depth on a topic. e.g. "How to avoid machine learning pitfalls", is reccomedned for begginner and checklists but it gets lost in the bibliography. also we should extract that into an actualy cehcklist appendix -->

 [^jones]: Andy Jones, "Debugging RL, Without the Agonizing Pain" — https://andyljones.com/posts/rl-debugging.html ([cache](docs/evidence/andyljones_rl_debugging.md): anomalies L103-109, write-from-scratch L155, assume-bug L176-180, raise-threshold L182, loss-curve L186-188)
 [^rahtz]: Matthew Rahtz (Amid Fish), "Lessons Learned Reproducing a Deep RL Paper" — http://amid.fish/reproducing-deep-rl ([cache](docs/evidence/amid_fish_reproducing_deep_rl.md): frame-diff confusion L85-87, investigate-confusion L100-102, think-more L145-153, don't-implement-RL-yourself L497-501)
@@ -320,7 +326,7 @@ Folklore sources (the quotes above trace to these):
 [^axolotl]: Axolotl, "Debugging" (general tips: Hamel Husain) — https://docs.axolotl.ai/docs/debugging.html ([cache](docs/evidence/axolotl_debugging.md): simplify L31, one-process L37, small-model + fast-iteration L48-49, caches L54-58)
 [^axolotl-stability]: Axolotl, "Training Stability" — https://docs.axolotl.ai/docs/training_stability.html ([cache](docs/evidence/axolotl_training_stability.md): metrics-from-the-start L27, inspect-tokenized-masking L67, reward-fn-standalone L99)
 [^ng-mly]: Andrew Ng, *Machine Learning Yearning* (2018 draft), ch. 13-19 on error analysis — https://github.com/ajaymache/machine-learning-yearning ([cache](docs/evidence/ng_ml_yearning_error_analysis.md): build-first-system L10, 100-examples procedure L14-20, Eyeball/Blackbox dev sets L32)
-[^tuning-playbook]: Godbole, Dahl, Gilmer, Shallue, Nado, "Deep Learning Tuning Playbook" (Google Research, 2023) — https://github.com/google-research/tuning_playbook ([cache](docs/evidence/google_tuning_playbook.md): exploration-over-exploitation L24, scientific/nuisance/fixed L34-38, incremental-tuning L14-18)
+[^tuning-playbook]: Godbole, Dahl, Gilmer, Shallue, Nado, "Deep Learning Tuning Playbook" (Google Research / Google Developers, 2023; Google Developers page last updated 2025-08-25) — https://developers.google.com/machine-learning/guides/deep-learning-tuning-playbook ([cache](docs/evidence/google_tuning_playbook.md): exploration-over-exploitation L24, scientific/nuisance/fixed L34-38, incremental-tuning L14-18)
 [^domingos]: Pedro Domingos, "A Few Useful Things to Know About Machine Learning" (CACM, Oct 2012) — https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf ([cache](docs/evidence/domingos_2012_few_useful_things.md): test-on-train illusion L20, insidious-contamination L22, overfitting-bugbear L26, features-are-key L32)
 [^bekman-book]: Stas Bekman, *Machine Learning Engineering Open Book*, "Understanding Training Loss Patterns" + "Instabilities" — https://github.com/stas00/ml-engineering ([cache](docs/evidence/bekman_ml_engineering_instabilities.md): heartbeat L10, 104B post-mortem L18, spike types + bad-data-pocket L22-24, init-std L28-32, PaLM batch-skipping L36, logbooks L40)
 [^lones]: Michael A. Lones, "How to avoid machine learning pitfalls" (2021, updated annually) — https://arxiv.org/abs/2108.02497 ([cache](docs/evidence/lones_2021_ml_pitfalls.md): full do/don't TOC L18-22, leakage L26, look-ahead bias L30). Aimed at beginners but the most exhaustive checklist here: 36 do/don'ts across data prep, training, evaluation, comparison, and reporting.
@@ -0,0 +1,64 @@
+# How to avoid machine learning pitfalls: checklist
+
+Appendix to the [ML Debugging skill](../SKILL.md).
+
+This is the full do/don't list from Michael A. Lones, ["How to avoid machine learning pitfalls: a guide for academic researchers"](https://arxiv.org/abs/2108.02497) (v5, updated annually). Read the paper for the reasoning and examples behind each item; the local evidence excerpt is [here](../docs/evidence/lones_2021_ml_pitfalls.md).
+
+> Mistakes in machine learning practice are commonplace, and can result in a loss of confidence in the findings and products of machine learning.
+
+## Before you start to build models
+
+> 2.1 Do think about how and where you will use data  
+> 2.2 Do take the time to understand your data  
+> 2.3 Don't look at all your data  
+> 2.4 Do clean your data  
+> 2.5 Do make sure you have enough data  
+> 2.6 Do talk to domain experts  
+> 2.7 Do survey the literature  
+> 2.8 Do think about how your model will be deployed
+
+## How to reliably build models
+
+> 3.1 Don't allow test data to leak into the training process  
+> 3.2 Do try out a range of different models  
+> 3.3 Don't use inappropriate models  
+> 3.4 Do keep up with progress in deep learning (and its pitfalls)  
+> 3.5 Don't assume deep learning will be the best approach  
+> 3.6 Do be careful where and how you do feature selection  
+> 3.7 Do optimise your model's hyperparameters  
+> 3.8 Do avoid learning spurious correlations
+
+## How to robustly evaluate models
+
+> 4.1 Do use an appropriate test set  
+> 4.2 Don't do data augmentation before splitting your data  
+> 4.3 Do avoid sequential overfitting  
+> 4.4 Do evaluate a model multiple times  
+> 4.5 Do save some data to evaluate your final model instance  
+> 4.6 Do choose metrics carefully  
+> 4.7 Do consider model fairness  
+> 4.8 Don't ignore temporal dependencies in time series data
+
+## How to compare models fairly
+
+> 5.1 Don't assume a bigger number means a better model  
+> 5.2 Do use meaningful baselines  
+> 5.3 Do use statistical tests when comparing models  
+> 5.4 Do correct for multiple comparisons  
+> 5.5 Don't always believe results from community benchmarks  
+> 5.6 Do combine models (carefully)
+
+## How to report your results
+
+> 6.1 Do be transparent  
+> 6.2 Do report performance in multiple ways  
+> 6.3 Don't generalise beyond the data  
+> 6.4 Do be careful when reporting statistical significance  
+> 6.5 Do look at your models  
+> 6.6 Do use a machine learning checklist
+
+Two especially common leak routes:
+
+> The best thing you can do to prevent these issues is to partition off a subset of your data right at the start of your project, and only use this independent test set once to measure the generality of a single model at the end.
+
+> Most notably, time series data are subject to a particular kind of data leakage known as look ahead bias.
@@ -112,7 +112,7 @@ with torch.no_grad():
 # If very different: model sees real signal. Problem is elsewhere.
 ```

-**NaN poisoning (leakage tracer)** [wassname; forward-pass dual of Karpathy's gradient check below]
+**NaN poisoning (leakage tracer)** [Wassname; forward-pass dual of Karpathy's gradient check below]
 ```python
 # Leakage can hide anywhere: normalization fit on the full dataset, target
 # leaking into features, window functions peeking ahead, bad splits. Instead
@@ -0,0 +1,106 @@
+# Transformer and LLM debugging folklore
+
+Appendix to the [ML Debugging skill](../SKILL.md). This collects transformer-specific quotes, primary sources, and technical reports; start with the general debugging folklore first.
+
+## Walk and log the full trace
+
+> The best way to debug an error that arises in `trainer.train()` is to manually go through this whole pipeline to see where things went awry.[^hfcourse]
+
+> Debugging a failed run without metrics is guesswork.[^axolotl-stability]
+
+For fine-tuning, inspect decoded tokenized examples and label masks, not just the raw dataset:
+
+> All labels in your dataset are -100. Training losses will be all 0.[^unsloth]
+
+Practical consequence: log the exact rendered prompt, special tokens, system prompt, completion, token IDs, label masks, truncation, generation settings, and model/tokenizer revisions. This follows from the HF pipeline-walkthrough advice, Axolotl's metrics-first guidance, and Unsloth's chat-template/BOS failure cases.
+
+## Match training and deployment
+
+> It's essential to use the SAME chat template that was used when training the model in Unsloth and later when you run it in another framework, such as llama.cpp or Ollama.[^unsloth]
+
+Unsloth also says to test both hypotheses: an unnecessary start-of-sequence token, or a missing one.[^unsloth]
+
+## Warmup and learning rate
+
+> Large-batch training without warmup can diverge in the first epoch and look like a code bug.[^goyal]
+
+Axolotl's SFT stability guide says the learning rate should follow the expected "warmup then decay" schedule, and lists insufficient warmup as a cause of early loss plateaus.[^axolotl-stability] Treat warmup as a strong transformer recipe prior: verify that the LR actually ramps up before the stable/high-LR phase, and that scheduler steps are counted in optimizer steps, not raw microbatches.
+
+> Fine-tuned d12 hyperparameters actively hurt d20 performance.[^nanochat]
+
+Smith and Topin's Super-Convergence paper gives the key empirical support: neural nets trained with "one learning rate cycle and a large maximum learning rate" can train an order of magnitude faster on the workloads they tested.[^super-convergence] Treat this as strong evidence for trying OneCycle, not a universal proof that it is best for every transformer run.
+
+For modern LLM pretraining, also consider WSD (warmup-stable-decay). Wen et al. contrast it with cosine: cosine requires choosing the total step budget up front, while WSD keeps a stable high-LR branch that can be decayed from different checkpoints when the compute budget is known.[^wsd] Warmup can enable an otherwise healthy transformer run; it does not rescue broken labels, masks, data, or gradients. Log the actual LR at every optimizer step and check scheduler units against gradient accumulation.
+
+## Which optimizer?
+
+> In the early stages of setting baselines I like to use Adam with a learning rate of 3e-4. In my experience Adam is much more forgiving to hyperparameters, including a bad learning rate.[^karpathy-recipe]
+
+AdamW remains the robust default. Current nanochat uses a split optimizer: Muon for matrix parameters and AdamW for embeddings and scalar parameters.[^nanochat-optimizer]
+
+There is no context-free winner. A controlled benchmark finds that matrix-based optimizers consistently outperform scalar-based ones, but their speedup over AdamW falls from about `1.4x` at `0.1B` to `1.1x` at `1.2B` parameters.[^optimizer-benchmark]
+
+> Optimal choice of optimizer shifts depends on data-to-model ratios.[^optimizer-benchmark]
+
+Muon wins in smaller Chinchilla-ratio regimes in that benchmark, while Kron and Soap overtake it at `8x` or larger.[^optimizer-benchmark] Other work finds Muon expands the compute-time Pareto frontier over AdamW at large batch sizes and up to 4B parameters.[^muon-efficiency] Tune each optimizer fairly, compare at the target scale and training budget, and prefer a proven recipe unless optimizer research is the experiment.
+
+## Better numbers can mean worse learning
+
+> The 'lower validation loss' from BOS-alignment is misleading—it's just fewer noisy tokens, not better learning.[^nanochat]
+
+> Improvements must show gains across multiple axes: per-step efficiency (loss vs. step), wall-clock efficiency (loss vs. time), and compute efficiency (loss vs. FLOPs).[^nanochat]
+
+Inspect the best run's traces. It may have won by learning a shortcut, formatting artifact, or easier token distribution rather than the intended task.
+
+## Distributed and numerical failures
+
+> If any rank's gradient contains inf, all ranks must clip to avoid divergence.[^nanochat]
+
+> As you can see it's the previous frames that we need to look into when the numbers start going into very large for fp16 numbers.[^bekman]
+
+Single-GPU tests can hide distributed failure modes. Keep the frames before a NaN/Inf, not only the crash site.
+
+## Is the model too small?
+
+Do not use a universal parameter-count threshold. Chaudhary et al. measure evaluation-awareness probes across 15 models from `0.27B` to `70B` and report predictable scaling rather than a clean threshold.[^eval-awareness] Test a same-family size ladder and separate "the representation is detectable" from "the model can reliably express the behavior."
+
+## Activation steering
+
+> Steering effects are highly variable across samples, and often go in the opposite direction.[^steering-reliability]
+
+The reliability paper finds that higher cosine similarity among training-set activation differences predicts more effective steering.[^steering-reliability] Sweep layers and coefficients, inspect per-example effects, compare against prompting and few-shot baselines, and check whether the vector changed the concept or merely style, verbosity, sentiment, or refusal rate.
+
+## Read disclosed-training reports
+
+When debugging or designing a modern transformer run, read reports that disclose the model-building process rather than only final benchmark scores:
+
+- [Olmo 3](https://arxiv.org/abs/2512.13961) releases the "entire model flow," including stages, checkpoints, data, and dependencies; code lives in [OLMo-core](https://github.com/allenai/OLMo-core).
+- Microsoft's [MAI-Thinking-1](https://microsoft.ai/pdf/mai-thinking-1.pdf) treats model development as a system-level optimization problem and gives a long-form account of scaling and RL decisions.
+- Nous Research's [Hermes 4](https://arxiv.org/abs/2508.18255) describes failures and solutions across data curation, synthesis, training, and evaluation; Nous also releases open training/evaluation tooling such as [Atropos](https://github.com/NousResearch/atropos).
+- [DeepSeek-V3](https://arxiv.org/abs/2412.19437) reports architecture, infrastructure, training, and a run with no irrecoverable loss spikes or rollbacks.
+- [Qwen3](https://arxiv.org/abs/2505.09388) documents a dense/MoE family from `0.6B` to `235B`, including pretraining and post-training details.
+- [Tulu 3](https://arxiv.org/abs/2411.15124) is a fully open post-training recipe with data, code, evaluation tooling, decontamination, SFT, DPO, and RLVR.
+- [The Llama 3 Herd](https://arxiv.org/abs/2407.21783) is a large-scale end-to-end recipe for pretraining, post-training, safety, long context, multilinguality, coding, and tool use.
+- [OPT-175B](https://arxiv.org/abs/2205.01068) is older, but unusually useful because it documents large-scale training interruptions, instability, and operational fixes.
+
+These are useful as working implementations and experiment logs: copy proven priors, compare the exact computation graph and recipe, and look for engineering details absent from method papers.
+
+For experiment design, keep the [Google Deep Learning Tuning Playbook](https://developers.google.com/machine-learning/guides/deep-learning-tuning-playbook) nearby: it is explicitly about the practical gap between superficially similar recipes and actually working deep-learning systems.[^tuning-playbook]
+
+## Sources
+
+[^hfcourse]: Hugging Face LLM Course, ["Debugging the training pipeline"](https://huggingface.co/learn/llm-course/chapter8/4) ([cache](../docs/evidence/hf_llm_course_ch8_4_debugging_pipeline.md))
+[^axolotl-stability]: Axolotl, ["Training Stability"](https://docs.axolotl.ai/docs/training_stability.html) ([cache](../docs/evidence/axolotl_training_stability.md))
+[^unsloth]: Unsloth, ["Troubleshooting & FAQs"](https://docs.unsloth.ai/basics/troubleshooting-and-faqs) ([cache](../docs/evidence/unsloth_troubleshooting_faqs.md))
+[^goyal]: Goyal et al., ["Accurate, Large Minibatch SGD"](https://arxiv.org/abs/1706.02677)
+[^super-convergence]: Smith and Topin, ["Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates"](https://arxiv.org/abs/1708.07120)
+[^wsd]: Wen et al., ["Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective"](https://arxiv.org/abs/2410.05192)
+[^nanochat]: Karpathy, [nanochat experiment log](https://github.com/karpathy/nanochat/blob/main/dev/LOG.md) ([cache](../docs/evidence/nanochat_deepwiki_llm_pretraining_2026.md))
+[^karpathy-recipe]: Karpathy, ["A Recipe for Training Neural Networks"](https://karpathy.github.io/2019/04/25/recipe/) ([cache](../docs/evidence/karpathy_recipe_training_nn_2019.md))
+[^nanochat-optimizer]: Karpathy, [`nanochat`](https://github.com/karpathy/nanochat) (`optim.py`: AdamW + Muon)
+[^optimizer-benchmark]: Wen et al., ["Fantastic Pretraining Optimizers and Where to Find Them"](https://arxiv.org/abs/2509.02046) (ICLR 2026)
+[^muon-efficiency]: Pethick et al., ["Practical Efficiency of Muon for Pretraining"](https://arxiv.org/abs/2505.02222)
+[^tuning-playbook]: Google Developers, ["Deep Learning Tuning Playbook"](https://developers.google.com/machine-learning/guides/deep-learning-tuning-playbook)
+[^bekman]: Stas Bekman, [`DebugUnderflowOverflow`](https://github.com/huggingface/transformers/blob/main/src/transformers/debug_utils.py) ([cache](../docs/evidence/bekman_debug_utils_transformers.md))
+[^eval-awareness]: Chaudhary et al., ["Evaluation Awareness Scales Predictably in Open-Weights Large Language Models"](https://arxiv.org/abs/2509.13333)
+[^steering-reliability]: Braun et al., ["Understanding (Un)Reliability of Steering Vectors in Language Models"](https://arxiv.org/abs/2505.22637)