Files
ml_debug/refs/transformers.md
T
2026-06-12 06:52:38 +08:00

10 KiB

Transformer and LLM debugging folklore

Appendix to the ML Debugging skill. This collects transformer-specific quotes, primary sources, and technical reports; start with the general debugging folklore first.

Walk and log the full trace

The best way to debug an error that arises in trainer.train() is to manually go through this whole pipeline to see where things went awry.1

Debugging a failed run without metrics is guesswork.2

For fine-tuning, inspect decoded tokenized examples and label masks, not just the raw dataset:

All labels in your dataset are -100. Training losses will be all 0.3

Practical consequence: log the exact rendered prompt, special tokens, system prompt, completion, token IDs, label masks, truncation, generation settings, and model/tokenizer revisions. This follows from the HF pipeline-walkthrough advice, Axolotl's metrics-first guidance, and Unsloth's chat-template/BOS failure cases.

Match training and deployment

It's essential to use the SAME chat template that was used when training the model in Unsloth and later when you run it in another framework, such as llama.cpp or Ollama.3

Unsloth also says to test both hypotheses: an unnecessary start-of-sequence token, or a missing one.3

Warmup and learning rate

Large-batch training without warmup can diverge in the first epoch and look like a code bug.4

Axolotl's SFT stability guide says the learning rate should follow the expected "warmup then decay" schedule, and lists insufficient warmup as a cause of early loss plateaus.2 Treat warmup as a strong transformer recipe prior: verify that the LR actually ramps up before the stable/high-LR phase, and that scheduler steps are counted in optimizer steps, not raw microbatches.

Fine-tuned d12 hyperparameters actively hurt d20 performance.5

Smith and Topin's Super-Convergence paper gives the key empirical support: neural nets trained with "one learning rate cycle and a large maximum learning rate" can train an order of magnitude faster on the workloads they tested.6 Treat this as strong evidence for trying OneCycle, not a universal proof that it is best for every transformer run.

For modern LLM pretraining, also consider WSD (warmup-stable-decay). Wen et al. contrast it with cosine: cosine requires choosing the total step budget up front, while WSD keeps a stable high-LR branch that can be decayed from different checkpoints when the compute budget is known.7 Warmup can enable an otherwise healthy transformer run; it does not rescue broken labels, masks, data, or gradients. Log the actual LR at every optimizer step and check scheduler units against gradient accumulation.

Which optimizer?

In the early stages of setting baselines I like to use Adam with a learning rate of 3e-4. In my experience Adam is much more forgiving to hyperparameters, including a bad learning rate.8

AdamW remains the robust default. Current nanochat uses a split optimizer: Muon for matrix parameters and AdamW for embeddings and scalar parameters.9

There is no context-free winner. A controlled benchmark finds that matrix-based optimizers consistently outperform scalar-based ones, but their speedup over AdamW falls from about 1.4x at 0.1B to 1.1x at 1.2B parameters.10

Optimal choice of optimizer shifts depends on data-to-model ratios.10

Muon wins in smaller Chinchilla-ratio regimes in that benchmark, while Kron and Soap overtake it at 8x or larger.10 Other work finds Muon expands the compute-time Pareto frontier over AdamW at large batch sizes and up to 4B parameters.11 Tune each optimizer fairly, compare at the target scale and training budget, and prefer a proven recipe unless optimizer research is the experiment.

Better numbers can mean worse learning

The 'lower validation loss' from BOS-alignment is misleading—it's just fewer noisy tokens, not better learning.5

Improvements must show gains across multiple axes: per-step efficiency (loss vs. step), wall-clock efficiency (loss vs. time), and compute efficiency (loss vs. FLOPs).5

Inspect the best run's traces. It may have won by learning a shortcut, formatting artifact, or easier token distribution rather than the intended task.

Distributed and numerical failures

If any rank's gradient contains inf, all ranks must clip to avoid divergence.5

As you can see it's the previous frames that we need to look into when the numbers start going into very large for fp16 numbers.12

Single-GPU tests can hide distributed failure modes. Keep the frames before a NaN/Inf, not only the crash site.

Is the model too small?

Do not use a universal parameter-count threshold. Chaudhary et al. measure evaluation-awareness probes across 15 models from 0.27B to 70B and report predictable scaling rather than a clean threshold.13 Test a same-family size ladder and separate "the representation is detectable" from "the model can reliably express the behavior."

Activation steering

Steering effects are highly variable across samples, and often go in the opposite direction.14

The reliability paper finds that higher cosine similarity among training-set activation differences predicts more effective steering.14 Sweep layers and coefficients, inspect per-example effects, compare against prompting and few-shot baselines, and check whether the vector changed the concept or merely style, verbosity, sentiment, or refusal rate.

Read disclosed-training reports

When debugging or designing a modern transformer run, read reports that disclose the model-building process rather than only final benchmark scores:

  • Olmo 3 releases the "entire model flow," including stages, checkpoints, data, and dependencies; code lives in OLMo-core.
  • Microsoft's MAI-Thinking-1 treats model development as a system-level optimization problem and gives a long-form account of scaling and RL decisions.
  • Nous Research's Hermes 4 describes failures and solutions across data curation, synthesis, training, and evaluation; Nous also releases open training/evaluation tooling such as Atropos.
  • DeepSeek-V3 reports architecture, infrastructure, training, and a run with no irrecoverable loss spikes or rollbacks.
  • Qwen3 documents a dense/MoE family from 0.6B to 235B, including pretraining and post-training details.
  • Tulu 3 is a fully open post-training recipe with data, code, evaluation tooling, decontamination, SFT, DPO, and RLVR.
  • The Llama 3 Herd is a large-scale end-to-end recipe for pretraining, post-training, safety, long context, multilinguality, coding, and tool use.
  • OPT-175B is older, but unusually useful because it documents large-scale training interruptions, instability, and operational fixes.

These are useful as working implementations and experiment logs: copy proven priors, compare the exact computation graph and recipe, and look for engineering details absent from method papers.

For experiment design, keep the Google Deep Learning Tuning Playbook nearby: it is explicitly about the practical gap between superficially similar recipes and actually working deep-learning systems.15

Sources


  1. Hugging Face LLM Course, "Debugging the training pipeline" (cache) ↩︎

  2. Axolotl, "Training Stability" (cache) ↩︎

  3. Unsloth, "Troubleshooting & FAQs" (cache) ↩︎

  4. Goyal et al., "Accurate, Large Minibatch SGD" ↩︎

  5. Karpathy, nanochat experiment log (cache) ↩︎

  6. Smith and Topin, "Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates" ↩︎

  7. Wen et al., "Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective" ↩︎

  8. Karpathy, "A Recipe for Training Neural Networks" (cache) ↩︎

  9. Karpathy, nanochat (optim.py: AdamW + Muon) ↩︎

  10. Wen et al., "Fantastic Pretraining Optimizers and Where to Find Them" (ICLR 2026) ↩︎

  11. Pethick et al., "Practical Efficiency of Muon for Pretraining" ↩︎

  12. Stas Bekman, DebugUnderflowOverflow (cache) ↩︎

  13. Chaudhary et al., "Evaluation Awareness Scales Predictably in Open-Weights Large Language Models" ↩︎

  14. Braun et al., "Understanding (Un)Reliability of Steering Vectors in Language Models" ↩︎

  15. Google Developers, "Deep Learning Tuning Playbook" ↩︎