Files
ml-debug/refs/transformers.md
T
2026-06-12 07:02:58 +08:00

14 KiB

Transformer and LLM debugging folklore

Appendix to the ML Debugging skill. This collects transformer-specific quotes, primary sources, and technical reports; start with the general debugging folklore first.

Walk and log the full trace

The best way to debug an error that arises in trainer.train() is to manually go through this whole pipeline to see where things went awry.1

Debugging a failed run without metrics is guesswork.2

For fine-tuning, inspect decoded tokenized examples and label masks, not just the raw dataset:

All labels in your dataset are -100. Training losses will be all 0.3

Practical consequence: log the exact rendered prompt, special tokens, system prompt, completion, token IDs, label masks, truncation, generation settings, and model/tokenizer revisions. This follows from the HF pipeline-walkthrough advice, Axolotl's metrics-first guidance, and Unsloth's chat-template/BOS failure cases.

Match training and deployment

It's essential to use the SAME chat template that was used when training the model in Unsloth and later when you run it in another framework, such as llama.cpp or Ollama.3

Unsloth also says to test both hypotheses: an unnecessary start-of-sequence token, or a missing one.3

Warmup and learning rate

Large-batch training without warmup can diverge in the first epoch and look like a code bug.4

Axolotl's SFT stability guide says the learning rate should follow the expected "warmup then decay" schedule, and lists insufficient warmup as a cause of early loss plateaus.2 Treat warmup as a strong transformer recipe prior: verify that the LR actually ramps up before the stable/high-LR phase, and that scheduler steps are counted in optimizer steps, not raw microbatches.

Fine-tuned d12 hyperparameters actively hurt d20 performance.5

Smith and Topin's Super-Convergence paper gives the key empirical support: neural nets trained with "one learning rate cycle and a large maximum learning rate" can train an order of magnitude faster on the workloads they tested.6 Treat this as strong evidence for trying OneCycle, not a universal proof that it is best for every transformer run.

For modern LLM pretraining, also consider WSD (warmup-stable-decay). Wen et al. contrast it with cosine: cosine requires choosing the total step budget up front, while WSD keeps a stable high-LR branch that can be decayed from different checkpoints when the compute budget is known.7 Warmup can enable an otherwise healthy transformer run; it does not rescue broken labels, masks, data, or gradients. Log the actual LR at every optimizer step and check scheduler units against gradient accumulation.

Which optimizer?

In the early stages of setting baselines I like to use Adam with a learning rate of 3e-4. In my experience Adam is much more forgiving to hyperparameters, including a bad learning rate.8

AdamW remains the robust default. Some modern recipes mix AdamW with matrix-style optimizers for selected parameters, but treat that as a recipe to copy deliberately, not a generic substitution.9

There is no context-free winner. A controlled benchmark finds that matrix-based optimizers consistently outperform scalar-based ones, but their speedup over AdamW falls from about 1.4x at 0.1B to 1.1x at 1.2B parameters.10

Optimal choice of optimizer shifts depends on data-to-model ratios.10

That benchmark discusses matrix optimizers such as Muon, Soap, and Kron; the point for debugging is the caveat, not a specific winner. Tune each optimizer fairly, compare at the target scale, batch size, data-to-model ratio, and training budget, and prefer a proven recipe unless optimizer research is the experiment.

The disclosed training reports mostly reinforce this boring answer. DeepSeek-V3, OPT-175B, and Llama 3 all disclose AdamW recipes with warmup and decay; DeepSeek-V3 uses AdamW with a warmup, long stable high-LR phase, cosine decay, late lower-LR phase, gradient clipping, and batch-size scheduling.11 OPT-175B tried vanilla SGD during divergence recovery; "optimization plateaued quickly," and they reverted to AdamW.12

Better numbers can mean worse learning

The 'lower validation loss' from BOS-alignment is misleading—it's just fewer noisy tokens, not better learning.5

Improvements must show gains across multiple axes: per-step efficiency (loss vs. step), wall-clock efficiency (loss vs. time), and compute efficiency (loss vs. FLOPs).5

Inspect the best run's traces. It may have won by learning a shortcut, formatting artifact, or easier token distribution rather than the intended task.

Distributed and numerical failures

If any rank's gradient contains inf, all ranks must clip to avoid divergence.5

As you can see it's the previous frames that we need to look into when the numbers start going into very large for fp16 numbers.13

Single-GPU tests can hide distributed failure modes. Keep the frames before a NaN/Inf, not only the crash site.

Modern reports treat infrastructure and numerics as first-class hypotheses, not background. DeepSeek-V3 reports no "irrecoverable loss spikes" or rollbacks after architecture, FP8, high-precision-retention, routing, and schedule co-design.11 MAI-Thinking-1 says "failures are expected" at thousands of GPUs, and gates nodes through certification before admitting them to production training.14 Llama 3 reports 466 interruptions in a 54-day 405B pretraining window, mostly hardware-related, with automation handling almost all of them.15

Is the model too small?

Do not use a universal parameter-count threshold. Chaudhary et al. measure evaluation-awareness probes across 15 models from 0.27B to 70B and report predictable scaling rather than a clean threshold.16 Test a same-family size ladder and separate "the representation is detectable" from "the model can reliably express the behavior."

Activation steering

Steering effects are highly variable across samples, and often go in the opposite direction.17

The reliability paper finds that higher cosine similarity among training-set activation differences predicts more effective steering.17 Sweep layers and coefficients, inspect per-example effects, compare against prompting and few-shot baselines, and check whether the vector changed the concept or merely style, verbosity, sentiment, or refusal rate.

What the recent reports add

OLMo 3 is the strongest "how to decide" reference in this set. It says "benchmarks are not perfect decision-making tools"; small models can sit at random chance, small score differences can be benchmark noise, and some tasks should be expanded, clustered, moved out of averages, or removed.18 Use proxy metrics and signal-to-noise checks before trusting small-scale ablations.

MAI-Thinking-1 gives the eval-design maxim: "Evaluation results are only as informative as the prompts they are computed on."14 A narrow, saturated, or misweighted eval can give tight confidence intervals around the wrong quantity. Treat eval construction as part of the experiment, not bookkeeping.

Hermes 4 is useful for evaluation reproducibility and reasoning-length control. It says an eval score depends on "the inference engine and hardware" as well as the model, so they route benchmarks through one OpenAI-compatible endpoint and log all evaluation samples.19 For overlong reasoning, Hermes 4 does a targeted second SFT stage that teaches </think> termination without training on the whole generated chain.19

Qwen3 is the chat-template and mode-control reminder: thinking/non-thinking behavior is part of the data format, not just sampling policy. Qwen3 uses /think and /no_think flags and exposes enable_thinking=False through the tokenizer chat template.20

Hermes 4 and Qwen3 both lean on filtered synthetic/verifiable data, but with guardrails: Hermes uses a different judge model from the answer model to reduce judge self-preference, and Qwen3 filters reasoning traces for wrong answers, repetition, guesswork, thinking/summary inconsistency, style shifts, and possible validation overlap.19 20

Read disclosed-training reports

When debugging or designing a modern transformer run, read reports that disclose the model-building process rather than only final benchmark scores:

  • Olmo 3 releases the "entire model flow," including stages, checkpoints, data, and dependencies; code lives in OLMo-core.
  • Microsoft's MAI-Thinking-1 treats model development as a system-level optimization problem and gives a long-form account of scaling and RL decisions.
  • Nous Research's Hermes 4 describes failures and solutions across data curation, synthesis, training, and evaluation; Nous also releases open training/evaluation tooling such as Atropos.
  • DeepSeek-V3 reports architecture, infrastructure, training, and a run with no irrecoverable loss spikes or rollbacks.
  • Qwen3 documents a dense/MoE family from 0.6B to 235B, including pretraining and post-training details.
  • Secondary postmortems: The Llama 3 Herd for large-scale pretraining operations, and OPT-175B for training interruptions, instability, and mid-flight recovery.

These are useful as working implementations and experiment logs: copy proven priors, compare the exact computation graph and recipe, and look for engineering details absent from method papers.

For experiment design, keep the Google Deep Learning Tuning Playbook nearby: it is explicitly about the practical gap between superficially similar recipes and actually working deep-learning systems.21

Sources


  1. Hugging Face LLM Course, "Debugging the training pipeline" (cache) ↩︎

  2. Axolotl, "Training Stability" (cache) ↩︎

  3. Unsloth, "Troubleshooting & FAQs" (cache) ↩︎

  4. Goyal et al., "Accurate, Large Minibatch SGD" ↩︎

  5. Karpathy, nanochat experiment log (cache) ↩︎

  6. Smith and Topin, "Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates" ↩︎

  7. Wen et al., "Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective" ↩︎

  8. Karpathy, "A Recipe for Training Neural Networks" (cache) ↩︎

  9. Karpathy, nanochat (optim.py: AdamW + Muon) ↩︎

  10. Wen et al., "Fantastic Pretraining Optimizers and Where to Find Them" (ICLR 2026) ↩︎

  11. DeepSeek-AI, "DeepSeek-V3 Technical Report" (cache) ↩︎

  12. Zhang et al., "OPT: Open Pre-trained Transformer Language Models" (cache) ↩︎

  13. Stas Bekman, DebugUnderflowOverflow (cache) ↩︎

  14. Microsoft AI Team, "MAI-Thinking-1: Building a Hill-Climbing Machine" (cache) ↩︎

  15. Meta AI, "The Llama 3 Herd of Models" (cache) ↩︎

  16. Chaudhary et al., "Evaluation Awareness Scales Predictably in Open-Weights Large Language Models" ↩︎

  17. Braun et al., "Understanding (Un)Reliability of Steering Vectors in Language Models" ↩︎

  18. OLMo Team, "Olmo 3" (cache; OLMo-core, cache) ↩︎

  19. Nous Research, "Hermes 4 Technical Report" (cache; Atropos, cache) ↩︎

  20. Qwen Team, "Qwen3 Technical Report" (cache) ↩︎

  21. Google Developers, "Deep Learning Tuning Playbook" ↩︎