Files
ml-debug/refs/metric_stuck.md
wassname 38ec634ff3 restructure: folklore-first, quote-verified, with wassname intro
Reorder around what's durable, per wassname's curation:
- human-written intro up top; rename to "wassname's ML Debugging Folklore"
- mindset first: calibrate -> mental models -> Part 1 general tricks (kept,
  they're well-based) -> read a working implementation when stuck
- a Folklore section built from verbatim, source-checked quotes (Jones,
  Rahtz, Karpathy, Schulman, Henderson, Irpan, CS231n, Slavv, Goodfellow),
  each footnoted to the canonical URL + the cached copy with line numbers
- LLM-agent babysitting (debugging loop, triage menu, anti-patterns) moved to
  the bottom where it belongs; triage reframed as a menu, not a flowchart
- deeper one-off tricks split to refs/ (loss_surface, metric_stuck, sweeps),
  scrubbed of private tooling (wandb/just/SI/personal scripts)

Quote integrity: every quote independently verified by fresh-eyes subagents
against the cached sources; fixed a reformatted Schulman slide, a truncated
Jones sentence, a reversed-order Rahtz stitch, a falsely-quoted Slavv phrase,
and the 3e-4 line (now the real tweet, framed as the joke Karpathy confirmed
it was, not gospel). lr_scheduler anti-pattern nuanced (warmup/cyclic matter).

Remove superseded SKILL2.md draft.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 20:46:25 +08:00

3.0 KiB

Why won't this metric move?

Appendix to the ML Debugging skill. When a quantity you're optimizing plateaus, these are ideas for telling why, not a flowchart to obey. They apply to most training setups, but they're suggestions; your project may not fit them.

The useful split is three questions, cheapest first.

1. Is the gradient nonzero at the metric level?

metric_val = torch.tensor(current_value, requires_grad=True)
loss = loss_fn(metric_val)
loss.backward()
print(f"d(loss)/d(metric) = {metric_val.grad}")
  • ~0: the loss doesn't care about this metric at the current operating point. Maybe saturated (log1p of a huge value), in a dead zone, or the metric is disconnected from the loss.
  • large: the loss is trying to move it. The problem is downstream.

2. Can the parameter even change the metric?

Trace the chain loss -> metric -> ... -> parameter. The metric is a function of intermediate quantities, which are functions of learned parameters. Look at d(metric)/d(parameter):

  • Analytically: is there a structural reason this derivative is ~0? (e.g. a rotation of V can't change span(U).)
  • Empirically: disable the loss term (set its coefficient to 0). Does the metric reach the same value anyway? If yes, the optimization never moved it; it's a structural ceiling, and you need a different parameterization, not a different loss weight.

3. Is something else fighting it?

If the gradient is nonzero and the parameter can change the metric:

  • Competing loss terms: compute each component's gradient on the shared parameter separately. Opposite-sign gradients cancel.
  • Optimizer state: AdamW momentum from earlier training can resist a direction change. Try resetting optimizer state or a warmup.
  • Conditioning: if the metric needs coordinated changes across many parameters (rotating several layers at once), the per-parameter gradient may be too small even when the aggregate signal is large.

A rough map (a guide, not a verdict)

d(loss)/d(metric) d(metric)/d(param) Same value with the term off? Reading
~0 any any Loss saturated or disconnected; reconsider the loss formula.
large ~0 yes Structural ceiling; reconsider the parameterization.
large large no Competing losses or optimizer inertia; isolate them.
large large yes The term helps but converges to the same basin; weak effect or coincidence.

Structural-ceiling check, concretely

# 1. Is d(loss)/d(metric) large? If so, the optimizer IS trying.
metric = torch.tensor(0.5, requires_grad=True)
loss = loss_fn(metric); loss.backward()
print(metric.grad)   # large (e.g. 350x the other grads) => it's trying

# 2. Can the parameter change the metric? Trace loss -> metric -> intermediate -> parameter.
#    If d(metric)/d(parameter) ~ 0, the parameter structurally cannot move it.
#    (e.g. a V-rotation can't change the output basis when U is fixed.)

# 3. Confirm empirically: set the term's coefficient to 0.
#    If the metric reaches the SAME value, it was never learned; it's structural.