Files
ml-debug/refs/loss_surface.md
T
wassname 38ec634ff3 restructure: folklore-first, quote-verified, with wassname intro
Reorder around what's durable, per wassname's curation:
- human-written intro up top; rename to "wassname's ML Debugging Folklore"
- mindset first: calibrate -> mental models -> Part 1 general tricks (kept,
  they're well-based) -> read a working implementation when stuck
- a Folklore section built from verbatim, source-checked quotes (Jones,
  Rahtz, Karpathy, Schulman, Henderson, Irpan, CS231n, Slavv, Goodfellow),
  each footnoted to the canonical URL + the cached copy with line numbers
- LLM-agent babysitting (debugging loop, triage menu, anti-patterns) moved to
  the bottom where it belongs; triage reframed as a menu, not a flowchart
- deeper one-off tricks split to refs/ (loss_surface, metric_stuck, sweeps),
  scrubbed of private tooling (wandb/just/SI/personal scripts)

Quote integrity: every quote independently verified by fresh-eyes subagents
against the cached sources; fixed a reformatted Schulman slide, a truncated
Jones sentence, a reversed-order Rahtz stitch, a falsely-quoted Slavv phrase,
and the 3e-4 line (now the real tweet, framed as the joke Karpathy confirmed
it was, not gospel). lr_scheduler anti-pattern nuanced (warmup/cyclic matter).

Remove superseded SKILL2.md draft.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 20:46:25 +08:00

3.8 KiB

Loss surface & gradient analysis (no model required)

Appendix to the ML Debugging skill. A trick worth reaching for when a loss (not the whole model) is misbehaving: visualize its surface and gradient flow directly, feeding synthetic tensors into the loss sub-components. No model, forward pass, or GPU, just the math. Five minutes of plotting often saves hours of squinting at training curves.

When you'd look this up: a new or custom loss behaves oddly; a metric is stuck and you suspect the loss shape; you just changed a loss formula and want to confirm gradients still flow at the operating point (not just at init); you're comparing two loss variants and want to see their gradient fields side by side.

The method

  1. Identify each loss sub-component as a function of its immediate inputs.
  2. Pick 1-2 axes that matter (the "natural axes" you reason about when you think about the loss).
  3. Grid over those axes, feed through the loss, call .backward(), collect gradients.
  4. Plot: contour heatmap + quiver overlay (negative gradient = the direction the optimizer moves).
  5. Build a summary table: component x representative_input -> loss_value, grad_value. Flag zero or non-finite gradients.
# ── 2D loss surface with gradient quiver ──────
def analyze_component(loss_fn, x_range, y_range, n=80):
    xs = torch.linspace(*x_range, n)
    ys = torch.linspace(*y_range, n)
    X, Y = torch.meshgrid(xs, ys, indexing='ij')
    x_flat = X.flatten().requires_grad_(True)
    y_flat = Y.flatten().requires_grad_(True)

    losses = loss_fn(x_flat, y_flat)       # vectorized, returns (n*n,)
    losses.sum().backward()

    loss_grid = losses.detach().reshape(n, n)
    gx = x_flat.grad.reshape(n, n)
    gy = y_flat.grad.reshape(n, n)

    # contourf(X, Y, loss_grid) + quiver(X, Y, -gx, -gy)
    # negative gradient = direction optimizer moves

# ── Gradient flow verification table ──────────
# For each component, evaluate at representative inputs
# (zero, small, converged, degenerate). Report loss + grad.
# Flag: zero grad (dead zone), non-finite (numerical issue).
#
# | Component       | Param   | Input        | Loss     | Grad     |
# |-----------------|---------|--------------|----------|----------|
# | barrier_penalty | v       | v=0.0        | +0.000   | +0.000   |  <-- zero grad!
# | barrier_penalty | v       | v=0.5        | +12.50   | +50.00   |
# | pair_loss       | dot_pos | (0.3, -0.3)  | -2.340   | -3.000   |
# | pair_loss       | dot_neg | (0.3, -0.3)  | -2.340   | +3.000   |  <-- antisym, good
# | pair_loss       | dot_pos | (0.0, 0.0)   | +0.000   | +0.000   |  <-- dead at init!

What to look for

Pattern Meaning Action
Gradient arrows point toward desired region Loss is well-shaped Ship it
Large flat region (zero gradient) Dead zone: optimizer stuck if it lands here Add curvature, change init, or reparameterize
Gradient magnitude 1000x in one axis vs another Imbalanced: one axis dominates Rescale, use log-space, or normalize
Saddle point at origin Common with product-form losses (A*B) Switch to additive (log A + log B) for independent gradients
Arrows point away from desired region Loss is wrong or has an unexpected local min Rethink the formula
Non-finite values in a region Numerical issue (log(0), 0/0) Add eps, clamp, or use log1p

The log-space decomposition trick

When your loss is a product of factors A*B and one factor can be near zero:

# BAD: symlog(A * B), when B~0 the chain rule gives 0 grad to A too
# GOOD: sign * (log|A| + log|B|) gives independent gradients
#   d/dA = 1/A  regardless of B
#   d/dB = 1/B  regardless of A

General principle: if you want gradient to flow independently through two factors, decompose multiplicatively in log space.