restructure: folklore-first, quote-verified, with wassname intro

Reorder around what's durable, per wassname's curation:
- human-written intro up top; rename to "wassname's ML Debugging Folklore"
- mindset first: calibrate -> mental models -> Part 1 general tricks (kept,
  they're well-based) -> read a working implementation when stuck
- a Folklore section built from verbatim, source-checked quotes (Jones,
  Rahtz, Karpathy, Schulman, Henderson, Irpan, CS231n, Slavv, Goodfellow),
  each footnoted to the canonical URL + the cached copy with line numbers
- LLM-agent babysitting (debugging loop, triage menu, anti-patterns) moved to
  the bottom where it belongs; triage reframed as a menu, not a flowchart
- deeper one-off tricks split to refs/ (loss_surface, metric_stuck, sweeps),
  scrubbed of private tooling (wandb/just/SI/personal scripts)

Quote integrity: every quote independently verified by fresh-eyes subagents
against the cached sources; fixed a reformatted Schulman slide, a truncated
Jones sentence, a reversed-order Rahtz stitch, a falsely-quoted Slavv phrase,
and the 3e-4 line (now the real tweet, framed as the joke Karpathy confirmed
it was, not gospel). lr_scheduler anti-pattern nuanced (warmup/cyclic matter).

Remove superseded SKILL2.md draft.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-02 20:46:25 +08:00
parent cf9df71f6a
commit 38ec634ff3
6 changed files with 458 additions and 730 deletions
+70
View File
@@ -0,0 +1,70 @@
# Loss surface & gradient analysis (no model required)
Appendix to the [ML Debugging skill](../SKILL.md). A trick worth reaching for when a *loss* (not the whole model) is misbehaving: visualize its surface and gradient flow directly, feeding synthetic tensors into the loss sub-components. No model, forward pass, or GPU, just the math. Five minutes of plotting often saves hours of squinting at training curves.
When you'd look this up: a new or custom loss behaves oddly; a metric is stuck and you suspect the loss shape; you just changed a loss formula and want to confirm gradients still flow at the operating point (not just at init); you're comparing two loss variants and want to see their gradient fields side by side.
## The method
1. Identify each loss sub-component as a function of its immediate inputs.
2. Pick 1-2 axes that matter (the "natural axes" you reason about when you think about the loss).
3. Grid over those axes, feed through the loss, call `.backward()`, collect gradients.
4. Plot: contour heatmap + quiver overlay (negative gradient = the direction the optimizer moves).
5. Build a summary table: component x representative_input -> loss_value, grad_value. Flag zero or non-finite gradients.
```py
# ── 2D loss surface with gradient quiver ──────
def analyze_component(loss_fn, x_range, y_range, n=80):
xs = torch.linspace(*x_range, n)
ys = torch.linspace(*y_range, n)
X, Y = torch.meshgrid(xs, ys, indexing='ij')
x_flat = X.flatten().requires_grad_(True)
y_flat = Y.flatten().requires_grad_(True)
losses = loss_fn(x_flat, y_flat) # vectorized, returns (n*n,)
losses.sum().backward()
loss_grid = losses.detach().reshape(n, n)
gx = x_flat.grad.reshape(n, n)
gy = y_flat.grad.reshape(n, n)
# contourf(X, Y, loss_grid) + quiver(X, Y, -gx, -gy)
# negative gradient = direction optimizer moves
# ── Gradient flow verification table ──────────
# For each component, evaluate at representative inputs
# (zero, small, converged, degenerate). Report loss + grad.
# Flag: zero grad (dead zone), non-finite (numerical issue).
#
# | Component | Param | Input | Loss | Grad |
# |-----------------|---------|--------------|----------|----------|
# | barrier_penalty | v | v=0.0 | +0.000 | +0.000 | <-- zero grad!
# | barrier_penalty | v | v=0.5 | +12.50 | +50.00 |
# | pair_loss | dot_pos | (0.3, -0.3) | -2.340 | -3.000 |
# | pair_loss | dot_neg | (0.3, -0.3) | -2.340 | +3.000 | <-- antisym, good
# | pair_loss | dot_pos | (0.0, 0.0) | +0.000 | +0.000 | <-- dead at init!
```
## What to look for
| Pattern | Meaning | Action |
|---------|---------|--------|
| Gradient arrows point toward desired region | Loss is well-shaped | Ship it |
| Large flat region (zero gradient) | Dead zone: optimizer stuck if it lands here | Add curvature, change init, or reparameterize |
| Gradient magnitude 1000x in one axis vs another | Imbalanced: one axis dominates | Rescale, use log-space, or normalize |
| Saddle point at origin | Common with product-form losses (A*B) | Switch to additive (log A + log B) for independent gradients |
| Arrows point away from desired region | Loss is wrong or has an unexpected local min | Rethink the formula |
| Non-finite values in a region | Numerical issue (log(0), 0/0) | Add eps, clamp, or use log1p |
## The log-space decomposition trick
When your loss is a product of factors A*B and one factor can be near zero:
```
# BAD: symlog(A * B), when B~0 the chain rule gives 0 grad to A too
# GOOD: sign * (log|A| + log|B|) gives independent gradients
# d/dA = 1/A regardless of B
# d/dB = 1/B regardless of A
```
General principle: if you want gradient to flow independently through two factors, decompose multiplicatively in log space.
+57
View File
@@ -0,0 +1,57 @@
# Why won't this metric move?
Appendix to the [ML Debugging skill](../SKILL.md). When a quantity you're optimizing plateaus, these are ideas for telling *why*, not a flowchart to obey. They apply to most training setups, but they're suggestions; your project may not fit them.
The useful split is three questions, cheapest first.
## 1. Is the gradient nonzero at the metric level?
```py
metric_val = torch.tensor(current_value, requires_grad=True)
loss = loss_fn(metric_val)
loss.backward()
print(f"d(loss)/d(metric) = {metric_val.grad}")
```
- ~0: the loss doesn't care about this metric at the current operating point. Maybe saturated (log1p of a huge value), in a dead zone, or the metric is disconnected from the loss.
- large: the loss is trying to move it. The problem is downstream.
## 2. Can the parameter even change the metric?
Trace the chain `loss -> metric -> ... -> parameter`. The metric is a function of intermediate quantities, which are functions of learned parameters. Look at `d(metric)/d(parameter)`:
- Analytically: is there a structural reason this derivative is ~0? (e.g. a rotation of V can't change span(U).)
- Empirically: disable the loss term (set its coefficient to 0). Does the metric reach the same value anyway? If yes, the optimization never moved it; it's a structural ceiling, and you need a different parameterization, not a different loss weight.
## 3. Is something else fighting it?
If the gradient is nonzero and the parameter *can* change the metric:
- Competing loss terms: compute each component's gradient on the shared parameter separately. Opposite-sign gradients cancel.
- Optimizer state: AdamW momentum from earlier training can resist a direction change. Try resetting optimizer state or a warmup.
- Conditioning: if the metric needs coordinated changes across many parameters (rotating several layers at once), the per-parameter gradient may be too small even when the aggregate signal is large.
## A rough map (a guide, not a verdict)
| d(loss)/d(metric) | d(metric)/d(param) | Same value with the term off? | Reading |
|---|---|---|---|
| ~0 | any | any | Loss saturated or disconnected; reconsider the loss formula. |
| large | ~0 | yes | Structural ceiling; reconsider the parameterization. |
| large | large | no | Competing losses or optimizer inertia; isolate them. |
| large | large | yes | The term helps but converges to the same basin; weak effect or coincidence. |
## Structural-ceiling check, concretely
```py
# 1. Is d(loss)/d(metric) large? If so, the optimizer IS trying.
metric = torch.tensor(0.5, requires_grad=True)
loss = loss_fn(metric); loss.backward()
print(metric.grad) # large (e.g. 350x the other grads) => it's trying
# 2. Can the parameter change the metric? Trace loss -> metric -> intermediate -> parameter.
# If d(metric)/d(parameter) ~ 0, the parameter structurally cannot move it.
# (e.g. a V-rotation can't change the output basis when U is fixed.)
# 3. Confirm empirically: set the term's coefficient to 0.
# If the metric reaches the SAME value, it was never learned; it's structural.
```
+32
View File
@@ -0,0 +1,32 @@
# Sweeps: same-seed comparison and cross-seed reliability
Appendix to the [ML Debugging skill](../SKILL.md). The general idea behind a trustworthy hyperparameter sweep, tool-agnostic. The point is the difference between "I tried it and it seemed better" and "it's reliably better across seeds." Irpan's 30% seed-failure result and Henderson's "seeds alone create statistically different distributions" (see the main skill's folklore section) are why this matters: a single lucky run proves nothing.
## The core move: pair on seed, normalize within group, test across seeds
1. Run the same set of seeds for every value of the parameter you're varying. Same seeds across values turns this into a paired comparison and cancels seed-level baseline differences.
2. Vary one parameter per sweep when you can (all-else-equal). If you vary two, effects confound and you can't attribute the result.
3. Within each (group, seed), z-score the metric across the parameter values. This removes the per-seed baseline offset so you compare *shapes*, not absolute levels.
4. Aggregate the z-scores across seeds per value, then take a t-stat: `mean_z / (std_z / sqrt(n_seeds))`. `|t| > 2` with 4+ seeds is a real, reliable effect; `t ~ 0` is no consistent effect.
5. For numeric parameters, also fit a linear trend (Pearson r) and t-test it: a clean dose-response is `r` near +/-1 with a significant t-stat.
```py
for group in groups:
for seed in seeds_in_group:
vals = {param_value: metric for runs matching (group, seed, param)}
z[seed] = (vals - mean(vals)) / std(vals) # within-(group,seed) normalization
for value in param_values:
mean_z, std_z = mean(z[:, value]), std(z[:, value])
t_stat = mean_z / (std_z / sqrt(n_seeds)) # >>2 reliably better, <<-2 reliably worse
```
## What you're looking for
High effect size *and* a strong t-stat. A value with a big mean but `t=0.5` is a lucky seed; a value with a modest mean but `t=4.0` is a real (if small) effect.
## Common pitfalls
- `n_seeds = 1`: t-stat is undefined. One data point. Replicate before concluding anything.
- Cross-group comparisons: different groups often have different base configs, so "group A's best value vs group B's best" is apples-to-oranges. Compare within groups.
- Too many parameters varied at once: split into separate sweeps.
- Crashed / diverged runs showing as missing or NaN metrics: investigate the run, don't silently drop it; a divergence is itself a finding.