mirror of
https://github.com/wassname/ml_debug.git
synced 2026-06-27 01:00:14 +08:00
restructure: folklore-first, quote-verified, with wassname intro
Reorder around what's durable, per wassname's curation: - human-written intro up top; rename to "wassname's ML Debugging Folklore" - mindset first: calibrate -> mental models -> Part 1 general tricks (kept, they're well-based) -> read a working implementation when stuck - a Folklore section built from verbatim, source-checked quotes (Jones, Rahtz, Karpathy, Schulman, Henderson, Irpan, CS231n, Slavv, Goodfellow), each footnoted to the canonical URL + the cached copy with line numbers - LLM-agent babysitting (debugging loop, triage menu, anti-patterns) moved to the bottom where it belongs; triage reframed as a menu, not a flowchart - deeper one-off tricks split to refs/ (loss_surface, metric_stuck, sweeps), scrubbed of private tooling (wandb/just/SI/personal scripts) Quote integrity: every quote independently verified by fresh-eyes subagents against the cached sources; fixed a reformatted Schulman slide, a truncated Jones sentence, a reversed-order Rahtz stitch, a falsely-quoted Slavv phrase, and the 3e-4 line (now the real tweet, framed as the joke Karpathy confirmed it was, not gospel). lr_scheduler anti-pattern nuanced (warmup/cyclic matter). Remove superseded SKILL2.md draft. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -0,0 +1,70 @@
|
||||
# Loss surface & gradient analysis (no model required)
|
||||
|
||||
Appendix to the [ML Debugging skill](../SKILL.md). A trick worth reaching for when a *loss* (not the whole model) is misbehaving: visualize its surface and gradient flow directly, feeding synthetic tensors into the loss sub-components. No model, forward pass, or GPU, just the math. Five minutes of plotting often saves hours of squinting at training curves.
|
||||
|
||||
When you'd look this up: a new or custom loss behaves oddly; a metric is stuck and you suspect the loss shape; you just changed a loss formula and want to confirm gradients still flow at the operating point (not just at init); you're comparing two loss variants and want to see their gradient fields side by side.
|
||||
|
||||
## The method
|
||||
|
||||
1. Identify each loss sub-component as a function of its immediate inputs.
|
||||
2. Pick 1-2 axes that matter (the "natural axes" you reason about when you think about the loss).
|
||||
3. Grid over those axes, feed through the loss, call `.backward()`, collect gradients.
|
||||
4. Plot: contour heatmap + quiver overlay (negative gradient = the direction the optimizer moves).
|
||||
5. Build a summary table: component x representative_input -> loss_value, grad_value. Flag zero or non-finite gradients.
|
||||
|
||||
```py
|
||||
# ── 2D loss surface with gradient quiver ──────
|
||||
def analyze_component(loss_fn, x_range, y_range, n=80):
|
||||
xs = torch.linspace(*x_range, n)
|
||||
ys = torch.linspace(*y_range, n)
|
||||
X, Y = torch.meshgrid(xs, ys, indexing='ij')
|
||||
x_flat = X.flatten().requires_grad_(True)
|
||||
y_flat = Y.flatten().requires_grad_(True)
|
||||
|
||||
losses = loss_fn(x_flat, y_flat) # vectorized, returns (n*n,)
|
||||
losses.sum().backward()
|
||||
|
||||
loss_grid = losses.detach().reshape(n, n)
|
||||
gx = x_flat.grad.reshape(n, n)
|
||||
gy = y_flat.grad.reshape(n, n)
|
||||
|
||||
# contourf(X, Y, loss_grid) + quiver(X, Y, -gx, -gy)
|
||||
# negative gradient = direction optimizer moves
|
||||
|
||||
# ── Gradient flow verification table ──────────
|
||||
# For each component, evaluate at representative inputs
|
||||
# (zero, small, converged, degenerate). Report loss + grad.
|
||||
# Flag: zero grad (dead zone), non-finite (numerical issue).
|
||||
#
|
||||
# | Component | Param | Input | Loss | Grad |
|
||||
# |-----------------|---------|--------------|----------|----------|
|
||||
# | barrier_penalty | v | v=0.0 | +0.000 | +0.000 | <-- zero grad!
|
||||
# | barrier_penalty | v | v=0.5 | +12.50 | +50.00 |
|
||||
# | pair_loss | dot_pos | (0.3, -0.3) | -2.340 | -3.000 |
|
||||
# | pair_loss | dot_neg | (0.3, -0.3) | -2.340 | +3.000 | <-- antisym, good
|
||||
# | pair_loss | dot_pos | (0.0, 0.0) | +0.000 | +0.000 | <-- dead at init!
|
||||
```
|
||||
|
||||
## What to look for
|
||||
|
||||
| Pattern | Meaning | Action |
|
||||
|---------|---------|--------|
|
||||
| Gradient arrows point toward desired region | Loss is well-shaped | Ship it |
|
||||
| Large flat region (zero gradient) | Dead zone: optimizer stuck if it lands here | Add curvature, change init, or reparameterize |
|
||||
| Gradient magnitude 1000x in one axis vs another | Imbalanced: one axis dominates | Rescale, use log-space, or normalize |
|
||||
| Saddle point at origin | Common with product-form losses (A*B) | Switch to additive (log A + log B) for independent gradients |
|
||||
| Arrows point away from desired region | Loss is wrong or has an unexpected local min | Rethink the formula |
|
||||
| Non-finite values in a region | Numerical issue (log(0), 0/0) | Add eps, clamp, or use log1p |
|
||||
|
||||
## The log-space decomposition trick
|
||||
|
||||
When your loss is a product of factors A*B and one factor can be near zero:
|
||||
|
||||
```
|
||||
# BAD: symlog(A * B), when B~0 the chain rule gives 0 grad to A too
|
||||
# GOOD: sign * (log|A| + log|B|) gives independent gradients
|
||||
# d/dA = 1/A regardless of B
|
||||
# d/dB = 1/B regardless of A
|
||||
```
|
||||
|
||||
General principle: if you want gradient to flow independently through two factors, decompose multiplicatively in log space.
|
||||
@@ -0,0 +1,57 @@
|
||||
# Why won't this metric move?
|
||||
|
||||
Appendix to the [ML Debugging skill](../SKILL.md). When a quantity you're optimizing plateaus, these are ideas for telling *why*, not a flowchart to obey. They apply to most training setups, but they're suggestions; your project may not fit them.
|
||||
|
||||
The useful split is three questions, cheapest first.
|
||||
|
||||
## 1. Is the gradient nonzero at the metric level?
|
||||
|
||||
```py
|
||||
metric_val = torch.tensor(current_value, requires_grad=True)
|
||||
loss = loss_fn(metric_val)
|
||||
loss.backward()
|
||||
print(f"d(loss)/d(metric) = {metric_val.grad}")
|
||||
```
|
||||
|
||||
- ~0: the loss doesn't care about this metric at the current operating point. Maybe saturated (log1p of a huge value), in a dead zone, or the metric is disconnected from the loss.
|
||||
- large: the loss is trying to move it. The problem is downstream.
|
||||
|
||||
## 2. Can the parameter even change the metric?
|
||||
|
||||
Trace the chain `loss -> metric -> ... -> parameter`. The metric is a function of intermediate quantities, which are functions of learned parameters. Look at `d(metric)/d(parameter)`:
|
||||
|
||||
- Analytically: is there a structural reason this derivative is ~0? (e.g. a rotation of V can't change span(U).)
|
||||
- Empirically: disable the loss term (set its coefficient to 0). Does the metric reach the same value anyway? If yes, the optimization never moved it; it's a structural ceiling, and you need a different parameterization, not a different loss weight.
|
||||
|
||||
## 3. Is something else fighting it?
|
||||
|
||||
If the gradient is nonzero and the parameter *can* change the metric:
|
||||
|
||||
- Competing loss terms: compute each component's gradient on the shared parameter separately. Opposite-sign gradients cancel.
|
||||
- Optimizer state: AdamW momentum from earlier training can resist a direction change. Try resetting optimizer state or a warmup.
|
||||
- Conditioning: if the metric needs coordinated changes across many parameters (rotating several layers at once), the per-parameter gradient may be too small even when the aggregate signal is large.
|
||||
|
||||
## A rough map (a guide, not a verdict)
|
||||
|
||||
| d(loss)/d(metric) | d(metric)/d(param) | Same value with the term off? | Reading |
|
||||
|---|---|---|---|
|
||||
| ~0 | any | any | Loss saturated or disconnected; reconsider the loss formula. |
|
||||
| large | ~0 | yes | Structural ceiling; reconsider the parameterization. |
|
||||
| large | large | no | Competing losses or optimizer inertia; isolate them. |
|
||||
| large | large | yes | The term helps but converges to the same basin; weak effect or coincidence. |
|
||||
|
||||
## Structural-ceiling check, concretely
|
||||
|
||||
```py
|
||||
# 1. Is d(loss)/d(metric) large? If so, the optimizer IS trying.
|
||||
metric = torch.tensor(0.5, requires_grad=True)
|
||||
loss = loss_fn(metric); loss.backward()
|
||||
print(metric.grad) # large (e.g. 350x the other grads) => it's trying
|
||||
|
||||
# 2. Can the parameter change the metric? Trace loss -> metric -> intermediate -> parameter.
|
||||
# If d(metric)/d(parameter) ~ 0, the parameter structurally cannot move it.
|
||||
# (e.g. a V-rotation can't change the output basis when U is fixed.)
|
||||
|
||||
# 3. Confirm empirically: set the term's coefficient to 0.
|
||||
# If the metric reaches the SAME value, it was never learned; it's structural.
|
||||
```
|
||||
@@ -0,0 +1,32 @@
|
||||
# Sweeps: same-seed comparison and cross-seed reliability
|
||||
|
||||
Appendix to the [ML Debugging skill](../SKILL.md). The general idea behind a trustworthy hyperparameter sweep, tool-agnostic. The point is the difference between "I tried it and it seemed better" and "it's reliably better across seeds." Irpan's 30% seed-failure result and Henderson's "seeds alone create statistically different distributions" (see the main skill's folklore section) are why this matters: a single lucky run proves nothing.
|
||||
|
||||
## The core move: pair on seed, normalize within group, test across seeds
|
||||
|
||||
1. Run the same set of seeds for every value of the parameter you're varying. Same seeds across values turns this into a paired comparison and cancels seed-level baseline differences.
|
||||
2. Vary one parameter per sweep when you can (all-else-equal). If you vary two, effects confound and you can't attribute the result.
|
||||
3. Within each (group, seed), z-score the metric across the parameter values. This removes the per-seed baseline offset so you compare *shapes*, not absolute levels.
|
||||
4. Aggregate the z-scores across seeds per value, then take a t-stat: `mean_z / (std_z / sqrt(n_seeds))`. `|t| > 2` with 4+ seeds is a real, reliable effect; `t ~ 0` is no consistent effect.
|
||||
5. For numeric parameters, also fit a linear trend (Pearson r) and t-test it: a clean dose-response is `r` near +/-1 with a significant t-stat.
|
||||
|
||||
```py
|
||||
for group in groups:
|
||||
for seed in seeds_in_group:
|
||||
vals = {param_value: metric for runs matching (group, seed, param)}
|
||||
z[seed] = (vals - mean(vals)) / std(vals) # within-(group,seed) normalization
|
||||
for value in param_values:
|
||||
mean_z, std_z = mean(z[:, value]), std(z[:, value])
|
||||
t_stat = mean_z / (std_z / sqrt(n_seeds)) # >>2 reliably better, <<-2 reliably worse
|
||||
```
|
||||
|
||||
## What you're looking for
|
||||
|
||||
High effect size *and* a strong t-stat. A value with a big mean but `t=0.5` is a lucky seed; a value with a modest mean but `t=4.0` is a real (if small) effect.
|
||||
|
||||
## Common pitfalls
|
||||
|
||||
- `n_seeds = 1`: t-stat is undefined. One data point. Replicate before concluding anything.
|
||||
- Cross-group comparisons: different groups often have different base configs, so "group A's best value vs group B's best" is apples-to-oranges. Compare within groups.
|
||||
- Too many parameters varied at once: split into separate sweeps.
|
||||
- Crashed / diverged runs showing as missing or NaN metrics: investigate the run, don't silently drop it; a divergence is itself a finding.
|
||||
Reference in New Issue
Block a user