Files
ml_debug/SKILL.md
T
wassname 90b11214f8 de-AI pass: drop em-dashes, flourishes; resolve in-file TODOs
- convert all prose ' -- ' to commas/periods/parens (left code/CLI/arrows)
- remove the antithesis flourish in the bisect step; inform not persuade
- de-telegraph "no model, no forward pass, no GPU. pure math."
- add non-exhaustive hedges (and so on / like) where lists implied closure
- fix typos: authoritative (x2), sklearn, it indented
- TODO: triage decision tree converted from ASCII art to nested bullets
- TODO: add Further reading section linking docs/evidence/* files

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 12:49:28 +08:00

49 KiB

name, description
name description
ml-debug Wassname's practical folklore for debugging ML systems: convergence issues, loss surface analysis, gradient analysis, sweep methodology, and same-seed comparisons. Use when stuck on training, designing sweeps, or analyzing experiment results.

ML Debugging Folklore

Practitioner knowledge that's hard to find in papers, distilled from Schulman's "Nuts and Bolts" talk, Andy Jones' debugging guide, r/reinforcementlearning threads, competition write-ups, and personal experience.

The core problem: errors aren't local [Jones 2021]. Especially in RL, information flows in a loop (actor -> learner -> actor), so a numerical bug in one spot gets smeared through the whole system in seconds. From outside everything goes weird at once (loss explodes, KL collapses, rewards oscillate) and you can tell something's wrong but not what or where. That's why the discipline below leads with calibration and clue-collection, not fixes.

Before you debug: calibrate

If you're an LLM agent reading this, the most useful thing it can tell you is about you, not the bug. ML research code is often outside your training distribution: novel losses, custom architectures, methods that don't have a canonical "right answer" you've seen a thousand times. Your trained reflex there is to be confident and fast: pattern-match a symptom to a fix ("loss stuck -> drop the LR") and apply it. That reflex is the enemy. It commits to one hypothesis before you've looked, and a wrong fix on possibly-buggy code wastes a run and corrupts your evidence about what's actually happening.

So slow down and widen out. The discipline below isn't a recipe that outputs a fix; it's a set of habits for staying calibrated and keeping your hypothesis space open until the evidence, not your prior, closes it. The habits transfer across timeseries, GANs, OCR, RL, PINNs, puzzles; the specific fixes in the tables below do not, so treat those tables as a menu of hypotheses to widen your search, never as a lookup-and-apply.

The debugging loop (judgment, not a checklist)

Roughly in this order, but the point is the mindset, not ticking boxes:

Collect clues before theorizing. Read the traceback and logs. Run static analysis (Part 6.1) and the cheap diagnostics (Part 6.2: data sanity check, init-loss check, overfit-one-batch). You are a detective at a scene, not a fortune teller. If you catch yourself proposing a fix before you've looked at anything, stop.

Hold several hypotheses at once; resist converging early. Unless the cause is already obvious (a traceback usually points right at it), generate a few genuinely different explanations before ranking any of them, so you don't marry the first one that comes to mind. Part 7.1 has five lenses for generating them: information flow, ablation, oracle substitution, learning curves, structural ceiling. Then sanity-check yourself with the failure-mode triplet:

  • Likely: your strongest competitor explanation, with a rough credence.
  • Subtle: the sneaky one, like sample size, leakage, a confound, a metric artifact, or plain seed variance masquerading as signal.
  • Null: there's no real effect, or it comes from something else you also changed.

Give each a one-line prior (rough credence) and its cheapest falsifier, a Check: ... line naming the observation that would kill it. Those falsifiers are the menu the next step draws from.

Anchor priors on what's usually wrong (Part 7.2: data ~40%, loss ~20%, training ~15%, architecture ~10%, hyperparameters ~5%), but priors are a starting weight, not a verdict. A clue that points elsewhere overrides them outright: a traceback naming a line, a metric stuck while the loss is healthy (loss-metric misalignment, not data), or an init-loss that's exactly right all redirect you regardless of the ~40% data prior.

Make sure to separate observations (to be faithfully reproduced in an auditable manner) and inferences. That way you can go back and rethink things without degrading the evidence.

Run the cheapest observation that splits your top hypotheses. Not the most thorough experiment, the most discriminating one (Rahtz: think more, experiment less, Part 1). To find it, forward-predict each hypothesis ("what would I see if this were the cause?"): a test is strong evidence only where the predictions diverge, and worthless where every hypothesis predicts the same outcome. Prefer the check whose result you'd bet on differently under each explanation. A grad-norm line reading ~0 under "dead layer" but healthy under "LR too low" beats a 4-hour sweep that only confirms what you already believed.

But before you run a 10 minute test, remember it's much faster to step back, and have good priors before you start running experiments. It's also good to rank multiple possible diagnostics and think about how much you learn, vs how much they cost in code complexity and gpu time. You want to pick ones where the learning is worth the cost.

Bisect the path to localize where it breaks. Once you have a hypothesis about the cause, you still need to find where in the pipeline it goes wrong. Data flows forward and gradients flow backward in a chain (input -> preprocess -> layers -> loss -> grads), so probe the midpoint instead of reading every step: is the value or gradient already wrong halfway through? Each probe halves the search space. Finding the first module to produce a non-finite value (the NaN search in Part 6.2) is one case of this; the same bisection works for finite-but-wrong values, exploded grad norms, dead activations, and so on.

Then act, and only on what the observation pointed to. If a cycle or two hasn't localized it, stop tuning and go read working code (next section), which is usually better than another guess.

Consult as reference, from inside this loop, never as a first move: triage tree (Part 6.3), hypothesis-generating lenses (Part 7.1), the metric-stuck decision tree (Part 5), RL specifics (rl/SKILL.md).

The same loop in pseudocode (for humans and agents to hold in one glance):

def debug(symptom):
    clues = collect(traceback, logs, static_analysis, cheap_diagnostics)  # look before theorizing
    H     = generate(clues, lenses=5) | {likely, subtle, null}            # ≥3 genuinely different
    prior = anchor(H)                  # base rates: data .40 loss .20 train .15 arch .10 hp .05

    while not localized:
        # pick the test by evidence-per-cost, not by thoroughness
        test  = argmax(divergence(predict(h, t) for h in H) / cost(t) for t in candidates)
        obs   = run(test)              # one log line or toy run; record obs separately from inference
        prior = update(prior, obs)     # a clue that points elsewhere overrides the prior outright
        H     = bisect_path(H, obs)    # forward values + backward grads, halve the search each probe
        if cycles  2:
            return read_working_code() # diff your math + graph against a trusted impl

    fix(root_cause); assert reproduces(obs)   # no silent fallback; crash if it doesn't

When stuck, read a working implementation

After 1-2 diagnostic cycles that don't localize the bug, or whenever you're building something you haven't built before, stop guessing and go read code that already works. Agents tend to skip this in favour of another round of from-scratch guessing, which is usually the worse bet.

Use the gh skill to find an implementation. Rank candidates by trust signal (per CLAUDE.md): community adoption > papers citing it > open source code that runs > author reputation > self-reports. A repo other researchers use as a baseline is worth more than a flashy README.

Read it for three things, explicitly:

  1. The algorithm done right. Diff your math and your computation graph against theirs. The bug is usually something "trivial", like a sign, a reset, an off-by-one in indexing, or an advantage normalization you skipped.
  2. The engineering tricks they don't mention in the paper. Did they normalize the input? tanh instead of ReLU? mean-pool instead of last-token? only 6 layers? clip to stop gradient saturation? warm-start? an easier dataset than yours? These are the difference between "works" and "doesn't," and they live in the code, not the abstract.
  3. Proven hyperparameters, schedule, and optimizer. Copy the values that are known to work before you tune your own. Their LR, warmup, batch size, weight decay, and optimizer choice are a working starting point you get for free.

Reference implementations are domain-specific. For RL, see rl/SKILL.md section 9 (spinning-up, stable-baselines3, cleanrl, OpenSpiel). For everything, diff against the reference rather than trusting your from-scratch version (Part 7.3).

Scope and modern pointers

Most sources here are 2017-2021, predating RLHF, large-scale pretraining, and JAX/PyTorch 2.0. The core principles (isolation testing, logging, seed variance, the loop above) are architecture-agnostic and durable; specific RL HP defaults and reward-scaling advice may need updating. For modern transformer pretraining specifically, go to Karpathy's recipe (2019; activation/gradient health checks) and nanochat deepwiki (2026; 320+ empirical HP sweeps for a GPT-2-scale run, MFU monitoring, precision management, BOS-aligned dataloaders). Evidence files: karpathy_recipe_training_nn_2019.md, nanochat_deepwiki_llm_pretraining_2026.md. Most multi-source claims below trace to quotes in docs/ml_debug_folklore.argdown (vargdown); uncovered claims are in the process log.


Part 1: General ML Debugging

What "collect clues" looks like

This is the catalog the loop's clue-collection step pulls from: the substantive checks, in rough dependency order (each assumes the one before). It's a menu to draw on, not a fresh end-to-end procedure that competes with the loop above.

Step 1: Verify components in isolation. [Goodfellow Ch11, CS229] Most bugs are "doing the wrong calculation." Test each piece independently.

  • Network forward pass: feed known inputs, check output shapes and ranges. assert shapes everywhere, since (None,) vs (None, 1) silently broadcasts into (None, None).
  • Loss computation: hand-compute a few targets and compare to code output.
  • Data pipeline: sample a batch, print it, eyeball it. Are labels aligned with inputs? Are transforms applied correctly?
  • Preprocessing: look at your processed inputs as a human. Can you solve the task from them? If you downsampled images, can you still tell what's going on?

Five most common deep learning bugs [FSDL]: (1) incorrect tensor shapes that fail silently via broadcasting, (2) preprocessing inputs incorrectly (wrong normalization, over-augmentation), (3) incorrect loss function or wrong sign in loss/gradient, (4) forgot to set up train vs eval mode (dropout/batchnorm behave differently), (5) numerical instability (NaN from log(0), overflow, vanishing grads).

Step 2: Get signs of life on a toy problem. Work the baseline ladder. [CS231n, FSDL, Goodfellow Ch11] Before your real task, solve something trivial with the same codebase. This establishes what "healthy" looks like. Run on CartPole (or equivalent) and log the same curves so you know what healthy learning looks like for your setup [reddit]. If it works on the toy but not your real task, the gap is usually scale/normalization, not fundamental correctness.

Also try to overfit to train. If you can't do that, you likely won't be able to generalise. [CS231n: "Unless you pass this sanity check with a small dataset it is not worth proceeding to the full dataset."] Start with a lightweight implementation (<200 lines of new code), no complicated data pipelines [FSDL]. Build those later once the core works.

Baseline ladder (for physics/simulation models, make each step beat the previous one):

  1. Persistence: y(t) = y(t-1). Bar for "does the model capture any dynamics at all?"
  2. Exponential decay to steady state (first-order response fit).
  3. Linear state-space / OLS on finite differences.
  4. Pure data MLP (same architecture, no physics). If PINN doesn't beat this, the physics constraint is hurting.
  5. Classical solver with fixed parameters (scipy solve_bvp, ODE, etc.).
  6. Classical solver with fitted parameters.
  7. Then and only then: PINN / learned physics.

Make complexity pay rent. Every added component (more physics, more dimensions, more losses) should improve a metric you care about. If it doesn't, remove it.

Step 3: Log everything, look for specific pathologies. [Goodfellow Ch11, Rahtz 2018, CS231n]

What to log:

  • Losses (train and val, per-component if multi-objective)
  • Gradient norms (per module if possible)
  • Learning rates
  • Parameter norms / update magnitudes
  • Update-to-data ratio per layer: ((lr * p.grad).std() / p.data.std()).log10(), target ~-3 [Karpathy nn-zero-to-hero Lec 4]
  • Activation statistics (mean, std, fraction of dead ReLUs, saturation % for tanh)
  • Data statistics (input distributions, label distributions)

Sanity check at init [CS231n]: verify you get the expected loss at chance performance before training starts. E.g., for 10-class softmax the initial loss should be -ln(0.1) = 2.302 with small random weights. If not, something is wrong with initialization or the loss function. Then verify that increasing regularization increases the loss.

Symptom Likely cause
Loss stuck from the start LR too low, bad init, data pipeline broken, wrong loss function
Loss decreases then explodes LR too high, numerical instability (log(0), div by 0), gradient accumulation bug
Loss NaN log(0), 0/0, overflow. Use log(x.clamp(min=1e-8)), 1/(std + 1e-5)
Train loss good, val loss bad Overfitting. More data, regularization, smaller model
Loss oscillates wildly LR too high, batch size too small, data shuffling broken
Gradients vanish Too-deep network without skip connections, saturating activations (tanh with large inputs), bad init
Gradients explode No gradient clipping, learning rate too high, recurrent networks without gradient clipping
Different results per seed Normal if small variance; suspicious if large. Check init sensitivity, batch ordering, floating point nondeterminism
Model outputs constant Dead neurons, vanishing gradients, mode collapse, all-zero init
Physics loss low but BCs violated Gradient imbalance: PDE residual dominates BC gradient; use adaptive loss weighting or hard BCs
PINN worse than pure-data MLP Wrong equations, bad scaling (forgot to nondimensionalize), or physics constraint fighting the data
PINN fails on hard PDE regime, works on easy Curriculum regularization: start with easy parameters, warm-start and increase to target
Scalar parameter (U, alpha) stuck at 0 or bound Degenerate solution; bound and initialize it, or estimate separately before joint training

Step 4: Numerical hygiene. [CS231n]

# Clamp log values
log_prob = prob.clamp(min=1e-8).log()

# Never divide by zero
ratio = x / (std + 1e-5)

# Clip gradients and LOG the pre-clip norm
grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=20.0)
logger.log("grad_norm", grad_norm)

# Catch NaNs early
assert torch.isfinite(loss), f"Loss is {loss}"

# Verify custom gradients (use float64! relative error plummets from 1e-2 to 1e-8)
torch.autograd.gradcheck(my_custom_fn, inputs.double().requires_grad_(True))

Gradient clipping masks problems, so always log the pre-clip norm to see if it's constantly being triggered. [CS231n: "the ratio of the update magnitudes to the value magnitudes... should be somewhere around 1e-3."]

Gradient check thresholds [CS231n]: use relative error, not absolute. Compare analytic vs numerical gradient using centered difference formula. Relative error > 1e-2 = probably wrong. 1e-4 = uncomfortable. 1e-7 = happy. Before checking: (a) turn off regularization and check data loss alone first (regularization can mask data loss bugs), (b) disable dropout and augmentation, (c) use float64 not float32.

Step 5: Normalization and Nondimensionalization. [Schulman 2017, CS231n, FSDL, Slavv 2017] Most ML training issues trace back to scale problems.

  • Input normalization: mean 0, std 1 per feature. Use running statistics over ALL data seen so far, not just recent data [Schulman 2017]. Using only recent data silently changes the input distribution in a way the policy doesn't know about, which can collapse performance. [Schulman slides: "Compute running estimate of mean and standard deviation, x' = clip((x-mu)/sigma, -10, 10)"]
  • Schulman: "plot histograms of all observations and rewards and make sure each component has the right mean and standard deviation and doesn't have crazy outliers."
  • Layer normalization helps stability.
  • For targets/labels: think about whether the scale is reasonable for your loss function.
  • For physics/PDE models (PINNs): nondimensionalize before training. Raw SI units (Kelvin, Joules, meters) create loss terms with wildly different magnitudes. This is the multi-scale problem that adaptive weighting tries to fix downstream. Nondimensionalizing fixes it at the source by making all PDE coefficients O(1). Recipe: pick characteristic scales (T_ref, L_ref, etc.), define dimensionless variables (T* = T/T_ref, z* = z/L), substitute into the PDE. The resulting groups (NTU, Biot, etc.) are all O(1).
  • Train/test split: use temporal split (not random) for time-series or plant data. Random splitting leaks temporal correlation and gives optimistic test RMSE. Conventional: first 75% train, last 25% test.

Step 6: Check your assumptions about the optimizer.

  • Adam's moment estimates can mask gradient problems. If step statistics look weird, check raw gradients separately.
  • abs_max(param_update) should be small (e.g., ~1e-3 at LR 1e-2); mean_square(param_update) should be very small but substantially smaller than abs_max.
  • Supervised learning tricks (batch norm, dropout, big networks) often don't transfer to RL. People tried them. They usually don't help.

Assume you have a bug [Jones 2021, Goodfellow Ch11]

When their RL implementation doesn't work, people are often keen to either (a) adjust their network architecture or (b) adjust their hyperparameters. On the other hand, they're reluctant to say they've got a bug. Most often, it turns out they've got a bug. (Andy Jones)

Bugs are faster to find and fix than validating that a new architecture is an improvement. Dramatically raise your threshold for "OK, I think this is correct." Neural net components can adapt to compensate for bugs, masking them [Goodfellow Ch11: "If one part is broken, the other parts can adapt and still achieve roughly acceptable performance."]

Loss curves are a red herring [Jones 2021]

They give global information about performance but don't localize errors. Don't debug by staring at loss curves. Use them after you've exhausted better methods. Their main value: splitting performance into "how fast it learns" vs "where it plateaus." [Jones: "The shape of your loss curve says very little about where in your code you've messed up."]

Pursue anomalies [Jones 2021, Rahtz 2018]

If you ever see a plot or a behaviour that just seems weird, chase right after it. Do not just 'hope it goes away'. (Andy Jones)

That cool new feature you were going to add today? It won't magically fix the anomaly. Give up on your plan and chase the anomaly instead. Rahtz independently calls this "noticing confusion": following confusion about a frame-differencing improvement led to finding a normalization bug that had hidden for months.

With long feedback loops, think more, experiment less [Rahtz 2018]

Switching from experimenting a lot and thinking a little to experimenting a little and thinking a lot was a key turnaround in productivity. (Rahtz)

When runs take hours, pour time into hypothesis-forming before launching. Spend 30-60 minutes mapping out possibilities, ranking them by likelihood given all evidence so far. Reserve experiments for distinguishing between your top hypotheses.

Keep a structured work log for long debugging sessions:

  1. What specific output am I working on right now?
  2. Thinking out loud: hypotheses about the current problem
  3. Record of currently running experiments with what each one is supposed to answer
  4. Results of runs (graphs, observations), separated by type

Part 2: RL-Specific Debugging

See rl/SKILL.md for the full RL debugging sub-skill: probe environments, reward engineering, diagnostics tables, hyperparameter defaults, and reference implementations.


Sources

Evidence map: docs/ml_debug_folklore.argdown traces each claim to verbatim quotes across 21 evidence files in docs/evidence/. Process log at docs/ml_debug_folklore_log.md.

Debugging deep networks (general)

Tools


Part 3: Loss Surface & Gradient Analysis (No Model Required)

When a loss isn't behaving as expected, don't guess. Visualize the loss surface and check gradient flow directly. This technique feeds synthetic tensors into loss sub-components, with no model, forward pass, or GPU needed, just the math.

The method

  1. Identify each loss sub-component as a function of its immediate inputs.
  2. Pick 1-2 axes that matter (the "natural axes" you think about when reasoning about the loss).
  3. Grid over those axes, feed through the loss, call .backward(), collect gradients.
  4. Plot: contour heatmap + quiver overlay (negative gradient = optimization direction).
  5. Build a summary table: component x representative_input -> loss_value, grad_value. Flag zero or non-finite gradients.

Pseudocode

# ── 2D loss surface with gradient quiver ──────
def analyze_component(loss_fn, x_range, y_range, n=80):
    xs = torch.linspace(*x_range, n)
    ys = torch.linspace(*y_range, n)
    X, Y = torch.meshgrid(xs, ys, indexing='ij')
    x_flat = X.flatten().requires_grad_(True)
    y_flat = Y.flatten().requires_grad_(True)

    losses = loss_fn(x_flat, y_flat)       # vectorized, returns (n*n,)
    losses.sum().backward()

    loss_grid = losses.detach().reshape(n, n)
    gx = x_flat.grad.reshape(n, n)
    gy = y_flat.grad.reshape(n, n)

    # contourf(X, Y, loss_grid) + quiver(X, Y, -gx, -gy)
    # negative gradient = direction optimizer moves

# ── Gradient flow verification table ──────────
#
# For each component, evaluate at representative inputs
# (zero, small, converged, degenerate). Report loss + grad.
# Flag: zero grad (dead zone), non-finite (numerical issue).
#
# | Component      | Param    | Input        | Loss     | Grad     |
# |----------------|----------|--------------|----------|----------|
# | barrier_penalty | v       | v=0.0        | +0.000   | +0.000   |  <-- zero grad!
# | barrier_penalty | v       | v=0.5        | +12.50   | +50.00   |
# | pair_loss       | dot_pos | (0.3, -0.3)  | -2.340   | -3.000   |
# | pair_loss       | dot_neg | (0.3, -0.3)  | -2.340   | +3.000   |  <-- antisym, good
# | pair_loss       | dot_pos | (0.0, 0.0)   | +0.000   | +0.000   |  <-- dead at init!

What to look for

Pattern Meaning Action
Gradient arrows point toward desired region Loss is well-shaped Ship it
Large flat region (zero gradient) Dead zone: optimizer stuck if it lands here Add curvature, change init, or use different parameterization
Gradient magnitude 1000x in one axis vs another Imbalanced: one axis dominates Rescale, use log-space, or normalize
Saddle point at origin Common with product-form losses (A*B) Switch to additive (log A + log B) for independent gradients
Arrows point away from desired region Loss is wrong or has unexpected local min Rethink the formula
Non-finite values in a region Numerical issue (log(0), 0/0) Add eps, clamp, or use log1p

The log-space decomposition trick

When your loss involves a product of factors A*B and one factor can be near zero:

# BAD: symlog(A * B), when B~0 the chain rule gives 0 grad to A too
# GOOD: sign * (log|A| + log|B|) gives independent gradients
#   d/dA = 1/A  regardless of B
#   d/dB = 1/B  regardless of A

General principle: if you want gradient to flow independently through two factors, decompose multiplicatively in log space.

Structural ceiling analysis

Sometimes a metric is stuck not because the optimizer fails but because the parameterization can't express a higher value. To diagnose:

# 1. Check: is d(loss)/d(metric) large? If yes, optimizer IS trying.
metric = torch.tensor(0.5, requires_grad=True)
loss = loss_fn(metric)
loss.backward()
print(metric.grad)   # if large (e.g. 350x the other gradients), it's trying

# 2. Check: can the parameter CHANGE the metric?
# Trace the chain: loss -> metric -> intermediate -> parameter
# If d(metric)/d(parameter) ~ 0, the param is structurally unable to move it.
# Example: V-rotation can't change output basis (U is fixed) so r_sub is capped.

# 3. Confirm empirically: set the exponent to 0 (disable the term).
# If metric reaches the SAME value, it's purely structural (not learned).

When to use this

  • New loss function: always visualize before training. 5 minutes of plotting saves hours of puzzling over curves.
  • Metric stuck at a value: distinguish "optimizer can't" from "parameterization can't" from "competing losses cancel out."
  • After changing loss formula: verify gradient flow didn't break, especially at the operating point (not just at init).
  • Comparing loss variants: grid the same axes for both, compare arrow fields side by side.

Part 4: Experiment Sweeps & Statistical Analysis

Principled hyperparameter sweeps with same-seed comparisons, within-group z-scores, and t-stat stability. This is the difference between "I tried it and it seemed better" and "I have evidence it's reliably better."

Sweep design (justfile pattern)

Each sweep is a justfile recipe. Key conventions:

set shell := ["bash", "-c"]
SEEDS_4 := "2024 4096 8192 9000"
BASE := "uv run python train.py gemma1b"

# Q: Does rotation type matter? block vs full vs givens.
# Hypothesis: block should balance expressiveness vs cost.
# 12 runs, ~3 hours
sweep-rotation-type:
    #!/usr/bin/env bash
    set -x
    export WANDB_RUN_GROUP="sweep-rotation-type-$(date +%Y%m%d-%H%M)"
    for seed in {{ SEEDS_4 }}; do
        {{ BASE }} --seed=$seed --svd_rotation_type=block
        {{ BASE }} --seed=$seed --svd_rotation_type=full
        {{ BASE }} --seed=$seed --svd_rotation_type=givens
    done

Rules:

  • One WANDB_RUN_GROUP per sweep, timestamped.
  • Same seeds across all values within a sweep (enables paired comparison).
  • Vary ONE parameter per sweep when possible (all-else-equal). If you must vary two, the analysis script warns about confounders.
  • Comment the recipe with: question, hypothesis, run count, time estimate.
  • Queue sweeps in a queue recipe in priority order.

Logging to wandb

Every run logs to wandb with: group name, seed, all config as hyperparams, final eval metric (SI = TPR - FPR).

Cache locally as parquet to avoid slow API calls on every analysis:

# download_wandb.py pattern:
#   1. Load cached parquet (if exists)
#   2. Find latest cached run date, subtract safety margin (1 day)
#   3. Fetch only runs newer than that
#   4. Merge (diagonal concat, dedup on run_id)
#   5. Save back to parquet + TSV
#
# Also downloads output.log per run for post-hoc log diagnosis.

Analysis: within-group z-scores -> t-stat

The core insight: don't compare raw SI across groups (different base configs, different dates). Compare within each group, then aggregate stability across seeds.

# analyze_results.py pseudocode:

for group in groups:
    for seed in seeds_in_group:
        # 1. Collect all SI values for this (group, seed) combo
        si_values = {param_value: SI for runs matching (group, seed, param)}

        # 2. Compute within-(group,seed) z-score
        mu = mean(si_values)
        sigma = std(si_values)
        z[value] = (si[value] - mu) / sigma
        # This normalizes out seed-level baseline differences

    # 3. Aggregate z-scores across seeds for each param value
    for value in param_values:
        mean_z = mean(z[value] across seeds)
        std_z  = std(z[value] across seeds)
        t_stat = mean_z / (std_z / sqrt(n_seeds))
        # t_stat >> 2: reliably better across seeds
        # t_stat ~ 0: no consistent effect
        # t_stat << -2: reliably worse

    # 4. Also compute linear trend (Pearson r) for numeric params
    #    r > 0: more is better. t_stat on r tests reliability.

Interpreting results

Metric What it tells you
SI_mean Raw effect size (higher = better behavioral control)
si_q10, si_q90 Spread. Wide = seed-sensitive.
t_stat Cross-seed reliability. |t| > 2 with 4+ seeds is meaningful.
linear r Monotonic trend. r near +1/-1 with significant t_stat = dose-response.
"Also varies" warning Confounders. Can't attribute effect to this param alone.

What you're looking for: high SI_mean and strong t_stat (reliable). A value with SI_mean=20 but t_stat=0.5 is a lucky seed. A value with SI_mean=10 but t_stat=4.0 is a real (if modest) effect.

Common pitfalls

  • Stale cache: always download_wandb.py before analyzing. Stale cache hides new groups.
  • Cross-group comparisons: different groups have different base configs. "Group A's best value vs Group B's best value" is apples-to-oranges. Compare within groups.
  • n_seeds=1: t_stat is NaN. You have one data point. Replicate before concluding.
  • Too many params varied: if a sweep varies 3 params simultaneously, effects are confounded. Split into separate sweeps.
  • Interpreting NaN SI: usually means eval crashed or the model diverged. Investigate the run log, don't just skip it.
  • "Fill" sweeps: if a sweep is 13/16 runs done (missing a seed), run the missing seed in a separate group with a clear name (e.g. sweep-coh-tau-fill). The analysis script treats it as a separate group, so you merge mentally.

The full workflow

# 1. Design sweep: write justfile recipe with hypothesis
# 2. Run it
just sweep-rotation-type
# 3. Wait for completion, then:
uv run python scripts/download_wandb.py
uv run python scripts/analyze_results.py --after $(date +%Y-%m-%d)
# 4. Read the output:
#    - Group Summary table: SI_mean, n_seeds per group
#    - Param tables: per-value SI with t_stat
#    - Linear trends: dose-response for numeric params
# 5. Record findings in research journal
# 6. Update default config if result is clear and reliable

Part 5: Diagnosing "Why Won't This Metric Move?"

A structured decision tree for when a metric is stuck. Applies to any training scenario where a quantity you're optimizing plateaus.

Step 1: Is the gradient nonzero at the metric level?

metric_val = torch.tensor(current_value, requires_grad=True)
loss = loss_fn(metric_val)
loss.backward()
print(f"d(loss)/d(metric) = {metric_val.grad}")
  • If ~0: the loss doesn't care about this metric at the current operating point. Likely saturated (log1p of huge value), in a dead zone, or the metric is disconnected from the loss.
  • If large: the loss IS trying to move it. Problem is downstream.

Step 2: Can the parameter change the metric?

Trace the chain rule: loss -> metric -> ... -> parameter. The metric is a function of intermediate quantities, which are functions of learned parameters. Check d(metric)/d(parameter):

  • Analytically: is there a structural reason this derivative is ~0? (e.g., a rotation of V can't change span(U))
  • Empirically: disable the loss term entirely (set coefficient to 0). Does the metric reach the same value? If yes, it's structural, and the optimization never moved it in the first place.

Step 3: Is something else fighting it?

If gradient is nonzero and the parameter CAN change the metric:

  • Check competing loss terms: compute gradient contribution from each loss component separately. If two terms have opposite-sign gradients on the same parameter, they cancel.
  • Check optimizer state: AdamW momentum from earlier training may resist direction changes. Try resetting optimizer state or using a warmup schedule.
  • Check conditioning: if the metric requires coordinated changes across many parameters (e.g., rotating multiple layers simultaneously), the gradient per-parameter may be too small even though the aggregate signal is large.

Decision table

d(loss)/d(metric) d(metric)/d(param) Same value without loss term? Diagnosis
~0 any any Loss saturated or disconnected. Change loss formula.
large ~0 yes Structural ceiling. Change parameterization.
large large no Competing losses or optimizer inertia. Isolate.
large large yes The term helps but converges to same basin. Coincidence or weak effect.

Note this is just a guide and in no way authoritative; it might not apply to your project.


Part 6: LLM Debugging Playbook

Concrete procedures for an LLM agent debugging ML code. Work top-to-bottom: static analysis first, then diagnostics, then the decision tree. Don't skip to hyperparameter suggestions.

6.1 Static analysis: grep for silent bugs

See refs/static_analysis.md for the full list of grep patterns. Categories: shape mismatches, autograd breakers, train/eval mode, in-place ops, double softmax, optimizer step ordering, broadcasting traps, wrong loss sign, frozen params, data leakage, class imbalance.

6.2 Diagnostic code snippets

See refs/diagnostics.md for copy-paste snippets. Includes: data pipeline sanity check, init loss check (with expected values per loss type), overfit-one-batch test, gradient flow check, NaN/Inf hooks, random input test, prime dimension trick, class imbalance check, confidence-sorted errors, weight/bias distributions.

6.3 Triage decision tree

Walk the list top-to-bottom and stop at the first question you answer "yes".

  1. Exception or traceback? Read it, fix the error, done.
  2. Loss is NaN/Inf? Attach NaN hooks (6.2), find the first module producing NaN. Common causes: log(0), 0/0, exp(large); add clamp/eps.
  3. Init loss wrong? (expected values in 6.2)
    • Check the data pipeline (6.2) and the loss function.
    • Check for double softmax (6.1).
    • Check labels match the model output format.
    • Random input test (6.2): same loss? -> data destroyed.
    • Init loss << expected? -> data leakage (Part 7.4).
  4. Can't overfit 1 batch? Run the gradient flow check (6.2).
    • Any None grads? -> disconnected layer.
    • All-zero grads? -> dead layer / detach.
    • Check for autograd breakers (6.1) and optimizer step ordering (6.1).
  5. Loss stuck from step 0 (but CAN overfit 1 batch)?
    • LR too low? Try 10x.
    • Frozen params? Check requires_grad (6.1).
    • Wrong loss function?
  6. Loss decreases then explodes?
    • LR too high? Try 0.1x.
    • Gradient clipping? Log the pre-clip norm.
    • Numerical instability? (log, exp, div)
  7. Train loss good, val loss bad? Overfitting, not a bug. More data, regularization, smaller model.
  8. Train loss okay but metric bad? Loss-metric misalignment. Is minimizing the loss equivalent to improving the metric? (Part 5)
  9. Model outputs constant? Mode collapse. Check:
    • Class imbalance? Run the label count (6.2).
    • All-zero init? Run the weight check (6.2).
    • Dead ReLUs? (try LeakyReLU)
    • Confidence-sorted errors (6.2) reveal a pattern?
  10. Training slow but not stuck? Not a bug. Consider batch size (Part 1 Step 6), architecture depth/width, data quality, and so on.
  11. None of the above? Read Part 1 (general) or Part 2 (RL-specific) for deeper diagnostics. Log everything (Part 1 Step 3) and pursue anomalies.

Again this is just a guide and in no way authoritative; it might not apply to your project.

6.4 LLM anti-patterns

These are the overconfident reflexes the "calibrate" section warns about, made concrete. Every one of them changes behaviour before localizing the bug, so each is a guess wearing a fix's clothes. Some people say "this is sklearn slop", or "the LLM is acting like it's tweaking hyperparameters in a hackathon, not understanding the problem".

  • Hyperparameter changes before verifying correctness. "Try reducing the learning rate" is the #1 wrong response to any training problem. Verify the code is correct first (Parts 1-2); HP tuning on buggy code wastes time.
  • try/except around training code. Training should crash loudly. A caught exception hides the bug and produces silently wrong results. The one exception is checkpoint saving on KeyboardInterrupt.
  • "Try a different optimizer." If Adam doesn't converge, the cause is almost never the optimizer choice. It's usually the loss, the data, the architecture, or a bug.
  • .detach() or .item() to "fix" gradient errors. If autograd complains, the computation graph is wrong. Detaching silences the error by cutting gradient flow, so the model just stops learning from that path. Understand why autograd is complaining first.
  • lr_scheduler as a cure for non-convergence. Schedulers refine convergence, they don't cause it. If the model won't learn at constant LR, a schedule won't save it.
  • More layers or a bigger model. If it can't overfit one batch, more parameters won't help. The problem is gradient flow, loss, or data.
  • "Normalize your data" without checking whether it already is. Run the data sanity check (6.2) first; if it's already mean0, std1, normalization isn't your problem.
  • float() or .to(dtype) to suppress type warnings. Type mismatches are signals. A float32/float64 mismatch might mean you're mixing model weights with double-precision data. Fix the root cause.

Part 7: Debugging Folklore & Mental Models

Part 6 tells you what to DO. This part tells you how to THINK. Use these frameworks when generating hypotheses, brainstorming causes, or deciding what to investigate next.

7.1 Five mental models for ML debugging

Pick the model that fits your situation. Each gives a different angle on the same problem.

1. Information flow: trace forward, trace backward. Data flows forward through the model; gradients flow backward. A bug anywhere in either direction corrupts everything downstream. When stuck: manually trace shapes and values forward from input through each layer. Then trace gradients backward from loss through each parameter. The break-point is where values go wrong.

  • Forward: input -> preprocess -> embed -> layers -> head -> loss
  • Backward: loss -> d(loss)/d(head) -> d(head)/d(layers) -> ... -> d(layer1)/d(params)
  • Tool: gradient flow check (6.2), NaN hooks (6.2)

2. Ablation: remove things until it works. [CS229] Systematically remove components (regularization, augmentation, auxiliary losses, fancy layers). If removing X fixes the problem, X is the problem. If nothing helps, the bug is in the core (data or main loss).

  • Start: turn off ALL regularization, augmentation, dropout, scheduling
  • If it works now: add back one-at-a-time until it breaks
  • If still broken: problem is in data pipeline, loss, or base architecture
  • Tool: just comment things out and rerun overfit-one-batch (6.2)

3. Oracle substitution: replace each component with ground truth. [CS229] For pipeline systems (data -> features -> model -> postprocess -> metric), replace one component at a time with a perfect/oracle version. The component whose oracle gives the biggest accuracy jump is the bottleneck.

  • Example: replace learned features with hand-crafted features. Big jump? -> feature learning is the problem.
  • Example: replace model predictions with ground truth labels. Small jump? -> model is fine, problem is upstream (data) or downstream (metric).
  • This is especially useful for multi-stage systems (NLP pipelines, detection + classification, etc.)

4. Bias-variance via learning curves. [CS229, FSDL] Plot train error and val error as a function of dataset size (or training steps). The shape tells you what to do:

  • Both high (converging together): high bias. Model too simple, wrong features, or bug reducing capacity.
  • Train low, val high (diverging): high variance. Overfitting. More data, regularization, smaller model.
  • Both low: working. Ship it.
  • Train low, val high, but val improves with more data: getting there, need more data.
  • Val error flat even with 10x more data: not a data problem. Fix the model.

5. Structural ceiling: can the parameterization express what you want? (Part 5 expands this) Sometimes the metric is stuck not because the optimizer fails but because the architecture/parameterization literally cannot represent the desired function. Check: disable the loss term entirely. Does the metric reach the same value? If yes, the loss never moved it, and the model can't express higher values.

7.2 Practitioner priors: what's usually wrong

When you have no other information, investigate in this order. Rough estimates synthesized from [Goodfellow, FSDL, Slavv, Jones, CS231n], not measured frequencies, just practitioner consensus on what's usually wrong:

  1. Data pipeline (~40% of bugs). Wrong preprocessing, labels misaligned with inputs, normalization missing or wrong, train/test leakage, data loader returning stale/wrong batches. "It's almost always the data." [FSDL, Slavv]
  2. Loss function (~20%). Wrong loss for the task, wrong sign, double softmax, loss not connected to metric, competing losses canceling gradients.
  3. Training procedure (~15%). Wrong optimizer step order, missing zero_grad, wrong LR, frozen params, in-place ops breaking autograd.
  4. Architecture (~10%). Too small (can't express), too deep (vanishing grads), wrong activation, missing skip connections.
  5. Hyperparameters (~5%). LR, batch size, weight decay. Almost never the real problem if the code is buggy.
  6. Numerical issues (~5%). NaN, overflow, underflow. Usually a symptom of something else.
  7. Environment/infrastructure (~5%). Wrong library version, GPU memory, nondeterminism, stale cache.

For RL specifically, add:

  • Reward scale/sign as a top-3 issue [Henderson, Schulman]. Rescaling from [-1,1] to [0,1] or vice versa can be the entire difference.
  • Episode boundary handling (done signals, reward discounting across resets) [Jones].

7.3 The debugging mindset

Core attitudes live in the top-of-skill debugging loop (calibrate, hold several hypotheses, read a working implementation when stuck) and in Part 1 ("Assume you have a bug," "Pursue anomalies," "Loss curves are a red herring"). Here are the additional mental habits not covered there:

MurphyJitsu pre-flight [Rahtz 2018]: before starting a run, ask "if this run fails, what's the most likely cause?" If you can name it, test for it first. It's the rationalist habit of "pre-hindsight": imagining the failure and working backward. This is the same move as naming the likely and subtle entries of the protocol's failure-mode triplet, applied before launch instead of after a crash.

"Tricks substitute for each other" [Schulman 2017]: many normalization and regularization tricks do roughly the same thing, so stacking them adds complexity without proportional benefit. If you have three normalization schemes and the model still doesn't work, the problem isn't normalization.

(Other attitudes live elsewhere: "think more, experiment less" with the work-log structure is in Part 1; diffing against a working implementation is the top-of-skill "When stuck, read a working implementation" section.)

7.4 When to suspect the data

Specific signal patterns that point to data problems.

Signal Diagnosis Action
Init loss << expected (e.g., 0.01 instead of 2.3) Data leakage or shortcut. Model "knows" the answer at init. Check: are labels in the input? Is test data in train? Is there a trivial feature?
Random input gives same loss as real input (6.2) Data pipeline destroying information. Preprocessing too aggressive, wrong transforms, input all zeros. Print raw data at each pipeline stage. Visualize.
Model predicts same class for everything Class imbalance. 100:1 ratio = model learns "always predict majority." Run class balance check (6.2). Use weighted loss or resample.
Val loss much worse than expected but train is fine Distribution shift. Val set from different distribution than train. Check: same preprocessing? Same time period? Same source? Use dual val sets [FSDL].
Learning curve flat even with 10x more data NOT a data problem. High bias. Model too simple or wrong features. Add capacity, fix features, check for bugs reducing effective capacity.
Adding data makes val WORSE Data quality issue. New data is noisier or from wrong distribution. Inspect recent additions. Check label quality.
Model works on reference dataset (MNIST/CIFAR) but not yours Your data is the problem, not the model. Simplify your data (fewer classes, clean labels, easy examples only). Scale up gradually. [Slavv]

7.5 Batch size & learning rate folklore

These interact in non-obvious ways. Get them wrong and training looks broken even with correct code.

Critical batch size [McCandlish 2018]: there's a batch size B_crit below which doubling batch size ~halves training time (compute-efficient), and above which it doesn't help (just wastes compute). B_crit depends on the task and increases during training as the loss decreases.

LR must scale with batch size. [McCandlish 2018, Goyal et al. 2017]

  • Linear scaling rule (SGD): if you double batch size, double LR. [Goyal et al. 2017]
  • For Adam: the scaling exponent is between 0.5 and 1 (between sqrt and linear), task-dependent. [McCandlish 2018]
  • Changing batch size without adjusting LR is a common silent mistake.

Adam default LR = 3e-4. [FSDL, Karpathy] This is the "just works" starting point. If you're using Adam and haven't tuned LR, start here. Karpathy: "3e-4 is the best learning rate for Adam."

Big batches need warmup. [Goyal et al. 2017] Large batch training with high LR diverges at the start. Warm up LR linearly over the first few hundred steps. Without warmup, you'll see loss spike/NaN in the first epoch and think the code is broken.

Batch size signals:

Symptom Likely cause
Training very noisy, loss oscillates Batch too small. Gradient noise overwhelms signal. Try 4-8x larger.
Training smooth but slow, poor generalization Batch too large without LR scaling. Try higher LR or smaller batch.
Loss spikes at start then recovers Normal with large batch + warmup. If no warmup: add it.
Different results at different batch sizes (same total steps) Missing LR scaling. Adjust LR proportionally.

Further reading

Local evidence files (verbatim quotes behind the claims above) live in docs/evidence/. The most useful starting points:

General neural-net debugging:

Modern LLM / large-batch training:

RL-specific:

Plus several practitioner reddit threads and a few more author notes in the same directory.