mirror of https://github.com/wassname/ml-debug.git synced 2026-06-27 19:47:55 +08:00

Files

T

wassname 38ec634ff3 restructure: folklore-first, quote-verified, with wassname intro

Reorder around what's durable, per wassname's curation:
- human-written intro up top; rename to "wassname's ML Debugging Folklore"
- mindset first: calibrate -> mental models -> Part 1 general tricks (kept,
  they're well-based) -> read a working implementation when stuck
- a Folklore section built from verbatim, source-checked quotes (Jones,
  Rahtz, Karpathy, Schulman, Henderson, Irpan, CS231n, Slavv, Goodfellow),
  each footnoted to the canonical URL + the cached copy with line numbers
- LLM-agent babysitting (debugging loop, triage menu, anti-patterns) moved to
  the bottom where it belongs; triage reframed as a menu, not a flowchart
- deeper one-off tricks split to refs/ (loss_surface, metric_stuck, sweeps),
  scrubbed of private tooling (wandb/just/SI/personal scripts)

Quote integrity: every quote independently verified by fresh-eyes subagents
against the cached sources; fixed a reformatted Schulman slide, a truncated
Jones sentence, a reversed-order Rahtz stitch, a falsely-quoted Slavv phrase,
and the 3e-4 line (now the real tweet, framed as the joke Karpathy confirmed
it was, not gospel). lr_scheduler anti-pattern nuanced (warmup/cyclic matter).

Remove superseded SKILL2.md draft.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-02 20:46:25 +08:00

2.4 KiB

Raw Blame History

Sweeps: same-seed comparison and cross-seed reliability

Appendix to the ML Debugging skill. The general idea behind a trustworthy hyperparameter sweep, tool-agnostic. The point is the difference between "I tried it and it seemed better" and "it's reliably better across seeds." Irpan's 30% seed-failure result and Henderson's "seeds alone create statistically different distributions" (see the main skill's folklore section) are why this matters: a single lucky run proves nothing.

The core move: pair on seed, normalize within group, test across seeds

Run the same set of seeds for every value of the parameter you're varying. Same seeds across values turns this into a paired comparison and cancels seed-level baseline differences.
Vary one parameter per sweep when you can (all-else-equal). If you vary two, effects confound and you can't attribute the result.
Within each (group, seed), z-score the metric across the parameter values. This removes the per-seed baseline offset so you compare shapes, not absolute levels.
Aggregate the z-scores across seeds per value, then take a t-stat: mean_z / (std_z / sqrt(n_seeds)). |t| > 2 with 4+ seeds is a real, reliable effect; t ~ 0 is no consistent effect.
For numeric parameters, also fit a linear trend (Pearson r) and t-test it: a clean dose-response is r near +/-1 with a significant t-stat.

for group in groups:
    for seed in seeds_in_group:
        vals = {param_value: metric for runs matching (group, seed, param)}
        z[seed] = (vals - mean(vals)) / std(vals)   # within-(group,seed) normalization
    for value in param_values:
        mean_z, std_z = mean(z[:, value]), std(z[:, value])
        t_stat = mean_z / (std_z / sqrt(n_seeds))    # >>2 reliably better, <<-2 reliably worse

What you're looking for

High effect size and a strong t-stat. A value with a big mean but t=0.5 is a lucky seed; a value with a modest mean but t=4.0 is a real (if small) effect.

Common pitfalls

n_seeds = 1: t-stat is undefined. One data point. Replicate before concluding anything.
Cross-group comparisons: different groups often have different base configs, so "group A's best value vs group B's best" is apples-to-oranges. Compare within groups.
Too many parameters varied at once: split into separate sweeps.
Crashed / diverged runs showing as missing or NaN metrics: investigate the run, don't silently drop it; a divergence is itself a finding.

2.4 KiB Raw Blame History

Sweeps: same-seed comparison and cross-seed reliability

The core move: pair on seed, normalize within group, test across seeds

What you're looking for

Common pitfalls

2.4 KiB

Raw Blame History