mirror of
https://github.com/wassname/ml-debug.git
synced 2026-06-27 19:47:55 +08:00
38ec634ff3
Reorder around what's durable, per wassname's curation: - human-written intro up top; rename to "wassname's ML Debugging Folklore" - mindset first: calibrate -> mental models -> Part 1 general tricks (kept, they're well-based) -> read a working implementation when stuck - a Folklore section built from verbatim, source-checked quotes (Jones, Rahtz, Karpathy, Schulman, Henderson, Irpan, CS231n, Slavv, Goodfellow), each footnoted to the canonical URL + the cached copy with line numbers - LLM-agent babysitting (debugging loop, triage menu, anti-patterns) moved to the bottom where it belongs; triage reframed as a menu, not a flowchart - deeper one-off tricks split to refs/ (loss_surface, metric_stuck, sweeps), scrubbed of private tooling (wandb/just/SI/personal scripts) Quote integrity: every quote independently verified by fresh-eyes subagents against the cached sources; fixed a reformatted Schulman slide, a truncated Jones sentence, a reversed-order Rahtz stitch, a falsely-quoted Slavv phrase, and the 3e-4 line (now the real tweet, framed as the joke Karpathy confirmed it was, not gospel). lr_scheduler anti-pattern nuanced (warmup/cyclic matter). Remove superseded SKILL2.md draft. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2.4 KiB
2.4 KiB
Sweeps: same-seed comparison and cross-seed reliability
Appendix to the ML Debugging skill. The general idea behind a trustworthy hyperparameter sweep, tool-agnostic. The point is the difference between "I tried it and it seemed better" and "it's reliably better across seeds." Irpan's 30% seed-failure result and Henderson's "seeds alone create statistically different distributions" (see the main skill's folklore section) are why this matters: a single lucky run proves nothing.
The core move: pair on seed, normalize within group, test across seeds
- Run the same set of seeds for every value of the parameter you're varying. Same seeds across values turns this into a paired comparison and cancels seed-level baseline differences.
- Vary one parameter per sweep when you can (all-else-equal). If you vary two, effects confound and you can't attribute the result.
- Within each (group, seed), z-score the metric across the parameter values. This removes the per-seed baseline offset so you compare shapes, not absolute levels.
- Aggregate the z-scores across seeds per value, then take a t-stat:
mean_z / (std_z / sqrt(n_seeds)).|t| > 2with 4+ seeds is a real, reliable effect;t ~ 0is no consistent effect. - For numeric parameters, also fit a linear trend (Pearson r) and t-test it: a clean dose-response is
rnear +/-1 with a significant t-stat.
for group in groups:
for seed in seeds_in_group:
vals = {param_value: metric for runs matching (group, seed, param)}
z[seed] = (vals - mean(vals)) / std(vals) # within-(group,seed) normalization
for value in param_values:
mean_z, std_z = mean(z[:, value]), std(z[:, value])
t_stat = mean_z / (std_z / sqrt(n_seeds)) # >>2 reliably better, <<-2 reliably worse
What you're looking for
High effect size and a strong t-stat. A value with a big mean but t=0.5 is a lucky seed; a value with a modest mean but t=4.0 is a real (if small) effect.
Common pitfalls
n_seeds = 1: t-stat is undefined. One data point. Replicate before concluding anything.- Cross-group comparisons: different groups often have different base configs, so "group A's best value vs group B's best" is apples-to-oranges. Compare within groups.
- Too many parameters varied at once: split into separate sweeps.
- Crashed / diverged runs showing as missing or NaN metrics: investigate the run, don't silently drop it; a divergence is itself a finding.