mirror of
https://github.com/wassname/ml-debug.git
synced 2026-06-27 18:05:27 +08:00
38ec634ff3
Reorder around what's durable, per wassname's curation: - human-written intro up top; rename to "wassname's ML Debugging Folklore" - mindset first: calibrate -> mental models -> Part 1 general tricks (kept, they're well-based) -> read a working implementation when stuck - a Folklore section built from verbatim, source-checked quotes (Jones, Rahtz, Karpathy, Schulman, Henderson, Irpan, CS231n, Slavv, Goodfellow), each footnoted to the canonical URL + the cached copy with line numbers - LLM-agent babysitting (debugging loop, triage menu, anti-patterns) moved to the bottom where it belongs; triage reframed as a menu, not a flowchart - deeper one-off tricks split to refs/ (loss_surface, metric_stuck, sweeps), scrubbed of private tooling (wandb/just/SI/personal scripts) Quote integrity: every quote independently verified by fresh-eyes subagents against the cached sources; fixed a reformatted Schulman slide, a truncated Jones sentence, a reversed-order Rahtz stitch, a falsely-quoted Slavv phrase, and the 3e-4 line (now the real tweet, framed as the joke Karpathy confirmed it was, not gospel). lr_scheduler anti-pattern nuanced (warmup/cyclic matter). Remove superseded SKILL2.md draft. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
33 lines
2.4 KiB
Markdown
33 lines
2.4 KiB
Markdown
# Sweeps: same-seed comparison and cross-seed reliability
|
|
|
|
Appendix to the [ML Debugging skill](../SKILL.md). The general idea behind a trustworthy hyperparameter sweep, tool-agnostic. The point is the difference between "I tried it and it seemed better" and "it's reliably better across seeds." Irpan's 30% seed-failure result and Henderson's "seeds alone create statistically different distributions" (see the main skill's folklore section) are why this matters: a single lucky run proves nothing.
|
|
|
|
## The core move: pair on seed, normalize within group, test across seeds
|
|
|
|
1. Run the same set of seeds for every value of the parameter you're varying. Same seeds across values turns this into a paired comparison and cancels seed-level baseline differences.
|
|
2. Vary one parameter per sweep when you can (all-else-equal). If you vary two, effects confound and you can't attribute the result.
|
|
3. Within each (group, seed), z-score the metric across the parameter values. This removes the per-seed baseline offset so you compare *shapes*, not absolute levels.
|
|
4. Aggregate the z-scores across seeds per value, then take a t-stat: `mean_z / (std_z / sqrt(n_seeds))`. `|t| > 2` with 4+ seeds is a real, reliable effect; `t ~ 0` is no consistent effect.
|
|
5. For numeric parameters, also fit a linear trend (Pearson r) and t-test it: a clean dose-response is `r` near +/-1 with a significant t-stat.
|
|
|
|
```py
|
|
for group in groups:
|
|
for seed in seeds_in_group:
|
|
vals = {param_value: metric for runs matching (group, seed, param)}
|
|
z[seed] = (vals - mean(vals)) / std(vals) # within-(group,seed) normalization
|
|
for value in param_values:
|
|
mean_z, std_z = mean(z[:, value]), std(z[:, value])
|
|
t_stat = mean_z / (std_z / sqrt(n_seeds)) # >>2 reliably better, <<-2 reliably worse
|
|
```
|
|
|
|
## What you're looking for
|
|
|
|
High effect size *and* a strong t-stat. A value with a big mean but `t=0.5` is a lucky seed; a value with a modest mean but `t=4.0` is a real (if small) effect.
|
|
|
|
## Common pitfalls
|
|
|
|
- `n_seeds = 1`: t-stat is undefined. One data point. Replicate before concluding anything.
|
|
- Cross-group comparisons: different groups often have different base configs, so "group A's best value vs group B's best" is apples-to-oranges. Compare within groups.
|
|
- Too many parameters varied at once: split into separate sweeps.
|
|
- Crashed / diverged runs showing as missing or NaN metrics: investigate the run, don't silently drop it; a divergence is itself a finding.
|