docs: clarify competing-worlds debugging loop

2026-06-27 16:46:04 +08:00 · 2026-06-12 06:53:36 +08:00
parent 8b9a1d62ed
commit 3e28a950e9
2 changed files with 7 additions and 4 deletions
@@ -133,12 +133,15 @@ Roughly in this order, though the point is the underlying mindset:
 **Collect clues before theorizing.** Read the traceback and logs. Run static analysis ([refs/static_analysis.md](refs/static_analysis.md)) and the cheap diagnostics ([refs/diagnostics.md](refs/diagnostics.md): data sanity check, init-loss check, overfit-one-batch). If you catch yourself proposing a fix before you've looked at anything, stop.
-**Hold several hypotheses at once; resist converging early.** Unless the cause is already obvious (a traceback usually points right at it), generate a few genuinely different explanations before ranking any, so you don't marry the first one. Use the five lenses in Mental models. Then sanity-check yourself with the failure-mode triplet (same idiom as the `research-journal` skill):
+**Hold several hypotheses at once; resist converging early.** Unless the cause is already obvious (a traceback usually points right at it), generate at least three genuinely different possible worlds before ranking any, so you don't marry the first one. Use the five lenses in Mental models. Put a rough credence/prior on each world, including an explicit unknown bucket when useful. Then sanity-check yourself with:
- *Likely*: your strongest competitor explanation, with a rough credence.
+- *Bug*: a boring implementation/data/loss bug, with high prior until checked.
 - *Likely*: your strongest ordinary explanation.
 - *Subtle*: the sneaky one, like sample size, leakage, a confound, a metric artifact, or plain seed variance masquerading as signal.
 - *Perverse success*: the metric improved, but for the wrong reason.
 - *Null*: there's no real effect, or it comes from something else you also changed.
 - *Unknown*: probability mass for a cause you have not imagined.
-Give each a one-line prior and its cheapest falsifier (`Check: ...`). Anchor priors on Practitioner priors above, but a clue that points elsewhere overrides them outright. Keep observations (reproducible, auditable) separate from inferences, so you can rethink without degrading the evidence.
+For each world, write `Prior: ...; Expect: ...; Check: ...`. Anchor priors on Practitioner priors above, but a clue that points elsewhere overrides them outright. Prefer the cheapest check that most changes the credences across worlds. Keep observations (reproducible, auditable) separate from inferences, so you can rethink without degrading the evidence.
 **Run the cheapest observation that splits your top hypotheses.** Not the most thorough experiment, the most *discriminating* one. Forward-predict each hypothesis ("what would I see if this were the cause?"); a test is strong evidence only where the predictions diverge. A grad-norm line reading ~0 under "dead layer" but healthy under "LR too low" beats a 4-hour sweep that only confirms what you believed.
@@ -11,7 +11,7 @@ Foreword: In an attempt to upskill the ML debugging on AI coding assistants (and
 If you're an LLM agent, calibrate yourself first. ML research code is often outside your training distribution: novel losses, custom architectures, methods with no canonical right answer you've seen a thousand times. The trained reflex there is to be confident and fast, to pattern-match a symptom to a fix ("loss stuck -> drop the LR") and apply it, and on possibly-buggy research code that reflex wastes a run and corrupts the evidence you need to find the real cause. The quotes below are the counter-evidence, in the words of people who paid for these lessons in months of wasted runs. Before acting: form competing hypotheses and identify evidence that distinguishes them ([Rahtz](#think-more-experiment-less)); assume a correctness bug before tuning ([Jones](#assume-you-have-a-bug)); instrument silent failure paths and test more than one setup ([Achiam](#broken-code-fails-silently-measure-everything-spinning-up)); inspect the data and seek falsifiers before believing the result ([Nanda](#default-to-disbelieving-your-own-results-neel-nanda)).
-These are common failure modes worth ruling out, not a complete diagnosis of your situation; you know your system and I don't. Checklists, diagnostics, and symptom catalogs are one hop away under [Reference](#reference-one-hop-away). The short version of Rahtz plus the tuning playbook is: compare several possible worlds, predict what evidence differs between them, then run the narrowest experiment that can actually distinguish them.
+These are common failure modes worth ruling out, not a complete diagnosis of your situation; you know your system and I don't. Checklists, diagnostics, and symptom catalogs are one hop away under [Reference](#reference-one-hop-away). The short version of Rahtz plus the tuning playbook is: compare at least three possible worlds, put rough credences on them, include a bug and an unknown if relevant, predict what evidence differs between them, then run the narrowest experiment that can actually distinguish them.
 ## Folklore