docs: add sourced transformer report folklore

2026-06-27 01:00:14 +08:00 · 2026-06-12 07:02:58 +08:00
parent 3e28a950e9
commit 160bd040cc
11 changed files with 13332 additions and 6 deletions
@@ -9,6 +9,8 @@ Foreword: In an attempt to upskill the ML debugging on AI coding assistants (and

 ## How to read this

+> Wassname's debugging loop (unpublished): write at least three possible worlds before acting: the most likely failure, a subtle failure, a perverse failure, a possible bug, and an unknown if relevant. Put a rough credence/prior on each. For each world, say what you expect to see differently and the cheapest evidence that would distinguish it.
+
 If you're an LLM agent, calibrate yourself first. ML research code is often outside your training distribution: novel losses, custom architectures, methods with no canonical right answer you've seen a thousand times. The trained reflex there is to be confident and fast, to pattern-match a symptom to a fix ("loss stuck -> drop the LR") and apply it, and on possibly-buggy research code that reflex wastes a run and corrupts the evidence you need to find the real cause. The quotes below are the counter-evidence, in the words of people who paid for these lessons in months of wasted runs. Before acting: form competing hypotheses and identify evidence that distinguishes them ([Rahtz](#think-more-experiment-less)); assume a correctness bug before tuning ([Jones](#assume-you-have-a-bug)); instrument silent failure paths and test more than one setup ([Achiam](#broken-code-fails-silently-measure-everything-spinning-up)); inspect the data and seek falsifiers before believing the result ([Nanda](#default-to-disbelieving-your-own-results-neel-nanda)).

 These are common failure modes worth ruling out, not a complete diagnosis of your situation; you know your system and I don't. Checklists, diagnostics, and symptom catalogs are one hop away under [Reference](#reference-one-hop-away). The short version of Rahtz plus the tuning playbook is: compare at least three possible worlds, put rough credences on them, include a bug and an unknown if relevant, predict what evidence differs between them, then run the narrowest experiment that can actually distinguish them.