- SKILL.md: 3 new entries (exploration-over-exploitation + nuisance HPs,
test-set contamination, loss-spikes-mean-bad-data-pocket) and an Ng
100-misclassified-examples quote under inspect-the-data
- refs/llm_judges.md: position/verbosity/self-preference biases (Zheng,
Wang 66/80 flip, Panickssery) + mitigation checklist from verdict docs
- Lones pitfalls linked as the exhaustive 36-item do/don't checklist
- 6 new frozen evidence files; Hamel evals link in further reading
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Both quote-verbatim with frozen evidence: the 2018 tweet thread (mirrored
via threadreaderapp, x.com blocks fetching) slots after overfit-one-batch;
CACE (NIPS 2015, entanglement section transcribed from the PDF) gives
Always-Be-Ablating its why.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
SKILL.md is now folklore only: verbatim practitioner quotes ordered
most-general-first, transformer/LLM fine-tuning entries in their own
section, minimal context, links and footnotes. New sources: unsloth,
axolotl (+training stability), HF course ch8.4, Bekman debug_utils
(evidence frozen in docs/evidence/).
The synthesized material (mental models, priors, symptom tables, agent
loop, triage, anti-patterns) moves to PLAYBOOK.md, framed as menus of
hypotheses rather than authoritative diagnoses. Made-up symptom tables
no longer sit next to sourced quotes.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
NaN poisoning: inject NaN where info must not come from (future/test/labels), run the real pipeline, assert past outputs stay finite. Documents false negatives (pandas skipna, nanmean) and false positives (softmax rows, batch stats). Backprop-to-input is its gradient dual for inside the model; quote already frozen in docs/evidence/karpathy_recipe_training_nn_2019.md.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- Promote the general (non-RL-specific) Spinning Up lessons up to the main
folklore: "broken code fails silently", "you can't tell it's broken if you
can't see that it's breaking", and test on more than one setup.
- Add gwern's "Unseeing" to the data theme: you can't read what you actually
wrote, hence fresh eyes / a fresh-eyes subagent.
- New "Research taste (adjacent to debugging)" section with verbatim quotes,
each cached: Neel Nanda (your research is false by default; excitement is
evidence of bullshit; read your data), Ulisse Mini (understand the system to
shrink the search space), John Wentworth (gears-level models are capital
investments vs cheap black boxes).
All quotes verbatim from cached sources; 25/25 footnotes resolve.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Spinning Up as a Deep RL Researcher was only a bare code link; it's the
canonical RL-researcher guide and its debugging advice is gold. Cache the
rigour/debugging sections verbatim and quote the sharpest lines in the RL
sub-skill: "broken RL code almost always fails silently", "if it doesn't work,
assume there's a bug", "measure everything ... you can't tell it's broken if
you can't see that it's breaking", and test on more than one env. Add to RL
sources.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Gather debugging folklore from more practitioners, each a verbatim quote
checked against a cached source copy (footnoted with line numbers):
- koaning (Vincent Warmerdam), "Bad Labels": benchmark labels are often wrong;
find them with confidence-sorted errors.
- gwern, the tank-detection legend: the canonical data-leakage parable, plus
the scout-mindset twist that it's a likely-unsourced urban legend.
- Patrick Kidger, "Just Know Stuff": why research code is buggy ("kludge ...
bugs that don't cripple things only because some other bug stops them") and
"never accept the kludge". Plus a one-line jaxtyping pointer for shape bugs.
- nanochat (Karpathy): BOS-alignment fake metric improvement; all-ranks must
clip on inf (a multi-GPU bug single-GPU testing hides).
- cleanrl "37 Implementation Details of PPO" -> RL sub-skill, as the canonical
proof that reference-impl details (not ideas) decide whether PPO works.
Trim the lucidrains item to one quote (it had ballooned). Add wassname credit
+ companion-gist link. All 20 footnotes resolve.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Phil Wang's x-transformers is the canonical "the fix is in the code, not the
paper" catalogue. Add a folklore item on the most debugging-relevant trick:
QK / cosine-sim normalization to stop attention logits overflowing (the usual
cause of transformer loss spikes/divergence), plus the BLOOM/YaLM
post-embedding LayerNorm. Two verbatim lucidrains quotes, footnoted to the repo
+ a cached README copy with line numbers. Doubles as the modern concrete
example for the read-a-working-implementation section.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Reorder around what's durable, per wassname's curation:
- human-written intro up top; rename to "wassname's ML Debugging Folklore"
- mindset first: calibrate -> mental models -> Part 1 general tricks (kept,
they're well-based) -> read a working implementation when stuck
- a Folklore section built from verbatim, source-checked quotes (Jones,
Rahtz, Karpathy, Schulman, Henderson, Irpan, CS231n, Slavv, Goodfellow),
each footnoted to the canonical URL + the cached copy with line numbers
- LLM-agent babysitting (debugging loop, triage menu, anti-patterns) moved to
the bottom where it belongs; triage reframed as a menu, not a flowchart
- deeper one-off tricks split to refs/ (loss_surface, metric_stuck, sweeps),
scrubbed of private tooling (wandb/just/SI/personal scripts)
Quote integrity: every quote independently verified by fresh-eyes subagents
against the cached sources; fixed a reformatted Schulman slide, a truncated
Jones sentence, a reversed-order Rahtz stitch, a falsely-quoted Slavv phrase,
and the 3e-4 line (now the real tweet, framed as the joke Karpathy confirmed
it was, not gospel). lr_scheduler anti-pattern nuanced (warmup/cyclic matter).
Remove superseded SKILL2.md draft.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Procedural/vibe anchor with gradual disclosure: calibrate + loop +
non-obvious numbers inline, tables/triage/sweeps demoted to on-demand
links into SKILL.md and refs/. Draft for side-by-side comparison; not
wired in (SKILL.md remains the entry point).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- drop "detective at a scene, not a fortune teller", "guess wearing a
fix's clothes", "that reflex is the enemy"
- rephrase negative parallelisms in intro/calibrate/loop to positive
(judgment not a checklist; mindset not ticking boxes; evidence not
prior; isn't a recipe; it's a; menu not a procedure; code not abstract)
- keep genuine instructional contrasts (relative error not absolute, etc.)
- trim pseudocode comments to intent-only
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- convert all prose ' -- ' to commas/periods/parens (left code/CLI/arrows)
- remove the antithesis flourish in the bisect step; inform not persuade
- de-telegraph "no model, no forward pass, no GPU. pure math."
- add non-exhaustive hedges (and so on / like) where lists implied closure
- fix typos: authoritative (x2), sklearn, it indented
- TODO: triage decision tree converted from ASCII art to nested bullets
- TODO: add Further reading section linking docs/evidence/* files
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- triplet now carries a prior + cheapest falsifier (Check:) per hypothesis
- discriminating-test step: forward-predict each hypothesis, prefer where
predictions diverge (strong vs weak evidence) instead of just "discriminating"
- new step: bisect the forward/backward path to localize where it breaks
- compact pseudocode summary of the whole loop
- resolve FIXME: drop references to the non-public research-journal skill
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Three targeted polishes to the rewritten skill:
- Reframe Part 1's "The hierarchy (work in order...)" -> "What 'collect clues'
looks like": it's the catalog the loop's clue-collection step draws on, not a
second master-procedure competing with "the debugging loop" 40 lines above.
- Reorder: lead straight into calibrate -> loop -> read-impl; relocate the
2017-2021 caveat + LLM-pretraining pointers into a "Scope and modern pointers"
block after the action sections, so the behaviour-changing content is the
first screen instead of provenance.
- Emphasis: give the "priors are a starting weight, not a verdict" line a
concrete clause (traceback / loss-metric misalignment / right init-loss
override the data prior) -- the weakest comprehension dim in the quiz.
Before-vs-after panel A/B (6 cold readers): tie on ordering/clarity/
conciseness/focus, each leaning slightly positive, no regression.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The skill was thorough but failed to instill debugging taste: an agent would
pattern-match a symptom-table row to a fix and ship a guess, because the
behaviour-changing material sat 550 lines down. Promote three gates to the top:
- "Before you debug: calibrate" -- you're likely OOD in research code; the
failure mode is overconfidence/impatience; the tables are a menu to widen the
search, never lookup-and-apply.
- "The debugging loop (judgment, not a checklist)" -- collect clues, hold a few
competing hypotheses scaled to the problem, sanity-check with the
likely/subtle/null triplet (shared vocab with research-journal), run the
cheapest discriminating observation, then act.
- "When stuck, read a working implementation" -- promoted from a buried Part 7.3
one-liner; extract the algorithm-done-right, the engineering tricks the paper
omits, and proven hyperparams; rank candidates by trust signal.
Collapse duplicated advice to pointers; de-bold Part 6.4 (8 bolded openers -> a
plain list). Net +10 lines, bold markers 112 -> 100.
Verified by a blind comprehension-by-inference quiz (5 cold-reader models, OLD
vs NEW): NEW 9.6/10 vs OLD 6.8/10, with the gap localized to the two added
sections (read-working-impl 5/5 vs 0/5; tables-as-menu 2.0 vs 1.6) while
untouched sections tied -- ruling out a "reads nicer" halo.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Add 3 new evidence files from modern open-source sources:
- karpathy_recipe_training_nn_2019.md: Karpathy's training recipe blog post
- nanochat_deepwiki_llm_pretraining_2026.md: 320+ HP sweeps for GPT-2-scale pretraining
- sanh_simple_considerations_hf_2021.md: HuggingFace NLP debugging notes
Add update-to-data ratio diagnostic to refs/diagnostics.md (target ~1e-3).
Add LLM pretraining gap note to SKILL.md intro linking the new sources.
Add tanh saturation % to logging checklist.
- Fix stale Part 2 cross-references to link to rl/SKILL.md
- Add McCandlish + Slavv back to parent Sources (cited in Part 7)
- Add back-links from refs/ files to parent SKILL.md
Moved 6.1 (static analysis grep patterns) and 6.2 (diagnostic code
snippets) to refs/static_analysis.md and refs/diagnostics.md.
Triage tree (6.3) stays in main with references to the ref files.
ml_debug/SKILL.md reduced from 7229w to 5093w (~30% from original).