- Promote the general (non-RL-specific) Spinning Up lessons up to the main
folklore: "broken code fails silently", "you can't tell it's broken if you
can't see that it's breaking", and test on more than one setup.
- Add gwern's "Unseeing" to the data theme: you can't read what you actually
wrote, hence fresh eyes / a fresh-eyes subagent.
- New "Research taste (adjacent to debugging)" section with verbatim quotes,
each cached: Neel Nanda (your research is false by default; excitement is
evidence of bullshit; read your data), Ulisse Mini (understand the system to
shrink the search space), John Wentworth (gears-level models are capital
investments vs cheap black boxes).
All quotes verbatim from cached sources; 25/25 footnotes resolve.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Spinning Up as a Deep RL Researcher was only a bare code link; it's the
canonical RL-researcher guide and its debugging advice is gold. Cache the
rigour/debugging sections verbatim and quote the sharpest lines in the RL
sub-skill: "broken RL code almost always fails silently", "if it doesn't work,
assume there's a bug", "measure everything ... you can't tell it's broken if
you can't see that it's breaking", and test on more than one env. Add to RL
sources.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Gather debugging folklore from more practitioners, each a verbatim quote
checked against a cached source copy (footnoted with line numbers):
- koaning (Vincent Warmerdam), "Bad Labels": benchmark labels are often wrong;
find them with confidence-sorted errors.
- gwern, the tank-detection legend: the canonical data-leakage parable, plus
the scout-mindset twist that it's a likely-unsourced urban legend.
- Patrick Kidger, "Just Know Stuff": why research code is buggy ("kludge ...
bugs that don't cripple things only because some other bug stops them") and
"never accept the kludge". Plus a one-line jaxtyping pointer for shape bugs.
- nanochat (Karpathy): BOS-alignment fake metric improvement; all-ranks must
clip on inf (a multi-GPU bug single-GPU testing hides).
- cleanrl "37 Implementation Details of PPO" -> RL sub-skill, as the canonical
proof that reference-impl details (not ideas) decide whether PPO works.
Trim the lucidrains item to one quote (it had ballooned). Add wassname credit
+ companion-gist link. All 20 footnotes resolve.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Phil Wang's x-transformers is the canonical "the fix is in the code, not the
paper" catalogue. Add a folklore item on the most debugging-relevant trick:
QK / cosine-sim normalization to stop attention logits overflowing (the usual
cause of transformer loss spikes/divergence), plus the BLOOM/YaLM
post-embedding LayerNorm. Two verbatim lucidrains quotes, footnoted to the repo
+ a cached README copy with line numbers. Doubles as the modern concrete
example for the read-a-working-implementation section.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Reorder around what's durable, per wassname's curation:
- human-written intro up top; rename to "wassname's ML Debugging Folklore"
- mindset first: calibrate -> mental models -> Part 1 general tricks (kept,
they're well-based) -> read a working implementation when stuck
- a Folklore section built from verbatim, source-checked quotes (Jones,
Rahtz, Karpathy, Schulman, Henderson, Irpan, CS231n, Slavv, Goodfellow),
each footnoted to the canonical URL + the cached copy with line numbers
- LLM-agent babysitting (debugging loop, triage menu, anti-patterns) moved to
the bottom where it belongs; triage reframed as a menu, not a flowchart
- deeper one-off tricks split to refs/ (loss_surface, metric_stuck, sweeps),
scrubbed of private tooling (wandb/just/SI/personal scripts)
Quote integrity: every quote independently verified by fresh-eyes subagents
against the cached sources; fixed a reformatted Schulman slide, a truncated
Jones sentence, a reversed-order Rahtz stitch, a falsely-quoted Slavv phrase,
and the 3e-4 line (now the real tweet, framed as the joke Karpathy confirmed
it was, not gospel). lr_scheduler anti-pattern nuanced (warmup/cyclic matter).
Remove superseded SKILL2.md draft.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Procedural/vibe anchor with gradual disclosure: calibrate + loop +
non-obvious numbers inline, tables/triage/sweeps demoted to on-demand
links into SKILL.md and refs/. Draft for side-by-side comparison; not
wired in (SKILL.md remains the entry point).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- drop "detective at a scene, not a fortune teller", "guess wearing a
fix's clothes", "that reflex is the enemy"
- rephrase negative parallelisms in intro/calibrate/loop to positive
(judgment not a checklist; mindset not ticking boxes; evidence not
prior; isn't a recipe; it's a; menu not a procedure; code not abstract)
- keep genuine instructional contrasts (relative error not absolute, etc.)
- trim pseudocode comments to intent-only
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- convert all prose ' -- ' to commas/periods/parens (left code/CLI/arrows)
- remove the antithesis flourish in the bisect step; inform not persuade
- de-telegraph "no model, no forward pass, no GPU. pure math."
- add non-exhaustive hedges (and so on / like) where lists implied closure
- fix typos: authoritative (x2), sklearn, it indented
- TODO: triage decision tree converted from ASCII art to nested bullets
- TODO: add Further reading section linking docs/evidence/* files
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- triplet now carries a prior + cheapest falsifier (Check:) per hypothesis
- discriminating-test step: forward-predict each hypothesis, prefer where
predictions diverge (strong vs weak evidence) instead of just "discriminating"
- new step: bisect the forward/backward path to localize where it breaks
- compact pseudocode summary of the whole loop
- resolve FIXME: drop references to the non-public research-journal skill
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Three targeted polishes to the rewritten skill:
- Reframe Part 1's "The hierarchy (work in order...)" -> "What 'collect clues'
looks like": it's the catalog the loop's clue-collection step draws on, not a
second master-procedure competing with "the debugging loop" 40 lines above.
- Reorder: lead straight into calibrate -> loop -> read-impl; relocate the
2017-2021 caveat + LLM-pretraining pointers into a "Scope and modern pointers"
block after the action sections, so the behaviour-changing content is the
first screen instead of provenance.
- Emphasis: give the "priors are a starting weight, not a verdict" line a
concrete clause (traceback / loss-metric misalignment / right init-loss
override the data prior) -- the weakest comprehension dim in the quiz.
Before-vs-after panel A/B (6 cold readers): tie on ordering/clarity/
conciseness/focus, each leaning slightly positive, no regression.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The skill was thorough but failed to instill debugging taste: an agent would
pattern-match a symptom-table row to a fix and ship a guess, because the
behaviour-changing material sat 550 lines down. Promote three gates to the top:
- "Before you debug: calibrate" -- you're likely OOD in research code; the
failure mode is overconfidence/impatience; the tables are a menu to widen the
search, never lookup-and-apply.
- "The debugging loop (judgment, not a checklist)" -- collect clues, hold a few
competing hypotheses scaled to the problem, sanity-check with the
likely/subtle/null triplet (shared vocab with research-journal), run the
cheapest discriminating observation, then act.
- "When stuck, read a working implementation" -- promoted from a buried Part 7.3
one-liner; extract the algorithm-done-right, the engineering tricks the paper
omits, and proven hyperparams; rank candidates by trust signal.
Collapse duplicated advice to pointers; de-bold Part 6.4 (8 bolded openers -> a
plain list). Net +10 lines, bold markers 112 -> 100.
Verified by a blind comprehension-by-inference quiz (5 cold-reader models, OLD
vs NEW): NEW 9.6/10 vs OLD 6.8/10, with the gap localized to the two added
sections (read-working-impl 5/5 vs 0/5; tables-as-menu 2.0 vs 1.6) while
untouched sections tied -- ruling out a "reads nicer" halo.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Add 3 new evidence files from modern open-source sources:
- karpathy_recipe_training_nn_2019.md: Karpathy's training recipe blog post
- nanochat_deepwiki_llm_pretraining_2026.md: 320+ HP sweeps for GPT-2-scale pretraining
- sanh_simple_considerations_hf_2021.md: HuggingFace NLP debugging notes
Add update-to-data ratio diagnostic to refs/diagnostics.md (target ~1e-3).
Add LLM pretraining gap note to SKILL.md intro linking the new sources.
Add tanh saturation % to logging checklist.
- Fix stale Part 2 cross-references to link to rl/SKILL.md
- Add McCandlish + Slavv back to parent Sources (cited in Part 7)
- Add back-links from refs/ files to parent SKILL.md
Moved 6.1 (static analysis grep patterns) and 6.2 (diagnostic code
snippets) to refs/static_analysis.md and refs/diagnostics.md.
Triage tree (6.3) stays in main with references to the ref files.
ml_debug/SKILL.md reduced from 7229w to 5093w (~30% from original).
Moved heat-exchanger-specific content from pinn/SKILL.md to
pinn/refs/heat_exchanger.md: complexity ladder table, known failure
modes (U->0, counterflow signs), property mappings (REFPROP/PCHIP),
multi-episode training. PINN skill is now domain-agnostic.
pinn/SKILL.md reduced from 4961w to 4274w (~14%).
Part 2 (RL-Specific Debugging) + RL-specific sources moved to
ml_debug/rl/SKILL.md as a sub-skill, following the pinn/ precedent.
Parent SKILL.md reduced from 9158w to 7229w (~21%).
General sources (Goodfellow, CS231n, Tobin, Ng) kept in parent.
Deep research to uplift LLMs for ML debugging, opinionated by source
selection. Distilled from Schulman, Jones, Rahtz, Goodfellow, CS231n,
FSDL, and more. Includes runnable diagnostic scripts and LLM-specific
anti-patterns.
Author: wassname (https://github.com/wassname)