Commit Graph

60 Commits

Author SHA1 Message Date
wassname f8f512f603 Cite Irpan in research taste (signs-of-life, seed canary)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-25 10:31:39 +08:00
wassname 3fe6cb9ad9 Replace OCR-garbled Schulman cache with clean slide transcript
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-25 10:31:39 +08:00
wassname 67d4dc90bb Document quote-first evidence style 2026-06-25 10:31:39 +08:00
wassname 20f03f20b8 Expand research taste appendix with expert quotes 2026-06-25 10:31:39 +08:00
wassname 8fc2c0bbd0 Add research taste evidence appendix 2026-06-25 10:31:39 +08:00
wassname (Michael J Clark) 3f3f95a3b4 Update SKILL.md with debugging strategies and folklore
Reorganize and refine debugging folklore and hypotheses for RL implementations.
2026-06-14 08:37:57 +08:00
wassname b8c3ffcf11 gpt5.5/fable 2026-06-12 09:30:25 +08:00
wassname 160bd040cc docs: add sourced transformer report folklore 2026-06-12 07:02:58 +08:00
wassname 3e28a950e9 docs: clarify competing-worlds debugging loop 2026-06-12 06:53:36 +08:00
wassname 8b9a1d62ed docs: resolve ml-debug TODO references 2026-06-12 06:52:38 +08:00
wassname 966f948d36 docs: refine numerical and scheduler debugging guidance 2026-06-12 06:35:29 +08:00
wassname (Michael J Clark) e58eda360b Update SKILL.md with TODOs for future content
Added TODO items for additional references and empirical evidence regarding transformers and model sizes.
2026-06-11 21:35:49 +08:00
wassname (Michael J Clark) 1ad74e14c6 Update loss_surface.md 2026-06-11 21:21:01 +08:00
wassname (Michael J Clark) 6e9a3ca633 Revise introduction in diagnostics.md
Updated the introduction to provide context for diagnostic code snippets.
2026-06-11 21:18:07 +08:00
wassname 30ac76053e chore: drop links to deleted/tombstone gists (repo is canonical now)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 16:55:04 +08:00
wassname 0837f27f08 fix: companion gist link pointed at the wrong gist
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 16:41:50 +08:00
wassname 8cd3c61050 folklore: tuning playbook, Domingos, Bekman loss spikes, Ng error analysis; LLM-judge bias appendix
- SKILL.md: 3 new entries (exploration-over-exploitation + nuisance HPs,
  test-set contamination, loss-spikes-mean-bad-data-pocket) and an Ng
  100-misclassified-examples quote under inspect-the-data
- refs/llm_judges.md: position/verbosity/self-preference biases (Zheng,
  Wang 66/80 flip, Panickssery) + mitigation checklist from verdict docs
- Lones pitfalls linked as the exhaustive 36-item do/don't checklist
- 6 new frozen evidence files; Hamel evals link in further reading

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 15:30:41 +08:00
wassname 2a2f5045bb folklore: add Karpathy common-mistakes tweet and Sculley CACE principle
Both quote-verbatim with frozen evidence: the 2018 tweet thread (mirrored
via threadreaderapp, x.com blocks fetching) slots after overfit-one-batch;
CACE (NIPS 2015, entanglement section transcribed from the PDF) gives
Always-Be-Ablating its why.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 14:43:47 +08:00
wassname fb753d093e restructure: quotes-first SKILL.md, synthesized playbook split out
SKILL.md is now folklore only: verbatim practitioner quotes ordered
most-general-first, transformer/LLM fine-tuning entries in their own
section, minimal context, links and footnotes. New sources: unsloth,
axolotl (+training stability), HF course ch8.4, Bekman debug_utils
(evidence frozen in docs/evidence/).

The synthesized material (mental models, priors, symptom tables, agent
loop, triage, anti-patterns) moves to PLAYBOOK.md, framed as menus of
hypotheses rather than authoritative diagnoses. Made-up symptom tables
no longer sit next to sourced quotes.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 14:33:32 +08:00
wassname 8ee980d62f diagnostics: add NaN-poisoning leakage tracer + Karpathy backprop-to-input check; README citation
NaN poisoning: inject NaN where info must not come from (future/test/labels), run the real pipeline, assert past outputs stay finite. Documents false negatives (pandas skipna, nanmean) and false positives (softmax rows, batch stats). Backprop-to-input is its gradient dual for inside the model; quote already frozen in docs/evidence/karpathy_recipe_training_nn_2019.md.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 10:18:51 +08:00
wassname (Michael J Clark) 53fb1c4bda Update SKILL.md 2026-06-03 08:19:07 +08:00
wassname (Michael J Clark) 8e1f9dec6d Update README.md 2026-06-03 08:18:11 +08:00
wassname (Michael J Clark) d1b1608a49 Update README.md 2026-06-03 08:17:04 +08:00
wassname 8509ec3c30 folklore: promote Spinning Up to main; add a Research-taste section
- Promote the general (non-RL-specific) Spinning Up lessons up to the main
  folklore: "broken code fails silently", "you can't tell it's broken if you
  can't see that it's breaking", and test on more than one setup.
- Add gwern's "Unseeing" to the data theme: you can't read what you actually
  wrote, hence fresh eyes / a fresh-eyes subagent.
- New "Research taste (adjacent to debugging)" section with verbatim quotes,
  each cached: Neel Nanda (your research is false by default; excitement is
  evidence of bullshit; read your data), Ulisse Mini (understand the system to
  shrink the search space), John Wentworth (gears-level models are capital
  investments vs cheap black boxes).

All quotes verbatim from cached sources; 25/25 footnotes resolve.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 21:08:49 +08:00
wassname a602ea5a0e rl: quote Spinning Up (Achiam) on silent failure and bug-first debugging
Spinning Up as a Deep RL Researcher was only a bare code link; it's the
canonical RL-researcher guide and its debugging advice is gold. Cache the
rigour/debugging sections verbatim and quote the sharpest lines in the RL
sub-skill: "broken RL code almost always fails silently", "if it doesn't work,
assume there's a bug", "measure everything ... you can't tell it's broken if
you can't see that it's breaking", and test on more than one env. Add to RL
sources.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 21:04:55 +08:00
wassname ee4e9a5caa folklore: add koaning, gwern, kidger, nanochat, cleanrl; trim lucidrains
Gather debugging folklore from more practitioners, each a verbatim quote
checked against a cached source copy (footnoted with line numbers):
- koaning (Vincent Warmerdam), "Bad Labels": benchmark labels are often wrong;
  find them with confidence-sorted errors.
- gwern, the tank-detection legend: the canonical data-leakage parable, plus
  the scout-mindset twist that it's a likely-unsourced urban legend.
- Patrick Kidger, "Just Know Stuff": why research code is buggy ("kludge ...
  bugs that don't cripple things only because some other bug stops them") and
  "never accept the kludge". Plus a one-line jaxtyping pointer for shape bugs.
- nanochat (Karpathy): BOS-alignment fake metric improvement; all-ranks must
  clip on inf (a multi-GPU bug single-GPU testing hides).
- cleanrl "37 Implementation Details of PPO" -> RL sub-skill, as the canonical
  proof that reference-impl details (not ideas) decide whether PPO works.

Trim the lucidrains item to one quote (it had ballooned). Add wassname credit
+ companion-gist link. All 20 footnotes resolve.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 20:59:36 +08:00
wassname 9911ac83c5 folklore: add lucidrains transformer-stability item (QK-norm, post-emb LN)
Phil Wang's x-transformers is the canonical "the fix is in the code, not the
paper" catalogue. Add a folklore item on the most debugging-relevant trick:
QK / cosine-sim normalization to stop attention logits overflowing (the usual
cause of transformer loss spikes/divergence), plus the BLOOM/YaLM
post-embedding LayerNorm. Two verbatim lucidrains quotes, footnoted to the repo
+ a cached README copy with line numbers. Doubles as the modern concrete
example for the read-a-working-implementation section.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 20:49:15 +08:00
wassname 38ec634ff3 restructure: folklore-first, quote-verified, with wassname intro
Reorder around what's durable, per wassname's curation:
- human-written intro up top; rename to "wassname's ML Debugging Folklore"
- mindset first: calibrate -> mental models -> Part 1 general tricks (kept,
  they're well-based) -> read a working implementation when stuck
- a Folklore section built from verbatim, source-checked quotes (Jones,
  Rahtz, Karpathy, Schulman, Henderson, Irpan, CS231n, Slavv, Goodfellow),
  each footnoted to the canonical URL + the cached copy with line numbers
- LLM-agent babysitting (debugging loop, triage menu, anti-patterns) moved to
  the bottom where it belongs; triage reframed as a menu, not a flowchart
- deeper one-off tricks split to refs/ (loss_surface, metric_stuck, sweeps),
  scrubbed of private tooling (wandb/just/SI/personal scripts)

Quote integrity: every quote independently verified by fresh-eyes subagents
against the cached sources; fixed a reformatted Schulman slide, a truncated
Jones sentence, a reversed-order Rahtz stitch, a falsely-quoted Slavv phrase,
and the 3e-4 line (now the real tweet, framed as the joke Karpathy confirmed
it was, not gospel). lr_scheduler anti-pattern nuanced (warmup/cyclic matter).

Remove superseded SKILL2.md draft.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 20:46:25 +08:00
wassname cf9df71f6a add SKILL2.md: condensed anchor proposal (74 vs 703 lines)
Procedural/vibe anchor with gradual disclosure: calibrate + loop +
non-obvious numbers inline, tables/triage/sweeps demoted to on-demand
links into SKILL.md and refs/. Draft for side-by-side comparison; not
wired in (SKILL.md remains the entry point).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 13:26:59 +08:00
wassname ab827116b5 remove AI flourishes and rhetorical "X, not Y" framing
- drop "detective at a scene, not a fortune teller", "guess wearing a
  fix's clothes", "that reflex is the enemy"
- rephrase negative parallelisms in intro/calibrate/loop to positive
  (judgment not a checklist; mindset not ticking boxes; evidence not
  prior; isn't a recipe; it's a; menu not a procedure; code not abstract)
- keep genuine instructional contrasts (relative error not absolute, etc.)
- trim pseudocode comments to intent-only

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 12:56:35 +08:00
wassname 7410a7ccf3 restore -- attribution form for blockquote citations
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 12:52:43 +08:00
wassname b6fad64930 loop pseudocode: pseudopy style (← assignment, ── divider, t̂)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 12:51:15 +08:00
wassname 90b11214f8 de-AI pass: drop em-dashes, flourishes; resolve in-file TODOs
- convert all prose ' -- ' to commas/periods/parens (left code/CLI/arrows)
- remove the antithesis flourish in the bisect step; inform not persuade
- de-telegraph "no model, no forward pass, no GPU. pure math."
- add non-exhaustive hedges (and so on / like) where lists implied closure
- fix typos: authoritative (x2), sklearn, it indented
- TODO: triage decision tree converted from ASCII art to nested bullets
- TODO: add Further reading section linking docs/evidence/* files

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 12:49:28 +08:00
wassname 220bd8dc7f fix typos: separate/reproduced/auditable, drop stray article
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 12:07:58 +08:00
wassname 715164416b loop: add likelihood-ratio test selection, path bisection, falsifiers, pseudocode
- triplet now carries a prior + cheapest falsifier (Check:) per hypothesis
- discriminating-test step: forward-predict each hypothesis, prefer where
  predictions diverge (strong vs weak evidence) instead of just "discriminating"
- new step: bisect the forward/backward path to localize where it breaks
- compact pseudocode summary of the whole loop
- resolve FIXME: drop references to the non-public research-journal skill

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 12:06:30 +08:00
wassname (Michael J Clark) d5c7dec5a6 Update SKILL.md 2026-06-01 13:36:35 +08:00
wassname 779beee03e refactor(ml_debug): tidy ordering/emphasis on the new top sections
Three targeted polishes to the rewritten skill:
- Reframe Part 1's "The hierarchy (work in order...)" -> "What 'collect clues'
  looks like": it's the catalog the loop's clue-collection step draws on, not a
  second master-procedure competing with "the debugging loop" 40 lines above.
- Reorder: lead straight into calibrate -> loop -> read-impl; relocate the
  2017-2021 caveat + LLM-pretraining pointers into a "Scope and modern pointers"
  block after the action sections, so the behaviour-changing content is the
  first screen instead of provenance.
- Emphasis: give the "priors are a starting weight, not a verdict" line a
  concrete clause (traceback / loss-metric misalignment / right init-loss
  override the data prior) -- the weakest comprehension dim in the quiz.

Before-vs-after panel A/B (6 cold readers): tie on ordering/clarity/
conciseness/focus, each leaning slightly positive, no regression.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 10:15:41 +08:00
wassname bb1a6bc61c feat(ml_debug): lead with judgment gates over the symptom-lookup encyclopedia
The skill was thorough but failed to instill debugging taste: an agent would
pattern-match a symptom-table row to a fix and ship a guess, because the
behaviour-changing material sat 550 lines down. Promote three gates to the top:

- "Before you debug: calibrate" -- you're likely OOD in research code; the
  failure mode is overconfidence/impatience; the tables are a menu to widen the
  search, never lookup-and-apply.
- "The debugging loop (judgment, not a checklist)" -- collect clues, hold a few
  competing hypotheses scaled to the problem, sanity-check with the
  likely/subtle/null triplet (shared vocab with research-journal), run the
  cheapest discriminating observation, then act.
- "When stuck, read a working implementation" -- promoted from a buried Part 7.3
  one-liner; extract the algorithm-done-right, the engineering tricks the paper
  omits, and proven hyperparams; rank candidates by trust signal.

Collapse duplicated advice to pointers; de-bold Part 6.4 (8 bolded openers -> a
plain list). Net +10 lines, bold markers 112 -> 100.

Verified by a blind comprehension-by-inference quiz (5 cold-reader models, OLD
vs NEW): NEW 9.6/10 vs OLD 6.8/10, with the gap localized to the two added
sections (read-working-impl 5/5 vs 0/5; tables-as-menu 2.0 vs 1.6) while
untouched sections tied -- ruling out a "reads nicer" halo.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 10:11:36 +08:00
wassname fde5ac62fd name 2026-04-09 05:09:25 +08:00
wassname b159b0fba8 docs(ml_debug): annotate EMNLP 2018 NLP code tutorial; note sparse Adam embedding bug 2026-03-10 05:48:36 +08:00
wassname 0fa4009fd5 docs(ml_debug): update Grus annotation after reading full slides; note EMNLP 2018 lead 2026-03-10 05:45:56 +08:00
wassname 52ff6c17cd docs(ml_debug): annotate Joel Grus slides -- SE/reproducibility talk, not debugging 2026-03-10 05:45:16 +08:00
wassname 3dffe890b1 docs(ml_debug): annotate sanh outbound links with content summaries 2026-03-10 05:40:31 +08:00
wassname c9c53f8e7f feat(ml_debug): expand nanochat evidence, add lec4 diagnostics file
nanochat_deepwiki_llm_pretraining_2026.md rewritten with content from
dev/LOG.md and deepwiki sections 3/12/13:
- 14 labelled findings with direct quotes and empirical numbers
- Dataset >> architecture (27% gain, 5 failed attempts before ClimbMix)
- Scale-dependent HP sensitivity (d12 HPs hurt d20)
- Multi-axis validation (steps/wall-clock/FLOPs)
- Negative results: MoE/SwiGLU/MTP all failed at this scale
- MFU monitoring, batch size Bopt∝D^0.383, WD∝1/width² tables
- FP8 reality: 1.38x micro → 1.17x full → 5% capability-matched
- Python GC 500ms overhead, torch.compile recompile gotcha

karpathy_nn_zero_to_hero_lec4_diagnostics.md: new evidence file
- Activation saturation check (tanh >0.97)
- Gradient distribution check per-layer
- Grad:data ratio (target ~1e-3)
- Update-to-data ratio tracker with full plotting code
- Incremental improvement log from notebook
2026-03-10 05:38:33 +08:00
wassname ced4edc200 feat(ml_debug): add Karpathy recipe + nanochat evidence, update-ratio diagnostic
Add 3 new evidence files from modern open-source sources:
- karpathy_recipe_training_nn_2019.md: Karpathy's training recipe blog post
- nanochat_deepwiki_llm_pretraining_2026.md: 320+ HP sweeps for GPT-2-scale pretraining
- sanh_simple_considerations_hf_2021.md: HuggingFace NLP debugging notes

Add update-to-data ratio diagnostic to refs/diagnostics.md (target ~1e-3).
Add LLM pretraining gap note to SKILL.md intro linking the new sources.
Add tanh saturation % to logging checklist.
2026-03-10 05:32:37 +08:00
wassname bbe3fe0985 feat(ml_debug): add JAX grep patterns and diagnostic equivalents
refs/static_analysis.md: JAX-specific grep patterns (in-place mutation,
print side effects, key reuse, numpy escape, cast behavior).
refs/diagnostics.md: JAX equivalents table (NaN detection, gradcheck,
disable_jit, debug.print, debug.breakpoint, checkify).
2026-03-06 14:10:39 +08:00
wassname 7ac7aacac7 fix(ml_debug): address review feedback
- Fix stale Part 2 cross-references to link to rl/SKILL.md
- Add McCandlish + Slavv back to parent Sources (cited in Part 7)
- Add back-links from refs/ files to parent SKILL.md
2026-03-06 13:59:48 +08:00
wassname 70c28f06ac refactor(ml_debug): extract grep patterns and diagnostics to refs/
Moved 6.1 (static analysis grep patterns) and 6.2 (diagnostic code
snippets) to refs/static_analysis.md and refs/diagnostics.md.
Triage tree (6.3) stays in main with references to the ref files.
ml_debug/SKILL.md reduced from 7229w to 5093w (~30% from original).
2026-03-06 13:54:37 +08:00
wassname 48d4c1044a refactor(pinn): extract heat exchanger specifics to refs/
Moved heat-exchanger-specific content from pinn/SKILL.md to
pinn/refs/heat_exchanger.md: complexity ladder table, known failure
modes (U->0, counterflow signs), property mappings (REFPROP/PCHIP),
multi-episode training. PINN skill is now domain-agnostic.
pinn/SKILL.md reduced from 4961w to 4274w (~14%).
2026-03-06 13:39:53 +08:00
wassname 7f34f26a5c refactor(ml_debug): extract RL debugging into rl/ sub-skill
Part 2 (RL-Specific Debugging) + RL-specific sources moved to
ml_debug/rl/SKILL.md as a sub-skill, following the pinn/ precedent.
Parent SKILL.md reduced from 9158w to 7229w (~21%).
General sources (Goodfellow, CS231n, Tobin, Ng) kept in parent.
2026-03-06 13:36:29 +08:00