ml-debug

mirror of https://github.com/wassname/ml-debug.git synced 2026-06-27 17:16:20 +08:00

Author	SHA1	Message	Date
wassname	8509ec3c30	folklore: promote Spinning Up to main; add a Research-taste section - Promote the general (non-RL-specific) Spinning Up lessons up to the main folklore: "broken code fails silently", "you can't tell it's broken if you can't see that it's breaking", and test on more than one setup. - Add gwern's "Unseeing" to the data theme: you can't read what you actually wrote, hence fresh eyes / a fresh-eyes subagent. - New "Research taste (adjacent to debugging)" section with verbatim quotes, each cached: Neel Nanda (your research is false by default; excitement is evidence of bullshit; read your data), Ulisse Mini (understand the system to shrink the search space), John Wentworth (gears-level models are capital investments vs cheap black boxes). All quotes verbatim from cached sources; 25/25 footnotes resolve. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 21:08:49 +08:00
wassname	a602ea5a0e	rl: quote Spinning Up (Achiam) on silent failure and bug-first debugging Spinning Up as a Deep RL Researcher was only a bare code link; it's the canonical RL-researcher guide and its debugging advice is gold. Cache the rigour/debugging sections verbatim and quote the sharpest lines in the RL sub-skill: "broken RL code almost always fails silently", "if it doesn't work, assume there's a bug", "measure everything ... you can't tell it's broken if you can't see that it's breaking", and test on more than one env. Add to RL sources. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 21:04:55 +08:00
wassname	ee4e9a5caa	folklore: add koaning, gwern, kidger, nanochat, cleanrl; trim lucidrains Gather debugging folklore from more practitioners, each a verbatim quote checked against a cached source copy (footnoted with line numbers): - koaning (Vincent Warmerdam), "Bad Labels": benchmark labels are often wrong; find them with confidence-sorted errors. - gwern, the tank-detection legend: the canonical data-leakage parable, plus the scout-mindset twist that it's a likely-unsourced urban legend. - Patrick Kidger, "Just Know Stuff": why research code is buggy ("kludge ... bugs that don't cripple things only because some other bug stops them") and "never accept the kludge". Plus a one-line jaxtyping pointer for shape bugs. - nanochat (Karpathy): BOS-alignment fake metric improvement; all-ranks must clip on inf (a multi-GPU bug single-GPU testing hides). - cleanrl "37 Implementation Details of PPO" -> RL sub-skill, as the canonical proof that reference-impl details (not ideas) decide whether PPO works. Trim the lucidrains item to one quote (it had ballooned). Add wassname credit + companion-gist link. All 20 footnotes resolve. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 20:59:36 +08:00
wassname	9911ac83c5	folklore: add lucidrains transformer-stability item (QK-norm, post-emb LN) Phil Wang's x-transformers is the canonical "the fix is in the code, not the paper" catalogue. Add a folklore item on the most debugging-relevant trick: QK / cosine-sim normalization to stop attention logits overflowing (the usual cause of transformer loss spikes/divergence), plus the BLOOM/YaLM post-embedding LayerNorm. Two verbatim lucidrains quotes, footnoted to the repo + a cached README copy with line numbers. Doubles as the modern concrete example for the read-a-working-implementation section. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 20:49:15 +08:00
wassname	38ec634ff3	restructure: folklore-first, quote-verified, with wassname intro Reorder around what's durable, per wassname's curation: - human-written intro up top; rename to "wassname's ML Debugging Folklore" - mindset first: calibrate -> mental models -> Part 1 general tricks (kept, they're well-based) -> read a working implementation when stuck - a Folklore section built from verbatim, source-checked quotes (Jones, Rahtz, Karpathy, Schulman, Henderson, Irpan, CS231n, Slavv, Goodfellow), each footnoted to the canonical URL + the cached copy with line numbers - LLM-agent babysitting (debugging loop, triage menu, anti-patterns) moved to the bottom where it belongs; triage reframed as a menu, not a flowchart - deeper one-off tricks split to refs/ (loss_surface, metric_stuck, sweeps), scrubbed of private tooling (wandb/just/SI/personal scripts) Quote integrity: every quote independently verified by fresh-eyes subagents against the cached sources; fixed a reformatted Schulman slide, a truncated Jones sentence, a reversed-order Rahtz stitch, a falsely-quoted Slavv phrase, and the 3e-4 line (now the real tweet, framed as the joke Karpathy confirmed it was, not gospel). lr_scheduler anti-pattern nuanced (warmup/cyclic matter). Remove superseded SKILL2.md draft. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 20:46:25 +08:00
wassname	cf9df71f6a	add SKILL2.md: condensed anchor proposal (74 vs 703 lines) Procedural/vibe anchor with gradual disclosure: calibrate + loop + non-obvious numbers inline, tables/triage/sweeps demoted to on-demand links into SKILL.md and refs/. Draft for side-by-side comparison; not wired in (SKILL.md remains the entry point). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 13:26:59 +08:00
wassname	ab827116b5	remove AI flourishes and rhetorical "X, not Y" framing - drop "detective at a scene, not a fortune teller", "guess wearing a fix's clothes", "that reflex is the enemy" - rephrase negative parallelisms in intro/calibrate/loop to positive (judgment not a checklist; mindset not ticking boxes; evidence not prior; isn't a recipe; it's a; menu not a procedure; code not abstract) - keep genuine instructional contrasts (relative error not absolute, etc.) - trim pseudocode comments to intent-only Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 12:56:35 +08:00
wassname	7410a7ccf3	restore -- attribution form for blockquote citations Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 12:52:43 +08:00
wassname	b6fad64930	loop pseudocode: pseudopy style (← assignment, ── divider, t̂) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 12:51:15 +08:00
wassname	90b11214f8	de-AI pass: drop em-dashes, flourishes; resolve in-file TODOs - convert all prose ' -- ' to commas/periods/parens (left code/CLI/arrows) - remove the antithesis flourish in the bisect step; inform not persuade - de-telegraph "no model, no forward pass, no GPU. pure math." - add non-exhaustive hedges (and so on / like) where lists implied closure - fix typos: authoritative (x2), sklearn, it indented - TODO: triage decision tree converted from ASCII art to nested bullets - TODO: add Further reading section linking docs/evidence/* files Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 12:49:28 +08:00
wassname	220bd8dc7f	fix typos: separate/reproduced/auditable, drop stray article Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 12:07:58 +08:00
wassname	715164416b	loop: add likelihood-ratio test selection, path bisection, falsifiers, pseudocode - triplet now carries a prior + cheapest falsifier (Check:) per hypothesis - discriminating-test step: forward-predict each hypothesis, prefer where predictions diverge (strong vs weak evidence) instead of just "discriminating" - new step: bisect the forward/backward path to localize where it breaks - compact pseudocode summary of the whole loop - resolve FIXME: drop references to the non-public research-journal skill Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 12:06:30 +08:00
wassname (Michael J Clark)	d5c7dec5a6	Update SKILL.md	2026-06-01 13:36:35 +08:00
wassname	779beee03e	refactor(ml_debug): tidy ordering/emphasis on the new top sections Three targeted polishes to the rewritten skill: - Reframe Part 1's "The hierarchy (work in order...)" -> "What 'collect clues' looks like": it's the catalog the loop's clue-collection step draws on, not a second master-procedure competing with "the debugging loop" 40 lines above. - Reorder: lead straight into calibrate -> loop -> read-impl; relocate the 2017-2021 caveat + LLM-pretraining pointers into a "Scope and modern pointers" block after the action sections, so the behaviour-changing content is the first screen instead of provenance. - Emphasis: give the "priors are a starting weight, not a verdict" line a concrete clause (traceback / loss-metric misalignment / right init-loss override the data prior) -- the weakest comprehension dim in the quiz. Before-vs-after panel A/B (6 cold readers): tie on ordering/clarity/ conciseness/focus, each leaning slightly positive, no regression. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 10:15:41 +08:00
wassname	bb1a6bc61c	feat(ml_debug): lead with judgment gates over the symptom-lookup encyclopedia The skill was thorough but failed to instill debugging taste: an agent would pattern-match a symptom-table row to a fix and ship a guess, because the behaviour-changing material sat 550 lines down. Promote three gates to the top: - "Before you debug: calibrate" -- you're likely OOD in research code; the failure mode is overconfidence/impatience; the tables are a menu to widen the search, never lookup-and-apply. - "The debugging loop (judgment, not a checklist)" -- collect clues, hold a few competing hypotheses scaled to the problem, sanity-check with the likely/subtle/null triplet (shared vocab with research-journal), run the cheapest discriminating observation, then act. - "When stuck, read a working implementation" -- promoted from a buried Part 7.3 one-liner; extract the algorithm-done-right, the engineering tricks the paper omits, and proven hyperparams; rank candidates by trust signal. Collapse duplicated advice to pointers; de-bold Part 6.4 (8 bolded openers -> a plain list). Net +10 lines, bold markers 112 -> 100. Verified by a blind comprehension-by-inference quiz (5 cold-reader models, OLD vs NEW): NEW 9.6/10 vs OLD 6.8/10, with the gap localized to the two added sections (read-working-impl 5/5 vs 0/5; tables-as-menu 2.0 vs 1.6) while untouched sections tied -- ruling out a "reads nicer" halo. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 10:11:36 +08:00
wassname	fde5ac62fd	name	2026-04-09 05:09:25 +08:00
wassname	b159b0fba8	docs(ml_debug): annotate EMNLP 2018 NLP code tutorial; note sparse Adam embedding bug	2026-03-10 05:48:36 +08:00
wassname	0fa4009fd5	docs(ml_debug): update Grus annotation after reading full slides; note EMNLP 2018 lead	2026-03-10 05:45:56 +08:00
wassname	52ff6c17cd	docs(ml_debug): annotate Joel Grus slides -- SE/reproducibility talk, not debugging	2026-03-10 05:45:16 +08:00
wassname	3dffe890b1	docs(ml_debug): annotate sanh outbound links with content summaries	2026-03-10 05:40:31 +08:00
wassname	c9c53f8e7f	feat(ml_debug): expand nanochat evidence, add lec4 diagnostics file nanochat_deepwiki_llm_pretraining_2026.md rewritten with content from dev/LOG.md and deepwiki sections 3/12/13: - 14 labelled findings with direct quotes and empirical numbers - Dataset >> architecture (27% gain, 5 failed attempts before ClimbMix) - Scale-dependent HP sensitivity (d12 HPs hurt d20) - Multi-axis validation (steps/wall-clock/FLOPs) - Negative results: MoE/SwiGLU/MTP all failed at this scale - MFU monitoring, batch size Bopt∝D^0.383, WD∝1/width² tables - FP8 reality: 1.38x micro → 1.17x full → 5% capability-matched - Python GC 500ms overhead, torch.compile recompile gotcha karpathy_nn_zero_to_hero_lec4_diagnostics.md: new evidence file - Activation saturation check (tanh >0.97) - Gradient distribution check per-layer - Grad:data ratio (target ~1e-3) - Update-to-data ratio tracker with full plotting code - Incremental improvement log from notebook	2026-03-10 05:38:33 +08:00
wassname	ced4edc200	feat(ml_debug): add Karpathy recipe + nanochat evidence, update-ratio diagnostic Add 3 new evidence files from modern open-source sources: - karpathy_recipe_training_nn_2019.md: Karpathy's training recipe blog post - nanochat_deepwiki_llm_pretraining_2026.md: 320+ HP sweeps for GPT-2-scale pretraining - sanh_simple_considerations_hf_2021.md: HuggingFace NLP debugging notes Add update-to-data ratio diagnostic to refs/diagnostics.md (target ~1e-3). Add LLM pretraining gap note to SKILL.md intro linking the new sources. Add tanh saturation % to logging checklist.	2026-03-10 05:32:37 +08:00
wassname	bbe3fe0985	feat(ml_debug): add JAX grep patterns and diagnostic equivalents refs/static_analysis.md: JAX-specific grep patterns (in-place mutation, print side effects, key reuse, numpy escape, cast behavior). refs/diagnostics.md: JAX equivalents table (NaN detection, gradcheck, disable_jit, debug.print, debug.breakpoint, checkify).	2026-03-06 14:10:39 +08:00
wassname	7ac7aacac7	fix(ml_debug): address review feedback - Fix stale Part 2 cross-references to link to rl/SKILL.md - Add McCandlish + Slavv back to parent Sources (cited in Part 7) - Add back-links from refs/ files to parent SKILL.md	2026-03-06 13:59:48 +08:00
wassname	70c28f06ac	refactor(ml_debug): extract grep patterns and diagnostics to refs/ Moved 6.1 (static analysis grep patterns) and 6.2 (diagnostic code snippets) to refs/static_analysis.md and refs/diagnostics.md. Triage tree (6.3) stays in main with references to the ref files. ml_debug/SKILL.md reduced from 7229w to 5093w (~30% from original).	2026-03-06 13:54:37 +08:00
wassname	48d4c1044a	refactor(pinn): extract heat exchanger specifics to refs/ Moved heat-exchanger-specific content from pinn/SKILL.md to pinn/refs/heat_exchanger.md: complexity ladder table, known failure modes (U->0, counterflow signs), property mappings (REFPROP/PCHIP), multi-episode training. PINN skill is now domain-agnostic. pinn/SKILL.md reduced from 4961w to 4274w (~14%).	2026-03-06 13:39:53 +08:00
wassname	7f34f26a5c	refactor(ml_debug): extract RL debugging into rl/ sub-skill Part 2 (RL-Specific Debugging) + RL-specific sources moved to ml_debug/rl/SKILL.md as a sub-skill, following the pinn/ precedent. Parent SKILL.md reduced from 9158w to 7229w (~21%). General sources (Goodfellow, CS231n, Tobin, Ng) kept in parent.	2026-03-06 13:36:29 +08:00
wassname	698b77f2d3	chore: fix .gitignore (dlbooks path, *_log.md pattern)	2026-03-06 12:22:22 +08:00
wassname	9e30cf7039	chore: remove duplicate subtitle file and log (now gitignored)	2026-03-06 12:21:54 +08:00
wassname (Michael J Clark)	fa41fecef2	Delete docs/dlbooks	2026-03-06 12:19:12 +08:00
wassname	7a9c667aa7	chore: add wassname attribution to description, gitignore dlbooks	2026-03-06 12:17:50 +08:00
wassname	463c8fdbbc	fix: apply Gemini review fixes (device kwarg, gradcheck requires_grad, torch prefix) Review: Gemini 3.1 Pro approved. 3 fixes applied: - pinn/SKILL.md: PchipFunction torch.tensor missing device=h.device (GPU crash) - SKILL.md: gradcheck needs .requires_grad_(True) on doubled inputs - SKILL.md: loss surface pseudocode now has torch. prefix + indexing='ij'	2026-03-06 12:15:37 +08:00
wassname	2db012dd2c	docs(pinn): add Wang 2021 and Rathore 2024 evidence files	2026-03-06 12:12:51 +08:00
wassname	a90624b36d	feat(pinn): add pinn/ sub-skill with SKILL.md and evidence SKILL.md: 478-line PINN training best practices (complexity ladder, nondim, architecture, optimization, loss design, sampling, property mappings, ConFIG, domain decomposition). docs/evidence/: 6 files -- krishnapriyan2021, sukumar2022, wang2022 causal, wang2022+2023 expert guides, Brunton youtube transcripts. Missing evidence (to fetch): Wang 2001.04536 (gradient pathologies), Rathore 2402.01868 (ICML loss landscape). Author: wassname (https://github.com/wassname)	2026-03-06 11:48:41 +08:00
wassname	51c9a2df44	docs: add README with author credit and usage	2026-03-06 10:16:24 +08:00
wassname	95fee7b5cb	chore: include Goodfellow chapters (author encourages sharing)	2026-03-06 10:16:00 +08:00
wassname	4393cceefd	initial: ML debugging folklore skill Deep research to uplift LLMs for ML debugging, opinionated by source selection. Distilled from Schulman, Jones, Rahtz, Goodfellow, CS231n, FSDL, and more. Includes runnable diagnostic scripts and LLM-specific anti-patterns. Author: wassname (https://github.com/wassname)	2026-03-06 10:11:30 +08:00

37 Commits