ml-debug

mirror of https://github.com/wassname/ml-debug.git synced 2026-06-27 13:45:14 +08:00

Author	SHA1	Message	Date
wassname (Michael J Clark)	e92ec01efe	Enhance ML debugging guidance for LLM agents Added guidance for LLM agents on reading and calibrating their confidence levels in ML debugging.	2026-06-26 09:52:43 +08:00
wassname	5fca5ad2b2	Refresh Schulman cache anchors after transcript rewrite Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-25 10:31:39 +08:00
wassname	f8f512f603	Cite Irpan in research taste (signs-of-life, seed canary) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-25 10:31:39 +08:00
wassname	3fe6cb9ad9	Replace OCR-garbled Schulman cache with clean slide transcript Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-25 10:31:39 +08:00
wassname	67d4dc90bb	Document quote-first evidence style	2026-06-25 10:31:39 +08:00
wassname	20f03f20b8	Expand research taste appendix with expert quotes	2026-06-25 10:31:39 +08:00
wassname	8fc2c0bbd0	Add research taste evidence appendix	2026-06-25 10:31:39 +08:00
wassname (Michael J Clark)	3f3f95a3b4	Update SKILL.md with debugging strategies and folklore Reorganize and refine debugging folklore and hypotheses for RL implementations.	2026-06-14 08:37:57 +08:00
wassname	b8c3ffcf11	gpt5.5/fable	2026-06-12 09:30:25 +08:00
wassname	160bd040cc	docs: add sourced transformer report folklore	2026-06-12 07:02:58 +08:00
wassname	3e28a950e9	docs: clarify competing-worlds debugging loop	2026-06-12 06:53:36 +08:00
wassname	8b9a1d62ed	docs: resolve ml-debug TODO references	2026-06-12 06:52:38 +08:00
wassname	966f948d36	docs: refine numerical and scheduler debugging guidance	2026-06-12 06:35:29 +08:00
wassname (Michael J Clark)	e58eda360b	Update SKILL.md with TODOs for future content Added TODO items for additional references and empirical evidence regarding transformers and model sizes.	2026-06-11 21:35:49 +08:00
wassname (Michael J Clark)	1ad74e14c6	Update loss_surface.md	2026-06-11 21:21:01 +08:00
wassname (Michael J Clark)	6e9a3ca633	Revise introduction in diagnostics.md Updated the introduction to provide context for diagnostic code snippets.	2026-06-11 21:18:07 +08:00
wassname	30ac76053e	chore: drop links to deleted/tombstone gists (repo is canonical now) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 16:55:04 +08:00
wassname	0837f27f08	fix: companion gist link pointed at the wrong gist Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 16:41:50 +08:00
wassname	8cd3c61050	folklore: tuning playbook, Domingos, Bekman loss spikes, Ng error analysis; LLM-judge bias appendix - SKILL.md: 3 new entries (exploration-over-exploitation + nuisance HPs, test-set contamination, loss-spikes-mean-bad-data-pocket) and an Ng 100-misclassified-examples quote under inspect-the-data - refs/llm_judges.md: position/verbosity/self-preference biases (Zheng, Wang 66/80 flip, Panickssery) + mitigation checklist from verdict docs - Lones pitfalls linked as the exhaustive 36-item do/don't checklist - 6 new frozen evidence files; Hamel evals link in further reading Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 15:30:41 +08:00
wassname	2a2f5045bb	folklore: add Karpathy common-mistakes tweet and Sculley CACE principle Both quote-verbatim with frozen evidence: the 2018 tweet thread (mirrored via threadreaderapp, x.com blocks fetching) slots after overfit-one-batch; CACE (NIPS 2015, entanglement section transcribed from the PDF) gives Always-Be-Ablating its why. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 14:43:47 +08:00
wassname	fb753d093e	restructure: quotes-first SKILL.md, synthesized playbook split out SKILL.md is now folklore only: verbatim practitioner quotes ordered most-general-first, transformer/LLM fine-tuning entries in their own section, minimal context, links and footnotes. New sources: unsloth, axolotl (+training stability), HF course ch8.4, Bekman debug_utils (evidence frozen in docs/evidence/). The synthesized material (mental models, priors, symptom tables, agent loop, triage, anti-patterns) moves to PLAYBOOK.md, framed as menus of hypotheses rather than authoritative diagnoses. Made-up symptom tables no longer sit next to sourced quotes. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 14:33:32 +08:00
wassname	8ee980d62f	diagnostics: add NaN-poisoning leakage tracer + Karpathy backprop-to-input check; README citation NaN poisoning: inject NaN where info must not come from (future/test/labels), run the real pipeline, assert past outputs stay finite. Documents false negatives (pandas skipna, nanmean) and false positives (softmax rows, batch stats). Backprop-to-input is its gradient dual for inside the model; quote already frozen in docs/evidence/karpathy_recipe_training_nn_2019.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 10:18:51 +08:00
wassname (Michael J Clark)	53fb1c4bda	Update SKILL.md	2026-06-03 08:19:07 +08:00
wassname (Michael J Clark)	8e1f9dec6d	Update README.md	2026-06-03 08:18:11 +08:00
wassname (Michael J Clark)	d1b1608a49	Update README.md	2026-06-03 08:17:04 +08:00
wassname	8509ec3c30	folklore: promote Spinning Up to main; add a Research-taste section - Promote the general (non-RL-specific) Spinning Up lessons up to the main folklore: "broken code fails silently", "you can't tell it's broken if you can't see that it's breaking", and test on more than one setup. - Add gwern's "Unseeing" to the data theme: you can't read what you actually wrote, hence fresh eyes / a fresh-eyes subagent. - New "Research taste (adjacent to debugging)" section with verbatim quotes, each cached: Neel Nanda (your research is false by default; excitement is evidence of bullshit; read your data), Ulisse Mini (understand the system to shrink the search space), John Wentworth (gears-level models are capital investments vs cheap black boxes). All quotes verbatim from cached sources; 25/25 footnotes resolve. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 21:08:49 +08:00
wassname	a602ea5a0e	rl: quote Spinning Up (Achiam) on silent failure and bug-first debugging Spinning Up as a Deep RL Researcher was only a bare code link; it's the canonical RL-researcher guide and its debugging advice is gold. Cache the rigour/debugging sections verbatim and quote the sharpest lines in the RL sub-skill: "broken RL code almost always fails silently", "if it doesn't work, assume there's a bug", "measure everything ... you can't tell it's broken if you can't see that it's breaking", and test on more than one env. Add to RL sources. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 21:04:55 +08:00
wassname	ee4e9a5caa	folklore: add koaning, gwern, kidger, nanochat, cleanrl; trim lucidrains Gather debugging folklore from more practitioners, each a verbatim quote checked against a cached source copy (footnoted with line numbers): - koaning (Vincent Warmerdam), "Bad Labels": benchmark labels are often wrong; find them with confidence-sorted errors. - gwern, the tank-detection legend: the canonical data-leakage parable, plus the scout-mindset twist that it's a likely-unsourced urban legend. - Patrick Kidger, "Just Know Stuff": why research code is buggy ("kludge ... bugs that don't cripple things only because some other bug stops them") and "never accept the kludge". Plus a one-line jaxtyping pointer for shape bugs. - nanochat (Karpathy): BOS-alignment fake metric improvement; all-ranks must clip on inf (a multi-GPU bug single-GPU testing hides). - cleanrl "37 Implementation Details of PPO" -> RL sub-skill, as the canonical proof that reference-impl details (not ideas) decide whether PPO works. Trim the lucidrains item to one quote (it had ballooned). Add wassname credit + companion-gist link. All 20 footnotes resolve. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 20:59:36 +08:00
wassname	9911ac83c5	folklore: add lucidrains transformer-stability item (QK-norm, post-emb LN) Phil Wang's x-transformers is the canonical "the fix is in the code, not the paper" catalogue. Add a folklore item on the most debugging-relevant trick: QK / cosine-sim normalization to stop attention logits overflowing (the usual cause of transformer loss spikes/divergence), plus the BLOOM/YaLM post-embedding LayerNorm. Two verbatim lucidrains quotes, footnoted to the repo + a cached README copy with line numbers. Doubles as the modern concrete example for the read-a-working-implementation section. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 20:49:15 +08:00
wassname	38ec634ff3	restructure: folklore-first, quote-verified, with wassname intro Reorder around what's durable, per wassname's curation: - human-written intro up top; rename to "wassname's ML Debugging Folklore" - mindset first: calibrate -> mental models -> Part 1 general tricks (kept, they're well-based) -> read a working implementation when stuck - a Folklore section built from verbatim, source-checked quotes (Jones, Rahtz, Karpathy, Schulman, Henderson, Irpan, CS231n, Slavv, Goodfellow), each footnoted to the canonical URL + the cached copy with line numbers - LLM-agent babysitting (debugging loop, triage menu, anti-patterns) moved to the bottom where it belongs; triage reframed as a menu, not a flowchart - deeper one-off tricks split to refs/ (loss_surface, metric_stuck, sweeps), scrubbed of private tooling (wandb/just/SI/personal scripts) Quote integrity: every quote independently verified by fresh-eyes subagents against the cached sources; fixed a reformatted Schulman slide, a truncated Jones sentence, a reversed-order Rahtz stitch, a falsely-quoted Slavv phrase, and the 3e-4 line (now the real tweet, framed as the joke Karpathy confirmed it was, not gospel). lr_scheduler anti-pattern nuanced (warmup/cyclic matter). Remove superseded SKILL2.md draft. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 20:46:25 +08:00
wassname	cf9df71f6a	add SKILL2.md: condensed anchor proposal (74 vs 703 lines) Procedural/vibe anchor with gradual disclosure: calibrate + loop + non-obvious numbers inline, tables/triage/sweeps demoted to on-demand links into SKILL.md and refs/. Draft for side-by-side comparison; not wired in (SKILL.md remains the entry point). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 13:26:59 +08:00
wassname	ab827116b5	remove AI flourishes and rhetorical "X, not Y" framing - drop "detective at a scene, not a fortune teller", "guess wearing a fix's clothes", "that reflex is the enemy" - rephrase negative parallelisms in intro/calibrate/loop to positive (judgment not a checklist; mindset not ticking boxes; evidence not prior; isn't a recipe; it's a; menu not a procedure; code not abstract) - keep genuine instructional contrasts (relative error not absolute, etc.) - trim pseudocode comments to intent-only Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 12:56:35 +08:00
wassname	7410a7ccf3	restore -- attribution form for blockquote citations Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 12:52:43 +08:00
wassname	b6fad64930	loop pseudocode: pseudopy style (← assignment, ── divider, t̂) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 12:51:15 +08:00
wassname	90b11214f8	de-AI pass: drop em-dashes, flourishes; resolve in-file TODOs - convert all prose ' -- ' to commas/periods/parens (left code/CLI/arrows) - remove the antithesis flourish in the bisect step; inform not persuade - de-telegraph "no model, no forward pass, no GPU. pure math." - add non-exhaustive hedges (and so on / like) where lists implied closure - fix typos: authoritative (x2), sklearn, it indented - TODO: triage decision tree converted from ASCII art to nested bullets - TODO: add Further reading section linking docs/evidence/* files Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 12:49:28 +08:00
wassname	220bd8dc7f	fix typos: separate/reproduced/auditable, drop stray article Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 12:07:58 +08:00
wassname	715164416b	loop: add likelihood-ratio test selection, path bisection, falsifiers, pseudocode - triplet now carries a prior + cheapest falsifier (Check:) per hypothesis - discriminating-test step: forward-predict each hypothesis, prefer where predictions diverge (strong vs weak evidence) instead of just "discriminating" - new step: bisect the forward/backward path to localize where it breaks - compact pseudocode summary of the whole loop - resolve FIXME: drop references to the non-public research-journal skill Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 12:06:30 +08:00
wassname (Michael J Clark)	d5c7dec5a6	Update SKILL.md	2026-06-01 13:36:35 +08:00
wassname	779beee03e	refactor(ml_debug): tidy ordering/emphasis on the new top sections Three targeted polishes to the rewritten skill: - Reframe Part 1's "The hierarchy (work in order...)" -> "What 'collect clues' looks like": it's the catalog the loop's clue-collection step draws on, not a second master-procedure competing with "the debugging loop" 40 lines above. - Reorder: lead straight into calibrate -> loop -> read-impl; relocate the 2017-2021 caveat + LLM-pretraining pointers into a "Scope and modern pointers" block after the action sections, so the behaviour-changing content is the first screen instead of provenance. - Emphasis: give the "priors are a starting weight, not a verdict" line a concrete clause (traceback / loss-metric misalignment / right init-loss override the data prior) -- the weakest comprehension dim in the quiz. Before-vs-after panel A/B (6 cold readers): tie on ordering/clarity/ conciseness/focus, each leaning slightly positive, no regression. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 10:15:41 +08:00
wassname	bb1a6bc61c	feat(ml_debug): lead with judgment gates over the symptom-lookup encyclopedia The skill was thorough but failed to instill debugging taste: an agent would pattern-match a symptom-table row to a fix and ship a guess, because the behaviour-changing material sat 550 lines down. Promote three gates to the top: - "Before you debug: calibrate" -- you're likely OOD in research code; the failure mode is overconfidence/impatience; the tables are a menu to widen the search, never lookup-and-apply. - "The debugging loop (judgment, not a checklist)" -- collect clues, hold a few competing hypotheses scaled to the problem, sanity-check with the likely/subtle/null triplet (shared vocab with research-journal), run the cheapest discriminating observation, then act. - "When stuck, read a working implementation" -- promoted from a buried Part 7.3 one-liner; extract the algorithm-done-right, the engineering tricks the paper omits, and proven hyperparams; rank candidates by trust signal. Collapse duplicated advice to pointers; de-bold Part 6.4 (8 bolded openers -> a plain list). Net +10 lines, bold markers 112 -> 100. Verified by a blind comprehension-by-inference quiz (5 cold-reader models, OLD vs NEW): NEW 9.6/10 vs OLD 6.8/10, with the gap localized to the two added sections (read-working-impl 5/5 vs 0/5; tables-as-menu 2.0 vs 1.6) while untouched sections tied -- ruling out a "reads nicer" halo. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 10:11:36 +08:00
wassname	fde5ac62fd	name	2026-04-09 05:09:25 +08:00
wassname	b159b0fba8	docs(ml_debug): annotate EMNLP 2018 NLP code tutorial; note sparse Adam embedding bug	2026-03-10 05:48:36 +08:00
wassname	0fa4009fd5	docs(ml_debug): update Grus annotation after reading full slides; note EMNLP 2018 lead	2026-03-10 05:45:56 +08:00
wassname	52ff6c17cd	docs(ml_debug): annotate Joel Grus slides -- SE/reproducibility talk, not debugging	2026-03-10 05:45:16 +08:00
wassname	3dffe890b1	docs(ml_debug): annotate sanh outbound links with content summaries	2026-03-10 05:40:31 +08:00
wassname	c9c53f8e7f	feat(ml_debug): expand nanochat evidence, add lec4 diagnostics file nanochat_deepwiki_llm_pretraining_2026.md rewritten with content from dev/LOG.md and deepwiki sections 3/12/13: - 14 labelled findings with direct quotes and empirical numbers - Dataset >> architecture (27% gain, 5 failed attempts before ClimbMix) - Scale-dependent HP sensitivity (d12 HPs hurt d20) - Multi-axis validation (steps/wall-clock/FLOPs) - Negative results: MoE/SwiGLU/MTP all failed at this scale - MFU monitoring, batch size Bopt∝D^0.383, WD∝1/width² tables - FP8 reality: 1.38x micro → 1.17x full → 5% capability-matched - Python GC 500ms overhead, torch.compile recompile gotcha karpathy_nn_zero_to_hero_lec4_diagnostics.md: new evidence file - Activation saturation check (tanh >0.97) - Gradient distribution check per-layer - Grad:data ratio (target ~1e-3) - Update-to-data ratio tracker with full plotting code - Incremental improvement log from notebook	2026-03-10 05:38:33 +08:00
wassname	ced4edc200	feat(ml_debug): add Karpathy recipe + nanochat evidence, update-ratio diagnostic Add 3 new evidence files from modern open-source sources: - karpathy_recipe_training_nn_2019.md: Karpathy's training recipe blog post - nanochat_deepwiki_llm_pretraining_2026.md: 320+ HP sweeps for GPT-2-scale pretraining - sanh_simple_considerations_hf_2021.md: HuggingFace NLP debugging notes Add update-to-data ratio diagnostic to refs/diagnostics.md (target ~1e-3). Add LLM pretraining gap note to SKILL.md intro linking the new sources. Add tanh saturation % to logging checklist.	2026-03-10 05:32:37 +08:00
wassname	bbe3fe0985	feat(ml_debug): add JAX grep patterns and diagnostic equivalents refs/static_analysis.md: JAX-specific grep patterns (in-place mutation, print side effects, key reuse, numpy escape, cast behavior). refs/diagnostics.md: JAX equivalents table (NaN detection, gradcheck, disable_jit, debug.print, debug.breakpoint, checkify).	2026-03-06 14:10:39 +08:00
wassname	7ac7aacac7	fix(ml_debug): address review feedback - Fix stale Part 2 cross-references to link to rl/SKILL.md - Add McCandlish + Slavv back to parent Sources (cited in Part 7) - Add back-links from refs/ files to parent SKILL.md	2026-03-06 13:59:48 +08:00
wassname	70c28f06ac	refactor(ml_debug): extract grep patterns and diagnostics to refs/ Moved 6.1 (static analysis grep patterns) and 6.2 (diagnostic code snippets) to refs/static_analysis.md and refs/diagnostics.md. Triage tree (6.3) stays in main with references to the ref files. ml_debug/SKILL.md reduced from 7229w to 5093w (~30% from original).	2026-03-06 13:54:37 +08:00

1 2

62 Commits