ml-debug

mirror of https://github.com/wassname/ml-debug.git synced 2026-06-27 16:00:43 +08:00

Author	SHA1	Message	Date
wassname	b8c3ffcf11	gpt5.5/fable	2026-06-12 09:30:25 +08:00
wassname	8b9a1d62ed	docs: resolve ml-debug TODO references	2026-06-12 06:52:38 +08:00
wassname (Michael J Clark)	6e9a3ca633	Revise introduction in diagnostics.md Updated the introduction to provide context for diagnostic code snippets.	2026-06-11 21:18:07 +08:00
wassname	8ee980d62f	diagnostics: add NaN-poisoning leakage tracer + Karpathy backprop-to-input check; README citation NaN poisoning: inject NaN where info must not come from (future/test/labels), run the real pipeline, assert past outputs stay finite. Documents false negatives (pandas skipna, nanmean) and false positives (softmax rows, batch stats). Backprop-to-input is its gradient dual for inside the model; quote already frozen in docs/evidence/karpathy_recipe_training_nn_2019.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-11 10:18:51 +08:00
wassname	c9c53f8e7f	feat(ml_debug): expand nanochat evidence, add lec4 diagnostics file nanochat_deepwiki_llm_pretraining_2026.md rewritten with content from dev/LOG.md and deepwiki sections 3/12/13: - 14 labelled findings with direct quotes and empirical numbers - Dataset >> architecture (27% gain, 5 failed attempts before ClimbMix) - Scale-dependent HP sensitivity (d12 HPs hurt d20) - Multi-axis validation (steps/wall-clock/FLOPs) - Negative results: MoE/SwiGLU/MTP all failed at this scale - MFU monitoring, batch size Bopt∝D^0.383, WD∝1/width² tables - FP8 reality: 1.38x micro → 1.17x full → 5% capability-matched - Python GC 500ms overhead, torch.compile recompile gotcha karpathy_nn_zero_to_hero_lec4_diagnostics.md: new evidence file - Activation saturation check (tanh >0.97) - Gradient distribution check per-layer - Grad:data ratio (target ~1e-3) - Update-to-data ratio tracker with full plotting code - Incremental improvement log from notebook	2026-03-10 05:38:33 +08:00
wassname	ced4edc200	feat(ml_debug): add Karpathy recipe + nanochat evidence, update-ratio diagnostic Add 3 new evidence files from modern open-source sources: - karpathy_recipe_training_nn_2019.md: Karpathy's training recipe blog post - nanochat_deepwiki_llm_pretraining_2026.md: 320+ HP sweeps for GPT-2-scale pretraining - sanh_simple_considerations_hf_2021.md: HuggingFace NLP debugging notes Add update-to-data ratio diagnostic to refs/diagnostics.md (target ~1e-3). Add LLM pretraining gap note to SKILL.md intro linking the new sources. Add tanh saturation % to logging checklist.	2026-03-10 05:32:37 +08:00
wassname	bbe3fe0985	feat(ml_debug): add JAX grep patterns and diagnostic equivalents refs/static_analysis.md: JAX-specific grep patterns (in-place mutation, print side effects, key reuse, numpy escape, cast behavior). refs/diagnostics.md: JAX equivalents table (NaN detection, gradcheck, disable_jit, debug.print, debug.breakpoint, checkify).	2026-03-06 14:10:39 +08:00
wassname	7ac7aacac7	fix(ml_debug): address review feedback - Fix stale Part 2 cross-references to link to rl/SKILL.md - Add McCandlish + Slavv back to parent Sources (cited in Part 7) - Add back-links from refs/ files to parent SKILL.md	2026-03-06 13:59:48 +08:00
wassname	70c28f06ac	refactor(ml_debug): extract grep patterns and diagnostics to refs/ Moved 6.1 (static analysis grep patterns) and 6.2 (diagnostic code snippets) to refs/static_analysis.md and refs/diagnostics.md. Triage tree (6.3) stays in main with references to the ref files. ml_debug/SKILL.md reduced from 7229w to 5093w (~30% from original).	2026-03-06 13:54:37 +08:00

9 Commits