ml_debug

mirror of https://github.com/wassname/ml_debug.git synced 2026-06-27 01:00:14 +08:00

Author	SHA1	Message	Date
wassname	52ff6c17cd	docs(ml_debug): annotate Joel Grus slides -- SE/reproducibility talk, not debugging	2026-03-10 05:45:16 +08:00
wassname	3dffe890b1	docs(ml_debug): annotate sanh outbound links with content summaries	2026-03-10 05:40:31 +08:00
wassname	c9c53f8e7f	feat(ml_debug): expand nanochat evidence, add lec4 diagnostics file nanochat_deepwiki_llm_pretraining_2026.md rewritten with content from dev/LOG.md and deepwiki sections 3/12/13: - 14 labelled findings with direct quotes and empirical numbers - Dataset >> architecture (27% gain, 5 failed attempts before ClimbMix) - Scale-dependent HP sensitivity (d12 HPs hurt d20) - Multi-axis validation (steps/wall-clock/FLOPs) - Negative results: MoE/SwiGLU/MTP all failed at this scale - MFU monitoring, batch size Bopt∝D^0.383, WD∝1/width² tables - FP8 reality: 1.38x micro → 1.17x full → 5% capability-matched - Python GC 500ms overhead, torch.compile recompile gotcha karpathy_nn_zero_to_hero_lec4_diagnostics.md: new evidence file - Activation saturation check (tanh >0.97) - Gradient distribution check per-layer - Grad:data ratio (target ~1e-3) - Update-to-data ratio tracker with full plotting code - Incremental improvement log from notebook	2026-03-10 05:38:33 +08:00
wassname	ced4edc200	feat(ml_debug): add Karpathy recipe + nanochat evidence, update-ratio diagnostic Add 3 new evidence files from modern open-source sources: - karpathy_recipe_training_nn_2019.md: Karpathy's training recipe blog post - nanochat_deepwiki_llm_pretraining_2026.md: 320+ HP sweeps for GPT-2-scale pretraining - sanh_simple_considerations_hf_2021.md: HuggingFace NLP debugging notes Add update-to-data ratio diagnostic to refs/diagnostics.md (target ~1e-3). Add LLM pretraining gap note to SKILL.md intro linking the new sources. Add tanh saturation % to logging checklist.	2026-03-10 05:32:37 +08:00
wassname	bbe3fe0985	feat(ml_debug): add JAX grep patterns and diagnostic equivalents refs/static_analysis.md: JAX-specific grep patterns (in-place mutation, print side effects, key reuse, numpy escape, cast behavior). refs/diagnostics.md: JAX equivalents table (NaN detection, gradcheck, disable_jit, debug.print, debug.breakpoint, checkify).	2026-03-06 14:10:39 +08:00
wassname	7ac7aacac7	fix(ml_debug): address review feedback - Fix stale Part 2 cross-references to link to rl/SKILL.md - Add McCandlish + Slavv back to parent Sources (cited in Part 7) - Add back-links from refs/ files to parent SKILL.md	2026-03-06 13:59:48 +08:00
wassname	70c28f06ac	refactor(ml_debug): extract grep patterns and diagnostics to refs/ Moved 6.1 (static analysis grep patterns) and 6.2 (diagnostic code snippets) to refs/static_analysis.md and refs/diagnostics.md. Triage tree (6.3) stays in main with references to the ref files. ml_debug/SKILL.md reduced from 7229w to 5093w (~30% from original).	2026-03-06 13:54:37 +08:00
wassname	48d4c1044a	refactor(pinn): extract heat exchanger specifics to refs/ Moved heat-exchanger-specific content from pinn/SKILL.md to pinn/refs/heat_exchanger.md: complexity ladder table, known failure modes (U->0, counterflow signs), property mappings (REFPROP/PCHIP), multi-episode training. PINN skill is now domain-agnostic. pinn/SKILL.md reduced from 4961w to 4274w (~14%).	2026-03-06 13:39:53 +08:00
wassname	7f34f26a5c	refactor(ml_debug): extract RL debugging into rl/ sub-skill Part 2 (RL-Specific Debugging) + RL-specific sources moved to ml_debug/rl/SKILL.md as a sub-skill, following the pinn/ precedent. Parent SKILL.md reduced from 9158w to 7229w (~21%). General sources (Goodfellow, CS231n, Tobin, Ng) kept in parent.	2026-03-06 13:36:29 +08:00
wassname	698b77f2d3	chore: fix .gitignore (dlbooks path, *_log.md pattern)	2026-03-06 12:22:22 +08:00
wassname	9e30cf7039	chore: remove duplicate subtitle file and log (now gitignored)	2026-03-06 12:21:54 +08:00
wassname (Michael J Clark)	fa41fecef2	Delete docs/dlbooks	2026-03-06 12:19:12 +08:00
wassname	7a9c667aa7	chore: add wassname attribution to description, gitignore dlbooks	2026-03-06 12:17:50 +08:00
wassname	463c8fdbbc	fix: apply Gemini review fixes (device kwarg, gradcheck requires_grad, torch prefix) Review: Gemini 3.1 Pro approved. 3 fixes applied: - pinn/SKILL.md: PchipFunction torch.tensor missing device=h.device (GPU crash) - SKILL.md: gradcheck needs .requires_grad_(True) on doubled inputs - SKILL.md: loss surface pseudocode now has torch. prefix + indexing='ij'	2026-03-06 12:15:37 +08:00
wassname	2db012dd2c	docs(pinn): add Wang 2021 and Rathore 2024 evidence files	2026-03-06 12:12:51 +08:00
wassname	a90624b36d	feat(pinn): add pinn/ sub-skill with SKILL.md and evidence SKILL.md: 478-line PINN training best practices (complexity ladder, nondim, architecture, optimization, loss design, sampling, property mappings, ConFIG, domain decomposition). docs/evidence/: 6 files -- krishnapriyan2021, sukumar2022, wang2022 causal, wang2022+2023 expert guides, Brunton youtube transcripts. Missing evidence (to fetch): Wang 2001.04536 (gradient pathologies), Rathore 2402.01868 (ICML loss landscape). Author: wassname (https://github.com/wassname)	2026-03-06 11:48:41 +08:00
wassname	51c9a2df44	docs: add README with author credit and usage	2026-03-06 10:16:24 +08:00
wassname	95fee7b5cb	chore: include Goodfellow chapters (author encourages sharing)	2026-03-06 10:16:00 +08:00
wassname	4393cceefd	initial: ML debugging folklore skill Deep research to uplift LLMs for ML debugging, opinionated by source selection. Distilled from Schulman, Jones, Rahtz, Goodfellow, CS231n, FSDL, and more. Includes runnable diagnostic scripts and LLM-specific anti-patterns. Author: wassname (https://github.com/wassname)	2026-03-06 10:11:30 +08:00

19 Commits