ml-debug

mirror of https://github.com/wassname/ml-debug.git synced 2026-06-27 18:24:28 +08:00

Author	SHA1	Message	Date
wassname	9911ac83c5	folklore: add lucidrains transformer-stability item (QK-norm, post-emb LN) Phil Wang's x-transformers is the canonical "the fix is in the code, not the paper" catalogue. Add a folklore item on the most debugging-relevant trick: QK / cosine-sim normalization to stop attention logits overflowing (the usual cause of transformer loss spikes/divergence), plus the BLOOM/YaLM post-embedding LayerNorm. Two verbatim lucidrains quotes, footnoted to the repo + a cached README copy with line numbers. Doubles as the modern concrete example for the read-a-working-implementation section. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 20:49:15 +08:00
wassname	b159b0fba8	docs(ml_debug): annotate EMNLP 2018 NLP code tutorial; note sparse Adam embedding bug	2026-03-10 05:48:36 +08:00
wassname	0fa4009fd5	docs(ml_debug): update Grus annotation after reading full slides; note EMNLP 2018 lead	2026-03-10 05:45:56 +08:00
wassname	52ff6c17cd	docs(ml_debug): annotate Joel Grus slides -- SE/reproducibility talk, not debugging	2026-03-10 05:45:16 +08:00
wassname	3dffe890b1	docs(ml_debug): annotate sanh outbound links with content summaries	2026-03-10 05:40:31 +08:00
wassname	c9c53f8e7f	feat(ml_debug): expand nanochat evidence, add lec4 diagnostics file nanochat_deepwiki_llm_pretraining_2026.md rewritten with content from dev/LOG.md and deepwiki sections 3/12/13: - 14 labelled findings with direct quotes and empirical numbers - Dataset >> architecture (27% gain, 5 failed attempts before ClimbMix) - Scale-dependent HP sensitivity (d12 HPs hurt d20) - Multi-axis validation (steps/wall-clock/FLOPs) - Negative results: MoE/SwiGLU/MTP all failed at this scale - MFU monitoring, batch size Bopt∝D^0.383, WD∝1/width² tables - FP8 reality: 1.38x micro → 1.17x full → 5% capability-matched - Python GC 500ms overhead, torch.compile recompile gotcha karpathy_nn_zero_to_hero_lec4_diagnostics.md: new evidence file - Activation saturation check (tanh >0.97) - Gradient distribution check per-layer - Grad:data ratio (target ~1e-3) - Update-to-data ratio tracker with full plotting code - Incremental improvement log from notebook	2026-03-10 05:38:33 +08:00
wassname	ced4edc200	feat(ml_debug): add Karpathy recipe + nanochat evidence, update-ratio diagnostic Add 3 new evidence files from modern open-source sources: - karpathy_recipe_training_nn_2019.md: Karpathy's training recipe blog post - nanochat_deepwiki_llm_pretraining_2026.md: 320+ HP sweeps for GPT-2-scale pretraining - sanh_simple_considerations_hf_2021.md: HuggingFace NLP debugging notes Add update-to-data ratio diagnostic to refs/diagnostics.md (target ~1e-3). Add LLM pretraining gap note to SKILL.md intro linking the new sources. Add tanh saturation % to logging checklist.	2026-03-10 05:32:37 +08:00
wassname	9e30cf7039	chore: remove duplicate subtitle file and log (now gitignored)	2026-03-06 12:21:54 +08:00
wassname (Michael J Clark)	fa41fecef2	Delete docs/dlbooks	2026-03-06 12:19:12 +08:00
wassname	95fee7b5cb	chore: include Goodfellow chapters (author encourages sharing)	2026-03-06 10:16:00 +08:00
wassname	4393cceefd	initial: ML debugging folklore skill Deep research to uplift LLMs for ML debugging, opinionated by source selection. Distilled from Schulman, Jones, Rahtz, Goodfellow, CS231n, FSDL, and more. Includes runnable diagnostic scripts and LLM-specific anti-patterns. Author: wassname (https://github.com/wassname)	2026-03-06 10:11:30 +08:00

11 Commits