mirror of https://github.com/wassname/ml-debug.git synced 2026-06-27 15:00:40 +08:00

T

wassname c9c53f8e7f feat(ml_debug): expand nanochat evidence, add lec4 diagnostics file

nanochat_deepwiki_llm_pretraining_2026.md rewritten with content from
dev/LOG.md and deepwiki sections 3/12/13:
- 14 labelled findings with direct quotes and empirical numbers
- Dataset >> architecture (27% gain, 5 failed attempts before ClimbMix)
- Scale-dependent HP sensitivity (d12 HPs hurt d20)
- Multi-axis validation (steps/wall-clock/FLOPs)
- Negative results: MoE/SwiGLU/MTP all failed at this scale
- MFU monitoring, batch size Bopt∝D^0.383, WD∝1/width² tables
- FP8 reality: 1.38x micro → 1.17x full → 5% capability-matched
- Python GC 500ms overhead, torch.compile recompile gotcha

karpathy_nn_zero_to_hero_lec4_diagnostics.md: new evidence file
- Activation saturation check (tanh >0.97)
- Gradient distribution check per-layer
- Grad:data ratio (target ~1e-3)
- Update-to-data ratio tracker with full plotting code
- Incremental improvement log from notebook

2026-03-10 05:38:33 +08:00

docs

feat(ml_debug): expand nanochat evidence, add lec4 diagnostics file

2026-03-10 05:38:33 +08:00

pinn

feat(ml_debug): add Karpathy recipe + nanochat evidence, update-ratio diagnostic

2026-03-10 05:32:37 +08:00

refs

feat(ml_debug): expand nanochat evidence, add lec4 diagnostics file

2026-03-10 05:38:33 +08:00

refactor(ml_debug): extract RL debugging into rl/ sub-skill

2026-03-06 13:36:29 +08:00

.gitignore

chore: fix .gitignore (dlbooks path, *_log.md pattern)

2026-03-06 12:22:22 +08:00

README.md

feat(ml_debug): add Karpathy recipe + nanochat evidence, update-ratio diagnostic

2026-03-10 05:32:37 +08:00

SKILL.md

feat(ml_debug): add Karpathy recipe + nanochat evidence, update-ratio diagnostic

2026-03-10 05:32:37 +08:00

README.md

ML Debugging Folklore

Practitioner knowledge for debugging ML systems, curated and synthesized by wassname. Opinionated by source selection -- I picked sources I trust (Schulman, Goodfellow, CS231n, ...) and had an LLM extract the most relevant information for debugging ML systems.

Use as a Claude skill

/skills add https://github.com/wassname/ml_debug

Or paste SKILL.md into your system prompt / context when debugging.

What's here

SKILL.md -- the main artifact. Load into an LLM agent's context as a debugging skill. Parts 1-5 are reference knowledge; Part 6 is a runnable triage protocol (grep patterns, diagnostic snippets, decision tree); Part 7 is debugging mental models and practitioner priors.
docs/evidence/ -- frozen local copies of source material (blog posts, talks, papers, reddit threads). Claims in SKILL.md link back to exact quotes here.