mirror of
https://github.com/wassname/ml_debug.git
synced 2026-06-27 01:00:14 +08:00
feat(ml_debug): expand nanochat evidence, add lec4 diagnostics file
nanochat_deepwiki_llm_pretraining_2026.md rewritten with content from dev/LOG.md and deepwiki sections 3/12/13: - 14 labelled findings with direct quotes and empirical numbers - Dataset >> architecture (27% gain, 5 failed attempts before ClimbMix) - Scale-dependent HP sensitivity (d12 HPs hurt d20) - Multi-axis validation (steps/wall-clock/FLOPs) - Negative results: MoE/SwiGLU/MTP all failed at this scale - MFU monitoring, batch size Bopt∝D^0.383, WD∝1/width² tables - FP8 reality: 1.38x micro → 1.17x full → 5% capability-matched - Python GC 500ms overhead, torch.compile recompile gotcha karpathy_nn_zero_to_hero_lec4_diagnostics.md: new evidence file - Activation saturation check (tanh >0.97) - Gradient distribution check per-layer - Grad:data ratio (target ~1e-3) - Update-to-data ratio tracker with full plotting code - Incremental improvement log from notebook
This commit is contained in:
+1
-1
@@ -159,7 +159,7 @@ for conf, pred, true, idx in errors[:10]:
|
||||
# Inspect the actual inputs for these indices. Pattern = systematic bug.
|
||||
```
|
||||
|
||||
**Update-to-data ratio check** [Karpathy nn-zero-to-hero Lec 4]
|
||||
**Update-to-data ratio check** [Karpathy nn-zero-to-hero Lec 4; evidence: karpathy_nn_zero_to_hero_lec4_diagnostics.md]
|
||||
```python
|
||||
# Track during training: how large are updates relative to parameter magnitudes?
|
||||
# Target: ~1e-3 (log10 ~ -3). Much higher = LR too large. Much lower = LR too small.
|
||||
|
||||
Reference in New Issue
Block a user