Commit Graph

16 Commits

Author SHA1 Message Date
wassname 8cd3c61050 folklore: tuning playbook, Domingos, Bekman loss spikes, Ng error analysis; LLM-judge bias appendix
- SKILL.md: 3 new entries (exploration-over-exploitation + nuisance HPs,
  test-set contamination, loss-spikes-mean-bad-data-pocket) and an Ng
  100-misclassified-examples quote under inspect-the-data
- refs/llm_judges.md: position/verbosity/self-preference biases (Zheng,
  Wang 66/80 flip, Panickssery) + mitigation checklist from verdict docs
- Lones pitfalls linked as the exhaustive 36-item do/don't checklist
- 6 new frozen evidence files; Hamel evals link in further reading

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 15:30:41 +08:00
wassname 2a2f5045bb folklore: add Karpathy common-mistakes tweet and Sculley CACE principle
Both quote-verbatim with frozen evidence: the 2018 tweet thread (mirrored
via threadreaderapp, x.com blocks fetching) slots after overfit-one-batch;
CACE (NIPS 2015, entanglement section transcribed from the PDF) gives
Always-Be-Ablating its why.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 14:43:47 +08:00
wassname fb753d093e restructure: quotes-first SKILL.md, synthesized playbook split out
SKILL.md is now folklore only: verbatim practitioner quotes ordered
most-general-first, transformer/LLM fine-tuning entries in their own
section, minimal context, links and footnotes. New sources: unsloth,
axolotl (+training stability), HF course ch8.4, Bekman debug_utils
(evidence frozen in docs/evidence/).

The synthesized material (mental models, priors, symptom tables, agent
loop, triage, anti-patterns) moves to PLAYBOOK.md, framed as menus of
hypotheses rather than authoritative diagnoses. Made-up symptom tables
no longer sit next to sourced quotes.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-11 14:33:32 +08:00
wassname 8509ec3c30 folklore: promote Spinning Up to main; add a Research-taste section
- Promote the general (non-RL-specific) Spinning Up lessons up to the main
  folklore: "broken code fails silently", "you can't tell it's broken if you
  can't see that it's breaking", and test on more than one setup.
- Add gwern's "Unseeing" to the data theme: you can't read what you actually
  wrote, hence fresh eyes / a fresh-eyes subagent.
- New "Research taste (adjacent to debugging)" section with verbatim quotes,
  each cached: Neel Nanda (your research is false by default; excitement is
  evidence of bullshit; read your data), Ulisse Mini (understand the system to
  shrink the search space), John Wentworth (gears-level models are capital
  investments vs cheap black boxes).

All quotes verbatim from cached sources; 25/25 footnotes resolve.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 21:08:49 +08:00
wassname a602ea5a0e rl: quote Spinning Up (Achiam) on silent failure and bug-first debugging
Spinning Up as a Deep RL Researcher was only a bare code link; it's the
canonical RL-researcher guide and its debugging advice is gold. Cache the
rigour/debugging sections verbatim and quote the sharpest lines in the RL
sub-skill: "broken RL code almost always fails silently", "if it doesn't work,
assume there's a bug", "measure everything ... you can't tell it's broken if
you can't see that it's breaking", and test on more than one env. Add to RL
sources.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 21:04:55 +08:00
wassname ee4e9a5caa folklore: add koaning, gwern, kidger, nanochat, cleanrl; trim lucidrains
Gather debugging folklore from more practitioners, each a verbatim quote
checked against a cached source copy (footnoted with line numbers):
- koaning (Vincent Warmerdam), "Bad Labels": benchmark labels are often wrong;
  find them with confidence-sorted errors.
- gwern, the tank-detection legend: the canonical data-leakage parable, plus
  the scout-mindset twist that it's a likely-unsourced urban legend.
- Patrick Kidger, "Just Know Stuff": why research code is buggy ("kludge ...
  bugs that don't cripple things only because some other bug stops them") and
  "never accept the kludge". Plus a one-line jaxtyping pointer for shape bugs.
- nanochat (Karpathy): BOS-alignment fake metric improvement; all-ranks must
  clip on inf (a multi-GPU bug single-GPU testing hides).
- cleanrl "37 Implementation Details of PPO" -> RL sub-skill, as the canonical
  proof that reference-impl details (not ideas) decide whether PPO works.

Trim the lucidrains item to one quote (it had ballooned). Add wassname credit
+ companion-gist link. All 20 footnotes resolve.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 20:59:36 +08:00
wassname 9911ac83c5 folklore: add lucidrains transformer-stability item (QK-norm, post-emb LN)
Phil Wang's x-transformers is the canonical "the fix is in the code, not the
paper" catalogue. Add a folklore item on the most debugging-relevant trick:
QK / cosine-sim normalization to stop attention logits overflowing (the usual
cause of transformer loss spikes/divergence), plus the BLOOM/YaLM
post-embedding LayerNorm. Two verbatim lucidrains quotes, footnoted to the repo
+ a cached README copy with line numbers. Doubles as the modern concrete
example for the read-a-working-implementation section.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 20:49:15 +08:00
wassname b159b0fba8 docs(ml_debug): annotate EMNLP 2018 NLP code tutorial; note sparse Adam embedding bug 2026-03-10 05:48:36 +08:00
wassname 0fa4009fd5 docs(ml_debug): update Grus annotation after reading full slides; note EMNLP 2018 lead 2026-03-10 05:45:56 +08:00
wassname 52ff6c17cd docs(ml_debug): annotate Joel Grus slides -- SE/reproducibility talk, not debugging 2026-03-10 05:45:16 +08:00
wassname 3dffe890b1 docs(ml_debug): annotate sanh outbound links with content summaries 2026-03-10 05:40:31 +08:00
wassname c9c53f8e7f feat(ml_debug): expand nanochat evidence, add lec4 diagnostics file
nanochat_deepwiki_llm_pretraining_2026.md rewritten with content from
dev/LOG.md and deepwiki sections 3/12/13:
- 14 labelled findings with direct quotes and empirical numbers
- Dataset >> architecture (27% gain, 5 failed attempts before ClimbMix)
- Scale-dependent HP sensitivity (d12 HPs hurt d20)
- Multi-axis validation (steps/wall-clock/FLOPs)
- Negative results: MoE/SwiGLU/MTP all failed at this scale
- MFU monitoring, batch size Bopt∝D^0.383, WD∝1/width² tables
- FP8 reality: 1.38x micro → 1.17x full → 5% capability-matched
- Python GC 500ms overhead, torch.compile recompile gotcha

karpathy_nn_zero_to_hero_lec4_diagnostics.md: new evidence file
- Activation saturation check (tanh >0.97)
- Gradient distribution check per-layer
- Grad:data ratio (target ~1e-3)
- Update-to-data ratio tracker with full plotting code
- Incremental improvement log from notebook
2026-03-10 05:38:33 +08:00
wassname ced4edc200 feat(ml_debug): add Karpathy recipe + nanochat evidence, update-ratio diagnostic
Add 3 new evidence files from modern open-source sources:
- karpathy_recipe_training_nn_2019.md: Karpathy's training recipe blog post
- nanochat_deepwiki_llm_pretraining_2026.md: 320+ HP sweeps for GPT-2-scale pretraining
- sanh_simple_considerations_hf_2021.md: HuggingFace NLP debugging notes

Add update-to-data ratio diagnostic to refs/diagnostics.md (target ~1e-3).
Add LLM pretraining gap note to SKILL.md intro linking the new sources.
Add tanh saturation % to logging checklist.
2026-03-10 05:32:37 +08:00
wassname 9e30cf7039 chore: remove duplicate subtitle file and log (now gitignored) 2026-03-06 12:21:54 +08:00
wassname 95fee7b5cb chore: include Goodfellow chapters (author encourages sharing) 2026-03-06 10:16:00 +08:00
wassname 4393cceefd initial: ML debugging folklore skill
Deep research to uplift LLMs for ML debugging, opinionated by source
selection. Distilled from Schulman, Jones, Rahtz, Goodfellow, CS231n,
FSDL, and more. Includes runnable diagnostic scripts and LLM-specific
anti-patterns.

Author: wassname (https://github.com/wassname)
2026-03-06 10:11:30 +08:00