From 779beee03e51a87b06e5b3ff3a7752d337491713 Mon Sep 17 00:00:00 2001
From: wassname <1103714+wassname@users.noreply.github.com>
Date: Mon, 1 Jun 2026 10:15:41 +0800
Subject: [PATCH] refactor(ml_debug): tidy ordering/emphasis on the new top
 sections

Three targeted polishes to the rewritten skill:
- Reframe Part 1's "The hierarchy (work in order...)" -> "What 'collect clues'
  looks like": it's the catalog the loop's clue-collection step draws on, not a
  second master-procedure competing with "the debugging loop" 40 lines above.
- Reorder: lead straight into calibrate -> loop -> read-impl; relocate the
  2017-2021 caveat + LLM-pretraining pointers into a "Scope and modern pointers"
  block after the action sections, so the behaviour-changing content is the
  first screen instead of provenance.
- Emphasis: give the "priors are a starting weight, not a verdict" line a
  concrete clause (traceback / loss-metric misalignment / right init-loss
  override the data prior) -- the weakest comprehension dim in the quiz.

Before-vs-after panel A/B (6 cold readers): tie on ordering/clarity/
conciseness/focus, each leaning slightly positive, no regression.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
---
 SKILL.md | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/SKILL.md b/SKILL.md
index 7fcef11..1323e21 100644
--- a/SKILL.md
+++ b/SKILL.md
@@ -5,13 +5,9 @@ description: "Wassname's practical folklore for debugging ML systems: convergenc
 
 # ML Debugging Folklore
 
-Practitioner knowledge that's hard to find in papers. Distilled from Schulman's "Nuts and Bolts" talk, Andy Jones' debugging guide, r/reinforcementlearning threads, competition write-ups, and personal experience. Most multi-source claims are traced to sourced quotes in [docs/ml_debug_folklore.argdown](docs/ml_debug_folklore.argdown) (vargdown format); uncovered claims are listed in the [process log](docs/ml_debug_folklore_log.md).
+Practitioner knowledge that's hard to find in papers, distilled from Schulman's "Nuts and Bolts" talk, Andy Jones' debugging guide, r/reinforcementlearning threads, competition write-ups, and personal experience.
 
-**Caveat:** Most sources are from 2017-2021, predating RLHF, large-scale pretraining, and JAX/PyTorch 2.0 workflows. Core debugging principles (isolation testing, logging, seed variance) are architecture-agnostic and likely durable. Specific RL HP defaults and reward-scaling advice may need updating for modern settings.
-
-**LLM pretraining gap:** For modern transformer pretraining debugging, see [Karpathy's recipe](https://karpathy.github.io/2019/04/25/recipe/) (2019; general training workflow, activation/gradient health checks) and [nanochat deepwiki](https://deepwiki.com/karpathy/nanochat) (2026; documents 320+ empirical HP sweeps for training a GPT-2-scale model from scratch, covering MFU monitoring, precision management, BOS-aligned dataloaders, and cross-scale ablation discipline). Evidence files: [karpathy_recipe_training_nn_2019.md](docs/evidence/karpathy_recipe_training_nn_2019.md), [nanochat_deepwiki_llm_pretraining_2026.md](docs/evidence/nanochat_deepwiki_llm_pretraining_2026.md).
-
-The core problem in RL (and to a lesser extent supervised ML): errors aren't local [Jones 2021]. In RL, information flows in a loop (actor -> learner -> actor), so a numerical bug in one spot gets smeared through the whole system in seconds. From outside, everything goes weird at once -- loss explodes, KL collapses, rewards oscillate. You can tell something's wrong but not *what* or *where*.
+The core problem: errors aren't local [Jones 2021]. Especially in RL, information flows in a loop (actor -> learner -> actor), so a numerical bug in one spot gets smeared through the whole system in seconds. From outside everything goes weird at once -- loss explodes, KL collapses, rewards oscillate -- and you can tell something's wrong but not *what* or *where*. That's why the discipline below leads with calibration and clue-collection, not fixes.
 
 ## Before you debug: calibrate
 
@@ -30,7 +26,7 @@ Roughly in this order, but the point is the mindset, not ticking boxes:
 - *Subtle*: the sneaky one -- sample size, leakage, a confound, a metric artifact, or plain seed variance masquerading as signal.
 - *Null*: there's no real effect, or it comes from something else you also changed.
 
-Anchor priors on what's usually wrong (Part 7.2: data ~40%, loss ~20%, training ~15%, architecture ~10%, hyperparameters ~5%) -- but priors are a starting weight, not a verdict.
+Anchor priors on what's usually wrong (Part 7.2: data ~40%, loss ~20%, training ~15%, architecture ~10%, hyperparameters ~5%) -- but priors are a starting weight, not a verdict. A clue that points elsewhere overrides them outright: a traceback naming a line, a metric stuck while the loss is healthy (loss-metric misalignment, not data), or an init-loss that's exactly right all redirect you regardless of the ~40% data prior.
 
 **Run the cheapest observation that splits your top hypotheses.** Not the most thorough experiment -- the most *discriminating* one (Rahtz: think more, experiment less, Part 1). One log line or one toy run that tells hypothesis A from B beats a 4-hour sweep that only confirms what you already believed.
 
@@ -51,11 +47,17 @@ Read it for three things, explicitly:
 
 Reference implementations are domain-specific. For RL, see `rl/SKILL.md` section 9 (spinning-up, stable-baselines3, cleanrl, OpenSpiel). For everything, diff against the reference rather than trusting your from-scratch version (Part 7.3).
 
+## Scope and modern pointers
+
+Most sources here are 2017-2021, predating RLHF, large-scale pretraining, and JAX/PyTorch 2.0. The core principles (isolation testing, logging, seed variance, the loop above) are architecture-agnostic and durable; specific RL HP defaults and reward-scaling advice may need updating. For modern transformer pretraining specifically, go to [Karpathy's recipe](https://karpathy.github.io/2019/04/25/recipe/) (2019; activation/gradient health checks) and [nanochat deepwiki](https://deepwiki.com/karpathy/nanochat) (2026; 320+ empirical HP sweeps for a GPT-2-scale run, MFU monitoring, precision management, BOS-aligned dataloaders) -- evidence files [karpathy_recipe_training_nn_2019.md](docs/evidence/karpathy_recipe_training_nn_2019.md), [nanochat_deepwiki_llm_pretraining_2026.md](docs/evidence/nanochat_deepwiki_llm_pretraining_2026.md). Most multi-source claims below trace to quotes in [docs/ml_debug_folklore.argdown](docs/ml_debug_folklore.argdown) (vargdown); uncovered claims are in the [process log](docs/ml_debug_folklore_log.md).
+
 ---
 
 ## Part 1: General ML Debugging
 
-### The hierarchy (work in order, don't skip to hyperparameters)
+### What "collect clues" looks like
+
+This is the catalog the loop's clue-collection step pulls from -- the substantive checks, in rough dependency order (each assumes the one before). It's a menu to draw on, not a fresh end-to-end procedure that competes with the loop above.
 
 **Step 1: Verify components in isolation.** [Goodfellow Ch11, CS229]
 Most bugs are "doing the wrong calculation." Test each piece independently.