initial: ML debugging folklore skill

Deep research to uplift LLMs for ML debugging, opinionated by source selection. Distilled from Schulman, Jones, Rahtz, Goodfellow, CS231n, FSDL, and more. Includes runnable diagnostic scripts and LLM-specific anti-patterns. Author: wassname (https://github.com/wassname)
2026-06-27 19:47:55 +08:00 · 2026-03-06 10:11:30 +08:00
commit 4393cceefd
25 changed files with 12512 additions and 0 deletions
@@ -0,0 +1,586 @@
+===
+title: ML Debugging Folklore - Evidence Map
+author: ml_debug SKILL synthesis
+date: 2026-03-05
+model:
+    mode: strict
+===
+
+// This argdown maps claims from the ML Debugging Folklore SKILL.md
+// back to sourced quotes across 21 evidence files. Each claim
+// is traced to 2-3 independent sources with verbatim quotes.
+//
+// Credence guide:
+//   0.90 = canonical textbook / primary algorithm author
+//   0.85 = peer-reviewed paper / authoritative course
+//   0.80 = established practitioner blog, widely cited
+//   0.70 = popular blog / course notes
+//   0.60 = reddit thread / community consensus
+
+[Folklore Reliable]: ML debugging folklore -- practitioner heuristics
+  transmitted via talks, blog posts, and course materials -- provides
+  reliable, independently corroborated guidance for diagnosing and
+  fixing ML training failures.
+  + <Normalization Consensus>
+  + <Isolation Testing Consensus>
+  + <Bug First Consensus>
+  + <Seed Variance Evidence>
+  + <Batch Size Evidence>
+  + <Reward Engineering Evidence>
+  + <Logging Consensus>
+  + <Reference Impl Consensus>
+  + <Anomaly Pursuit Evidence>
+  + <Random Search Evidence>
+  - <Source Age Concern>
+  - <RL Specificity Concern>
+
+
+# Section 1: General ML Debugging
+
+## Normalize Inputs
+
+<Normalization Consensus>
+
+(1) [Schulman Normalize]: Schulman recommends standardizing all
+    observations via running mean/std, clipping, and rescaling
+    rewards without shifting the mean. #observation
+    [Schulman 2016 Slides](http://joschu.net/docs/nuts-and-bolts.pdf)
+    [evidence](evidence/joschu_nuts_and_bolts.md#L120-L125)
+    > (cid:73) If observations have unknown range, standardize
+    > (cid:73) Compute running estimate of mean and standard deviation
+    > (cid:73) = clip((x −µ)/σ,−10,10)
+    > **(cid:73) Rescale the rewards, but don't shift mean, as that affects agent's will to live**
+    > (cid:73) Standardize prediction targets (e.g., value functions) the same way
+    {reason: "Schulman is PPO/TRPO author; these slides are from his canonical 'Nuts and Bolts' talk, widely adopted in OpenAI baselines and stable-baselines3. Note: PDF conversion artifacts (cid:73 = bullet markers) present in evidence file.", credence: 0.90}
+(2) [FSDL Normalize]: FSDL Lecture 7 lists normalizing input data as
+    a default step in the 'Start Simple' phase. #observation
+    [FSDL Spring 2021 Lecture 7](https://fullstackdeeplearning.com/)
+    [evidence](evidence/fsdl_spring2021_lecture7.md#L334-L338)
+    > The next step is to **normalize the input data**, subtracting the mean
+    > and dividing by the variance. Note that for images, it's fine to scale
+    > values to 0-1 or -0.5 to 0.5 (for example, by dividing by 255).
+    {reason: "FSDL is Josh Tobin's industry-focused course; independent from Schulman's RL lineage", credence: 0.80}
+(3) [Slavv Normalize]: Ivanov's '37 Reasons' checklist lists
+    standardization as item #12 and preprocessing consistency
+    as items #14-#15. #observation
+    [Slavv 2017 - 37 Reasons](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607)
+    [evidence](evidence/slavv_37_reasons_nn.md#L122-L132)
+    > **12. Standardize the features.**
+    > Did you standardize your input to have zero mean and unit variance?
+    > 13. Do you have too much data augmentation?
+    > Augmentation has a regularizing effect. Too much of this combined with
+    > other forms of regularization (weight L2, dropout, etc.) can cause
+    > the net to underfit.
+    > **14. Check the preprocessing of your pretrained model.**
+    > If you are using a pretrained model, make sure you are using the same
+    > normalization and preprocessing as the model was when training. For
+    > example, should an image pixel be in the range (0, 1), (-1, 1) or (0, 255)?
+    {reason: "popular debugging checklist (2017), independent practitioner; items 12-14 all address normalization/preprocessing", credence: 0.70}
+----
+(4) [Normalize Robust]: Normalizing inputs to mean=0, std=1 is a
+    robustly supported heuristic across 3 independent lineages
+    (RL research, industry courses, practitioner checklists).
+    {reason: "three independent sources from different sub-communities all converge on the same recommendation; community adoption in major frameworks confirms", inference: 0.90}
+  +> [Folklore Reliable]
+
+
+## Overfit First / Test in Isolation
+
+<Isolation Testing Consensus>
+
+(1) [CS231n Overfit]: CS231n lists overfitting a tiny subset as the
+    most important sanity check before full training. #observation
+    [CS231n Neural Networks 3](https://cs231n.github.io/neural-networks-3/)
+    [evidence](evidence/cs231n_neural_networks_3.md#L87-L89)
+    > **Overfit a tiny subset of data**. Lastly and most importantly, before
+    > training on the full dataset try to train on a tiny portion (e.g. 20
+    > examples) of your data and make sure you can achieve zero cost. For this
+    > experiment it's also best to set regularization to zero, otherwise this
+    > can prevent you from getting zero cost. **Unless you pass this sanity
+    > check with a small dataset it is not worth proceeding to the full
+    > dataset.**
+    {reason: "CS231n (Karpathy/Li/Johnson) is the canonical deep learning course; 'most importantly' framing shows high confidence in this heuristic", credence: 0.85}
+(2) [FSDL Overfit]: FSDL positions single-batch overfitting as the
+    step immediately after getting the model to run. #observation
+    [FSDL Spring 2021 Lecture 7](https://fullstackdeeplearning.com/)
+    [evidence](evidence/fsdl_spring2021_lecture7.md#L433-L451)
+    > After getting your model to run, the next thing you need to do is to
+    > **overfit a single batch of data**. This is a heuristic that can catch
+    > an absurd number of bugs. This really means that you want to drive your
+    > training error arbitrarily close to 0. There are a few things that can
+    > happen when you try to overfit a single batch and it fails:
+    > **Error goes up**: Commonly, this is due to a flip sign somewhere in
+    > the loss function/gradient. **Error explodes**: This is usually a
+    > numerical issue but can also be caused by a high learning rate.
+    {reason: "FSDL (Josh Tobin) independent from CS231n; 'absurd number of bugs' is strong practitioner endorsement", credence: 0.80}
+(3) [Goodfellow Overfit]: Goodfellow et al. state that inability to
+    fit a single example indicates a software defect. #observation
+    [Goodfellow et al. Deep Learning Ch11](https://www.deeplearningbook.org/)
+    [evidence](evidence/goodfellow_ch11_practical_methodology.md#L218)
+    > Fit a tiny dataset: If you have high error on the training set,
+    > determine whether it is due to genuine underfitting or due to a
+    > software defect. **Usually even small models can be guaranteed to be
+    > able fit a sufficiently small dataset.** For example, a classification
+    > dataset with only one example can be fit just by setting the biases of
+    > the output layer correctly. Usually if you cannot train a classifier to
+    > correctly label a single example... **there is a software defect
+    > preventing successful optimization on the training set.**
+    {reason: "Goodfellow/Bengio/Courville textbook is the canonical deep learning reference; independent from CS231n and FSDL", credence: 0.90}
+----
+(4) [Overfit First Robust]: 'Overfit a tiny dataset as sanity check'
+    is robustly supported across 3 independent authoritative sources.
+    {reason: "canonical textbook + two major courses all prescribe the same test; consistent across supervised and RL contexts", inference: 0.90}
+  +> [Folklore Reliable]
+
+
+## Assume You Have a Bug
+
+<Bug First Consensus>
+
+(1) [Jones Bug]: Andy Jones argues that RL practitioners are reluctant
+    to admit bugs, but bugs are the most common cause of failure. #observation
+    [Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html)
+    [evidence](evidence/andyljones_rl_debugging.md#L174-L182)
+    > When their RL implementation doesn't work, people are often keen to
+    > either (a) adjust their network architecture or (b) adjust their
+    > hyperparameters. On the other hand, they're reluctant to say they've
+    > got a bug. **Most often, it turns out they've got a bug.** Why bugs
+    > are so much more common in RL code is discussed above, but there's
+    > another advantage to assuming you've got a bug: bugs are a damn sight
+    > faster to find and fix than validating that your new architecture is
+    > an improvement over the old one.
+    {reason: "Jones is an experienced RL practitioner; this blog post is widely cited in the RL community and is the primary source for structured RL debugging advice", credence: 0.80}
+(2) [Goodfellow Bug Mask]: Goodfellow et al. warn that neural net
+    components can adapt to compensate for bugs, masking them. #observation
+    [Goodfellow et al. Deep Learning Ch11](https://www.deeplearningbook.org/)
+    [evidence](evidence/goodfellow_ch11_practical_methodology.md#L194-L206)
+    > When a machine learning system performs poorly, it is usually difficult
+    > to tell whether the poor performance is intrinsic to the algorithm
+    > itself or whether there is a bug in the implementation of the
+    > algorithm. **If one part is broken, the other parts can adapt and
+    > still achieve roughly acceptable performance.** The bug may not be
+    > apparent just from examining the output of the model though. Depending
+    > on the distribution of the input, the weights may be able to adapt to
+    > compensate for the negative biases.
+    {reason: "canonical textbook; the 'adaptive compensation' mechanism explains why ML bugs are uniquely hard to detect", credence: 0.90}
+(3) [Jones Loss Herring]: Jones explicitly warns that loss curves
+    don't localize errors and are therefore a red herring for
+    debugging. #observation
+    [Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html)
+    [evidence](evidence/andyljones_rl_debugging.md#L184-L188)
+    > When someone's RL implementation isn't working, they *luuuuuurv* to
+    > copy-paste a screenshot of their loss curve to you. The problem with
+    > using the loss curve as an indicator of correctness is somewhat that
+    > it's not reliable, but mostly because **it doesn't localise errors.**
+    > The shape of your loss curve says very little about where in your code
+    > you've messed up, and so says very little about what you need to
+    > change to get things working.
+    {reason: "same source as (1) but independent observation about loss curves specifically; consistent with Goodfellow's point about adaptive masking", credence: 0.80}
+----
+(4) [Bug First Robust]: 'Assume you have a bug' is well-supported:
+    bugs are common, adaptive compensation masks them, and loss
+    curves don't help localize them.
+    {reason: "Jones (practitioner) and Goodfellow (textbook) independently describe the same mechanism: ML systems mask bugs through adaptation. Loss curves give false comfort.", inference: 0.85}
+  +> [Folklore Reliable]
+
+
+# Section 2: RL-Specific Debugging
+
+## Seed Variance
+
+<Seed Variance Evidence>
+
+(1) [Schulman Seeds]: Schulman demonstrates that 3 seemingly different
+    algorithms on 7 MuJoCo tasks were actually the same algorithm
+    with different random seeds. #observation
+    [Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ)
+    [evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L377-L412)
+    > you can see seven different tasks these are the Jim Moo Joko tasks
+    > like half cheetah and hopper and so on and you have three different
+    > algorithms here the red one the green one and the blue one... but as
+    > it turns out **these are all the exact same algorithms and just random
+    > seeds different random seeds** so it's easy to imagine that you're
+    > just looking at one of these problems then you see that blue curve and
+    > you think you get really excited than you think you found some huge
+    > improvement to your algorithm but it's really that you just got a
+    > lucky seed... **even if you had like 20 seeds here there's a still a
+    > pretty big error bar**
+    {reason: "Schulman (PPO author) showing his own data; this demonstration is one of the most cited examples of RL noise in the community", credence: 0.90}
+(2) [Henderson Seeds]: Henderson et al. show that same hyperparameters
+    with different seeds produce statistically different learning
+    curves on standard benchmarks. #observation
+    [Henderson et al. 2018 - Deep RL That Matters](https://arxiv.org/abs/1709.06560)
+    [evidence](evidence/henderson_2018_deep_rl_matters.md#L233-L235)
+    > We perform 10 experiment trials, for the same hyperparameter
+    > configuration, only varying the random seed across all 10 trials. We
+    > then split the trials into two sets of 5 and average these two
+    > groupings together. As shown in Figure 5, we find that **the
+    > performance of algorithms can be drastically different.** We
+    > demonstrate that the variance between runs is enough to create
+    > statistically different distributions just from varying random seeds.
+    > Our experiment with random seeds shows that this can be potentially
+    > misleading.
+    {reason: "peer-reviewed AAAI 2018 paper; systematic experimental study with statistical testing (t-test results reported); independent from Schulman's demo", credence: 0.85}
+(3) [Irpan Seeds]: Irpan reports a 30% failure rate on Pendulum
+    across 10 seeds with identical hyperparameters, and notes
+    that this would be considered a bug in supervised learning. #observation
+    [Alex Irpan - RL Hard](https://www.alexirpan.com/2018/02/14/rl-hard.html)
+    [evidence](evidence/alexirpan_rl_hard.md#L651-L678)
+    > Here is a plot of performance, after I fixed all the bugs. Each line
+    > is the reward curve from one of 10 independent runs. Same
+    > hyperparameters, the only difference is the random seed. **Seven of
+    > these runs worked. Three of these runs didn't. A 30% failure rate
+    > counts as working.** Look, there's variance in supervised learning
+    > too, but it's rarely this bad. If my supervised learning code failed
+    > to beat random chance 30% of the time, I'd have super high confidence
+    > there was a bug in data loading or training. **If my reinforcement
+    > learning code does no better than random, I have no idea if it's a
+    > bug, if my hyperparameters are bad, or if I simply got unlucky.**
+    {reason: "Google Brain engineer's direct experience; the SL vs RL comparison makes the point vivid; independent from Schulman and Henderson", credence: 0.80}
+----
+(4) [Seed Variance Robust]: RL seed variance is extreme -- same algo
+    with different seeds can look like different algorithms. This
+    is robustly demonstrated across 3 independent sources with
+    quantitative evidence.
+    {reason: "primary algorithm author (Schulman), peer-reviewed study (Henderson), and independent practitioner (Irpan) all demonstrate the same effect with data; the 30% failure rate on Pendulum is a striking data point", inference: 0.90}
+  +> [Folklore Reliable]
+
+
+## Batch Size
+
+<Batch Size Evidence>
+
+(1) [Schulman Batch]: Schulman warns that batch sizes too small
+    cause noise to overwhelm signal, citing his own TRPO debugging
+    experience needing 100K timesteps per batch. #observation
+    [Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ)
+    [evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L279-L307)
+    > sometimes you should use more samples than you think you're going
+    > to need because usually things just work better when you have more
+    > samples almost always... if you want to just get something working at
+    > all often you need to use bigger batch sizes and you thought because
+    > **if your batch size is too small than the noise will overwhelm the
+    > signal and you won't learn anything**
+    {reason: "Schulman's personal debugging story: TRPO needed 100K timesteps per batch (documented in slides). Direct experience from algorithm author.", credence: 0.90}
+(2) [Schulman Batch Slides]: Schulman's slides give specific batch
+    size numbers for TRPO and DQN on Atari. #observation
+    [Schulman 2016 Slides](http://joschu.net/docs/nuts-and-bolts.pdf)
+    [evidence](evidence/joschu_nuts_and_bolts.md#L61-L72)
+    > Run with More Samples Than Expected. **Early in tuning process, may
+    > need huge number of samples.** Don't be deterred by published work.
+    > Examples: TRPO on Atari: 100K timesteps per batch for KL=0.01.
+    > DQN on Atari: update freq=10K, replay buffer size=1M.
+    {reason: "same author as (1) but written slides with specific numbers; corroborates the talk", credence: 0.90}
+(3) [McCandlish Critical Batch]: McCandlish et al. derive a critical
+    batch size that predicts speed/efficiency tradeoffs, finding
+    it grows during training as gradients shrink. #observation
+    [McCandlish et al. 2018](https://arxiv.org/abs/1812.06162)
+    [evidence](evidence/mccandlish_2018_large_batch.md#L180-L196)
+    > Equation 2.7 nevertheless predicts the dependence of training speed
+    > on batch size remarkably well, even for full training runs that range
+    > over many points in the loss landscape. **By averaging Equation 2.7
+    > over multiple optimization steps, we find a simple relationship
+    > between training speed and data efficiency.** Here, S and Smin
+    > represent the actual and minimum possible number of steps taken to
+    > reach a specified level of performance, respectively.
+    {reason: "OpenAI paper providing theoretical foundation for batch size effects; peer-reviewed; explains WHY Schulman's observation holds", credence: 0.85}
+----
+(4) [Batch Size Robust]: 'Use bigger batches than you think' is
+    supported by both practitioner experience and theoretical analysis.
+    {reason: "Schulman's empirical observation is explained by McCandlish's noise scale theory; the critical batch size concept provides a principled way to reason about it", inference: 0.85}
+  +> [Folklore Reliable]
+
+
+## Reward Engineering
+
+<Reward Engineering Evidence>
+
+(1) [Schulman Reward Mean]: Schulman warns that shifting reward mean
+    changes the agent's 'will to live' -- how long it wants to
+    survive -- thereby changing the problem. #observation
+    [Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ)
+    [evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L513-L519)
+    > for the rewards I'd recommend rescaling it but not shifting them
+    > because **that affects the agents will to live so if you shift the
+    > mean reward that'll affect whether how long it wants to survive
+    > you're actually changing the problem**
+    {reason: "Schulman explains the causal mechanism: reward mean shift changes the MDP's optimal policy, not just scaling. This is not obvious to beginners.", credence: 0.90}
+(2) [Jones Reward Scale]: Jones identifies reward scaling as the single
+    most common issue for RL newbies, and warns against adaptive
+    reward scaling as extra nonstationarity. #observation
+    [Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html)
+    [evidence](evidence/andyljones_rl_debugging.md#L115-L119)
+    > The single most common issue for newbies writing custom RL
+    > implementations is that the targets arriving at their neural net
+    > aren't [-1, +1]. Actually, anything [-.1, +.1]ish to [-10, +10]ish
+    > is good. Having read that, you might be tempted to write some
+    > adaptive scheme to scale your rewards for you. **Don't: it's an extra
+    > bit of nonstationarity that'll make life more difficult. Just
+    > hand-scale, hand-clip the rewards** from your env so that the targets
+    > passed to your network are sensible.
+    {reason: "Jones independently converges on same advice as Schulman; labels it 'single most common issue'; explicitly warns against adaptive schemes", credence: 0.80}
+(3) [Henderson Reward Scale]: Henderson et al. show that multiplying
+    rewards by a scalar causes significant performance differences
+    in DDPG, with inconsistent effects across environments. #observation
+    [Henderson et al. 2018 - Deep RL That Matters](https://arxiv.org/abs/1709.06560)
+    [evidence](evidence/henderson_2018_deep_rl_matters.md#L181)
+    > Reward rescaling has been used in several recent works (Duan et al .
+    > 2016; Gu et al . 2016) to improve results for DDPG. This involves
+    > simply multiplying the rewards gen-erated from an environment by some
+    > scalar ( rhat = r*sigma ) for training. Often, these works report using a
+    > reward scale of sigma = 0 .1. In Atari domains, this is akin to clipping
+    > the rewards to (0 , 1) . **By intuition, in gradient based methods (as
+    > used in most deep RL) a large and sparse output scale can result in
+    > problems regarding saturation and inefficiency in learning** (LeCun
+    > et al . 2012; Glorot and Bengio 2010; Vincent, de Brebisson, and
+    > Bouthillier 2015).
+    {reason: "peer-reviewed AAAI paper; provides gradient-based mechanism explaining WHY reward scale matters; cites 3 supporting references", credence: 0.85}
+----
+(4) [Reward Robust]: Reward scaling advice (hand-scale, don't shift
+    mean, target [-10, +10]) is well-supported across practitioner
+    experience, RL research, and controlled experiments.
+    {reason: "Schulman (causal mechanism), Jones (practitioner experience), Henderson (experimental validation) all converge; the 'will to live' explanation is especially compelling", inference: 0.85}
+  +> [Folklore Reliable]
+
+
+## Reference Implementations
+
+<Reference Impl Consensus>
+
+(1) [Jones Ref Impl]: Jones calls writing RL from scratch 'the most
+    catastrophically self-sabotaging thing you can do' as a
+    newcomer. #observation
+    [Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html)
+    [evidence](evidence/andyljones_rl_debugging.md#L153-L165)
+    > **If you're new to reinforcement learning, writing things from scratch
+    > is the most catastrophically self-sabotaging thing you can do.** There
+    > is an alluring masochism in writing things from scratch. There's
+    > concrete value in it too: by writing things from scratch, you're both
+    > forced to fully understand what you're doing and you're more likely to
+    > come up with a fresh perspective. **In reinforcement learning, these
+    > benefits are not worth it.** At all. As discussed above, the nature
+    > of RL work makes it extremely hard for you to self-correct.
+    {reason: "Jones provides three graduated risk levels for using references (out-of-box, components, one-eye-on); concrete implementation lists (spinningup, stable-baselines3, cleanrl, OpenSpiel)", credence: 0.80}
+(2) [Rahtz Ref]: Rahtz spent 8 months reproducing a Deep RL paper and
+    found that even small normalization bugs can hide for months,
+    supporting the case for starting from reference code. #observation
+    [Matthew Rahtz - Reproducing Deep RL](http://amid.fish/reproducing-deep-rl)
+    [evidence](evidence/amid_fish_reproducing_deep_rl.md#L36-L52)
+    > reinforcement learning turned out to be a lot trickier than expected.
+    > A big part of it is that right now, reinforcement learning is really
+    > sensitive. There are a lot of details to get just right, and **if you
+    > don't get them right, it can be difficult to diagnose where you've
+    > gone wrong.** After finishing the basic implementation, training runs
+    > just weren't succeeding... it turned out to be because of **problems
+    > with normalization of rewards and pixel data at a key stage.**
+    {reason: "first-person account of 2-month debugging session caused by normalization bug; vivid illustration of why references save time", credence: 0.75}
+----
+(3) [Ref Impl Robust]: Starting from reference implementations is
+    strongly supported by practitioner experience: the self-correction
+    mechanisms in RL are too weak for solo implementation.
+    {reason: "Jones's forceful advice is validated by Rahtz's 8-month experience; the underlying theory (RL's weak error signals) provides a causal explanation", inference: 0.85}
+  +> [Folklore Reliable]
+
+
+## Pursue Anomalies
+
+<Anomaly Pursuit Evidence>
+
+(1) [Jones Anomaly]: Jones recommends chasing anomalies immediately,
+    calling it 'one of the most powerful ways to debug'. #observation
+    [Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html)
+    [evidence](evidence/andyljones_rl_debugging.md#L101-L109)
+    > If you ever see a plot or a behaviour that just *seems weird*, chase
+    > right after it! Do not - do *not* - just 'hope it goes away'.
+    > **Chasing anomalies is one of the most powerful ways to debug your
+    > system**, because if you've noticed a problem without having had to go
+    > look for it, that means it's a *really big problem*. It's really
+    > tempting to think that the cool extra functionality you were planning
+    > to write today might just magically fix this anomalous behaviour.
+    > It won't. Give up on your plan for the day and chase the anomaly
+    > instead.
+    {reason: "strong practitioner endorsement; the causal reasoning (visible anomaly = big problem) is sound", credence: 0.80}
+(2) [Rahtz Confusion]: Rahtz independently converges on the same
+    advice, calling it 'noticing confusion' -- following confusion
+    led to finding a normalization bug. #observation
+    [Matthew Rahtz - Reproducing Deep RL](http://amid.fish/reproducing-deep-rl)
+    [evidence](evidence/amid_fish_reproducing_deep_rl.md#L77-L102)
+    > A corollary is to **try and be as sensitive as possible in noticing
+    > confusion**. There were a lot of points in this project where the
+    > only clues came from noticing some small thing that didn't make sense.
+    > It was only by following that confusion and realising that taking the
+    > difference between frames zeroed out the background that gave the
+    > hint of a problem with normalization. Learn to **recognise what
+    > confusion *feels* like**... **commit yourself to always investigate
+    > whenever you notice confusion.**
+    {reason: "independent practitioner arriving at same principle through personal experience; the normalization bug was only found this way", credence: 0.75}
+----
+(3) [Anomaly Robust]: 'Pursue anomalies immediately' is supported by
+    two independent practitioners who both found it was the key
+    debugging strategy for hard-to-diagnose issues.
+    {reason: "Jones and Rahtz independently describe the same strategy with different language (anomalies vs confusion) and different stories but the same conclusion", inference: 0.85}
+  +> [Folklore Reliable]
+
+
+## Comprehensive Logging
+
+<Logging Consensus>
+
+(1) [Rahtz Log]: Rahtz recommends logging all metrics you can to
+    maximize diagnostic evidence per run. #observation
+    [Matthew Rahtz - Reproducing Deep RL](http://amid.fish/reproducing-deep-rl)
+    [evidence](evidence/amid_fish_reproducing_deep_rl.md#L197-L201)
+    > First, adopting an attitude of **log all the metrics you can** to
+    > maximise the amount of evidence you gather on each run. There are
+    > obvious metrics like training/validation accuracy, but it might also
+    > be worth spending a good chunk of time at the start of the project
+    > brainstorming and researching which other metrics might be important
+    > for diagnosing potential problems.
+    {reason: "learned from 8-month reproduction attempt; specifically regrets not logging policy entropy earlier", credence: 0.75}
+(2) [Goodfellow Monitor]: Goodfellow et al. recommend visualizing
+    activation and gradient statistics collected over many
+    training iterations. #observation
+    [Goodfellow et al. Deep Learning Ch11](https://www.deeplearningbook.org/)
+    [evidence](evidence/goodfellow_ch11_practical_methodology.md#L238)
+    > **It is often useful to visualize statistics of neural network
+    > activations and gradients, collected over a large amount of training
+    > iterations.** The preactivation value of hidden units can tell us if
+    > the units saturate, or how often they do... it is useful to compare
+    > the magnitude of parameter gradients to the magnitude of the
+    > parameters themselves.
+    {reason: "canonical textbook; independent from Rahtz; specifies what to log (activations, gradients, parameter magnitudes)", credence: 0.90}
+----
+(3) [Logging Robust]: Comprehensive logging is unanimously recommended
+    across textbooks, courses, and practitioner accounts.
+    {reason: "Goodfellow (textbook), Rahtz (practitioner), and multiple other sources (FSDL, reddit threads) all emphasize logging as foundational; no dissenting voice found", inference: 0.90}
+  +> [Folklore Reliable]
+
+
+## Random HP Search
+
+<Random Search Evidence>
+
+(1) [CS231n Random]: CS231n cites Bergstra and Bengio (2012) showing
+    random search is more efficient than grid search for
+    hyperparameter optimization. #observation
+    [CS231n Neural Networks 3](https://cs231n.github.io/neural-networks-3/)
+    [evidence](evidence/cs231n_neural_networks_3.md#L306-L312)
+    > **Prefer random search to grid search.** As argued by Bergstra and
+    > Bengio in Random Search for Hyper-Parameter Optimization, "randomly
+    > chosen trials are more efficient for hyper-parameter optimization
+    > than trials on a grid". It is very often the case that **some of the
+    > hyperparameters matter much more than others**. Performing random
+    > search rather than grid search allows you to much more precisely
+    > discover good values for the important ones.
+    {reason: "CS231n citing peer-reviewed JMLR paper (Bergstra & Bengio 2012); the intuition (some HPs matter more) is well-established", credence: 0.85}
+(2) [Schulman Random]: Schulman endorses random sampling + human
+    regression as his preferred HP search method. #observation
+    [Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ)
+    [evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L995-L1013)
+    > favorite hyper parameter optimization framework... I just like to
+    > use the **uniform random sampling yeah that works really well I mean
+    > you just run a bunch of experiments with random hyper parameters and
+    > then you just look at the results the next day and do some regression
+    > to figure out which parameters actually mattered** and then you've
+    > run another experiment with better parameter ranges... I use the
+    > human version of it.
+    {reason: "Schulman's personal method; independent endorsement of random search from RL context (CS231n focuses on supervised)", credence: 0.85}
+----
+(3) [Random Search Robust]: Random HP search + manual analysis is
+    supported by both theory (Bergstra & Bengio) and practitioner
+    preference (Schulman).
+    {reason: "peer-reviewed paper provides theoretical justification; leading practitioner independently uses same method; the intuition (some HPs matter more than others) is universally recognized", inference: 0.85}
+  +> [Folklore Reliable]
+
+
+## Probe Environments
+
+<Probe Env Evidence>
+
+(1) [Jones Probes]: Jones describes a sequence of probe environments
+    that progressively isolate value network, backprop, reward
+    discounting, and policy errors. #observation
+    [Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html)
+    [evidence](evidence/andyljones_rl_debugging.md#L204-L221)
+    > Instead, construct environments that do localise errors. In a recent
+    > project, I used 1. **One action, zero observation, one timestep long,
+    > +1 reward every timestep**: This isolates the value network. 2. **One
+    > action, random +1/-1 observation, one timestep long, obs-dependent
+    > +1/-1 reward every time**: If my agent can learn the value in (1.) but
+    > not this one, it must be that backpropagation through my network is
+    > broken. 3. **One action, zero-then-one observation, two timesteps
+    > long, +1 reward at the end**: If my agent can learn the value in (2.)
+    > but not this one, it must be that my reward discounting is broken.
+    > You get the idea: (1.) is the simplest possible environment, and
+    > **each new env adds the smallest possible bit of functionality. If the
+    > old env works but the successor doesn't, that gives you a lot of
+    > information about where the problem is.**
+    {reason: "Jones is the only source with this detailed probe env methodology; but the technique is a direct application of the general 'test in isolation' principle (Goodfellow, CS231n) to RL specifically", credence: 0.80}
+----
+(2) [Probe Env Useful]: Probe environments are a practical application
+    of component isolation testing for RL, where standard envs
+    like CartPole don't localize errors.
+    {reason: "single source but the methodology is a rigorous instantiation of widely-supported isolation testing; each probe takes seconds, making it fast to verify", inference: 0.80}
+  +> [Folklore Reliable]
+
+
+## Policy Entropy and KL Diagnostics
+
+<Entropy KL Evidence>
+
+(1) [Schulman Entropy]: Schulman recommends monitoring policy entropy
+    carefully: dropping too fast means premature determinism,
+    not dropping means no learning. #observation
+    [Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ)
+    [evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L631-L652)
+    > look at the entropy really carefully if your entropy is going down too
+    > fast that means your policy is becoming deterministic and it's not
+    > going to explore anything... also if it's not going down your policy is
+    > never going to be that good because it's always really random... **you
+    > can sort of alleviate this issue by using an entropy bonus or a KL
+    > penalty** so by stopping yourself from changing the policy the
+    > probability distribution too fast as a side effect you also prevent
+    > the entropy from going down too fast... I also look at the KL as a
+    > diagnostic like look at how big of an update you're doing in terms of
+    > KL divergence
+    {reason: "Schulman designed PPO's clipped objective specifically to control KL; these diagnostics come from the algorithm author's direct practice", credence: 0.90}
+----
+(2) [Entropy KL Useful]: Policy entropy and KL divergence are
+    essential RL-specific diagnostics that detect exploration
+    failure and update instability.
+    {reason: "single source but from the algorithm designer; entropy/KL monitoring is now built into stable-baselines3, RLlib, and cleanrl as standard", inference: 0.85}
+  +> [Folklore Reliable]
+
+
+# Evidence Against
+
+## Sources Are Dated
+
+<Source Age Concern>
+
+(1) [Dated Sources]: Most sources are from 2017-2018, before
+    transformers, RLHF, large-scale pretraining, and modern
+    frameworks became dominant. #assumption
+    {reason: "Schulman 2017, Jones 2021, Henderson 2018, Irpan 2018, CS231n ~2017; the RL landscape has shifted substantially since then (PPO is now standard, RLHF is a major use case, JAX/PyTorch 2.0 changed workflows)", credence: 0.65}
+----
+(2) [Age Limits]: Some folklore may not transfer to modern settings
+    (e.g., batch size advice may differ for LLM fine-tuning vs
+    classic RL; reward scaling is less relevant for RLHF).
+    {reason: "core debugging principles (test isolation, logging, seed variance) are architecture-agnostic and likely durable; specific HP defaults and RL diagnostics may need updating", inference: 0.40}
+  -> [Folklore Reliable]
+
+
+## RL-Specific Focus
+
+<RL Specificity Concern>
+
+(1) [RL Heavy]: The SKILL is heavily weighted toward RL debugging,
+    with ~60% of content RL-specific (probe envs, reward scaling,
+    policy entropy, KL diagnostics). #assumption
+    {reason: "Parts 2 and 4 are RL-only; Part 1 is general but many examples are RL-flavored; limits applicability for users doing pure supervised learning or generative modeling", credence: 0.80}
+----
+(2) [Scope Limits]: RL-heavy focus limits the SKILL's applicability
+    but the general debugging principles (Parts 1, 3, 5) transfer
+    broadly.
+    {reason: "the RL focus is clearly labeled in the SKILL; general ML principles like normalization, isolation testing, and loss surface analysis are domain-agnostic", inference: 0.30}
+  -> [Folklore Reliable]