mirror of
https://github.com/wassname/ml-debug.git
synced 2026-06-27 17:31:04 +08:00
4393cceefd
Deep research to uplift LLMs for ML debugging, opinionated by source selection. Distilled from Schulman, Jones, Rahtz, Goodfellow, CS231n, FSDL, and more. Includes runnable diagnostic scripts and LLM-specific anti-patterns. Author: wassname (https://github.com/wassname)
587 lines
36 KiB
Plaintext
587 lines
36 KiB
Plaintext
===
|
||
title: ML Debugging Folklore - Evidence Map
|
||
author: ml_debug SKILL synthesis
|
||
date: 2026-03-05
|
||
model:
|
||
mode: strict
|
||
===
|
||
|
||
// This argdown maps claims from the ML Debugging Folklore SKILL.md
|
||
// back to sourced quotes across 21 evidence files. Each claim
|
||
// is traced to 2-3 independent sources with verbatim quotes.
|
||
//
|
||
// Credence guide:
|
||
// 0.90 = canonical textbook / primary algorithm author
|
||
// 0.85 = peer-reviewed paper / authoritative course
|
||
// 0.80 = established practitioner blog, widely cited
|
||
// 0.70 = popular blog / course notes
|
||
// 0.60 = reddit thread / community consensus
|
||
|
||
[Folklore Reliable]: ML debugging folklore -- practitioner heuristics
|
||
transmitted via talks, blog posts, and course materials -- provides
|
||
reliable, independently corroborated guidance for diagnosing and
|
||
fixing ML training failures.
|
||
+ <Normalization Consensus>
|
||
+ <Isolation Testing Consensus>
|
||
+ <Bug First Consensus>
|
||
+ <Seed Variance Evidence>
|
||
+ <Batch Size Evidence>
|
||
+ <Reward Engineering Evidence>
|
||
+ <Logging Consensus>
|
||
+ <Reference Impl Consensus>
|
||
+ <Anomaly Pursuit Evidence>
|
||
+ <Random Search Evidence>
|
||
- <Source Age Concern>
|
||
- <RL Specificity Concern>
|
||
|
||
|
||
# Section 1: General ML Debugging
|
||
|
||
## Normalize Inputs
|
||
|
||
<Normalization Consensus>
|
||
|
||
(1) [Schulman Normalize]: Schulman recommends standardizing all
|
||
observations via running mean/std, clipping, and rescaling
|
||
rewards without shifting the mean. #observation
|
||
[Schulman 2016 Slides](http://joschu.net/docs/nuts-and-bolts.pdf)
|
||
[evidence](evidence/joschu_nuts_and_bolts.md#L120-L125)
|
||
> (cid:73) If observations have unknown range, standardize
|
||
> (cid:73) Compute running estimate of mean and standard deviation
|
||
> (cid:73) = clip((x −µ)/σ,−10,10)
|
||
> **(cid:73) Rescale the rewards, but don't shift mean, as that affects agent's will to live**
|
||
> (cid:73) Standardize prediction targets (e.g., value functions) the same way
|
||
{reason: "Schulman is PPO/TRPO author; these slides are from his canonical 'Nuts and Bolts' talk, widely adopted in OpenAI baselines and stable-baselines3. Note: PDF conversion artifacts (cid:73 = bullet markers) present in evidence file.", credence: 0.90}
|
||
(2) [FSDL Normalize]: FSDL Lecture 7 lists normalizing input data as
|
||
a default step in the 'Start Simple' phase. #observation
|
||
[FSDL Spring 2021 Lecture 7](https://fullstackdeeplearning.com/)
|
||
[evidence](evidence/fsdl_spring2021_lecture7.md#L334-L338)
|
||
> The next step is to **normalize the input data**, subtracting the mean
|
||
> and dividing by the variance. Note that for images, it's fine to scale
|
||
> values to 0-1 or -0.5 to 0.5 (for example, by dividing by 255).
|
||
{reason: "FSDL is Josh Tobin's industry-focused course; independent from Schulman's RL lineage", credence: 0.80}
|
||
(3) [Slavv Normalize]: Ivanov's '37 Reasons' checklist lists
|
||
standardization as item #12 and preprocessing consistency
|
||
as items #14-#15. #observation
|
||
[Slavv 2017 - 37 Reasons](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607)
|
||
[evidence](evidence/slavv_37_reasons_nn.md#L122-L132)
|
||
> **12. Standardize the features.**
|
||
> Did you standardize your input to have zero mean and unit variance?
|
||
> 13. Do you have too much data augmentation?
|
||
> Augmentation has a regularizing effect. Too much of this combined with
|
||
> other forms of regularization (weight L2, dropout, etc.) can cause
|
||
> the net to underfit.
|
||
> **14. Check the preprocessing of your pretrained model.**
|
||
> If you are using a pretrained model, make sure you are using the same
|
||
> normalization and preprocessing as the model was when training. For
|
||
> example, should an image pixel be in the range (0, 1), (-1, 1) or (0, 255)?
|
||
{reason: "popular debugging checklist (2017), independent practitioner; items 12-14 all address normalization/preprocessing", credence: 0.70}
|
||
----
|
||
(4) [Normalize Robust]: Normalizing inputs to mean=0, std=1 is a
|
||
robustly supported heuristic across 3 independent lineages
|
||
(RL research, industry courses, practitioner checklists).
|
||
{reason: "three independent sources from different sub-communities all converge on the same recommendation; community adoption in major frameworks confirms", inference: 0.90}
|
||
+> [Folklore Reliable]
|
||
|
||
|
||
## Overfit First / Test in Isolation
|
||
|
||
<Isolation Testing Consensus>
|
||
|
||
(1) [CS231n Overfit]: CS231n lists overfitting a tiny subset as the
|
||
most important sanity check before full training. #observation
|
||
[CS231n Neural Networks 3](https://cs231n.github.io/neural-networks-3/)
|
||
[evidence](evidence/cs231n_neural_networks_3.md#L87-L89)
|
||
> **Overfit a tiny subset of data**. Lastly and most importantly, before
|
||
> training on the full dataset try to train on a tiny portion (e.g. 20
|
||
> examples) of your data and make sure you can achieve zero cost. For this
|
||
> experiment it's also best to set regularization to zero, otherwise this
|
||
> can prevent you from getting zero cost. **Unless you pass this sanity
|
||
> check with a small dataset it is not worth proceeding to the full
|
||
> dataset.**
|
||
{reason: "CS231n (Karpathy/Li/Johnson) is the canonical deep learning course; 'most importantly' framing shows high confidence in this heuristic", credence: 0.85}
|
||
(2) [FSDL Overfit]: FSDL positions single-batch overfitting as the
|
||
step immediately after getting the model to run. #observation
|
||
[FSDL Spring 2021 Lecture 7](https://fullstackdeeplearning.com/)
|
||
[evidence](evidence/fsdl_spring2021_lecture7.md#L433-L451)
|
||
> After getting your model to run, the next thing you need to do is to
|
||
> **overfit a single batch of data**. This is a heuristic that can catch
|
||
> an absurd number of bugs. This really means that you want to drive your
|
||
> training error arbitrarily close to 0. There are a few things that can
|
||
> happen when you try to overfit a single batch and it fails:
|
||
> **Error goes up**: Commonly, this is due to a flip sign somewhere in
|
||
> the loss function/gradient. **Error explodes**: This is usually a
|
||
> numerical issue but can also be caused by a high learning rate.
|
||
{reason: "FSDL (Josh Tobin) independent from CS231n; 'absurd number of bugs' is strong practitioner endorsement", credence: 0.80}
|
||
(3) [Goodfellow Overfit]: Goodfellow et al. state that inability to
|
||
fit a single example indicates a software defect. #observation
|
||
[Goodfellow et al. Deep Learning Ch11](https://www.deeplearningbook.org/)
|
||
[evidence](evidence/goodfellow_ch11_practical_methodology.md#L218)
|
||
> Fit a tiny dataset: If you have high error on the training set,
|
||
> determine whether it is due to genuine underfitting or due to a
|
||
> software defect. **Usually even small models can be guaranteed to be
|
||
> able fit a sufficiently small dataset.** For example, a classification
|
||
> dataset with only one example can be fit just by setting the biases of
|
||
> the output layer correctly. Usually if you cannot train a classifier to
|
||
> correctly label a single example... **there is a software defect
|
||
> preventing successful optimization on the training set.**
|
||
{reason: "Goodfellow/Bengio/Courville textbook is the canonical deep learning reference; independent from CS231n and FSDL", credence: 0.90}
|
||
----
|
||
(4) [Overfit First Robust]: 'Overfit a tiny dataset as sanity check'
|
||
is robustly supported across 3 independent authoritative sources.
|
||
{reason: "canonical textbook + two major courses all prescribe the same test; consistent across supervised and RL contexts", inference: 0.90}
|
||
+> [Folklore Reliable]
|
||
|
||
|
||
## Assume You Have a Bug
|
||
|
||
<Bug First Consensus>
|
||
|
||
(1) [Jones Bug]: Andy Jones argues that RL practitioners are reluctant
|
||
to admit bugs, but bugs are the most common cause of failure. #observation
|
||
[Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html)
|
||
[evidence](evidence/andyljones_rl_debugging.md#L174-L182)
|
||
> When their RL implementation doesn't work, people are often keen to
|
||
> either (a) adjust their network architecture or (b) adjust their
|
||
> hyperparameters. On the other hand, they're reluctant to say they've
|
||
> got a bug. **Most often, it turns out they've got a bug.** Why bugs
|
||
> are so much more common in RL code is discussed above, but there's
|
||
> another advantage to assuming you've got a bug: bugs are a damn sight
|
||
> faster to find and fix than validating that your new architecture is
|
||
> an improvement over the old one.
|
||
{reason: "Jones is an experienced RL practitioner; this blog post is widely cited in the RL community and is the primary source for structured RL debugging advice", credence: 0.80}
|
||
(2) [Goodfellow Bug Mask]: Goodfellow et al. warn that neural net
|
||
components can adapt to compensate for bugs, masking them. #observation
|
||
[Goodfellow et al. Deep Learning Ch11](https://www.deeplearningbook.org/)
|
||
[evidence](evidence/goodfellow_ch11_practical_methodology.md#L194-L206)
|
||
> When a machine learning system performs poorly, it is usually difficult
|
||
> to tell whether the poor performance is intrinsic to the algorithm
|
||
> itself or whether there is a bug in the implementation of the
|
||
> algorithm. **If one part is broken, the other parts can adapt and
|
||
> still achieve roughly acceptable performance.** The bug may not be
|
||
> apparent just from examining the output of the model though. Depending
|
||
> on the distribution of the input, the weights may be able to adapt to
|
||
> compensate for the negative biases.
|
||
{reason: "canonical textbook; the 'adaptive compensation' mechanism explains why ML bugs are uniquely hard to detect", credence: 0.90}
|
||
(3) [Jones Loss Herring]: Jones explicitly warns that loss curves
|
||
don't localize errors and are therefore a red herring for
|
||
debugging. #observation
|
||
[Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html)
|
||
[evidence](evidence/andyljones_rl_debugging.md#L184-L188)
|
||
> When someone's RL implementation isn't working, they *luuuuuurv* to
|
||
> copy-paste a screenshot of their loss curve to you. The problem with
|
||
> using the loss curve as an indicator of correctness is somewhat that
|
||
> it's not reliable, but mostly because **it doesn't localise errors.**
|
||
> The shape of your loss curve says very little about where in your code
|
||
> you've messed up, and so says very little about what you need to
|
||
> change to get things working.
|
||
{reason: "same source as (1) but independent observation about loss curves specifically; consistent with Goodfellow's point about adaptive masking", credence: 0.80}
|
||
----
|
||
(4) [Bug First Robust]: 'Assume you have a bug' is well-supported:
|
||
bugs are common, adaptive compensation masks them, and loss
|
||
curves don't help localize them.
|
||
{reason: "Jones (practitioner) and Goodfellow (textbook) independently describe the same mechanism: ML systems mask bugs through adaptation. Loss curves give false comfort.", inference: 0.85}
|
||
+> [Folklore Reliable]
|
||
|
||
|
||
# Section 2: RL-Specific Debugging
|
||
|
||
## Seed Variance
|
||
|
||
<Seed Variance Evidence>
|
||
|
||
(1) [Schulman Seeds]: Schulman demonstrates that 3 seemingly different
|
||
algorithms on 7 MuJoCo tasks were actually the same algorithm
|
||
with different random seeds. #observation
|
||
[Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ)
|
||
[evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L377-L412)
|
||
> you can see seven different tasks these are the Jim Moo Joko tasks
|
||
> like half cheetah and hopper and so on and you have three different
|
||
> algorithms here the red one the green one and the blue one... but as
|
||
> it turns out **these are all the exact same algorithms and just random
|
||
> seeds different random seeds** so it's easy to imagine that you're
|
||
> just looking at one of these problems then you see that blue curve and
|
||
> you think you get really excited than you think you found some huge
|
||
> improvement to your algorithm but it's really that you just got a
|
||
> lucky seed... **even if you had like 20 seeds here there's a still a
|
||
> pretty big error bar**
|
||
{reason: "Schulman (PPO author) showing his own data; this demonstration is one of the most cited examples of RL noise in the community", credence: 0.90}
|
||
(2) [Henderson Seeds]: Henderson et al. show that same hyperparameters
|
||
with different seeds produce statistically different learning
|
||
curves on standard benchmarks. #observation
|
||
[Henderson et al. 2018 - Deep RL That Matters](https://arxiv.org/abs/1709.06560)
|
||
[evidence](evidence/henderson_2018_deep_rl_matters.md#L233-L235)
|
||
> We perform 10 experiment trials, for the same hyperparameter
|
||
> configuration, only varying the random seed across all 10 trials. We
|
||
> then split the trials into two sets of 5 and average these two
|
||
> groupings together. As shown in Figure 5, we find that **the
|
||
> performance of algorithms can be drastically different.** We
|
||
> demonstrate that the variance between runs is enough to create
|
||
> statistically different distributions just from varying random seeds.
|
||
> Our experiment with random seeds shows that this can be potentially
|
||
> misleading.
|
||
{reason: "peer-reviewed AAAI 2018 paper; systematic experimental study with statistical testing (t-test results reported); independent from Schulman's demo", credence: 0.85}
|
||
(3) [Irpan Seeds]: Irpan reports a 30% failure rate on Pendulum
|
||
across 10 seeds with identical hyperparameters, and notes
|
||
that this would be considered a bug in supervised learning. #observation
|
||
[Alex Irpan - RL Hard](https://www.alexirpan.com/2018/02/14/rl-hard.html)
|
||
[evidence](evidence/alexirpan_rl_hard.md#L651-L678)
|
||
> Here is a plot of performance, after I fixed all the bugs. Each line
|
||
> is the reward curve from one of 10 independent runs. Same
|
||
> hyperparameters, the only difference is the random seed. **Seven of
|
||
> these runs worked. Three of these runs didn't. A 30% failure rate
|
||
> counts as working.** Look, there's variance in supervised learning
|
||
> too, but it's rarely this bad. If my supervised learning code failed
|
||
> to beat random chance 30% of the time, I'd have super high confidence
|
||
> there was a bug in data loading or training. **If my reinforcement
|
||
> learning code does no better than random, I have no idea if it's a
|
||
> bug, if my hyperparameters are bad, or if I simply got unlucky.**
|
||
{reason: "Google Brain engineer's direct experience; the SL vs RL comparison makes the point vivid; independent from Schulman and Henderson", credence: 0.80}
|
||
----
|
||
(4) [Seed Variance Robust]: RL seed variance is extreme -- same algo
|
||
with different seeds can look like different algorithms. This
|
||
is robustly demonstrated across 3 independent sources with
|
||
quantitative evidence.
|
||
{reason: "primary algorithm author (Schulman), peer-reviewed study (Henderson), and independent practitioner (Irpan) all demonstrate the same effect with data; the 30% failure rate on Pendulum is a striking data point", inference: 0.90}
|
||
+> [Folklore Reliable]
|
||
|
||
|
||
## Batch Size
|
||
|
||
<Batch Size Evidence>
|
||
|
||
(1) [Schulman Batch]: Schulman warns that batch sizes too small
|
||
cause noise to overwhelm signal, citing his own TRPO debugging
|
||
experience needing 100K timesteps per batch. #observation
|
||
[Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ)
|
||
[evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L279-L307)
|
||
> sometimes you should use more samples than you think you're going
|
||
> to need because usually things just work better when you have more
|
||
> samples almost always... if you want to just get something working at
|
||
> all often you need to use bigger batch sizes and you thought because
|
||
> **if your batch size is too small than the noise will overwhelm the
|
||
> signal and you won't learn anything**
|
||
{reason: "Schulman's personal debugging story: TRPO needed 100K timesteps per batch (documented in slides). Direct experience from algorithm author.", credence: 0.90}
|
||
(2) [Schulman Batch Slides]: Schulman's slides give specific batch
|
||
size numbers for TRPO and DQN on Atari. #observation
|
||
[Schulman 2016 Slides](http://joschu.net/docs/nuts-and-bolts.pdf)
|
||
[evidence](evidence/joschu_nuts_and_bolts.md#L61-L72)
|
||
> Run with More Samples Than Expected. **Early in tuning process, may
|
||
> need huge number of samples.** Don't be deterred by published work.
|
||
> Examples: TRPO on Atari: 100K timesteps per batch for KL=0.01.
|
||
> DQN on Atari: update freq=10K, replay buffer size=1M.
|
||
{reason: "same author as (1) but written slides with specific numbers; corroborates the talk", credence: 0.90}
|
||
(3) [McCandlish Critical Batch]: McCandlish et al. derive a critical
|
||
batch size that predicts speed/efficiency tradeoffs, finding
|
||
it grows during training as gradients shrink. #observation
|
||
[McCandlish et al. 2018](https://arxiv.org/abs/1812.06162)
|
||
[evidence](evidence/mccandlish_2018_large_batch.md#L180-L196)
|
||
> Equation 2.7 nevertheless predicts the dependence of training speed
|
||
> on batch size remarkably well, even for full training runs that range
|
||
> over many points in the loss landscape. **By averaging Equation 2.7
|
||
> over multiple optimization steps, we find a simple relationship
|
||
> between training speed and data efficiency.** Here, S and Smin
|
||
> represent the actual and minimum possible number of steps taken to
|
||
> reach a specified level of performance, respectively.
|
||
{reason: "OpenAI paper providing theoretical foundation for batch size effects; peer-reviewed; explains WHY Schulman's observation holds", credence: 0.85}
|
||
----
|
||
(4) [Batch Size Robust]: 'Use bigger batches than you think' is
|
||
supported by both practitioner experience and theoretical analysis.
|
||
{reason: "Schulman's empirical observation is explained by McCandlish's noise scale theory; the critical batch size concept provides a principled way to reason about it", inference: 0.85}
|
||
+> [Folklore Reliable]
|
||
|
||
|
||
## Reward Engineering
|
||
|
||
<Reward Engineering Evidence>
|
||
|
||
(1) [Schulman Reward Mean]: Schulman warns that shifting reward mean
|
||
changes the agent's 'will to live' -- how long it wants to
|
||
survive -- thereby changing the problem. #observation
|
||
[Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ)
|
||
[evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L513-L519)
|
||
> for the rewards I'd recommend rescaling it but not shifting them
|
||
> because **that affects the agents will to live so if you shift the
|
||
> mean reward that'll affect whether how long it wants to survive
|
||
> you're actually changing the problem**
|
||
{reason: "Schulman explains the causal mechanism: reward mean shift changes the MDP's optimal policy, not just scaling. This is not obvious to beginners.", credence: 0.90}
|
||
(2) [Jones Reward Scale]: Jones identifies reward scaling as the single
|
||
most common issue for RL newbies, and warns against adaptive
|
||
reward scaling as extra nonstationarity. #observation
|
||
[Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html)
|
||
[evidence](evidence/andyljones_rl_debugging.md#L115-L119)
|
||
> The single most common issue for newbies writing custom RL
|
||
> implementations is that the targets arriving at their neural net
|
||
> aren't [-1, +1]. Actually, anything [-.1, +.1]ish to [-10, +10]ish
|
||
> is good. Having read that, you might be tempted to write some
|
||
> adaptive scheme to scale your rewards for you. **Don't: it's an extra
|
||
> bit of nonstationarity that'll make life more difficult. Just
|
||
> hand-scale, hand-clip the rewards** from your env so that the targets
|
||
> passed to your network are sensible.
|
||
{reason: "Jones independently converges on same advice as Schulman; labels it 'single most common issue'; explicitly warns against adaptive schemes", credence: 0.80}
|
||
(3) [Henderson Reward Scale]: Henderson et al. show that multiplying
|
||
rewards by a scalar causes significant performance differences
|
||
in DDPG, with inconsistent effects across environments. #observation
|
||
[Henderson et al. 2018 - Deep RL That Matters](https://arxiv.org/abs/1709.06560)
|
||
[evidence](evidence/henderson_2018_deep_rl_matters.md#L181)
|
||
> Reward rescaling has been used in several recent works (Duan et al .
|
||
> 2016; Gu et al . 2016) to improve results for DDPG. This involves
|
||
> simply multiplying the rewards gen-erated from an environment by some
|
||
> scalar ( rhat = r*sigma ) for training. Often, these works report using a
|
||
> reward scale of sigma = 0 .1. In Atari domains, this is akin to clipping
|
||
> the rewards to (0 , 1) . **By intuition, in gradient based methods (as
|
||
> used in most deep RL) a large and sparse output scale can result in
|
||
> problems regarding saturation and inefficiency in learning** (LeCun
|
||
> et al . 2012; Glorot and Bengio 2010; Vincent, de Brebisson, and
|
||
> Bouthillier 2015).
|
||
{reason: "peer-reviewed AAAI paper; provides gradient-based mechanism explaining WHY reward scale matters; cites 3 supporting references", credence: 0.85}
|
||
----
|
||
(4) [Reward Robust]: Reward scaling advice (hand-scale, don't shift
|
||
mean, target [-10, +10]) is well-supported across practitioner
|
||
experience, RL research, and controlled experiments.
|
||
{reason: "Schulman (causal mechanism), Jones (practitioner experience), Henderson (experimental validation) all converge; the 'will to live' explanation is especially compelling", inference: 0.85}
|
||
+> [Folklore Reliable]
|
||
|
||
|
||
## Reference Implementations
|
||
|
||
<Reference Impl Consensus>
|
||
|
||
(1) [Jones Ref Impl]: Jones calls writing RL from scratch 'the most
|
||
catastrophically self-sabotaging thing you can do' as a
|
||
newcomer. #observation
|
||
[Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html)
|
||
[evidence](evidence/andyljones_rl_debugging.md#L153-L165)
|
||
> **If you're new to reinforcement learning, writing things from scratch
|
||
> is the most catastrophically self-sabotaging thing you can do.** There
|
||
> is an alluring masochism in writing things from scratch. There's
|
||
> concrete value in it too: by writing things from scratch, you're both
|
||
> forced to fully understand what you're doing and you're more likely to
|
||
> come up with a fresh perspective. **In reinforcement learning, these
|
||
> benefits are not worth it.** At all. As discussed above, the nature
|
||
> of RL work makes it extremely hard for you to self-correct.
|
||
{reason: "Jones provides three graduated risk levels for using references (out-of-box, components, one-eye-on); concrete implementation lists (spinningup, stable-baselines3, cleanrl, OpenSpiel)", credence: 0.80}
|
||
(2) [Rahtz Ref]: Rahtz spent 8 months reproducing a Deep RL paper and
|
||
found that even small normalization bugs can hide for months,
|
||
supporting the case for starting from reference code. #observation
|
||
[Matthew Rahtz - Reproducing Deep RL](http://amid.fish/reproducing-deep-rl)
|
||
[evidence](evidence/amid_fish_reproducing_deep_rl.md#L36-L52)
|
||
> reinforcement learning turned out to be a lot trickier than expected.
|
||
> A big part of it is that right now, reinforcement learning is really
|
||
> sensitive. There are a lot of details to get just right, and **if you
|
||
> don't get them right, it can be difficult to diagnose where you've
|
||
> gone wrong.** After finishing the basic implementation, training runs
|
||
> just weren't succeeding... it turned out to be because of **problems
|
||
> with normalization of rewards and pixel data at a key stage.**
|
||
{reason: "first-person account of 2-month debugging session caused by normalization bug; vivid illustration of why references save time", credence: 0.75}
|
||
----
|
||
(3) [Ref Impl Robust]: Starting from reference implementations is
|
||
strongly supported by practitioner experience: the self-correction
|
||
mechanisms in RL are too weak for solo implementation.
|
||
{reason: "Jones's forceful advice is validated by Rahtz's 8-month experience; the underlying theory (RL's weak error signals) provides a causal explanation", inference: 0.85}
|
||
+> [Folklore Reliable]
|
||
|
||
|
||
## Pursue Anomalies
|
||
|
||
<Anomaly Pursuit Evidence>
|
||
|
||
(1) [Jones Anomaly]: Jones recommends chasing anomalies immediately,
|
||
calling it 'one of the most powerful ways to debug'. #observation
|
||
[Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html)
|
||
[evidence](evidence/andyljones_rl_debugging.md#L101-L109)
|
||
> If you ever see a plot or a behaviour that just *seems weird*, chase
|
||
> right after it! Do not - do *not* - just 'hope it goes away'.
|
||
> **Chasing anomalies is one of the most powerful ways to debug your
|
||
> system**, because if you've noticed a problem without having had to go
|
||
> look for it, that means it's a *really big problem*. It's really
|
||
> tempting to think that the cool extra functionality you were planning
|
||
> to write today might just magically fix this anomalous behaviour.
|
||
> It won't. Give up on your plan for the day and chase the anomaly
|
||
> instead.
|
||
{reason: "strong practitioner endorsement; the causal reasoning (visible anomaly = big problem) is sound", credence: 0.80}
|
||
(2) [Rahtz Confusion]: Rahtz independently converges on the same
|
||
advice, calling it 'noticing confusion' -- following confusion
|
||
led to finding a normalization bug. #observation
|
||
[Matthew Rahtz - Reproducing Deep RL](http://amid.fish/reproducing-deep-rl)
|
||
[evidence](evidence/amid_fish_reproducing_deep_rl.md#L77-L102)
|
||
> A corollary is to **try and be as sensitive as possible in noticing
|
||
> confusion**. There were a lot of points in this project where the
|
||
> only clues came from noticing some small thing that didn't make sense.
|
||
> It was only by following that confusion and realising that taking the
|
||
> difference between frames zeroed out the background that gave the
|
||
> hint of a problem with normalization. Learn to **recognise what
|
||
> confusion *feels* like**... **commit yourself to always investigate
|
||
> whenever you notice confusion.**
|
||
{reason: "independent practitioner arriving at same principle through personal experience; the normalization bug was only found this way", credence: 0.75}
|
||
----
|
||
(3) [Anomaly Robust]: 'Pursue anomalies immediately' is supported by
|
||
two independent practitioners who both found it was the key
|
||
debugging strategy for hard-to-diagnose issues.
|
||
{reason: "Jones and Rahtz independently describe the same strategy with different language (anomalies vs confusion) and different stories but the same conclusion", inference: 0.85}
|
||
+> [Folklore Reliable]
|
||
|
||
|
||
## Comprehensive Logging
|
||
|
||
<Logging Consensus>
|
||
|
||
(1) [Rahtz Log]: Rahtz recommends logging all metrics you can to
|
||
maximize diagnostic evidence per run. #observation
|
||
[Matthew Rahtz - Reproducing Deep RL](http://amid.fish/reproducing-deep-rl)
|
||
[evidence](evidence/amid_fish_reproducing_deep_rl.md#L197-L201)
|
||
> First, adopting an attitude of **log all the metrics you can** to
|
||
> maximise the amount of evidence you gather on each run. There are
|
||
> obvious metrics like training/validation accuracy, but it might also
|
||
> be worth spending a good chunk of time at the start of the project
|
||
> brainstorming and researching which other metrics might be important
|
||
> for diagnosing potential problems.
|
||
{reason: "learned from 8-month reproduction attempt; specifically regrets not logging policy entropy earlier", credence: 0.75}
|
||
(2) [Goodfellow Monitor]: Goodfellow et al. recommend visualizing
|
||
activation and gradient statistics collected over many
|
||
training iterations. #observation
|
||
[Goodfellow et al. Deep Learning Ch11](https://www.deeplearningbook.org/)
|
||
[evidence](evidence/goodfellow_ch11_practical_methodology.md#L238)
|
||
> **It is often useful to visualize statistics of neural network
|
||
> activations and gradients, collected over a large amount of training
|
||
> iterations.** The preactivation value of hidden units can tell us if
|
||
> the units saturate, or how often they do... it is useful to compare
|
||
> the magnitude of parameter gradients to the magnitude of the
|
||
> parameters themselves.
|
||
{reason: "canonical textbook; independent from Rahtz; specifies what to log (activations, gradients, parameter magnitudes)", credence: 0.90}
|
||
----
|
||
(3) [Logging Robust]: Comprehensive logging is unanimously recommended
|
||
across textbooks, courses, and practitioner accounts.
|
||
{reason: "Goodfellow (textbook), Rahtz (practitioner), and multiple other sources (FSDL, reddit threads) all emphasize logging as foundational; no dissenting voice found", inference: 0.90}
|
||
+> [Folklore Reliable]
|
||
|
||
|
||
## Random HP Search
|
||
|
||
<Random Search Evidence>
|
||
|
||
(1) [CS231n Random]: CS231n cites Bergstra and Bengio (2012) showing
|
||
random search is more efficient than grid search for
|
||
hyperparameter optimization. #observation
|
||
[CS231n Neural Networks 3](https://cs231n.github.io/neural-networks-3/)
|
||
[evidence](evidence/cs231n_neural_networks_3.md#L306-L312)
|
||
> **Prefer random search to grid search.** As argued by Bergstra and
|
||
> Bengio in Random Search for Hyper-Parameter Optimization, "randomly
|
||
> chosen trials are more efficient for hyper-parameter optimization
|
||
> than trials on a grid". It is very often the case that **some of the
|
||
> hyperparameters matter much more than others**. Performing random
|
||
> search rather than grid search allows you to much more precisely
|
||
> discover good values for the important ones.
|
||
{reason: "CS231n citing peer-reviewed JMLR paper (Bergstra & Bengio 2012); the intuition (some HPs matter more) is well-established", credence: 0.85}
|
||
(2) [Schulman Random]: Schulman endorses random sampling + human
|
||
regression as his preferred HP search method. #observation
|
||
[Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ)
|
||
[evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L995-L1013)
|
||
> favorite hyper parameter optimization framework... I just like to
|
||
> use the **uniform random sampling yeah that works really well I mean
|
||
> you just run a bunch of experiments with random hyper parameters and
|
||
> then you just look at the results the next day and do some regression
|
||
> to figure out which parameters actually mattered** and then you've
|
||
> run another experiment with better parameter ranges... I use the
|
||
> human version of it.
|
||
{reason: "Schulman's personal method; independent endorsement of random search from RL context (CS231n focuses on supervised)", credence: 0.85}
|
||
----
|
||
(3) [Random Search Robust]: Random HP search + manual analysis is
|
||
supported by both theory (Bergstra & Bengio) and practitioner
|
||
preference (Schulman).
|
||
{reason: "peer-reviewed paper provides theoretical justification; leading practitioner independently uses same method; the intuition (some HPs matter more than others) is universally recognized", inference: 0.85}
|
||
+> [Folklore Reliable]
|
||
|
||
|
||
## Probe Environments
|
||
|
||
<Probe Env Evidence>
|
||
|
||
(1) [Jones Probes]: Jones describes a sequence of probe environments
|
||
that progressively isolate value network, backprop, reward
|
||
discounting, and policy errors. #observation
|
||
[Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html)
|
||
[evidence](evidence/andyljones_rl_debugging.md#L204-L221)
|
||
> Instead, construct environments that do localise errors. In a recent
|
||
> project, I used 1. **One action, zero observation, one timestep long,
|
||
> +1 reward every timestep**: This isolates the value network. 2. **One
|
||
> action, random +1/-1 observation, one timestep long, obs-dependent
|
||
> +1/-1 reward every time**: If my agent can learn the value in (1.) but
|
||
> not this one, it must be that backpropagation through my network is
|
||
> broken. 3. **One action, zero-then-one observation, two timesteps
|
||
> long, +1 reward at the end**: If my agent can learn the value in (2.)
|
||
> but not this one, it must be that my reward discounting is broken.
|
||
> You get the idea: (1.) is the simplest possible environment, and
|
||
> **each new env adds the smallest possible bit of functionality. If the
|
||
> old env works but the successor doesn't, that gives you a lot of
|
||
> information about where the problem is.**
|
||
{reason: "Jones is the only source with this detailed probe env methodology; but the technique is a direct application of the general 'test in isolation' principle (Goodfellow, CS231n) to RL specifically", credence: 0.80}
|
||
----
|
||
(2) [Probe Env Useful]: Probe environments are a practical application
|
||
of component isolation testing for RL, where standard envs
|
||
like CartPole don't localize errors.
|
||
{reason: "single source but the methodology is a rigorous instantiation of widely-supported isolation testing; each probe takes seconds, making it fast to verify", inference: 0.80}
|
||
+> [Folklore Reliable]
|
||
|
||
|
||
## Policy Entropy and KL Diagnostics
|
||
|
||
<Entropy KL Evidence>
|
||
|
||
(1) [Schulman Entropy]: Schulman recommends monitoring policy entropy
|
||
carefully: dropping too fast means premature determinism,
|
||
not dropping means no learning. #observation
|
||
[Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ)
|
||
[evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L631-L652)
|
||
> look at the entropy really carefully if your entropy is going down too
|
||
> fast that means your policy is becoming deterministic and it's not
|
||
> going to explore anything... also if it's not going down your policy is
|
||
> never going to be that good because it's always really random... **you
|
||
> can sort of alleviate this issue by using an entropy bonus or a KL
|
||
> penalty** so by stopping yourself from changing the policy the
|
||
> probability distribution too fast as a side effect you also prevent
|
||
> the entropy from going down too fast... I also look at the KL as a
|
||
> diagnostic like look at how big of an update you're doing in terms of
|
||
> KL divergence
|
||
{reason: "Schulman designed PPO's clipped objective specifically to control KL; these diagnostics come from the algorithm author's direct practice", credence: 0.90}
|
||
----
|
||
(2) [Entropy KL Useful]: Policy entropy and KL divergence are
|
||
essential RL-specific diagnostics that detect exploration
|
||
failure and update instability.
|
||
{reason: "single source but from the algorithm designer; entropy/KL monitoring is now built into stable-baselines3, RLlib, and cleanrl as standard", inference: 0.85}
|
||
+> [Folklore Reliable]
|
||
|
||
|
||
# Evidence Against
|
||
|
||
## Sources Are Dated
|
||
|
||
<Source Age Concern>
|
||
|
||
(1) [Dated Sources]: Most sources are from 2017-2018, before
|
||
transformers, RLHF, large-scale pretraining, and modern
|
||
frameworks became dominant. #assumption
|
||
{reason: "Schulman 2017, Jones 2021, Henderson 2018, Irpan 2018, CS231n ~2017; the RL landscape has shifted substantially since then (PPO is now standard, RLHF is a major use case, JAX/PyTorch 2.0 changed workflows)", credence: 0.65}
|
||
----
|
||
(2) [Age Limits]: Some folklore may not transfer to modern settings
|
||
(e.g., batch size advice may differ for LLM fine-tuning vs
|
||
classic RL; reward scaling is less relevant for RLHF).
|
||
{reason: "core debugging principles (test isolation, logging, seed variance) are architecture-agnostic and likely durable; specific HP defaults and RL diagnostics may need updating", inference: 0.40}
|
||
-> [Folklore Reliable]
|
||
|
||
|
||
## RL-Specific Focus
|
||
|
||
<RL Specificity Concern>
|
||
|
||
(1) [RL Heavy]: The SKILL is heavily weighted toward RL debugging,
|
||
with ~60% of content RL-specific (probe envs, reward scaling,
|
||
policy entropy, KL diagnostics). #assumption
|
||
{reason: "Parts 2 and 4 are RL-only; Part 1 is general but many examples are RL-flavored; limits applicability for users doing pure supervised learning or generative modeling", credence: 0.80}
|
||
----
|
||
(2) [Scope Limits]: RL-heavy focus limits the SKILL's applicability
|
||
but the general debugging principles (Parts 1, 3, 5) transfer
|
||
broadly.
|
||
{reason: "the RL focus is clearly labeled in the SKILL; general ML principles like normalization, isolation testing, and loss surface analysis are domain-agnostic", inference: 0.30}
|
||
-> [Folklore Reliable]
|