=== title: ML Debugging Folklore - Evidence Map author: ml_debug SKILL synthesis date: 2026-03-05 model: mode: strict === // This argdown maps claims from the ML Debugging Folklore SKILL.md // back to sourced quotes across 21 evidence files. Each claim // is traced to 2-3 independent sources with verbatim quotes. // // Credence guide: // 0.90 = canonical textbook / primary algorithm author // 0.85 = peer-reviewed paper / authoritative course // 0.80 = established practitioner blog, widely cited // 0.70 = popular blog / course notes // 0.60 = reddit thread / community consensus [Folklore Reliable]: ML debugging folklore -- practitioner heuristics transmitted via talks, blog posts, and course materials -- provides reliable, independently corroborated guidance for diagnosing and fixing ML training failures. + + + + + + + + + + - - # Section 1: General ML Debugging ## Normalize Inputs (1) [Schulman Normalize]: Schulman recommends standardizing all observations via running mean/std, clipping, and rescaling rewards without shifting the mean. #observation [Schulman 2016 Slides](http://joschu.net/docs/nuts-and-bolts.pdf) [evidence](evidence/joschu_nuts_and_bolts.md#L120-L125) > (cid:73) If observations have unknown range, standardize > (cid:73) Compute running estimate of mean and standard deviation > (cid:73) = clip((x −µ)/σ,−10,10) > **(cid:73) Rescale the rewards, but don't shift mean, as that affects agent's will to live** > (cid:73) Standardize prediction targets (e.g., value functions) the same way {reason: "Schulman is PPO/TRPO author; these slides are from his canonical 'Nuts and Bolts' talk, widely adopted in OpenAI baselines and stable-baselines3. Note: PDF conversion artifacts (cid:73 = bullet markers) present in evidence file.", credence: 0.90} (2) [FSDL Normalize]: FSDL Lecture 7 lists normalizing input data as a default step in the 'Start Simple' phase. #observation [FSDL Spring 2021 Lecture 7](https://fullstackdeeplearning.com/) [evidence](evidence/fsdl_spring2021_lecture7.md#L334-L338) > The next step is to **normalize the input data**, subtracting the mean > and dividing by the variance. Note that for images, it's fine to scale > values to 0-1 or -0.5 to 0.5 (for example, by dividing by 255). {reason: "FSDL is Josh Tobin's industry-focused course; independent from Schulman's RL lineage", credence: 0.80} (3) [Slavv Normalize]: Ivanov's '37 Reasons' checklist lists standardization as item #12 and preprocessing consistency as items #14-#15. #observation [Slavv 2017 - 37 Reasons](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607) [evidence](evidence/slavv_37_reasons_nn.md#L122-L132) > **12. Standardize the features.** > Did you standardize your input to have zero mean and unit variance? > 13. Do you have too much data augmentation? > Augmentation has a regularizing effect. Too much of this combined with > other forms of regularization (weight L2, dropout, etc.) can cause > the net to underfit. > **14. Check the preprocessing of your pretrained model.** > If you are using a pretrained model, make sure you are using the same > normalization and preprocessing as the model was when training. For > example, should an image pixel be in the range (0, 1), (-1, 1) or (0, 255)? {reason: "popular debugging checklist (2017), independent practitioner; items 12-14 all address normalization/preprocessing", credence: 0.70} ---- (4) [Normalize Robust]: Normalizing inputs to mean=0, std=1 is a robustly supported heuristic across 3 independent lineages (RL research, industry courses, practitioner checklists). {reason: "three independent sources from different sub-communities all converge on the same recommendation; community adoption in major frameworks confirms", inference: 0.90} +> [Folklore Reliable] ## Overfit First / Test in Isolation (1) [CS231n Overfit]: CS231n lists overfitting a tiny subset as the most important sanity check before full training. #observation [CS231n Neural Networks 3](https://cs231n.github.io/neural-networks-3/) [evidence](evidence/cs231n_neural_networks_3.md#L87-L89) > **Overfit a tiny subset of data**. Lastly and most importantly, before > training on the full dataset try to train on a tiny portion (e.g. 20 > examples) of your data and make sure you can achieve zero cost. For this > experiment it's also best to set regularization to zero, otherwise this > can prevent you from getting zero cost. **Unless you pass this sanity > check with a small dataset it is not worth proceeding to the full > dataset.** {reason: "CS231n (Karpathy/Li/Johnson) is the canonical deep learning course; 'most importantly' framing shows high confidence in this heuristic", credence: 0.85} (2) [FSDL Overfit]: FSDL positions single-batch overfitting as the step immediately after getting the model to run. #observation [FSDL Spring 2021 Lecture 7](https://fullstackdeeplearning.com/) [evidence](evidence/fsdl_spring2021_lecture7.md#L433-L451) > After getting your model to run, the next thing you need to do is to > **overfit a single batch of data**. This is a heuristic that can catch > an absurd number of bugs. This really means that you want to drive your > training error arbitrarily close to 0. There are a few things that can > happen when you try to overfit a single batch and it fails: > **Error goes up**: Commonly, this is due to a flip sign somewhere in > the loss function/gradient. **Error explodes**: This is usually a > numerical issue but can also be caused by a high learning rate. {reason: "FSDL (Josh Tobin) independent from CS231n; 'absurd number of bugs' is strong practitioner endorsement", credence: 0.80} (3) [Goodfellow Overfit]: Goodfellow et al. state that inability to fit a single example indicates a software defect. #observation [Goodfellow et al. Deep Learning Ch11](https://www.deeplearningbook.org/) [evidence](evidence/goodfellow_ch11_practical_methodology.md#L218) > Fit a tiny dataset: If you have high error on the training set, > determine whether it is due to genuine underfitting or due to a > software defect. **Usually even small models can be guaranteed to be > able fit a sufficiently small dataset.** For example, a classification > dataset with only one example can be fit just by setting the biases of > the output layer correctly. Usually if you cannot train a classifier to > correctly label a single example... **there is a software defect > preventing successful optimization on the training set.** {reason: "Goodfellow/Bengio/Courville textbook is the canonical deep learning reference; independent from CS231n and FSDL", credence: 0.90} ---- (4) [Overfit First Robust]: 'Overfit a tiny dataset as sanity check' is robustly supported across 3 independent authoritative sources. {reason: "canonical textbook + two major courses all prescribe the same test; consistent across supervised and RL contexts", inference: 0.90} +> [Folklore Reliable] ## Assume You Have a Bug (1) [Jones Bug]: Andy Jones argues that RL practitioners are reluctant to admit bugs, but bugs are the most common cause of failure. #observation [Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html) [evidence](evidence/andyljones_rl_debugging.md#L174-L182) > When their RL implementation doesn't work, people are often keen to > either (a) adjust their network architecture or (b) adjust their > hyperparameters. On the other hand, they're reluctant to say they've > got a bug. **Most often, it turns out they've got a bug.** Why bugs > are so much more common in RL code is discussed above, but there's > another advantage to assuming you've got a bug: bugs are a damn sight > faster to find and fix than validating that your new architecture is > an improvement over the old one. {reason: "Jones is an experienced RL practitioner; this blog post is widely cited in the RL community and is the primary source for structured RL debugging advice", credence: 0.80} (2) [Goodfellow Bug Mask]: Goodfellow et al. warn that neural net components can adapt to compensate for bugs, masking them. #observation [Goodfellow et al. Deep Learning Ch11](https://www.deeplearningbook.org/) [evidence](evidence/goodfellow_ch11_practical_methodology.md#L194-L206) > When a machine learning system performs poorly, it is usually difficult > to tell whether the poor performance is intrinsic to the algorithm > itself or whether there is a bug in the implementation of the > algorithm. **If one part is broken, the other parts can adapt and > still achieve roughly acceptable performance.** The bug may not be > apparent just from examining the output of the model though. Depending > on the distribution of the input, the weights may be able to adapt to > compensate for the negative biases. {reason: "canonical textbook; the 'adaptive compensation' mechanism explains why ML bugs are uniquely hard to detect", credence: 0.90} (3) [Jones Loss Herring]: Jones explicitly warns that loss curves don't localize errors and are therefore a red herring for debugging. #observation [Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html) [evidence](evidence/andyljones_rl_debugging.md#L184-L188) > When someone's RL implementation isn't working, they *luuuuuurv* to > copy-paste a screenshot of their loss curve to you. The problem with > using the loss curve as an indicator of correctness is somewhat that > it's not reliable, but mostly because **it doesn't localise errors.** > The shape of your loss curve says very little about where in your code > you've messed up, and so says very little about what you need to > change to get things working. {reason: "same source as (1) but independent observation about loss curves specifically; consistent with Goodfellow's point about adaptive masking", credence: 0.80} ---- (4) [Bug First Robust]: 'Assume you have a bug' is well-supported: bugs are common, adaptive compensation masks them, and loss curves don't help localize them. {reason: "Jones (practitioner) and Goodfellow (textbook) independently describe the same mechanism: ML systems mask bugs through adaptation. Loss curves give false comfort.", inference: 0.85} +> [Folklore Reliable] # Section 2: RL-Specific Debugging ## Seed Variance (1) [Schulman Seeds]: Schulman demonstrates that 3 seemingly different algorithms on 7 MuJoCo tasks were actually the same algorithm with different random seeds. #observation [Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ) [evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L377-L412) > you can see seven different tasks these are the Jim Moo Joko tasks > like half cheetah and hopper and so on and you have three different > algorithms here the red one the green one and the blue one... but as > it turns out **these are all the exact same algorithms and just random > seeds different random seeds** so it's easy to imagine that you're > just looking at one of these problems then you see that blue curve and > you think you get really excited than you think you found some huge > improvement to your algorithm but it's really that you just got a > lucky seed... **even if you had like 20 seeds here there's a still a > pretty big error bar** {reason: "Schulman (PPO author) showing his own data; this demonstration is one of the most cited examples of RL noise in the community", credence: 0.90} (2) [Henderson Seeds]: Henderson et al. show that same hyperparameters with different seeds produce statistically different learning curves on standard benchmarks. #observation [Henderson et al. 2018 - Deep RL That Matters](https://arxiv.org/abs/1709.06560) [evidence](evidence/henderson_2018_deep_rl_matters.md#L233-L235) > We perform 10 experiment trials, for the same hyperparameter > configuration, only varying the random seed across all 10 trials. We > then split the trials into two sets of 5 and average these two > groupings together. As shown in Figure 5, we find that **the > performance of algorithms can be drastically different.** We > demonstrate that the variance between runs is enough to create > statistically different distributions just from varying random seeds. > Our experiment with random seeds shows that this can be potentially > misleading. {reason: "peer-reviewed AAAI 2018 paper; systematic experimental study with statistical testing (t-test results reported); independent from Schulman's demo", credence: 0.85} (3) [Irpan Seeds]: Irpan reports a 30% failure rate on Pendulum across 10 seeds with identical hyperparameters, and notes that this would be considered a bug in supervised learning. #observation [Alex Irpan - RL Hard](https://www.alexirpan.com/2018/02/14/rl-hard.html) [evidence](evidence/alexirpan_rl_hard.md#L651-L678) > Here is a plot of performance, after I fixed all the bugs. Each line > is the reward curve from one of 10 independent runs. Same > hyperparameters, the only difference is the random seed. **Seven of > these runs worked. Three of these runs didn't. A 30% failure rate > counts as working.** Look, there's variance in supervised learning > too, but it's rarely this bad. If my supervised learning code failed > to beat random chance 30% of the time, I'd have super high confidence > there was a bug in data loading or training. **If my reinforcement > learning code does no better than random, I have no idea if it's a > bug, if my hyperparameters are bad, or if I simply got unlucky.** {reason: "Google Brain engineer's direct experience; the SL vs RL comparison makes the point vivid; independent from Schulman and Henderson", credence: 0.80} ---- (4) [Seed Variance Robust]: RL seed variance is extreme -- same algo with different seeds can look like different algorithms. This is robustly demonstrated across 3 independent sources with quantitative evidence. {reason: "primary algorithm author (Schulman), peer-reviewed study (Henderson), and independent practitioner (Irpan) all demonstrate the same effect with data; the 30% failure rate on Pendulum is a striking data point", inference: 0.90} +> [Folklore Reliable] ## Batch Size (1) [Schulman Batch]: Schulman warns that batch sizes too small cause noise to overwhelm signal, citing his own TRPO debugging experience needing 100K timesteps per batch. #observation [Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ) [evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L279-L307) > sometimes you should use more samples than you think you're going > to need because usually things just work better when you have more > samples almost always... if you want to just get something working at > all often you need to use bigger batch sizes and you thought because > **if your batch size is too small than the noise will overwhelm the > signal and you won't learn anything** {reason: "Schulman's personal debugging story: TRPO needed 100K timesteps per batch (documented in slides). Direct experience from algorithm author.", credence: 0.90} (2) [Schulman Batch Slides]: Schulman's slides give specific batch size numbers for TRPO and DQN on Atari. #observation [Schulman 2016 Slides](http://joschu.net/docs/nuts-and-bolts.pdf) [evidence](evidence/joschu_nuts_and_bolts.md#L61-L72) > Run with More Samples Than Expected. **Early in tuning process, may > need huge number of samples.** Don't be deterred by published work. > Examples: TRPO on Atari: 100K timesteps per batch for KL=0.01. > DQN on Atari: update freq=10K, replay buffer size=1M. {reason: "same author as (1) but written slides with specific numbers; corroborates the talk", credence: 0.90} (3) [McCandlish Critical Batch]: McCandlish et al. derive a critical batch size that predicts speed/efficiency tradeoffs, finding it grows during training as gradients shrink. #observation [McCandlish et al. 2018](https://arxiv.org/abs/1812.06162) [evidence](evidence/mccandlish_2018_large_batch.md#L180-L196) > Equation 2.7 nevertheless predicts the dependence of training speed > on batch size remarkably well, even for full training runs that range > over many points in the loss landscape. **By averaging Equation 2.7 > over multiple optimization steps, we find a simple relationship > between training speed and data efficiency.** Here, S and Smin > represent the actual and minimum possible number of steps taken to > reach a specified level of performance, respectively. {reason: "OpenAI paper providing theoretical foundation for batch size effects; peer-reviewed; explains WHY Schulman's observation holds", credence: 0.85} ---- (4) [Batch Size Robust]: 'Use bigger batches than you think' is supported by both practitioner experience and theoretical analysis. {reason: "Schulman's empirical observation is explained by McCandlish's noise scale theory; the critical batch size concept provides a principled way to reason about it", inference: 0.85} +> [Folklore Reliable] ## Reward Engineering (1) [Schulman Reward Mean]: Schulman warns that shifting reward mean changes the agent's 'will to live' -- how long it wants to survive -- thereby changing the problem. #observation [Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ) [evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L513-L519) > for the rewards I'd recommend rescaling it but not shifting them > because **that affects the agents will to live so if you shift the > mean reward that'll affect whether how long it wants to survive > you're actually changing the problem** {reason: "Schulman explains the causal mechanism: reward mean shift changes the MDP's optimal policy, not just scaling. This is not obvious to beginners.", credence: 0.90} (2) [Jones Reward Scale]: Jones identifies reward scaling as the single most common issue for RL newbies, and warns against adaptive reward scaling as extra nonstationarity. #observation [Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html) [evidence](evidence/andyljones_rl_debugging.md#L115-L119) > The single most common issue for newbies writing custom RL > implementations is that the targets arriving at their neural net > aren't [-1, +1]. Actually, anything [-.1, +.1]ish to [-10, +10]ish > is good. Having read that, you might be tempted to write some > adaptive scheme to scale your rewards for you. **Don't: it's an extra > bit of nonstationarity that'll make life more difficult. Just > hand-scale, hand-clip the rewards** from your env so that the targets > passed to your network are sensible. {reason: "Jones independently converges on same advice as Schulman; labels it 'single most common issue'; explicitly warns against adaptive schemes", credence: 0.80} (3) [Henderson Reward Scale]: Henderson et al. show that multiplying rewards by a scalar causes significant performance differences in DDPG, with inconsistent effects across environments. #observation [Henderson et al. 2018 - Deep RL That Matters](https://arxiv.org/abs/1709.06560) [evidence](evidence/henderson_2018_deep_rl_matters.md#L181) > Reward rescaling has been used in several recent works (Duan et al . > 2016; Gu et al . 2016) to improve results for DDPG. This involves > simply multiplying the rewards gen-erated from an environment by some > scalar ( rhat = r*sigma ) for training. Often, these works report using a > reward scale of sigma = 0 .1. In Atari domains, this is akin to clipping > the rewards to (0 , 1) . **By intuition, in gradient based methods (as > used in most deep RL) a large and sparse output scale can result in > problems regarding saturation and inefficiency in learning** (LeCun > et al . 2012; Glorot and Bengio 2010; Vincent, de Brebisson, and > Bouthillier 2015). {reason: "peer-reviewed AAAI paper; provides gradient-based mechanism explaining WHY reward scale matters; cites 3 supporting references", credence: 0.85} ---- (4) [Reward Robust]: Reward scaling advice (hand-scale, don't shift mean, target [-10, +10]) is well-supported across practitioner experience, RL research, and controlled experiments. {reason: "Schulman (causal mechanism), Jones (practitioner experience), Henderson (experimental validation) all converge; the 'will to live' explanation is especially compelling", inference: 0.85} +> [Folklore Reliable] ## Reference Implementations (1) [Jones Ref Impl]: Jones calls writing RL from scratch 'the most catastrophically self-sabotaging thing you can do' as a newcomer. #observation [Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html) [evidence](evidence/andyljones_rl_debugging.md#L153-L165) > **If you're new to reinforcement learning, writing things from scratch > is the most catastrophically self-sabotaging thing you can do.** There > is an alluring masochism in writing things from scratch. There's > concrete value in it too: by writing things from scratch, you're both > forced to fully understand what you're doing and you're more likely to > come up with a fresh perspective. **In reinforcement learning, these > benefits are not worth it.** At all. As discussed above, the nature > of RL work makes it extremely hard for you to self-correct. {reason: "Jones provides three graduated risk levels for using references (out-of-box, components, one-eye-on); concrete implementation lists (spinningup, stable-baselines3, cleanrl, OpenSpiel)", credence: 0.80} (2) [Rahtz Ref]: Rahtz spent 8 months reproducing a Deep RL paper and found that even small normalization bugs can hide for months, supporting the case for starting from reference code. #observation [Matthew Rahtz - Reproducing Deep RL](http://amid.fish/reproducing-deep-rl) [evidence](evidence/amid_fish_reproducing_deep_rl.md#L36-L52) > reinforcement learning turned out to be a lot trickier than expected. > A big part of it is that right now, reinforcement learning is really > sensitive. There are a lot of details to get just right, and **if you > don't get them right, it can be difficult to diagnose where you've > gone wrong.** After finishing the basic implementation, training runs > just weren't succeeding... it turned out to be because of **problems > with normalization of rewards and pixel data at a key stage.** {reason: "first-person account of 2-month debugging session caused by normalization bug; vivid illustration of why references save time", credence: 0.75} ---- (3) [Ref Impl Robust]: Starting from reference implementations is strongly supported by practitioner experience: the self-correction mechanisms in RL are too weak for solo implementation. {reason: "Jones's forceful advice is validated by Rahtz's 8-month experience; the underlying theory (RL's weak error signals) provides a causal explanation", inference: 0.85} +> [Folklore Reliable] ## Pursue Anomalies (1) [Jones Anomaly]: Jones recommends chasing anomalies immediately, calling it 'one of the most powerful ways to debug'. #observation [Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html) [evidence](evidence/andyljones_rl_debugging.md#L101-L109) > If you ever see a plot or a behaviour that just *seems weird*, chase > right after it! Do not - do *not* - just 'hope it goes away'. > **Chasing anomalies is one of the most powerful ways to debug your > system**, because if you've noticed a problem without having had to go > look for it, that means it's a *really big problem*. It's really > tempting to think that the cool extra functionality you were planning > to write today might just magically fix this anomalous behaviour. > It won't. Give up on your plan for the day and chase the anomaly > instead. {reason: "strong practitioner endorsement; the causal reasoning (visible anomaly = big problem) is sound", credence: 0.80} (2) [Rahtz Confusion]: Rahtz independently converges on the same advice, calling it 'noticing confusion' -- following confusion led to finding a normalization bug. #observation [Matthew Rahtz - Reproducing Deep RL](http://amid.fish/reproducing-deep-rl) [evidence](evidence/amid_fish_reproducing_deep_rl.md#L77-L102) > A corollary is to **try and be as sensitive as possible in noticing > confusion**. There were a lot of points in this project where the > only clues came from noticing some small thing that didn't make sense. > It was only by following that confusion and realising that taking the > difference between frames zeroed out the background that gave the > hint of a problem with normalization. Learn to **recognise what > confusion *feels* like**... **commit yourself to always investigate > whenever you notice confusion.** {reason: "independent practitioner arriving at same principle through personal experience; the normalization bug was only found this way", credence: 0.75} ---- (3) [Anomaly Robust]: 'Pursue anomalies immediately' is supported by two independent practitioners who both found it was the key debugging strategy for hard-to-diagnose issues. {reason: "Jones and Rahtz independently describe the same strategy with different language (anomalies vs confusion) and different stories but the same conclusion", inference: 0.85} +> [Folklore Reliable] ## Comprehensive Logging (1) [Rahtz Log]: Rahtz recommends logging all metrics you can to maximize diagnostic evidence per run. #observation [Matthew Rahtz - Reproducing Deep RL](http://amid.fish/reproducing-deep-rl) [evidence](evidence/amid_fish_reproducing_deep_rl.md#L197-L201) > First, adopting an attitude of **log all the metrics you can** to > maximise the amount of evidence you gather on each run. There are > obvious metrics like training/validation accuracy, but it might also > be worth spending a good chunk of time at the start of the project > brainstorming and researching which other metrics might be important > for diagnosing potential problems. {reason: "learned from 8-month reproduction attempt; specifically regrets not logging policy entropy earlier", credence: 0.75} (2) [Goodfellow Monitor]: Goodfellow et al. recommend visualizing activation and gradient statistics collected over many training iterations. #observation [Goodfellow et al. Deep Learning Ch11](https://www.deeplearningbook.org/) [evidence](evidence/goodfellow_ch11_practical_methodology.md#L238) > **It is often useful to visualize statistics of neural network > activations and gradients, collected over a large amount of training > iterations.** The preactivation value of hidden units can tell us if > the units saturate, or how often they do... it is useful to compare > the magnitude of parameter gradients to the magnitude of the > parameters themselves. {reason: "canonical textbook; independent from Rahtz; specifies what to log (activations, gradients, parameter magnitudes)", credence: 0.90} ---- (3) [Logging Robust]: Comprehensive logging is unanimously recommended across textbooks, courses, and practitioner accounts. {reason: "Goodfellow (textbook), Rahtz (practitioner), and multiple other sources (FSDL, reddit threads) all emphasize logging as foundational; no dissenting voice found", inference: 0.90} +> [Folklore Reliable] ## Random HP Search (1) [CS231n Random]: CS231n cites Bergstra and Bengio (2012) showing random search is more efficient than grid search for hyperparameter optimization. #observation [CS231n Neural Networks 3](https://cs231n.github.io/neural-networks-3/) [evidence](evidence/cs231n_neural_networks_3.md#L306-L312) > **Prefer random search to grid search.** As argued by Bergstra and > Bengio in Random Search for Hyper-Parameter Optimization, "randomly > chosen trials are more efficient for hyper-parameter optimization > than trials on a grid". It is very often the case that **some of the > hyperparameters matter much more than others**. Performing random > search rather than grid search allows you to much more precisely > discover good values for the important ones. {reason: "CS231n citing peer-reviewed JMLR paper (Bergstra & Bengio 2012); the intuition (some HPs matter more) is well-established", credence: 0.85} (2) [Schulman Random]: Schulman endorses random sampling + human regression as his preferred HP search method. #observation [Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ) [evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L995-L1013) > favorite hyper parameter optimization framework... I just like to > use the **uniform random sampling yeah that works really well I mean > you just run a bunch of experiments with random hyper parameters and > then you just look at the results the next day and do some regression > to figure out which parameters actually mattered** and then you've > run another experiment with better parameter ranges... I use the > human version of it. {reason: "Schulman's personal method; independent endorsement of random search from RL context (CS231n focuses on supervised)", credence: 0.85} ---- (3) [Random Search Robust]: Random HP search + manual analysis is supported by both theory (Bergstra & Bengio) and practitioner preference (Schulman). {reason: "peer-reviewed paper provides theoretical justification; leading practitioner independently uses same method; the intuition (some HPs matter more than others) is universally recognized", inference: 0.85} +> [Folklore Reliable] ## Probe Environments (1) [Jones Probes]: Jones describes a sequence of probe environments that progressively isolate value network, backprop, reward discounting, and policy errors. #observation [Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html) [evidence](evidence/andyljones_rl_debugging.md#L204-L221) > Instead, construct environments that do localise errors. In a recent > project, I used 1. **One action, zero observation, one timestep long, > +1 reward every timestep**: This isolates the value network. 2. **One > action, random +1/-1 observation, one timestep long, obs-dependent > +1/-1 reward every time**: If my agent can learn the value in (1.) but > not this one, it must be that backpropagation through my network is > broken. 3. **One action, zero-then-one observation, two timesteps > long, +1 reward at the end**: If my agent can learn the value in (2.) > but not this one, it must be that my reward discounting is broken. > You get the idea: (1.) is the simplest possible environment, and > **each new env adds the smallest possible bit of functionality. If the > old env works but the successor doesn't, that gives you a lot of > information about where the problem is.** {reason: "Jones is the only source with this detailed probe env methodology; but the technique is a direct application of the general 'test in isolation' principle (Goodfellow, CS231n) to RL specifically", credence: 0.80} ---- (2) [Probe Env Useful]: Probe environments are a practical application of component isolation testing for RL, where standard envs like CartPole don't localize errors. {reason: "single source but the methodology is a rigorous instantiation of widely-supported isolation testing; each probe takes seconds, making it fast to verify", inference: 0.80} +> [Folklore Reliable] ## Policy Entropy and KL Diagnostics (1) [Schulman Entropy]: Schulman recommends monitoring policy entropy carefully: dropping too fast means premature determinism, not dropping means no learning. #observation [Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ) [evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L631-L652) > look at the entropy really carefully if your entropy is going down too > fast that means your policy is becoming deterministic and it's not > going to explore anything... also if it's not going down your policy is > never going to be that good because it's always really random... **you > can sort of alleviate this issue by using an entropy bonus or a KL > penalty** so by stopping yourself from changing the policy the > probability distribution too fast as a side effect you also prevent > the entropy from going down too fast... I also look at the KL as a > diagnostic like look at how big of an update you're doing in terms of > KL divergence {reason: "Schulman designed PPO's clipped objective specifically to control KL; these diagnostics come from the algorithm author's direct practice", credence: 0.90} ---- (2) [Entropy KL Useful]: Policy entropy and KL divergence are essential RL-specific diagnostics that detect exploration failure and update instability. {reason: "single source but from the algorithm designer; entropy/KL monitoring is now built into stable-baselines3, RLlib, and cleanrl as standard", inference: 0.85} +> [Folklore Reliable] # Evidence Against ## Sources Are Dated (1) [Dated Sources]: Most sources are from 2017-2018, before transformers, RLHF, large-scale pretraining, and modern frameworks became dominant. #assumption {reason: "Schulman 2017, Jones 2021, Henderson 2018, Irpan 2018, CS231n ~2017; the RL landscape has shifted substantially since then (PPO is now standard, RLHF is a major use case, JAX/PyTorch 2.0 changed workflows)", credence: 0.65} ---- (2) [Age Limits]: Some folklore may not transfer to modern settings (e.g., batch size advice may differ for LLM fine-tuning vs classic RL; reward scaling is less relevant for RLHF). {reason: "core debugging principles (test isolation, logging, seed variance) are architecture-agnostic and likely durable; specific HP defaults and RL diagnostics may need updating", inference: 0.40} -> [Folklore Reliable] ## RL-Specific Focus (1) [RL Heavy]: The SKILL is heavily weighted toward RL debugging, with ~60% of content RL-specific (probe envs, reward scaling, policy entropy, KL diagnostics). #assumption {reason: "Parts 2 and 4 are RL-only; Part 1 is general but many examples are RL-flavored; limits applicability for users doing pure supervised learning or generative modeling", credence: 0.80} ---- (2) [Scope Limits]: RL-heavy focus limits the SKILL's applicability but the general debugging principles (Parts 1, 3, 5) transfer broadly. {reason: "the RL focus is clearly labeled in the SKILL; general ML principles like normalization, isolation testing, and loss surface analysis are domain-agnostic", inference: 0.30} -> [Folklore Reliable]