From 4393cceefd51865b6ccb2bb20dff4cbaa2b06c8a Mon Sep 17 00:00:00 2001 From: wassname <1103714+wassname@users.noreply.github.com> Date: Fri, 6 Mar 2026 10:11:30 +0800 Subject: [PATCH] initial: ML debugging folklore skill Deep research to uplift LLMs for ML debugging, opinionated by source selection. Distilled from Schulman, Jones, Rahtz, Goodfellow, CS231n, FSDL, and more. Includes runnable diagnostic scripts and LLM-specific anti-patterns. Author: wassname (https://github.com/wassname) --- .gitignore | 10 + SKILL.md | 1083 ++++++++++ docs/dlbooks | 1 + ...olts of Deep RL Experimentation [Do(1).txt | 1008 +++++++++ ...d Bolts of Deep RL Experimentation [Do.txt | 1008 +++++++++ docs/evidence/alexirpan_rl_hard.md | 1103 ++++++++++ .../evidence/amid_fish_reproducing_deep_rl.md | 679 ++++++ docs/evidence/andyljones_rl_debugging.md | 399 ++++ docs/evidence/cs229_ml_advice.md | 719 +++++++ docs/evidence/cs231n_neural_networks_3.md | 353 +++ docs/evidence/fsdl_spring2021_lecture7.md | 786 +++++++ .../henderson_2018_deep_rl_matters.md | 1906 +++++++++++++++++ docs/evidence/joschu_nuts_and_bolts.md | 199 ++ docs/evidence/mccandlish_2018_large_batch.md | 660 ++++++ .../reddit_deeprl_bootcamp_2017_75m5vd.md | 15 + .../reddit_icml2017_tutorial_levine_6vcvu1.md | 18 + .../reddit_rl_debugging_tips_9sh77q.md | 78 + .../reddit_rl_practical_tips_7s8px9.md | 110 + docs/evidence/reddit_rl_roadblocks_bzg3l2.md | 197 ++ .../reddit_schulman_nuts_bolts_5hereu.md | 12 + ...ts_bolts_deeprl_bootcamp_2017_subtitles.md | 1014 +++++++++ docs/evidence/slavv_37_reasons_nn.md | 272 +++ docs/evidence/williamfalcon_deeprl_hacks.md | 220 ++ docs/ml_debug_folklore.argdown | 586 +++++ docs/ml_debug_folklore_log.md | 76 + 25 files changed, 12512 insertions(+) create mode 100644 .gitignore create mode 100644 SKILL.md create mode 120000 docs/dlbooks create mode 100644 docs/evidence/[English (auto-generated)] Deep RL Bootcamp Lecture 6 Nuts and Bolts of Deep RL Experimentation [Do(1).txt create mode 100644 docs/evidence/[English (auto-generated)] Deep RL Bootcamp Lecture 6 Nuts and Bolts of Deep RL Experimentation [Do.txt create mode 100644 docs/evidence/alexirpan_rl_hard.md create mode 100644 docs/evidence/amid_fish_reproducing_deep_rl.md create mode 100644 docs/evidence/andyljones_rl_debugging.md create mode 100644 docs/evidence/cs229_ml_advice.md create mode 100644 docs/evidence/cs231n_neural_networks_3.md create mode 100644 docs/evidence/fsdl_spring2021_lecture7.md create mode 100644 docs/evidence/henderson_2018_deep_rl_matters.md create mode 100644 docs/evidence/joschu_nuts_and_bolts.md create mode 100644 docs/evidence/mccandlish_2018_large_batch.md create mode 100644 docs/evidence/reddit_deeprl_bootcamp_2017_75m5vd.md create mode 100644 docs/evidence/reddit_icml2017_tutorial_levine_6vcvu1.md create mode 100644 docs/evidence/reddit_rl_debugging_tips_9sh77q.md create mode 100644 docs/evidence/reddit_rl_practical_tips_7s8px9.md create mode 100644 docs/evidence/reddit_rl_roadblocks_bzg3l2.md create mode 100644 docs/evidence/reddit_schulman_nuts_bolts_5hereu.md create mode 100644 docs/evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md create mode 100644 docs/evidence/slavv_37_reasons_nn.md create mode 100644 docs/evidence/williamfalcon_deeprl_hacks.md create mode 100644 docs/ml_debug_folklore.argdown create mode 100644 docs/ml_debug_folklore_log.md diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..82ccded --- /dev/null +++ b/.gitignore @@ -0,0 +1,10 @@ +# Personal notes +docs/wassname.md + +# Internal dev specs (implementation notes, not useful publicly) +docs/spec/ + +# Book chapters (copyright, not redistributable) +docs/evidence/goodfellow_ch11_practical_methodology.md +docs/evidence/goodfellow_ch15_representation_learning.md +docs/evidence/deeplearning_book.md diff --git a/SKILL.md b/SKILL.md new file mode 100644 index 0000000..0ecabc8 --- /dev/null +++ b/SKILL.md @@ -0,0 +1,1083 @@ +--- +name: ml-debugging +description: "Practical folklore for debugging ML systems: convergence issues, loss surface analysis, gradient analysis, sweep methodology, and same-seed comparisons. Use when stuck on training, designing sweeps, or analyzing experiment results." +--- + +# ML Debugging Folklore + +Practitioner knowledge that's hard to find in papers. Distilled from Schulman's "Nuts and Bolts" talk, Andy Jones' debugging guide, r/reinforcementlearning threads, competition write-ups, and personal experience. Most multi-source claims are traced to sourced quotes in [docs/ml_debug_folklore.argdown](docs/ml_debug_folklore.argdown) (vargdown format); uncovered claims are listed in the [process log](docs/ml_debug_folklore_log.md). + +The core problem: in ML (especially RL), errors aren't local [Goodfellow Ch11]. Information flows in loops, so a numerical bug in one spot gets smeared through the whole system in seconds. From outside, everything goes weird at once -- loss explodes, KL collapses, rewards oscillate. You can tell something's wrong but not *what* or *where* [Jones 2021]. + +**When debugging, work in this order:** +1. Run static analysis (grep for silent bugs) -- Part 6.1 +2. Run diagnostics (data check, init loss, overfit-one-batch) -- Part 6.2 +3. Follow the triage decision tree -- Part 6.3 +4. Use mental models to brainstorm hypotheses -- Part 7 +5. Only then read Parts 1-5 for deeper understanding of specific issues + +--- + +## Part 1: General ML Debugging + +### The hierarchy (work in order, don't skip to hyperparameters) + +**Step 1: Verify components in isolation.** [Goodfellow Ch11, CS229] +Most bugs are "doing the wrong calculation." Test each piece independently. + +- Network forward pass: feed known inputs, check output shapes and ranges. `assert` shapes everywhere -- `(None,)` vs `(None, 1)` silently broadcasts into `(None, None)`. +- Loss computation: hand-compute a few targets and compare to code output. +- Data pipeline: sample a batch, print it, eyeball it. Are labels aligned with inputs? Are transforms applied correctly? +- Preprocessing: look at your processed inputs as a human. Can *you* solve the task from them? If you downsampled images, can you still tell what's going on? + +**Five most common deep learning bugs** [FSDL]: (1) incorrect tensor shapes that fail silently via broadcasting, (2) preprocessing inputs incorrectly (wrong normalization, over-augmentation), (3) incorrect loss function or wrong sign in loss/gradient, (4) forgot to set up train vs eval mode (dropout/batchnorm behave differently), (5) numerical instability (NaN from log(0), overflow, vanishing grads). + +**Step 2: Get signs of life on a toy problem. Work the baseline ladder.** [CS231n, FSDL, Goodfellow Ch11] +Before your real task, solve something trivial with the same codebase. This establishes what "healthy" looks like. Run on CartPole (or equivalent) and log the same curves so you know what healthy learning looks like for your setup [reddit]. If it works on the toy but not your real task, the gap is usually scale/normalization, not fundamental correctness. + +Also try to overfit to train. If you can't do that, you likely won't be able to generalise. [CS231n: "Unless you pass this sanity check with a small dataset it is not worth proceeding to the full dataset."] Start with a lightweight implementation (<200 lines of new code), no complicated data pipelines [FSDL]. Build those later once the core works. + +**Baseline ladder** (for physics/simulation models -- make each step beat the previous one): +1. Persistence: y(t) = y(t-1). Bar for "does the model capture any dynamics at all?" +2. Exponential decay to steady state (first-order response fit). +3. Linear state-space / OLS on finite differences. +4. Pure data MLP (same architecture, no physics). If PINN doesn't beat this, the physics constraint is hurting. +5. Classical solver with fixed parameters (scipy solve_bvp, ODE, etc.). +6. Classical solver with fitted parameters. +7. Then and only then: PINN / learned physics. + +Make complexity pay rent. Every added component (more physics, more dimensions, more losses) should improve a metric you care about. If it doesn't, remove it. + +**Step 3: Log everything, look for specific pathologies.** [Goodfellow Ch11, Rahtz 2018, CS231n] + +What to log: +- Losses (train and val, per-component if multi-objective) +- Gradient norms (per module if possible) +- Learning rates +- Parameter norms / update magnitudes +- Activation statistics (mean, std, fraction of dead ReLUs) +- Data statistics (input distributions, label distributions) + +**Sanity check at init** [CS231n]: verify you get the expected loss at chance performance before training starts. E.g., for 10-class softmax the initial loss should be -ln(0.1) = 2.302 with small random weights. If not, something is wrong with initialization or the loss function. Then verify that increasing regularization increases the loss. + +| Symptom | Likely cause | +|---|---| +| Loss stuck from the start | LR too low, bad init, data pipeline broken, wrong loss function | +| Loss decreases then explodes | LR too high, numerical instability (log(0), div by 0), gradient accumulation bug | +| Loss NaN | log(0), 0/0, overflow. Use `log(x.clamp(min=1e-8))`, `1/(std + 1e-5)` | +| Train loss good, val loss bad | Overfitting. More data, regularization, smaller model | +| Loss oscillates wildly | LR too high, batch size too small, data shuffling broken | +| Gradients vanish | Too-deep network without skip connections, saturating activations (tanh with large inputs), bad init | +| Gradients explode | No gradient clipping, learning rate too high, recurrent networks without gradient clipping | +| Different results per seed | Normal if small variance; suspicious if large. Check init sensitivity, batch ordering, floating point nondeterminism | +| Model outputs constant | Dead neurons, vanishing gradients, mode collapse, all-zero init | +| Physics loss low but BCs violated | Gradient imbalance -- PDE residual dominates BC gradient; use adaptive loss weighting or hard BCs | +| PINN worse than pure-data MLP | Wrong equations, bad scaling (forgot to nondimensionalize), or physics constraint fighting the data | +| PINN fails on hard PDE regime, works on easy | Curriculum regularization: start with easy parameters, warm-start and increase to target | +| Scalar parameter (U, alpha) stuck at 0 or bound | Degenerate solution; bound and initialize it, or estimate separately before joint training | + +**Step 4: Numerical hygiene.** [CS231n] + +```python +# Clamp log values +log_prob = prob.clamp(min=1e-8).log() + +# Never divide by zero +ratio = x / (std + 1e-5) + +# Clip gradients and LOG the pre-clip norm +grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=20.0) +logger.log("grad_norm", grad_norm) + +# Catch NaNs early +assert torch.isfinite(loss), f"Loss is {loss}" + +# Verify custom gradients (use float64! relative error plummets from 1e-2 to 1e-8) +torch.autograd.gradcheck(my_custom_fn, inputs.double()) +``` + +Gradient clipping *masks* problems -- always log the pre-clip norm to see if it's constantly being triggered. [CS231n: "the ratio of the update magnitudes to the value magnitudes... should be somewhere around 1e-3."] + +**Gradient check thresholds** [CS231n]: use relative error, not absolute. Compare analytic vs numerical gradient using centered difference formula. Relative error > 1e-2 = probably wrong. 1e-4 = uncomfortable. 1e-7 = happy. Before checking: (a) turn off regularization and check data loss alone first (regularization can mask data loss bugs), (b) disable dropout and augmentation, (c) use float64 not float32. + +**Step 5: Normalization and Nondimensionalization.** [Schulman 2017, CS231n, FSDL, Slavv 2017] +Most ML training issues trace back to scale problems. + +- Input normalization: mean 0, std 1 per feature. Use running statistics over ALL data seen so far, not just recent data [Schulman 2017]. Using only recent data silently changes the input distribution in a way the policy doesn't know about, which can collapse performance. [Schulman slides: "Compute running estimate of mean and standard deviation, x' = clip((x-mu)/sigma, -10, 10)"] +- Schulman: "plot histograms of all observations and rewards and make sure each component has the right mean and standard deviation and doesn't have crazy outliers." +- Layer normalization helps stability. +- For targets/labels: think about whether the scale is reasonable for your loss function. +- **For physics/PDE models (PINNs)**: nondimensionalize *before* training. Raw SI units (Kelvin, Joules, meters) create loss terms with wildly different magnitudes -- this is the multi-scale problem that adaptive weighting tries to fix downstream. Nondimensionalizing fixes it at the source by making all PDE coefficients O(1). Recipe: pick characteristic scales (T_ref, L_ref, etc.), define dimensionless variables (T* = T/T_ref, z* = z/L), substitute into the PDE. The resulting groups (NTU, Biot, etc.) are all O(1). +- **Train/test split**: use temporal split (not random) for time-series or plant data. Random splitting leaks temporal correlation and gives optimistic test RMSE. Conventional: first 75% train, last 25% test. + +**Step 6: Check your assumptions about the optimizer.** + +- Adam's moment estimates can mask gradient problems. If step statistics look weird, check raw gradients separately. +- `abs_max(param_update)` should be small (e.g., ~1e-3 at LR 1e-2); `mean_square(param_update)` should be very small but substantially smaller than abs_max. +- Supervised learning tricks (batch norm, dropout, big networks) often *don't* transfer to RL. People tried them. They usually don't help. + +### Assume you have a bug [Jones 2021, Goodfellow Ch11] + +> When their RL implementation doesn't work, people are often keen to either (a) adjust their network architecture or (b) adjust their hyperparameters. On the other hand, they're reluctant to say they've got a bug. Most often, it turns out they've got a bug. -- Andy Jones + +Bugs are faster to find and fix than validating that a new architecture is an improvement. Dramatically raise your threshold for "OK, I think this is correct." Neural net components can adapt to compensate for bugs, masking them [Goodfellow Ch11: "If one part is broken, the other parts can adapt and still achieve roughly acceptable performance."] + +### Loss curves are a red herring [Jones 2021] + +They give global information about performance but don't localize errors. Don't debug by staring at loss curves. Use them *after* you've exhausted better methods. Their main value: splitting performance into "how fast it learns" vs "where it plateaus." [Jones: "The shape of your loss curve says very little about where in your code you've messed up."] + +### Pursue anomalies [Jones 2021, Rahtz 2018] + +> If you ever see a plot or a behaviour that just *seems weird*, chase right after it. Do not just 'hope it goes away'. -- Andy Jones + +That cool new feature you were going to add today? It won't magically fix the anomaly. Give up on your plan and chase the anomaly instead. Rahtz independently calls this "noticing confusion" -- following confusion about a frame-differencing improvement led to finding a normalization bug that had hidden for months. + +### With long feedback loops, think more, experiment less [Rahtz 2018] + +> Switching from experimenting a lot and thinking a little to experimenting a little and thinking a lot was a key turnaround in productivity. -- Rahtz + +When runs take hours, pour time into hypothesis-forming *before* launching. Spend 30-60 minutes mapping out possibilities, ranking them by likelihood given all evidence so far. Reserve experiments for distinguishing between your top hypotheses. + +Keep a structured work log for long debugging sessions: +1. What specific output am I working on right now? +2. Thinking out loud -- hypotheses about the current problem +3. Record of currently running experiments with what each one is supposed to answer +4. Results of runs (graphs, observations), separated by type + +--- + +## Part 2: RL-Specific Debugging + +Everything in Part 1 applies, plus RL has unique challenges: + +- **Errors aren't local**: actor -> learner -> actor loop smears bugs everywhere +- **Performance is noisy**: a bug-free implementation can fail due to hyperparameters; a buggy one might seem to work [Irpan 2018: "If my reinforcement learning code does no better than random, I have no idea if it's a bug, if my hyperparameters are bad, or if I simply got unlucky."] +- **Few narrow interfaces**: components consume/produce huge arrays and are heavily stateful +- **Few good abstractions**: you need to understand env, network, optimizer, backprop, multiprocessing, GPU, all at once +- **Compound cost**: sample inefficiency x instability x HP sensitivity multiplies compute needed. 1M steps x 5 seeds x HP tuning = exploding compute to test a single hypothesis [Irpan 2018] + +### RL debugging hierarchy + +**1. Use probe environments (from Andy Jones).** [Jones 2021] + +Don't just test on CartPole -- construct environments that *localize* errors: + +1. **One action, zero obs, one step, +1 reward**: Isolates value network. If it can't learn value=1, value loss or optimizer is broken. +2. **One action, random ±1 obs, one step, obs-dependent ±1 reward**: If (1) works but not this, backprop through the network is broken. +3. **One action, zero-then-one obs, two steps, +1 reward at end**: If (2) works but not this, reward discounting is broken. +4. **Two actions, zero obs, one step, action-dependent ±1 reward**: First to exercise policy. If it fails, check advantage calculation, policy loss, or policy update. +5. **Two actions, random ±1 obs, one step, action-and-obs dependent ±1 reward**: Policy and value networks interact. Check that policy picks right action per state AND value network learns value=+1. +6. **Progressively harder from there.** + +Each env should solve in seconds. If it takes longer, you have a bug. + +**2. Use probe agents.** + +- **Cheat agents**: Leak extra info (e.g., goal direction). If it can't solve the task even with extra info, the problem is elsewhere. +- **Automatons**: Hand-written algorithms, no NN. Tests that the environment is actually solvable. +- **Tabular agents**: Replace NN with a lookup table on simple envs. Far easier to inspect. + +**3. Unit test the tricky bits.** [Jones 2021] + +Most bugs cluster in the same few places: +- Reward discounting around episode resets +- Advantage calculations around resets +- Buffering: pairing wrong rewards with wrong observations +- Done signal handling (wrapped envs silently truncate episodes -- if you ignore `done` but store it in your buffer, all updates are wrong) + +These are deterministic, easy to factor out, fast to test. + +**4. Verify reward and observation scale.** + +Schulman: "as a rule of thumb, usually want everything to be mean 0 and standard deviation 1 for observations." + +For rewards: hand-scale so that value targets land in [-10, +10], ideally [-3, +3] [Jones 2021]. The hyperparameters from papers are tuned for this range. + +> Don't be tempted to write an adaptive reward scaling scheme. It's extra nonstationarity. Just hand-scale. -- Andy Jones + +For reward normalization: rescale but DON'T shift the mean [Schulman 2017]. Shifting the mean changes the agent's "will to live" -- how long it wants to survive. You're changing the problem. Henderson et al. 2018 confirm experimentally that reward rescaling can have large effects on DDPG performance. + +**5. RL-specific diagnostics to log.** [Schulman 2017, Jones 2021] + +| Metric | Healthy behavior | What it tells you | +|---|---|---| +| **Policy entropy** (relative to max) | Starts near 1, falls, flattens | If stays ~1: not learning any policy. If drops to 0: collapsed, not exploring. If oscillates wildly: LR too high. | +| **KL divergence** (old vs new policy) | Small but positive, stable | Very large: stale experience. Very low: can increase LR. Growing over time: feeding same old experience repeatedly. Negative: calculation bug. | +| **Residual variance** (var(target - predicted) / var(target)) | Starts ~1, falls rapidly, then slowly | Stays ~1: value net not learning. Drops to 0: policy collapsed or one scenario dominates. *Negative* explained variance (= 1 - residual_var) means the value net is worse than predicting zero -- likely overfitting or broken [Schulman 2017]. | +| **Value target distribution** | In [-10, +10], ideally [-3, +3] | Blowing up: discount too high or reward discounting broken. | +| **Advantage distribution** | Approximately mean-zero, in [-10, +10] | Persistently non-zero mean: advantage calculation broken. | +| **Episode length** | Depends on env | All episodes same unexpected length: env broken or degenerate policy. | +| **Max episode return** | Look at it, not just mean | If max is high but mean is low, the policy knows the good strategy but can't consistently execute it. Schulman: "if you have a deterministic system, that maximum return is something your policy can hone in on pretty straightforwardly." | +| **Std of policy** (continuous) | Should decrease as it learns (PPO) | Not decreasing: not learning. Collapsed to 0: no exploration. | +| **Critic loss** then **actor loss** | Critic converges first, actor follows | Actor loss initially increases: normal -- value function is a moving target during critic warmup. | +| **Sample staleness** | Steady throughout training | Growing: buffering problem. | + +**6. RL hyperparameter defaults (only after verifying correctness).** + +| Parameter | Default | Notes | +|---|---|---| +| Hidden layers | 2 × 64 or 2 × 256 | Small networks can accomplish a lot. Jones: 4×256 FC learned *perfect* play on 9×9 board game. | +| Activation | ReLU for hidden (tanh sensitive to init/scale) | tanh on output layers where needed (e.g., action bounds) | +| Optimizer | Adam | | +| γ (discount) | 0.99 | Think about what 1/(1-γ) = 100 timesteps means in real time. With TD(λ), can use γ→0.999 if λ<1. | +| Batch size | BIG. Pong ~1k, Space Invaders ~10k, Dota ~100k | Schulman: "sometimes you need to use bigger batch sizes than you thought." TRPO needed 100k. McCandlish et al. 2018 provide theoretical foundation via critical batch size. | +| Critic LR | ≥ Actor LR | Critic needs to learn first to provide signal | +| Replay buffer | Big as you can afford (DQN: 1M steps) | | +| Entropy coeff | Start 0.01, decrease | | +| Exploration | Epsilon-greedy schedules help for Q-learning | | +| Policy init | Final layer zero or very small | Ensures random exploration initially instead of arbitrary strong opinions | + +**7. Reward engineering.** [Irpan 2018, Schulman 2017] + +- Rewards must have *variance*. If all rewards are equal, there's nothing to learn. +- Rescaling rewards from [-1, 1] to [0, 1] was the game-changer for at least one practitioner stuck on Pendulum. +- Shaping rewards (e.g., distance to target instead of sparse success/fail) gives much faster learning but changes the problem. +- Don't shift reward mean (changes the MDP). + +**Reward misspecification war stories** [Irpan 2018]: (a) Agent trained to navigate a room with no penalty for going out of bounds. Negative reward was plentiful, positive reward too hard. Policy learned to be *suicidal* -- quick death at 0 reward was preferable to a long life risking negative reward. (b) Robot arm reaching toward a point defined relative to a table. Policy learned to slam the table, making it fall over, moving the target point. Both are cases where a minor reward specification error changed the learned optimization target fundamentally. + +**8. Environment setup.** + +- If all vectorized envs start from the same state, initial batches are highly correlated. Mix envs by taking random steps first. Check: if resets cluster on one timestep, not well-mixed. +- Think about time discretization: can a human control the system at this frame skip rate? What does random exploration look like at this discretization? If you repeat the same action too many times, you get weird Brownian motion. +- Avoid pixels if you can. Before your agent does anything interesting, it has to learn to *see* -- from sparse rewards. Use gridworlds, state vectors, or simple observations first. + +**9. Working from reference implementations.** [Jones 2021, Rahtz 2018] + +> If you're new to RL, writing things from scratch is the most catastrophically self-sabotaging thing you can do. -- Andy Jones + +The allure of writing from scratch is real but the self-correction mechanisms in RL are too weak. Henderson et al. 2018 found that "implementation differences which are often not reflected in publications can have dramatic impacts on performance" -- different implementations of the *same* algorithm with the *same* hyperparameters perform differently. Options, from safest to riskiest: +1. Use reference impl out-of-the-box, make small changes, verify nothing broke +2. Use reference impl as source of reliable components, work to the same API +3. Have one eye on reference while you write your own -- copy hyperparameters, discounting code, termination handling + +References: spinning-up (OpenAI), stable-baselines3, cleanrl (single-file per algo), OpenSpiel (multi-agent). + +**10. Don't over-interpret noise.** [Schulman 2017, Henderson 2018, Irpan 2018] + +Schulman showed 7 MuJoCo tasks x 3 "different algorithms" that were actually the same algorithm with different seeds. Easy to think blue is best on one task, red on another. Need multiple tasks x multiple seeds. Even 20 seeds leaves a pretty big error bar. Henderson et al. 2018 confirmed this with t-tests: "the variance between runs is enough to create statistically different distributions just from varying random seeds." Irpan reports a 30% failure rate across 10 seeds on Pendulum with identical hyperparameters. + +Corollary: don't keep adding modifications until your algorithm is complicated. Many tricks substitute for each other (especially normalization tricks). Simplify -- simpler algorithms generalize better. [Schulman slides: "Different tricks may substitute. Especially whitening."] + +**11. When you really have no bug.** + +Sometimes (rarely) you don't. Schulman: +- Policy gradient: "if it's going to learn it'll learn at the beginning" -- less burn-in than Q-learning +- DQN: has a "serious warmup period" -- the original authors needed patience and bravery +- Some easy problems (cart-pole swing-up) can defeat state-of-the-art algorithms without careful tuning. Don't get stuck on one problem your method fails on. + +**12. Meta-advice from Schulman.** [Schulman 2017, CS231n/Bergstra & Bengio 2012] + +- HP search: uniform random sampling, look at results, do regression to find which parameters matter, narrow ranges, repeat. "I use the human version of it." CS231n independently recommends: "Prefer random search to grid search" citing Bergstra & Bengio 2012. +- Read older textbooks and theses -- denser source of useful information than conference papers. +- Automate experiments. Don't spend all day watching your code print numbers. +- Have a battery of benchmark problems you run frequently. Easy to overfit one problem. +- Once something works, check sensitivity to every HP. If it's too sensitive, "you probably just got lucky." + +--- + +## Sources + +**Evidence map**: [docs/ml_debug_folklore.argdown](docs/ml_debug_folklore.argdown) traces each claim to verbatim quotes across 21 evidence files in [docs/evidence/](docs/evidence/). Process log at [docs/ml_debug_folklore_log.md](docs/ml_debug_folklore_log.md). + +### Talks +- Schulman, "Nuts and Bolts of Deep RL Experimentation," Deep RL Bootcamp 2017 + - Video: https://www.youtube.com/watch?v=8EcdaCk9KaQ + - Summary: https://github.com/williamFalcon/Deep-Reinforcement-Learning-Bootcamp/blob/master/lecture6.md + +### Articles +- Andy Jones, "Debugging RL, Without the Agonizing Pain" (2021): https://andyljones.com/posts/rl-debugging.html +- Matthew Rahtz, "Lessons Learned Reproducing a Deep RL Paper" (2018): http://amid.fish/reproducing-deep-rl +- Henderson et al., "Deep Reinforcement Learning that Matters" (2018): https://arxiv.org/abs/1709.06560 +- Alex Irpan, "Deep Reinforcement Learning Doesn't Work Yet" (2018): https://www.alexirpan.com/2018/02/14/rl-hard.html +- McCandlish & Kaplan, "An Empirical Model of Large-Batch Training" (2018): https://arxiv.org/abs/1812.06162 +- Slav Ivanov, "37 Reasons why your Neural Network is not working" (2017): https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607 + +### Reference implementations +- OpenAI Spinning Up: https://github.com/openai/spinningup +- Stable Baselines3: https://github.com/DLR-RM/stable-baselines3 +- CleanRL: https://github.com/vwxyzjn/cleanrl +- OpenSpiel: https://github.com/deepmind/open_spiel + +### Reddit threads +- "Deep RL practical tips" (2018): https://old.reddit.com/r/reinforcementlearning/comments/7s8px9/deep_reinforcement_learning_practical_tips/ +- "What are your best tips for debugging RL problems?" (2018): https://old.reddit.com/r/reinforcementlearning/comments/9sh77q/what_are_your_best_tips_for_debugging_rl_problems/ +- "How to more intelligently debug RL roadblocks?" (2019): https://old.reddit.com/r/reinforcementlearning/comments/bwjp3r/how_to_more_intelligently_debug_rl_roadblocks/ + +### Other talks/slides +- Schulman, "The Nuts and Bolts of Deep RL Research" (NIPS 2017 Deep RL Workshop): https://www.reddit.com/r/reinforcementlearning/comments/5hereu/the_nuts_and_bolts_of_deep_rl_research_schulman/ +- Levine & Finn, ICML 2017 Tutorial: https://www.reddit.com/r/reinforcementlearning/comments/6vcvu1/icml_2017_tutorial_slides_levine_finn_deep/ +- Deep RL Bootcamp 2017 (all slides/talks): https://www.reddit.com/r/reinforcementlearning/comments/75m5vd/deep_rl_bootcamp_2017_slides_and_talks/ + +### Debugging deep networks (general) +- Goodfellow et al., Deep Learning Book, "Practical Methodology" chapter: https://www.deeplearningbook.org/ +- Stanford CS231n, Neural Networks Part 3: https://cs231n.github.io/neural-networks-3/ +- Glorot & Bengio, "Understanding the difficulty of training deep feedforward neural networks" (2010) +- Josh Tobin, FSDL Spring 2021 Lecture 7 "Troubleshooting Deep Neural Networks": https://fullstackdeeplearning.com/ +- Andrew Ng, CS229 Machine Learning Advice: Stanford CS229 + +### Tools +- PyTorch memory profiling: https://github.com/Stonesjtu/pytorch_memlab +- GPU profiling: nsight, snakeviz, tuna +- Gradient debugging: `torch.autograd.gradcheck`, `torch.autograd.detect_anomaly()` + +--- + +## Part 3: Loss Surface & Gradient Analysis (No Model Required) + +When a loss isn't behaving as expected, don't guess -- visualize the loss surface and check gradient flow directly. This technique uses *synthetic tensors* fed into loss sub-components. No model, no forward pass, no GPU. Pure math. + +### The method + +1. Identify each loss sub-component as a function of its immediate inputs. +2. Pick 1-2 axes that matter (the "natural axes" you think about when reasoning about the loss). +3. Grid over those axes, feed through the loss, call `.backward()`, collect gradients. +4. Plot: contour heatmap + quiver overlay (negative gradient = optimization direction). +5. Build a summary table: component x representative_input -> loss_value, grad_value. Flag zero or non-finite gradients. + +### Pseudocode + +```py +# ── 2D loss surface with gradient quiver ────── +def analyze_component(loss_fn, x_range, y_range, n=80): + xs = linspace(*x_range, n) + ys = linspace(*y_range, n) + X, Y = meshgrid(xs, ys) + x_flat = X.flatten().requires_grad_(True) + y_flat = Y.flatten().requires_grad_(True) + + losses = loss_fn(x_flat, y_flat) # vectorized, returns (n*n,) + losses.sum().backward() + + loss_grid = losses.detach().reshape(n, n) + gx = x_flat.grad.reshape(n, n) + gy = y_flat.grad.reshape(n, n) + + # contourf(X, Y, loss_grid) + quiver(X, Y, -gx, -gy) + # negative gradient = direction optimizer moves + +# ── Gradient flow verification table ────────── +# +# For each component, evaluate at representative inputs +# (zero, small, converged, degenerate). Report loss + grad. +# Flag: zero grad (dead zone), non-finite (numerical issue). +# +# | Component | Param | Input | Loss | Grad | +# |----------------|----------|--------------|----------|----------| +# | barrier_penalty | v | v=0.0 | +0.000 | +0.000 | <-- zero grad! +# | barrier_penalty | v | v=0.5 | +12.50 | +50.00 | +# | pair_loss | dot_pos | (0.3, -0.3) | -2.340 | -3.000 | +# | pair_loss | dot_neg | (0.3, -0.3) | -2.340 | +3.000 | <-- antisym, good +# | pair_loss | dot_pos | (0.0, 0.0) | +0.000 | +0.000 | <-- dead at init! +``` + +### What to look for + +| Pattern | Meaning | Action | +|---------|---------|--------| +| Gradient arrows point toward desired region | Loss is well-shaped | Ship it | +| Large flat region (zero gradient) | Dead zone -- optimizer stuck if it lands here | Add curvature, change init, or use different parameterization | +| Gradient magnitude 1000x in one axis vs another | Imbalanced -- one axis dominates | Rescale, use log-space, or normalize | +| Saddle point at origin | Common with product-form losses (A*B) | Switch to additive (log A + log B) for independent gradients | +| Arrows point away from desired region | Loss is wrong or has unexpected local min | Rethink the formula | +| Non-finite values in a region | Numerical issue (log(0), 0/0) | Add eps, clamp, or use log1p | + +### The log-space decomposition trick + +When your loss involves a product of factors A*B and one factor can be near zero: + +``` +# BAD: symlog(A * B) -- when B~0, chain rule gives 0 grad to A too +# GOOD: sign * (log|A| + log|B|) -- independent gradients +# d/dA = 1/A regardless of B +# d/dB = 1/B regardless of A +``` + +General principle: **if you want gradient to flow independently through two factors, decompose multiplicatively in log space.** + +### Structural ceiling analysis + +Sometimes a metric is stuck not because the optimizer fails but because the parameterization can't express a higher value. To diagnose: + +```py +# 1. Check: is d(loss)/d(metric) large? If yes, optimizer IS trying. +metric = torch.tensor(0.5, requires_grad=True) +loss = loss_fn(metric) +loss.backward() +print(metric.grad) # if large (e.g. 350x the other gradients), it's trying + +# 2. Check: can the parameter CHANGE the metric? +# Trace the chain: loss -> metric -> intermediate -> parameter +# If d(metric)/d(parameter) ~ 0, the param is structurally unable to move it. +# Example: V-rotation can't change output basis (U is fixed) so r_sub is capped. + +# 3. Confirm empirically: set the exponent to 0 (disable the term). +# If metric reaches the SAME value, it's purely structural (not learned). +``` + +### When to use this + +- New loss function: always visualize before training. 5 minutes of plotting saves hours of puzzling over curves. +- Metric stuck at a value: distinguish "optimizer can't" from "parameterization can't" from "competing losses cancel out." +- After changing loss formula: verify gradient flow didn't break, especially at the operating point (not just at init). +- Comparing loss variants: grid the same axes for both, compare arrow fields side by side. + +--- + +## Part 4: Experiment Sweeps & Statistical Analysis + +Principled hyperparameter sweeps with same-seed comparisons, within-group z-scores, and t-stat stability. This is the difference between "I tried it and it seemed better" and "I have evidence it's reliably better." + +### Sweep design (justfile pattern) + +Each sweep is a justfile recipe. Key conventions: + +```just +set shell := ["bash", "-c"] +SEEDS_4 := "2024 4096 8192 9000" +BASE := "uv run python train.py gemma1b" + +# Q: Does rotation type matter? block vs full vs givens. +# Hypothesis: block should balance expressiveness vs cost. +# 12 runs, ~3 hours +sweep-rotation-type: + #!/usr/bin/env bash + set -x + export WANDB_RUN_GROUP="sweep-rotation-type-$(date +%Y%m%d-%H%M)" + for seed in {{ SEEDS_4 }}; do + {{ BASE }} --seed=$seed --svd_rotation_type=block + {{ BASE }} --seed=$seed --svd_rotation_type=full + {{ BASE }} --seed=$seed --svd_rotation_type=givens + done +``` + +**Rules:** +- One WANDB_RUN_GROUP per sweep, timestamped. +- Same seeds across all values within a sweep (enables paired comparison). +- Vary ONE parameter per sweep when possible (all-else-equal). If you must vary two, the analysis script warns about confounders. +- Comment the recipe with: question, hypothesis, run count, time estimate. +- Queue sweeps in a `queue` recipe in priority order. + +### Logging to wandb + +Every run logs to wandb with: group name, seed, all config as hyperparams, final eval metric (SI = TPR - FPR). + +Cache locally as parquet to avoid slow API calls on every analysis: + +```py +# download_wandb.py pattern: +# 1. Load cached parquet (if exists) +# 2. Find latest cached run date, subtract safety margin (1 day) +# 3. Fetch only runs newer than that +# 4. Merge (diagonal concat, dedup on run_id) +# 5. Save back to parquet + TSV +# +# Also downloads output.log per run for post-hoc log diagnosis. +``` + +### Analysis: within-group z-scores -> t-stat + +The core insight: don't compare raw SI across groups (different base configs, different dates). Compare *within* each group, then aggregate stability across seeds. + +```py +# analyze_results.py pseudocode: + +for group in groups: + for seed in seeds_in_group: + # 1. Collect all SI values for this (group, seed) combo + si_values = {param_value: SI for runs matching (group, seed, param)} + + # 2. Compute within-(group,seed) z-score + mu = mean(si_values) + sigma = std(si_values) + z[value] = (si[value] - mu) / sigma + # This normalizes out seed-level baseline differences + + # 3. Aggregate z-scores across seeds for each param value + for value in param_values: + mean_z = mean(z[value] across seeds) + std_z = std(z[value] across seeds) + t_stat = mean_z / (std_z / sqrt(n_seeds)) + # t_stat >> 2: reliably better across seeds + # t_stat ~ 0: no consistent effect + # t_stat << -2: reliably worse + + # 4. Also compute linear trend (Pearson r) for numeric params + # r > 0: more is better. t_stat on r tests reliability. +``` + +### Interpreting results + +| Metric | What it tells you | +|--------|-------------------| +| SI_mean | Raw effect size (higher = better behavioral control) | +| si_q10, si_q90 | Spread. Wide = seed-sensitive. | +| t_stat | Cross-seed reliability. \|t\| > 2 with 4+ seeds is meaningful. | +| linear r | Monotonic trend. r near +1/-1 with significant t_stat = dose-response. | +| "Also varies" warning | Confounders. Can't attribute effect to this param alone. | + +**What you're looking for**: high SI_mean *and* strong t_stat (reliable). A value with SI_mean=20 but t_stat=0.5 is a lucky seed. A value with SI_mean=10 but t_stat=4.0 is a real (if modest) effect. + +### Common pitfalls + +- **Stale cache**: always `download_wandb.py` before analyzing. Stale cache hides new groups. +- **Cross-group comparisons**: different groups have different base configs. "Group A's best value vs Group B's best value" is apples-to-oranges. Compare within groups. +- **n_seeds=1**: t_stat is NaN. You have one data point. Replicate before concluding. +- **Too many params varied**: if a sweep varies 3 params simultaneously, effects are confounded. Split into separate sweeps. +- **Interpreting NaN SI**: usually means eval crashed or the model diverged. Investigate the run log, don't just skip it. +- **"Fill" sweeps**: if a sweep is 13/16 runs done (missing a seed), run the missing seed in a separate group with a clear name (e.g. `sweep-coh-tau-fill`). The analysis script treats it as a separate group -- you merge mentally. + +### The full workflow + +```bash +# 1. Design sweep: write justfile recipe with hypothesis +# 2. Run it +just sweep-rotation-type +# 3. Wait for completion, then: +uv run python scripts/download_wandb.py +uv run python scripts/analyze_results.py --after $(date +%Y-%m-%d) +# 4. Read the output: +# - Group Summary table: SI_mean, n_seeds per group +# - Param tables: per-value SI with t_stat +# - Linear trends: dose-response for numeric params +# 5. Record findings in research journal +# 6. Update default config if result is clear and reliable +``` + +--- + +## Part 5: Diagnosing "Why Won't This Metric Move?" + +A structured decision tree for when a metric is stuck. Applies to any training scenario where a quantity you're optimizing plateaus. + +### Step 1: Is the gradient nonzero at the metric level? + +```py +metric_val = torch.tensor(current_value, requires_grad=True) +loss = loss_fn(metric_val) +loss.backward() +print(f"d(loss)/d(metric) = {metric_val.grad}") +``` + +- If ~0: the loss doesn't care about this metric at the current operating point. Likely saturated (log1p of huge value), in a dead zone, or the metric is disconnected from the loss. +- If large: the loss IS trying to move it. Problem is downstream. + +### Step 2: Can the parameter change the metric? + +Trace the chain rule: `loss -> metric -> ... -> parameter`. The metric is a function of intermediate quantities, which are functions of learned parameters. Check `d(metric)/d(parameter)`: + +- Analytically: is there a structural reason this derivative is ~0? (e.g., a rotation of V can't change span(U)) +- Empirically: disable the loss term entirely (set coefficient to 0). Does the metric reach the same value? If yes, it's structural -- the optimization never moved it in the first place. + +### Step 3: Is something else fighting it? + +If gradient is nonzero and the parameter CAN change the metric: +- Check competing loss terms: compute gradient contribution from each loss component separately. If two terms have opposite-sign gradients on the same parameter, they cancel. +- Check optimizer state: AdamW momentum from earlier training may resist direction changes. Try resetting optimizer state or using a warmup schedule. +- Check conditioning: if the metric requires coordinated changes across many parameters (e.g., rotating multiple layers simultaneously), the gradient per-parameter may be too small even though the aggregate signal is large. + +### Decision table + +| d(loss)/d(metric) | d(metric)/d(param) | Same value without loss term? | Diagnosis | +|---|---|---|---| +| ~0 | any | any | Loss saturated or disconnected. Change loss formula. | +| large | ~0 | yes | Structural ceiling. Change parameterization. | +| large | large | no | Competing losses or optimizer inertia. Isolate. | +| large | large | yes | The term helps but converges to same basin. Coincidence or weak effect. | + +--- + +## Part 6: LLM Debugging Playbook + +Concrete procedures for an LLM agent debugging ML code. Work top-to-bottom: static analysis first, then diagnostics, then the decision tree. Don't skip to hyperparameter suggestions. + +### 6.1 Static analysis: grep for silent bugs + +Run these searches on the codebase before anything else. Each catches a common bug that produces no error but wrong results. + +**Shape mismatches (silent broadcasting)** +``` +# Grep patterns: +\.view\(|\.reshape\( # check dims match intent +unsqueeze\(|squeeze\( # dimension insertion/removal +\.expand\(|\.repeat\( # broadcasting +# Action: for every hit, trace the tensor shape backward. Add assert statements. +``` + +**Autograd breakers** +``` +# Grep patterns: +\.detach\(\) # breaks gradient flow +\.data\b # bypasses autograd entirely +with torch\.no_grad # check this isn't wrapping training code +\.item\(\) # in a loss computation = broken +\.numpy\(\) # in forward pass = broken +# Action: every .detach() should have a comment explaining WHY grad is intentionally stopped. +``` + +**Missing train/eval mode** +``` +# Grep patterns: +\.train\(\) # count occurrences +\.eval\(\) # should pair with .train() +# Action: verify .eval() before every val loop, .train() before every train loop. +# Dropout and batchnorm behave differently -- this silently degrades results. +``` + +**In-place ops on tensors requiring grad** +``` +# Grep patterns: +\+=|\-=|\*=|/= # in-place assignment on tensors +\.add_\(|\.mul_\(|\.zero_\( # in-place methods +\[.*\]\s*=[^=] # index assignment (excludes ==) +# Action: in-place ops on leaf tensors with requires_grad=True corrupt autograd. +# Replace x += y with x = x + y. +``` + +**Double softmax (softmax input to CrossEntropyLoss)** +``` +# Grep patterns: +CrossEntropyLoss|cross_entropy # expects raw logits +softmax|log_softmax|\.softmax # if applied BEFORE CrossEntropyLoss = double softmax +# Action: CrossEntropyLoss = log_softmax + NLLLoss internally. +# If you softmax first, CE computes log_softmax(softmax(x)) -- the softmax +# compresses logits into (0,1), so log_softmax sees near-uniform inputs. +# Gradients vanish. Loss plateaus near ln(n_classes). +``` + +**Wrong optimizer step ordering** +``` +# Grep patterns -- verify this exact order exists: +# 1. optimizer.zero_grad() +# 2. loss.backward() +# 3. [optional: clip_grad_norm_] +# 4. optimizer.step() +# 5. [optional: scheduler.step()] +# Common bugs: zero_grad after backward (kills grads), step before backward (stale grads), +# scheduler.step() in wrong loop: per-epoch schedulers (StepLR, CosineAnnealingLR) +# called per-batch = decays too fast. Per-step schedulers (OneCycleLR) called per-epoch = too slow. +``` + +**Broadcasting traps** +```python +# Diagnostic: print shapes at every binary operation between tensors of different ndim +# Shapes (3,) and (3,1) silently broadcast to (3,3) -- probably not intended. +# Shapes (B,1) and (B,N) broadcast fine but verify it's intentional. +a = torch.randn(3) +b = torch.randn(3, 1) +print((a + b).shape) # (3, 3) -- wanted (3,)? +``` + +**Wrong loss sign** +``` +# Grep patterns: +maximize|ascent # gradient ascent when descent intended? +\-\s*loss # negating loss -- intentional (e.g., reward maximization)? +1\.0\s*-\s*|1\s*-\s* # 1 - metric as loss -- is the metric bounded [0,1]? +# Action: verify that minimizing the loss = improving the metric you care about. +``` + +**Frozen parameters not intended** +``` +# Grep patterns: +requires_grad\s*=\s*False # intentional freeze? +\.freeze\(|\.requires_grad_ # parameter freezing +for.*param.*\.parameters # check nothing is skipped +# Diagnostic: +for name, p in model.named_parameters(): + if not p.requires_grad: + print(f"FROZEN: {name}") +``` + +**Data leakage** +``` +# Grep patterns: +\.fit_transform\( # on test data = leakage +train_test_split.*shuffle=True # for time series = leakage +# Action: fit on train only, transform on both. Use temporal split for time series. +``` + +**Class imbalance** +``` +# Grep patterns: +CrossEntropyLoss\(\) # no weight= argument? check if classes balanced +weight=.*class # existing balancing -- verify weights are correct +# Diagnostic: count labels per class (see 6.2 "Class imbalance check"). +# 100:1 ratio with unweighted loss = model predicts majority class. +``` + +### 6.2 Diagnostic code snippets + +Copy-paste these. Each tests one thing. + +**Data pipeline sanity check** +```python +batch = next(iter(train_loader)) +for k, v in (batch.items() if isinstance(batch, dict) else enumerate(batch)): + if isinstance(v, torch.Tensor): + print(f"{k}: shape={v.shape}, dtype={v.dtype}, " + f"range=[{v.min():.3f}, {v.max():.3f}], " + f"mean={v.float().mean():.3f}, std={v.float().std():.3f}, " + f"nan={v.isnan().sum()}, inf={v.isinf().sum()}") + else: + print(f"{k}: type={type(v)}, len={len(v) if hasattr(v, '__len__') else 'scalar'}") +# Check: inputs ~mean 0, std 1? Labels in expected range? No NaN/Inf? Shapes match model? +``` + +**Init loss check** +```python +model.eval() +with torch.no_grad(): + batch = next(iter(train_loader)) + out = model(batch['input']) # adapt to your interface + loss = loss_fn(out, batch['target']) + print(f"Init loss: {loss.item():.4f}") + +# Expected init loss (random predictions): +# - CrossEntropy, C classes: -ln(1/C) = ln(C) +# C=2: 0.693, C=10: 2.303, C=100: 4.605, C=1000: 6.908 +# - Binary CrossEntropy: -ln(0.5) = 0.693 +# - MSE (targets ~N(0,1)): ~1.0 (if init outputs ~0) or ~var(targets) +# - L1 (targets ~N(0,1)): ~0.8 +# +# If init loss << expected: model is cheating (data leakage, shortcut) +# If init loss >> expected: wrong loss fn, bad init, or data pipeline broken +``` + +**Overfit-one-batch test** +```python +model.train() +batch = next(iter(train_loader)) +optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) + +for step in range(200): + optimizer.zero_grad() + out = model(batch['input']) + loss = loss_fn(out, batch['target']) + loss.backward() + grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 100.0) + optimizer.step() + if step % 20 == 0: + print(f"step {step:3d} loss={loss.item():.4f} grad_norm={grad_norm:.4f}") + +# Expected: loss drops to ~0 within 200 steps. +# If not: model can't even memorize 1 batch -- architecture or gradient problem. +``` + +**Gradient flow check (per-layer)** +```python +loss.backward() +for name, p in model.named_parameters(): + if p.grad is not None: + g = p.grad + print(f"{name:40s} grad: mean={g.mean():+.2e}, std={g.std():.2e}, " + f"max={g.abs().max():.2e}, zero%={100*(g==0).float().mean():.0f}") + else: + print(f"{name:40s} grad: None") # <-- not in computation graph! +# Check: no None grads (disconnected), no all-zero grads (dead layer), +# no huge grads (explosion), reasonable magnitude across layers. +``` + +**NaN/Inf detector hooks** +```python +def nan_hook(module, input, output): + def _check(t, label): + if isinstance(t, torch.Tensor) and (torch.isnan(t).any() or torch.isinf(t).any()): + raise RuntimeError( + f"NaN/Inf in {module.__class__.__name__} {label}, " + f"shape={t.shape}, nan={t.isnan().sum()}, inf={t.isinf().sum()}") + if isinstance(output, torch.Tensor): + _check(output, "output") + elif isinstance(output, dict): + for k, v in output.items(): + _check(v, f"output[{k!r}]") + elif isinstance(output, (tuple, list)): + for i, o in enumerate(output): + _check(o, f"output[{i}]") + +for name, module in model.named_modules(): + module.register_forward_hook(nan_hook) +# Run one forward pass. First module to raise = source of the NaN. +``` + +**Random input test** [Slavv] +```python +# Pass random noise instead of real data. If loss/error behaves the same, +# the data pipeline is destroying information before the model sees it. +model.eval() +real_batch = next(iter(train_loader)) +fake_input = torch.randn_like(real_batch['input']) +with torch.no_grad(): + real_out = model(real_batch['input']) + fake_out = model(fake_input) + real_loss = loss_fn(real_out, real_batch['target']).item() + fake_loss = loss_fn(fake_out, real_batch['target']).item() + print(f"Real input loss: {real_loss:.4f}") + print(f"Random input loss: {fake_loss:.4f}") +# If similar: model isn't using the input. Check preprocessing, data loading, feature selection. +# If very different: model sees real signal. Problem is elsewhere. +``` + +**Prime dimension trick** [Slavv] +```python +# Use prime/weird numbers for each dimension to catch silent broadcasting. +# If batch=7, seq=13, hidden=17, any mismatched reshape/view that "works" +# by accident with powers-of-2 will fail with primes. +x = torch.randn(7, 13, 17) # (batch=7, seq=13, hidden=17) +out = model(x) +print(f"in={x.shape} -> out={out.shape}") +# If this crashes but normal shapes don't: you have a broadcasting bug. +``` + +**Class imbalance check** +```python +from collections import Counter +all_labels = [] +for batch in train_loader: + labels = batch['target'] if isinstance(batch, dict) else batch[1] + all_labels.extend(labels.flatten().tolist()) +counts = Counter(all_labels) +total = sum(counts.values()) +for cls, n in sorted(counts.items(), key=lambda x: -x[1]): + print(f" class {cls}: {n:6d} ({100*n/total:.1f}%)") +# Ratio > 10:1 = likely need weighted loss or resampling. +# Ratio > 100:1 = model will predict majority class and look "accurate". +``` + +**Confidence-sorted error inspection** [common practice, cf. FSDL error analysis] +```python +# Find the model's most confident wrong predictions. These reveal +# systematic bugs (e.g., cropping cutting off relevant features). +model.eval() +errors = [] +with torch.no_grad(): + for batch in val_loader: + logits = model(batch['input']) + probs = torch.softmax(logits, dim=-1) + confidence, predicted = probs.max(dim=-1) + wrong = predicted != batch['target'] + for i in wrong.nonzero(as_tuple=True)[0]: + errors.append((confidence[i].item(), predicted[i].item(), + batch['target'][i].item(), i.item())) +errors.sort(reverse=True) # most confident mistakes first +for conf, pred, true, idx in errors[:10]: + print(f" conf={conf:.3f} predicted={pred} true={true} idx={idx}") +# Inspect the actual inputs for these indices. Pattern = systematic bug. +``` + +**Weight/bias distribution check** [Slavv, CS231n] +```python +for name, p in model.named_parameters(): + print(f"{name:40s} mean={p.data.mean():+.4f} std={p.data.std():.4f} " + f"min={p.data.min():+.4f} max={p.data.max():+.4f} " + f"shape={list(p.shape)}") +# Healthy: roughly Gaussian, std ~0.01-1.0 depending on init scheme. +# Bad signs: all zeros, huge values (>100), std ~0 (collapsed), NaN. +# After training: weights diverging to +/-inf = exploding. All same value = dead. +``` + +### 6.3 Triage decision tree + +Follow top-to-bottom. Stop at the first match. + +``` +START + | + v +[Exception / traceback?] --yes--> Read the traceback. Fix the error. Done. + |no + v +[Loss is NaN/Inf?] --yes--> Attach NaN hooks (6.2). Find first module producing NaN. + | Common: log(0), 0/0, exp(large). Add clamp/eps. + |no + v +[Init loss wrong?] --yes--> Check data pipeline (6.2). Check loss function. + | (see expected Check for double softmax (6.1). + | values in 6.2) Check labels match model output format. + | Run random input test (6.2): same loss? -> data destroyed. + | Init loss << expected? -> data leakage (Part 7.4). + |no + v +[Can't overfit 1 batch?] --yes--> Run gradient flow check (6.2). + | Any None grads? -> disconnected layer + | All-zero grads? -> dead layer / detach + | Check for autograd breakers (6.1) + | Check optimizer step ordering (6.1) + |no + v +[Loss stuck from step 0 (but CAN overfit 1 batch)?] --yes--> LR too low? Try 10x. + | Frozen params? Check requires_grad (6.1). + | Wrong loss function? + |no + v +[Loss decreases then explodes?] --yes--> LR too high? Try 0.1x. + | Gradient clipping? Log pre-clip norm. + | Numerical instability? (log, exp, div) + |no + v +[Train loss good, val loss bad?] --yes--> Overfitting. Not a bug. + | More data, regularization, smaller model. + |no + v +[Train loss okay but metric bad?] --yes--> Loss-metric misalignment. + | Is minimizing the loss equivalent to + | improving the metric? (Part 5) + |no + v +[Model outputs constant?] --yes--> Mode collapse. Check: + | - Class imbalance? Run label count (6.2). + | - All-zero init? Run weight check (6.2). + | - Dead ReLUs (try LeakyReLU)? + | - Confidence-sorted errors (6.2) reveal pattern? + |no + v +[Training is slow but not stuck?] --yes--> Not a bug. Consider: + | - Batch size (Part 1 Step 6) + | - Architecture depth/width + | - Data quality + |no + v +[None of the above?] + Read Part 1 (general) or Part 2 (RL-specific) for deeper diagnostics. + Log everything (Part 1 Step 3) and pursue anomalies. +``` + +### 6.4 LLM anti-patterns + +Things an LLM should NOT suggest when debugging ML code. + +**Don't suggest hyperparameter changes before verifying correctness.** +"Try reducing the learning rate" is the #1 wrong response to any training problem. Verify the code is correct first (Parts 1-2). HP tuning on buggy code wastes time. + +**Don't add try/except around training code.** +Training code should crash loudly. A caught exception in a training loop hides the bug and produces silently wrong results. The only exception: checkpoint saving on KeyboardInterrupt. + +**Don't suggest "try a different optimizer" as a debugging step.** +If Adam doesn't converge, the problem is almost never the optimizer choice. It's the loss, the data, the architecture, or a bug. + +**Don't add .detach() or .item() to "fix" gradient errors.** +If autograd complains, something is wrong with the computation graph. Adding .detach() silences the error by cutting gradient flow -- it doesn't fix anything, it makes the model stop learning from that path. Understand why autograd is complaining first. + +**Don't suggest lr_scheduler as a fix for non-convergence.** +Schedulers refine convergence, they don't cause it. If the model doesn't learn with constant LR, a scheduler won't help. + +**Don't suggest adding more layers or making the model bigger.** +If the model can't overfit one batch, more parameters won't help. The problem is gradient flow, loss function, or data. Fix those first. + +**Don't suggest "normalize your data" without checking if it's already normalized.** +Run the data sanity check (6.2) first. If data is already mean~0, std~1, normalization isn't the problem. + +**Don't wrap things in `float()` or `.to(dtype)` to suppress type warnings.** +Type mismatches are signals. A float32/float64 mismatch might mean you're mixing model weights with double-precision data. Fix the root cause. + +--- + +## Part 7: Debugging Folklore & Mental Models + +Part 6 tells you what to DO. This part tells you how to THINK. Use these frameworks when generating hypotheses, brainstorming causes, or deciding what to investigate next. + +### 7.1 Five mental models for ML debugging + +Pick the model that fits your situation. Each gives a different angle on the same problem. + +**1. Information flow: trace forward, trace backward.** +Data flows forward through the model; gradients flow backward. A bug anywhere in either direction corrupts everything downstream. When stuck: manually trace shapes and values forward from input through each layer. Then trace gradients backward from loss through each parameter. The break-point is where values go wrong. +- Forward: input -> preprocess -> embed -> layers -> head -> loss +- Backward: loss -> d(loss)/d(head) -> d(head)/d(layers) -> ... -> d(layer1)/d(params) +- Tool: gradient flow check (6.2), NaN hooks (6.2) + +**2. Ablation: remove things until it works.** [CS229] +Systematically remove components (regularization, augmentation, auxiliary losses, fancy layers). If removing X fixes the problem, X is the problem. If nothing helps, the bug is in the core (data or main loss). +- Start: turn off ALL regularization, augmentation, dropout, scheduling +- If it works now: add back one-at-a-time until it breaks +- If still broken: problem is in data pipeline, loss, or base architecture +- Tool: just comment things out and rerun overfit-one-batch (6.2) + +**3. Oracle substitution: replace each component with ground truth.** [CS229] +For pipeline systems (data -> features -> model -> postprocess -> metric), replace one component at a time with a perfect/oracle version. The component whose oracle gives the biggest accuracy jump is the bottleneck. +- Example: replace learned features with hand-crafted features. Big jump? -> feature learning is the problem. +- Example: replace model predictions with ground truth labels. Small jump? -> model is fine, problem is upstream (data) or downstream (metric). +- This is especially useful for multi-stage systems (NLP pipelines, detection + classification, etc.) + +**4. Bias-variance via learning curves.** [CS229, FSDL] +Plot train error and val error as a function of dataset size (or training steps). The shape tells you what to do: +- Both high (converging together): high bias. Model too simple, wrong features, or bug reducing capacity. +- Train low, val high (diverging): high variance. Overfitting. More data, regularization, smaller model. +- Both low: working. Ship it. +- Train low, val high, but val improves with more data: getting there, need more data. +- Val error flat even with 10x more data: not a data problem. Fix the model. + +**5. Structural ceiling: can the parameterization express what you want?** (Part 5 expands this) +Sometimes the metric is stuck not because the optimizer fails but because the architecture/parameterization literally cannot represent the desired function. Check: disable the loss term entirely. Does the metric reach the same value? If yes, the loss never moved it -- the model can't express higher values. + +### 7.2 Practitioner priors: what's usually wrong + +When you have no other information, investigate in this order. Rough estimates synthesized from [Goodfellow, FSDL, Slavv, Jones, CS231n] -- not measured frequencies, just practitioner consensus on what's usually wrong: + +1. **Data pipeline** (~40% of bugs). Wrong preprocessing, labels misaligned with inputs, normalization missing or wrong, train/test leakage, data loader returning stale/wrong batches. "It's almost always the data." [FSDL, Slavv] +2. **Loss function** (~20%). Wrong loss for the task, wrong sign, double softmax, loss not connected to metric, competing losses canceling gradients. +3. **Training procedure** (~15%). Wrong optimizer step order, missing zero_grad, wrong LR, frozen params, in-place ops breaking autograd. +4. **Architecture** (~10%). Too small (can't express), too deep (vanishing grads), wrong activation, missing skip connections. +5. **Hyperparameters** (~5%). LR, batch size, weight decay. Almost never the real problem if the code is buggy. +6. **Numerical issues** (~5%). NaN, overflow, underflow. Usually a symptom of something else. +7. **Environment/infrastructure** (~5%). Wrong library version, GPU memory, nondeterminism, stale cache. + +For RL specifically, add: +- **Reward scale/sign** as a top-3 issue [Henderson, Schulman]. Rescaling from [-1,1] to [0,1] or vice versa can be the entire difference. +- **Episode boundary handling** (done signals, reward discounting across resets) [Jones]. + +### 7.3 The debugging mindset + +Core attitudes are covered in Part 1 ("Assume you have a bug," "Pursue anomalies," "Loss curves are a red herring") and Part 2 ("Working from reference implementations"). Here are the additional mental habits not covered there: + +**"Think more, experiment less."** [Rahtz 2018] +When runs take hours, spend 30-60 minutes mapping hypotheses before launching. Rank by likelihood given all evidence. Only run experiments that distinguish between your top hypotheses. Rahtz: "Switching from experimenting a lot and thinking a little to experimenting a little and thinking a lot was a key turnaround." + +**MurphyJitsu pre-flight.** [Rahtz 2018] +Before starting a run, ask: "If this run fails, what would the most likely cause be?" If you can name it, test for it first. This is the rationalist habit of "pre-hindsight" -- imagining the failure and working backward. + +**"Tricks substitute for each other."** [Schulman 2017] +Many normalization/regularization tricks do roughly the same thing. Adding more tricks adds complexity without proportional benefit. If you have three normalization schemes and the model still doesn't work, the problem isn't normalization. + +**Diff against reference implementations.** [Henderson 2018, Jones 2021] +When stuck, diff your code line-by-line against a working reference. The bug is usually in something "trivial" -- episode resets, advantage normalization, dtype. Henderson et al. 2018: "implementation differences which are often not reflected in publications can have dramatic impacts on performance." See Part 2.9 for details. + +### 7.4 When to suspect the data + +Specific signal patterns that point to data problems. + +| Signal | Diagnosis | Action | +|--------|-----------|--------| +| Init loss << expected (e.g., 0.01 instead of 2.3) | Data leakage or shortcut. Model "knows" the answer at init. | Check: are labels in the input? Is test data in train? Is there a trivial feature? | +| Random input gives same loss as real input (6.2) | Data pipeline destroying information. Preprocessing too aggressive, wrong transforms, input all zeros. | Print raw data at each pipeline stage. Visualize. | +| Model predicts same class for everything | Class imbalance. 100:1 ratio = model learns "always predict majority." | Run class balance check (6.2). Use weighted loss or resample. | +| Val loss much worse than expected but train is fine | Distribution shift. Val set from different distribution than train. | Check: same preprocessing? Same time period? Same source? Use dual val sets [FSDL]. | +| Learning curve flat even with 10x more data | NOT a data problem. High bias. Model too simple or wrong features. | Add capacity, fix features, check for bugs reducing effective capacity. | +| Adding data makes val WORSE | Data quality issue. New data is noisier or from wrong distribution. | Inspect recent additions. Check label quality. | +| Model works on reference dataset (MNIST/CIFAR) but not yours | Your data is the problem, not the model. | Simplify your data (fewer classes, clean labels, easy examples only). Scale up gradually. [Slavv] | + +### 7.5 Batch size & learning rate folklore + +These interact in non-obvious ways. Get them wrong and training looks broken even with correct code. + +**Critical batch size** [McCandlish 2018]: there's a batch size B_crit below which doubling batch size ~halves training time (compute-efficient), and above which it doesn't help (just wastes compute). B_crit depends on the task and increases during training as the loss decreases. + +**LR must scale with batch size.** [McCandlish 2018, Goyal et al. 2017] +- Linear scaling rule (SGD): if you double batch size, double LR. [Goyal et al. 2017] +- For Adam: the scaling exponent is between 0.5 and 1 (between sqrt and linear), task-dependent. [McCandlish 2018] +- Changing batch size without adjusting LR is a common silent mistake. + +**Adam default LR = 3e-4.** [FSDL, Karpathy] +This is the "just works" starting point. If you're using Adam and haven't tuned LR, start here. Karpathy: "3e-4 is the best learning rate for Adam." + +**Big batches need warmup.** [Goyal et al. 2017] +Large batch training with high LR diverges at the start. Warm up LR linearly over the first few hundred steps. Without warmup, you'll see loss spike/NaN in the first epoch and think the code is broken. + +**Batch size signals:** + +| Symptom | Likely cause | +|---------|-------------| +| Training very noisy, loss oscillates | Batch too small. Gradient noise overwhelms signal. Try 4-8x larger. | +| Training smooth but slow, poor generalization | Batch too large without LR scaling. Try higher LR or smaller batch. | +| Loss spikes at start then recovers | Normal with large batch + warmup. If no warmup: add it. | +| Different results at different batch sizes (same total steps) | Missing LR scaling. Adjust LR proportionally. | diff --git a/docs/dlbooks b/docs/dlbooks new file mode 120000 index 0000000..66f0d05 --- /dev/null +++ b/docs/dlbooks @@ -0,0 +1 @@ +/media/wassname/SGIronWolf/projects5/2026/far_ai/dlbooks \ No newline at end of file diff --git a/docs/evidence/[English (auto-generated)] Deep RL Bootcamp Lecture 6 Nuts and Bolts of Deep RL Experimentation [Do(1).txt b/docs/evidence/[English (auto-generated)] Deep RL Bootcamp Lecture 6 Nuts and Bolts of Deep RL Experimentation [Do(1).txt new file mode 100644 index 0000000..69e9285 --- /dev/null +++ b/docs/evidence/[English (auto-generated)] Deep RL Bootcamp Lecture 6 Nuts and Bolts of Deep RL Experimentation [Do(1).txt @@ -0,0 +1,1008 @@ +so last year at nips I was slated to +give a talk at the deep RL workshop and +I wasn't sure what I was going to talk +about because everything I had prepared +I had already talked about it so many +times that I just didn't want to didn't +want to give another talk on it so I I +asked Peter for his advice on what I +should talk about and he said that +Entering had given the talk earlier in +the conference I called the nuts and +bolts of deep learning where he sort of +went through the flowchart of what you +do when you see a new problem and like +if you if you're overfitting you regular +eyes and if you're underfitting then you +use a bigger model and so on so so Peter +suggested to come up to write a talk +called the nuts and bolts of deep RL +research where I would talk about some +of the similar lessons and the tips and +tricks for the RL setting so I put +together a talk for that and actually +people seem to like it +so I'll give a slightly updated version +of that talk right now so I'm going to +talk about a few different things some +of which are general and sort of apply +to RL using reinforcement learning in +general and some of them pertain to +particular classes of methods like +policy grading methods and these are +just sort of little tips and tricks for +how you how you get your algorithm to +work and what you do day to day so let's +say you have a totally new problem +you're trying to solve like you have you +have some new tasks and you figured out +how to you defined an observation in an +action space and you have your neural +network policy or Q function but and you +want to start learning learning how to +solve it but you but you've never tried +it before +um so or okay or if you have a new +algorithm you're trying to get working +that you've never you you've never used +it before so so what do you do what's +the first thing you do if you have a new +algorithm +so so that I mean my first advice would +be to use the small problems so you can +run a lot of experiments really quickly +and do a hyper parameter search and it's +really useful too +to be able to visualize the learning +process in as many ways as possible so +look at the state visitation like how +that's evolving over time and look at +how well your value function is fitting +and so on so like I spent a lot of time +looking at the pendulum problem where +you're trying to swing up a pendulum +because this problem has a 2d state +space where it's just the angular and +the angular velocity of the pendulum and +I would visit visualize here's exactly +what the value function looks like +here's exactly what the state +distribution looks like and here's how +they evolve over time so I would get a +sense for like what's if my algorithm +isn't working is it because it's like +oscillating in some funny way or maybe +it's just giving a bad fit or maybe the +function it's learnt the value function +alerting isn't smooth enough and so on +so I would say try to visualize +everything and maybe use small problems +where you can visualize everything also +yeah it's useful to construct toy +problems where your idea is going to be +the strongest where you think okay if +this idea has any possibility of working +it's going to work there so for example +let's say you're trying to do something +with hierarchical reinforcement learning +then construct some problem where +there's some kind of obvious hierarchy +that it should learn and you'll be able +to tell if it's doing the right thing +also construct the the problems where +it's going to be weakest obviously and +also as a counterpoint to that don't +over fit your method to some contrived +problem so let's say you've come up with +some toy problem where your method is +really good then don't realize that it's +a toy problem and don't like tweak +everything to just work on this toy +problem perfectly because yeah it's also +pretty useful to have medium-sized +problems that you're very familiar with +and you know exactly how fast the +learning should be and what the reward +should be at every iteration and so on +so +a few problems that I use a lot like +training on pong Atari and the hopper +would the hopper like problem which is +this simulated robot problem with this +hopping robot and I know exactly how +fast an algorithm that's working should +learn on these problems so so I can sort +of it's it makes it easier to tune +things if you have okay that's if you +have a new algorithm let's say you have +a new task I would recommend just making +the task easier until you start seeing +some signs of life you see it learning +something so so there are various ways +you can make it easier you can try doing +some feature engineering so your input +features you think that the you think +that the policy should be a simple +function of your input features like +let's say you're trying to get pong to +work and you tried setting it up with +the images as input and you weren't +learning anything then you can set up +the problem where you pass in XY +coordinates as input and then try +running your algorithm and it's a much +simpler function you're trying to learn +so that's much more likely at work and +then you can try to make it harder and +harder until you're solving the full +problem another way you can make it +easier is by shaping the reward function +that means you if you come up with some +reward function that gives you fast +feedback code on whether you're doing +the right thing or not so let's say we +can define one task where we have this +reaching robot and we just give it a +reward if it reaches if it hits the +target so it gets a reward of one if it +hits the target in zero otherwise so +that might be hard to learn because +you're not getting any feedback as +you're flailing around but we could +define a better shaped reward function +where the where it's just distance to +target then learning is going to be much +faster in that problem there's also the +problem on exactly how to turn your +problem into a pom DP in the first place +so so often it's not clear what your +observation features should be and it's +not even clear what the reward function +should be so or it's not clear if this +problem you're trying to solve is +if it's feasible at all so so let's say +you're trying to solve you you have some +game or some robotics task or something +new like and you you want to turn it +into a reinforcement learning problem +but you're not sure if this is feasible +at all +so the first thing to do is to just +visualize a random policy acting on this +problem and see see what happens so if +the random policy occasionally does the +right thing then there's a high chance +of reinforcement learning is going to +work because bringing forth a policy +grading method is just going to take +this random behavior and it's going to +make the look the good behaviors more +likely +so it'll gradually like hone in on the +good behaviors whereas if you're never +doing the right thing then then there's +RL isn't going to get any signal that +tells it to do the right thing sometimes +RL is able to learn even though it seems +like it it's not clear how it's going to +learn like learning how to walk it's not +clear that that should work but because +you would think that like you really +have to have the whole thing in the +whole policy in place before it does +anything useful but as it turns out you +sort of learn to take one step and then +fall over and then take two steps and +then fall over and so on until you've +got a proper walking gait okay another +thing to do is to make to make your +observations make sure your observations +are useable try to look at them as a +human and see if you can control the +system using the same observations +you're giving to the agent so let's say +you're doing some pre-processing on your +images look at those pre processed +images yourself and make sure you're not +like losing too much detail when you +downsample them or losing too much or in +the color transformations and so on +another thing to do is you want to make +sure that everything is reasonably +scaled so that for example well as a +rule of thumb you usually want +everything to be mean 0 and standard +deviation 1 for the observations and for +the rewards well it's a little less +obvious but that's a reasonable +heuristic so +so you might want to like a scaler using +some kind of filter I mean that's that's +another good thing you can do but if you +don't want to mess with some kind of +filters on your observations and rewards +what you can do it you can just kind of +if you're allowed to define those +yourself then you might want to just +scale them yourself so what I'd +recommend doing is plot histograms of +all of your observations and your +rewards and make sure that for each +component of the observations and +rewards you've scaled it properly so +that it has the right mean insanity +deviation and it doesn't have crazy +outliers okay another thing to do is you +should have some good baselines that you +can use whenever you see a new when you +whenever you see a new problem so just +it's not clear which algorithm is going +to work beforehand so make sure you you +just have a bunch of a bunch of like +well tune things that you can run on +each problem yeah okay the question was +if you're gonna do some kind of reward +normalization should you do this over +your whole training like all of your +training data or just like the recent +data I would yeah that's a there's a lot +of subtlety there so I would say use all +of your data so far because you're +making everything non-stationary if you +do some kind of filtering actually I'm +going to talk about this at a later +slide so anyway I would recommend as +just a few baselines you should have a +cross and to be method some policy +grading methods some kind of cue +learning or sarsa type method there's a +lot of code online now that you can use +other people's code that that's already +written so you can use like we have this +open AI baselines repository and also +our L lab has a bunch of algorithms okay +another thing to do which people often +get tripped up on especially when +they're trying to reproduce published +work is so you implement the algorithm +based on the paper +and then it doesn't really learn +anything at all and then you think oh +maybe mike is my code like wrong or what +happened so I would say early on you +might need to run with more samples than +expected +so one hyper parameter that you can +usually adjust is how big of a batch +size to use or how many samples to use +and I would say sometimes you should use +more samples than you think you're going +to need because usually things just work +better when you have more samples almost +always so often sometimes when you're +trying to reproduce a published paper +you've got it mostly right but not +exactly right like maybe you haven't +scaled everything properly or there's +some like there's some really like +obscure hyper parameter that you have +wrong and then you just find that the +code doesn't learn anything so then I +would say just try to make it work a +little bit and then you can work from +there and try to tweak all the hyper +parameters to to get up to the like to +get fully up to the publish performance +but if you want to just get something +working at all often you need to use +bigger batch sizes and you thought +because if your batch size is too small +than the nor the noise will overwhelm +the signal and you won't learn anything +so like for example for TRP oh I wasn't +seeing any learning for a while and then +it turned out it's just because I was +using too small of a batch size and I +had to use a hundred thousand time steps +of a batch for the batch size but and +for Atari they for dqn the type of +parameters that were found to be best +where you update every ten thousand time +steps you update your queue function +every ten thousand time steps and you +have a 1 million time steps in your +replay buffer which is a lot okay so now +I'll talk about some guidelines for on +for the ongoing development and tuning +process as opposed to the initial +process of I have a totally new problem +or a new algorithm that I want to see +some signs of life on so +let's say you get something working I +recommend looking how sensitive your +algorithm is to every hyper parameter +and if it's too sensitive it it's not +actually a robust algorithm then you +shouldn't be happy with it you probably +just got luck lucky on that one problem +and it's it's actually kind of possible +to have a method that does that is a +fluke and it works in one way because +it's I mean one problem because of some +funny dynamics but then it doesn't work +in general so you kind of have to it +need some serious improvements so yeah +so that's okay there's also a few things +you can look at to see that actually I'm +going to talk about more of these kind +of Diagnostics a little later but there +are some indicators that'll tell you if +that if your algorithm is working +besides just looking at the final +performance but other in look for other +indicators that are going to tell you +that your optimization process is kind +of healthy so this is going to vary +based on the algorithm but for example +you can look at whether your value +function is actually accurate like +whether it's actually predicting returns +well you can look at how big the updates +are in terms of some either parameter +space or the output space standard +Diagnostics for deep networks like you +can look at norms of gradients and so on +okay one thing that takes some +discipline but is very useful is to have +a system for continually benchmarking +your code and that includes all of your +code not just the one thing you're +tuning right now because often it's easy +to tune your algorithm to work well in +one problem and then mess up the +performance on other problems and it's +really easy to overfit on single +problems when you're just adjusting +hyper parameters so I'd really recommend +having some kind of benchmark you can +run frequently and some kind of battery +of benchmarks that you've run +occasionally along as similar lines of +like overfitting of sort of reading +too far into noise or over interpreting +noise it's really easy to just to think +you're improving your algorithm or +you're making it worse but really you're +just seeing random noise so so you can +see seven different tasks these are the +Jim Moo Joko tasks like half cheetah and +hopper and so on and you have three +different algorithms here the red one +the green one and the blue one and you +can see ok let's I mean we can see that +the performance is a little different on +all the problems but let's it looks like +the the red let's see does the green +which one looks like the best well it +kind of varies by problem like the blue +one looks better on this problem and the +red one is worse on this problem and so +on but as it turns out these are all the +exact same algorithms and just random +seeds different random seeds so so it's +easy to imagine that you're just looking +at one of these problems then you see +that blue curve and you think you get +really excited than you think you found +some huge improvement to your algorithm +but it's really that you just got a +lucky seed that one run so yeah really +you've got to run your algorithm +multiple times an average and even if +you're averaging over a lot of seeds +like even if you had like 20 seeds here +there's a still a pretty big error bar +so it's yeah that makes it particularly +hard +I mean I'd recommend having like +multiple tasks and multiple seeds and if +you don't do that then you're probably +just overfitting unless you see a really +drastically large improvement another +thing to do is it's easy to keep adding +little modifications to your algorithm +until it gets really complicated and +then you're not sure and then you think +you have this really complicated +algorithm which is perfect but it turns +out that most of the things you did are +unnecessary because base some of the +tricks substitute for each other this is +often true because a lot of tricks help +because they're like normalizing things +in a better way or improving your +optimization like making +your optimization less susceptible to +like big spikes I don't know a lot of +different modifications you make have +similar effects so so often you you can +remove them and simplify your algorithm +and this is pretty important so it's +like especially with regard to changes +that do whitening these kind of these +kind of all substitute for each other +and also substitute for changes to your +optimization algorithm and yeah I would +and I would simplify things because it's +then it's more likely that your insights +will generalize to other problems and +also lastly it's pretty useful to +automate your experiments because +otherwise you're going to end up +spending all your day your whole day +just watching your code prints out +numbers and and it's actually really +it's it's really tempting to spend all +day doing that but I would I mean +especially if you need to run multiple +random seeds then it's then you you +really need to get your work flow down +so the year you're automating this +process and launching lots of +experiments at the same time so I'd +recommend just getting set up with one +of these cloud computing services so you +can just launch experiments on remote +instances and pull the results back when +you're done question oh yeah question is +you have a recommendation on what +framework to use to keep track of your +experiment results I personally use no +framework at all and I just have like +ipython notebooks and scripts that +collect a bunch of data that's stored in +various log files so I just have scripts +that read all my log files and plot them +I don't use some people like having +databases and stuff where they store all +their hyper parameter results but on I +think I don't find it necessary +personally okay so now I'm let's see I'm +going to talk about general tuning +strategies for RL and then after that +I'll talk about some specific tuning +strategies for different classes of +algorithms +okay so one thing is widening or +standardizing your data so if your +observations have unknown range you +should definitely standardize them I +would do that by computing a running +estimate of the mean and the standard +deviation and then just transform it Z +transforming it like this +and I would recommend computing the mean +and the standard deviation over all data +you've seen so far not just your recent +data because otherwise you're +effectively changing your data in some +way that the policy doesn't know about +like you have your that your policy +grading algorithm doesn't know about +like your policy grading algorithm is +actually optimizing some objective so +then if you just go and change the +problem out from under it then you're +often going to make things a lot worse +like if you rescale your observations +then your optimization algorithm didn't +know about that so you might just +collapse the performance so that's why I +would recommend using your whole all of +your data from the start of time so that +at least it's going to slow down over +time how fast it's how fast your +scalings are changing so yeah that's +what I would recommend doing with the +observations and for the rewards +I'd recommend rescaling it but not +shifting them because that affects the +agents will to live so if you if you +shift the mean reward that'll affect +whether how long it wants to survive +you're actually changing the problem ok +another yeah you might also want to try +to standardize prediction targets in the +same way though that's a little more +complicated to do using okay yeah so +question is what about pca widening +instead of just this element why scaling +yeah that could that could definitely +help I haven't I haven't experimented +with that but yeah that could help it's +hard to predict with like with neural +nets if it's going to help or not +because they seem to be pretty good at +disentangling things so I know that if +you have things that are terribly scaled +like they're from negative one thousand +two one thousand and other coordinates +are from negative point +point one then it's gonna be slow for +learning so this kind of scaling helps a +lot even though you're having their own +networks okay there's some parameters +that are really generally important like +discount factor that determines whether +you're that determines how long how far +away you're doing credit assignments so +whether you're paying attention to +effects that are delayed by a certain +time so if your discount is gamma equals +point 99 then you're basically ignoring +effects that are more delayed by a +hundred time steps so so you're kind of +short-sighted that gamma is controlling +your shortsightedness and you might want +to actually look at if how long that +corresponds to in real time so usually +in reinforcement learning you're sort of +discretizing time in a certain way and +it's worth paying attention to like is +that 100 time steps like three seconds +of real time or what and what happens +during that time also note that if you +have TD lamda kind of methods for either +for value function estimation or for +policy grading methods you can get away +with using a Lambda gamma that's really +close to one like 0.999 and things +aren't going to go unstable because if +you have a lower land of like 0.9 then +that's going to make it so the algorithm +is still stable even though gamma is +really close to one also okay so so as I +mentioned you might want to in in +practice we're usually discretizing some +continuous-time system so then it's +worth seeing if the problem can actually +be solved at this discretization level +so so for example in a game let's say +you're you're doing frame skip a meaning +that you repeat the action multiple +times as a human can you control it at +this rate or is it just impossible to +control is it just too like you're doing +the action too many times in a row and +you have to slow responses to control it +and I would also just look at the what +the random exploration looks like and if +you make sure that you're exploring like +the the +this Croatian is going to determine like +how far your Brownian motion goes +because if you're doing the same action +many times in a row then you're going to +be able to then you're going to tend to +explore further so so it's worth just +looking at what the random exploration +does and and choosing your time +discretization in a sensible way so that +it does interesting things question yeah +so the question is if you have a DQ n +how would you get started like tuning it +with tuning all the hyper parameters +actually I'm going to talk about DQ n +pretty soon so yeah I'll get to that +okay also look at the episode returns +very closely look at don't just look at +the mean look at the minimum and the +maximum so the maximum especially if you +have a deterministic system if you have +a certain maximum return that's +basically something that your policy can +hone in on pretty straightforwardly +because if if you just do that every +time then you're going to increase your +mean return to that level so so it's +worth so so it's useful to look at the +max return to see if your policy is ever +doing like the right thing according to +that max return or if it's just kind of +stuck and it's never discovering the +high return strategy also look at the +episode in length which is sometimes +more informative than the episode reward +like if because sometimes well yeah I +won't go into details on that like well +if you have a game you're it might mean +that like you might be losing every time +so you're never seeing yourself win but +the episode length will tell you if +you're losing slower so you might see an +improvement in episode length at the +beginning but not in reward okay for +Policy gradient there are specific +strategies or prediction there are +specific Diagnostics that are really +helpful so look at the entropy really +carefully if your entropy is going down +too fast that means your policy is +becoming deterministic and it's not +going to explore anything so +so be careful and also if it's not going +down your policy is never going to be +that good because it's always really +random else so you can sort of alleviate +this issue by using an entropy bonus or +a KL penalty so by stopping yourself +from move changing the policy the +probability distribution too fast as a +side effect you also prevent the entropy +from going down too fast when you use +the KL penalties I also look at the KL +as a diagnostic like look at how big of +an update you're doing in terms of KL +divergence if your KL is like 0.01 +that's a pretty small update but if it's +like a 10 that's a really big update +question oh yeah how do you question is +how do you measure entropy so so if you +have for most policies you can compute +the entropy analytically so if you have +a discrete action space then you usually +can just compute it analytically and if +you have a continuous policy you're +usually you're using a Gaussian +distribution or something so you can +compute the differential entropy +analytically so here we're talking about +entropy in action space so the average +over state space of the action space +entropy what you actually might care +about even more is the entropy in state +space but you have no hope at actually +calculating that except maybe to do some +really crude approximation of it +okay yeah so KL is really useful look at +explain variants like whether your value +function is actually explaining is +actually a good predictor of the returns +or if it's just worse than predicting +nothing so if you just predict zeroes +then your explained variance is zero but +sometimes if you have some neural +network that's predicting then you find +that it's actually negative because it's +overfitting or it's just noisy and it's +not doing anything useful so that +probably means you need to tune some +hyper parameters so that your neural +networks actually predicting better than +the constant predicting zero question +okay yeah question is why does the KL +spike give you a loss in performance +well it doesn't always be a lot it's not +always a loss in performance sometimes +it's a gain in performance but in +practice it's usually a loss in +performance because it usually the +approximation that your policy gradient +is just taking you way outside the +region where your local approximation to +the policy performance is accurate so +you're you're probably just overshooting +like if you take your policy and you +take a really big step in any direction +you're probably making it worse so so +that's so usually if you take a big step +you're getting worse like if you have a +convex function if you take a big step +in any direction you're probably going +to make it worse let's see okay +initialize your policy that's pretty +important more important than in +supervised learning because that in +determines what data you're going to see +initially and you're going to learn from +at the beginning so I would recommend +using have initializing the final layer +to be either zero or really small so +that at least you you have the maximum +and you sort of explore randomly at the +beginning we randomly at the beginning +as opposed to having some kind of +particular like policy that has a strong +opinion on the right thing to do which +is based on no information at all okay +that's for Policy gradient for Q +learning so a few thing a few things one +is okay you often it helps to have a +really big replay buffer and to be able +to do this you need to be a little +careful about memory usage so it's worth +putting in the extra effort to do that +learning rate schedules are often quite +helpful here in practice as our +exploration schedules so in qdq any +you're usually using epsilon greedy and +it often helps to do to play with the +schedule on that also it converges +pretty slowly and it has a miss +serious warmup period at the beginning +often so so sometimes you just so I +actually have a lot of admiration for +the authors who originally got this the +people people who got this to work +originally because they had to just let +their code run for a while before it did +anything so so you have to have a lot of +patience - a lot of bravery to do that +ok this is just miscellaneous advice for +not necessarily for tuning algorithms +but just for for personal development so +I recommend reading older textbooks and +theses not just the latest conference +papers because often they up in them +like there are more dense source of +useful information whereas each +conference paper just has one idea ok +yeah don't get too stuck on problems +because often you actually have a +legitimately good algorithm but it's has +like some flaws so its might fail +miserably at some easy problem so in RL +there's some like simple problems like +cart will swing up where you have this +stick and you're trying to swing it up +by moving the cart around and this +problem like you might have a great +algorithm but it's gonna in my but like +some of the state-of-the-art algorithms +are gonna fail on that problem unless +you really tuned them carefully and +that's just because maybe it's not +exactly the right problem to start to I +mean maybe like the thing that makes +this problem hard is not the thing that +your algorithm is doing that's +interesting so you might have like come +up with a better policy grading method +but still it'll converge to the same +local minimum on that swing up problem +and you're not gonna fix that problem so +I I would say just don't get too stuck +on a single problem that your method +bails on and enough in like maybe the +ultimate algorithm will solve all of +these problems but we're not there yet +so you might as well just try to improve +and some like decently large subset of +problems so also like one funny thing is +the dqn performs pretty poorly on a lot +of problems especially with continuous +control +I think it does I mean for cartful it +probably solves it pretty well if with a +reasonable amount of tuning but some of +the other like fairly small continuous +control problems it fails on but that +doesn't mean it's like that doesn't mean +it's a bad algorithm because it solves a +different problem extremely well so yeah +I would say just these these things are +at least right now it's not gonna you +shouldn't expect to be able to solve +everything with the same method without +any tuning also techniques from +supervised learning often don't transfer +over to reinforcement learning so so +don't be surprised if you find that I +guess that's not I said this slide was +gonna be a bad personal development +that's not about personal development +but yeah I guess this is just a grab bag +of miscellaneous advice so yeah so like +Bachner a lot of people look at what +people are doing in RL and they think +why aren't you using batch norm or drop +out or or big networks why are you using +like two layers of 64 units and it's not +like people didn't think of trying these +other things they tried them and then +they found that those other like +architectures and methods don't actually +help here I mean if you figure out how +to make batch norm and drop out actually +help in RL that'll actually be really +great and a big that would be a big +development but yeah I don't know it's +not totally straightforward all right +that's all thank you +okay I have a few minutes for questions +yeah so the question is how long do you +wait until you've decided that your +algorithm is new at work either because +your code is wrong or it's just too hard +I don't have a good general answer to +that I think the problem is worse for +some algorithms and others I'd say for +policy gradient methods you don't see +that burnin period as much like often if +it's going to learn it'll learn it at +the beginning but that's not always true +either I mean sometimes it will kind of +take some time to get into the right +numerical regime so I don't yeah I don't +have general advice I would say you have +to just I would say go back and start +with the easy problems and you'll get +some intuition about whether you're you +should expect a you should expect a burn +in period or not where it's not learning +anything see I want to get some people +in the back because okay oh yeah +question is - do I use unit tests I use +unit tests for code that where there's +it's doing a very particular +mathematical thing that you can actually +write a test for like let's say I'm +computing the KL divergence then I'll +write a test to check I don't know their +various ways of testing it so and it's +easy to get those things wrong like you +have it's I don't know as you're off by +a constant or something so yeah I would +write tests for I write tests for things +where it's nothing that there's a very +well-defined correct thing to do it's +harder to write it for an algorithm +where it has a lot of different moving +parts where you it's not clear how fast +it should learn and it's also there's +some randomness involved +so if you try to write a test saying I +should be at performance 100 after this +many iterations it might fail just out +of random noise but yeah I think +probably unit tests are a good idea oh +yeah so the question is do I have +guidelines on matching the algorithm to +the task like when to use policy +gradients versus a value iteration style +method it's yeah it's hard to give some +general guidelines I think people have +found that and and the guidelines I give +you might just be just be kind of +historical accidents like someone got +this to work here and this to work there +so I think the well certainly if you +don't care that much about sample +complexity policy gradient methods are +are probably are probably the way to go +if you don't care about sample +complexity or using off policy data then +policy grading methods are probably the +safest bet because you I don't know it's +more understandable exactly what it's +doing it's just doing gradient descent +whereas q-learning it's a little bit +indirect what it's doing so it's and it +in practice is more finicky yeah if you +do care about sample complexity though +or need off policy data then hue +learning is usually better or yeah or a +few students like sample complexity is +relevant if your simulator is expensive +of course I would also say that people +have found that dqn and relatives have +worked well on game-like tasks with +images as input whereas policy grading +methods work better on the continuous +control tasks like these robotic +locomotion problems though that this +might not be fundamental it might be +more of a historical accident let's +oh yeah recommendations on older +textbooks let's see there's like brutes +a cuss for take us as books +that's approximate dinette what is it +optimal control and why am i blanking on +the name optimal control and dynamic +programming something like that and the +set I mean sudden embargo is a good one +to read butterman has a textbook kind of +a classic textbook on Markov decision +processes that's in the RL space then +there's books on numerical optimization +that are good and yeah I'd say obviously +the machine learning textbooks have a +lot of good material that might be +useful in the RL setting too +oh yeah can I comment on evolution +strategies and the blog posts the the +opening I blog post on it let's see do +you have any specific questions about it +or like how it compares +oh yeah okay yeah yeah so there's +there's a lot of policy grading methods +out there and some of them are quite +complicated so we've had a couple of +talks on them so far like all these +different work +it's excessively more complicated policy +grading methods but then there's this +old algorithm called evolution +strategies which is an extremely simple +algorithm and and there's a paper by +some of my colleagues where they show it +was called evolutionary strategies as a +scaleable alternative to reinforcement +learning which really meant like to +policy grading methods so and they +claimed that it worked basically as well +as policy grading methods or at least +it's sort of in the same order oh and +beer is one of the authors of that paper +so the claim was that it works it works +like similarly well to policy grading +method so why should we bother with +these policy grading methods if es works +just as well well I think in practice it +works well it works but it works not not +as well like it's it takes me the sample +complexity is is is worse by some +constant factor or it's not clear that +it's a constant factor or if this factor +scales with the size of the network but +it's it is a lot it is significantly +slower and the question is just what is +that constant factor so is that constant +factor like one or is it three or is it +10 or 100 so that's not that's going to +vary between problems and also the that +paper had some innovations in exactly +how to parameterize the networks and so +forth that made everything better +numerically everything better scaled so +that yes did work well but I would say +that if you that it's usually quote like +I don't know it's usually a pretty +decent constant factor slower than +policy grading methods especially the +more advanced ones like the PPO and +actor so so i'm i think it's it's not +really a clear win in the RL setting +where policy gradients work I think if +policy gradients work it's usually going +to be a lot better +and the es is going to be is going to be +better on problems where policy +gradients aren't going to work for some +reason like if you've got really long +you depend time dependencies where the +discounts are gonna are gonna ignore +them +then es might be less sensitive to that +let's see I think I'm okay last question +oh yeah favorite hyper parameter +optimization framework I've used some of +these than I just like to use the +uniform random sampling yeah that works +really well I mean you just run a bunch +of experiments with random hyper +parameters and then you just look at the +results the next day and to do some +regression to figure out which +parameters actually mattered and then +you've run another experiment with +better parameter ranges and so on so I +use the human version of it because +often it's just - it's like a it's it's +useful to be able to look at the results +yourself - to get some to figure out +which parameters actually matter so +you're not wasting a lot of computation +because that information transferred +between problems all right +[Applause] \ No newline at end of file diff --git a/docs/evidence/[English (auto-generated)] Deep RL Bootcamp Lecture 6 Nuts and Bolts of Deep RL Experimentation [Do.txt b/docs/evidence/[English (auto-generated)] Deep RL Bootcamp Lecture 6 Nuts and Bolts of Deep RL Experimentation [Do.txt new file mode 100644 index 0000000..69e9285 --- /dev/null +++ b/docs/evidence/[English (auto-generated)] Deep RL Bootcamp Lecture 6 Nuts and Bolts of Deep RL Experimentation [Do.txt @@ -0,0 +1,1008 @@ +so last year at nips I was slated to +give a talk at the deep RL workshop and +I wasn't sure what I was going to talk +about because everything I had prepared +I had already talked about it so many +times that I just didn't want to didn't +want to give another talk on it so I I +asked Peter for his advice on what I +should talk about and he said that +Entering had given the talk earlier in +the conference I called the nuts and +bolts of deep learning where he sort of +went through the flowchart of what you +do when you see a new problem and like +if you if you're overfitting you regular +eyes and if you're underfitting then you +use a bigger model and so on so so Peter +suggested to come up to write a talk +called the nuts and bolts of deep RL +research where I would talk about some +of the similar lessons and the tips and +tricks for the RL setting so I put +together a talk for that and actually +people seem to like it +so I'll give a slightly updated version +of that talk right now so I'm going to +talk about a few different things some +of which are general and sort of apply +to RL using reinforcement learning in +general and some of them pertain to +particular classes of methods like +policy grading methods and these are +just sort of little tips and tricks for +how you how you get your algorithm to +work and what you do day to day so let's +say you have a totally new problem +you're trying to solve like you have you +have some new tasks and you figured out +how to you defined an observation in an +action space and you have your neural +network policy or Q function but and you +want to start learning learning how to +solve it but you but you've never tried +it before +um so or okay or if you have a new +algorithm you're trying to get working +that you've never you you've never used +it before so so what do you do what's +the first thing you do if you have a new +algorithm +so so that I mean my first advice would +be to use the small problems so you can +run a lot of experiments really quickly +and do a hyper parameter search and it's +really useful too +to be able to visualize the learning +process in as many ways as possible so +look at the state visitation like how +that's evolving over time and look at +how well your value function is fitting +and so on so like I spent a lot of time +looking at the pendulum problem where +you're trying to swing up a pendulum +because this problem has a 2d state +space where it's just the angular and +the angular velocity of the pendulum and +I would visit visualize here's exactly +what the value function looks like +here's exactly what the state +distribution looks like and here's how +they evolve over time so I would get a +sense for like what's if my algorithm +isn't working is it because it's like +oscillating in some funny way or maybe +it's just giving a bad fit or maybe the +function it's learnt the value function +alerting isn't smooth enough and so on +so I would say try to visualize +everything and maybe use small problems +where you can visualize everything also +yeah it's useful to construct toy +problems where your idea is going to be +the strongest where you think okay if +this idea has any possibility of working +it's going to work there so for example +let's say you're trying to do something +with hierarchical reinforcement learning +then construct some problem where +there's some kind of obvious hierarchy +that it should learn and you'll be able +to tell if it's doing the right thing +also construct the the problems where +it's going to be weakest obviously and +also as a counterpoint to that don't +over fit your method to some contrived +problem so let's say you've come up with +some toy problem where your method is +really good then don't realize that it's +a toy problem and don't like tweak +everything to just work on this toy +problem perfectly because yeah it's also +pretty useful to have medium-sized +problems that you're very familiar with +and you know exactly how fast the +learning should be and what the reward +should be at every iteration and so on +so +a few problems that I use a lot like +training on pong Atari and the hopper +would the hopper like problem which is +this simulated robot problem with this +hopping robot and I know exactly how +fast an algorithm that's working should +learn on these problems so so I can sort +of it's it makes it easier to tune +things if you have okay that's if you +have a new algorithm let's say you have +a new task I would recommend just making +the task easier until you start seeing +some signs of life you see it learning +something so so there are various ways +you can make it easier you can try doing +some feature engineering so your input +features you think that the you think +that the policy should be a simple +function of your input features like +let's say you're trying to get pong to +work and you tried setting it up with +the images as input and you weren't +learning anything then you can set up +the problem where you pass in XY +coordinates as input and then try +running your algorithm and it's a much +simpler function you're trying to learn +so that's much more likely at work and +then you can try to make it harder and +harder until you're solving the full +problem another way you can make it +easier is by shaping the reward function +that means you if you come up with some +reward function that gives you fast +feedback code on whether you're doing +the right thing or not so let's say we +can define one task where we have this +reaching robot and we just give it a +reward if it reaches if it hits the +target so it gets a reward of one if it +hits the target in zero otherwise so +that might be hard to learn because +you're not getting any feedback as +you're flailing around but we could +define a better shaped reward function +where the where it's just distance to +target then learning is going to be much +faster in that problem there's also the +problem on exactly how to turn your +problem into a pom DP in the first place +so so often it's not clear what your +observation features should be and it's +not even clear what the reward function +should be so or it's not clear if this +problem you're trying to solve is +if it's feasible at all so so let's say +you're trying to solve you you have some +game or some robotics task or something +new like and you you want to turn it +into a reinforcement learning problem +but you're not sure if this is feasible +at all +so the first thing to do is to just +visualize a random policy acting on this +problem and see see what happens so if +the random policy occasionally does the +right thing then there's a high chance +of reinforcement learning is going to +work because bringing forth a policy +grading method is just going to take +this random behavior and it's going to +make the look the good behaviors more +likely +so it'll gradually like hone in on the +good behaviors whereas if you're never +doing the right thing then then there's +RL isn't going to get any signal that +tells it to do the right thing sometimes +RL is able to learn even though it seems +like it it's not clear how it's going to +learn like learning how to walk it's not +clear that that should work but because +you would think that like you really +have to have the whole thing in the +whole policy in place before it does +anything useful but as it turns out you +sort of learn to take one step and then +fall over and then take two steps and +then fall over and so on until you've +got a proper walking gait okay another +thing to do is to make to make your +observations make sure your observations +are useable try to look at them as a +human and see if you can control the +system using the same observations +you're giving to the agent so let's say +you're doing some pre-processing on your +images look at those pre processed +images yourself and make sure you're not +like losing too much detail when you +downsample them or losing too much or in +the color transformations and so on +another thing to do is you want to make +sure that everything is reasonably +scaled so that for example well as a +rule of thumb you usually want +everything to be mean 0 and standard +deviation 1 for the observations and for +the rewards well it's a little less +obvious but that's a reasonable +heuristic so +so you might want to like a scaler using +some kind of filter I mean that's that's +another good thing you can do but if you +don't want to mess with some kind of +filters on your observations and rewards +what you can do it you can just kind of +if you're allowed to define those +yourself then you might want to just +scale them yourself so what I'd +recommend doing is plot histograms of +all of your observations and your +rewards and make sure that for each +component of the observations and +rewards you've scaled it properly so +that it has the right mean insanity +deviation and it doesn't have crazy +outliers okay another thing to do is you +should have some good baselines that you +can use whenever you see a new when you +whenever you see a new problem so just +it's not clear which algorithm is going +to work beforehand so make sure you you +just have a bunch of a bunch of like +well tune things that you can run on +each problem yeah okay the question was +if you're gonna do some kind of reward +normalization should you do this over +your whole training like all of your +training data or just like the recent +data I would yeah that's a there's a lot +of subtlety there so I would say use all +of your data so far because you're +making everything non-stationary if you +do some kind of filtering actually I'm +going to talk about this at a later +slide so anyway I would recommend as +just a few baselines you should have a +cross and to be method some policy +grading methods some kind of cue +learning or sarsa type method there's a +lot of code online now that you can use +other people's code that that's already +written so you can use like we have this +open AI baselines repository and also +our L lab has a bunch of algorithms okay +another thing to do which people often +get tripped up on especially when +they're trying to reproduce published +work is so you implement the algorithm +based on the paper +and then it doesn't really learn +anything at all and then you think oh +maybe mike is my code like wrong or what +happened so I would say early on you +might need to run with more samples than +expected +so one hyper parameter that you can +usually adjust is how big of a batch +size to use or how many samples to use +and I would say sometimes you should use +more samples than you think you're going +to need because usually things just work +better when you have more samples almost +always so often sometimes when you're +trying to reproduce a published paper +you've got it mostly right but not +exactly right like maybe you haven't +scaled everything properly or there's +some like there's some really like +obscure hyper parameter that you have +wrong and then you just find that the +code doesn't learn anything so then I +would say just try to make it work a +little bit and then you can work from +there and try to tweak all the hyper +parameters to to get up to the like to +get fully up to the publish performance +but if you want to just get something +working at all often you need to use +bigger batch sizes and you thought +because if your batch size is too small +than the nor the noise will overwhelm +the signal and you won't learn anything +so like for example for TRP oh I wasn't +seeing any learning for a while and then +it turned out it's just because I was +using too small of a batch size and I +had to use a hundred thousand time steps +of a batch for the batch size but and +for Atari they for dqn the type of +parameters that were found to be best +where you update every ten thousand time +steps you update your queue function +every ten thousand time steps and you +have a 1 million time steps in your +replay buffer which is a lot okay so now +I'll talk about some guidelines for on +for the ongoing development and tuning +process as opposed to the initial +process of I have a totally new problem +or a new algorithm that I want to see +some signs of life on so +let's say you get something working I +recommend looking how sensitive your +algorithm is to every hyper parameter +and if it's too sensitive it it's not +actually a robust algorithm then you +shouldn't be happy with it you probably +just got luck lucky on that one problem +and it's it's actually kind of possible +to have a method that does that is a +fluke and it works in one way because +it's I mean one problem because of some +funny dynamics but then it doesn't work +in general so you kind of have to it +need some serious improvements so yeah +so that's okay there's also a few things +you can look at to see that actually I'm +going to talk about more of these kind +of Diagnostics a little later but there +are some indicators that'll tell you if +that if your algorithm is working +besides just looking at the final +performance but other in look for other +indicators that are going to tell you +that your optimization process is kind +of healthy so this is going to vary +based on the algorithm but for example +you can look at whether your value +function is actually accurate like +whether it's actually predicting returns +well you can look at how big the updates +are in terms of some either parameter +space or the output space standard +Diagnostics for deep networks like you +can look at norms of gradients and so on +okay one thing that takes some +discipline but is very useful is to have +a system for continually benchmarking +your code and that includes all of your +code not just the one thing you're +tuning right now because often it's easy +to tune your algorithm to work well in +one problem and then mess up the +performance on other problems and it's +really easy to overfit on single +problems when you're just adjusting +hyper parameters so I'd really recommend +having some kind of benchmark you can +run frequently and some kind of battery +of benchmarks that you've run +occasionally along as similar lines of +like overfitting of sort of reading +too far into noise or over interpreting +noise it's really easy to just to think +you're improving your algorithm or +you're making it worse but really you're +just seeing random noise so so you can +see seven different tasks these are the +Jim Moo Joko tasks like half cheetah and +hopper and so on and you have three +different algorithms here the red one +the green one and the blue one and you +can see ok let's I mean we can see that +the performance is a little different on +all the problems but let's it looks like +the the red let's see does the green +which one looks like the best well it +kind of varies by problem like the blue +one looks better on this problem and the +red one is worse on this problem and so +on but as it turns out these are all the +exact same algorithms and just random +seeds different random seeds so so it's +easy to imagine that you're just looking +at one of these problems then you see +that blue curve and you think you get +really excited than you think you found +some huge improvement to your algorithm +but it's really that you just got a +lucky seed that one run so yeah really +you've got to run your algorithm +multiple times an average and even if +you're averaging over a lot of seeds +like even if you had like 20 seeds here +there's a still a pretty big error bar +so it's yeah that makes it particularly +hard +I mean I'd recommend having like +multiple tasks and multiple seeds and if +you don't do that then you're probably +just overfitting unless you see a really +drastically large improvement another +thing to do is it's easy to keep adding +little modifications to your algorithm +until it gets really complicated and +then you're not sure and then you think +you have this really complicated +algorithm which is perfect but it turns +out that most of the things you did are +unnecessary because base some of the +tricks substitute for each other this is +often true because a lot of tricks help +because they're like normalizing things +in a better way or improving your +optimization like making +your optimization less susceptible to +like big spikes I don't know a lot of +different modifications you make have +similar effects so so often you you can +remove them and simplify your algorithm +and this is pretty important so it's +like especially with regard to changes +that do whitening these kind of these +kind of all substitute for each other +and also substitute for changes to your +optimization algorithm and yeah I would +and I would simplify things because it's +then it's more likely that your insights +will generalize to other problems and +also lastly it's pretty useful to +automate your experiments because +otherwise you're going to end up +spending all your day your whole day +just watching your code prints out +numbers and and it's actually really +it's it's really tempting to spend all +day doing that but I would I mean +especially if you need to run multiple +random seeds then it's then you you +really need to get your work flow down +so the year you're automating this +process and launching lots of +experiments at the same time so I'd +recommend just getting set up with one +of these cloud computing services so you +can just launch experiments on remote +instances and pull the results back when +you're done question oh yeah question is +you have a recommendation on what +framework to use to keep track of your +experiment results I personally use no +framework at all and I just have like +ipython notebooks and scripts that +collect a bunch of data that's stored in +various log files so I just have scripts +that read all my log files and plot them +I don't use some people like having +databases and stuff where they store all +their hyper parameter results but on I +think I don't find it necessary +personally okay so now I'm let's see I'm +going to talk about general tuning +strategies for RL and then after that +I'll talk about some specific tuning +strategies for different classes of +algorithms +okay so one thing is widening or +standardizing your data so if your +observations have unknown range you +should definitely standardize them I +would do that by computing a running +estimate of the mean and the standard +deviation and then just transform it Z +transforming it like this +and I would recommend computing the mean +and the standard deviation over all data +you've seen so far not just your recent +data because otherwise you're +effectively changing your data in some +way that the policy doesn't know about +like you have your that your policy +grading algorithm doesn't know about +like your policy grading algorithm is +actually optimizing some objective so +then if you just go and change the +problem out from under it then you're +often going to make things a lot worse +like if you rescale your observations +then your optimization algorithm didn't +know about that so you might just +collapse the performance so that's why I +would recommend using your whole all of +your data from the start of time so that +at least it's going to slow down over +time how fast it's how fast your +scalings are changing so yeah that's +what I would recommend doing with the +observations and for the rewards +I'd recommend rescaling it but not +shifting them because that affects the +agents will to live so if you if you +shift the mean reward that'll affect +whether how long it wants to survive +you're actually changing the problem ok +another yeah you might also want to try +to standardize prediction targets in the +same way though that's a little more +complicated to do using okay yeah so +question is what about pca widening +instead of just this element why scaling +yeah that could that could definitely +help I haven't I haven't experimented +with that but yeah that could help it's +hard to predict with like with neural +nets if it's going to help or not +because they seem to be pretty good at +disentangling things so I know that if +you have things that are terribly scaled +like they're from negative one thousand +two one thousand and other coordinates +are from negative point +point one then it's gonna be slow for +learning so this kind of scaling helps a +lot even though you're having their own +networks okay there's some parameters +that are really generally important like +discount factor that determines whether +you're that determines how long how far +away you're doing credit assignments so +whether you're paying attention to +effects that are delayed by a certain +time so if your discount is gamma equals +point 99 then you're basically ignoring +effects that are more delayed by a +hundred time steps so so you're kind of +short-sighted that gamma is controlling +your shortsightedness and you might want +to actually look at if how long that +corresponds to in real time so usually +in reinforcement learning you're sort of +discretizing time in a certain way and +it's worth paying attention to like is +that 100 time steps like three seconds +of real time or what and what happens +during that time also note that if you +have TD lamda kind of methods for either +for value function estimation or for +policy grading methods you can get away +with using a Lambda gamma that's really +close to one like 0.999 and things +aren't going to go unstable because if +you have a lower land of like 0.9 then +that's going to make it so the algorithm +is still stable even though gamma is +really close to one also okay so so as I +mentioned you might want to in in +practice we're usually discretizing some +continuous-time system so then it's +worth seeing if the problem can actually +be solved at this discretization level +so so for example in a game let's say +you're you're doing frame skip a meaning +that you repeat the action multiple +times as a human can you control it at +this rate or is it just impossible to +control is it just too like you're doing +the action too many times in a row and +you have to slow responses to control it +and I would also just look at the what +the random exploration looks like and if +you make sure that you're exploring like +the the +this Croatian is going to determine like +how far your Brownian motion goes +because if you're doing the same action +many times in a row then you're going to +be able to then you're going to tend to +explore further so so it's worth just +looking at what the random exploration +does and and choosing your time +discretization in a sensible way so that +it does interesting things question yeah +so the question is if you have a DQ n +how would you get started like tuning it +with tuning all the hyper parameters +actually I'm going to talk about DQ n +pretty soon so yeah I'll get to that +okay also look at the episode returns +very closely look at don't just look at +the mean look at the minimum and the +maximum so the maximum especially if you +have a deterministic system if you have +a certain maximum return that's +basically something that your policy can +hone in on pretty straightforwardly +because if if you just do that every +time then you're going to increase your +mean return to that level so so it's +worth so so it's useful to look at the +max return to see if your policy is ever +doing like the right thing according to +that max return or if it's just kind of +stuck and it's never discovering the +high return strategy also look at the +episode in length which is sometimes +more informative than the episode reward +like if because sometimes well yeah I +won't go into details on that like well +if you have a game you're it might mean +that like you might be losing every time +so you're never seeing yourself win but +the episode length will tell you if +you're losing slower so you might see an +improvement in episode length at the +beginning but not in reward okay for +Policy gradient there are specific +strategies or prediction there are +specific Diagnostics that are really +helpful so look at the entropy really +carefully if your entropy is going down +too fast that means your policy is +becoming deterministic and it's not +going to explore anything so +so be careful and also if it's not going +down your policy is never going to be +that good because it's always really +random else so you can sort of alleviate +this issue by using an entropy bonus or +a KL penalty so by stopping yourself +from move changing the policy the +probability distribution too fast as a +side effect you also prevent the entropy +from going down too fast when you use +the KL penalties I also look at the KL +as a diagnostic like look at how big of +an update you're doing in terms of KL +divergence if your KL is like 0.01 +that's a pretty small update but if it's +like a 10 that's a really big update +question oh yeah how do you question is +how do you measure entropy so so if you +have for most policies you can compute +the entropy analytically so if you have +a discrete action space then you usually +can just compute it analytically and if +you have a continuous policy you're +usually you're using a Gaussian +distribution or something so you can +compute the differential entropy +analytically so here we're talking about +entropy in action space so the average +over state space of the action space +entropy what you actually might care +about even more is the entropy in state +space but you have no hope at actually +calculating that except maybe to do some +really crude approximation of it +okay yeah so KL is really useful look at +explain variants like whether your value +function is actually explaining is +actually a good predictor of the returns +or if it's just worse than predicting +nothing so if you just predict zeroes +then your explained variance is zero but +sometimes if you have some neural +network that's predicting then you find +that it's actually negative because it's +overfitting or it's just noisy and it's +not doing anything useful so that +probably means you need to tune some +hyper parameters so that your neural +networks actually predicting better than +the constant predicting zero question +okay yeah question is why does the KL +spike give you a loss in performance +well it doesn't always be a lot it's not +always a loss in performance sometimes +it's a gain in performance but in +practice it's usually a loss in +performance because it usually the +approximation that your policy gradient +is just taking you way outside the +region where your local approximation to +the policy performance is accurate so +you're you're probably just overshooting +like if you take your policy and you +take a really big step in any direction +you're probably making it worse so so +that's so usually if you take a big step +you're getting worse like if you have a +convex function if you take a big step +in any direction you're probably going +to make it worse let's see okay +initialize your policy that's pretty +important more important than in +supervised learning because that in +determines what data you're going to see +initially and you're going to learn from +at the beginning so I would recommend +using have initializing the final layer +to be either zero or really small so +that at least you you have the maximum +and you sort of explore randomly at the +beginning we randomly at the beginning +as opposed to having some kind of +particular like policy that has a strong +opinion on the right thing to do which +is based on no information at all okay +that's for Policy gradient for Q +learning so a few thing a few things one +is okay you often it helps to have a +really big replay buffer and to be able +to do this you need to be a little +careful about memory usage so it's worth +putting in the extra effort to do that +learning rate schedules are often quite +helpful here in practice as our +exploration schedules so in qdq any +you're usually using epsilon greedy and +it often helps to do to play with the +schedule on that also it converges +pretty slowly and it has a miss +serious warmup period at the beginning +often so so sometimes you just so I +actually have a lot of admiration for +the authors who originally got this the +people people who got this to work +originally because they had to just let +their code run for a while before it did +anything so so you have to have a lot of +patience - a lot of bravery to do that +ok this is just miscellaneous advice for +not necessarily for tuning algorithms +but just for for personal development so +I recommend reading older textbooks and +theses not just the latest conference +papers because often they up in them +like there are more dense source of +useful information whereas each +conference paper just has one idea ok +yeah don't get too stuck on problems +because often you actually have a +legitimately good algorithm but it's has +like some flaws so its might fail +miserably at some easy problem so in RL +there's some like simple problems like +cart will swing up where you have this +stick and you're trying to swing it up +by moving the cart around and this +problem like you might have a great +algorithm but it's gonna in my but like +some of the state-of-the-art algorithms +are gonna fail on that problem unless +you really tuned them carefully and +that's just because maybe it's not +exactly the right problem to start to I +mean maybe like the thing that makes +this problem hard is not the thing that +your algorithm is doing that's +interesting so you might have like come +up with a better policy grading method +but still it'll converge to the same +local minimum on that swing up problem +and you're not gonna fix that problem so +I I would say just don't get too stuck +on a single problem that your method +bails on and enough in like maybe the +ultimate algorithm will solve all of +these problems but we're not there yet +so you might as well just try to improve +and some like decently large subset of +problems so also like one funny thing is +the dqn performs pretty poorly on a lot +of problems especially with continuous +control +I think it does I mean for cartful it +probably solves it pretty well if with a +reasonable amount of tuning but some of +the other like fairly small continuous +control problems it fails on but that +doesn't mean it's like that doesn't mean +it's a bad algorithm because it solves a +different problem extremely well so yeah +I would say just these these things are +at least right now it's not gonna you +shouldn't expect to be able to solve +everything with the same method without +any tuning also techniques from +supervised learning often don't transfer +over to reinforcement learning so so +don't be surprised if you find that I +guess that's not I said this slide was +gonna be a bad personal development +that's not about personal development +but yeah I guess this is just a grab bag +of miscellaneous advice so yeah so like +Bachner a lot of people look at what +people are doing in RL and they think +why aren't you using batch norm or drop +out or or big networks why are you using +like two layers of 64 units and it's not +like people didn't think of trying these +other things they tried them and then +they found that those other like +architectures and methods don't actually +help here I mean if you figure out how +to make batch norm and drop out actually +help in RL that'll actually be really +great and a big that would be a big +development but yeah I don't know it's +not totally straightforward all right +that's all thank you +okay I have a few minutes for questions +yeah so the question is how long do you +wait until you've decided that your +algorithm is new at work either because +your code is wrong or it's just too hard +I don't have a good general answer to +that I think the problem is worse for +some algorithms and others I'd say for +policy gradient methods you don't see +that burnin period as much like often if +it's going to learn it'll learn it at +the beginning but that's not always true +either I mean sometimes it will kind of +take some time to get into the right +numerical regime so I don't yeah I don't +have general advice I would say you have +to just I would say go back and start +with the easy problems and you'll get +some intuition about whether you're you +should expect a you should expect a burn +in period or not where it's not learning +anything see I want to get some people +in the back because okay oh yeah +question is - do I use unit tests I use +unit tests for code that where there's +it's doing a very particular +mathematical thing that you can actually +write a test for like let's say I'm +computing the KL divergence then I'll +write a test to check I don't know their +various ways of testing it so and it's +easy to get those things wrong like you +have it's I don't know as you're off by +a constant or something so yeah I would +write tests for I write tests for things +where it's nothing that there's a very +well-defined correct thing to do it's +harder to write it for an algorithm +where it has a lot of different moving +parts where you it's not clear how fast +it should learn and it's also there's +some randomness involved +so if you try to write a test saying I +should be at performance 100 after this +many iterations it might fail just out +of random noise but yeah I think +probably unit tests are a good idea oh +yeah so the question is do I have +guidelines on matching the algorithm to +the task like when to use policy +gradients versus a value iteration style +method it's yeah it's hard to give some +general guidelines I think people have +found that and and the guidelines I give +you might just be just be kind of +historical accidents like someone got +this to work here and this to work there +so I think the well certainly if you +don't care that much about sample +complexity policy gradient methods are +are probably are probably the way to go +if you don't care about sample +complexity or using off policy data then +policy grading methods are probably the +safest bet because you I don't know it's +more understandable exactly what it's +doing it's just doing gradient descent +whereas q-learning it's a little bit +indirect what it's doing so it's and it +in practice is more finicky yeah if you +do care about sample complexity though +or need off policy data then hue +learning is usually better or yeah or a +few students like sample complexity is +relevant if your simulator is expensive +of course I would also say that people +have found that dqn and relatives have +worked well on game-like tasks with +images as input whereas policy grading +methods work better on the continuous +control tasks like these robotic +locomotion problems though that this +might not be fundamental it might be +more of a historical accident let's +oh yeah recommendations on older +textbooks let's see there's like brutes +a cuss for take us as books +that's approximate dinette what is it +optimal control and why am i blanking on +the name optimal control and dynamic +programming something like that and the +set I mean sudden embargo is a good one +to read butterman has a textbook kind of +a classic textbook on Markov decision +processes that's in the RL space then +there's books on numerical optimization +that are good and yeah I'd say obviously +the machine learning textbooks have a +lot of good material that might be +useful in the RL setting too +oh yeah can I comment on evolution +strategies and the blog posts the the +opening I blog post on it let's see do +you have any specific questions about it +or like how it compares +oh yeah okay yeah yeah so there's +there's a lot of policy grading methods +out there and some of them are quite +complicated so we've had a couple of +talks on them so far like all these +different work +it's excessively more complicated policy +grading methods but then there's this +old algorithm called evolution +strategies which is an extremely simple +algorithm and and there's a paper by +some of my colleagues where they show it +was called evolutionary strategies as a +scaleable alternative to reinforcement +learning which really meant like to +policy grading methods so and they +claimed that it worked basically as well +as policy grading methods or at least +it's sort of in the same order oh and +beer is one of the authors of that paper +so the claim was that it works it works +like similarly well to policy grading +method so why should we bother with +these policy grading methods if es works +just as well well I think in practice it +works well it works but it works not not +as well like it's it takes me the sample +complexity is is is worse by some +constant factor or it's not clear that +it's a constant factor or if this factor +scales with the size of the network but +it's it is a lot it is significantly +slower and the question is just what is +that constant factor so is that constant +factor like one or is it three or is it +10 or 100 so that's not that's going to +vary between problems and also the that +paper had some innovations in exactly +how to parameterize the networks and so +forth that made everything better +numerically everything better scaled so +that yes did work well but I would say +that if you that it's usually quote like +I don't know it's usually a pretty +decent constant factor slower than +policy grading methods especially the +more advanced ones like the PPO and +actor so so i'm i think it's it's not +really a clear win in the RL setting +where policy gradients work I think if +policy gradients work it's usually going +to be a lot better +and the es is going to be is going to be +better on problems where policy +gradients aren't going to work for some +reason like if you've got really long +you depend time dependencies where the +discounts are gonna are gonna ignore +them +then es might be less sensitive to that +let's see I think I'm okay last question +oh yeah favorite hyper parameter +optimization framework I've used some of +these than I just like to use the +uniform random sampling yeah that works +really well I mean you just run a bunch +of experiments with random hyper +parameters and then you just look at the +results the next day and to do some +regression to figure out which +parameters actually mattered and then +you've run another experiment with +better parameter ranges and so on so I +use the human version of it because +often it's just - it's like a it's it's +useful to be able to look at the results +yourself - to get some to figure out +which parameters actually matter so +you're not wasting a lot of computation +because that information transferred +between problems all right +[Applause] \ No newline at end of file diff --git a/docs/evidence/alexirpan_rl_hard.md b/docs/evidence/alexirpan_rl_hard.md new file mode 100644 index 0000000..1baf544 --- /dev/null +++ b/docs/evidence/alexirpan_rl_hard.md @@ -0,0 +1,1103 @@ +Source: https://www.alexirpan.com/2018/02/14/rl-hard.html +Title: Deep Reinforcement Learning Doesn't Work Yet - Alex Irpan (2018) +Fetched-via: uvx markitdown https://www.alexirpan.com/2018/02/14/rl-hard.html +Fetch-status: verbatim + +[Sorta Insightful](/) + +[Reviews](/recs/) +[Projects](/projects/) +[Puzzles](/puzzles/) +[Archive](/archive/) +[Research](/research/) +[About](/about/) +[![](/public/feed-icon.png)](/feed.xml) + +In a world where everyone has opinions, one man...also has opinions + +# Deep Reinforcement Learning Doesn't Work Yet + +Feb 14, 2018 + +*June 24, 2018 note: If you want to cite an example from the post, please +cite the paper which that example came from. If you want to cite the +post as a whole, you can use the following BibTeX:* + +``` +@misc{rlblogpost, + title={Deep Reinforcement Learning Doesn't Work Yet}, + author={Irpan, Alex}, + howpublished={\url{https://www.alexirpan.com/2018/02/14/rl-hard.html}}, + year={2018} +} +``` + +--- + +*This mostly cites papers from Berkeley, Google Brain, DeepMind, and OpenAI +from the past few years, because that work is most visible to me. +I’m almost certainly missing stuff from older literature and other +institutions, and for that I apologize - I’m just one guy, after all.* + +# Introduction + +Once, on Facebook, I made the following claim. + +> Whenever someone asks me if reinforcement learning can solve their problem, I tell them it can’t. I think this is right at least 70% of the time. + +![Futurama Bender meme](/public/rl-hard/bender-70.jpg) + +Deep reinforcement learning is surrounded by mountains and mountains of hype. And +for good reasons! +Reinforcement learning is an incredibly general paradigm, +and in principle, a robust and performant RL system should be great at +everything. Merging this paradigm with the empirical power of deep learning +is an obvious fit. Deep RL is one of the closest things that looks anything like +AGI, and that’s the kind of dream that fuels billions +of dollars of funding. + +Unfortunately, it doesn’t really work yet. + +Now, I believe it *can* work. If I didn’t believe in reinforcement learning, +I wouldn’t be working on it. +But there are a lot of problems in the way, many of which feel fundamentally +difficult. The beautiful demos of learned agents hide all the blood, sweat, and +tears that go into creating them. + +Several times now, I’ve seen people get lured by recent work. They try +deep reinforcement learning for the first time, and without fail, they +underestimate deep RL’s difficulties. +Without fail, the “toy problem” is not as easy as it looks. And without fail, +the field destroys them a few times, until they learn how to set realistic +research expectations. + +This isn’t the fault of anyone in particular. It’s more of a systemic problem. +It’s easy to write a story around a positive result. It’s hard to do the same +for negative ones. The problem is that the negative ones are the ones that +researchers run into the most often. In some ways, the negative cases are +actually more important than the positives. + +In the rest of the post, I explain why deep RL doesn’t work, cases where +it does work, and ways I can see it working more reliably in the future. +I’m not doing this because I want people to stop working on deep RL. +I’m doing this because I believe it’s easier to make progress on problems if +there’s agreement on what those problems are, and it’s easier to build +agreement if people actually talk about the problems, instead of independently +re-discovering the same issues over and over again. + +I want to see more deep RL research. I want new people to join the field. I +also want new people to know what they’re getting into. + +Before getting into the rest of the post, a few remarks. + +* I cite several papers in this post. Usually, I cite the paper for its + compelling negative examples, leaving out the positive ones. **This doesn’t + mean I don’t like the paper.** I like these papers - they’re worth a read, if + you have the time. +* I use “reinforcement learning” and “deep reinforcement learning” + interchangeably, because in my day-to-day, “RL” always implicitly + means deep RL. + **I am criticizing the empirical behavior of deep reinforcement + learning, not reinforcement learning in general.** + The papers I cite usually represent the agent with a deep neural net. + Although the empirical criticisms *may* apply to linear RL or tabular RL, I’m not + confident they generalize to smaller problems. + The hype around deep RL is driven by the promise of applying RL to large, complex, + high-dimensional environments where good function approximation is necessary. + It is that hype in particular that needs to be addressed. +* This post is structured to go from pessimistic to optimistic. + I know it’s a bit long, but I’d appreciate it if you would take the time to + read the entire post before replying. + +Without further ado, here are some of the failure cases of deep RL. + +# Deep Reinforcement Learning Can Be Horribly Sample Inefficient + +The most well-known benchmark for deep reinforcement learning is Atari. As shown +in the now-famous Deep Q-Networks paper, if you combine Q-Learning with +reasonably sized neural networks and some optimization tricks, you can achieve +human or superhuman performance in several Atari games. + +Atari games run at 60 frames per second. Off the +top of your head, can you estimate how many frames a state of the art DQN +needs to reach human performance? + +The answer depends on the game, so let’s take a look at a recent Deepmind +paper, [Rainbow DQN (Hessel et al, 2017)](https://arxiv.org/abs/1710.02298). +This paper does an ablation study over several incremental advances made to the +original DQN architecture, demonstrating that a combination of all advances gives +the best performance. It exceeds human-level performance on over 40 of the 57 Atari +games attempted. The results are displayed in this handy chart. + +![Figure from Rainbow DQN](/public/rl-hard/rainbow_dqn.png) + +The y-axis is “median human-normalized score”. This is computed by training +57 DQNs, one for each Atari game, normalizing the score of each agent such that +human performance is 100%, then plotting the median performance across the +57 games. RainbowDQN passes the 100% threshold at about *18 million* frames. +This corresponds to about 83 hours of play experience, plus however long it takes +to train the model. A lot of time, for an Atari game that most humans pick up +within a few minutes. + +Mind you, 18 million frames is actually pretty good, when you consider that the +previous record +([Distributional DQN (Bellemare et al, 2017)](https://arxiv.org/pdf/1707.06887.pdf)) +needed 70 million frames to hit 100% median performance, which is about 4x more +time. As for the [Nature DQN (Mnih et al, 2015)](https://www.nature.com/articles/nature14236), +it never hits 100% median performance, even after 200 million frames of +experience. + +The planning fallacy says that finishing something usually takes longer than +you think it will. Reinforcement +learning has its own planning fallacy - learning a policy usually needs more +samples than you think it will. + +This is not an Atari-specific issue. The 2nd most popular benchmark is the +MuJoCo benchmarks, a set of tasks set in the MuJoCo physics +simulator. In these tasks, the input state is usually the position and velocity +of each joint of some simulated robot. Even without having to solve vision, +these benchmarks take between \(10^5\) to \(10^7\) steps to learn, depending +on the task. This is an astoundingly large amount of experience to +control such a simple environment. + +The [DeepMind parkour paper (Heess et al, 2017)](https://arxiv.org/abs/1707.02286), +demoed below, +trained policies by using 64 workers for over 100 hours. The paper does not clarify what “worker” +means, but I assume it means 1 CPU. + +These results are *super cool*. When it first came out, I was surprised +deep RL was even able to learn these running gaits. + +At the same time, the fact that this needed 6400 CPU hours is a bit +disheartening. It’s not that I expected it to need less time…it’s more that +it’s disappointing that deep RL is still orders of magnitude above a practical +level of sample efficiency. + +There’s an obvious counterpoint here: what if we just ignore sample efficiency? +There are several settings where it’s easy to generate experience. Games are +a big example. But, for any setting where this *isn’t* true, RL faces an uphill +battle, and unfortunately, most real-world settings fall under this category. + +# If You Just Care About Final Performance, Many Problems are Better Solved by Other Methods + +When searching for solutions to any research problem, there are usually +trade-offs between different objectives. You can optimize for getting a really +good solution for that research problem, or you can optimize for making a good +research contribution. The best problems are ones where getting a good solution +requires making good research contributions, but it can be hard to find +approachable problems that meet that criteria. + +For purely getting good performance, deep RL’s track record isn’t +that great, because it consistently gets beaten by other methods. +Here is a video of the MuJoCo robots, controlled with online trajectory +optimization. The correct actions are computed in near real-time, online, with +no offline training. Oh, and it’s running on 2012 hardware. ([Tassa et al, IROS 2012](https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf)). + +I think these behaviors compare well to the parkour +paper. What’s different between this paper and that one? + +The difference is that Tassa et al use model predictive control, which gets to +perform planning against a ground-truth world model (the physics simulator). +Model-free RL doesn’t do this planning, and therefore has a much harder +job. On the other hand, if planning against a model helps this much, why +bother with the bells and whistles of training an RL policy? + +In a similar vein, you can easily outperform DQN in Atari with off-the-shelf +Monte Carlo Tree Search. Here are baseline +numbers from [Guo et al, NIPS 2014](https://papers.nips.cc/paper/5421-deep-learning-for-real-time-atari-game-play-using-offline-monte-carlo-tree-search-planning). +They compare the scores of a trained DQN to the scores of a UCT agent +(where UCT is the standard version of MCTS used today.) + +![DQN results](/public/rl-hard/dqn_atari.png) + +![MCTS results](/public/rl-hard/uct_atari.png) + +Again, this isn’t a fair comparison, because DQN does no search, and MCTS gets to +perform search against a ground truth model (the Atari emulator). +However, sometimes you don’t care about fair comparisons. Sometimes you just +want the thing to work. (If you’re interested in a full evaluation of UCT, +see the appendix of the original +[Arcade Learning Environment paper (Bellemare et al, JAIR 2013)](http://www.marcgbellemare.info/static/publications/bellemare13arcade.pdf).) + +Reinforcement learning can theoretically work for anything, including +environments where a model of the world isn’t known. However, this generality +comes at a price: it’s hard to exploit any problem-specific information that +could help with learning, which forces you to use tons of samples to learn +things that could have been hardcoded. + +The rule-of-thumb is that except in rare cases, domain-specific algorithms +work faster and better than reinforcement learning. This isn’t a problem if +you’re doing deep RL for deep RL’s sake, but I +personally find it frustrating when I compare RL’s performance to, well, +anything else. One reason I liked AlphaGo so much was *because* it was an +unambiguous win for deep RL, and that doesn’t happen very often. + +This makes it harder for me to explain to laypeople why my problems +are cool and hard and interesting, because they often don’t have the context +or experience to appreciate *why* they’re hard. +There’s an explanation gap between what people think deep RL can do, and +what it can really do. I’m working in robotics right now. Consider the company +most people think of +when you mention robotics: +[Boston Dynamics](https://www.youtube.com/channel/UC7vVhkEfw4nOGp8TyDk7RcQ). + +This doesn’t use reinforcement learning. I’ve had a few conversations where +people thought it used RL, but it doesn’t. +If you look up research papers from the group, you find papers mentioning +[time-varying LQR, QP solvers, and convex optimization](https://dspace.mit.edu/openaccess-disseminate/1721.1/110533). +In other words, they mostly apply classical robotics techniques. Turns out +those classical techniques can work pretty well, when you apply them right. + +# Reinforcement Learning Usually Requires a Reward Function + +Reinforcement learning assumes the existence of a reward function. Usually, +this is either given, or it is hand-tuned offline and kept fixed over the course +of learning. I say “usually” because there are exceptions, such as imitation +learning or inverse RL, but most RL approaches treat the reward as an oracle. + +Importantly, +for RL to do the right thing, your reward function must capture *exactly* what +you want. +And I mean *exactly*. RL has an annoying tendency to overfit to your reward, +leading to things you didn’t expect. This is why Atari is such a nice benchmark. Not only +is it easy to get lots of samples, the goal in every game is to maximize score, +so you never have to worry about defining your reward, and you know everyone +else has the same reward function. + +This is also why the MuJoCo tasks are popular. Because they’re run in simulation, +you have perfect knowledge of all object state, which makes reward function design +a lot easier. + +In the Reacher task, you control a two-segment arm, that’s connected to a central +point, and the goal is to move the end of the arm to a target location. Below +is a video of a successfully learned policy. + +Since all locations are known, reward can be defined as the distance +from the end of the arm to the target, plus a small control cost. In principle, +you can do this in the real world too, if you have enough sensors to get +accurate enough positions for your environment. But depending on what you want +your system to do, it could be hard to define a reasonable reward. + +By itself, requiring a reward function wouldn’t be a big deal, except… + +# Reward Function Design is Difficult + +Making *a* reward function isn’t that difficult. The difficulty comes when +you try to design a reward function that encourages the behaviors you want +while still being learnable. + +In the HalfCheetah environment, you have a two-legged robot, restricted to a +vertical plane, meaning it can only run forward or backward. + +[ + + +Your browser does not support the video element. + +](/public/rl-hard/upright_half_cheetah.mp4) + +The goal is to learn a running gait. Reward is the velocity of the HalfCheetah. + +This is a *shaped* reward, meaning it gives increasing reward in states +that are closer to the end goal. This is in contrast to *sparse* rewards, which +give reward at the goal state, and no reward anywhere else. +Shaped rewards are often much easier to learn, because they provide positive feedback +even when the policy hasn’t figured out a full solution to the problem. + +Unfortunately, shaped rewards can bias learning. As said earlier, this can lead +to behaviors that don’t match what you want. +A good example is the boat racing game, from an [OpenAI blog post](https://blog.openai.com/faulty-reward-functions/). +The intended goal is to finish the race. You can imagine that a sparse reward +would give +1 reward for finishing under a given time, and 0 reward otherwise. + +The provided reward gives points for hitting checkpoints, and also +gives reward for collecting powerups that let you finish the race faster. +It turns out farming the powerups gives more points than finishing the race. + +To be honest, I was a bit annoyed when this blog post first came out. This +wasn’t because I thought it was making a bad point! It was because I thought +the point it made was blindingly obvious. Of course reinforcement learning +does weird things when the reward is misspecified! It felt like the post +was making an unnecessarily large deal out of the given example. + +Then I started writing this blog post, and realized the most compelling video +of misspecified reward *was* the boat racing video. And since then, that video’s +been used in several presentations bringing awareness to the problem. So, okay, +I’ll begrudgingly admit this was a good blog post. + +RL algorithms fall along a continuum, where they get to assume more or less +knowledge about the environment they’re in. The broadest category, model-free RL, +is almost the same as black-box optimization. These methods are only allowed to +assume they are in an MDP. Otherwise, they are given nothing else. The agent +is simply told that this gives +1 reward, this doesn’t, and it has to learn +the rest on its own. +And like black-box optimization, the problem is that anything that gives ++1 reward is good, even if the +1 reward isn’t coming for the right reasons. + +A classic non-RL example is the time someone applied genetic algorithms to +circuit design, and +[got a circuit where an unconnected logic gate was necessary to the final +design](https://en.wikipedia.org/wiki/Evolvable_hardware#Introduction). + +![Circuit with crazy gates](/public/rl-hard/circuit.png) + +The gray cells are required to get correct behavior, including the one in the top-left corner, +even though it’s connected to nothing. +[From “An Evolved Circuit, Intrinsic in Silicon, Entwined with Physics”](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.50.9691&rep=rep1&type=pdf) + +For a more recent example, see this +[2017 blog post from Salesforce](https://www.salesforce.com/products/einstein/ai-research/tl-dr-reinforced-model-abstractive-summarization/). +Their goal is text summarization. +Their baseline model is trained with supervised learning, then evaluated with +an automated metric called ROUGE. ROUGE is non-differentiable, but RL can +deal with non-differentiable rewards, so they tried applying RL to optimize +ROUGE directly. This gives high ROUGE (hooray!), but it doesn’t actually +give good summaries. Here’s an example. + +> Button was denied his 100th race for McLaren after an ERS prevented him from making it to the start-line. It capped a miserable weekend for the Briton. Button has out-qualified. Finished ahead of Nico Rosberg at Bahrain. Lewis Hamilton has. In 11 races. . The race. To lead 2,000 laps. . In. . . And. + +[Paulus et al, 2017](https://arxiv.org/abs/1705.04304) + +So, despite the RL model giving the highest ROUGE score… + +![Salesforce ROUGE performance](/public/rl-hard/salesforce_rouge.png) + +they ended up using a different model instead. + +Here’s another fun example. This is [Popov et al, 2017](https://arxiv.org/abs/1704.03073), +sometimes known as “the Lego stacking paper”. +The authors use a distributed version of DDPG to learn a grasping policy. The +goal is to grasp the red block, and stack it on top of the blue block. + +They got it to work, but they ran into a neat failure case. +For the initial lifting motion, reward is given based on how high the red block +is. This is defined by the z-coordinate of the +bottom face of the block. One of the failure modes was that the policy learned +to tip the red block over, instead of picking it up. + +Now, clearly this isn’t the intended solution. But RL doesn’t care. +From the perspective of reinforcement learning, it got rewarded for flipping +a block, so it’s going to keep flipping blocks. + +One way to address this is to make the reward sparse, by only giving positive +reward after the robot stacks the block. Sometimes, this works, because the +sparse reward is learnable. Often, it doesn’t, because the lack of positive +reinforcement makes everything too difficult. + +The other way to address this is to do careful reward shaping, adding new +reward terms and tweaking coefficients of existing ones until the behaviors +you want to learn fall out of the RL algorithm. It’s *possible* to fight +RL on this front, but it’s a very unfulfilling fight. On occasion, it’s +necessary, but I’ve never felt like I’ve learnt anything by doing it. + +For reference, here is one of the reward functions from the Lego stacking +paper. + +![Lego grasp reward function](/public/rl-hard/lego_reward.png) + +I don’t know how much time was spent designing this reward, but based on the +number of terms and the number of different coefficients, I’m going to +guess “a lot”. + +In talks with other RL researchers, I’ve heard several anecdotes about +the novel behavior they’ve seen from improperly defined rewards. + +* A coworker is teaching an + agent to navigate a room. The episode terminates if the agent + walks out of bounds. He didn’t add any penalty if the episode terminates this + way. The final policy learned to be suicidal, because negative reward was + plentiful, positive reward was too hard to achieve, and a quick death ending + in 0 reward was preferable to a long life that risked negative reward. +* A friend is training a simulated robot arm to reach towards a point + above a table. It turns out the point was defined *with respect to the table*, + and the table wasn’t anchored to anything. + The policy learned to slam the table really hard, making the table fall + over, which moved the target point too. The target point *just so happened* + to fall next to the end of the arm. +* A researcher gives a talk about using RL to train a simulated robot hand to + pick up a hammer and hammer in a nail. Initially, the reward was defined + by how far the nail was pushed into the hole. Instead of + picking up the hammer, the robot used its own limbs to punch the nail in. + So, they added a reward term to encourage picking up the hammer, and retrained + the policy. + They got the policy to pick up the hammer…but then it threw the hammer at the + nail instead of actually using it. + +Admittedly, these are all secondhand accounts, and I haven’t seen videos of +any of these behaviors. +However, none of it sounds implausible to me. +I’ve been burned by RL too many times to believe otherwise. + +I know people who like to tell stories about [paperclip optimizers](https://en.wikipedia.org/wiki/Instrumental_convergence#Paperclip_maximizer). I get it, +I really do. But honestly, I’m sick of hearing those stories, because they +always speculate up some superhuman misaligned AGI to create a just-so story. +There’s no reason to speculate that far when present-day examples happen +all the time. + +# Even Given a Good Reward, Local Optima Can Be Hard To Escape + +The previous examples of RL are sometimes called “reward hacking”. To me, +this implies a clever, out-of-the-box solution that gives more reward than the +intended answer of the reward function designer. + +Reward hacking is the exception. The much more common case is a poor local optima +that comes from getting the exploration-exploitation trade-off wrong. + +Here’s one of my favorite videos. This is an implementation of +[Normalized Advantage Function](https://arxiv.org/abs/1603.00748), learning +on the HalfCheetah environment. + +[ + + +Your browser does not support the video element. + +](/public/rl-hard/upsidedown_half_cheetah.mp4) + +From an outside perspective, this is really, *really* dumb. But we can +only say it’s dumb because we can see the 3rd person view, and have a bunch of +prebuilt knowledge that tells us running on your feet is better. +RL doesn’t know this! It sees a state vector, it sends action vectors, and it +knows it’s getting some positive reward. That’s it. + +Here’s my best guess for what happened during learning. + +* In random exploration, the policy found falling forward was better than + standing still. +* It did so enough to “burn in” that behavior, so now it’s falling forward + consistently. +* After falling forward, the policy learned that if it does a one-time application + of a lot of force, it’ll do a backflip that gives a bit more reward. +* It explored the backflip enough to become confident this was a good idea, + and now backflipping is burned into the policy. +* Once the policy is backflipping consistently, which is easier for the + policy: learning to right itself and then run “the standard way”, or learning + or figuring out how to move forward while lying on its back? I would + guess the latter. + +It’s very funny, but it definitely isn’t what I wanted the robot to do. + +Here’s another failed run, this time on the Reacher environment. + +[ + + +Your browser does not support the video element. + +](/public/rl-hard/failed_reacher.mp4) + +In this run, the initial random weights tended to output highly positive or +highly negative action outputs. This makes most of the actions output the +maximum or minimum acceleration possible. It’s really easy to spin super fast: +just output high magnitude forces at every joint. Once the robot gets going, it’s hard +to deviate from this policy in a meaningful way - to deviate, you have to take +several exploration steps to stop the rampant spinning. It’s certainly +possible, but in this run, it didn’t happen. + +These are both cases of the classic exploration-exploitation problem that has dogged +reinforcement learning since time immemorial. +Your data comes from your current policy. If your current policy explores too +much you get junk data and learn nothing. Exploit too much and you burn-in +behaviors that aren’t optimal. + +There are several intuitively pleasing ideas for addressing this - intrinsic +motivation, curiosity-driven exploration, count-based exploration, and so forth. +Many of these approaches were first proposed in the 1980s or earlier, and +several of them have been revisited with deep learning models. However, as far +as I know, none of them work consistently across all environments. Sometimes +they help, sometimes they don’t. It would be nice if there was an exploration +trick that worked everywhere, but I’m skeptical a silver bullet of that caliber +will be discovered anytime soon. Not because people aren’t trying, but because +exploration-exploitation +is really, really, really, really, really hard. +To quote [Wikipedia](https://en.wikipedia.org/wiki/Multi-armed_bandit), + +> Originally considered by Allied scientists in World War II, it proved so intractable that, according to Peter Whittle, the problem was proposed to be dropped over Germany so that German scientists could also waste their time on it. + +(Reference: [Q-Learning for Bandit Problems, Duff 1995](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.57.1916&rep=rep1&type=pdf)) + +I’ve taken to imagining deep RL as a demon that’s +deliberately misinterpreting your reward and actively searching for the laziest +possible local optima. It’s a bit ridiculous, but I’ve found it’s actually a +productive mindset to have. + +# Even When Deep RL Works, It May Just Be Overfitting to Weird Patterns In the Environment + +> Deep RL is popular because it’s the only area in ML where it’s socially +> acceptable to train on the test set. + +[Source](https://twitter.com/jacobandreas/status/924356906344267776) + +The upside of reinforcement learning is that if you want to do +well in an environment, you’re free to overfit like crazy. The downside is that +if you want to generalize to any other environment, you’re probably going to +do poorly, because you overfit like crazy. + +DQN can solve a lot of the Atari games, but it does so by focusing all of +learning on a single goal - getting really good at one game. The final model +won’t generalize to other games, because it hasn’t been trained that way. +You can finetune a learned DQN to a new Atari game +(see [Progressive Neural Networks (Rusu et al, 2016)](https://arxiv.org/abs/1606.04671)), +but there’s no guarantee it’ll transfer and people usually don’t expect it to +transfer. +It’s not the wild success people see from pretrained ImageNet features. + +To forestall some obvious comments: yes, in principle, training on a wide +distribution of environments should make these issues go away. +In some cases, you get such a distribution for free. An example +is navigation, where you can sample goal locations randomly, and use +universal value functions to generalize. +(See [Universal Value Function Approximators, Schaul et al, ICML 2015](http://proceedings.mlr.press/v37/schaul15.pdf).) +I find this work very promising, and I give more examples of this work later. +However, I don’t think the +generalization capabilities of deep RL are strong enough to handle a diverse +set of tasks yet. Perception has gotten a lot better, but deep RL has yet to +have its “ImageNet for control” moment. OpenAI Universe tried to spark +this, but from what I heard, it was too difficult to solve, so not much got +done. + +Until we have that kind of generalization moment, we’re stuck with policies that +can be surprisingly narrow in scope. +As an example of this (and as an opportunity to poke fun at some of my own work), +consider [Can Deep RL Solve Erdos-Selfridge-Spencer Games? (Raghu et al, 2017)](https://arxiv.org/abs/1711.02301). +We studied a toy 2-player combinatorial game, where there’s a closed-form analytic solution +for optimal play. +In one of our first experiments, we fixed player 1’s behavior, then trained +player 2 with RL. By doing this, you can treat player 1’s actions as part +of the environment. By training player 2 against the optimal player 1, we showed +RL could reach high performance. But when we deployed the same +policy against a non-optimal player 1, its performance dropped, because it +didn’t generalize to non-optimal opponents. + +[Lanctot et al, NIPS 2017](https://arxiv.org/abs/1711.00832) showed a +similar result. Here, there are two agents +playing laser tag. The agents are trained with multiagent reinforcement +learning. To test generalization, they run the training with 5 random +seeds. Here’s a video of agents that have been trained against one +another. + +As you can see, they learn to move towards and shoot each other. Then, they +took player 1 from one experiment, and pitted it against player 2 from a +*different* experiment. If the learned policies generalize, we should see +similar behavior. + +Spoiler alert: they don’t. + +This seems to be a running theme in multiagent RL. When agents are trained +against one another, a kind of co-evolution happens. The agents get really good +at beating each other, but when they get deployed against an unseen player, +performance drops. I’d also like to point out that the +only difference between these videos is the random seed. Same learning +algorithm, same hyperparameters. The diverging behavior is purely from randomness +in initial conditions. + +That being said, there are some neat results from competitive self-play environments +that seem to contradict this. [OpenAI has a nice blog post of some of their work in this space](https://blog.openai.com/competitive-self-play/). +Self-play is also an important part of both +AlphaGo and AlphaZero. My intuition is that if your agents are learning at +the same pace, they can continually challenge each other and speed up each other’s +learning, but if one of them learns much faster, it exploits the weaker player +too much and overfits. As you relax from symmetric self-play to general +multiagent settings, it gets harder to ensure learning happens at the same +speed. + +# Even Ignoring Generalization Issues, The Final Results Can be Unstable and Hard to Reproduce + +Almost every ML algorithm has hyperparameters, which influence the behavior +of the learning system. Often, these are picked by hand, or by random search. + +Supervised learning is stable. Fixed dataset, ground truth targets. If you +change the hyperparameters a little bit, +your performance won’t change that much. Not all hyperparameters perform +well, but with all the empirical tricks discovered over the years, +many hyperparams will show signs of life during training. These signs of life are +super important, because they tell you that you’re on the right track, you’re +doing something reasonable, and it’s worth investing more time. + +Currently, deep RL isn’t stable at all, and it’s just hugely annoying for research. + +When I started working at Google Brain, one of the first +things I did was implement the algorithm from the Normalized Advantage Function +paper. I figured it would only take me about 2-3 weeks. I had several things +going for me: some familiarity with Theano (which transferred to TensorFlow +well), some deep RL experience, and the first author of the NAF paper was +interning at Brain, so I could bug him with questions. + +It ended up taking me 6 weeks to reproduce results, thanks to several software +bugs. The question is, why did it take so long to find these bugs? + +To answer this, let’s consider the simplest continuous control task in +OpenAI Gym: the Pendulum task. In this task, there’s a pendulum, anchored +at a point, with gravity acting on the pendulum. The input state is +3-dimensional. The action space is 1-dimensional, the amount of torque to apply. +The goal is to balance the pendulum perfectly straight up. + +This is a tiny problem, and it’s made even easier by a well shaped reward. +Reward is defined by the angle of the pendulum. Actions bringing the pendulum +closer to the vertical not only give reward, they give *increasing* reward. +The reward landscape is basically concave. + +Below is a video of a policy that *mostly* works. Although the policy doesn’t +balance straight up, it outputs the exact torque needed to counteract +gravity. + +[ + + +Your browser does not support the video element. + +](/public/rl-hard/pendulum_example.mp4) + +Here is a plot of performance, after I fixed all the bugs. Each line is the +reward curve from one of 10 independent runs. Same hyperparameters, the only +difference is the random seed. + +![Graph of Pendulum results](/public/rl-hard/pendulum_results.png) + +Seven of these runs worked. Three of these runs didn’t. *A 30% +failure rate counts as working.* +Here’s another plot from some published work, +[“Variational Information Maximizing Exploration” (Houthooft et al, NIPS 2016)](https://arxiv.org/abs/1605.09674). +The environment is HalfCheetah. The reward is modified to be sparser, but the +details aren’t too important. +The y-axis is episode reward, the x-axis is number of timesteps, and the +algorithm used is TRPO. + +![Plot from VIME paper](/public/rl-hard/vime.png) + +The dark line is the median performance over 10 random seeds, and the shaded +region is the 25th to 75th percentile. Don’t get me wrong, this plot is a good +argument in favor of VIME. But on the other hand, the 25th percentile line +is really close to 0 reward. That means about 25% of runs are failing, just +because of random seed. + +Look, there’s variance in supervised learning too, but it’s rarely this bad. +If my supervised learning code failed to beat random chance 30% of the time, I’d +have super high confidence there was a bug in data loading or training. If +my reinforcement learning code does no better than random, I have no idea if +it’s a bug, if my hyperparameters are bad, or if I simply got unlucky. + +![Dimensions of debugging](/public/rl-hard/dimensionsdebugging.png) + +This picture is from [“Why is Machine Learning ‘Hard’?”](http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html). +The core thesis is that machine learning adds more dimensions to your space +of failure cases, which exponentially increases the number of ways you can fail. +Deep RL adds a new dimension: random chance. And the only way you can address +random chance is by throwing enough experiments at the problem to drown out +the noise. + +**When your training algorithm is both sample inefficient and unstable, it heavily +slows down your rate of productive research.** Maybe it only takes 1 million +steps. But when you multiply that by 5 random seeds, and then multiply that with +hyperparam tuning, you need an exploding amount of compute to test hypotheses +effectively. + +> If it makes you feel any better, I’ve been doing this for a while and it took me last ~6 weeks to get a from-scratch policy gradients implementation to work 50% of the time on a bunch of RL problems. And I also have a GPU cluster available to me, and a number of friends I get lunch with every day who’ve been in the area for the last few years. +> +> Also, what we know about good CNN design from supervised learning land doesn’t seem to apply to reinforcement learning land, because you’re mostly bottlenecked by credit assignment / supervision bitrate, not by a lack of a powerful representation. Your ResNets, batchnorms, or very deep networks have no power here. +> +> [Supervised learning] wants to work. Even if you screw something up you’ll usually get something non-random back. RL must be forced to work. If you screw something up or don’t tune something well enough you’re exceedingly likely to get a policy that is even worse than random. And even if it’s all well tuned you’ll get a bad policy 30% of the time, just because. +> +> Long story short your failure is more due to the difficulty of deep RL, and much less due to the difficulty of “designing neural networks”. + +[Hacker News comment from Andrej Karpathy, back when he was at OpenAI](https://news.ycombinator.com/item?id=13519044) + +Instability to random seed is like a canary in a coal mine. If pure randomness +is enough to lead to this much variance between runs, imagine how much an actual +difference in the code could make. + +Luckily, we don’t have to imagine, because this was inspected by +the paper [“Deep Reinforcement Learning That Matters” (Henderson et al, AAAI 2018)](https://arxiv.org/abs/1709.06560). +Among its conclusions are: + +* Multiplying the reward by a constant can cause significant differences in performance. +* Five random seeds (a common reporting metric) may not be enough to argue + significant results, since with careful selection you can get non-overlapping + confidence intervals. +* Different implementations of the same algorithm have different performance on + the same task, even when the same hyperparameters are used. + +My theory is that RL is very sensitive to both your initialization and to the +dynamics of your training process, because your data is always collected online +and the only supervision you get is a single scalar for reward. A policy that +randomly stumbles onto good training examples will bootstrap itself much +faster than a policy that doesn’t. A policy that fails to discover good training +examples in time will collapse towards learning nothing at all, as it becomes +more confident that any deviation it tries will fail. + +# But What About All The Great Things Deep RL Has Done For Us? + +Deep reinforcement learning has certainly done some very cool things. DQN is +old news now, but was absolutely *nuts* at the time. A single model was able to +learn directly from raw pixels, without tuning for each game individually. +And AlphaGo and AlphaZero continue to be very impressive achievements. + +However, outside of these successes, it’s hard to find cases where deep RL +has created practical real world value. + +I tried to think of real-world, productionized uses of deep RL, and it was +surprisingly difficult. I expected to find something in recommendation systems, +but I believe those are still dominated by [collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering) +and [contextual bandits](https://research.yahoo.com/publications/5863/contextual-bandit-approach-personalized-news-article-recommendation). + +In the end, the best I could find were two Google projects: [reducing data +center power usage](https://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/), +and the recently announced [AutoML Vision effort](https://cloud.google.com/automl/). +Jack Clark from OpenAI +[tweeted a similar request and found a similar conclusion](https://twitter.com/jackclarkSF/status/919584404472602624). +(The tweet is from last year, before AutoML was announced.) + +I know Audi’s doing something with deep RL, since they demoed a self-driving +RC car at NIPS and said it used deep RL. I know there’s some +neat work +[optimizing device placement for large Tensorflow graphs (Mirhoseini et al, ICML 2017)](https://arxiv.org/abs/1706.04972). +Salesforce has their text summarization model, which worked if you massaged the +RL carefully enough. +Finance companies are surely experimenting with RL as we speak, but so far +there’s no definitive proof. (Of course, finance companies have reasons to be cagey +about how they play the market, so perhaps the evidence there is never going to +be strong.) Facebook’s been doing some neat work with deep RL for chatbots and +conversation. Every Internet company ever has probably thought about adding RL to +their ad-serving model, but if anyone’s done it, they’ve kept quiet about it. + +The way I see it, either deep RL is still a research topic that isn’t robust +enough for widespread use, or it’s usable and the people who’ve gotten it +to work aren’t publicizing it. I think the former is more likely. + +If you came to me with an image classification problem, I’d point you to +pretrained ImageNet models, and they’d probably do great. +We’re in a world where +[the people behind *Silicon Valley* can build a real Not Hotdog app](https://medium.com/%40timanglade/how-hbos-silicon-valley-built-not-hotdog-with-mobile-tensorflow-keras-react-native-ef03260747f3) +as a joke. I have trouble seeing the same happen with deep RL. + +# Given Current Limitations, When Could Deep RL Work For Me? + +A priori, it’s really hard to say. The problem with trying to solve everything +with RL is that you’re trying to solve several very different environments +with the same approach. It’s only natural that it won’t work all the time. + +That being said, we can draw conclusions from the current list of deep +reinforcement learning successes. These are projects where deep RL either +learns some qualitatively impressive behavior, or +it learns something better than comparable prior work. (Admittedly, this +is a very subjective criteria.) + +Here’s my list so far. + +* Things mentioned in the previous sections: DQN, AlphaGo, AlphaZero, + the parkour bot, reducing power center usage, and AutoML with Neural + Architecture Search. +* [OpenAI’s Dota 2 1v1 Shadow Fiend bot, which beat top pro players in a + simplified duel setting.](https://blog.openai.com/dota-2/) +* [A Super Smash Brothers Melee bot](https://arxiv.org/abs/1702.06230) that can beat + pro players at [1v1 Falcon dittos](https://www.youtube.com/watch?v=dXJUlqBsZtE). + (Firoiu et al, 2017). + +(A quick aside: machine learning recently beat pro players at no-limit +heads up Texas Hold’Em. This was done by both [Libratus (Brown et al, IJCAI 2017)](https://www.ijcai.org/proceedings/2017/0772.pdf) +and [DeepStack (Moravčík et al, 2017)](https://arxiv.org/abs/1701.01724). +I’ve talked to a few people who believed this was done with deep RL. +They’re both very cool, but they don’t use deep RL. +They use counterfactual regret minimization and clever iterative solving of +subgames.) + +From this list, we can identify common properties that make learning easier. +None of the properties below are *required* for learning, but satisfying more +of them is definitively better. + +* **It is easy to generate near unbounded amounts of experience.** + It should be clear why this helps. The more data you have, the easier the learning + problem is. This applies to + Atari, Go, Chess, Shogi, and the simulated environments for the parkour bot. + It likely applies to the power center project too, because + [in prior work (Gao, 2014)](https://googleblog.blogspot.com/2014/05/better-data-centers-through-machine.html), + it was shown that neural nets can predict energy efficiency with high + accuracy. That’s exactly the kind of simulated model you’d want for training an + RL system. + + It might apply to the Dota 2 and SSBM work, but it depends on the throughput + of how quickly the games can be run, and how many machines were available to + run them. +* **The problem is simplified into an easier form.** One of the common errors + I’ve seen in deep RL is to dream too big. Reinforcement learning can do + anything! That doesn’t mean you have to do everything at once. + + The OpenAI Dota 2 bot only played the early game, only played Shadow Fiend against Shadow + Fiend in a 1v1 laning setting, used hardcoded item builds, and presumably + called the [Dota 2 API](https://developer.valvesoftware.com/wiki/Dota_2_Workshop_Tools/Scripting/API) + to avoid having to solve perception. The SSBM bot acheived superhuman performance, + but it was only in 1v1 games, with Captain Falcon only, on Battlefield only, + in an infinite time match. + + This isn’t a dig at either bot. Why work on a hard problem when you don’t + even know the easier one is solvable? The + broad trend of all research is to demonstrate the smallest proof-of-concept + first and generalize it later. OpenAI is extending their Dota 2 work, and + there’s ongoing work to extend the SSBM bot [to other characters](https://github.com/vladfi1/phillip). +* **There is a way to introduce self-play into learning.** This is a component + of AlphaGo, AlphaZero, the Dota 2 Shadow Fiend bot, and the SSBM Falcon bot. I + should note that by self-play, I mean exactly the setting where the game is + competitive, and both players can be controlled by the same agent. So far, + that setting seems to have the most stable and well-performing behavior. +* **There’s a clean way to define a learnable, ungameable reward.** Two player games + have this: +1 for a win, -1 for a loss. The [original neural architecture search paper from Zoph et al, ICLR 2017](https://openreview.net/forum?id=r1Ue8Hcxg) had this: validation accuracy of + the trained model. Any time you introduce reward shaping, you introduce a chance + for learning a non-optimal policy that optimizes the wrong objective. + + If you’re interested in further reading on what makes a good reward, + a good search term is [“proper scoring rule”](https://en.wikipedia.org/wiki/Scoring_rule#Proper_scoring_rules). + See [this Terrence Tao blog post](https://terrytao.wordpress.com/2016/06/01/how-to-assign-partial-credit-on-an-exam-of-true-false-questions/) for an approachable example. + + As for learnability, I have no advice besides trying it out to see if it + works. +* **If the reward has to be shaped, it should at least be rich.** + In Dota 2, reward can come from last hits (triggers after every monster kill + by either player), and health (triggers after every attack or skill that + hits a target.) These reward signals + come quick and often. For the SSBM bot, reward can be given for damage dealt + and taken, which gives signal for every attack that successfully lands. The shorter + the delay between action and consequence, the faster the feedback loop gets + closed, and the easier it is for reinforcement learning to figure out a path to high reward. + +# A Case Study: Neural Architecture Search + +We can combine a few of the principles to analyze the success of Neural +Architecture Search. According to the initial [ICLR 2017 version](https://arxiv.org/abs/1611.01578), +after 12800 examples, deep RL was able to design state-of-the art neural +net architectures. Admittedly, each example required training a neural net +to convergence, but this is still very sample efficient. + +As mentioned above, the reward is validation accuracy. +This is a very rich reward signal - if a neural net design decision only increases +accuracy from 70% to 71%, RL will still pick up on this. Additionally, there’s +evidence that hyperparameters in deep learning are close to +linearly independent. (This was empirically shown in [Hyperparameter +Optimization: A Spectral Approach (Hazan et al, 2017)](https://arxiv.org/abs/1706.00764) - a summary by me is +[here](/2017/06/27/hyperparam-spectral.html) if interested.) +NAS isn’t exactly tuning hyperparameters, but I think it’s reasonable +that neural net design decisions would act similarly. This is good +news for learning, because the correlations between decision and performance +are strong. Finally, not only is the reward rich, it’s actually what we care +about when we train models. + +The combination of all these points helps me understand why it “only” takes about +12800 trained networks to learn a better one, compared to the millions of examples +needed in other environments. Several parts of the problem are all pushing in +RL’s favor. + +Overall, success stories this strong are still the exception, not the rule. +Many things have to go right for reinforcement learning to be a plausible +solution, and even then, it’s not a free ride to make that solution happen. + +In short: deep RL is currently not a plug-and-play technology. + +# Looking to The Future + +There’s an old saying - every researcher learns how to hate their area of +study. The trick is that researchers will press on despite this, because they +like the problems too much. + +That’s roughly how I feel about deep reinforcement learning. Despite my +reservations, I think people absolutely should be throwing RL at different +problems, including ones where it probably shouldn’t work. How else are we +supposed to make RL better? + +I see no reason why deep RL couldn’t work, given more time. Several very +interesting things are going to happen when deep RL is robust enough for wider +use. The question is +how it’ll get there. + +Below, I’ve listed some futures I find plausible. For the futures +based on further research, I’ve provided citations to relevant papers in those +research areas. + +**Local optima are good enough:** It would be very arrogant to claim humans are +globally optimal at anything. I would guess we’re juuuuust good enough to get +to civilization stage, compared to any other species. In the same vein, an +RL solution doesn’t have to achieve a global optima, as long as its local optima +is better than the human baseline. + +**Hardware solves everything:** I know some people who believe that the most +influential thing that can be done for AI is simply scaling up hardware. Personally, +I’m skeptical that hardware will fix everything, but it’s certainly going to +be important. The faster you can run things, the less you care about sample +inefficiency, and the easier it is to brute-force your way past exploration +problems. + +**Add more learning signal:** Sparse rewards are hard to learn because you get +very little information about what thing help you. It’s possible we can either hallucinate +positive rewards ([Hindsight Experience Replay, Andrychowicz et al, NIPS 2017](https://arxiv.org/abs/1707.01495)), define auxiliary tasks ([UNREAL, Jaderberg et al, NIPS 2016](https://arxiv.org/abs/1611.05397)), +or bootstrap with self-supervised learning to build good world model. Adding +more cherries to the cake, so to speak. + +**Model-based learning unlocks sample efficiency:** Here’s how I describe +model-based RL: “Everyone wants to do it, not many people know how.” In principle, +a good model fixes a bunch of problems. As seen in AlphaGo, having a model +at all makes it much easier to learn a good solution. +Good world models will transfer well to new tasks, +and rollouts of the world model let you imagine new experience. From what I’ve +seen, model-based approaches use fewer samples as well. + +The problem is that learning good models is hard. My impression is that +low-dimensional state models work sometimes, and image +models are usually too hard. But, if this gets easier, some interesting things +could happen. + +[Dyna (Sutton, 1991)](http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=711FEF6BA26BBF98C28BC111B26F8761?doi=10.1.1.48.6005&rep=rep1&type=pdf) and +[Dyna-2 (Silver et al., ICML 2008)](http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Applications_files/dyna2.pdf) are +classical papers in this space. +For papers combining model-based learning with deep nets, I would recommend a few recent papers from the Berkeley robotics labs: +[Neural Network Dynamics for Model-Based Deep RL with Model-Free Fine-Tuning (Nagabandi et al, 2017](http://bair.berkeley.edu/blog/2017/11/30/model-based-rl/), +[Self-Supervised Visual Planning with Temporal Skip Connections (Ebert et al, CoRL 2017)](https://arxiv.org/abs/1710.05268), +[Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning (Chebotar et al, ICML 2017)](https://arxiv.org/abs/1703.03078). +[Deep Spatial Autoencoders for Visuomotor Learning (Finn et al, ICRA 2016)](http://rll.berkeley.edu/dsae/dsae.pdf), +and [Guided Policy Search (Levine et al, JMLR 2016)](http://jmlr.org/papers/v17/15-522.html). + +**Use reinforcement learning just as the fine-tuning step:** The first AlphaGo +paper started with supervised learning, and then did RL fine-tuning on top of it. +This is a nice recipe, since it lets you use a faster-but-less-powerful method +to speed up initial learning. It’s worked in other contexts - see +[Sequence Tutor (Jaques et al, ICML 2017)](https://arxiv.org/abs/1611.02796). +You can view this as starting the RL process with a reasonable prior, instead of +a random one, where the problem of learning the prior is offloaded to some +other approach. + +**Reward functions could be learnable:** The promise of ML is that we can use +data to learn things that are better than human design. If reward function design +is so hard, Why not apply this to learn better reward functions? Imitation +learning and inverse reinforcement learning are both rich fields that have +shown reward functions can be implicitly +defined by human demonstrations or human ratings. + +For famous papers in inverse RL and imitation learning, see +[Algorithms for Inverse Reinforcement Learning (Ng and Russell, ICML 2000)](http://ai.stanford.edu/~ang/papers/icml00-irl.pdf), +[Apprenticeship Learning via Inverse Reinforcement Learning (Abbeel and Ng, ICML 2004)](http://ai.stanford.edu/~ang/papers/icml04-apprentice.pdf), +and [DAgger (Ross, Gordon, and Bagnell, AISTATS 2011)](https://www.cs.cmu.edu/~sross1/publications/Ross-AIStats11-NoRegret.pdf). + +For recent work scaling these ideas to deep learning, see [Guided Cost Learning (Finn et al, ICML 2016)](https://arxiv.org/abs/1603.00448), [Time-Constrastive Networks (Sermanet et al, 2017)](https://arxiv.org/abs/1704.06888), +and [Learning From Human Preferences (Christiano et al, NIPS 2017)](https://blog.openai.com/deep-reinforcement-learning-from-human-preferences/). (The Human Preferences paper in particular showed +that a reward learned from human ratings was actually better-shaped for learning +than the original hardcoded reward, which is a neat practical result.) + +For longer term work that doesn’t use deep learning, I liked +[Inverse Reward Design (Hadfield-Menell et al, NIPS 2017)](https://arxiv.org/abs/1711.02827) +and [Learning Robot Objectives from Physical Human Interaction (Bajcsy et al, CoRL 2017)](http://proceedings.mlr.press/v78/bajcsy17a/bajcsy17a.pdf). + +**Transfer learning saves the day:** The promise of transfer learning is that +you can leverage knowledge from previous tasks to speed up learning of new ones. +I think this is absolutely the future, when task learning is robust enough to +solve several disparate tasks. It’s hard to do transfer learning if you can’t +learn at all, and given task A and task B, it can be very hard to predict +whether A transfers to B. In my experience, it’s either super obvious, or super +unclear, and even the super obvious cases aren’t trivial to get working. + +[Universal Value Function Approximators (Schaul et al, ICML 2015)](http://proceedings.mlr.press/v37/schaul15.pdf), +[Distral (Whye Teh et al, NIPS 2017)](https://arxiv.org/abs/1707.04175), +and [Overcoming Catastrophic Forgetting (Kirkpatrick et al, PNAS 2017)](https://deepmind.com/blog/enabling-continual-learning-in-neural-networks/) are recent works in this direction. +For older work, consider reading [Horde (Sutton et al, AAMAS 2011)](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.6455&rep=rep1&type=pdf). + +Robotics +in particular has had lots of progress in sim-to-real transfer (transfer learning +between a simulated version of a task and the real task). See [Domain Randomization (Tobin et al, IROS 2017)](https://blog.openai.com/spam-detection-in-the-physical-world/), [Sim-to-Real Robot +Learning with Progressive Nets (Rusu et al, CoRL 2017)](https://arxiv.org/abs/1610.04286), +and +[GraspGAN (Bousmalis et al, 2017)](https://research.googleblog.com/2017/10/closing-simulation-to-reality-gap-for.html). (Disclaimer: I worked on GraspGAN.) + +**Good priors could heavily reduce learning time:** This is closely tied to +several of the previous points. In one view, transfer learning is about using +past experience to build a good prior for learning other tasks. +RL algorithms are designed to apply to any Markov Decision Process, which is +where the pain of generality comes in. +If we accept that our solutions will only perform well on a small section of +environments, we should be able to leverage shared structure to solve those +environments in an efficient way. + +One point Pieter Abbeel +likes to mention in his talks is that deep RL only needs to solve tasks that +we expect to need in the real world. I agree it makes a lot of sense. +There should exist +a real-world prior that lets us quickly learn new real-world tasks, at the cost +of slower learning on non-realistic tasks, but that’s a perfectly acceptable trade-off. + +The difficulty is that such a real-world prior will be very hard to design. +However, I think there’s a good chance it won’t be impossible. +Personally, I’m excited by the recent work in metalearning, since it provides +a data-driven way to generate reasonable priors. +For example, if I wanted to use RL to do warehouse navigation, I’d get pretty +curious about using metalearning to learn a good navigation prior, +and then fine-tuning the prior for the specific warehouse the robot will be +deployed in. +This very much seems like the future, and the question is whether metalearning +will get there or not. + +A summary of recent learning-to-learn work can be found in +[this post from BAIR (Berkeley AI Research)](http://bair.berkeley.edu/blog/2017/07/18/learning-to-learn/). + +**Harder environments could paradoxically be easier:** One of the big lessons +from the DeepMind parkour paper is that if you make your task very difficult +by adding several task variations, you can actually make the learning +easier, because the policy cannot overfit to any one setting without losing +performance on all the other settings. We’ve seen a similar thing in the +domain randomization papers, and even back to ImageNet: models trained on +ImageNet will generalize way better than ones trained on CIFAR-100. As I said +above, maybe we’re just an “ImageNet for control” away from making RL +considerably more generic. + +Environment wise, there are a lot of options. [OpenAI Gym](https://github.com/openai/gym) +easily has the most traction, but there’s also the [Arcade Learning Environment](https://github.com/mgbellemare/Arcade-Learning-Environment), [Roboschool](https://github.com/openai/roboschool), +[DeepMind Lab](https://github.com/deepmind/lab), the [DeepMind Control Suite](https://github.com/deepmind/dm_control), and [ELF](https://github.com/facebookresearch/ELF). + +Finally, although it’s unsatisfying from a research +perspective, the empirical issues of deep RL may not matter for practical purposes. As a +hypothetical example, suppose a finance company is using deep RL. They train a +trading agent based on past data from the US stock market, using 3 random seeds. +In live A/B testing, one gives 2% less revenue, one performs the +same, and one gives 2% more revenue. In that hypothetical, reproducibility +doesn’t matter - you deploy the model with 2% more revenue and celebrate. +Similarly, it doesn’t matter that the trading agent may only perform well +in the United States - if it generalizes poorly to the worldwide market, +just don’t deploy it there. There is a large gap between doing something +extraordinary and making that extraordinary success reproducible, and maybe it’s +worth focusing on the former first. + +# Where We Are Now + +In many ways, I find myself annoyed with the current state of deep RL. +And yet, it’s attracted some of the strongest research +interest I’ve ever seen. My feelings are best summarized by a mindset Andrew +Ng mentioned in his [Nuts and Bolts of Applying Deep Learning](https://www.youtube.com/watch?v=F1ka6a13S9I) +talk - a lot of short-term pessimism, balanced by even more long-term optimism. +Deep RL is a bit messy right now, but I still believe in where it could be. + +That being said, +the next time someone asks me whether reinforcement learning can solve their +problem, I’m still going to tell them that no, it can’t. But I’ll also tell +them to ask me again in a few years. By then, maybe it can. + +\* \* \* + +*This post went through a lot of revision. Thanks go to following people +for reading earlier drafts: +Daniel Abolafia, +[Kumar Krishna Agrawal](https://kumarkrishna.github.io), +[Surya Bhupatiraju](https://www.linkedin.com/in/suryabhupa/), +[Jared Quincy Davis](https://www.linkedin.com/in/jaredquincydavis/), +[Ashley Edwards](http://cc.gatech.edu/~aedwards/), +[Peter Gao](https://www.linkedin.com/in/pgaooo/), +Julian Ibarz, +Sherjil Ozair, +[Vitchyr Pong](https://people.eecs.berkeley.edu/~vitchyr/), +Alex Ray, +and Kelvin Xu. There were several more reviewers who I’m crediting +anonymously - thanks for all the feedback.* + +[← MIT Mystery Hunt 2018](/2018/01/18/mh-2018.html) +[Blog Posts and Research Papers →](/2018/03/07/blog-paper.html) + +Please enable JavaScript to view the [comments powered by Disqus.](https://disqus.com/?ref_noscript) + +## Sorta Insightful + +* Email: alexirpan [at] berkeley [dot] edu + +* [alexirpan](https://github.com/alexirpan) diff --git a/docs/evidence/amid_fish_reproducing_deep_rl.md b/docs/evidence/amid_fish_reproducing_deep_rl.md new file mode 100644 index 0000000..2cfc810 --- /dev/null +++ b/docs/evidence/amid_fish_reproducing_deep_rl.md @@ -0,0 +1,679 @@ +Source: http://amid.fish/reproducing-deep-rl +Title: Lessons Learned Reproducing a Deep Reinforcement Learning Paper - Matthew Rahtz (2018) +Fetched-via: uvx markitdown http://amid.fish/reproducing-deep-rl +Fetch-status: verbatim + +[Amid Fish](/) + +# Lessons Learned Reproducing a Deep Reinforcement Learning Paper + +Apr 6, 2018 + +There are a lot of neat things going on in deep reinforcement learning. One of +the coolest things from last year was OpenAI and DeepMind’s work on training an +agent using feedback from a human rather than a classical reward signal. +There’s a great blog post about it at [Learning from Human +Preferences](https://blog.openai.com/deep-reinforcement-learning-from-human-preferences/), +and the original paper is at [Deep Reinforcement Learning from Human +Preferences](https://arxiv.org/pdf/1706.03741.pdf). + +![](images/humanfeedbackjump.gif) + +Learn some deep reinforcement learning, and you too can train a noodle to do backflip. From [Learning from Human Preferences](https://blog.openai.com/deep-reinforcement-learning-from-human-preferences/). + +I’ve seen a few recommendations that reproducing papers is a good way of +levelling up machine learning skills, and I decided this could be an +interesting one to try with. It was indeed a [super fun +project](https://github.com/mrahtz/learning-from-human-preferences), and I’m +happy to have tackled it - but looking back, I realise it wasn’t exactly the +experience I thought it would be. + +If you’re thinking about reproducing papers too, here are some notes on what +surprised me about working with deep RL. + +--- + +First, in general, **reinforcement learning turned out to be a lot trickier +than expected**. + +A big part of it is that right now, reinforcement learning is really sensitive. +There are a lot of details to get *just* right, and if you don’t get them +right, it can be difficult to diagnose where you’ve gone wrong. + +Example 1: after finishing the basic implementation, training runs just weren’t +succeeding. I had all sorts of ideas about what the problem might be, but after +a couple of months of head scratching, it turned out to be because of problems +with normalization of rewards and pixel data at a key stage[1](#fn:normproblems). +Even with the benefit of hindsight, there were no obvious clues pointing in +that direction: the accuracy of the reward predictor network the pixel data +went into was just fine, and it took a long time to occur to me to examine the +rewards predicted carefully enough to notice the reward normalization bug. +Figuring out what the problem was happened almost accidentally, noticing a +small inconsistency that eventually lead to the right path. + +Example 2: doing a final code cleanup, I realised I’d implemented dropout kind +of wrong. The reward predictor network takes as input a pair of video clips, +each processed identically by two networks with shared weights. If you add +dropout and you’re not careful about giving it the same random seed in each +network, you’ll drop out differently for each network, so the video clips won’t +be processed identically. As it turned out, though, fixing it completely broke +training, despite prediction accuracy of the network looking exactly the same! + +![](images/broken_dropout.png) + +Spot which one is broken. Yeah, I don't see it either. + +I get the impression this is a pretty common story (e.g. [Deep Reinforcement +Learning Doesn’t Work Yet](https://www.alexirpan.com/2018/02/14/rl-hard.html)). +My takeaway is that, starting a reinforcement learning project, you should +**expect to get stuck like you get stuck on a math problem**. It’s not like my +experience of programming in general so far where you get stuck but there’s +usually a clear trail to follow and you can get unstuck within a couple of days +at most. It’s more like when you’re trying to solve a puzzle, there are no +clear inroads into the problem, and the only way to proceed is to try things +until you find the key piece of evidence or get the key spark that lets you +figure it out. + +A corollary is to **try and be as sensitive as possible in noticing +confusion**. + +There were a lot of points in this project where the only clues came from +noticing some small thing that didn’t make sense. For example, at some point it +turned out that taking the difference between frames as features made things +work much better. It was tempting to just forge ahead with the new features, +but I realised I was confused about *why* it made such a big difference for the +simple environment I was working with back then. It was only by following that +confusion and realising that taking the difference between frames zeroed out +the background that gave the hint of a problem with normalization. + +I’m not entirely sure how to make one’s mind do more of this, but my best +guesses at the moment are: + +* Learn to **recognise what confusion *feels* like**. There are a lot of + different shades of the “something’s not quite right” feeling. Sometimes it’s + code you know is ugly. Sometimes it’s worry about wasting time on the wrong + thing. But sometimes it’s that *you’ve seen something you didn’t expect*: + confusion. Being able to recognise that exact shade of discomfort is + important, so that you can… +* Develop the habit of following through on confusion. There are some + sources of discomfort that it can be better to ignore in the moment (e.g. + code smell while prototyping), but confusion isn’t one of them. It seems + important to really **commit yourself to *always* investigate whenever you + notice confusion**. + +In any case: expect to get stuck for several weeks at a time. (And have +confidence you will be able to get to the other side if you keep at it, paying +attention to those small details.) + +--- + +Speaking of differences to past programming experiences, a second major +learning experience was the **difference in mindset required for working with +long iteration times**. + +Debugging seems to involve four basic steps: + +* Gather evidence about what the problem might be. +* Form hypotheses about the problem based on the evidence you have so far. +* Choose the most likely hypothesis, implement a fix, and see what happens. +* Repeat until the problem goes away. + +In most of the programming I’ve done before, I’ve been used to rapid feedback. +If something doesn’t work, you can make a change and see what difference it +makes within seconds or minutes. Gathering evidence is very cheap. + +In fact, in rapid-feedback situations, gathering evidence can be a lot cheaper +than forming hypotheses. Why spend 15 minutes carefully considering everything +that could be causing what you see when you can check the first idea that jumps +to mind in a fraction of that (and gather more evidence in the process)? To put +it another way: if you have rapid feedback, you can narrow down the hypothesis +space a lot faster by trying things than thinking carefully. + +If you keep that strategy when each run takes 10 hours, though, you can easily +waste a *lot* of time. Last run didn’t work? OK, I think it’s this thing. Let’s +set off another run to check. Coming back the next morning: still doesn’t work? +OK, maybe it’s this other thing. Let’s set off another run. A week later, you +still haven’t solved the problem. + +Doing multiple runs at the same time, each trying a different thing, can help +to some extent, but a) unless you have access to a cluster you can end up +racking up a lot of costs on cloud compute (see below), and b) because of the +kinds of difficulties with reinforcement learning mentioned above, if you try +to iterate too quickly, you might never realise what kind of evidence you +actually need. + +Switching from **experimenting a lot and thinking a little** to **experimenting +a little and thinking a lot** was a key turnaround in productivity. When +debugging with long iteration times, you really need to *pour* time into the +hypothesis-forming step - thinking about what all the possibilities are, how +likely they seem on their own, and how likely they seem in light of everything +you’ve seen so far. Spend as much time as you need, even if it takes 30 +minutes, or an hour. Reserve experiments for once you’ve fleshed out the +hypothesis space as thoroughly as possible and know which pieces of evidence +would allow you to best distinguish between the different possibilities. + +(It’s especially important to be deliberate about this if you’re working on +something as a side project. If you’re only working on it for an hour a day and +each iteration takes a day to run, the number of runs you can do per week ends +up feeling a precious commodity you have to make the most of. It’s easy to +then feel a sense of pressure to spend your working hour each day rushing to +figure out something to do for that day’s run. Another turnaround was being +willing to spend several days just *thinking*, not starting any runs, until I +felt really confident I had a strong hypothesis about what the problem was.) + +A key enabler of the switch to thinking more was **keeping a much more detailed +work log**. Working without a log is fine when each chunk of progress takes +less than a few hours, but anything longer than that and it’s easy to forget +what you’ve tried so far and end up just going in circles. The log format I +converged on was: + +* Log 1: what specific output am I working on right now? +* Log 2: thinking out loud - e.g. hypotheses about the current problem, what to + work on next +* Log 3: record of currently ongoing runs along with a short reminder of what + question each run is supposed to answer +* Log 4: results of runs (TensorBoard graphs, any other significant + observations), separated by type of run (e.g. by environment the agent is + being trained in) + +I started out with relatively sparse logs, but towards the end of the project +my attitude moved more towards “log absolutely everything going through my +head”. The overhead was significant, but I think it was worth it - partly +because some debugging required cross-referencing results and thoughts that +were days or weeks apart, and partly for (at least, this is my impression) +general improvements in thinking quality from the massive upgrade to effective +mental RAM. + +![](images/rl_logs.jpg) + +A typical day's log. + +--- + +In terms of **getting the most out of the experiments you do run**, there are +two things I started experimenting with towards the end of the project which +seem like they could be helpful in the future. + +First, adopting an attitude of **log all the metrics you can** to maximise the +amount of evidence you gather on each run. There are obvious metrics like +training/validation accuracy, but it might also be worth spending a good chunk +of time at the start of the project brainstorming and researching which other +metrics might be important for diagnosing potential problems. + +I might be making this recommendation partly out of hindsight bias where I +*know* which metrics I should have started logging earlier. It’s hard to +predict which metrics will be useful in advance. Still, heuristics that might +be useful are: + +* For every important component in the system, consider what *can* be measured + about it. If there’s a database, measure how quickly it’s growing in size. + If there’s a queue, measure how quickly items are being processed. +* For every complex procedure, measure how long different parts of it take. If + you’ve got a training loop, measure how long each batch takes to run. If + you’ve got a complex inference procedure, measure how long each sub-inference + takes. Those times are going to help a lot for performance debugging later + on, and can sometimes reveal bugs that are otherwise hard to spot. (For + example, if you see something taking longer and longer, it might be because + of a memory leak.) +* Similarly, consider profiling memory usage of different components. Small + memory leaks can be indicative of all sorts of things. + +Another strategy is to look at what other people are measuring. In the context +of deep reinforcement learning, John Schulman has some good tips in his [Nuts +and Bolts of Deep RL talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ) +([slides](http://joschu.net/docs/nuts-and-bolts.pdf); [summary +notes](https://github.com/williamFalcon/DeepRLHacks)). For policy gradient +methods, I’ve found policy entropy in particular to be a good indicator of +whether training is going anywhere - much more sensitive than per-episode +rewards. + +![](images/entropies.png) + +Examples of unhealthy and healthy +policy entropy graphs. Failure mode 1 (left): convergence to constant entropy (random choice among a subset of actions). Failure mode 2 (centre): convergence to zero entropy (choosing the same action every time). Right: policy entropy from a successful Pong training run. + +When you do see something suspicious in metrics recorded, remembering to +*notice confusion*, err on the side of assuming it’s something important rather +than just e.g. an inefficient implementation of some data structure. (I missed +a multithreading bug for several months by ignoring a small but mysterious +decay in frames per second.) + +Debugging is much easier if you can see all your metrics in one place. I like +to have as much as possible on TensorBoard. Logging arbitrary metrics with +TensorFlow can be awkward, though, so **consider checking out +[easy-tf-log](https://github.com/mrahtz/easy-tf-log)**, which provides an easy +`tflog(key, value)` interface without any extra setup. + +A second thing that seems promising for getting more out of runs is +**taking the time to try and predict failure in advance**. + +Thanks to hindsight bias, failures often seem obvious in retrospect. But the +*really* frustrating thing is when the failure mode is obvious *before you’ve +even observed what it was*. You know when you’ve set off a run, you come back +the next day, you see it’s failed, and even before you’ve investigated, you +realise, “Oh, it must have been because I forgot to set the frobulator”? That’s +what I’m talking about. + +The neat thing is that sometimes you can trigger that kind of +half-hindsight-realisation in advance. It does take conscious effort, though - +really stopping for a good five minutes before launching a run to think about +what might go wrong. The particular script I found most helpful to go through +was: [2](#fn:murphyjitsu) + +1. Ask yourself, “How surprised would I be if this run failed?” +2. If the answer is ‘not very surprised’, put yourself in the shoes of + future-you where the run *has* failed, and ask, “If I’m here, what might + have gone wrong?” +3. Fix whatever comes to mind. +4. Repeat until the answer to question 1 is “very surprised” (or at least “as + surprised as I can get”). + +There are always going to be failures you couldn’t have predicted, and +sometimes you still miss obvious things, but this does at least seem to *cut +down* on the number of times something fails in a way you feel *really* stupid +for not having thought of earlier. + +--- + +Finally, though, **the biggest surprise with this project was just how long it +took** - and related, the amount of compute resources it needed. + +The first surprise was in terms of calendar time. My original estimate was that +as a side project it would take about 3 months. It actually took around *8 +months*. (And the original estimate was supposed to be pessimistic!) Some of +that was down to underestimating how many hours each stage would take, but a +big chunk of the underestimate was failing to anticipate other things coming up +outside the project. It’s hard to say how well this generalises, but **for +side projects, taking your original (already pessimistic) time estimates and +doubling them** might not be a bad rule-of-thumb. + +The more interesting surprise was in how many hours each stage actually took. +The main stages of my initial project plan were basically: + +![](images/pretime.png) + +Here’s how long each stage *actually* took. + +![](images/posttime.png) + +It wasn’t writing code that took a long time - it was debugging it. In fact, +getting it working on even a [supposedly-simple +environment](https://github.com/mrahtz/gym-moving-dot) took *four times* as +long as initial implementation. (This is the first side project where I’ve been +keeping track of hours, but experiences with past machine learning projects +have been similar.) + +(Side note: be careful about designing from scratch what you hope should be an +‘easy’ environment for reinforcement learning. In particular, think carefully +about a) whether your rewards really convey the right information to be able to +solve the task - yes, this is easy to mess up - and b) whether rewards depend +only on previous observations or also on current action. The latter, in +particular, might be relevant if you’re doing any kind of reward prediction, +e.g. with a critic.) + +**Another surprise was the amount of compute time needed.** I was lucky having +access to my university’s cluster - only CPU machines, but that was fine for +some tasks. For work which needed a GPU (e.g. to iterate quickly on some small +part) or when the cluster was too busy, I experimented with two cloud services: +VMs on [Google Cloud Compute +Engine](https://console.cloud.google.com/projectselector/compute/instances?supportedpurview=project), +and [FloydHub](http://floydhub.com/). + +Compute Engine is fine if you just want shell access to a GPU machine, but I +tried to do as much as possible on FloydHub. FloydHub is basically a cloud +compute service targeted at machine learning. You run `floyd run python +awesomecode.py` and FloydHub sets up a container, uploads your code to it, and +runs the code. The two key things which make FloydHub awesome are: + +* Containers come preinstalled with GPU drivers and common libraries. (Even in + 2018, I wasted a good few hours fiddling with CUDA versions while upgrading + TensorFlow on the Compute Engine VM.) +* Each run is automatically archived. For each run, the code used, the exact + command used to start the run, any command-line output, and any data outputs + are saved automatically, and indexed through a web interface. + + [![](images/floydhub.png)](images/floydhub.png) + +FloydHub's web interface. Top: index of past runs, +and overview of a single run. Bottom: both the code used for each run and any +data output from the run are automatically archived. + +I can’t stress enough how important that second feature is. For any project +this long, detailed records of what you’ve tried and the ability to reproduce +past experiments are an absolute must. Version control software can help, but +a) managing large outputs can be painful, and b) requires extreme diligence. +(For example, if you’ve set off some runs, then make a small change and launch +another run, when you commit the results of the first runs, is it going to be +clear which code was used?) You could take careful notes or roll your own +system, but with FloydHub, *it just works* and you save *so* much mental +energy. + +(Update: check out some example FloydHub runs at +.) + +Other things I like about FloydHub are: + +* Containers are automatically shut down once the run is finished. Not having + to worry about checking runs to see whether they’ve finished and the VM can + be turned off is a big relief. +* Billing is much more straightforward than with cloud VMs. You pay for usage + in, say, 10-hour blocks, and you’re charged immediately. That makes keeping + weekly budgets much easier. + +The one pain point I’ve had with FloydHub is that you can’t customize +containers. If your code has a lot of dependencies, you’ll need to install them +at the start of every run. That limits the rate at which you can iterate on +short runs. You *can* get around this, though, by creating a ‘dataset’ which +contains the changes to the filesystem from installing dependencies, then +copying files from that dataset at the start of each run (e.g. +[`create_floyd_base.sh`](https://github.com/mrahtz/learning-from-human-preferences/blob/master/floydhub_utils/create_floyd_base.sh)). +It’s awkward, but still probably less awkward than having to deal with GPU +drivers. + +FloydHub is a little more expensive than Compute Engine: as of writing, +$1.20/hour for a machine with a K80 GPU, compared to about $0.85/hour for a +similarly-specced VM (though less if you don’t need as much as 61 GB of RAM). +Unless your budget is really limited, I think the extra convenience of FloydHub +is worth it. The only case where Compute Engine can be a lot cheaper is doing a +lot of runs in parallel, which you can stack up on a single large VM. + +(A third option is Google’s new +[Colaboratory](https://colab.research.google.com) service, which gives you a +hosted Jupyter notebook with free access to a single K80 GPU. Don’t be put off +by Jupyter: you can execute arbitrary commands, and set up shell access if you +really want it. The main drawbacks are that your code doesn’t keep running if +you close the browser window, and there are time limits on how long you can run +before the container hosting the notebook gets reset. So it’s not suitable for +doing long runs, but can be useful for quick prototyping on a GPU.) + +In total, the project took: + +* **150 hours of GPU time and 7,700 hours (wall time × cores) of CPU time** on + Compute Engine, +* **292 hours of GPU time** on FloydHub, +* and **1,500 hours (wall time, 4 to 16 cores) of CPU time** on my university’s + cluster. + +I was horrified to realise that in total, that added up to **about $850** ($200 +on FloydHub, $650 on Compute Engine) over the 8 months of the project. + +Some of that’s down to me being ham-fisted (see the above section on mindset +for slow iteration). Some of it’s down to the fact that reinforcement learning +is still so sample-inefficient that runs do just take a long time (up to 10 +hours to train a Pong agent that beats the computer every time). + +But a big chunk of it was down to a horrible surprise I had during the final +stages of the project: **reinforcement learning can be so unstable that you +need to repeat every run multiple times with different seeds to be confident**. + +For example, once I thought everything was basically working, I sat down to +make end-to-end tests for the environments I’d been working with. But I was +having trouble getting even the simplest environment I’d been working with, +[training a dot to move to the centre of a +square](https://github.com/mrahtz/gym-moving-dot), to train successfully. I +went back to the FloydHub job that had originally worked and re-ran three +copies. It turned out that the hyperparameters I thought were fine actually +only succeeded one out of three times. + +![](images/failed_reproductions.png) + +It's not uncommon for two out of three random seeds (red/blue) to fail. + +To give a visceral sense of how much compute that means you need: + +* Using A3C with 16 workers, Pong would take about 10 hours to train. +* That’s 160 hours of CPU time. +* Running 3 random seeds, that 480 hours (20 days) of CPU time. + +In terms of costs: + +* FloydHub charges about $0.50 per hour for an 8-core machine. +* So 10 hours costs about $5 per run. +* **Running 3 different random seeds at the same time, that’s $15 per run.** + +**That’s, like, 3 sandwiches every time you want to test an idea.** + +Again, from [Deep Reinforcement Learning Doesn’t Work +Yet](https://www.alexirpan.com/2018/02/14/rl-hard.html), that kind of +instability seems normal and accepted right now. In fact, even “Five random +seeds (a common reporting metric) may not be enough to argue significant +results, since with careful selection you can get non-overlapping confidence +intervals.” + +(All of a sudden the $25,000 of AWS credits that the [OpenAI Scholars +programme](https://blog.openai.com/openai-scholars/) provides doesn’t seem +quite so crazy. That probably *is* about the amount you need to give someone so +that compute isn’t a worry at all.) + +My point here is that **if you want to tackle a deep reinforcement learning +project, make sure you know what you’re getting yourself into**. Make sure +you’re prepared for how much time it could take and how much it might cost. + +--- + +Overall, reproducing a reinforcement learning paper was a fun side project to +try. But looking back, thinking about which skills it actually levelled up, I’m +also wondering whether reproducing a paper was really the best use of time over +the past months. + +On one hand, I definitely feel like my machine learning *engineering* ability +improved a lot. I feel more confident in being able to recognise common RL +implementation mistakes; my workflow got a whole lot better; and from this +particular paper I got to learn a bunch about Distributed TensorFlow and +asynchronous design in general. + +On the other hand, I don’t feel like my machine learning *research* ability +improved much (which is, in retrospect, what I was actually aiming for). Rather +than implementation, the much more difficult part of research seems to be +coming up with ideas that are interesting but also *tractable and concrete*; +ideas which give you the best bang-for-your-buck for the time you *do* spend +implementing. Coming up with interesting ideas seems to be a matter of a) +having a large vocabulary of concepts to draw on, and b) having good ‘taste’ +for ideas (e.g. what kind of work is likely to be useful to the community). I +think a better project for both of those might have been to, say, read +influential papers and write summaries and critical analyses of them. + +So I think my main meta-takeaway from this project is that **it’s worth +thinking carefully whether you want to level up engineering skills or research +skills**. Not that there’s no overlap; but if you’re particularly weak on one +of them you might be better off with a project specifically targeting that one. + +If you want to level up both, a better project might be to read papers until +you find something you’re really interested in that comes with clean code, and +trying to implement an extension to it. + +--- + +If you *do* want to tackle a deep RL project, here are some more specific +things to watch out for. + +#### Choosing papers to reproduce + +* Look for papers with few moving parts. Avoid papers which require multiple + parts working together in coordination. + +#### Reinforcement learning + +* If you’re doing anything that involves an RL algorithm as a component in a + larger system, don’t try and implement the RL algorithm yourself. It’s a fun + challenge, and you’ll learn a lot, but RL is unstable enough at the moment + that you’ll never be sure whether your system doesn’t work because of a bug + in your RL implementation or because of a bug in your larger system. +* Before doing anything, see how easily an agent can be trained on your + environment with a baseline algorithm. +* Don’t forget to normalize observations. *Everywhere* that observations might + be being used. [3](#fn:norm2) +* Write end-to-end tests as soon as you think you’ve got something working. + Successful training can be more fragile than you expected. +* If you’re working with OpenAI Gym environments, note that with `-v0` + environments, 25% of the time, the current action is ignored and the previous + action is repeated (to make the environment less deterministic). Use `-v4` + environments if you don’t want that extra randomness. Also note that + environments by default only give you every 4th frame from the emulator, + matching the early DeepMind papers. Use `NoFrameSkip` environments if you + don’t want that. For a fully deterministic environment that gives you exactly + what the emulator gives you, use e.g. `PongNoFrameskip-v4`. + +#### General machine learning + +* Because of how long end-to-end tests take to run, you’ll waste a lot of time + if you have to do major refactoring later on. Err on the side of implementing + things well the first time rather than hacking something up and saving + refactoring for later. +* Initialising a model can easily take ~ 20 seconds. That’s a painful amount of + time to waste because of e.g. syntax errors. If you don’t like using IDEs, or + you can’t because you’re editing on a server with only shell access, it’s + worth investing the time to set up a linter for your editor. (For Vim, I like + [ALE](https://github.com/w0rp/ale) with *both* + [Pylint](https://www.pylint.org/) and + [Flake8](http://flake8.pycqa.org/en/latest/). Though Flake8 is more of a + style checker, it can catch some things that Pylint can’t, like wrong + arguments to a function.) Either way, every time you hit a stupid error while + trying to start a run, invest time in making your linter catch it in the + future. +* It’s not just dropout you have to be careful about implementing in networks + with weight-sharing - it’s also batchnorm. Don’t forget there are + normalization statistics and extra variables in the network to match. +* Seeing regular spikes in memory usage while training? It might be that your + validation batch size is too large. +* If you’re seeing strange things when using Adam as an optimizer, it might be + because of Adam’s momentum. Try using an optimizer without momentum like + RMSprop, or disable Adam’s momentum by setting β1 to zero. + +#### TensorFlow + +* If you want to debug what’s happening with some node buried deep in the + middle of your graph, check out + [`tf.Print`](https://www.tensorflow.org/api_docs/python/tf/Print), an + identity operation which prints the value of its input every time the graph + is run. +* If you’re saving checkpoints only for inference, you can save a lot of space + by omitting optimizer parameters from the set of variables that are saved. +* `session.run()` can have a large overhead. Group up multiple calls in a batch + wherever possible. +* If you’re getting out-of-GPU-memory errors when trying to run more than one + TensorFlow instance on the same machine, it could just be because one of your + instances is trying to reserve all the GPU memory, rather than because your + models are too large. This is TensorFlow’s default behaviour. To tell + TensorFlow to only reserve the memory it needs, see the + [`allow_growth`](https://www.tensorflow.org/programmers_guide/using_gpu#allowing_gpu_memory_growth) + option. +* If you want to access the graph from multiple things running at once, it + looks like you *can* access the same graph from multiple threads, but there’s + a lock somewhere which only allows one thread at a time to actually do + anything. This seems to be distinct from the Python global interpreter lock, + which TensorFlow is [supposed + to](https://stackoverflow.com/questions/38206695/python-parallelizing-gpu-and-cpu-work) + release before doing heavy lifting. I’m uncertain about this, and didn’t have + time to debug more thoroughly, but if you’re in the same boat, it might be + simpler to just use multiple processes and replicate the graph between them + with [Distributed + TensorFlow](http://amid.fish/distributed-tensorflow-a-gentle-introduction). +* Working with Python, you get used to not having to worry about overflows. In + TensorFlow, though, you still need to be careful: + +``` +> a = np.array([255, 200]).astype(np.uint8) +> sess.run(tf.reduce_sum(a)) +199 +``` + +* Be careful about using `allow_soft_placement` to fall back to a CPU if a GPU + isn’t available. If you’ve accidentally coded something that can’t be run on + a GPU, it’ll be silently moved to a CPU. For example: + +``` +with tf.device("/device:GPU:0"): + a = tf.placeholder(tf.uint8, shape=(4)) + b = a[..., -1] + +sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) +sess.run(tf.global_variables_initializer()) + +# Seems to work fine. But with allow_soft_placement=False + +sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=False)) +sess.run(tf.global_variables_initializer()) + +# we get + +# Cannot assign a device for operation 'strided_slice_5': +# Could not satisfy explicit device specification '/device:GPU:0' +# because no supported kernel for GPU devices is available. +``` + +* I don’t know how many operations there are like this that can’t be run on a + GPU, but to be safe, do CPU fallback manually: + +``` +gpu_name = tf.test.gpu_device_name() +device = gpu_name if gpu_name else "/cpu:0" +with tf.device(device): + # graph code +``` + +#### Mental health + +* Don’t get addicted to TensorBoard. I’m serious. It’s the perfect example of + addiction through unpredictable rewards: most of the time you check how your + run is doing and it’s just pootling away, but as training progresses, + sometimes you check and all of the sudden - jackpot! It’s doing something + super exciting. If you start feeling urges to check TensorBoard every few + minutes, it might be worth setting rules for yourself about how often it’s + reasonable to check. + +--- + +If you’ve read this far and haven’t been put off, awesome! If you’d like to get +into deep RL too, here are some resources for getting started. + +* Andrej Karpathy’s [Deep Reinforcement Learning: Pong from + Pixels](http://karpathy.github.io/2016/05/31/rl/) is a great introduction to + build motivation and intuition. +* For more on the theory of reinforcement learning, check out [David Silver’s + lectures](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html). There + isn’t much on deep RL (reinforcement learning using neural networks), but it + does teach the vocabulary you’ll need to be able to understand papers. +* John Schulman’s [Nuts and Bolts of Deep RL + talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ) + ([slides](http://joschu.net/docs/nuts-and-bolts.pdf); [summary + notes](https://github.com/williamFalcon/DeepRLHacks)) has lots more tips + about practical issues you might run into. + +For a sense of the bigger picture of what’s going on in deep RL at the moment, +check out some of these. + +* Alex Irpan’s [Deep Reinforcement Learning Doesn’t Work + Yet](https://www.alexirpan.com/2018/02/14/rl-hard.html) has a great overview + of where things are right now. +* Vlad Mnih’s talk on [Recent Advances and Frontiers in Deep + RL](https://www.youtube.com/watch?v=bsuvM1jO-4w) has more examples of work on + some of the problems mentioned in Alex’s post. +* Sergey Levine’s [Deep Robotic + Learning](https://www.youtube.com/watch?v=eKaYnXQUb2g) talk, with a focus on + improving generalization and sample efficiency in robotics. +* Pieter Abbeel’s [Deep Learning for + Robotics](https://www.youtube.com/watch?v=TyOooJC_bLY) keynote at NIPS 2017 + with some of the more recent tricks in deep RL. + +Good luck! + +Thanks to [Michal Pokorný](http://agentydragon.com/about.html) and Marko Thiel for thoughts on +a first draft on this post. + +1. Observations are fed into two different training loops, policy training and reward predictor training, and I’d forgotten to normalize observations for the second one. Also, calculating running statistics (specifically, variance) is tricky. Check out [John Schulman’s code](https://github.com/joschu/modular_rl/blob/master/modular_rl/running_stat.py) for a good reference. [↩](#fnref:normproblems) +2. This is basically [CFAR’s](http://www.rationality.org/) ‘MurphyJitsu’ script. [↩](#fnref:murphyjitsu) +3. As mentioned above, I was stuck for a good while because of forgetting to normalize observations used for training the reward predictor. Derp. [↩](#fnref:norm2) + +Please enable JavaScript to view the [comments powered by Disqus.](https://disqus.com/?ref_noscript) + +![](/images/me.png) + +## Amid Fish + +is Matthew Rahtz's blog + +[GitHub](https://github.com/mrahtz), +[LinkedIn](https://uk.linkedin.com/pub/matthew-rahtz/b8/a47/540), +or say hello at +[[email protected]](/cdn-cgi/l/email-protection#ee838f9a9a868b99c09c8f869a94ae89838f8782c08d8183)! diff --git a/docs/evidence/andyljones_rl_debugging.md b/docs/evidence/andyljones_rl_debugging.md new file mode 100644 index 0000000..3f3420b --- /dev/null +++ b/docs/evidence/andyljones_rl_debugging.md @@ -0,0 +1,399 @@ +Source: https://andyljones.com/posts/rl-debugging.html +Title: Debugging RL, Without the Agonizing Pain - Andy Jones (2021) +Fetched-via: uvx markitdown https://andyljones.com/posts/rl-debugging.html +Fetch-status: verbatim + +[andy jones](/) + +[![RSS](/icons/rss-solid.svg)](/rss.xml) + +![Email](/icons/at-solid.svg) + +[![Scholar](/icons/scholar-brands.svg)](https://scholar.google.com/citations?user=wjU_zmMAAAAJ) +[![Github](/icons/github-brands.svg)](https://github.com/andyljones) +[![LinkedIn](/icons/linkedin-in-brands.svg)](https://www.linkedin.com/in/andyjonescs) + +[![Twitter](/icons/twitter-brands.svg)](https://twitter.com/andy_l_jones) +[![Reddit](/icons/reddit-brands.svg)](https://www.reddit.com/u/bluecoffee) +[![StackOverflow](/icons/stack-overflow-brands.svg)](https://stackoverflow.com/users/2565457/andy-jones) + +# Debugging RL, Without the Agonizing Pain + +Debugging reinforcement learning systems combines the pain of debugging distributed systems with the pain of debugging numerical optimizers. Which is to say, it *sucks*. If this is your first time, you might have a few hundred lines of code that you *think* are correct in an hour, and a system that's *actually* correct two months later. [Here's the head of Tesla AI having just that experience](https://news.ycombinator.com/item?id=13519044). + +This is a collection of debugging advice that has served me well over the past few years. It was formed both from my personal experiences, and from several months of helping people out in the [RL Discord](https://discord.com/invite/xhfNqQv). It is intended as compliment to the [other excellent articles on debugging RL that can be found elsewhere](https://github.com/andyljones/reinforcement-learning-discord-wiki/wiki#debugging-advice). I recommend you read all of them; each one has their own unique set of bugbears to warn you away from. + +There are three sections: one on [theory](#theory), one on [common fixes](#fixes), and one on [practical advice](#tactics). Things flow a little better if you read them in order, but you can skip on ahead if you wish. + +# Theory + +## Why is debugging RL so hard? + +A combination of issues. These issues show up in debugging any kind of system, but in RL they're more common, and they'll show up starting with the first system you ever write. + +### Feedback is poor + +**Errors aren't local**: The vast majority of the bugs you'll make are the 'doing the wrong calculation' sort. Because information in an RL system flows in a loop - actor to learner and then back to actor - a numerical error in one spot gets smeared throughout the system in seconds, poisoning everything. This means that most numerical errors manifest as *all* your metrics going weird at the same time; your loss exploding, your KL div collapsing, your rewards oscillating. From the outside, you can tell something is wrong but you've no idea *what* is wrong or where to start looking. + +To my mind this is the single biggest issue with debugging RL systems, and much of the advice below is about how to better-localise errors. + +**Performance is noisy**: The ultimate arbiter of an RL system - how good it is at collecting reward - is only weakly related to how good of an implementation you've written. You could write a bug-free implementation the first time and other factors (like hyperparameters, architecture or your environment) could sabotage performance. In the worst case, your evaluation run could just get an unlucky seed. Conversely, you could write a bug-laden implementation and it might seem to work! After all, bugs are just one more source of noise and your neural net is going to [try its damnedest](https://twitter.com/gwern/status/1014978860369182722) to pull the signal out of that mess you're feeding it. + +The real kicker though is that because run-to-run variability is so high, it's very easy to fix - or introduce - a bug and then see no change in performance at all. + +### Simplifying is hard + +**There're few narrow interfaces**: Smart software development involves splitting the system up into components so that each component only talks to the others through a narrow interface. This way you can easily pinch a component off from the the rest of the system, feed it some mock inputs and see if it gives the correct answers. + +This is difficult in RL systems. In RL systems, each component typically consumes a large number of mega- or gigabyte arrays and returns the same. The components are also unavoidably stateful, with the principal two components - the actor and learner - hefting around the state of the environnment and the network weights respectively. State can be thought of an interface with the own component's past, and in RL this interface is *huge*. + +Consequently while you *can* isolate components in RL (and we'll talk about how to below), it's much more painful to do than it is in other kinds of software. + +**There are few black boxes**: A black box is a component that works in a complex way, but which you can reason about in a simple way. Another name for a black box would be 'a good abstraction'. The prototypical example is your computer: there's a hierarchy of concepts in there, from doped silicon through to operating systems, but as far as you the programmer are concerned it's all about for loops and function calls. + +RL has surprisingly few of these black boxes. You're required to know how your environment works, how your network works, how your optimizer works, how backprop works, how multiprocessing works, how stat collection and logging work. How GPUs work! There are [lots](https://docs.ray.io/en/latest/rllib.html) of [attempts](https://github.com/thu-ml/tianshou) at [writing](https://github.com/deepmind/acme) black-box [RL](https://github.com/astooke/rlpyt) libraries, but as of Jan 2021 my experience has been that these libraries have yet to be both flexible *and* easy-to-use. This might be a symptom of my odd strand of research, but I've heard several other researchers echo my frustrations. + +### We're bad at writing RL systems + +**Your expectations suck**: In any domain, problems evaporate as you get used to them. The first stack trace you see in your life is a nightmare; the millionth a triviality. All of the problems with RL listed above are only really problems because people new to the field expect something much more refined and reliable, as they've come to expect from other fields of programming and numerical research. If instead you arrive in RL expecting a garbage fire, you might just stay zen throughout. + +Obviously though, this begs the question of *why* RL development is a garbage fire. + +**The community is young**: While reinforcement learning as a field stretches back decades, it has *exploded* in the past few years and continues apace today. Finding good abstractions requires in part that the userbase's requirements stabilize, and that just isn't the case yet. Some of that is because it's very much a community of researchers rather than a community of practitioners, and the terrible thing about researchers it that they're very keen on doing new and different things. Maybe it'll be different once someone figures out how to turn RL into an industry. + +**The community has other priorities**: Again, the community is a community of researchers. The population sets the priorities, and the priority is publication. Reliable, reproducible research contributes to publishing high-impact papers, but it also costs time and effort that is arguably better spent working on something *new*. And, well, it's hard to argue with the results: the current standards of RL development have carried us [a](https://deepmind.com/blog/article/muzero-mastering-go-chess-shogi-and-atari-without-rules) [long](https://openai.com/blog/learning-dexterity/) [way](https://deepmind.com/blog/article/AlphaStar-Grandmaster-level-in-StarCraft-II-using-multi-agent-reinforcement-learning). + +Don't take this as a clarion call for better practices, nor a stalwart defense of practices as they are. It's not a hill I wish to die on. I'm only giving an explanation for why things are the way they are, rather than a justification for it. My preferences are towards improved practices, but I can see the sense in the other side's position. + +## Debugging Strategies + +With all that in mind, here are some broad strategies to keep in mind when chasing a bug. + +### Design reliable tests + +Write tests that either clearly pass or clearly fail. There's some amount of true randomness in RL, but most of that can be controlled with a seed. What's harder to deal with is psuedorandomness such that on one seed a test might pass and another seed the test might fail. This is *awful* to deal with, and you should go out of your way to avoid it. + +While the ideal is a test that is guaranteed to cleanly pass or fail, a good fallback is one that is simply *overwhelmingly likely* to pass or fail. Typically, this means substituting out environments or algorithms with simpler ones that behave more predictably, and which you can run through your implementation with some massive batch size that'll suppress a lot of the wackiness that you might otherwise suffer. + +### Design *fast* tests + +Iteration speed is a huge determinant of debugging speed. Running a test should take at most as long as it takes you to make a potential fix, which is to say 'a few seconds'. + +This means: don't try to debug your implementation by just running it on your full task. That might take days! That way madness lies. Instead, design setups that can execute more quickly, but still exercise the code you're looking at. For specific tips, look at the [probe environments](#probe) section below. + +### Localise errors + +Write test code that'll tell you the most about where the error is. The classic example of this is binary search: if you're looking for an specific item in a sorted list, then taking a look at the middle item tells you a *lot* more about where your target item is than looking at the first item. + +Similarly, when debugging RL systems try to find tests that cut your system in half in some way, and tell you which half the problem is in. Incrementally testing every.single.chunk of code - well, sometimes that's what it comes down to! But it's something to try and avoid. + +### Be Bayesian + +But sometimes you can't avoid it! Binary search wouldn't have been much help in [finding the wreck of the USS Scorpion](https://en.wikipedia.org/wiki/USS_Scorpion_%28SSN-589%29). There they had to do a location-by-location search, and the key turned out to be prioritising the areas where + +* the Scorpion was likely to be and +* where it was likely to be *spotted*. + +This kind of thinking isn't so critical in traditional software development because isolating components is much easier, so you can do the sort of binary search I mentioned previously. But in RL, well, sometimes you just can't untangle something. Then you should reflect on which bits of your code are most likely to *contain* bugs, and which bits of your code you're going to be able to *easily spot* those bugs in. Prioritise looking in those places! + +As an aside, the [parable of the drunk and his keys](https://en.wikipedia.org/wiki/Streetlight_effect) has always confused me: I don't know if it's saying the wise thing to do is to look under the streetlight, or to look in the dark. Best moral I've heard for it is 'it depends'. + +### Pursue Anomalies + +If you ever see a plot or a behaviour that just *seems weird*, chase right after it! Do not - do *not* - just 'hope it goes away'. Chasing anomalies is one of the most powerful ways to debug your system, because if you've noticed a problem without having had to go look for it, that means it's a *really big problem*. + +This takes quite a bit of a mindset change though. It's really tempting to think that the cool extra functionality you were planning to write today - a tournament, adaptive reward scaling, a transformer - might just magically fix this anomalous behaviour. + +It won't. + +Give up on your plan for the day and chase the anomaly instead. + +# Common Fixes + +These are specific things that frequently trip people up. + +## Hand-tune your reward scale + +The single most common issue for newbies writing custom RL implementations is that the targets arriving at their neural net aren't [-1, +1]. Actually, anything [-.1, +.1]ish to [-10, +10]ish is good. The point is to have rewards that generate 'sensible' targets for your network. The hyperparameters you've pulled from the literature are adapted to work with these nicely-scaled targets, but lots of envs don't natively provide rewards of the right size so as to generate these nicely-scaled targets. + +Having read that, you might be tempted to write some adaptive scheme to scale your rewards for you. Don't: it's an extra bit of nonstationarity that'll make life more difficult. Just hand-scale, hand-clip the rewards from your env so that the targets passed to your network are sensible. When everything else is working, you can come back and replace this with something less artificial. + +## Use a really large batch size + +One of the most reliable ways to make life easier in RL is to use a really large batch size. A *really* large batch size. There's an [excellent paper on picking batch sizes](https://arxiv.org/abs/1812.06162), and to pull some examples from there: + +* Pong: ~1k batch size +* Space Invaders: ~10k batch size +* 1v1 Dota: ~100k batch size + +The idea behind this is that with small batches and complex envs, it's easy for your learner to end up with a batch that represents some weird idiosyncratic part of the problem. Big batches do a lot to suppress this. + +## Use a really small network + +Hand in hand with really large batch sizes is really small networks. When you use really large batches, your binding constraint is likely to be the memory it takes to hold the forward pass activations on your GPU. By making the network smaller, you can fit bigger batches! And frankly, small networks can accomplish a *lot*. In my [boardlaw](https://andyljones.com/boardlaw/) project, I found that a fully connected network with 4 layers of 256 neurons was enough to learn perfect play on a 9x9 board. Perfect play! That's really complex! + +## Avoid pixels + +And hand-in-hand with 'use a small network' is: *avoid pixels*. Especially if you're an independent researcher with hardware constraints, just... don't work on environments with hefty, expensive-to-ingest observations like Atari. Pixel-based observations mean that before it does anything interesting, your agent has to learn to *see*. From sparse rewards! That's hard, and it's compute-intensive, and it's *boring*. If you've got any choice in the matter, pick the simplest env that will be able to generate the behaviour you're after. For example: + +* Gridworlds like [Griddly](https://github.com/Bam4d/Griddly) and [minigrid](https://github.com/maximecb/gym-minigrid). Gridworlds can support most of the interesting behaviours you'd find in a continuous environment, but are much more resource-efficient. If you've just graduated out of [the Gym envs](https://gym.openai.com/envs/#classic_control), gridworlds are an excellent next step. +* Multi-agent setups like the boardgames from [OpenSpiel](https://openspiel.readthedocs.io/en/latest/games.html), [microRTS](https://github.com/santiontanon/microrts) or [Neural MMO](https://github.com/jsuarez5341/neural-mmo). A multi-agent env shouldn't be your *first* foray into RL - they're substantially more complex than the single-agent case - but competition and cooperation can generate a lot of complexity from very lightweight environments. +* Unusual envs like [WordCraft](https://github.com/minqi/wordcraft). WordCraft is unique in that it isolates learning about the real world from actually having to model the real world! But again, possibly not the best choice for a first RL project; I've included it here as an example of how powerful simple environments can be. + +In all, fast environments with small networks and big batches are far easier to debug than slow environments with big networks and small batches. Make sure you can walk before you try running. + +## Mix your vectorized envs + +If you've got a long-lived env and you're simulating a lot of them in parallel, you might find that your system behaves a bit strangely at the start of training. One common issue is that if all your envs start from the same state, then your learner gets passed very highly-correlated samples, and so it tries to optimise for, say, steps 0-10 of the env in the first batch, then 10-20 in the second batch, etc. You can avoid this by '[mixing](https://en.wikipedia.org/wiki/Markov_chain_mixing_time)' your envs: taking enough random steps in the env that they become uncorrelated with one another. A good way to check that things are well-mixed is to look at the number of resets at each timestep: if they look pretty uniform, things are well-mixed. If they all cluster on a specific timestep, you need to take some more random actions. + +# Practical Advice + +This advice sits somewhere between the 'common mistakes' and the more general 'theory' we discussed earlier. + +## Work from a reference implementation + +*If you're new to reinforcement learning, writing things from scratch is the most catastrophically self-sabotaging thing you can do.* + +There is an alluring masochism in writing things from scratch. There's concrete value in it too: by writing things from scratch, you're both forced to fully understand what you're doing and you're more likely to come up with a fresh perspective. In many other fields of software development these benefits would be worth the slow-down you suffer from having to work everything out yourself. + +In reinforcement learning, these benefits are not worth it. At all. As discussed [above](#theory), the nature of RL work makes it extremely hard for you to self-correct. + +When I say 'use a reference implementation', there are several interpretations you can take depending on your risk tolerance. + +* The safest thing to do is to use a reference implementation out-of-the-box. Check that it works on your task, then repeatedly make a small change and check that it works as it did before. +* Less safe is to just use the reference implementation as a source of reliable components. Work to the same API, and check that giving your version of a component and their version give the same outputs. +* Least safe (but still dramatically better than going in blind) is to have one eye on the reference implementation while you write your own. Copy their hyperparameters, copy their discounting code, copy how they handle termination and invalid actions and a hundred other little things that you're likely to muck up otherwise. + +Here are some excellent reference implementations to choose from: + +* [spinning-up](https://github.com/openai/spinningup) has been written by OpenAI, and has a [short course to go along with it](https://spinningup.openai.com/). +* [stable-baselines3](https://github.com/DLR-RM/stable-baselines3) is based on an older set of OpenAI implementations, but cleaned up and actively maintained. +* [cleanrl](https://github.com/vwxyzjn/cleanrl/tree/master/cleanrl) isolates every algorithm in its own file. +* [OpenSpiel](https://github.com/deepmind/open_spiel) is DeepMind's multi-agent reinforcement learning library. They provide both Python and C++ implementations of many algorithms - you'll probably want the Python ones. + +## Assume you have a bug + +When their RL implementation doesn't work, people are often keen to either (a) adjust their network architecture or (b) adjust their hyperparameters. On the other hand, they're reluctant to say they've got a bug. + +Most often, it turns out they've got a bug. + +Why bugs are so much more common in RL code is discussed [above](#theory), but there's another advantage to assuming you've got a bug: bugs are a damn sight faster to find and fix than validating that your new architecture is an improvement over the old one. + +Now having said that you should assume you have a bug, it's worth mentioning that sometimes - rarely - you don't have a bug. What I'm advocating for here is not a blind faith in the buginess of your code, but for dramatically raising the threshold at which you start thinking 'OK, I think this is correct.' + +## Loss curves are a red herring + +When someone's RL implementation isn't working, they *luuuuuurv* to copy-paste a screenshot of their loss curve to you. They do this because they know they want a pretty, exponentially-decaying loss curve, and they know what they have *isn't that*. + +The problem with using the loss curve as an indicator of correctness is somewhat that it's not reliable, but mostly because it doesn't localise errors. The shape of your loss curve says very little about where in your code you've messed up, and so says very little about what you need to change to get things working. + +As in the previous section, my sweeping proclamation comes with some qualifiers. Once you have a semi-functional implementation and you've exhausted other, better methods of error localisation (as documented in the rest of this post), there *is* valuable information in a loss curve. If nothing else, being able to split a model's performance into 'how fast it learns' and 'where it plateaus' is a useful way to think about the next improvement you might want to make. But because it only offers *global* information about the performance of your implementation, it makes for a really poor debugging tool. + +## Unit test the tricky bits + +Most of the bugs in a typical attempt at an RL implementation turn up in the same few places. Some of the usual suspects are + +* reward discounting, especially around episode resets +* advantage calculations, again especially around resets +* buffering and batching, especially pairing the wrong rewards with the wrong observations + +Fortunately, these components are all really easy to test! They've got none of the issues that validating RL algorithms as a whole has. These components are deterministic, they're easy to factor out, and they're fast. Checking you've got the termination right on your reward discounting is [a few lines](https://github.com/andyljones/megastep/blob/master/megastep/demo/learning.py#L134-L159). + +What's even better is that most of the time, *as you write these things* you know you're messing them up. If you're not certain whether you've just accumulated the reward on one side of the reset or the other, *put a test in*. + +## Use probe environments. + +The usual advice to people writing RL algorithms is to use a simple environment like the [classic control ones from the Gym](https://gym.openai.com/envs/#classic_control). + +Thing is, these envs have the same problem as looking at loss curves: at best they give you a noisy indicator, and if the noisy indicator looks poor you don't know *why* it looks poor. They don't localise errors. + +Instead, construct environments that *do* localise errors. In a recent project, I used + +1. **One action, zero observation, one timestep long, +1 reward every timestep**: This isolates the value network. If my agent can't learn that the value of the only observation it ever sees it 1, there's a problem with the value loss calculation or the optimizer. +2. **One action, random +1/-1 observation, one timestep long, obs-dependent +1/-1 reward every time**: If my agent can learn the value in (1.) but not this one - meaning it can learn a constant reward but not a predictable one! - it must be that backpropagation through my network is broken. +3. **One action, zero-then-one observation, *two* timesteps long, +1 reward at the end**: If my agent can learn the value in (2.) but not this one, it must be that my reward discounting is broken. +4. **Two actions, zero observation, one timestep long, action-dependent +1/-1 reward**: The first env to exercise the policy! If my agent can't learn to pick the better action, there's something wrong with either my advantage calculations, my policy loss or my policy update. That's three things, but it's easy to work out by hand the expected values for each one and check that the values produced by your actual code line up with them. +5. **Two actions, random +1/-1 observation, one timestep long, action-and-obs dependent +1/-1 reward**: Now we've got a dependence on both obs and action. The policy and value networks interact here, so there's a couple of things to verify: that the policy network learns to pick the right action in each of the two states, and that the value network learns that the value of each state is +1. If everything's worked up until now, then if - for example - the value network fails to learn here, it likely means your batching process is feeding the value network stale experience. +6. Etc. + +You get the idea: (1.) is the simplest possible environment, and each new env adds the smallest possible bit of functionality. If the old env works but the successor doesn't, that gives you a *lot* of information about where the problem is. + +Even better, these environments are extraordinarily fast. When you've a correct implementation, it should only take a second or two to learn them. And they're *decisive*: if your value network in (1.) ends up more than an epsilon away from the correct value, it means you've got a bug. + +## Use probe agents. + +In much the same way that you can simplify your environments to localise errors, you can do the same with your agents too. + +*Cheat* agents are ones that you leak extra information to. For example, if I'm writing an agent to navigate to a goal, then slipping the agent an extra vector saying which direction the goal is in should help a *lot*. My agent should be able to solve this problem *much* faster, and if it can't then how the heck can I expect it to solve the original problem? + +*Automatons* are agents that don't use a neural network at all. Instead, they're hand-written algorithms. The point of writing something like this is to check that your environment is actually solvable. On an navigation environment I wrote once, I set up a room with a red post behind the agent. Then I wrote an automaton which would just turn left until a block of red was in the middle of it's view. Shocker: my automaton couldn't solve this task, because it turned out I'd mucked up the observation generation on odd-numbered environments. + +It's worth keeping in mind that automatons can be handed cheat information too! Combining automatons and progressively more cheat information is a powerful way to debug an environment. + +*Tabular* agents a good match for probe environments. If you've set up a real simple environment and *still* nothing works, then replacing your NN with a far-easier-to-interpret lookup table of state values is a great way to figure out what you're missing. Be aware that it might take some time with a pen and paper to check that the values that you're seeing in the table are the ones you expect, but it's a hard setup to fool. + +## Use adaptive network definitions + +One of the issues with probe environments and probe agents is that every time you swap out your environment or agent, you'll find yourself having to rewrite the interface of the network with the rest of the world. By 'interface' I mean 'the bit that eats the observation and the bit that spits out the action'. + +One way to avoid this is to write a function that takes the observation space and action space of the environment, and generates 'heads' for the network that convert the observation into a fixed-width vector, and which convert a fixed-width vector to the action. Then you can hand-implement *just* the body of the net that converts the intake vector to the output vector, and the rest will be slotted in by your function based on the env it has to work with. + +You can see [one](https://github.com/andyljones/megastep/blob/master/megastep/demo/heads.py) [implementation](https://github.com/andyljones/megastep/blob/master/megastep/demo/__init__.py#L17-L26) of this in my [megastep](https://andyljones.com/megastep/) work, but it's an idea that's been independently developed a few times. I haven't yet seen a general library for it. + +## Log excessively. + +The last few sections have involved controlled experiments of a sort, where you place your components in a known setup and see how they act. The complement to a controlled experiment is an observational study: watching your system in its natural habitat *very carefully* and seeing if you can spot anything anomalous. + +In reinforcement learning, watching your system carefully means logging. Lots of logging. Below are some of the logs I've found particularly useful. + +### Relative policy entropy + +The entropy of your policy network's outputs, relative to the maximum possible entropy. It'll usually start near 1, then rapidly fall for a while, then flatten out for the rest of training. + +If it stays very near 1, your agent is failing to learn any policy at all. You should check that your policy targets are being computed correctly, that the gradient's being backpropagated correctly, and - if you've defined a custom environment - then your environment is actually correct! + +If it drops to zero or close to zero, then your agent has 'collapsed' into some - likely myopic - policy, and isn't exploring any more. This is usually because you'v either forgotten to include an exploration mechanism of some sort (like epsilon-greedy actions or an entropy term in the loss), or because your rewards are much larger than whatever you're using to encourage exploration. + +Sometimes it'll go up for a while; don't stress about that unless it's a large, permanent increase. If it *is* a large permanent increase and the minimum was very early in training, that can be an indicator that your policy fell into some myopic obviously-good behaviour that it's having to gradually climb back out of. It might help to turn up the exploration incentives. + +If the entropy oscillates wildly, that usually means your learning rate is too high. + +### Kullback-Leibler divergence + +The KL div between the policy that was used to collect the experience in the batch, and the policy that your learner's just generated for the same batch. This should be small but positive. + +If it's very large then your agent is having to learn from experience that's very different to the current policy. In some algorithms - like those with a replay buffer - that's expected, and all that's important is the KL div is stable. In other algorithms (like PPO), a very large KL div is an indicator that the experience reaching your network is 'stale', and that'll slow down training. + +If it's very low then that suggests your network hasn't changed much in the time since the experience was generated, and you can probably get away with turning the learning rate up. + +If it's growing steadily over time, that means you're probably feeding the same experience from early on in training back into the network again and again. Check your buffering system. + +If it's negative - that shouldn't happen, and it means you're likely calculating the KL div incorrectly (probably by not handling invalid actions). + +### Residual variance + +The variance of (target values - network values), divided by the variance of the target values. + +Like the policy entropy, this should start close to 1, fall very rapidly early on, and then decrease more gradually over the course of training. + +If it stays near 1, your value network isn't learning to predict the rewards. Check that your rewards are what you think they are, and check that your value loss and backprop through the value net are all working correctly. + +If it drops to zero, that's usually because the policy entropy has dropped to zero too, the policy has collapsed into some deterministic behaviour, and the value network has learned the rewards it is collecting perfectly. Another common reason is that some scenarios are generating vastly larger returns than the others, and the value net's learned to identify when that happens. + +If the residual variance oscillates wildly, that usually means your learning rate is too high. + +### Terminal correlation + +The correlation between the value in the final state and the reward in the final step. This is only useful when there's lots of reward in the final step (like in boardgames). + +It should start near zero, rise rapidly, then plateau near 1. + +If it stays near zero but all the other value-related logs look good, then check that your reward-to-gos are being calculated correctly near termination! + +If reward is more evenly distributed through the episode, you could write a version of this that looks at the correlation of (next state's value - this state's value) with the reward in that step. I haven't used this myself though, so can't offer commentary. + +### Penultimate terminal correlation + +The correlation between the value in the penultimate step and the final reward. Again, only useful when there's lots of reward at the end of the episode. If terminal correlation is high but penultimate terminal correlation is low, that's a strong indicator that your reward-to-gos aren't being carried backwards properly. + +### Value target distribution + +Either plot a histogram, or the min/max/mean/std. The plots should indicate 'reasonable' value targets in the range [-10, +10] (and ideally [-3, +3]). + +If they're larger than that, make your rewards proportionately smaller; if they're smaller than that, make your rewards larger. + +If they blow up, check that your reward discounting is correct, and possibly make your discount rate smaller. + +If they're blowing up but you're insistent on leaving the discount rate where it is, one alternative is to increase the number of steps used to bootstrap the value targets. In PPO, this'd mean using longer chunks. Longer chunks mean that the values used for bootstrapping get shrunk more before they're fed back to the value net as targets, increasing the stability. You could also consider annealing the discount factor from a smaller value up towards 1. + +### Reward distribution + +Again, as a histogram or min/max/mean/std. What a reasonable reward distribution is depends on the environment; some envs have a few large rewards, while others have lots of small rewards. Either way, if it doesn't match your expectations then you should investigate. + +### Value distribution + +Again, as a histogram or min/max/mean/std. This is a complement to the previous two distributions and *should* closely match the value target distribution. If it doesn't, and it stays different from the value target distribution, that's an indicator that your value network is having trouble learning. + +It's also worth keeping an eye on the sign of the distribution. If your env only produces positive rewards but there are persistently negatives values in the value target distribution, that suggests your reward-to-go mechanism is badly broken or your value network is failing to learn. + +### Advantage distribution + +Again, as a histogram or min/max/mean/std. As with the value targets, these should be in the range [-10, +10] (and ideally [-3, +3]). + +Advantages should also be approximately mean-zero due to how they're constructed; if they're persistently not then you've messed up your advantage calculations. + +### Episode length distribution + +Again, as a histogram or a min/max/mean/std. As with the reward distribution, interpreting this depends on the environment. If your environment should have arbitrary-length episodes, but you're seeing that every episode here is length 7, that indicates your environment is broken or your network's fallen into some degenerate behaviour. + +### Sample staleness + +Sample staleness is the number of learner steps between the network used to generate a sample, and the network currently learning from that sample. You can generate this by setting an 'age' attribute on the network, and incrementing it at every learner step. Then when a sample arrives at the learner, diff it against the learner's current age. + +How to interpret this depends on the algorithm, but it should generally stay at a steady value throughout training. In on-policy algorithms, lower sample stalenesses are better; in off-policy algorithms it's a tradeoff between fresh samples that let the network bootstrap quickly, and aged samples that stabilise things. + +### Step statistics + +Step statistics are the abs-max and mean-square-value of the difference between the network's parameters when it enters the learner, and the network's parameters when it leaves the learner. + +Interpreting this depends on a whole bunch of things, but the mean-square value should typically be very small (1e-3 in my current training run with a LR of 1e-2), while the abs-max should small yet substantially larger than the mean-square-value. + +If the statistics are much smaller than that, you might be able to increase your learning rate; if they're much larger than that then be on the lookout for instability in your training. + +### Gradient statistics + +Gradient statistics re the abs-max and mean-square-value of the gradient. In the age of Adam and other quasi-Newton optimizers, this isn't as informative as it once was, because normalising by the curvature estimates can dramatically inflate or collapse the gradient. + +That said, if the step statistics are looking strange, this can help diagnose whether the problem is with the gradient calculation or with Adam's second-order magic. + +### Gradient noise + +This is from [McCandlish and Kaplan](https://arxiv.org/abs/1812.06162), and it's intended to help you choose your batch size. Unfortunately it's *spectacularly* noisy, to the point where you likely want to average over all steps in your run. + +I've been thinking that it might be possible to get more stable estimates of the gradient noise from Adam's moment estimates, but that's decidedly on the to-do list. + +### Component throughput + +At the least, keep track of the actor throughput and learner throughput in terms of samples per second, and steps per second. + +Typically the actor should be generating *at most* as many samples as the learner is consuming. If the actor is generating excess samples there are weak reasons that might be a good thing - it'll refresh the replay buffer more rapidly - but typically it's considered a waste of compute. + +More generally, you want to see these remain stable throughout training. If your throughputs gradually decay, you're accumulating some costly state somewhere in your system. + +(For me, problems with gradually-slowing-down systems have always turned out to be with stats and logging, but I suspect that's because I've rolled my own stats and logging systems) + +### Value trace + +The trace of the value over a random episode from recent history, plotted together with the rewards. This can be useful if you suspect your value function or rewards of 'being weird' in some way; the value trace should typically be a collection of exponentially-increasing curves leading up to rewards, followed by vertical drops as the agent collects those rewards. + +### GPU stats + +There are several GPU-related stats that are worth tracking. First are the memory stats, which in PyTorch include + +* the *memory allocation*, as reported by `torch.cuda.max_memory_allocated`. This is how much memory has actually been *used* by your computations, +* the *memory reserve*, as reported by `torch.cuda.max_memory_reserved`. This is how much memory PyTorch has *set aside* for your computations, +* the *memory gross*, as reported by `nvidia-smi`. This is how much memory PyTorch is using overall, [including the ~gigabyte it needs for its own kernels](https://github.com/pytorch/pytorch/issues/20532#issuecomment-540628939). It's this figure that'll crash your program if it hits the GPU's memory limit. + +Keeping track of all three is useful for diagnosing memory issues: figuring out if it's you that's hanging onto too many tensors, or PyTorch that's being too aggressive with its caching. + +If you're running out of memory and you can't immediately figure out why, [memlab](https://github.com/Stonesjtu/pytorch_memlab#memory-profiler) can help a lot. Disclosure: I wrote the frontend. + +As well as the memory stats, it's also useful to track the utilization, fan speed and temperature reported by `nvidia-smi`. You can get these values in [machine-readable form](https://github.com/andyljones/megastep/blob/master/rebar/stats/gpu.py#L17-L29). + +In particular, if the utilization is persistently low then you should profile your code. Make sure to set `CUDA_LAUNCH_BLOCKING=1` before importing your tensor library, and then use [snakeviz](https://jiffyclub.github.io/snakeviz/) or [tuna](https://github.com/nschloe/tuna) to profile things in a broad way. If that's not enough detail, you can dig into things further with [nsight](https://developer.nvidia.com/nsight-systems). + +### Traditional metrics + +As well as the above, I also plot some other things out of habit + +* **Reward per trajectory**: should increase dramatically at the start of training. This is, usually, what you care about. Unfortunately it's incredibly noisy and does little to localise errors. Closely related is the **reward per step**, which is typically what you care about in infinite environments. +* **Mean value**: is (if your value network is working well) a less-noisy proxy for the reward per trajectory. If your trajectories are particularly long compared to your reward discount factor however, this can be dramatically different from the reward per trajectory. +* **Policy and value losses**: should fall dramatically at the start of training, then level out. + +## Credit + +* **kfir.b.y**, for spotting an error in my description of the probe environments. + +2021/01/01 + +[icons by dave gandy](https://fontawesome.com/license), theme by [#6d2e98](https://color-hex.org/color/6d2e98 "i have never been funny") diff --git a/docs/evidence/cs229_ml_advice.md b/docs/evidence/cs229_ml_advice.md new file mode 100644 index 0000000..3d22d19 --- /dev/null +++ b/docs/evidence/cs229_ml_advice.md @@ -0,0 +1,719 @@ +Source: https://cs229.stanford.edu/materials/ML-advice.pdf +Title: CS229 - Advice for Applying Machine Learning (Andrew Ng) +Fetched-via: bash -c 'uvx "markitdown[pdf]" https://cs229.stanford.edu/materials/ML-advice.pdf' +Fetch-status: verbatim + +Advice for applying +Machine Learning + +Andrew Ng + +Stanford University + +Andrew Y. Ng + + Today’s Lecture + +• Advice on how getting learning algorithms to different applications. + +• Most of today’s material is not very mathematical. But it’s also some of the + +hardest material in this class to understand. + +• Some of what I’ll say today is debatable. + +• Some of what I’ll say is not good advice for doing novel machine learning + +research. + +• Key ideas: + +1. Diagnostics for debugging learning algorithms. +2. Error analyses and ablative analysis. +3. How to get started on a machine learning problem. + +– Premature (statistical) optimization. + +Andrew Y. Ng + + Debugging Learning +Algorithms + +Andrew Y. Ng + + Debugging learning algorithms + +Motivating example: + +• Anti-spam. You carefully choose a small set of 100 words to use as + +features. (Instead of using all 50000+ words in English.) + +• Bayesian logistic regression, implemented with gradient descent, gets 20% + +test error, which is unacceptably high. + +• What to do next? + +Andrew Y. Ng + + Fixing the learning algorithm + +• Bayesian logistic regression: + +• Common approach: Try improving the algorithm in different ways. + +– Try getting more training examples. +– Try a smaller set of features. +– Try a larger set of features. +– Try changing the features: Email header vs. email body features. +– Run gradient descent for more iterations. +– Try Newton’s method. +– Use a different value for λ. +– Try using an SVM. + +• This approach might work, but it’s very time-consuming, and largely a matter + +of luck whether you end up fixing what the problem really is. + +Andrew Y. Ng + + Diagnostic for bias vs. variance + +Better approach: + +– Run diagnostics to figure out what the problem is. +– Fix whatever the problem is. + +Bayesian logistic regression’s test error is 20% (unacceptably high). + +Suppose you suspect the problem is either: + +– Overfitting (high variance). +– Too few features to classify spam (high bias). + +Diagnostic: + +– Variance: Training error will be much lower than test error. +– Bias: Training error will also be high. + +Andrew Y. Ng + + More on bias vs. variance + +Typical learning curve for high variance: + +r +o +r +r +e + +Test error + +Desired performance + +Training error + +m (training set size) + +• Test error still decreasing as m increases. Suggests larger training set +will help. +• Large gap between training and test error. + +Andrew Y. Ng + + More on bias vs. variance + +Typical learning curve for high bias: + +r +o +r +r +e + +Test error + +Training error + +Desired performance + +m (training set size) + +• Even training error is unacceptably high. +• Small gap between training and test error. + +Andrew Y. Ng + + Diagnostics tell you what to try next + +Bayesian logistic regression, implemented with gradient descent. + +Fixes to try: + +– Try getting more training examples. +– Try a smaller set of features. +– Try a larger set of features. +– Try email header features. +– Run gradient descent for more iterations. +– Try Newton’s method. +– Use a different value for λ. +– Try using an SVM. + +Fixes high variance. +Fixes high variance. +Fixes high bias. +Fixes high bias. + +Andrew Y. Ng + + Optimization algorithm diagnostics + +• Bias vs. variance is one common diagnostic. + +• For other problems, it’s usually up to your own ingenuity to construct your + +own diagnostics to figure out what’s wrong. + +• Another example: + +– Bayesian logistic regression gets 2% error on spam, and 2% error on non-spam. + +(Unacceptably high error on non-spam.) + +– SVM using a linear kernel gets 10% error on spam, and 0.01% error on non- + +spam. (Acceptable performance.) + +– But you want to use logistic regression, because of computational efficiency, etc. + +• What to do next? + +Andrew Y. Ng + + More diagnostics + +• Other common questions: + +– Is the algorithm (gradient descent for logistic regression) converging? + +J(θ) + +e +v +i +t +c +e +b +O + +j + +Iterations + +It’s often very hard to tell if an algorithm has converged yet by looking at the objective. + +Andrew Y. Ng + + More diagnostics + +• Other common questions: + +– Is the algorithm (gradient descent for logistic regression) converging? +– Are you optimizing the right function? +– I.e., what you care about: + +(weights w(i) higher for non-spam than for spam). +– Bayesian logistic regression? Correct value for λ? + +– SVM? Correct value for C? + +Andrew Y. Ng + + Diagnostic + +An SVM outperforms Bayesian logistic regression, but you really want to deploy Bayesian + +logistic regression for your application. + +Let θSVM be the parameters learned by an SVM. + +Let θBLR be the parameters learned by Bayesian logistic regression. + +You care about weighted accuracy: + +θSVM outperforms θBLR. So: + +BLR tries to maximize: + +Diagnostic: + +Andrew Y. Ng + + Two cases + +Case 1: + +But BLR was trying to maximize J(θ). This means that θBLR fails to maximize J, and the + +problem is with the convergence of the algorithm. Problem is with optimization +algorithm. + +Case 2: + +This means that BLR succeeded at maximizing J(θ). But the SVM, which does worse on + +J(θ), actually does better on weighted accuracy a(θ). + +This means that J(θ) is the wrong function to be maximizing, if you care about a(θ). + +Problem is with objective function of the maximization problem. + +Andrew Y. Ng + + Diagnostics tell you what to try next + +Bayesian logistic regression, implemented with gradient descent. + +Fixes to try: + +– Try getting more training examples. +– Try a smaller set of features. +– Try a larger set of features. +– Try email header features. +– Run gradient descent for more iterations. +– Try Newton’s method. +– Use a different value for λ. +– Try using an SVM. + +Fixes high variance. +Fixes high variance. +Fixes high bias. +Fixes high bias. +Fixes optimization algorithm. +Fixes optimization algorithm. +Fixes optimization objective. +Fixes optimization objective. + +Andrew Y. Ng + + The Stanford Autonomous Helicopter + +Payload: 14 pounds +Weight: 32 pounds + +Andrew Y. Ng + + Machine learning algorithm + +1. Build a simulator of helicopter. + +Simulator + +2. Choose a cost function. Say J(θ) = ||x – xdesired||2 (x = helicopter position) + +3. Run reinforcement learning (RL) algorithm to fly helicopter in simulation, so + +as to try to minimize cost function: + +θRL = arg minθ J(θ) + +Suppose you do this, and the resulting controller parameters θRL gives much worse + +performance than your human pilot. What to do next? + +Improve simulator? +Modify cost function J? +Modify RL algorithm? + +Andrew Y. Ng + + Debugging an RL algorithm + +The controller given by θRL performs poorly. +Suppose that: + +1. The helicopter simulator is accurate. + +2. The RL algorithm correctly controls the helicopter (in simulation) so as to + +minimize J(θ). + +3. Minimizing J(θ) corresponds to correct autonomous flight. + +Then: The learned parameters θRL should fly well on the actual helicopter. + +Diagnostics: + +1. + +If θRL flies well in simulation, but not in real life, then the problem is in the +simulator. Otherwise: + +2. Let θhuman be the human control policy. If J(θhuman) < J(θRL), then the problem is +in the reinforcement learning algorithm. (Failing to minimize the cost function J.) +If J(θhuman) + +J(θRL), then the problem is in the cost function. (Maximizing it + +3. + +≥ + +doesn’t correspond to good autonomous flight.) + +Andrew Y. Ng + + More on diagnostics + +• Quite often, you’ll need to come up with your own diagnostics to figure out + +what’s happening in an algorithm. + +• Even if a learning algorithm is working well, you might also run diagnostics to + +make sure you understand what’s going on. This is useful for: + +– Understanding your application problem: If you’re working on one important ML + +application for months/years, it’s very valuable for you personally to get a intuitive +understand of what works and what doesn’t work in your problem. + +– Writing research papers: Diagnostics and error analysis help convey insight about + +the problem, and justify your research claims. + +– I.e., Rather than saying “Here’s an algorithm that works,” it’s more interesting to +say “Here’s an algorithm that works because of component X, and here’s my +justification.” + +• Good machine learning practice: Error analysis. Try to understand what + +your sources of error are. + +Andrew Y. Ng + + Error Analysis + +Andrew Y. Ng + + Error analysis + +Many applications combine many different learning components into a +“pipeline.” E.g., Face recognition from images: [contrived example] + +Camera +image + +Preprocess +(remove background) + +Eyes segmentation + +Face detection + +Nose segmentation + +Logistic regression + +Label + +Mouth segmentation + +Andrew Y. Ng + + Camera +image + +Preprocess +Preprocess +(remove background) +(remove background) + +Error analysis + +Eyes segmentation +Eyes segmentation + +Face detection +Face detection + +Nose segmentation +Nose segmentation + +Logistic regression +Logistic regression + +Label + +Mouth segmentation +Mouth segmentation + +How much error is attributable to each of the + +components? + +Plug in ground-truth for each component, and + +see how accuracy changes. + +Conclusion: Most room for improvement in face + +detection and eyes segmentation. + +Component + +Accuracy + +Overall system + +85% + +Preprocess (remove +background) + +Face detection + +Eyes segmentation + +Nose segmentation + +Mouth segmentation + +85.1% + +91% + +95% + +96% + +97% + +Logistic regression + +100% +Andrew Y. Ng + + Ablative analysis + +Error analysis tries to explain the difference between current performance and + +perfect performance. + +Ablative analysis tries to explain the difference between some baseline (much + +poorer) performance and current performance. + +E.g., Suppose that you’ve build a good anti-spam classifier by adding lots of + +clever features to logistic regression: + +– Spelling correction. +– Sender host features. +– Email header features. +– Email text parser features. +– Javascript parser. +– Features from embedded images. + +Question: How much did each of these components really help? + +Andrew Y. Ng + + Ablative analysis + +Simple logistic regression without any clever features get 94% performance. + +Just what accounts for your improvement from 94 to 99.9%? + +Ablative analysis: Remove components from your system one at a time, to see + +how it breaks. + +Component + +Accuracy + +Overall system + +Spelling correction + +Sender host features + +Email header features + +Email text parser features + +Javascript parser + +Features from images + +99.9% + +99.0 + +98.9% + +98.9% + +95% + +94.5% + +94.0% + +[baseline] + +Conclusion: The email text parser features account for most of the + +improvement. + +Andrew Y. Ng + + Getting started on a +learning problem + +Andrew Y. Ng + + Getting started on a problem + +Approach #1: Careful design. + +• Spend a long term designing exactly the right features, collecting the right dataset, + +and designing the right algorithmic architecture. + +• + +Implement it and hope it works. + +• Benefit: Nicer, perhaps more scalable algorithms. May come up with new, elegant, + +learning algorithms; contribute to basic research in machine learning. + +Approach #2: Build-and-fix. + +• + +Implement something quick-and-dirty. + +• Run error analyses and diagnostics to see what’s wrong with it, and fix its errors. + +• Benefit: Will often get your application problem working more quickly. Faster time to + +market. + +Andrew Y. Ng + + Premature statistical optimization + +Very often, it’s not clear what parts of a system are easy or difficult to build, and + +which parts you need to spend lots of time focusing on. E.g., + +Camera +image + +Preprocess +(remove background) + +This system’s much too +complicated for a first attempt. + +Eyes segmentation + +Step 1 of designing a learning +system: Plot the data. + +Face detection + +Nose segmentation + +Logistic regression + +Label + +The only way to find out what needs work is to implement something quickly, + +and find out what parts break. + +Mouth segmentation + +[But this may be bad advice if your goal is to come up with new machine + +learning algorithms.] + +Andrew Y. Ng + + The danger of over-theorizing + +3d similarity +learning + +Color +invariance + +Object +detection + +Navigation + +Differential +geometry of +3d manifolds + +Complexity of +non-Riemannian +geometries + +VC +dimension + +… Convergence + +bounds for +sampled non- +monotonic logic + +Mail +delivery +robot + +Obstacle +avoidance + +Robot +manipulation + +[Based on Papadimitriou, 1995] + +Andrew Y. Ng + + Summary + +Andrew Y. Ng + + Summary + +• Time spent coming up with diagnostics for learning algorithms is time well- + +spent. + +• + +It’s often up to your own ingenuity to come up with right diagnostics. + +• Error analyses and ablative analyses also give insight into the problem. + +• Two approaches to applying learning algorithms: + +– Design very carefully, then implement. + +• Risk of premature (statistical) optimization. +– Build a quick-and-dirty prototype, diagnose, and fix. + +Andrew Y. Ng + + diff --git a/docs/evidence/cs231n_neural_networks_3.md b/docs/evidence/cs231n_neural_networks_3.md new file mode 100644 index 0000000..5a613a5 --- /dev/null +++ b/docs/evidence/cs231n_neural_networks_3.md @@ -0,0 +1,353 @@ +Source: https://cs231n.github.io/neural-networks-3/ +Title: CS231n - Neural Networks Part 3: Learning and Evaluation +Fetched-via: uvx markitdown https://cs231n.github.io/neural-networks-3/ +Fetch-status: verbatim + +[CS231n Deep Learning for Computer Vision](https://cs231n.github.io) +[Course Website](http://cs231n.stanford.edu/) + +# + +Table of Contents: + +* [Gradient checks](#gradcheck) +* [Sanity checks](#sanitycheck) +* [Babysitting the learning process](#baby) + + [Loss function](#loss) + + [Train/val accuracy](#accuracy) + + [Weights:Updates ratio](#ratio) + + [Activation/Gradient distributions per layer](#distr) + + [Visualization](#vis) +* [Parameter updates](#update) + + [First-order (SGD), momentum, Nesterov momentum](#sgd) + + [Annealing the learning rate](#anneal) + + [Second-order methods](#second) + + [Per-parameter adaptive learning rates (Adagrad, RMSProp)](#ada) +* [Hyperparameter Optimization](#hyper) +* [Evaluation](#eval) + + [Model Ensembles](#ensemble) +* [Summary](#summary) +* [Additional References](#add) + +## Learning + +In the previous sections we’ve discussed the static parts of a Neural Networks: how we can set up the network connectivity, the data, and the loss function. This section is devoted to the dynamics, or in other words, the process of learning the parameters and finding good hyperparameters. + +### Gradient Checks + +In theory, performing a gradient check is as simple as comparing the analytic gradient to the numerical gradient. In practice, the process is much more involved and error prone. Here are some tips, tricks, and issues to watch out for: + +**Use the centered formula**. The formula you may have seen for the finite difference approximation when evaluating the numerical gradient looks as follows: + +\[\frac{df(x)}{dx} = \frac{f(x + h) - f(x)}{h} \hspace{0.1in} \text{(bad, do not use)}\] + +where \(h\) is a very small number, in practice approximately 1e-5 or so. In practice, it turns out that it is much better to use the *centered* difference formula of the form: + +\[\frac{df(x)}{dx} = \frac{f(x + h) - f(x - h)}{2h} \hspace{0.1in} \text{(use instead)}\] + +This requires you to evaluate the loss function twice to check every single dimension of the gradient (so it is about 2 times as expensive), but the gradient approximation turns out to be much more precise. To see this, you can use Taylor expansion of \(f(x+h)\) and \(f(x-h)\) and verify that the first formula has an error on order of \(O(h)\), while the second formula only has error terms on order of \(O(h^2)\) (i.e. it is a second order approximation). + +**Use relative error for the comparison**. What are the details of comparing the numerical gradient \(f’\_n\) and analytic gradient \(f’\_a\)? That is, how do we know if the two are not compatible? You might be temped to keep track of the difference \(\mid f’\_a - f’\_n \mid \) or its square and define the gradient check as failed if that difference is above a threshold. However, this is problematic. For example, consider the case where their difference is 1e-4. This seems like a very appropriate difference if the two gradients are about 1.0, so we’d consider the two gradients to match. But if the gradients were both on order of 1e-5 or lower, then we’d consider 1e-4 to be a huge difference and likely a failure. Hence, it is always more appropriate to consider the *relative error*: + +\[\frac{\mid f'\_a - f'\_n \mid}{\max(\mid f'\_a \mid, \mid f'\_n \mid)}\] + +which considers their ratio of the differences to the ratio of the absolute values of both gradients. Notice that normally the relative error formula only includes one of the two terms (either one), but I prefer to max (or add) both to make it symmetric and to prevent dividing by zero in the case where one of the two is zero (which can often happen, especially with ReLUs). However, one must explicitly keep track of the case where both are zero and pass the gradient check in that edge case. In practice: + +* relative error > 1e-2 usually means the gradient is probably wrong +* 1e-2 > relative error > 1e-4 should make you feel uncomfortable +* 1e-4 > relative error is usually okay for objectives with kinks. But if there are no kinks (e.g. use of tanh nonlinearities and softmax), then 1e-4 is too high. +* 1e-7 and less you should be happy. + +Also keep in mind that the deeper the network, the higher the relative errors will be. So if you are gradient checking the input data for a 10-layer network, a relative error of 1e-2 might be okay because the errors build up on the way. Conversely, an error of 1e-2 for a single differentiable function likely indicates incorrect gradient. + +**Use double precision**. A common pitfall is using single precision floating point to compute gradient check. It is often that case that you might get high relative errors (as high as 1e-2) even with a correct gradient implementation. In my experience I’ve sometimes seen my relative errors plummet from 1e-2 to 1e-8 by switching to double precision. + +**Stick around active range of floating point**. It’s a good idea to read through [“What Every Computer Scientist Should Know About Floating-Point Arithmetic”](http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html), as it may demystify your errors and enable you to write more careful code. For example, in neural nets it can be common to normalize the loss function over the batch. However, if your gradients per datapoint are very small, then *additionally* dividing them by the number of data points is starting to give very small numbers, which in turn will lead to more numerical issues. This is why I like to always print the raw numerical/analytic gradient, and make sure that the numbers you are comparing are not extremely small (e.g. roughly 1e-10 and smaller in absolute value is worrying). If they are you may want to temporarily scale your loss function up by a constant to bring them to a “nicer” range where floats are more dense - ideally on the order of 1.0, where your float exponent is 0. + +**Kinks in the objective**. One source of inaccuracy to be aware of during gradient checking is the problem of *kinks*. Kinks refer to non-differentiable parts of an objective function, introduced by functions such as ReLU (\(max(0,x)\)), or the SVM loss, Maxout neurons, etc. Consider gradient checking the ReLU function at \(x = -1e6\). Since \(x < 0\), the analytic gradient at this point is exactly zero. However, the numerical gradient would suddenly compute a non-zero gradient because \(f(x+h)\) might cross over the kink (e.g. if \(h > 1e-6\)) and introduce a non-zero contribution. You might think that this is a pathological case, but in fact this case can be very common. For example, an SVM for CIFAR-10 contains up to 450,000 \(max(0,x)\) terms because there are 50,000 examples and each example yields 9 terms to the objective. Moreover, a Neural Network with an SVM classifier will contain many more kinks due to ReLUs. + +Note that it is possible to know if a kink was crossed in the evaluation of the loss. This can be done by keeping track of the identities of all “winners” in a function of form \(max(x,y)\); That is, was x or y higher during the forward pass. If the identity of at least one winner changes when evaluating \(f(x+h)\) and then \(f(x-h)\), then a kink was crossed and the numerical gradient will not be exact. + +**Use only few datapoints**. One fix to the above problem of kinks is to use fewer datapoints, since loss functions that contain kinks (e.g. due to use of ReLUs or margin losses etc.) will have fewer kinks with fewer datapoints, so it is less likely for you to cross one when you perform the finite different approximation. Moreover, if your gradcheck for only ~2 or 3 datapoints then you would almost certainly gradcheck for an entire batch. Using very few datapoints also makes your gradient check faster and more efficient. + +**Be careful with the step size h**. It is not necessarily the case that smaller is better, because when \(h\) is much smaller, you may start running into numerical precision problems. Sometimes when the gradient doesn’t check, it is possible that you change \(h\) to be 1e-4 or 1e-6 and suddenly the gradient will be correct. This [wikipedia article](http://en.wikipedia.org/wiki/Numerical_differentiation) contains a chart that plots the value of **h** on the x-axis and the numerical gradient error on the y-axis. + +**Gradcheck during a “characteristic” mode of operation**. It is important to realize that a gradient check is performed at a particular (and usually random), single point in the space of parameters. Even if the gradient check succeeds at that point, it is not immediately certain that the gradient is correctly implemented globally. Additionally, a random initialization might not be the most “characteristic” point in the space of parameters and may in fact introduce pathological situations where the gradient seems to be correctly implemented but isn’t. For instance, an SVM with very small weight initialization will assign almost exactly zero scores to all datapoints and the gradients will exhibit a particular pattern across all datapoints. An incorrect implementation of the gradient could still produce this pattern and not generalize to a more characteristic mode of operation where some scores are larger than others. Therefore, to be safe it is best to use a short **burn-in** time during which the network is allowed to learn and perform the gradient check after the loss starts to go down. The danger of performing it at the first iteration is that this could introduce pathological edge cases and mask an incorrect implementation of the gradient. + +**Don’t let the regularization overwhelm the data**. It is often the case that a loss function is a sum of the data loss and the regularization loss (e.g. L2 penalty on weights). One danger to be aware of is that the regularization loss may overwhelm the data loss, in which case the gradients will be primarily coming from the regularization term (which usually has a much simpler gradient expression). This can mask an incorrect implementation of the data loss gradient. Therefore, it is recommended to turn off regularization and check the data loss alone first, and then the regularization term second and independently. One way to perform the latter is to hack the code to remove the data loss contribution. Another way is to increase the regularization strength so as to ensure that its effect is non-negligible in the gradient check, and that an incorrect implementation would be spotted. + +**Remember to turn off dropout/augmentations**. When performing gradient check, remember to turn off any non-deterministic effects in the network, such as dropout, random data augmentations, etc. Otherwise these can clearly introduce huge errors when estimating the numerical gradient. The downside of turning off these effects is that you wouldn’t be gradient checking them (e.g. it might be that dropout isn’t backpropagated correctly). Therefore, a better solution might be to force a particular random seed before evaluating both \(f(x+h)\) and \(f(x-h)\), and when evaluating the analytic gradient. + +**Check only few dimensions**. In practice the gradients can have sizes of million parameters. In these cases it is only practical to check some of the dimensions of the gradient and assume that the others are correct. **Be careful**: One issue to be careful with is to make sure to gradient check a few dimensions for every separate parameter. In some applications, people combine the parameters into a single large parameter vector for convenience. In these cases, for example, the biases could only take up a tiny number of parameters from the whole vector, so it is important to not sample at random but to take this into account and check that all parameters receive the correct gradients. + +### Before learning: sanity checks Tips/Tricks + +Here are a few sanity checks you might consider running before you plunge into expensive optimization: + +* **Look for correct loss at chance performance.** Make sure you’re getting the loss you expect when you initialize with small parameters. It’s best to first check the data loss alone (so set regularization strength to zero). For example, for CIFAR-10 with a Softmax classifier we would expect the initial loss to be 2.302, because we expect a diffuse probability of 0.1 for each class (since there are 10 classes), and Softmax loss is the negative log probability of the correct class so: -ln(0.1) = 2.302. For The Weston Watkins SVM, we expect all desired margins to be violated (since all scores are approximately zero), and hence expect a loss of 9 (since margin is 1 for each wrong class). If you’re not seeing these losses there might be issue with initialization. +* As a second sanity check, increasing the regularization strength should increase the loss +* **Overfit a tiny subset of data**. Lastly and most importantly, before training on the full dataset try to train on a tiny portion (e.g. 20 examples) of your data and make sure you can achieve zero cost. For this experiment it’s also best to set regularization to zero, otherwise this can prevent you from getting zero cost. Unless you pass this sanity check with a small dataset it is not worth proceeding to the full dataset. Note that it may happen that you can overfit very small dataset but still have an incorrect implementation. For instance, if your datapoints’ features are random due to some bug, then it will be possible to overfit your small training set but you will never notice any generalization when you fold it your full dataset. + +### Babysitting the learning process + +There are multiple useful quantities you should monitor during training of a neural network. These plots are the window into the training process and should be utilized to get intuitions about different hyperparameter settings and how they should be changed for more efficient learning. + +The x-axis of the plots below are always in units of epochs, which measure how many times every example has been seen during training in expectation (e.g. one epoch means that every example has been seen once). It is preferable to track epochs rather than iterations since the number of iterations depends on the arbitrary setting of batch size. + +#### Loss function + +The first quantity that is useful to track during training is the loss, as it is evaluated on the individual batches during the forward pass. Below is a cartoon diagram showing the loss over time, and especially what the shape might tell you about the learning rate: + +![](/assets/nn3/learningrates.jpeg) +![](/assets/nn3/loss.jpeg) + +**Left:** A cartoon depicting the effects of different learning rates. With low learning rates the improvements will be linear. With high learning rates they will start to look more exponential. Higher learning rates will decay the loss faster, but they get stuck at worse values of loss (green line). This is because there is too much "energy" in the optimization and the parameters are bouncing around chaotically, unable to settle in a nice spot in the optimization landscape. **Right:** An example of a typical loss function over time, while training a small network on CIFAR-10 dataset. This loss function looks reasonable (it might indicate a slightly too small learning rate based on its speed of decay, but it's hard to say), and also indicates that the batch size might be a little too low (since the cost is a little too noisy). + +The amount of “wiggle” in the loss is related to the batch size. When the batch size is 1, the wiggle will be relatively high. When the batch size is the full dataset, the wiggle will be minimal because every gradient update should be improving the loss function monotonically (unless the learning rate is set too high). + +Some people prefer to plot their loss functions in the log domain. Since learning progress generally takes an exponential form shape, the plot appears as a slightly more interpretable straight line, rather than a hockey stick. Additionally, if multiple cross-validated models are plotted on the same loss graph, the differences between them become more apparent. + +Sometimes loss functions can look funny [lossfunctions.tumblr.com](http://lossfunctions.tumblr.com/). + +#### Train/Val accuracy + +The second important quantity to track while training a classifier is the validation/training accuracy. This plot can give you valuable insights into the amount of overfitting in your model: + +![](/assets/nn3/accuracies.jpeg) + +The gap between the training and validation accuracy indicates the amount of overfitting. Two possible cases are shown in the diagram on the left. The blue validation error curve shows very small validation accuracy compared to the training accuracy, indicating strong overfitting (note, it's possible for the validation accuracy to even start to go down after some point). When you see this in practice you probably want to increase regularization (stronger L2 weight penalty, more dropout, etc.) or collect more data. The other possible case is when the validation accuracy tracks the training accuracy fairly well. This case indicates that your model capacity is not high enough: make the model larger by increasing the number of parameters. + +#### Ratio of weights:updates + +The last quantity you might want to track is the ratio of the update magnitudes to the value magnitudes. Note: *updates*, not the raw gradients (e.g. in vanilla sgd this would be the gradient multiplied by the learning rate). You might want to evaluate and track this ratio for every set of parameters independently. A rough heuristic is that this ratio should be somewhere around 1e-3. If it is lower than this then the learning rate might be too low. If it is higher then the learning rate is likely too high. Here is a specific example: + +``` +# assume parameter vector W and its gradient vector dW +param_scale = np.linalg.norm(W.ravel()) +update = -learning_rate*dW # simple SGD update +update_scale = np.linalg.norm(update.ravel()) +W += update # the actual update +print update_scale / param_scale # want ~1e-3 +``` + +Instead of tracking the min or the max, some people prefer to compute and track the norm of the gradients and their updates instead. These metrics are usually correlated and often give approximately the same results. + +#### Activation / Gradient distributions per layer + +An incorrect initialization can slow down or even completely stall the learning process. Luckily, this issue can be diagnosed relatively easily. One way to do so is to plot activation/gradient histograms for all layers of the network. Intuitively, it is not a good sign to see any strange distributions - e.g. with tanh neurons we would like to see a distribution of neuron activations between the full range of [-1,1], instead of seeing all neurons outputting zero, or all neurons being completely saturated at either -1 or 1. + +#### First-layer Visualizations + +Lastly, when one is working with image pixels it can be helpful and satisfying to plot the first-layer features visually: + +![](/assets/nn3/weights.jpeg) +![](/assets/nn3/cnnweights.jpg) + +Examples of visualized weights for the first layer of a neural network. **Left**: Noisy features indicate could be a symptom: Unconverged network, improperly set learning rate, very low weight regularization penalty. **Right:** Nice, smooth, clean and diverse features are a good indication that the training is proceeding well. + +### Parameter updates + +Once the analytic gradient is computed with backpropagation, the gradients are used to perform a parameter update. There are several approaches for performing the update, which we discuss next. + +We note that optimization for deep networks is currently a very active area of research. In this section we highlight some established and common techniques you may see in practice, briefly describe their intuition, but leave a detailed analysis outside of the scope of the class. We provide some further pointers for an interested reader. + +#### SGD and bells and whistles + +**Vanilla update**. The simplest form of update is to change the parameters along the negative gradient direction (since the gradient indicates the direction of increase, but we usually wish to minimize a loss function). Assuming a vector of parameters `x` and the gradient `dx`, the simplest update has the form: + +``` +# Vanilla update +x += - learning_rate * dx +``` + +where `learning_rate` is a hyperparameter - a fixed constant. When evaluated on the full dataset, and when the learning rate is low enough, this is guaranteed to make non-negative progress on the loss function. + +**Momentum update** is another approach that almost always enjoys better converge rates on deep networks. This update can be motivated from a physical perspective of the optimization problem. In particular, the loss can be interpreted as the height of a hilly terrain (and therefore also to the potential energy since \(U = mgh\) and therefore \( U \propto h \) ). Initializing the parameters with random numbers is equivalent to setting a particle with zero initial velocity at some location. The optimization process can then be seen as equivalent to the process of simulating the parameter vector (i.e. a particle) as rolling on the landscape. + +Since the force on the particle is related to the gradient of potential energy (i.e. \(F = - \nabla U \) ), the **force** felt by the particle is precisely the (negative) **gradient** of the loss function. Moreover, \(F = ma \) so the (negative) gradient is in this view proportional to the acceleration of the particle. Note that this is different from the SGD update shown above, where the gradient directly integrates the position. Instead, the physics view suggests an update in which the gradient only directly influences the velocity, which in turn has an effect on the position: + +``` +# Momentum update +v = mu * v - learning_rate * dx # integrate velocity +x += v # integrate position +``` + +Here we see an introduction of a `v` variable that is initialized at zero, and an additional hyperparameter (`mu`). As an unfortunate misnomer, this variable is in optimization referred to as *momentum* (its typical value is about 0.9), but its physical meaning is more consistent with the coefficient of friction. Effectively, this variable damps the velocity and reduces the kinetic energy of the system, or otherwise the particle would never come to a stop at the bottom of a hill. When cross-validated, this parameter is usually set to values such as [0.5, 0.9, 0.95, 0.99]. Similar to annealing schedules for learning rates (discussed later, below), optimization can sometimes benefit a little from momentum schedules, where the momentum is increased in later stages of learning. A typical setting is to start with momentum of about 0.5 and anneal it to 0.99 or so over multiple epochs. + +> With Momentum update, the parameter vector will build up velocity in any direction that has consistent gradient. + +**Nesterov Momentum** is a slightly different version of the momentum update that has recently been gaining popularity. It enjoys stronger theoretical converge guarantees for convex functions and in practice it also consistenly works slightly better than standard momentum. + +The core idea behind Nesterov momentum is that when the current parameter vector is at some position `x`, then looking at the momentum update above, we know that the momentum term alone (i.e. ignoring the second term with the gradient) is about to nudge the parameter vector by `mu * v`. Therefore, if we are about to compute the gradient, we can treat the future approximate position `x + mu * v` as a “lookahead” - this is a point in the vicinity of where we are soon going to end up. Hence, it makes sense to compute the gradient at `x + mu * v` instead of at the “old/stale” position `x`. + +![](/assets/nn3/nesterov.jpeg) + +Nesterov momentum. Instead of evaluating gradient at the current position (red circle), we know that our momentum is about to carry us to the tip of the green arrow. With Nesterov momentum we therefore instead evaluate the gradient at this "looked-ahead" position. + +That is, in a slightly awkward notation, we would like to do the following: + +``` +x_ahead = x + mu * v +# evaluate dx_ahead (the gradient at x_ahead instead of at x) +v = mu * v - learning_rate * dx_ahead +x += v +``` + +However, in practice people prefer to express the update to look as similar to vanilla SGD or to the previous momentum update as possible. This is possible to achieve by manipulating the update above with a variable transform `x_ahead = x + mu * v`, and then expressing the update in terms of `x_ahead` instead of `x`. That is, the parameter vector we are actually storing is always the ahead version. The equations in terms of `x_ahead` (but renaming it back to `x`) then become: + +``` +v_prev = v # back this up +v = mu * v - learning_rate * dx # velocity update stays the same +x += -mu * v_prev + (1 + mu) * v # position update changes form +``` + +We recommend this further reading to understand the source of these equations and the mathematical formulation of Nesterov’s Accelerated Momentum (NAG): + +* [Advances in optimizing Recurrent Networks](http://arxiv.org/pdf/1212.0901v2.pdf) by Yoshua Bengio, Section 3.5. +* [Ilya Sutskever’s thesis](http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf) (pdf) contains a longer exposition of the topic in section 7.2 + +#### Annealing the learning rate + +In training deep networks, it is usually helpful to anneal the learning rate over time. Good intuition to have in mind is that with a high learning rate, the system contains too much kinetic energy and the parameter vector bounces around chaotically, unable to settle down into deeper, but narrower parts of the loss function. Knowing when to decay the learning rate can be tricky: Decay it slowly and you’ll be wasting computation bouncing around chaotically with little improvement for a long time. But decay it too aggressively and the system will cool too quickly, unable to reach the best position it can. There are three common types of implementing the learning rate decay: + +* **Step decay**: Reduce the learning rate by some factor every few epochs. Typical values might be reducing the learning rate by a half every 5 epochs, or by 0.1 every 20 epochs. These numbers depend heavily on the type of problem and the model. One heuristic you may see in practice is to watch the validation error while training with a fixed learning rate, and reduce the learning rate by a constant (e.g. 0.5) whenever the validation error stops improving. +* **Exponential decay.** has the mathematical form \(\alpha = \alpha\_0 e^{-k t}\), where \(\alpha\_0, k\) are hyperparameters and \(t\) is the iteration number (but you can also use units of epochs). +* **1/t decay** has the mathematical form \(\alpha = \alpha\_0 / (1 + k t )\) where \(a\_0, k\) are hyperparameters and \(t\) is the iteration number. + +In practice, we find that the step decay is slightly preferable because the hyperparameters it involves (the fraction of decay and the step timings in units of epochs) are more interpretable than the hyperparameter \(k\). Lastly, if you can afford the computational budget, err on the side of slower decay and train for a longer time. + +#### Second order methods + +A second, popular group of methods for optimization in context of deep learning is based on [Newton’s method](http://en.wikipedia.org/wiki/Newton%27s_method_in_optimization), which iterates the following update: + +\[x \leftarrow x - [H f(x)]^{-1} \nabla f(x)\] + +Here, \(H f(x)\) is the [Hessian matrix](http://en.wikipedia.org/wiki/Hessian_matrix), which is a square matrix of second-order partial derivatives of the function. The term \(\nabla f(x)\) is the gradient vector, as seen in Gradient Descent. Intuitively, the Hessian describes the local curvature of the loss function, which allows us to perform a more efficient update. In particular, multiplying by the inverse Hessian leads the optimization to take more aggressive steps in directions of shallow curvature and shorter steps in directions of steep curvature. Note, crucially, the absence of any learning rate hyperparameters in the update formula, which the proponents of these methods cite this as a large advantage over first-order methods. + +However, the update above is impractical for most deep learning applications because computing (and inverting) the Hessian in its explicit form is a very costly process in both space and time. For instance, a Neural Network with one million parameters would have a Hessian matrix of size [1,000,000 x 1,000,000], occupying approximately 3725 gigabytes of RAM. Hence, a large variety of *quasi-Newton* methods have been developed that seek to approximate the inverse Hessian. Among these, the most popular is [L-BFGS](http://en.wikipedia.org/wiki/Limited-memory_BFGS), which uses the information in the gradients over time to form the approximation implicitly (i.e. the full matrix is never computed). + +However, even after we eliminate the memory concerns, a large downside of a naive application of L-BFGS is that it must be computed over the entire training set, which could contain millions of examples. Unlike mini-batch SGD, getting L-BFGS to work on mini-batches is more tricky and an active area of research. + +**In practice**, it is currently not common to see L-BFGS or similar second-order methods applied to large-scale Deep Learning and Convolutional Neural Networks. Instead, SGD variants based on (Nesterov’s) momentum are more standard because they are simpler and scale more easily. + +Additional references: + +* [Large Scale Distributed Deep Networks](http://research.google.com/archive/large_deep_networks_nips2012.html) is a paper from the Google Brain team, comparing L-BFGS and SGD variants in large-scale distributed optimization. +* [SFO](http://arxiv.org/abs/1311.2115) algorithm strives to combine the advantages of SGD with advantages of L-BFGS. + +#### Per-parameter adaptive learning rate methods + +All previous approaches we’ve discussed so far manipulated the learning rate globally and equally for all parameters. Tuning the learning rates is an expensive process, so much work has gone into devising methods that can adaptively tune the learning rates, and even do so per parameter. Many of these methods may still require other hyperparameter settings, but the argument is that they are well-behaved for a broader range of hyperparameter values than the raw learning rate. In this section we highlight some common adaptive methods you may encounter in practice: + +**Adagrad** is an adaptive learning rate method originally proposed by [Duchi et al.](http://jmlr.org/papers/v12/duchi11a.html). + +``` +# Assume the gradient dx and parameter vector x +cache += dx**2 +x += - learning_rate * dx / (np.sqrt(cache) + eps) +``` + +Notice that the variable `cache` has size equal to the size of the gradient, and keeps track of per-parameter sum of squared gradients. This is then used to normalize the parameter update step, element-wise. Notice that the weights that receive high gradients will have their effective learning rate reduced, while weights that receive small or infrequent updates will have their effective learning rate increased. Amusingly, the square root operation turns out to be very important and without it the algorithm performs much worse. The smoothing term `eps` (usually set somewhere in range from 1e-4 to 1e-8) avoids division by zero. A downside of Adagrad is that in case of Deep Learning, the monotonic learning rate usually proves too aggressive and stops learning too early. + +**RMSprop.** RMSprop is a very effective, but currently unpublished adaptive learning rate method. Amusingly, everyone who uses this method in their work currently cites [slide 29 of Lecture 6](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) of Geoff Hinton’s Coursera class. The RMSProp update adjusts the Adagrad method in a very simple way in an attempt to reduce its aggressive, monotonically decreasing learning rate. In particular, it uses a moving average of squared gradients instead, giving: + +``` +cache = decay_rate * cache + (1 - decay_rate) * dx**2 +x += - learning_rate * dx / (np.sqrt(cache) + eps) +``` + +Here, `decay_rate` is a hyperparameter and typical values are [0.9, 0.99, 0.999]. Notice that the `x+=` update is identical to Adagrad, but the `cache` variable is a “leaky”. Hence, RMSProp still modulates the learning rate of each weight based on the magnitudes of its gradients, which has a beneficial equalizing effect, but unlike Adagrad the updates do not get monotonically smaller. + +**Adam.** [Adam](http://arxiv.org/abs/1412.6980) is a recently proposed update that looks a bit like RMSProp with momentum. The (simplified) update looks as follows: + +``` +m = beta1*m + (1-beta1)*dx +v = beta2*v + (1-beta2)*(dx**2) +x += - learning_rate * m / (np.sqrt(v) + eps) +``` + +Notice that the update looks exactly as RMSProp update, except the “smooth” version of the gradient `m` is used instead of the raw (and perhaps noisy) gradient vector `dx`. Recommended values in the paper are `eps = 1e-8`, `beta1 = 0.9`, `beta2 = 0.999`. In practice Adam is currently recommended as the default algorithm to use, and often works slightly better than RMSProp. However, it is often also worth trying SGD+Nesterov Momentum as an alternative. The full Adam update also includes a *bias correction* mechanism, which compensates for the fact that in the first few time steps the vectors `m,v` are both initialized and therefore biased at zero, before they fully “warm up”. With the *bias correction* mechanism, the update looks as follows: + +``` +# t is your iteration counter going from 1 to infinity +m = beta1*m + (1-beta1)*dx +mt = m / (1-beta1**t) +v = beta2*v + (1-beta2)*(dx**2) +vt = v / (1-beta2**t) +x += - learning_rate * mt / (np.sqrt(vt) + eps) +``` + +Note that the update is now a function of the iteration as well as the other parameters. +We refer the reader to the paper for the details, or the course slides where this is expanded on. + +Additional References: + +* [Unit Tests for Stochastic Optimization](http://arxiv.org/abs/1312.6055) proposes a series of tests as a standardized benchmark for stochastic optimization. + +![](/assets/nn3/opt2.gif) +![](/assets/nn3/opt1.gif) + +Animations that may help your intuitions about the learning process dynamics. **Left:** Contours of a loss surface and time evolution of different optimization algorithms. Notice the "overshooting" behavior of momentum-based methods, which make the optimization look like a ball rolling down the hill. **Right:** A visualization of a saddle point in the optimization landscape, where the curvature along different dimension has different signs (one dimension curves up and another down). Notice that SGD has a very hard time breaking symmetry and gets stuck on the top. Conversely, algorithms such as RMSprop will see very low gradients in the saddle direction. Due to the denominator term in the RMSprop update, this will increase the effective learning rate along this direction, helping RMSProp proceed. Images credit: [Alec Radford](https://twitter.com/alecrad). + +### Hyperparameter optimization + +As we’ve seen, training Neural Networks can involve many hyperparameter settings. The most common hyperparameters in context of Neural Networks include: + +* the initial learning rate +* learning rate decay schedule (such as the decay constant) +* regularization strength (L2 penalty, dropout strength) + +But as we saw, there are many more relatively less sensitive hyperparameters, for example in per-parameter adaptive learning methods, the setting of momentum and its schedule, etc. In this section we describe some additional tips and tricks for performing the hyperparameter search: + +**Implementation**. Larger Neural Networks typically require a long time to train, so performing hyperparameter search can take many days/weeks. It is important to keep this in mind since it influences the design of your code base. One particular design is to have a **worker** that continuously samples random hyperparameters and performs the optimization. During the training, the worker will keep track of the validation performance after every epoch, and writes a model checkpoint (together with miscellaneous training statistics such as the loss over time) to a file, preferably on a shared file system. It is useful to include the validation performance directly in the filename, so that it is simple to inspect and sort the progress. Then there is a second program which we will call a **master**, which launches or kills workers across a computing cluster, and may additionally inspect the checkpoints written by workers and plot their training statistics, etc. + +**Prefer one validation fold to cross-validation**. In most cases a single validation set of respectable size substantially simplifies the code base, without the need for cross-validation with multiple folds. You’ll hear people say they “cross-validated” a parameter, but many times it is assumed that they still only used a single validation set. + +**Hyperparameter ranges**. Search for hyperparameters on log scale. For example, a typical sampling of the learning rate would look as follows: `learning_rate = 10 ** uniform(-6, 1)`. That is, we are generating a random number from a uniform distribution, but then raising it to the power of 10. The same strategy should be used for the regularization strength. Intuitively, this is because learning rate and regularization strength have multiplicative effects on the training dynamics. For example, a fixed change of adding 0.01 to a learning rate has huge effects on the dynamics if the learning rate is 0.001, but nearly no effect if the learning rate when it is 10. This is because the learning rate multiplies the computed gradient in the update. Therefore, it is much more natural to consider a range of learning rate multiplied or divided by some value, than a range of learning rate added or subtracted to by some value. Some parameters (e.g. dropout) are instead usually searched in the original scale (e.g. `dropout = uniform(0,1)`). + +**Prefer random search to grid search**. As argued by Bergstra and Bengio in [Random Search for Hyper-Parameter Optimization](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf), “randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid”. As it turns out, this is also usually easier to implement. + +![](/assets/nn3/gridsearchbad.jpeg) + +Core illustration from [Random Search for Hyper-Parameter Optimization](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf) by Bergstra and Bengio. It is very often the case that some of the hyperparameters matter much more than others (e.g. top hyperparam vs. left one in this figure). Performing random search rather than grid search allows you to much more precisely discover good values for the important ones. + +**Careful with best values on border**. Sometimes it can happen that you’re searching for a hyperparameter (e.g. learning rate) in a bad range. For example, suppose we use `learning_rate = 10 ** uniform(-6, 1)`. Once we receive the results, it is important to double check that the final learning rate is not at the edge of this interval, or otherwise you may be missing more optimal hyperparameter setting beyond the interval. + +**Stage your search from coarse to fine**. In practice, it can be helpful to first search in coarse ranges (e.g. 10 \*\* [-6, 1]), and then depending on where the best results are turning up, narrow the range. Also, it can be helpful to perform the initial coarse search while only training for 1 epoch or even less, because many hyperparameter settings can lead the model to not learn at all, or immediately explode with infinite cost. The second stage could then perform a narrower search with 5 epochs, and the last stage could perform a detailed search in the final range for many more epochs (for example). + +**Bayesian Hyperparameter Optimization** is a whole area of research devoted to coming up with algorithms that try to more efficiently navigate the space of hyperparameters. The core idea is to appropriately balance the exploration - exploitation trade-off when querying the performance at different hyperparameters. Multiple libraries have been developed based on these models as well, among some of the better known ones are [Spearmint](https://github.com/JasperSnoek/spearmint), [SMAC](http://www.cs.ubc.ca/labs/beta/Projects/SMAC/), and [Hyperopt](http://jaberg.github.io/hyperopt/). However, in practical settings with ConvNets it is still relatively difficult to beat random search in a carefully-chosen intervals. See some additional from-the-trenches discussion [here](http://nlpers.blogspot.com/2014/10/hyperparameter-search-bayesian.html). + +## Evaluation + +### Model Ensembles + +In practice, one reliable approach to improving the performance of Neural Networks by a few percent is to train multiple independent models, and at test time average their predictions. As the number of models in the ensemble increases, the performance typically monotonically improves (though with diminishing returns). Moreover, the improvements are more dramatic with higher model variety in the ensemble. There are a few approaches to forming an ensemble: + +* **Same model, different initializations**. Use cross-validation to determine the best hyperparameters, then train multiple models with the best set of hyperparameters but with different random initialization. The danger with this approach is that the variety is only due to initialization. +* **Top models discovered during cross-validation**. Use cross-validation to determine the best hyperparameters, then pick the top few (e.g. 10) models to form the ensemble. This improves the variety of the ensemble but has the danger of including suboptimal models. In practice, this can be easier to perform since it doesn’t require additional retraining of models after cross-validation +* **Different checkpoints of a single model**. If training is very expensive, some people have had limited success in taking different checkpoints of a single network over time (for example after every epoch) and using those to form an ensemble. Clearly, this suffers from some lack of variety, but can still work reasonably well in practice. The advantage of this approach is that is very cheap. +* **Running average of parameters during training**. Related to the last point, a cheap way of almost always getting an extra percent or two of performance is to maintain a second copy of the network’s weights in memory that maintains an exponentially decaying sum of previous weights during training. This way you’re averaging the state of the network over last several iterations. You will find that this “smoothed” version of the weights over last few steps almost always achieves better validation error. The rough intuition to have in mind is that the objective is bowl-shaped and your network is jumping around the mode, so the average has a higher chance of being somewhere nearer the mode. + +One disadvantage of model ensembles is that they take longer to evaluate on test example. An interested reader may find the recent work from Geoff Hinton on [“Dark Knowledge”](https://www.youtube.com/watch?v=EK61htlw8hY) inspiring, where the idea is to “distill” a good ensemble back to a single model by incorporating the ensemble log likelihoods into a modified objective. + +## Summary + +To train a Neural Network: + +* Gradient check your implementation with a small batch of data and be aware of the pitfalls. +* As a sanity check, make sure your initial loss is reasonable, and that you can achieve 100% training accuracy on a very small portion of the data +* During training, monitor the loss, the training/validation accuracy, and if you’re feeling fancier, the magnitude of updates in relation to parameter values (it should be ~1e-3), and when dealing with ConvNets, the first-layer weights. +* The two recommended updates to use are either SGD+Nesterov Momentum or Adam. +* Decay your learning rate over the period of the training. For example, halve the learning rate after a fixed number of epochs, or whenever the validation accuracy tops off. +* Search for good hyperparameters with random search (not grid search). Stage your search from coarse (wide hyperparameter ranges, training only for 1-5 epochs), to fine (narrower rangers, training for many more epochs) +* Form model ensembles for extra performance + +## Additional References + +* [SGD](http://research.microsoft.com/pubs/192769/tricks-2012.pdf) tips and tricks from Leon Bottou +* [Efficient BackProp](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf) (pdf) from Yann LeCun +* [Practical Recommendations for Gradient-Based Training of Deep + Architectures](http://arxiv.org/pdf/1206.5533v2.pdf) from Yoshua Bengio + +* [cs231n](https://github.com/cs231n) +* [cs231n](https://twitter.com/cs231n) diff --git a/docs/evidence/fsdl_spring2021_lecture7.md b/docs/evidence/fsdl_spring2021_lecture7.md new file mode 100644 index 0000000..c14406a --- /dev/null +++ b/docs/evidence/fsdl_spring2021_lecture7.md @@ -0,0 +1,786 @@ +Source: https://fullstackdeeplearning.com/spring2021/lecture-7/ +Title: FSDL Spring 2021 - Lecture 7: Troubleshooting Deep Neural Networks +Fetched-via: uvx markitdown https://fullstackdeeplearning.com/spring2021/lecture-7/ +Fetch-status: verbatim + +[Skip to content](#lecture-7-troubleshooting-deep-neural-networks) + +[Sign up for our latest in-person course!](https://www.scale.bythebay.io/llm-workshop) + +[![logo](../../images/favicon.png)](../.. "The Full Stack") + +The Full Stack + +Lecture 7: Troubleshooting Deep Neural Networks + +Initializing search + +[The Full Stack Website](https://github.com/the-full-stack/website "Go to repository") + +* [Home](../..) +* [LLM Bootcamp](../../llm-bootcamp/) +* [Deep Learning Course](../../course/) +* [Blog](../../blog/) +* [Cloud GPUs](../../cloud-gpus/) + +[![logo](../../images/favicon.png)](../.. "The Full Stack") +The Full Stack + +[The Full Stack Website](https://github.com/the-full-stack/website "Go to repository") + +* [Home](../..) +* [ ] + + [LLM Bootcamp](../../llm-bootcamp/) + + LLM Bootcamp + + [ ] + + [Spring 2023](../../llm-bootcamp/spring-2023/) + + Spring 2023 + - [Launch an LLM App in One Hour](../../llm-bootcamp/spring-2023/launch-an-llm-app-in-one-hour/) + - [LLM Foundations](../../llm-bootcamp/spring-2023/llm-foundations/) + - [Learn to Spell: Prompt Engineering](../../llm-bootcamp/spring-2023/prompt-engineering/) + - [Augmented Language Models](../../llm-bootcamp/spring-2023/augmented-language-models/) + - [Project Walkthrough: askFSDL](../../llm-bootcamp/spring-2023/askfsdl-walkthrough/) + - [UX for Language User Interfaces](../../llm-bootcamp/spring-2023/ux-for-luis/) + - [LLMOps](../../llm-bootcamp/spring-2023/llmops/) + - [What's Next?](../../llm-bootcamp/spring-2023/whats-next/) + - [Reza Shabani: How to train your own LLM](../../llm-bootcamp/spring-2023/shabani-train-your-own/) + - [Harrison Chase: Agents](../../llm-bootcamp/spring-2023/chase-agents/) + - [Fireside Chat with Peter Welinder](../../llm-bootcamp/spring-2023/welinder-fireside-chat/) +* [x] + + [Deep Learning Course](../../course/) + + Deep Learning Course + + [ ] + + [FSDL 2022](../../course/2022/) + + FSDL 2022 + - [Lecture 1: Course Vision and When to Use ML](../../course/2022/lecture-1-course-vision-and-when-to-use-ml/) + - [Lab Overview](../../course/2022/lab-0-overview/) + - [Lecture 2: Development Infrastructure & Tooling](../../course/2022/lecture-2-development-infrastructure-and-tooling/) + - [Lab 4: Experiment Management](../../course/2022/lab-4-experiment-management/) + - [Lecture 3: Troubleshooting & Testing](../../course/2022/lecture-3-troubleshooting-and-testing/) + - [Lab 5: Troubleshooting & Testing](../../course/2022/lab-5-troubleshooting-and-testing/) + - [Lecture 4: Data Management](../../course/2022/lecture-4-data-management/) + - [Lab 6: Data Annotation](../../course/2022/lab-6-data-annotation/) + - [Lecture 5: Deployment](../../course/2022/lecture-5-deployment/) + - [Lab 7: Web Deployment](../../course/2022/lab-7-web-deployment/) + - [Lecture 6: Continual Learning](../../course/2022/lecture-6-continual-learning/) + - [Lab 8: Model Monitoring](../../course/2022/lab-8-model-monitoring/) + - [Lecture 7: Foundation Models](../../course/2022/lecture-7-foundation-models/) + - [Lecture 8: ML Teams and Project Management](../../course/2022/lecture-8-teams-and-pm/) + - [Lecture 9: Ethics](../../course/2022/lecture-9-ethics/) + - [Project Showcase](../../course/2022/project-showcase/) + - [Course Announcement](../../course/2022/announcement/) + + [x] + + Older + + Older + - [x] + + [FSDL 2021](../) + + FSDL 2021 + * [Synchronous Online Course](../synchronous/) + * [Course Projects Showcase](../projects/) + * [Lecture 1: DL Fundamentals](../lecture-1/) + * [Lab 1: Setup and Introduction](../lab-1/) + * [Notebook: Coding a neural net](../notebook-1/) + * [Lecture 2A: CNNs](../lecture-2a/) + * [Lecture 2B: Computer Vision](../lecture-2b/) + * [Lab 2: CNNs and Synthetic Data](../lab-2/) + * [Lecture 3: RNNs](../lecture-3/) + * [Lab 3: RNNs](../lab-3/) + * [Lecture 4: Transformers](../lecture-4/) + * [Lab 4: Transformers](../lab-4/) + * [Lecture 5: ML Projects](../lecture-5/) + * [Lecture 6: MLOps Infrastructure & Tooling](../lecture-6/) + * [Lab 5: Experiment Management](../lab-5/) + * [ ] + + Lecture 7: Troubleshooting Deep Neural Networks + + [Lecture 7: Troubleshooting Deep Neural Networks](./) + + Table of contents + + [Video](#video) + + [Slides](#slides) + + [Notes](#notes) + + - [1 - Why Is Deep Learning Troubleshooting Hard?](#1-why-is-deep-learning-troubleshooting-hard) + - [2 - Strategy to Debug Neural Networks](#2-strategy-to-debug-neural-networks) + - [3 - Start Simple](#3-start-simple) + + * [Choose A Simple Architecture](#choose-a-simple-architecture) + * [Use Sensible Defaults](#use-sensible-defaults) + * [Normalize Inputs](#normalize-inputs) + * [Simplify The Problem](#simplify-the-problem) + - [4 - Implement and Debug](#4-implement-and-debug) + + * [Get Your Model To Run](#get-your-model-to-run) + * [Overfit A Single Batch](#overfit-a-single-batch) + * [Compare To A Known Result](#compare-to-a-known-result) + - [5 - Evaluate](#5-evaluate) + + * [Bias-Variance Decomposition](#bias-variance-decomposition) + * [Distribution Shift](#distribution-shift) + - [6 - Improve Model and Data](#6-improve-model-and-data) + + * [Step 1: Address Underfitting](#step-1-address-underfitting) + * [Step 2: Address Overfitting](#step-2-address-overfitting) + * [Step 3: Address Distribution Shift](#step-3-address-distribution-shift) + + + [Error Analysis](#error-analysis) + + [Domain Adaptation](#domain-adaptation) + * [Step 4: Rebalance datasets](#step-4-rebalance-datasets) + - [7 - Tune Hyperparameters](#7-tune-hyperparameters) + + * [Techniques for Tuning Hyperparameter Optimization](#techniques-for-tuning-hyperparameter-optimization) + - [8 - Conclusion](#8-conclusion) + * [Lecture 8: Data Management](../lecture-8/) + * [Lab 6: Data Labeling](../lab-6/) + * [Lecture 9: AI Ethics](../lecture-9/) + * [Lab 7: Paragraph Recognition](../lab-7/) + * [Lecture 10: Testing & Explainability](../lecture-10/) + * [Lab 8: Testing & CI](../lab-8/) + * [Lecture 11: Deployment & Monitoring](../lecture-11/) + * [Lab 9: Web Deployment](../lab-9/) + * [Lecture 12: Research Directions](../lecture-12/) + * [Lecture 13: ML Teams and Startups](../lecture-13/) + * [Panel Discussion: Do I need a PhD to work in ML?](../panel/) + - [FSDL 2021 (Berkeley)](https://bit.ly/berkeleyfsdl) + - [FSDL 2020 (UW)](https://bit.ly/uwfsdl) + - [FSDL 2019 (Online)](https://fall2019.fullstackdeeplearning.com) + - [FSDL 2019 (Bootcamp)](/march2019.html) + - [FSDL 2018 (Bootcamp)](/august2018.html) +* [Blog](../../blog/) +* [Cloud GPUs](../../cloud-gpus/) + +Table of contents + +* [Video](#video) +* [Slides](#slides) +* [Notes](#notes) + + + [1 - Why Is Deep Learning Troubleshooting Hard?](#1-why-is-deep-learning-troubleshooting-hard) + + [2 - Strategy to Debug Neural Networks](#2-strategy-to-debug-neural-networks) + + [3 - Start Simple](#3-start-simple) + + - [Choose A Simple Architecture](#choose-a-simple-architecture) + - [Use Sensible Defaults](#use-sensible-defaults) + - [Normalize Inputs](#normalize-inputs) + - [Simplify The Problem](#simplify-the-problem) + + [4 - Implement and Debug](#4-implement-and-debug) + + - [Get Your Model To Run](#get-your-model-to-run) + - [Overfit A Single Batch](#overfit-a-single-batch) + - [Compare To A Known Result](#compare-to-a-known-result) + + [5 - Evaluate](#5-evaluate) + + - [Bias-Variance Decomposition](#bias-variance-decomposition) + - [Distribution Shift](#distribution-shift) + + [6 - Improve Model and Data](#6-improve-model-and-data) + + - [Step 1: Address Underfitting](#step-1-address-underfitting) + - [Step 2: Address Overfitting](#step-2-address-overfitting) + - [Step 3: Address Distribution Shift](#step-3-address-distribution-shift) + + * [Error Analysis](#error-analysis) + * [Domain Adaptation](#domain-adaptation) + - [Step 4: Rebalance datasets](#step-4-rebalance-datasets) + + [7 - Tune Hyperparameters](#7-tune-hyperparameters) + + - [Techniques for Tuning Hyperparameter Optimization](#techniques-for-tuning-hyperparameter-optimization) + + [8 - Conclusion](#8-conclusion) + +# Lecture 7: Troubleshooting Deep Neural Networks + +## Video + +## Slides + +[Download slides as PDF](https://drive.google.com/file/d/1yXQCnGGp3wWdoCf6nSP5b758cXF92rtg/view?usp=sharing) + +## Notes + +*Lecture by [Josh Tobin](http://josh-tobin.com). +Notes transcribed by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).* + +In traditional software engineering, a bug usually leads to the program +crashing. While this is annoying for the user, it is critical for the +developer to inspect the errors to understand why. With deep learning, +we sometimes encounter errors, but all too often, the program crashes +without a clear reason why. While these issues can be debugged manually, +deep learning models most often fail because of poor output predictions. +What’s worse is that when the model performance is low, there is usually +no signal about why or when the models failed. + +A common sentiment among practitioners is that they spend **80–90% of +time debugging and tuning the models** and only 10–20% of time deriving +math equations and implementing things. This is confirmed by Andrej +Kaparthy, [as seen in this +tweet](https://twitter.com/karpathy/status/423990618289733632). + +### 1 - Why Is Deep Learning Troubleshooting Hard? + +Suppose you are trying to reproduce a research paper result for your +work, but your results are worse. You might wonder why your model’s +performance is significantly worse than the paper that you’re trying to +reproduce? + +![](/spring2021/lecture-7-notes-media/image3.png) + +Many different things can cause this: + +* It can be **implementation bugs**. Most bugs in deep learning are + actually invisible. +* **Hyper-parameter choices** can also cause your performance to + degrade. Deep learning models are very sensitive to + hyper-parameters. Even very subtle choices of learning rate and + weight initialization can make a big difference. +* Performance can also be worse just because of **data/model fit**. + For example, you pre-train your model on ImageNet data and fit it + on self-driving car images, which are harder to learn. +* Finally, poor model performance could be caused not by your model + but your **dataset construction**. Typical issues here include not + having enough examples, dealing with noisy labels and imbalanced + classes, splitting train and test set with different + distributions. + +### 2 - Strategy to Debug Neural Networks + +The key idea of deep learning troubleshooting is: *Since it is hard to +disambiguate errors, it’s best to start simple and gradually ramp up +complexity.* + +This lecture provides **a decision tree for debugging deep learning +models and improving performance**. This guide assumes that you already +have an initial test dataset, a single metric to improve, and target +performance based on human-level performance, published results, +previous baselines, etc. + +![](/spring2021/lecture-7-notes-media/image4.png) + +### 3 - Start Simple + +The first step is the troubleshooting workflow is **starting simple**. + +#### Choose A Simple Architecture + +There are a few things to consider when you want to start simple. The +first is how to **choose a simple architecture**. These are +architectures that are easy to implement and are likely to get you part +of the way towards solving your problem without introducing as many +bugs. + +Architecture selection is one of the many intimidating parts of getting +into deep learning because there are tons of papers coming out +all-the-time and claiming to be state-of-the-art on some problems. They +get very complicated fast. In the limit, if you’re trying to get to +maximal performance, then architecture selection is challenging. But +when starting on a new problem, you can just solve a simple set of rules +that will allow you to pick an architecture that enables you to do a +decent job on the problem you’re working on. + +* If your data looks like **images**, start with a LeNet-like + architecture and consider using something like ResNet as your + codebase gets more mature. +* If your data looks like **sequences**, start with an LSTM with one + hidden layer and/or temporal/classical convolutions. Then, when + your problem gets more mature, you can move to an Attention-based + model or a WaveNet-like model. +* For **all other tasks**, start with a fully-connected neural network + with one hidden layer and use more advanced networks later, + depending on the problem. + +![](/spring2021/lecture-7-notes-media/image7.png) + +In reality, many times, the input data contains multiple of those things +above. So how to deal with **multiple input modalities** into a neural +network? Here is the 3-step strategy that we recommend: + +* First, map each of these modalities into a lower-dimensional feature + space. In the example above, the images are passed through a + ConvNet, and the words are passed through an LSTM. +* Then we flatten the outputs of those networks to get a single vector + for each of the inputs that will go into the model. Then we + concatenate those inputs. +* Finally, we pass them through some fully-connected layers to an + output. + +#### Use Sensible Defaults + +After choosing a simple architecture, the next thing to do is to +**select sensible hyper-parameter defaults** to start with. Here are the +defaults that we recommend: + +* [Adam optimizer with a “magic” learning rate value of + 3e-4](https://twitter.com/karpathy/status/801621764144971776?lang=en). +* [ReLU](https://stats.stackexchange.com/questions/226923/why-do-we-use-relu-in-neural-networks-and-how-do-we-use-it) + activation for fully-connected and convolutional models and + [Tanh](https://stats.stackexchange.com/questions/330559/why-is-tanh-almost-always-better-than-sigmoid-as-an-activation-function) + activation for LSTM models. +* [He initialization for ReLU activation function and Glorot + initialization for Tanh activation + function](https://datascience.stackexchange.com/questions/13061/when-to-use-he-or-glorot-normal-initialization-over-uniform-init-and-what-are). +* No regularization and data normalization. + +#### Normalize Inputs + +The next step is to **normalize the input data**, subtracting the mean +and dividing by the variance. Note that for images, it’s fine to scale +values to [0, 1] or [-0.5, 0.5] (for example, by dividing by 255). + +#### Simplify The Problem + +The final thing you should do is consider **simplifying the problem** +itself. If you have a complicated problem with massive data and tons of +classes to deal with, then you should consider: + +* Working with a small training set around 10,000 examples. +* Using a fixed number of objects, classes, input size, etc. +* Creating a simpler synthetic training set like in research labs. + +This is important because (1) you will have reasonable confidence that +your model should be able to solve, and (2) your iteration speed will +increase. + +The diagram below neatly summarizes how to start simple: + +![](/spring2021/lecture-7-notes-media/image6.png) + +### 4 - Implement and Debug + +To give you a preview, below are the five most common bugs in deep +learning models that we recognize: + +* **Incorrect shapes for the network tensors**: This bug is a common + one and can fail silently. This happens many times because the + automatic differentiation systems in the deep learning framework + do silent broadcasting. Tensors become different shapes in the + network and can cause a lot of problems. +* **Pre-processing inputs incorrectly**: For example, you forget to + normalize your inputs or apply too much input pre-processing + (over-normalization and excessive data augmentation). +* **Incorrect input to the model’s loss function**: For example, you + use softmax outputs to a loss that expects logits. +* **Forgot to set up train mode for the network correctly**: For + example, toggling train/evaluation mode or controlling batch norm + dependencies. +* **Numerical instability**: For example, you get `inf` or `NaN` + as outputs. This bug often stems from using an exponent, a log, or + a division operation somewhere in the code. + +Here are three pieces of general advice for implementing your model: + +* **Start with a lightweight implementation**. You want minimum + possible new lines of code for the 1st version of your model. The + rule of thumb is less than 200 lines. This doesn’t count tested + infrastructure components or TensorFlow/PyTorch code. +* **Use off-the-shelf components** such as Keras if possible, since + most of the stuff in Keras works well out-of-the-box. If you have + to use TensorFlow, use the built-in functions, don’t do the math + yourself. This would help you avoid a lot of numerical instability + issues. +* **Build complicated data pipelines later**. These are important for + large-scale ML systems, but you should not start with them because + data pipelines themselves can be a big source of bugs. Just start + with a dataset that you can load into memory. + +![](/spring2021/lecture-7-notes-media/image11.png) + +#### Get Your Model To Run + +The first step of implementing bug-free deep learning models is +**getting your model to run at all**. There are a few things that can +prevent this from happening: + +* **Shape mismatch/casting issue**: To address this type of problem, + you should step through your model creation and inference + step-by-step in a debugger, checking for correct shapes and data + types of your tensors. +* **Out-of-memory issues**: This can be very difficult to debug. You + can scale back your memory-intensive operations one-by-one. For + example, if you create large matrices anywhere in your code, you + can reduce the size of their dimensions or cut your batch size in + half. +* **Other issues**: You can simply Google it. Stack Overflow would be + great most of the time. + +Let’s zoom in on the process of stepping through model creation in a +debugger and talk about **debuggers for deep learning code**: + +* In PyTorch, you can use + [ipdb](https://pypi.org/project/ipdb/) — which exports + functions to access the interactive + [IPython](http://ipython.org/) debugger. +* In TensorFlow, it’s trickier. TensorFlow separates the process of + creating the graph and executing operations in the graph. There + are three options you can try: (1) step through the graph creation + itself and inspect each tensor layer, (2) step into the training + loop and evaluate the tensor layers, or (3) use [TensorFlow + Debugger](https://mullikine.github.io/posts/tensorflow-debugger-tfdb-and-emacs/) + (tfdb), which does option 1 and 2 automatically. + +![](/spring2021/lecture-7-notes-media/image14.png) + +#### Overfit A Single Batch + +After getting your model to run, the next thing you need to do is to +**overfit a single batch of data**. This is a heuristic that can catch +an absurd number of bugs. This really means that you want to drive your +training error arbitrarily close to 0. + +There are a few things that can happen when you try to overfit a single +batch and it fails: + +* **Error goes up**: Commonly, this is due to a flip sign somewhere in + the loss function/gradient. +* **Error explodes**: This is usually a numerical issue but can also + be caused by a high learning rate. +* **Error oscillates**: You can lower the learning rate and inspect + the data for shuffled labels or incorrect data augmentation. +* **Error plateaus**: You can increase the learning rate and get rid + of regulation. Then you can inspect the loss function and the data + pipeline for correctness. + +![](/spring2021/lecture-7-notes-media/image10.png) + +#### Compare To A Known Result + +Once your model overfits in a single batch, there can still be some +other issues that cause bugs. The last step here is to **compare your +results to a known result**. So what sort of known results are useful? + +* The most useful results come from **an official model implementation + evaluated on a similar dataset to yours**. You can step through + the code in both models line-by-line and ensure your model has the + same output. You want to ensure that your model performance is up + to par with expectations. +* If you can’t find an official implementation on a similar dataset, + you can compare your approach to results from **an official model + implementation evaluated on a benchmark dataset**. You most + definitely want to walk through the code line-by-line and ensure + you have the same output. +* If there is no official implementation of your approach, you can + compare it to results from **an unofficial model implementation**. + You can review the code the same as before but with lower + confidence (because almost all the unofficial implementations on + GitHub have bugs). +* Then, you can compare to results from **a paper with no code** (to + ensure that your performance is up to par with expectations), + results from **your model on a benchmark dataset** (to make sure + your model performs well in a simpler setting), and results from + **a similar model on a similar dataset** (to help you get a + general sense of what kind of performance can be expected). +* An under-rated source of results comes from **simple baselines** + (for example, the average of outputs or linear regression), which + can help make sure that your model is learning anything at all. + +The diagram below neatly summarizes how to implement and debug deep +neural networks: + +![](/spring2021/lecture-7-notes-media/image8.png) + +### 5 - Evaluate + +#### Bias-Variance Decomposition + +To evaluate models and prioritize the next steps in model development, +we will apply the bias-variance decomposition. The [bias-variance +decomposition](http://scott.fortmann-roe.com/docs/BiasVariance.html) +is the fundamental model fitting tradeoff. In our application, let’s +talk more specifically about the formula for bias-variance tradeoff with +respect to the **test error;** this will help us apply the concept more +directly to our model’s performance. There are four terms in the formula +for test error: + +*Test error = irreducible error + bias + variance + validation +overfitting* + +1. **Irreducible error** is the baseline error you don’t expect your + model to do better. It can be estimated through strong baselines, + like human performance. +2. **Avoidable bias**, a measure of underfitting, is the difference + between our train error and irreducible error. +3. **Variance**, a measure of overfitting, is the difference between + validation error and training error. +4. **Validation set overfitting** is the difference between test error + and validation error. + +Consider the chart of learning curves and errors below. Using the test +error formula for bias and variance, we can calculate each component of +test error and make decisions based on the value. For example, our +avoidable bias is rather low (only 2 points), while the variance is much +higher (5 points). With this knowledge, we should prioritize methods of +preventing overfitting, like regularization. + +![](/spring2021/lecture-7-notes-media/image12.png) + +#### Distribution Shift + +Clearly, the application of the bias-variance decomposition to the test +error has already helped prioritize our next steps for model +development. However, until now, we’ve assumed that the samples +(training, validation, testing) all come from the same distribution. +What if this isn’t the case? In practical ML situations, this +**distribution shift** often cars. In building self-driving cars, a +frequent occurrence might be training with samples from one distribution +(e.g., daytime driving video) but testing or inferring on samples from a +totally different distribution (e.g., night time driving). + +A simple way of handling this wrinkle in our assumption is to create two +validation sets: one from the training distribution and one from the +test distribution. This can be helpful even with a very small testing +set. If we apply this, we can actually estimate our distribution shift, +which is the difference between testing validation error and testing +error. This is really useful for practical applications of ML! With this +new term, let’s update our test error formula of bias and variance: + +*Test error = irreducible error + bias + variance + distribution shift + +validation overfitting* + +### 6 - Improve Model and Data + +Using the updated formula from the last section, we’ll be able to decide +on and prioritize the right next steps for each iteration of a model. In +particular, we’ll follow a specific process (shown below). + +![](/spring2021/lecture-7-notes-media/image1.png) + +#### Step 1: Address Underfitting + +We’ll start by addressing underfitting (i.e., reducing bias). The first +thing to try in this case is to make your model bigger (e.g., add +layers, more units per layer). Next, consider regularization, which can +prevent a tight fit to your data. Other options are error analysis, +choosing a different model architecture (e.g., something more state of +the art), tuning hyperparameters, or adding features. Some notes: + +* Choosing different architectures, especially a SOTA one, can be very + helpful but is also risky. Bugs are easily introduced in the + implementation process. +* Adding features is uncommon in the deep learning paradigm (vs. + traditional machine learning). We usually want the network to + learn features of its own accord. If all else fails, it can be + beneficial in a practical setting. + +![](/spring2021/lecture-7-notes-media/image13.png) + +#### Step 2: Address Overfitting + +After addressing underfitting, move on to solving overfitting. +Similarly, there’s a recommended series of methods to try in order. +Starting with collecting training data (if possible) is the soundest way +to address overfitting, though it can be challenging in certain +applications. Next, tactical improvements like normalization, data +augmentation, and regularization can help. Following these steps, +traditional defaults like tuning hyperparameters, choosing a different +architecture, or error analysis are useful. Finally, if overfitting is +rather intractable, there’s a series of less recommended steps, such as +early stopping, removing features, and reducing model size. Early +stopping is a personal choice; the fast.ai community is a strong +proponent. + +![](/spring2021/lecture-7-notes-media/image15.png) + +#### Step 3: Address Distribution Shift + +After addressing underfitting and overfitting, If there’s a difference +between the error on our training validation set vs. our test validation +set, we need to address the error caused by the distribution shift. This +is a harder problem to solve, so there’s less in our toolkit to apply. + +Start by looking manually at the errors in the test-validation set. +Compare the potential logic behind these errors to the performance in +the train-validation set, and use the errors to guide further data +collection. Essentially, reason about why your model may be suffering +from distribution shift error. This is the most principled way to deal +with distribution shift, though it’s the most challenging way +practically. If collecting more data to address these errors isn’t +possible, try synthesizing data. Additionally, you can try [domain +adaptation](https://ece.engin.umich.edu/wp-content/uploads/2019/09/4142.pdf). + +![](/spring2021/lecture-7-notes-media/image9.png) + +##### Error Analysis + +Manually evaluating errors to understand model performance is generally +a high-yield way of figuring out how to improve the model. +Systematically performing this **error analysis** process and +decomposing the error from different error types can help prioritize +model improvements. For example, in a self-driving car use case with +error types like hard-to-see pedestrians, reflections, and nighttime +scenes, decomposing the error contribution of each and where it occurs +(train-val vs. test-val) can give rise to a clear set of prioritized +action items. See the table for an example of how this error analysis +can be effectively structured. + +![](/spring2021/lecture-7-notes-media/image5.png) + +##### Domain Adaptation + +Domain adaptation is a class of techniques that train on a “source” +distribution and generalize to another “target” using only unlabeled +data or limited labeled data. You should use domain adaptation when +access to labeled data from the test distribution is limited, but access +to relatively similar data is plentiful. + +There are a few different types of domain adaptation: + +1. **Supervised domain adaptation**: In this case, we have limited data + from the target domain to adapt to. Some example applications of + the concept include fine-tuning a pre-trained model or adding + target data to a training set. +2. **Unsupervised domain adaptation**: In this case, we have lots of + unlabeled data from the target domain. Some techniques you might + see are CORAL, domain confusion, and CycleGAN. + +Practically speaking, supervised domain adaptation can work really well! +Unsupervised domain adaptation has a little bit further to go. + +#### Step 4: Rebalance datasets + +If the test-validation set performance starts to look considerably +better than the test performance, you may have overfit the validation +set. This commonly occurs with small validation sets or lots of +hyperparameter training. If this occurs, resample the validation set +from the test distribution and get a fresh estimate of the performance. + +### 7 - Tune Hyperparameters + +One of the core challenges in hyperparameter optimization is very basic: +**which hyperparameters should you tune?** As we consider this +fundamental question, let’s keep the following in mind: + +* Models are more sensitive to some hyperparameters than others. This + means we should focus our efforts on the more impactful + hyperparameters. +* However, which hyperparameters are most important depends heavily on + our choice of model. +* Certain rules of thumbs can help guide our initial thinking. +* Sensitivity is always relative to default values; if you use good + defaults, you might start in a good place! + +See the following table for a ranked list of hyperparameters and their +impact on the model: + +![](/spring2021/lecture-7-notes-media/image2.png) + +#### Techniques for Tuning Hyperparameter Optimization + +Now that we know which hyperparameters make the most sense to tune +(using rules of thumb), let’s consider the various methods of actually +tuning them: + +1. **Manual Hyperparameter Optimization**. Colloquially referred to as + Graduate Student Descent, this method works by taking a manual, + detailed look at your algorithm, building intuition, and + considering which hyperparameters would make the most difference. + After figuring out these parameters, you train, evaluate, and + guess a better hyperparameter value using your intuition for the + algorithm and intelligence. While it may seem archaic, this method + combines well with other methods (e.g., setting a range of values + for hyperparameters) and has the main benefit of reducing + computation time and cost if used skillfully. It can be + time-consuming and challenging, but it can be a good starting + point. +2. **Grid Search**. Imagine each of your parameters plotted against + each other on a grid, from which you uniformly sample values to + test. For each point, you run a training run and evaluate + performance. The advantages are that it’s very simple and can + often produce good results. However, it’s quite inefficient, as + you must run every combination of hyperparameters. It also often + requires prior knowledge about the hyperparameters since we must + manually set the range of values. +3. **Random Search**: This method is recommended over grid search. + Rather than sampling from the grid of values for the + hyperparameter evenly, we’ll choose n points sampled randomly + across the grid. Empirically, this method produces better results + than grid search. However, the results can be somewhat + uninterpretable, with unexpected values in certain hyperparameters + returned. +4. **Coarse-to-fine Search**: Rather than running entirely random runs, + we can gradually narrow in on the best hyperparameters through + this method. Initially, start by defining a very large range to + run a randomized search on. Within the pool of results, you can + find N best results and hone in on the hyperparameter values used + to generate those samples. As you iteratively perform this method, + you can get excellent performance. This doesn’t remove the manual + component, as you have to select which range to continuously + narrow your search to, but it’s perhaps the most popular method + available. +5. **Bayesian Hyperparameter Optimization**: This is a reasonably + sophisticated method, which you can read more about + [here](http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/slides/lec21.pdf) + and + [here](https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f). + At a high level, start with a prior estimate of parameter + distributions. Subsequently, maintain a probabilistic model of the + relationship between hyperparameter values and model performance. + As you maintain this model, you toggle between training with + hyperparameter values that maximize the expected improvement (per + the model) and use training results to update the initial + probabilistic model and its expectations. This is a great, + hands-off, efficient method to choose hyperparameters. However, + these techniques can be quite challenging to implement from + scratch. As libraries and infrastructure mature, the integration + of these methods into training will become easier. + +In summary, you should probably start with coarse-to-fine random +searches and move to Bayesian methods as your codebase matures and +you’re more certain of your model. + +### 8 - Conclusion + +To wrap up this lecture, deep learning troubleshooting and debugging is +really hard. It’s difficult to tell if you have a bug because there are +many possible sources for the same degradation in performance. +Furthermore, the results can be sensitive to small changes in +hyper-parameters and dataset makeup. + +To train bug-free deep learning models, we need to treat building them +as an iterative process. If you skipped to the end, the following steps +can make this process easier and catch errors as early as possible: + +* **Start Simple**: Choose the simplest model and data possible. +* **Implement and Debug**: Once the model runs, overfit a single batch + and reproduce a known result. +* **Evaluate**: Apply the bias-variance decomposition to decide what + to do next. +* **Tune Hyper-parameters**: Use coarse-to-fine random searches to + tune the model’s hyper-parameters. +* **Improve Model and Data**: Make your model bigger if your model + under-fits and add more data and/or regularization if your model + over-fits. + +Here are additional resources that you can go to learn more: + +* Andrew Ng’s “[Machine Learning + Yearning](https://www.deeplearning.ai/machine-learning-yearning/)” + book. +* This [Twitter + thread](https://twitter.com/karpathy/status/1013244313327681536) + from Andrej Karpathy. +* BYU’s “[Practical Advice for Building Deep Neural + Networks](https://pcc.cs.byu.edu/2017/10/02/practical-advice-for-building-deep-neural-networks/)” + blog post. + +## We are excited to share this course with you for **free**. + +We have more upcoming great content. +Subscribe to stay up to date as we release it. + +We take your privacy and attention very seriously and will never spam you. +I am already a subscriber + +The Full Stack, 2023 + +Made with +[Material for MkDocs](https://squidfunk.github.io/mkdocs-material/) diff --git a/docs/evidence/henderson_2018_deep_rl_matters.md b/docs/evidence/henderson_2018_deep_rl_matters.md new file mode 100644 index 0000000..825b4a9 --- /dev/null +++ b/docs/evidence/henderson_2018_deep_rl_matters.md @@ -0,0 +1,1906 @@ +Source: https://arxiv.org/abs/1709.06560 +Title: Deep Reinforcement Learning that Matters - Henderson et al. (2018) +Fetched-via: curl https://r.jina.ai/https://arxiv.org/pdf/1709.06560 +Fetch-status: verbatim + +Title: Deep Reinforcement Learning that Matters + +URL Source: https://arxiv.org/pdf/1709.06560 + +Published Time: Sun, 22 Jan 2023 19:19:13 GMT + +Number of Pages: 26 + +Markdown Content: +# Deep Reinforcement Learning that Matters + +## Peter Henderson 1∗, Riashat Islam 1,2 ∗, Philip Bachman 2 + +## Joelle Pineau 1, Doina Precup 1, David Meger 1 + +> 1 + +McGill University, Montreal, Canada + +> 2 + +Microsoft Maluuba, Montreal, Canada + +{peter.henderson,riashat.islam}@mail.mcgill.ca , phbachma@microsoft.com {jpineau,dprecup}@cs.mcgill.ca , dmeger@cim.mcgill.ca + +Abstract + +In recent years, significant progress has been made in solving challenging problems across various domains using deep re-inforcement learning (RL). Reproducing existing work and accurately judging the improvements offered by novel meth-ods is vital to sustaining this progress. Unfortunately, repro-ducing results for state-of-the-art deep RL methods is seldom straightforward. In particular, non-determinism in standard benchmark environments, combined with variance intrinsic to the methods, can make reported results tough to interpret. Without significance metrics and tighter standardization of experimental reporting, it is difficult to determine whether im-provements over the prior state-of-the-art are meaningful. In this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. We illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible. We aim to spur discussion about how to ensure continued progress in the field by minimizing wasted effort stemming from results that are non-reproducible and easily misinterpreted. + +## Introduction + +Reinforcement learning (RL) is the study of how an agent can interact with its environment to learn a policy which maximizes expected cumulative rewards for a task. Recently, RL has experienced dramatic growth in attention and interest due to promising results in areas like: controlling continuous systems in robotics (Lillicrap et al . 2015a), playing Go (Silver et al . 2016), Atari (Mnih et al . 2013), and competitive video games (Vinyals et al . 2017; Silva and Chaimowicz 2017). Figure 1 illustrates growth of the field through the number of publications per year. To maintain rapid progress in RL research, it is important that existing works can be easily reproduced and compared to accurately judge improvements offered by novel methods. However, reproducing deep RL results is seldom straight-forward, and the literature reports a wide range of results for the same baseline algorithms (Islam et al . 2017). Re-producibility can be affected by extrinsic factors (e.g. hy-perparameters or codebases) and intrinsic factors (e.g. ef- + +> ∗ + +These two authors contributed equally Copyright c© 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. + +Figure 1: Growth of published reinforcement learning papers. Shown are the number of RL-related publications (y-axis) per year (x-axis) scraped from Google Scholar searches. fects of random seeds or environment properties). We inves-tigate these sources of variance in reported results through a representative set of experiments. For clarity, we focus our investigation on policy gradient (PG) methods in con-tinuous control. Policy gradient methods with neural net-work function approximators have been particularly suc-cessful in continuous control (Schulman et al . 2015a; 2017; Lillicrap et al . 2015b) and are competitive with value-based methods in discrete settings. We note that the diversity of metrics and lack of significance testing in the RL literature creates the potential for misleading reporting of results. We demonstrate possible benefits of significance testing using techniques common in machine learning and statistics. Several works touch upon evaluating RL algorithms. Duan et al . (2016) benchmark several RL algorithms and provide the community with baseline implementations. Generaliz-able RL evaluation metrics are proposed in (Whiteson et al .2011). Machado et al . (2017) revisit the Arcade Learning Environment to propose better evaluation methods in these benchmarks. However, while the question of reproducibility and good experimental practice has been examined in related fields (Wagstaff 2012; Boulesteix, Lauer, and Eugster 2013; Stodden, Leisch, and Peng 2014; Bouckaert and Frank 2004; Bouckaert 2004; Vaughan and Wawerla 2012), to the best of our knowledge this is the first work to address this important question in the context of deep RL. In each section of our experimental analysis, we pose ques-tions regarding key factors affecting reproducibility. We find that there are numerous sources of non-determinism when reproducing and comparing RL algorithms. To this end, we show that fine details of experimental procedure can be crit- + +> arXiv:1709.06560v3 [cs.LG] 30 Jan 2019 + +ical. Based on our experiments, we conclude with possible recommendations, lines of investigation, and points of dis-cussion for future works to ensure that deep reinforcement learning is reproducible and continues to matter. + +## Technical Background + +This work focuses on several model-free policy gradient algorithms with publicly available implementations which appear frequently in the literature as baselines for compar-ison against novel methods. We experiment with Trust Re-gion Policy Optimization (TRPO) (Schulman et al . 2015a), Deep Deterministic Policy Gradients (DDPG) (Lillicrap et al . 2015b), Proximal Policy Optimization (PPO) (Schulman et al . 2017), and Actor Critic using Kronecker-Factored Trust Region (ACKTR) (Wu et al . 2017). These methods have shown promising results in continuous control MuJoCo domain tasks (Todorov, Erez, and Tassa 2012) from Ope-nAI Gym (Brockman et al . 2016). Generally, they optimize + +ρ(θ, s 0) = Eπθ [∑∞ + +> t=0 + +γtr(st)|s0], using the policy gradient theorem: δρ (θ,s 0) + +> δθ + += ∑ + +> s + +μπθ (s|s0) ∑ + +> aδπ θ(a|s) +> δθ + +Qπθ (s, a ).Here, μπθ (s|s0) = ∑∞ + +> t=0 + +γtP (st = s|s0) (Sutton et al .2000). TRPO (Schulman et al . 2015a) and PPO (Schulman et al . 2017) use constraints and advantage estimation to per-form this update, reformulating the optimization problem as: max θ Et + +[ πθ (at|st) + +> πθold (at|st) + +At(st, a t) + +] + +. Here, At is the general-ized advantage function (Schulman et al . 2015b). TRPO uses conjugate gradient descent as the optimization method with a KL constraint: Et [KL [πθold (·| st), π θ (·| st)]] ≤ δ. PPO re-formulates the constraint as a penalty (or clipping objective). DDPG and ACKTR use actor-critic methods which estimate + +Q(s, a ) and optimize a policy that maximizes the Q-function based on Monte-Carlo rollouts. DDPG does this using deter-ministic policies, while ACKTR uses Kronecketer-factored trust regions to ensure stability with stochastic policies. + +## Experimental Analysis + +We pose several questions about the factors affecting repro-ducibility of state-of-the-art RL methods. We perform a set of experiments designed to provide insight into the questions posed. In particular, we investigate the effects of: specific hyperparameters on algorithm performance if not properly tuned; random seeds and the number of averaged experi-ment trials; specific environment characteristics; differences in algorithm performance due to stochastic environments; differences due to codebases with most other factors held constant. For most of our experiments 1, except for those com-paring codebases, we generally use the OpenAI Baselines 2 + +implementations of the following algorithms: ACKTR (Wu et al . 2017), PPO (Schulman et al . 2017), DDPG (Plappert et al . 2017), TRPO (Schulman et al . 2017). We use the Hopper-v1 and HalfCheetah-v1 MuJoCo (Todorov, Erez, and Tassa 2012) environments from OpenAI Gym (Brockman et al .2016). These two environments provide contrasting dynam-ics (the former being more unstable). + +> 1Specific details can be found in the supplemental and code can be found at: https://git.io/vFHnf +> 2https://www.github.com/openai/baselines + +To ensure fairness we run five experiment trials for each evaluation, each with a different preset random seed (all experiments use the same set of random seeds). In all cases, we highlight important results here, with full descriptions of experimental setups and additional learning curves included in the supplemental material. Unless otherwise mentioned, we use default settings whenever possible, while modifying only the hyperparameters of interest. All results (including graphs) show mean and standard error across random seeds. We use multilayer perceptron function approximators in all cases. We denote the hidden layer sizes and activations as (N, M, activation ). For default settings, we vary the hy-perparameters under investigation one at a time. For DDPG we use a network structure of (64 , 64 , ReLU ) for both actor and critic. For TRPO and PPO, we use (64 , 64 , tanh ) for the policy. For ACKTR, we use (64 , 64 , tanh ) for the actor and + +(64 , 64 , ELU ) for the critic. + +Hyperparameters + +What is the magnitude of the effect hyperparameter settings can have on baseline performance? + +Tuned hyperparameters play a large role in eliciting the best results from many algorithms. However, the choice of op-timal hyperparameter configuration is often not consistent in related literature, and the range of values considered is often not reported 3. Furthermore, poor hyperparameter selec-tion can be detrimental to a fair comparison against baseline algorithms. Here, we investigate several aspects of hyperpa-rameter selection on performance. + +Network Architecture + +How does the choice of network architecture for the policy and value function approximation affect performance? + +In (Islam et al . 2017), it is shown that policy network architec-ture can significantly impact results in both TRPO and DDPG. Furthermore, certain activation functions such as Rectified Linear Unit (ReLU) have been shown to cause worsened learning performance due to the “dying relu” problem (Xu et al . 2015). As such, we examine network architecture and ac-tivation functions for both policy and value function approxi-mators. In the literature, similar lines of investigation have shown the differences in performance when comparing linear approximators, RBFs, and neural networks (Rajeswaran et al . 2017). Tables 1 and 2 summarize the final evaluation per-formance of all architectural variations after training on 2M samples (i.e. 2M timesteps in the environment). All learning curves and details on setup can be found in the supplemental material. We vary hyperparameters one at a time, while using a default setting for all others. We investigate three multilayer perceptron (MLP) architectures commonly seen in the liter-ature: (64 , 64) , (100 , 50 , 25) , and (400 , 300) . Furthermore, we vary the activation functions of both the value and policy networks across tanh, ReLU, and Leaky ReLU activations. + +Results Figure 2 shows how significantly performance can be affected by simple changes to the policy or value network + +> 3A sampled literature review can be found in the supplemental. 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 +> Timesteps ×10 6 +> −2000 +> −1000 +> 0 +> 1000 +> 2000 +> Average Return +> HalfCheetah-v1 (PPO, Policy Network Structure) +> (64,64) +> (100,50,25) +> (400,300) 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 +> Timesteps ×10 6 +> −750 +> −500 +> −250 +> 0 +> 250 +> 500 +> 750 +> 1000 +> Average Return +> HalfCheetah-v1 (TRPO, Policy Network Activation) +> tanh +> relu +> leaky relu + +Figure 2: Significance of Policy Network Structure and Activation Functions PPO (left), TRPO (middle) and DDPG (right). 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +> Timesteps ×10 6 +> 0 +> 1000 +> 2000 +> 3000 +> 4000 +> 5000 +> Average Return +> HalfCheetah-v1 (DDPG, Reward Scale, Layer Norm) +> rs=1e-4 +> rs=1e-3 +> rs=1e-2 +> rs=1e-1 +> rs=1 +> rs=10 +> rs=100 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 +> Timesteps ×10 6 +> 0 +> 1000 +> 2000 +> 3000 +> Average Return +> HalfCheetah-v1 (DDPG, Reward Scale, No Layer Norm) +> rs=1e-4 +> rs=1e-3 +> rs=1e-2 +> rs=1e-1 +> rs=1 +> rs=10 +> rs=100 + +Figure 3: DDPG reward rescaling on HalfCheetah-v1, with and without layer norm. activations. We find that usually ReLU or Leaky ReLU acti-vations perform the best across environments and algorithms. The effects are not consistent across algorithms or environ-ments. This inconsistency demonstrates how interconnected network architecture is to algorithm methodology. For exam-ple, using a large network with PPO may require tweaking other hyperparameters such as the trust region clipping or learning rate to compensate for the architectural change 4.This intricate interplay of hyperparameters is one of the rea-sons reproducing current policy gradient methods is so dif-ficult. It is exceedingly important to choose an appropriate architecture for proper baseline results. This also suggests a possible need for hyperparameter agnostic algorithms—that is algorithms that incorporate hyperparameter adaptation as part of the design—such that fair comparisons can be made without concern about improper settings for the task at hand. + +Reward Scale + +How can the reward scale affect results? Why is reward rescaling used? + +Reward rescaling has been used in several recent works (Duan et al . 2016; Gu et al . 2016) to improve results for DDPG. This involves simply multiplying the rewards gen-erated from an environment by some scalar ( ˆr = rˆσ) for training. Often, these works report using a reward scale of ˆσ = 0 .1. In Atari domains, this is akin to clipping the rewards to [0 , 1] . By intuition, in gradient based methods (as used in most deep RL) a large and sparse output scale can result in problems regarding saturation and inefficiency in learning (LeCun et al . 2012; Glorot and Bengio 2010; Vincent, de Br ´ebisson, and Bouthillier 2015). Therefore clip-ping or rescaling rewards compresses the space of estimated + +> 4 + +We find that the KL divergence of updates with the large net-work (400 , 300) seen in Figure 2 is on average 33 .52 times higher than the KL divergence of updates with the (64 , 64) network. + +expected returns in action value function based methods such as DDPG. We run a set of experiments using reward rescaling in DDPG (with and without layer normalization) for insights into how this aspect affects performance. + +Results Our analysis shows that reward rescaling can have a large effect (full experiment results can be found in the supplemental material), but results were inconsistent across environments and scaling values. Figure 3 shows one such ex-ample where reward rescaling affects results, causing a failure to learn in small settings below ˆσ = 0 .01 . In particular, layer normalization changes how the rescaling factor affects results, suggesting that these impacts are due to the use of deep net-works and gradient-based methods. With the value function approximator tracking a moving target distribution, this can potentially affect learning in unstable environments where a deep Q-value function approximator is used. Furthermore, some environments may have untuned reward scales (e.g. the HumanoidStandup-v1 of OpenAI gym which can reach rewards in the scale of millions). Therefore, we suggest that this hyperparameter has the potential to have a large impact if considered properly. Rather than rescaling rewards in some environments, a more principled approach should be taken to address this. An initial foray into this problem is made in (van Hasselt et al . 2016), where the authors adaptively rescale reward targets with normalized stochastic gradient, but further research is needed. + +Random Seeds and Trials + +Can random seeds drastically alter performance? Can one distort results by averaging an improper number of trials? + +A major concern with deep RL is the variance in results due to environment stochasticity or stochasticity in the learning process (e.g. random weight initialization). As such, even averaging several learning results together across totally dif-ferent random seeds can lead to the reporting of misleading results. We highlight this in the form of an experiment. Algorithm Environment 400,300 64,64 100,50,25 tanh ReLU LeakyReLU + +TRPO Hopper-v1 2980 ± 35 2674 ± 227 3110 ± 78 2674 ± 227 2772 ± 211 -(Schulman et al. 2015a) HalfCheetah-v1 1791 ± 224 1939 ± 140 2151 ± 27 1939 ± 140 3041 ± 161 -TRPO Hopper-v1 1243 ± 55 1303 ± 89 1243 ± 55 1303 ± 89 1131 ± 65 1341 ± 127 (Duan et al. 2016) HalfCheetah-v1 738 ± 240 834 ± 317 850 ±378 834 ± 317 784 ± 352 1139 ±364 TRPO Hopper-v1 2909 ± 87 2828 ± 70 2812 ± 88 2828 ± 70 2941 ± 91 2865 ± 189 (Schulman et al. 2017) HalfCheetah-v1 -155 ± 188 205 ± 256 306 ± 261 205 ± 256 1045 ± 114 778 ± 177 PPO Hopper-v1 61 ± 33 2790 ± 62 2592 ± 196 2790 ± 62 2695 ± 86 2587 ± 53 (Schulman et al. 2017) HalfCheetah-v1 -1180 ± 444 2201 ± 323 1314 ± 340 2201 ± 323 2971 ± 364 2895 ± 365 DDPG Hopper-v1 1419 ± 313 1632 ± 459 2142 ± 436 1491 ± 205 1632 ± 459 1384 ± 285 (Plappert et al. 2017) HalfCheetah-v1 5579 ± 354 4198 ± 606 5600 ± 601 5325 ± 281 4198 ± 606 4094 ± 233 DDPG Hopper-v1 600 ± 126 593 ± 155 501 ± 129 436 ± 48 593 ± 155 319 ± 127 (Gu et al. 2016) HalfCheetah-v1 2845 ± 589 2771 ± 535 1638 ± 624 1638 ± 624 2771 ± 535 1405 ± 511 DDPG Hopper-v1 506 ± 208 749 ± 271 629 ± 138 354 ± 91 749 ± 271 -(Duan et al. 2016) HalfCheetah-v1 850 ± 41 1573 ± 385 1224 ± 553 1311 ± 271 1573 ± 385 -ACKTR Hopper-v1 2577 ± 529 1608 ± 66 2287 ± 946 1608 ± 66 2835 ± 503 2718 ± 434 (Wu et al. 2017) HalfCheetah-v1 2653 ± 408 2691 ± 231 2498 ± 112 2621 ± 381 2160 ± 151 2691 ± 231 + +Table 1: Results for our policy architecture permutations across various implementations and algorithms. Final average ± + +standard error across 5 trials of returns across the last 100 trajectories after 2M training samples. For ACKTR, we use ELU + +activations instead of leaky ReLU. + +Algorithm Environment 400,300 64,64 100,50,25 tanh ReLU LeakyReLU + +TRPO Hopper-v1 3011 ± 171 2674 ± 227 2782 ± 120 2674 ± 227 3104 ± 84 -(Schulman et al. 2015a) HalfCheetah-v1 2355 ± 48 1939 ± 140 1673 ± 148 1939 ± 140 2281 ± 91 -TRPO Hopper-v1 2909 ± 87 2828 ± 70 2812 ± 88 2828 ± 70 2829 ± 76 3047 ± 68 (Schulman et al. 2017) HalfCheetah-v1 178 ± 242 205 ± 256 172 ± 257 205 ± 256 235 ± 260 325 ± 208 PPO Hopper-v1 2704 ± 37 2790 ± 62 2969 ± 111 2790 ± 62 2687 ± 144 2748 ± 77 (Schulman et al. 2017) HalfCheetah-v1 1523 ± 297 2201 ± 323 1807 ± 309 2201 ± 323 1288 ± 12 1227 ± 462 DDPG Hopper-v1 1419 ± 312 1632 ± 458 1569 ± 453 971 ± 137 852 ± 143 843 ± 160 (Plappert et al. 2017) HalfCheetah-v1 5600 ± 601 4197 ± 606 4713 ± 374 3908 ± 293 4197 ± 606 5324 ± 280 DDPG Hopper-v1 523 ± 248 343 ± 34 345 ± 44 436 ± 48 343 ± 34 -(Gu et al. 2016) HalfCheetah-v1 1373 ± 678 1717 ± 508 1868 ± 620 1128 ± 511 1717 ± 508 -DDPG Hopper-v1 1208 ± 423 394 ± 144 380 ± 65 354 ± 91 394 ± 144 -(Duan et al. 2016) HalfCheetah-v1 789 ± 91 1095 ± 139 988 ± 52 1311 ± 271 1095 ± 139 -ACKTR Hopper-v1 152 ± 47 1930 ± 185 1589 ± 225 691 ± 55 500 ± 379 1930 ± 185 (Wu et al. 2017) HalfCheetah-v1 518 ± 632 3018 ± 386 2554 ± 219 2547 ± 172 3362 ± 682 3018 ± 38 + +Table 2: Results for our value function ( Q or V ) architecture permutations across various implementations and algorithms. Final average ± standard error across 5 trials of returns across the last 100 trajectories after 2M training samples. For ACKTR, we use + +ELU activations instead of leaky ReLU. + +Figure 4: Performance of several policy gradient algorithms across benchmark MuJoCo environment suites + +Environment DDPG ACKTR TRPO PPO + +HalfCheetah-v1 5037 (3664, 6574) 3888 (2288, 5131) 1254.5 (999, 1464) 3043 (1920, 4165) Hopper-v1 1632 (607, 2370) 2546 (1875, 3217) 2965 (2854, 3076) 2715 (2589, 2847) Walker2d-v1 1582 (901, 2174) 2285 (1246, 3235) 3072 (2957, 3183) 2926 (2514, 3361) Swimmer-v1 31 (21, 46) 50 (42, 55) 214 (141, 287) 107 (101, 118) + +Table 3: Bootstrap mean and 95% confidence bounds for a subset of environment experiments. 10k bootstrap iterations and the pivotal method were used. 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +> Timesteps ×10 6 +> 0 +> 1000 +> 2000 +> 3000 +> 4000 +> 5000 +> Average Return +> HalfCheetah-v1 (TRPO, Different Random Seeds) +> Random Average (5 runs) +> Random Average (5 runs) + +Figure 5: TRPO on HalfCheetah-v1 using the same hyperpa-rameter configurations averaged over two sets of 5 different random seeds each. The average 2-sample t-test across entire training distribution resulted in t = −9.0916 , p = 0 .0016 . + +Results We perform 10 experiment trials, for the same hyperparameter configuration, only varying the random seed across all 10 trials. We then split the trials into two sets of 5 and average these two groupings together. As shown in Figure 5, we find that the performance of algorithms can be drastically different. We demonstrate that the variance between runs is enough to create statistically different dis-tributions just from varying random seeds. Unfortunately, in recent reported results, it is not uncommon for the top-N tri-als to be selected from among several trials (Wu et al . 2017; Mnih et al . 2016) or averaged over only small number of tri-als ( N < 5) (Gu et al . 2017; Wu et al . 2017). Our experiment with random seeds shows that this can be potentially mislead-ing. Particularly for HalfCheetah, it is possible to get learning curves that do not fall within the same distribution at all, just by averaging different runs with the same hyperparameters, but different random seeds. While there can be no specific number of trials specified as a recommendation, it is possible that power analysis methods can be used to give a general idea to this extent as we will discuss later. However, more investigation is needed to answer this open problem. + +Environments + +How do the environment properties affect variability in re-ported RL algorithm performance? + +To assess how the choice of evaluation environment can af-fect the presented results, we use our aforementioned default set of hyperparameters across our chosen testbed of algo-rithms and investigate how well each algorithm performs across an extended suite of continuous control tasks. For these experiments, we use the following environments from OpenAI Gym: Hopper-v1, HalfCheetah-v1, Swimmer-v1 and Walker2d-v1. The choice of environment often plays an im-portant role in demonstrating how well a new proposed algo-rithm performs against baselines. In continuous control tasks, often the environments have random stochasticity, shortened trajectories, or different dynamic properties. We demonstrate that, as a result of these differences, algorithm performance can vary across environments and the best performing algo-rithm across all environments is not always clear. Thus it is increasingly important to present results for a wide range of environments and not only pick those which show a novel work outperforming other methods. + +Results As shown in Figure 4, in environments with sta-ble dynamics (e.g. HalfCheetah-v1), DDPG outperforms all other algorithsm. However, as dynamics become more unsta-ble (e.g. in Hopper-v1) performance gains rapidly diminish. As DDPG is an off-policy method, exploration noise can cause sudden failures in unstable environments. Therefore, learning a proper Q-value estimation of expected returns is difficult, particularly since many exploratory paths will result in failure. Since failures in such tasks are characterized by shortened trajectories, a local optimum in this case would be simply to survive until the maximum length of the trajectory (corresponding to one thousand timesteps and similar reward due to a survival bonus in the case of Hopper-v1). As can be seen in Figure 4, DDPG with Hopper does exactly this. This is a clear example where showing only the favourable and sta-ble HalfCheetah when reporting DDPG-based experiments would be unfair. Furthermore, let us consider the Swimmer-v1 environment shown in Figure 4. Here, TRPO significantly outperforms all other algorithms. Due to the dynamics of the water-like environment, a local optimum for the system is to curl up and flail without proper swimming. However, this corresponds to a return of ∼130 . By reaching a local optimum, learning curves can indicate successful optimization of the policy over time, when in reality the returns achieved are not qualitatively representative of learning the desired behaviour, as demon-strated in video replays of the learned policy 5. Therefore, it is important to show not only returns but demonstrations of the learned policy in action. Without understanding what the evaluation returns indicate, it is possible that misleading results can be reported which in reality only optimize local optima rather than reaching the desired behaviour. + +Codebases + +Are commonly used baseline implementations comparable? + +In many cases, authors implement their own versions of base-line algorithms to compare against. We investigate the Ope-nAI baselines implementation of TRPO as used in (Schulman et al . 2017), the original TRPO code (Schulman et al . 2015a), and the rllab (Duan et al . 2016) Tensorflow implementation of TRPO. We also compare the rllab Theano (Duan et al . 2016), rllabplusplus (Gu et al . 2016), and OpenAI baselines (Plap-pert et al . 2017) implementations of DDPG. Our goal is to draw attention to the variance due to implementation details across algorithms. We run a subset of our architecture experi-ments as with the OpenAI baselines implementations using the same hyperparameters as in those experiments 6. + +Results We find that implementation differences which are often not reflected in publications can have dramatic impacts on performance. This can be seen for our final evalu-ation performance after training on 2M samples in Tables 1 and 2, as well as a sample comparison in Figure 6. This + +> 5https://youtu.be/lKpUQYjgm80 +> 6Differences are discussed in the supplemental (e.g. use of dif-ferent optimizers for the value function baseline). Leaky ReLU activations are left out to narrow the experiment scope. 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 +> Timesteps ×10 6 +> −500 +> 0 +> 500 +> 1000 +> 1500 +> 2000 +> Average Return +> HalfCheetah-v1 (TRPO, Codebase Comparison) +> Schulman 2015 +> Schulman 2017 +> Duan 2016 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 +> Timesteps ×10 6 +> 0 +> 1000 +> 2000 +> 3000 +> 4000 +> 5000 +> Average Return +> HalfCheetah-v1 (DDPG, Codebase Comparison) +> Duan 2016 +> Gu 2016 +> Plapper 2017 + +Figure 6: TRPO codebase comparison using our default set of hyperparameters (as used in other experiments). demonstrates the necessity that implementation details be enumerated, codebases packaged with publications, and that performance of baseline experiments in novel works matches the original baseline publication code. + +## Reporting Evaluation Metrics + +In this section we analyze some of the evaluation metrics commonly used in the reinforcement learning literature. In practice, RL algorithms are often evaluated by simply pre-senting plots or tables of average cumulative reward (average returns) and, more recently, of maximum reward achieved over a fixed number of timesteps. Due to the unstable na-ture of many of these algorithms, simply reporting the max-imum returns is typically inadequate for fair comparison; even reporting average returns can be misleading as the range of performance across seeds and trials is unknown. Alone, these may not provide a clear picture of an algorithm’s range of performance. However, when combined with confidence intervals, this may be adequate to make an informed deci-sion given a large enough number of trials. As such, we investigate using the bootstrap and significance testing as in ML (Kohavi and others 1995; Bouckaert and Frank 2004; Nadeau and Bengio 2000) to evaluate algorithm performance. + +Online View vs. Policy Optimization An important dis-tinction when reporting results is the online learning view versus the policy optimization view of RL. In the online view, an agent will optimize the returns across the entire learning process and there is not necessarily an end to the agent’s trajectory. In this view, evaluations can use the average cumu-lative rewards across the entire learning process (balancing exploration and exploitation) as in (Hofer and Gimbert 2016), or can possibly use offline evaluation as in (Mandel et al .2016). The alternate view corresponds to policy optimization, where evaluation is performed using a target policy in an of-fline manner. In the policy optimization view it is important to run evaluations across the entire length of the task trajectory with a single target policy to determine the average returns that the target can obtain. We focus on evaluation methods for the policy optimization view (with offline evaluation), but the same principles can be applied to the online view. + +Confidence Bounds The sample bootstrap has been a pop-ular method to gain insight into a population distribution from a smaller sample (Efron and Tibshirani 1994). Boot-strap methods are particularly popular for A/B testing, and we can borrow some ideas from this field. Generally a boot-strap estimator is obtained by resampling with replacement many times to generate a statistically relevant mean and con-fidence bound. Using this technique, we can gain insight into what is the 95% confidence interval of the results from our section on environments. Table 3 shows the bootstrap mean and 95% confidence bounds on our environment experiments. Confidence intervals can vary wildly between algorithms and environments. We find that TRPO and PPO are the most stable with small confidence bounds from the bootstrap. In cases where confidence bounds are exceedingly large, it may be necessary to run more trials (i.e. increase the sample size). + +Power Analysis Another method to determine if the sample size must be increased is bootstrap power analy-sis (Tuff ´ery 2011; Yuan and Hayashi 2003). If we use our sample and give it some uniform lift (for example, scaling uni-formly by 1.25), we can run many bootstrap simulations and determine what percentage of the simulations result in statis-tically significant values with the lift. If there is a small per-centage of significant values, a larger sample size is needed (more trials must be run). We do this across all environment experiment trial runs and indeed find that, in more unstable settings, the bootstrap power percentage leans towards in-significant results in the lift experiment. Conversely, in stable trials (e.g. TRPO on Hopper-v1) with a small sample size, the lift experiment shows that no more trials are needed to generate significant comparisons. These results are provided in the supplemental material. + +Significance An important factor when deciding on an RL algorithm to use is the significance of the reported gains based on a given metric. Several works have investigated the use of significance metrics to assess the reliability of reported evaluation metrics in ML. However, few works in reinforcement learning assess the significance of reported metrics. Based on our experimental results which indicate that algorithm performance can vary wildly based simply on perturbations of random seeds, it is clear that some metric is necessary for assessing the significance of algorithm perfor-mance gains and the confidence of reported metrics. While more research and investigation is needed to determine the best metrics for assessing RL algorithms, we investigate an initial set of metrics based on results from ML. In supervised learning, k-fold t-test, corrected resampled t-test, and other significance metrics have been discussed when comparing machine learning results (Bouckaert and Frank 2004; Nadeau and Bengio 2000). However, the assumptions pertaining to the underlying data with corrected metrics do not necessarily apply in RL. Further work is needed to inves-tigate proper corrected significance tests for RL. Nonetheless, we explore several significance measures which give insight into whether a novel algorithm is truly performing as the state-of-the-art. We consider the simple 2-sample t-test (sorting all final evaluation returns across N random trials with different random seeds); the Kolmogorov-Smirnov test (Wilcox 2005); and bootstrap percent differences with 95% confidence in-tervals. All calculated metrics can be found in the supple-mental. Generally, we find that the significance values match up to what is to be expected. Take, for example, comparing Walker2d-v1 performance of ACKTR vs. DDPG. ACKTR performs slightly better, but this performance is not signifi-cant due to the overlapping confidence intervals of the two: + +t = 1 .03 , p = 0 .334 , KS = 0 .40 , p = 0 .697 , bootstrapped percent difference 44.47% (-80.62%, 111.72%). + +## Discussion and Conclusion + +Through experimental methods focusing on PG methods for continuous control, we investigate problems with repro-ducibility in deep RL. We find that both intrinsic (e.g. random seeds, environment properties) and extrinsic sources (e.g. hy-perparameters, codebases) of non-determinism can contribute to difficulties in reproducing baseline algorithms. Moreover, we find that highly varied results due to intrinsic sources bolster the need for using proper significance analysis. We propose several such methods and show their value on a subset of our experiments. + +What recommendations can we draw from our experiments? + +Based on our experimental results and investigations, we can provide some general recommendations. Hyperparame-ters can have significantly different effects across algorithms and environments. Thus it is important to find the work-ing set which at least matches the original reported perfor-mance of baseline algorithms through standard hyperparame-ter searches. Similarly, new baseline algorithm implementa-tions used for comparison should match the original codebase results if available. Overall, due to the high variance across trials and random seeds of reinforcement learning algorithms, many trials must be run with different random seeds when comparing performance. Unless random seed selection is explicitly part of the algorithm, averaging multiple runs over different random seeds gives insight into the population dis-tribution of the algorithm performance on an environment. Similarly, due to these effects, it is important to perform proper significance testing to determine if the higher average returns are in fact representative of better performance. We highlight several forms of significance testing and find that they give generally expected results when taking confi-dence intervals into consideration. Furthermore, we demon-strate that bootstrapping and power analysis are possible ways to gain insight into the number of trial runs necessary to make an informed decision about the significance of algorithm per-formance gains. In general, however, the most important step to reproducibility is to report all hyperparameters, implemen-tation details, experimental setup, and evaluation methods for both baseline comparison methods and novel work. Without the publication of implementations and related details, wasted effort on reproducing state-of-the-art works will plague the community and slow down progress. + +What are possible future lines of investigation? + +Due to the significant effects of hyperparameters (partic-ularly reward scaling), another possibly important line of future investigation is in building hyperparameter agnostic algorithms. Such an approach would ensure that there is no unfairness introduced from external sources when compar-ing algorithms agnostic to parameters such as reward scale, batch size, or network structure. Furthermore, while we in-vestigate an initial set of significance metrics here, they may not be the best fit for comparing RL algorithms. Several works have begun investigating policy evaluation methods for the purposes of safe RL (Thomas and Brunskill 2016; Thomas, Theocharous, and Ghavamzadeh 2015), but further work is needed in significance testing and statistical analysis. Similar lines of investigation to (Nadeau and Bengio 2000; Bouckaert and Frank 2004) would be helpful to determine the best methods for evaluating performance gain significance. + +How can we ensure that deep RL matters? + +We discuss many different factors affecting reproducibility of RL algorithms. The sensitivity of these algorithms to changes in reward scale, environment dynamics, and random seeds can be considerable and varies between algorithms and set-tings. Since benchmark environments are proxies for real-world applications to gauge generalized algorithm perfor-mance, perhaps more emphasis should be placed on the appli-cability of RL algorithms to real-world tasks. That is, as there is often no clear winner among all benchmark environments, perhaps recommended areas of application should be demon-strated along with benchmark environment results when pre-senting a new algorithm. Maybe new methods should be answering the question: in what setting would this work be useful? This is something that is addressed for machine learn-ing in (Wagstaff 2012) and may warrant more discussion for RL. As a community, we must not only ensure reproducible results with fair comparisons, but we must also consider what are the best ways to demonstrate that RL continues to matter. + +## Acknowledgements + +We thank NSERC, CIFAR, the Open Philanthropy Project, and the AWS Cloud Credits for Research Program. + +## References + +> Bouckaert, R. R., and Frank, E. 2004. Evaluating the replicability of significance tests for comparing learning algorithms. In PAKDD ,3–12. Springer. Bouckaert, R. R. 2004. Estimating replicability of classifier learning experiments. In Proceedings of the 21st International Conference on Machine Learning (ICML) .Boulesteix, A.-L.; Lauer, S.; and Eugster, M. J. 2013. A plea for neutral comparison studies in computational sciences. PloS one +> 8(4):e61562. Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. OpenAI gym. arXiv preprint arXiv:1606.01540 .Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; and Abbeel, P. 2016. Benchmarking deep reinforcement learning for continuous control. In Proceedings of the 33rd International Conference on Machine Learning (ICML) . + +Efron, B., and Tibshirani, R. J. 1994. An introduction to the boot-strap . CRC press. Glorot, X., and Bengio, Y. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , 249–256. Gu, S.; Lillicrap, T.; Ghahramani, Z.; Turner, R. E.; and Levine, S. 2016. Q-prop: Sample-efficient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247 .Gu, S.; Lillicrap, T.; Ghahramani, Z.; Turner, R. E.; Sch ¨olkopf, B.; and Levine, S. 2017. Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. arXiv preprint arXiv:1706.00387 .Hofer, L., and Gimbert, H. 2016. Online reinforcement learning for real-time exploration in continuous state and action markov decision processes. arXiv preprint arXiv:1612.03780 .Islam, R.; Henderson, P.; Gomrokchi, M.; and Precup, D. 2017. Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. ICML Reproducibility in Machine Learning Workshop .Kohavi, R., et al. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI , volume 14. LeCun, Y. A.; Bottou, L.; Orr, G. B.; and M ¨uller, K.-R. 2012. Effi-cient backprop. In Neural Networks: Tricks of the Trade . Springer. Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015a. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 .Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015b. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 .Machado, M. C.; Bellemare, M. G.; Talvitie, E.; Veness, J.; Hausknecht, M.; and Bowling, M. 2017. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. arXiv preprint arXiv:1709.06009 .Mandel, T.; Liu, Y.-E.; Brunskill, E.; and Popovic, Z. 2016. Offline Evaluation of Online Reinforcement Learning Algorithms. In AAAI .Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 .Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning , 1928–1937. Nadeau, C., and Bengio, Y. 2000. Inference for the generalization error. In Advances in neural information processing systems .Plappert, M.; Houthooft, R.; Dhariwal, P.; Sidor, S.; Chen, R.; Chen, X.; Asfour, T.; Abbeel, P.; and Andrychowicz, M. 2017. Parameter space noise for exploration. arXiv preprint arXiv:1706.01905 .Rajeswaran, A.; Lowrey, K.; Todorov, E.; and Kakade, S. 2017. Towards generalization and simplicity in continuous control. arXiv preprint arXiv:1703.02660 .Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015a. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML) .Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; and Abbeel, P. 2015b. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 .Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 .Silva, V. d. N., and Chaimowicz, L. 2017. Moba: a new arena for game ai. arXiv preprint arXiv:1705.10443 .Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershel-vam, V.; Lanctot, M.; et al. 2016. Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489. Stadie, B. C.; Abbeel, P.; and Sutskever, I. 2017. Third-person imitation learning. arXiv preprint arXiv:1703.01703 .Stodden, V.; Leisch, F.; and Peng, R. D. 2014. Implementing reproducible research . CRC Press. Sutton, R. S.; McAllester, D. A.; Singh, S. P.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with func-tion approximation. In Advances in neural information processing systems .Thomas, P., and Brunskill, E. 2016. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning , 2139–2148. Thomas, P. S.; Theocharous, G.; and Ghavamzadeh, M. 2015. High-Confidence Off-Policy Evaluation. In AAAI .Todorov, E.; Erez, T.; and Tassa, Y. 2012. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Confer-ence on Intelligent Robots and Systems, IROS 2012, Vilamoura, Algarve, Portugal, October 7-12, 2012 , 5026–5033. Tuff ´ery, S. 2011. Data mining and statistics for decision making ,volume 2. Wiley Chichester. van Hasselt, H. P.; Guez, A.; Hessel, M.; Mnih, V.; and Silver, D. 2016. Learning values across many orders of magnitude. In + +Advances in Neural Information Processing Systems , 4287–4295. Vaughan, R., and Wawerla, J. 2012. Publishing identifiable exper-iment code and configuration is important, good and easy. arXiv preprint arXiv:1204.2235 .Vincent, P.; de Br ´ebisson, A.; and Bouthillier, X. 2015. Efficient exact gradient update for training deep networks with very large sparse targets. In Advances in Neural Information Processing Sys-tems , 1108–1116. Vinyals, O.; Ewalds, T.; Bartunov, S.; Georgiev, P.; Vezhnevets, A. S.; Yeo, M.; Makhzani, A.; K ¨uttler, H.; Agapiou, J.; Schrittwieser, J.; et al. 2017. Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782 .Wagstaff, K. 2012. Machine learning that matters. arXiv preprint arXiv:1206.4656 .Whiteson, S.; Tanner, B.; Taylor, M. E.; and Stone, P. 2011. Pro-tecting against evaluation overfitting in empirical reinforcement learning. In 2011 IEEE Symposium on Adaptive Dynamic Program-ming And Reinforcement Learning, ADPRL 2011, Paris, France, April 12-14, 2011 , 120–127. Wilcox, R. 2005. Kolmogorov–smirnov test. Encyclopedia of biostatistics .Wu, Y.; Mansimov, E.; Liao, S.; Grosse, R.; and Ba, J. 2017. Scal-able trust-region method for deep reinforcement learning using kronecker-factored approximation. arXiv preprint:1708.05144 .Xu, B.; Wang, N.; Chen, T.; and Li, M. 2015. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853 .Yuan, K.-H., and Hayashi, K. 2003. Bootstrap approach to inference and power analysis based on three test statistics for covariance structure models. British Journal of Mathematical and Statistical Psychology 56(1):93–110. Supplemental Material + +In this supplemental material, we include a detailed review of experiment configurations of related work with policy gradient methods in continuous control MuJoCo (Todorov, Erez, and Tassa 2012) environment tasks from OpenAI Gym (Brockman et al . 2016). We include a detailed list of the hyperparameters and reported metrics typically used in policy gradient literature in deep RL. We also include all our experimental results, with baseline algorithms DDPG (Lillicrap et al . 2015b), TRPO (Schulman et al . 2015a), PPO (Schulman et al . 2017) and ACKTR (Wu et al . 2017)) as discussed in the paper. Our experimental results include figures with different hyperparameters (network architectures, activation functions) to highlight the differences this can have across algorithms and environments. Finally, as discussed in the paper, we include discussion of significance metrics and show how these metrics can be useful for evaluating deep RL algorithms. + +## Literature Reviews + +Hyperparameters + +In this section, we include a list of hyperparameters that are reported in related literature, as shown in figure 4. Our analysis shows that often there is no consistency in the type of network architectures and activation functions that are used in related literature. As shown in the paper and from our experimental results in later sections, we find, however, that these hyperparameters can have a significant effect in the performance of algorithms across benchmark environments typically used. + +Table 4: Evaluation Hyperparameters of baseline algorithms reported in related literature + +Related Work (Algorithm) Policy Network Policy Network Activation Value Network Value Network Activation Reward Scaling Batch Size + +DDPG 64x64 ReLU 64x64 ReLU 1.0 128 TRPO 64x64 TanH 64x64 TanH - 5k PPO 64x64 TanH 64x64 TanH - 2048 ACKTR 64x64 TanH 64x64 ELU - 2500 Q-Prop (DDPG) 100x50x25 TanH 100x100 ReLU 0.1 64 Q-Prop (TRPO) 100x50x25 TanH 100x100 ReLU - 5k IPG (TRPO) 100x50x25 TanH 100x100 ReLU - 10k Param Noise (DDPG) 64x64 ReLU 64x64 ReLU - 128 Param Noise (TRPO) 64x64 TanH 64x64 TanH - 5k Benchmarking (DDPG) 400x300 ReLU 400x300 ReLU 0.1 64 Benchmarking (TRPO) 100x50x25 TanH 100x50x25 TanH - 25k + +Reported Results on Benchmarked Environments + +We then demonstrate how experimental reported results, on two different environments (HalfCheetah-v1 and Hopper-v1) can vary across different related work that uses these algorithms for baseline comparison. We further show the results we get, using the same hyperparameter configuration, but using two different codebase implementations (note that these implementations are often used as baseline codebase to develop algorithms). We highlight that, depending on the codebase used, experimental results can vary significantly. + +Table 5: Comparison with Related Reported Results with Hopper Environment + +Environment Metric rllab QProp IPG TRPO Our Results (rllab) Our Results (Baselines) TRPO on Hopper Environment Number of Iterations 500 500 500 500 500 500 Average Return 1183.3 - - - 2021.34 2965.3 Max Average Return - 2486 3668.8 3229.1 3034.4 Table 6: Comparison with Related Reported Results with HalfCheetah Environment + +Environment Metric rllab QProp IPG TRPO Our Results (rllab) Our Results (Baselines) TRPO on HalfCheetah Environment Number of Iterations 500 500 500 500 500 500 Average Return 1914.0 - - 3576.08 1045.6 Max Average Return - 4734 2889 4855 5197 1045.6 + +Work Number of Trials (Mnih et al. 2016) top-5 (Schulman et al. 2017) 3-9 (Duan et al. 2016) 5 (5) (Gu et al. 2017) 3(Lillicrap et al. 2015b) 5(Schulman et al. 2015a) 5(Wu et al. 2017) top-2, top-3 Table 7: Number of trials reported during evaluation in various works. + +Reported Evaluation Metrics in Related Work + +In table 8 we show the evaluation metrics, and reported results in further details across related work. + +Table 8: Reported Evaluation Metrics of baseline algorithms in related literature + +Related Work (Algorithm) Environments Timesteps or Episodes or Iterations Evaluation Metrics Average Return Max Return Std Error + +PPO HalfCheetah Hopper 1M ∼1800 + +∼2200 - -ACKTR HalfCheetah Hopper 1M ∼2400 + +∼3500 - -Q-Prop (DDPG) HalfCheetah Hopper 6k (eps) ∼6000 -7490 2604 --Q-Prop (TRPO) HalfCheetah Hopper 5k (timesteps) ∼4000 -4734 2486 --IPG (TRPO) HalfCheetah Hopper 10k (eps) ∼3000 - 2889 --Param Noise (DDPG) HalfCheetah Hopper 1M ∼1800 + +∼500 ----Param Noise (TRPO) HalfCheetah Hopper 1M ∼3900 + +∼2400 ----Benchmarking (DDPG) HalfCheetah Hopper 500 iters (25k eps) + +∼2148 + +∼267 --702 43 Benchmarking (TRPO) HalfCheetah Hopper 500 iters (925k eps) + +∼1914 + +∼1183 --150 120 + +## Experimental Setup + +In this section, we show detailed analysis of our experimental results, using same hyperparameter configurations used in related work. Experimental results are included for the OpenAI Gym (Brockman et al . 2016) Hopper-v1 and HalfCheetah-v1 environments, using the policy gradient algorithms including DDPG, TRPO, PPO and ACKTR. Our experiments are done using the available codebase from OpenAI rllab (Duan et al . 2016) and OpenAI Baselines. Each of our experiments are performed over 5 experimental trials with different random seeds, and results averaged over all trials. Unless explicitly specified as otherwise (such as in hyperparameter modifications where we alter a hyperparameter under investigation), hyperparameters were as follows. All results (including graphs) show mean and standard error across random seeds. + +• DDPG – Policy Network: (64, relu, 64, relu, tanh); Q Network (64, relu, 64, relu, linear) + +– Normalized observations with running mean filter + +– Actor LR: 1e − 4; Critic LR: 1e − 3 + +– Reward Scale: 1.0 + +– Noise type: O-U 0.2 + +– Soft target update τ = .01 + +– γ = 0 .995 + +– batch size = 128 + +– Critic L2 reg 1e − 2 + +• PPO + +– Policy Network: (64, tanh, 64, tanh, Linear) + Standard Deviation variable; Value Network (64, tanh, 64, tanh, linear) + +– Normalized observations with running mean filter + +– Timesteps per batch 2048 + +– clip param = 0.2 + +– entropy coeff = 0.0 + +– Optimizer epochs per iteration = 10 + +– Optimizer step size 3e − 4 + +– Optimizer batch size 64 + +– Discount γ = 0 .995 , GAE λ = 0 .97 + +– learning rate schedule is constant + +• TRPO + +– Policy Network: (64, tanh, 64, tanh, Linear) + Standard Deviation variable; Value Network (64, tanh, 64, tanh, linear) + +– Normalized observations with running mean filter + +– Timesteps per batch 5000 + +– max KL=0.01 + +– Conjugate gradient iterations = 20 + +– CG damping = 0.1 + +– VF Iterations = 5 + +– VF Batch Size = 64 + +– VF Step Size = 1e − 3 + +– entropy coeff = 0.0 + +– Discount γ = 0 .995 , GAE λ = 0 .97 + +• ACKTR + +– Policy Network: (64, tanh, 64, tanh, Linear) + Standard Deviation variable; Value Network (64, elu, 64, elu, linear) + +– Normalized observations with running mean filter + +– Timesteps per batch 2500 + +– desired KL = .002 + +– Discount γ = 0 .995 , GAE λ = 0 .97 + +Modifications to Baseline Implementations + +To ensure fairness of comparison, we make several modifications to the existing implementations. First, we change evaluation in DDPG (Plappert et al . 2017) such that during evaluation at the end of an epoch, 10 full trajectories are evaluated. In the current implementation, only a partial trajectory is evaluated immediately after training such that a full trajectory will be evaluated across several different policies, this corresponds more closely to the online view of evaluation, while we take a policy optimization view when evaluating algorithms. + +Hyperparameters : Network Structures and Activation Functions + +Below, we examine the significance of the network configurations used for the non-linear function approximators in policy gradient methods. Several related work have used different sets of network configurations (network sizes and activation functions). We use the reported network configurations from other works, and demonstrate the significance of careful fine tuning that is required. We demonstrate results using the network activation functions, ReLU, TanH and Leaky ReLU, where most papers use ReLU and TanH as activation functions without detailed reporting of the effect of these activation functions. We analyse the signifcance of using different activations in the policy and action value networks. Previously, we included a detailed table showing average reward with standard error obtained for each of the hyperparameter configurations. In the results below, we show detailed results of how each of these policy gradient algorithms are affected by the choice of the network configuration. Proximal Policy Optimization (PPO) 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +−500 0500 1000 1500 2000 2500 + +> Average Return + +HalfCheetah-v1 (PPO, Value Network Activation) + +> tanh relu leaky relu + +Figure 7: PPO Policy and Value Network activation + +Experiment results in Figure 7, 8, and 9 in this section show the effect of the policy network structures and activation functions in the Proximal Policy Optimization (PPO) algorithm. 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +−2000 + +−1000 + +0 + +1000 + +2000 + +> Average Return + +HalfCheetah-v1 (PPO, Policy Network Structure) + +> (64,64) +> (100,50,25) +> (400,300) + +Figure 8: PPO Policy Network structure Figure 9: PPO Value Network structure + +Actor Critic using Kronecker-Factored Trust Region (ACKTR) 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +−500 + +0 + +500 + +1000 + +1500 + +2000 + +2500 + +3000 + +> Average Return + +HalfCheetah-v1 (ACKTR, Policy Network Structure) + +> (64,64) +> (100,50,25) +> (400,300) + +0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +0 + +500 + +1000 + +1500 + +2000 + +2500 + +3000 + +> Average Return + +Hopper-v1 (ACKTR, Policy Network Structure) + +> (64,64) +> (100,50,25) +> (400,300) + +Figure 10: ACKTR Policy Network structure 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +−500 + +0 + +500 + +1000 + +1500 + +2000 + +2500 + +3000 + +> Average Return + +HalfCheetah-v1 (ACKTR, Value Network Structure) + +> (64,64) +> (100,50,25) +> (400,300) + +0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +0 + +500 + +1000 + +1500 + +2000 + +2500 + +3000 + +> Average Return + +Hopper-v1 (ACKTR, Value Network Structure) + +> (64,64) +> (100,50,25) +> (400,300) + +Figure 11: ACKTR Value Network structure 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +−500 + +0 + +500 + +1000 + +1500 + +2000 + +2500 + +3000 + +> Average Return + +HalfCheetah-v1 (ACKTR, Policy Network Activation) + +> tanh +> relu +> elu + +0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +0 + +500 + +1000 + +1500 + +2000 + +2500 + +3000 + +> Average Return + +Hopper-v1 (ACKTR, Policy Network Activation) + +> tanh +> relu +> elu + +Figure 12: ACKTR Policy Network Activation 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +0 + +1000 + +2000 + +3000 + +4000 + +> Average Return + +HalfCheetah-v1 (ACKTR, Value Network Activation) + +> tanh +> relu +> elu + +0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +0 + +500 + +1000 + +1500 + +2000 + +2500 + +3000 + +> Average Return + +Hopper-v1 (ACKTR, Value Network Activation) + +> tanh +> relu +> elu + +Figure 13: ACKTR Value Network Activation + +We then similarly, show the significance of these hyperparameters in the ACKTR algorithm. Our results show that the value network structure can have a significant effect on the performance of ACKTR algorithm. + +Trust Region Policy Optimization (TRPO) + +Figure 14: TRPO Policy Network structure Figure 15: TRPO Value Network structure 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +> −750 +> −500 +> −250 +> 0 +> 250 +> 500 +> 750 +> 1000 +> Average Return + +HalfCheetah-v1 (TRPO, Policy Network Activation) + +> tanh +> relu +> leaky relu + +Figure 16: TRPO Policy and Value Network activation + +Figure 17: TRPO Policy and Value Network activation + +In Figures 14, 15, 16, and 17 we show the effects of network structure on the OpenAI baselines implementation of TRPO. In this case, only the policy architecture seems to have a large effect on the performance of the algorithm’s ability to learn. Deep Deterministic Policy Gradient (DDPG) + +Figure 18: Policy or Actor Network Architecture experiments for DDPG on HalfCheetah and Hopper Environment + +We further analyze the actor and critic network configurations for use in DDPG. As in default configurations, we first use the ReLU activation function for policy networks, and examine the effect of different activations and network sizes for the critic networks. Similarly, keeping critic network configurations under default setting, we also examine the effect of actor network activation functions and network sizes. Figure 19: Significance of Value Function or Critic Network Activations for DDPG on HalfCheetah and Hopper Environment + +## Reward Scaling Parameter in DDPG + +Figure 20: DDPG reward rescaling on Hopper-v1, with and without layer norm. 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +> 0 +> 1000 +> 2000 +> 3000 +> 4000 +> 5000 +> Average Return + +HalfCheetah-v1 (DDPG, Reward Scale, Layer Norm) + +> rs=1e-4 +> rs=1e-3 +> rs=1e-2 +> rs=1e-1 +> rs=1 +> rs=10 +> rs=100 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +> 0 +> 1000 +> 2000 +> 3000 +> Average Return + +HalfCheetah-v1 (DDPG, Reward Scale, No Layer Norm) + +> rs=1e-4 +> rs=1e-3 +> rs=1e-2 +> rs=1e-1 +> rs=1 +> rs=10 +> rs=100 + +Figure 21: DDPG reward rescaling on HalfCheetah-v1, with and without layer norm. + +Several related work (Gu et al . 2016; 2017; Duan et al . 2016) have often reported that for DDPG the reward scaling parameter often needs to be fine-tuned for stabilizing the performance of DDPG. It can make a significant impact in performance of DDPG based on the choice of environment. We examine several reward scaling parameters and demonstrate the effect this parameter can have on the stability and performance of DDPG, based on the HalfCheetah and Hopper environments. Our experiment results, as demonstrated in Figure 21 and 20, show that the reward scaling parameter indeed can have a significant impact on performance. Our results show that, very small or negligible reward scaling parameter can significantly detriment the performance of DDPG across all environments. Furthermore, a scaling parameter of 10 or 1 often performs good. Based on our analysis, we suggest that every time DDPG is reported as a baseline algorithm for comparison, the reward scaling parameter should be fine-tuned, specific to the algorithm. Batch Size in TRPO 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +> 0 +> 500 +> 1000 +> 1500 +> 2000 +> 2500 +> 3000 +> 3500 +> Average Return + +Hopper-v1 (TRPO, original, Batch Size) + +> 1024 +> 2048 +> 4096 +> 8192 +> 16384 +> 32768 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +> −500 +> 0 +> 500 +> 1000 +> 1500 +> 2000 +> 2500 +> 3000 +> Average Return + +HalfCheetah-v1 (TRPO, original, Batch Size) + +> 1024 +> 2048 +> 4096 +> 8192 +> 16384 +> 32768 + +Figure 22: TRPO (Schulman et al. 2015a) original code batch size experiments. 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +> 0 +> 500 +> 1000 +> 1500 +> 2000 +> 2500 +> 3000 +> Average Return + +Hopper-v1 (TRPO, baselines, Batch Size) + +> 1024 +> 2048 +> 4096 +> 8192 +> 16384 +> 32768 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +> −600 +> −400 +> −200 +> 0 +> 200 +> Average Return + +HalfCheetah-v1 (TRPO, baselines, Batch Size) + +> 1024 +> 2048 +> 4096 +> 8192 +> 16384 +> 32768 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +> 0 +> 500 +> 1000 +> 1500 +> 2000 +> 2500 +> 3000 +> Average Return + +Walker2d-v1 (TRPO, baselines, Batch Size) + +> 1024 +> 2048 +> 4096 +> 8192 +> 16384 +> 32768 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +> −135 +> −130 +> −125 +> −120 +> −115 +> −110 +> Average Return + +Reacher-v1 (TRPO, baselines, Batch Size) + +> 1024 +> 2048 +> 4096 +> 8192 +> 16384 +> 32768 + +Figure 23: TRPO (Schulman et al. 2017) baselines code batch size experiments. + +We run batch size experiments using the original TRPO code (Schulman et al . 2015a) and the OpenAI baselines code (Schulman et al . 2017). These results can be found in Experiment results in Figure 22 and Figure 23, show that for both HalfCheetah-v1 and Hopper-v1 environments, a batch size of 1024 for TRPO performs best, while perform degrades consecutively as the batch size is increased. + +## Random Seeds + +To determine much random seeds can affect results, we run 10 trials total on two environments using the default previously described settings usign the (Gu et al . 2016) implementation of DDPG and the (Duan et al . 2016) version of TRPO. We divide our trials random into 2 partitions and plot them in Figures 24 and Fig 25. As can be seen, statistically different distributions can be attained just from the random seeds with the same exact hyperparameters. As we will discuss later, bootstrapping off of the sample can give an idea for how drastic this effect will be, though too small a bootstrap will still not give concrete enough results. 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +> 0 +> 1000 +> 2000 +> 3000 +> 4000 +> 5000 +> Average Return + +HalfCheetah-v1 (TRPO, Different Random Seeds) + +> Random Average (5 runs) +> Random Average (5 runs) 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +> 0 +> 500 +> 1000 +> 1500 +> 2000 +> 2500 +> 3000 +> 3500 +> Average Return + +Hopper-v1 (TRPO, Different Random Seeds) + +> Random Average (5 runs) +> Random Average (5 runs) + +Figure 24: Two different TRPO experiment runs, with same hyperparameter configurations, averaged over two splits of 5 different random seeds. 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +> 0 +> 1000 +> 2000 +> 3000 +> 4000 +> Average Return + +HalfCheetah-v1 (DDPG, Different Random Seeds) + +> Random Average (5 runs) +> Random Average (5 runs) 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +> 0 +> 250 +> 500 +> 750 +> 1000 +> 1250 +> 1500 +> 1750 +> Average Return + +Hopper-v1 (DDPG, Different Random Seeds) + +> Random Average (5 runs) +> Random Average (5 runs) + +Figure 25: Two different DDPG experiment runs, with same hyperparameter configurations, averaged over two splits of 5 different random seeds. + +## Choice of Benchmark Continuous Control Environment + +We previously demonstrated that the performance of policy gradient algorithms can be highly biased based on the choice of the environment. In this section, we include further results examining the impact the choice of environment can have. We show that no single algorithm can perform consistenly better in all environments. This is often unlike the results we see with DQN networks in Atari domains, where results can often be demonstrated across a wide range of Atari games. Our results, for example, shows that while TRPO can perform significantly better than other algorithms on the Swimmer environment, it may perform quite poorly n the HalfCheetah environment, and marginally better on the Hopper environment compared to PPO. We demonstrate our results using the OpenAI MuJoCo Gym environments including Hopper, HalfCheetah, Swimmer and Walker environments. It is notable to see the varying performance these algorithms can have even in this small set of environment domains. The choice of reporting algorithm performance results can therefore often be biased based on the algorithm designer’s experience with these environments. Figure 26: Comparing Policy Gradients across various environments + +## Codebases + +We include a detailed analysis of performance comparison, with different network structures and activations, based on the choice of the algorithm implementation codebase. + +Figure 27: TRPO Policy and Value Network structure Figure 28: TRPO Policy and Value Network activations. 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +−500 + +−250 + +0 + +250 + +500 + +750 + +1000 + +1250 + +> Average Return + +HalfCheetah-v1 (TRPO, rllab, Policy Network Structure) + +> (64,64) +> (100,50,25) +> (400,300) + +0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +0 + +200 + +400 + +600 + +800 + +1000 + +1200 + +1400 + +> Average Return + +Hopper-v1 (TRPO, rllab, Policy Network Structure) + +> (64,64) +> (100,50,25) +> (400,300) + +0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +−500 + +−250 + +0 + +250 + +500 + +750 + +1000 + +1250 + +1500 + +> Average Return + +HalfCheetah-v1 (TRPO, rllab, Policy Network Activation) + +> tanh +> relu +> leaky relu + +0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +0 + +200 + +400 + +600 + +800 + +1000 + +1200 + +1400 + +> Average Return + +Hopper-v1 (TRPO, rllab, Policy Network Activation) + +> tanh +> relu +> leaky relu + +Figure 29: TRPO rllab Policy Structure and Activation 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +−500 + +0 + +500 + +1000 + +1500 + +2000 + +2500 + +3000 + +3500 + +> Average Return + +HalfCheetah-v1 (DDPG, rllab++, Policy Network Structure) + +> (64,64) +> (100,50,25) +> (400,300) + +0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +0 + +100 + +200 + +300 + +400 + +500 + +600 + +700 + +800 + +> Average Return + +Hopper-v1 (DDPG, rllab++, Policy Network Structure) + +> (64,64) +> (100,50,25) +> (400,300) + +0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +0 + +200 + +400 + +600 + +800 + +1000 + +1200 + +> Average Return + +Hopper-v1 (DDPG, rllab++, Value Network Structure) + +> (64,64) +> (100,50,25) +> (400,300) + +0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +−500 + +0 + +500 + +1000 + +1500 + +2000 + +2500 + +> Average Return + +HalfCheetah-v1 (DDPG, rllab++, Value Network Structure) + +> (64,64) +> (100,50,25) +> (400,300) + +Figure 30: DDPG rllab++ Policy and Value Network structure 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +−500 + +0 + +500 + +1000 + +1500 + +2000 + +2500 + +3000 + +3500 + +> Average Return + +HalfCheetah-v1 (DDPG, rllab++, Policy Network Activation) + +> tanh +> relu +> leaky relu + +0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +0 + +200 + +400 + +600 + +800 + +1000 + +> Average Return + +Hopper-v1 (DDPG, rllab++, Policy Network Activation) + +> tanh +> relu +> leaky + +0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +−500 + +0 + +500 + +1000 + +1500 + +2000 + +2500 + +> Average Return + +HalfCheetah-v1 (DDPG, rllab++, Value Network Activation) + +> tanh +> relu +> leaky relu + +0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +0 + +200 + +400 + +600 + +800 + +1000 + +> Average Return + +Hopper-v1 (DDPG, rllab++, Value Network Activation) + +> tanh +> relu +> leaky relu + +Figure 31: DDPG rllab++ Policy and Value Network activations. + +Similarly, Figures 32 and 33 show the same network experiments for DDPG with the Theano implementation of rllab code (Duan et al .2016). 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +−500 + +0 + +500 + +1000 + +1500 + +2000 + +2500 + +3000 + +3500 + +> Average Return + +HalfCheetah-v1 (DDPG, rllab, Policy Network Structure) + +> (64,64) +> (100,50,25) +> (400,300) + +0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +0 + +100 + +200 + +300 + +400 + +500 + +600 + +700 + +800 + +> Average Return + +Hopper-v1 (DDPG, rllab, Policy Network Structure) + +> (64,64) +> (100,50,25) +> (400,300) + +0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +0 + +250 + +500 + +750 + +1000 + +1250 + +1500 + +1750 + +2000 + +> Average Return + +Hopper-v1 (DDPG, rllab, Value Network Structure) + +> (64,64) +> (100,50,25) +> (400,300) + +0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +−500 + +0 + +500 + +1000 + +1500 + +2000 + +> Average Return + +HalfCheetah-v1 (DDPG, rllab, Value Network Structure) + +> (64,64) +> (100,50,25) +> (400,300) + +Figure 32: DDPG rllab Policy and Value Network structure 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +0 + +500 + +1000 + +1500 + +2000 + +> Average Return + +HalfCheetah-v1 (DDPG, rllab, Policy Network Activation) + +> tanh +> relu + +0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +0 + +200 + +400 + +600 + +800 + +1000 + +> Average Return + +Hopper-v1 (DDPG, rllab, Policy Network Activation) + +> tanh +> relu + +0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +0 + +500 + +1000 + +1500 + +> Average Return + +HalfCheetah-v1 (DDPG, rllab, Value Network Activation) + +> tanh +> relu + +0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +0 + +200 + +400 + +600 + +800 + +1000 + +> Average Return + +Hopper-v1 (DDPG, rllab, Value Network Activation) + +> tanh +> relu + +Figure 33: DDPG rllab Policy and Value Network activations. + +Often in related literature, there is different baseline codebase people use for implementation of algorithms. One such example is for the TRPO algorithm. It is a commonly used policy gradient method for continuous control tasks, and there exists several implementations from OpenAI Baselines (Plappert et al . 2017), OpenAI rllab (Duan et al . 2016) and the original TRPO codebase (Schulman et al . 2015a). In this section, we perform an analysis of the impact the choice of algorithm codebase can have on the performance. Figures 27 and 28 summarizes our results with TRPO policy network and value networks, using the original TRPO codebase from (Schulman et al . 2015a). Figure 29 shows the results using the rllab implementation of TRPO using the same hyperparameters as our default experiments aforementioned. Note, we use a linear function approximator rather than a neural network due to the fact that the Tensorflow implementation of OpenAI rllab doesn’t provide anything else. We note that this is commonly used in other works (Duan et al . 2016; Stadie, Abbeel, and Sutskever 2017), but may cause differences in performance. Furthermore, we leave out our value function network experiments due to this. 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +> 0 +> 1000 +> 2000 +> 3000 +> 4000 +> 5000 +> Average Return + +HalfCheetah-v1 (DDPG, Codebase Comparison) + +> Duan 2016 +> Gu 2016 +> Plapper 2017 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +> 0 +> 250 +> 500 +> 750 +> 1000 +> 1250 +> 1500 +> 1750 +> Average Return + +Hopper-v1 (DDPG, Codebase Comparison) + +> Duan 2016 +> Gu 2016 +> Plapper 2017 + +Figure 34: DDPG codebase comparison using our default set of hyperparameters (as used in other experiments). 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +> −500 +> 0 +> 500 +> 1000 +> 1500 +> 2000 +> Average Return + +HalfCheetah-v1 (TRPO, Codebase Comparison) + +> Schulman 2015 +> Schulman 2017 +> Duan 2016 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 + +Timesteps ×10 6 + +> 0 +> 500 +> 1000 +> 1500 +> 2000 +> 2500 +> 3000 +> Average Return + +Hopper-v1 (TRPO, Codebase Comparison) + +> Schulman 2015 +> Schulman 2017 +> Duan 2016 + +Figure 35: TRPO codebase comparison using our default set of hyperparameters (as used in other experiments). + +Figure 35 shows a comparison of the TRPO implementations using the default hyperparamters as specified earlier in the supplemental. Note, the exception is that we use a larger batch size for rllab and original TRPO code of 20k samples per batch, as optimized in a second set of experiments. Figure 30 and 31 show the same network experiments for DDPG with the rllab++ code (Gu et al . 2016). We can then compare the performance of the algorithm across 3 codebases (keeping all hyperparameters constant at the defaults), this can be seen in Figure 34. + +## Significance + +Our full results from significance testing with difference metrics can be found in Table 9 and Table 10. Our bootstrap mean and confidence intervals can be found in Table 13. Bootstrap power analysis can be found in Table 14. To performance significance testing, we use our 5 sample trials to generate a bootstrap with 10k bootstraps. From this confidence intervals can be obtained. For the t-test and KS-test, the average returns from the 5 trials are sorted and compared using the normal 2-sample versions of these tests. Scipy ( https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ks_2samp. html , https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html ) and Facebook Boostrapped ( https://github.com/facebookincubator/bootstrapped ) are used for the KS test, t-test, and bootstrap analysis. For power analysis, we attempt to determine if a sample is enough to game the significance of a 25% lift. This is commonly used in A/B testing (Tuff´ ery 2011). - DDPG ACKTR TRPO PPO DDPG - + +t = 1 .85 , p = 0 .102 + +KS = 0 .60 , p = 0 .209 + +61.91 % (-32.27 %, 122.99 %) + +t = 4 .59 , p = 0 .002 + +KS = 1 .00 , p = 0 .004 + +301.48 % (150.50 %, 431.67 %) + +t = 2 .67 , p = 0 .029 + +KS = 0 .80 , p = 0 .036 + +106.91 % (-37.62 %, 185.26 %) ACKTR + +t = −1.85 , p = 0 .102 + +KS = 0 .60 , p = 0 .209 + +-38.24 % (-75.42 %, -15.19 %) - + +t = 2 .78 , p = 0 .024 + +KS = 0 .80 , p = 0 .036 + +147.96 % (30.84 %, 234.60 %) + +t = 0 .80 , p = 0 .448 + +KS = 0 .60 , p = 0 .209 + +27.79 % (-67.77 %, 79.56 %) TRPO + +t = −4.59 , p = 0 .002 + +KS = 1 .00 , p = 0 .004 + +-75.09 % (-86.44 %, -68.36 %) + +t = −2.78 , p = 0 .024 + +KS = 0 .80 , p = 0 .036 + +-59.67 % (-81.70 %, -46.84 %) - + +t = −2.12 , p = 0 .067 + +KS = 0 .80 , p = 0 .036 + +-48.46 % (-81.23 %, -32.05 %) PPO + +t = −2.67 , p = 0 .029 + +KS = 0 .80 , p = 0 .036 + +-51.67 % (-80.69 %, -31.94 %) + +t = −0.80 , p = 0 .448 + +KS = 0 .60 , p = 0 .209 + +-21.75 % (-75.99 %, 11.68 %) + +t = 2 .12 , p = 0 .067 + +KS = 0 .80 , p = 0 .036 + +94.04 % (2.73 %, 169.06 %) - + +Table 9: HalfCheetah Significance values and metrics for different algorithms. Rows in cells are: sorted 2-sample t-test, Kolmogorov-Smirnov test, bootstrap A/B comparison % difference with 95% confidence bounds. + +- DDPG ACKTR TRPO PPO DDPG - + +t = −1.41 , p = 0 .196 + +KS = 0 .60 , p = 0 .209 + +-35.92 % (-85.62 %, -5.38 %) + +t = −2.58 , p = 0 .033 + +KS = 0 .80 , p = 0 .036 + +-44.96 % (-78.82 %, -20.29 %) + +t = −2.09 , p = 0 .070 + +KS = 0 .80 , p = 0 .036 + +-39.90 % (-77.12 %, -12.95 %) ACKTR + +t = 1 .41 , p = 0 .196 + +KS = 0 .60 , p = 0 .209 + +56.05 % (-87.98 %, 123.15 %) - + +t = −1.05 , p = 0 .326 + +KS = 0 .60 , p = 0 .209 + +-14.11 % (-37.17 %, 9.11 %) + +t = −0.42 , p = 0 .686 + +KS = 0 .40 , p = 0 .697 + +-6.22 % (-31.58 %, 18.98 %) TRPO + +t = 2 .58 , p = 0 .033 + +KS = 0 .80 , p = 0 .036 + +81.68 % (-67.76 %, 151.64 %) + +t = 1 .05 , p = 0 .326 + +KS = 0 .60 , p = 0 .209 + +16.43 % (-27.92 %, 41.17 %) - + +t = 2 .57 , p = 0 .033 + +KS = 0 .60 , p = 0 .209 + +9.19 % (2.37 %, 15.58 %) PPO + +t = 2 .09 , p = 0 .070 + +KS = 0 .80 , p = 0 .036 + +66.39 % (-67.80 %, 130.16 %) + +t = 0 .42 , p = 0 .686 + +KS = 0 .40 , p = 0 .697 + +6.63 % (-33.54 %, 29.59 %) + +t = −2.57 , p = 0 .033 + +KS = 0 .60 , p = 0 .209 + +-8.42 % (-14.08 %, -2.97 %) - + +Table 10: Hopper Significance values and metrics for different algorithms. Rows in cells are: sorted 2-sample t-test, Kolmogorov-Smirnov test, bootstrap A/B comparison % difference with 95% confidence bounds. + +- DDPG ACKTR TRPO PPO DDPG - + +t = −1.03 , p = 0 .334 + +KS = 0 .40 , p = 0 .697 + +-30.78 % (-91.35 %, 1.06 %) + +t = −4.04 , p = 0 .004 + +KS = 1 .00 , p = 0 .004 + +-48.52 % (-70.33 %, -28.62 %) + +t = −3.07 , p = 0 .015 + +KS = 0 .80 , p = 0 .036 + +-45.95 % (-70.85 %, -24.65 %) ACKTR + +t = 1 .03 , p = 0 .334 + +KS = 0 .40 , p = 0 .697 + +44.47 % (-80.62 %, 111.72 %) - + +t = −1.35 , p = 0 .214 + +KS = 0 .60 , p = 0 .209 + +-25.63 % (-61.28 %, 5.54 %) + +t = −1.02 , p = 0 .338 + +KS = 0 .60 , p = 0 .209 + +-21.91 % (-61.53 %, 11.02 %) TRPO + +t = 4 .04 , p = 0 .004 + +KS = 1 .00 , p = 0 .004 + +94.24 % (-22.59 %, 152.61 %) + +t = 1 .35 , p = 0 .214 + +KS = 0 .60 , p = 0 .209 + +34.46 % (-60.47 %, 77.32 %) -PPO + +t = 3 .07 , p = 0 .015 + +KS = 0 .80 , p = 0 .036 + +85.01 % (-31.02 %, 144.35 %) + +t = 1 .02 , p = 0 .338 + +KS = 0 .60 , p = 0 .209 + +28.07 % (-65.67 %, 71.71 %) + +t = −0.57 , p = 0 .582 + +KS = 0 .40 , p = 0 .697 + +-4.75 % (-19.06 %, 10.02 %) - + +Table 11: Walker2d Significance values and metrics for different algorithms. Rows in cells are: sorted 2-sample t-test, Kolmogorov-Smirnov test, bootstrap A/B comparison % difference with 95% confidence bounds. + +- DDPG ACKTR TRPO PPO DDPG - + +t = −2.18 , p = 0 .061 + +KS = 0 .80 , p = 0 .036 + +-36.44 % (-61.04 %, -6.94 %) + +t = −4.06 , p = 0 .004 + +KS = 1 .00 , p = 0 .004 + +-85.13 % (-97.17 %, -77.95 %) + +t = −8.33 , p = 0 .000 + +KS = 1 .00 , p = 0 .004 + +-70.41 % (-80.86 %, -56.52 %) ACKTR + +t = 2 .18 , p = 0 .061 + +KS = 0 .80 , p = 0 .036 + +57.34 % (-80.96 %, 101.11 %) - + +t = −3.69 , p = 0 .006 + +KS = 1 .00 , p = 0 .004 + +-76.61 % (-90.68 %, -70.06 %) + +t = −8.85 , p = 0 .000 + +KS = 1 .00 , p = 0 .004 + +-53.45 % (-62.22 %, -47.30 %) TRPO + +t = 4 .06 , p = 0 .004 + +KS = 1 .00 , p = 0 .004 + +572.61 % (-73.29 %, 869.24 %) + +t = 3 .69 , p = 0 .006 + +KS = 1 .00 , p = 0 .004 + +327.48 % (165.47 %, 488.66 %) - + +t = 2 .39 , p = 0 .044 + +KS = 0 .60 , p = 0 .209 + +99.01 % (28.44 %, 171.85 %) PPO + +t = 8 .33 , p = 0 .000 + +KS = 1 .00 , p = 0 .004 + +237.97 % (-59.74 %, 326.85 %) + +t = 8 .85 , p = 0 .000 + +KS = 1 .00 , p = 0 .004 + +114.80 % (81.85 %, 147.33 %) + +t = −2.39 , p = 0 .044 + +KS = 0 .60 , p = 0 .209 + +-49.75 % (-78.58 %, -36.43 %) - + +Table 12: Swimmer Significance values and metrics for different algorithms. Rows in cells are: sorted 2-sample t-test, Kolmogorov-Smirnov test, bootstrap A/B comparison % difference with 95% confidence bounds. Environment DDPG ACKTR TRPO PPO HalfCheetah-v1 5037.26 (3664.11, 6574.01) 3888.85 (2288.13, 5131.96) 1254.55 (999.52, 1464.86) 3043.1 (1920.4, 4165.86) Hopper-v1 1632.13 (607.98, 2370.21) 2546.89 (1875.79, 3217.98) 2965.33 (2854.66, 3076.00) 2715.72 (2589.06, 2847.93) Walker2d-v1 1582.04 (901.66, 2174.66) 2285.49 (1246.00, 3235.96) 3072.97 (2957.94, 3183.10) 2926.92 (2514.83, 3361.43) Swimmer-v1 31.92 (21.68, 46.23) 50.22 (42.47, 55.37) 214.69 (141.52, 287.92) 107.88 (101.13, 118.56) + +Table 13: Envs bootstrap mean and 95% confidence bounds Environment DDPG ACKTR TRPO PPO HalfCheetah-v1 100.00 % 0.00 % 0.00 % 79.03 % 11.53 % 9.43 % 79.47 % 20.53 % 0.00 % 61.07 % 10.50 % 28.43 % Hopper-v1 60.90 % 10.00 % 29.10 % 79.60 % 11.00 % 9.40 % 0.00 % 100.00 % 0.00 % 0.00 % 100.00 % 0.00 % Walker2d-v1 89.50 % 0.00 % 10.50 % 60.33 % 9.73 % 29.93 % 0.00 % 100.00 % 0.00 % 59.80 % 31.27 % 8.93 % Swimmer-v1 89.97 % 0.00 % 10.03 % 59.90 % 40.10 % 0.00 % 89.47 % 0.00 % 10.53 % 40.27 % 59.73 % 0.00 % Table 14: Power Analysis for predicted significance of 25% lift. Rows in cells are: % insignificant simulations,% positive significant, % negative significant. diff --git a/docs/evidence/joschu_nuts_and_bolts.md b/docs/evidence/joschu_nuts_and_bolts.md new file mode 100644 index 0000000..3c73249 --- /dev/null +++ b/docs/evidence/joschu_nuts_and_bolts.md @@ -0,0 +1,199 @@ +Source: http://joschu.net/docs/nuts-and-bolts.pdf +Title: Nuts and Bolts of Deep RL Research - John Schulman (2016) +Fetched-via: bash -c 'uvx "markitdown[pdf]" http://joschu.net/docs/nuts-and-bolts.pdf' +Fetch-status: verbatim + +| The Nuts | and Bolts | of Deep | RL Research | +| -------- | --------- | --------- | ----------- | +| | John | Schulman | | +| | December | 9th, 2016 | | + +Outline +| Approaching | New Problems | | +| --------------------- | ------------ | ---------- | +| Ongoing Development | | and Tuning | +| General Tuning | Strategies | for RL | +| Policy Gradient | Strategies | | +| Q-Learning Strategies | | | +| Miscellaneous | Advice | | + +Approaching New Problems + +| New Algorithm? | Use Small | Test Problems | +| -------------------------- | --------- | ------------- | +| (cid:73) Run experiments | quickly | | +| (cid:73) Do hyperparameter | search | | +(cid:73) Interpret and visualize learning process: state visitation, value function, etc. +(cid:73) Counterpoint: don’t overfit algorithm to contrived problem +(cid:73) Useful to have medium-sized problems that you’re intimately familiar with +(Hopper, Atari Pong) + +| New Task? | Make | It Easier Until | Signs | of Life | +| ---------------- | --------------- | --------------- | ----- | ------- | +| (cid:73) Provide | good input | features | | | +| (cid:73) Shape | reward function | | | | + +POMDP Design +(cid:73) Visualize random policy: does it sometimes exhibit desired behavior? +| (cid:73) Human | control | | | | +| -------------- | ------- | --- | --- | --- | +(cid:73) Atari: can you see game features in downsampled image? +(cid:73) Plot time series for observations and rewards. Are they on a reasonable +scale? +| (cid:73) hopper.py | in gym: | | | | +| ------------------ | ------------ | --------------------------- | ------- | ----------- | +| reward | = 1.0 | - 1e-3 * np.square(a).sum() | + delta | x / delta t | +| (cid:73) Histogram | observations | and rewards | | | + +Run Your Baselines +| (cid:73) Don’t expect | them to | work with default | parameters | +| --------------------- | ------- | ----------------- | ---------- | +(cid:73) Recommended: +| Cross-entropy | method1 | | | +| ------------- | ------- | --- | --- | +(cid:73) +| (cid:73) Well-tuned | policy gradient | method2 | | +| ------------------- | --------------- | -------------- | --- | +| (cid:73) Well-tuned | Q-learning | + SARSA method | | +1Istv´anSzitaandAndr´asL¨orincz(2006).“LearningTetrisusingthenoisycross-entropymethod”. In:Neuralcomputation. +2https://github.com/openai/rllab + +| Run with | More Samples | Than | Expected | | +| -------- | ------------ | ---- | -------- | --- | +(cid:73) Early in tuning process, may need huge number of samples +| | Don’t be deterred | by published | work | | +| --- | ----------------- | ------------ | ---- | --- | +(cid:73) +| (cid:73) Examples: | | | | | +| ------------------ | --- | --- | --- | --- | +(cid:73) TRPO on Atari: 100K timesteps per batch for KL= 0.01 +| | DQN on Atari: | update freq=10K, | replay buffer | size=1M | +| --- | ------------- | ---------------- | ------------- | ------- | +(cid:73) + +| Ongoing | Development | and Tuning | +| ------- | ----------- | ---------- | + +| It | Works! | But | Don’t | Be Satisfied | | | +| --- | ---------------- | ----------- | ----- | ----------------- | --- | --- | +| | (cid:73) Explore | sensitivity | | to each parameter | | | +(cid:73) If too sensitive, it doesn’t really work, you just got lucky +| | (cid:73) Look | for health | indicators | | | | +| --- | ------------- | --------------- | ---------- | --- | --- | --- | +| | | (cid:73) VF fit | quality | | | | +| | | Policy | entropy | | | | +(cid:73) +| | | (cid:73) Update | size in | output space | and parameter | space | +| --- | --- | ----------------- | ----------- | ------------ | ------------- | ----- | +| | | (cid:73) Standard | diagnostics | for | deep networks | | + +| Continually | Benchmark | | Your Code | +| ------------------- | --------- | ------------- | ------------ | +| (cid:73) If reusing | code, | regressions | occur | +| (cid:73) Run | a battery | of benchmarks | occasionally | + +| Always | Use Multiple | Random | Seeds | +| ------ | ------------ | ------ | ----- | + +| Always Be | Ablating | | +| ------------------ | ---------- | ---------- | +| (cid:73) Different | tricks may | substitute | +| Especially | whitening | | +(cid:73) +(cid:73) “Regularize” to favor simplicity in algorithm design space +| (cid:73) As | usual, simplicity | → generalization | +| ----------- | ----------------- | ---------------- | + +| Automate Your | Experiments | | | +| ------------- | ---------------- | --------- | ----------------- | +| Don’t spend | all day watching | your code | print out numbers | +(cid:73) +(cid:73) Consider using a cloud computing platform (Microsoft Azure, Amazon EC2, +| Google Compute | Engine) | | | +| -------------- | ------- | --- | --- | + +| General | Tuning | Strategies | for RL | +| ------- | ------ | ---------- | ------ | + +| Whitening | / Standardizing | Data | +| ------------------------ | --------------- | ------------------ | +| (cid:73) If observations | have unknown | range, standardize | +(cid:73) Compute running estimate of mean and standard deviation +x(cid:48) +(cid:73) = clip((x −µ)/σ,−10,10) +(cid:73) Rescale the rewards, but don’t shift mean, as that affects agent’s will to live +(cid:73) Standardize prediction targets (e.g., value functions) the same way + +| Generally | Important | Parameters | | | | +| --------- | --------------- | ------------- | ---- | ------- | --------- | +| (cid:73) | Discount | | | | | +| | (cid:73) Return | = r +γr | +γ2r | +... | | +| | | t t | t+1 | t+2 | | +| | Effective | time horizon: | 1+γ | +γ2+··· | = 1/(1−γ) | +(cid:73) +(cid:73) I.e., γ =0.99⇒ ignore rewards delayed by more than 100 timesteps +| | Low | γ works well | for well-shaped | reward | | +| --- | --- | ------------ | --------------- | ------ | --- | +(cid:73) +(cid:73) In TD(λ) methods, can get away with high γ when λ < 1 +| (cid:73) | Action frequency | | | | | +| -------- | ---------------- | ---------- | ------- | ------------- | --- | +| | Solvable | with human | control | (if possible) | | +(cid:73) +| | (cid:73) View | random exploration | | | | +| --- | ------------- | ------------------ | --- | --- | --- | + +General RL Diagnostics +(cid:73) Look at min/max/stdev of episode returns, along with mean +(cid:73) Look at episode lengths: sometimes provides additional information +| (cid:73) Solving problem | faster, losing | game slower | +| ------------------------ | -------------- | ----------- | + +Policy Gradient Strategies + +| Entropy as | Diagnostic | | | +| ------------------ | ---------------- | ------- | ------------- | +| (cid:73) Premature | drop in policy | entropy | ⇒ no learning | +| (cid:73) Alleviate | by using entropy | bonus | or KL penalty | + +KL as Diagnostic +(cid:2) (cid:3) +| (cid:73) Compute | KL π | (·|s),π(·|s) | | +| ---------------- | ---- | ------------ | --- | +old +| (cid:73) KL spike | ⇒ drastic | loss of performance | | +| -------------------- | --------- | ------------------- | ------------- | +| (cid:73) No learning | progress | might mean steps | are too large | +(cid:73) batchsize=100K converges to different result than batchsize=20K. + +| Baseline | Explained | Variance | +| -------- | --------- | -------- | +1−Var[empiricalreturn−predictedvalue] +| (cid:73) | explained variance | = | +| -------- | ------------------ | --- | +Var[empiricalreturn] + +Policy Initialization +(cid:73) More important than in supervised learning: determines initial state +visitation +| (cid:73) Zero | or tiny final layer, | to maximize | entropy | +| ------------- | -------------------- | ----------- | ------- | + +| Q-Learning Strategies | | | +| --------------------- | --- | --- | +(cid:73) Optimize memory usage carefully: you’ll need it for replay buffer +| (cid:73) Learning | rate schedules | | +| -------------------- | -------------- | ------ | +| (cid:73) Exploration | schedules | | +| (cid:73) Be patient. | DQN converges | slowly | +(cid:73) On Atari, often 10-40M frames to get policy much better than random +ThankstoSzymonSidorforsuggestions + +Miscellaneous Advice +(cid:73) Read older textbooks and theses, not just conference papers +(cid:73) Don’t get stuck on problems—can’t solve everything at once +| (cid:73) Exploration | problems | like cart-pole swing-up | +| -------------------- | ----------------- | ----------------------- | +| (cid:73) DQN on | Atari vs CartPole | | + +Thanks! diff --git a/docs/evidence/mccandlish_2018_large_batch.md b/docs/evidence/mccandlish_2018_large_batch.md new file mode 100644 index 0000000..1049f17 --- /dev/null +++ b/docs/evidence/mccandlish_2018_large_batch.md @@ -0,0 +1,660 @@ +Source: https://arxiv.org/abs/1812.06162 +Title: An Empirical Model of Large-Batch Training - McCandlish & Kaplan (2018) +Fetched-via: curl https://r.jina.ai/https://arxiv.org/pdf/1812.06162 +Fetch-status: verbatim + +Title: 1812.06162v1.pdf + +URL Source: https://arxiv.org/pdf/1812.06162 + +Published Time: Sun, 22 Jan 2023 23:25:04 GMT + +Number of Pages: 35 + +Markdown Content: +# An Empirical Model of Large-Batch Training + +Sam McCandlish ∗ + +OpenAI + +sam@openai.com + +Jared Kaplan + +Johns Hopkins University, OpenAI + +jaredk@jhu.edu + +Dario Amodei + +OpenAI + +damodei@openai.com + +and the OpenAI Dota Team † + +# Abstract + +In an increasing number of domains it has been demonstrated that deep learning models can be trained using relatively large batch sizes without sacrificing data efficiency. However the limits of this massive data parallelism seem to differ from domain to domain, ranging from batches of tens of thousands in ImageNet to batches of millions in RL agents that play the game Dota 2. To our knowledge there is limited conceptual understanding of why these limits to batch size differ or how we might choose the correct batch size in a new domain. In this paper, we demonstrate that a simple and easy-to-measure statistic called the gradient noise scale predicts the largest useful batch size across many domains and applications, including a number of supervised learning datasets (MNIST, SVHN, CIFAR-10, ImageNet, Billion Word), reinforcement learning domains (Atari and Dota), and even generative model training (autoencoders on SVHN). We find that the noise scale increases as the loss decreases over a training run and depends on the model size primarily through improved model performance. Our empirically-motivated theory also describes the tradeoff between compute-efficiency and time-efficiency, and provides a rough model of the benefits of adaptive batch-size training. + +> ∗ + +Work done as an OpenAI Fellow. + +> † + +The OpenAI Dota Team (Greg Brockman, Brooke Chan, Przemysław Debiak, Christy Dennison, David Farhi, Rafał Józefowicz, Jakub Pachocki, Michael Petrov, Henrique Pondé, Jonathan Raiman, Szymon Sidor, Jie Tang, Filip Wolski, and Susan Zhang) performed measurements of the reinforcement learning agents they developed for the game Dota 2. The Dota team’s work can be cited as [BCD +18]. + +> arXiv:1812.06162v1 [cs.LG] 14 Dec 2018 + +# Contents + +1 Introduction 22 Theory and Predictions for the Gradient Noise Scale 43 Experiments 94 Related Work 14 5 Discussion 15 A Methods 17 B Results for All Tasks 19 C Temperature and the Noise Scale 25 D Dynamically Varying the Batch Size 26 E Comments on Optimization 29 + +# 1 Introduction + +The last few years have seen a rapid increase in the amount of computation used to train deep learning models [AH18]. A major enabler as well as a limiting factor in this growth has been parallelism – the extent to which a training process can be usefully spread across multiple devices. Regardless of how much total computation is available, if model training cannot be sufficiently parallelized, then it may take too much serial time and therefore may be practically infeasible. A very common source of parallelism in deep learning has been data parallelism, which involves splitting a batch of data across multiple devices and then aggregating and applying the resulting gradients. Data parallelism requires fast communication between devices, but also requires that large batches are algorithmi-cally effective in accelerating learning. Recently, a number of papers have shown empirically that on spe-cific datasets or tasks, large batch sizes can achieve almost linear speed-ups in training without substantially harming sample efficiency or generalization. For example, batch sizes of 8 thousand [GDG +17], 16 thousand [SKYL17], 32 thousand [YGG17, YZH +17, ASF17], and even 64 thousand [JSH +18] examples have been effectively employed to train ImageNet, and batch sizes of thousands have been effective for language mod-els and generative models [OEGA18, PKYC18, BDS18]. This phenomenon is not confined to supervised learning: in reinforcement learning, batch sizes of over a million timesteps (with tens of thousands of envi-ronments running in parallel) have been used in a Dota-playing agent [BCD +18], and even in simple Atari environments batch sizes of several thousand timesteps have proved effective [AAG +18, HQB +18, SA18]. These discoveries have allowed massive amounts of data and computation to be productively poured into models in a reasonable amount of time, enabling more powerful models in supervised learning, RL, and other domains. However, for a given dataset and model, there is little to guide us in predicting how large a batch size we can feasibly use, why that number takes a particular value, or how we would expect it to differ if we used a different dataset or model. For example, why can we apparently use a batch size of over a million when training a Dota agent, but only thousands or tens of thousands when training an image recognition model? In practice researchers tend to simply experiment with batch sizes and see what works, but a downside of this is that large batch sizes often require careful tuning to be effective (for example, they may require a warmup 2Figure 1: The tradeoff between time and compute resources spent to train a model to a given level of perfor-mance takes the form of a Pareto frontier (left). Training time and compute cost are primarily determined by the number of optimization steps and the number of training examples processed, respectively. We can train a model more quickly at the cost of using more compute resources. On the right we show a concrete example of the Pareto frontiers obtained from training a model to solve the Atari Breakout game to different levels of performance. The cost and training time depend on the computing architecture and are shown approximately. period or an unusual learning rate schedule), so the fact that it is possible to use a large batch size can remain undiscovered for a long time. For example, both the Atari and ImageNet tasks were for several years conventionally run with a substantially smaller batch size than is now understood to be possible. Knowing ahead of time what batch size we expect to be effective would be a significant practical advantage in training new models. In this paper we attempt to answer some of these questions. We measure a simple empirical statistic, the gradient noise scale 3 (essentially a measure of the signal-to-noise ratio of gradient across training examples), and show that it can approximately predict the largest efficient batch size for a wide range of tasks. Our model also predicts a specific shape for the compute/time tradeoff curve, illustrated in Figure 1. Our contributions are a mix of fairly elementary theory and extensive empirical testing of that theory. On the conceptual side, we derive a framework which predicts, under some basic assumptions, that training should parallelize almost linearly up to a batch size equal to the noise scale, after which there should be a smooth but relatively rapid switch to a regime where further parallelism provides minimal benefits. Addition-ally, we expect that the noise scale should increase during training as models get more accurate, and should be larger for more complex tasks, but should not have a strong dependence on model size per se. We also provide an analysis of the efficiency gains to be expected from dynamically adjusting the batch size according to noise scale during training. Finally, we predict that, all else equal, the noise scale will tend to be larger in complex RL tasks due to the stochasticity of the environment and the additional variance introduced by the credit assignment problem. On the empirical side, we verify these predictions across 8 tasks in supervised learning, RL, and gener-ative models, including ImageNet, CIFAR-10, SVHN, MNIST, BillionWord, Atari, OpenAI’s Dota agent [BCD +18], and a variational autoencoder for images. For each of these tasks we demonstrate that the noise scale accurately predicts the largest usable batch size (at the order of magnitude level) and that gains to paral-lelism degrade in the manner predicted by theory. We also show that the noise scale increases over the course of training and demonstrate efficiency gains from dynamic batch size tuning. The noise scale eventually becomes larger for more performant models, but this appears to be caused by the fact that more performant models simply achieve a better loss. The rest of this paper is organized as follows. In Section 2, we derive a simple conceptual picture of the noise scale, data parallelism, and batch sizes, and explain what it predicts about optimal batch sizes and how + +> 3Similar metrics have appeared previously in the literature. We discuss related work in Section 4. + +3Figure 2: Less noisy gradient estimates allow SGD-type optimizers to take larger steps, leading to conver-gence in a smaller number of iterations. As an illustration, we show two optimization trajectories using momentum in a quadratic loss, with different step sizes and different amounts of artificial noise added to the gradient. they vary over the course of training and across tasks. We build on this analysis to study training efficiency in Section 2.3. Then in Section 3 we empirically test the predictions in Section 2 and explore how the noise scale varies with dataset, model size, and learning paradigm (supervised learning vs RL vs generative models). Section 4 describes related work and Section 5 discusses the implications of these results and possible future experiments. + +# 2 Theory and Predictions for the Gradient Noise Scale + +2.1 Intuitive Picture + +Before working through the details of the gradient noise scale and the batch size, it is useful to present the intuitive picture. Suppose we have a function we wish to optimize via stochastic gradient descent (SGD). There is some underlying true optimization landscape, corresponding to the loss over the entire dataset (or, more abstractly, the loss over the distribution it is drawn from). When we perform an SGD update with a finite batch size, we’re approximating the gradient to this true loss. How should we decide what batch size to use? When the batch size is very small, the approximation will have very high variance, and the resulting gradient update will be mostly noise. Applying a bunch of these SGD updates successively will average out the variance and push us overall in the right direction, but the individual updates to the parameters won’t be very helpful, and we could have done almost as well by aggregating these updates in parallel and applying them all at once (in other words, by using a larger batch size). For an illustrative comparison between large and small batch training, see Figure 2. By contrast, when the batch size is very large, the batch gradient will almost exactly match the true gradient, and correspondingly two randomly sampled batches will have almost the same gradient. As a result, doubling the batch size will barely improve the update – we will use twice as much computation for little gain. Intuitively, the transition between the first regime (where increasing the batch size leads to almost perfectly linear speedups) and the second regime (where increasing the batch size mostly wastes computation) should occur roughly where the noise and signal of the gradient are balanced – where the variance of the gradient is at the same scale as the gradient itself 4. Formalizing this heuristic observation leads to the noise scale. The situation is shown pictorially in Figure 1. For a given model, we’d like to train it in as little wall time as possible (x-axis) while also using as little total computation as possible (y-axis) – this is the usual goal + +> 4Note that these considerations are completely agnostic about the size of the dataset itself. + +4of parallelization. Changing the batch size moves us along a tradeoff curve between the two. Initially, we can increase the batch size without much increase in total computation, then there is a “turning point” where there is a substantive tradeoff between the two, and finally when the batch size is large we cannot make further gains in training time. In the conceptual and experimental results below, we formalize these concepts and show that the bend in the curve (and thus the approximate largest effective batch size) is in fact set roughly by the noise scale. + +2.2 Gradients, Batches, and the Gradient Noise Scale + +We’ll now formalize the intuitions described in Section 2.1. Consider a model, parameterized by variables + +θ ∈ RD , whose performance is assessed by a loss function L (θ). The loss function is given by an average over a distribution ρ (x) over data points x. Each data point x has an associated loss function Lx (θ), and the full loss is given by L (θ) = Ex∼ρ [Lx (θ)] 5.We would like to minimize L (θ) using an SGD-like optimizer, so the relevant quantity is the gradient G (θ) = + +∇L (θ). However, optimizing L (θ) directly would be wasteful if not impossible, since it would require processing the entire data distribution every optimization step. Instead, we obtain an estimate of the gradient by averaging over a collection of samples from ρ, called a batch: + +Gest (θ) = 1 + +B + +> B + +∑ + +> i=1 + +∇θ Lxi (θ) ; xi ∼ ρ (2.1) This approximation forms the basis for stochastic optimization methods such as mini-batch stochastic gradi-ent descent (SGD) and Adam [KB14]. The gradient is now a random variable whose expected value (averaged over random batches) is given by the true gradient. Its variance scales inversely with the batch size B6: + +Ex1··· B ∼ρ [Gest (θ)] = G (θ)cov x1··· B ∼ρ (Gest (θ)) = 1 + +B Σ ( θ) , (2.2) where the per-example covariance matrix is defined by + +Σ ( θ) ≡ cov x∼ρ (∇θ Lx (θ)) = Ex∼ρ + +[ + +(∇θ Lx (θ)) ( ∇θ Lx (θ)) T ] + +− G (θ) G (θ)T . (2.3) The key point here is that the minibatch gradient gives a noisy estimate of the true gradient, and that larger batches give higher quality estimates. We are interested in how useful the gradient is for optimization pur-poses as a function of B, and how that might guide us in choosing a good B. We can do this by connecting the noise in the gradient to the maximum improvement in true loss that we can expect from a single gradient update. To start, let G denote the true gradient and H the true Hessian at parameter values θ. If we perturb the parameters θ by some vector V to θ − V , where  is the step size, we can expand true loss at this new point to quadratic order in : + +L (θ − V ) ≈ L (θ) − G T V + 12 2V T HV. (2.4) If we had access to the noiseless true gradient G and used it to perturb the parameters, then Equation 2.4 with + +V = G would be minimized by setting  = max ≡ |G|2 + +> GTHG + +. However, in reality we have access only to the noisy estimated gradient Gest from a batch of size B, thus the best we can do is minimize the expectation + +> 5In the context of reinforcement learning, the loss could be the surrogate policy gradient loss, and the distribution ρ +> would be nonstationary. +> 6This is strictly true only when training examples are sampled independently from the same data distribution. For ex-ample, when batches are sampled without replacement from a dataset of size D, the variance instead scales like (1 +> B−1 +> D +> ).For simplicity, we restrict ourself to the case where BDor where batches are sampled with replacement, but our conclusions can be altered straightforwardly to account for correlated samples. + +5Figure 3: Larger batch sizes yield estimated gradients that are closer to the true gradient, on average. Larger step sizes can be used when the estimated gradient is closer to the true gradient, so more progress can be made per step. Left: A large step size used with a small batch size can lead to instability, as illustrated for a quadratic loss. Right: Equation 2.6 predicts that the ‘turning point’ after which larger batch sizes become less helpful is the noise scale B, where the training speed drops to 50% of the maximum possible. + +E[L (θ − G est )] with respect to . This expected value can be evaluated using Equation 2.2: + +E[L (θ − G est )] = L (θ) − |G|2 + 12 2 + +( + +GT HG + tr( HΣ) + +B + +) + +. (2.5) Minimizing this equation with respect to  leads to: + +opt (B) = argmin E [L (θ − G est )] = max + +1 + Bnoise /B (2.6) as the optimal step size, which produces an optimal improvement in the loss from the noisy gradient: + +∆Lopt (B) = ∆Lmax + +1 + Bnoise /B ; ∆Lmax = 12 + +|G|4 + +GT HG . (2.7) Above, we have defined the noise scale as: + +Bnoise = tr ( HΣ) + +GT HG , (2.8) Note that our definition of the noise scale is independent of the size of the full training set. If we use a step size larger than twice opt , the loss may increase , leading to divergence, as illustrated in Figure 3. Despite the many unfounded assumptions in the above derivation, we will find that equations 2.7 and 2.8 provide a helpful guide to the behavior of large-batch training, even when using other optimizers (including momentum, Adam, and RMSProp). For a discussion of the dependence of the noise scale on the learning rate, see Appendix C on the ‘temperature’ of training. + +Implications and Simplifications + +Equation 2.7 implies that when the batch size is much smaller than the noise scale, B  B noise , the second term in the denominator dominates the first, so increasing the batch size B linearly increases the progress in loss. This is the small batch regime, where increases in batch size linearly speed up training. By contrast, when B  B noise , then the first term dominates, so that increasing B has almost no effect on the progress in loss. This is the large batch regime where increases in batch size do not speed up training and simply waste computation; the switch between the two occurs at B ≈ B noise (see Figure 3). The noise scale in Equation 2.8 requires some overhead to compute due to the presence of the Hessian H. We can estimate it by measuring ∆Lopt (B) using a series of line searches in the direction of a gradient measured 6with various batch sizes B and fitting the result to Equation 2.7. This allows us to estimate Bnoise as well as to empirically test whether Equation 2.7 actually fits the data (we discuss these local tests more in Section 3). The situation gets even simpler if we make the (unrealistic) assumption that the optimization is perfectly well-conditioned – that the Hessian is a multiple of the identity matrix. If that is the case, then Equation 2.8 reduces to: + +Bsimple = tr(Σ) + +|G|2 , (2.9) which says that the noise scale is equal to the sum of the variances of the individual gradient components, divided by the global norm of the gradient 7 – essentially a measure of how large the gradient is compared to its variance. It is also a measure of the scale at which the estimated and true gradient become close in L2 + +space (having non-trivial dot product) – the expected normalized L2 distance is given by: + +E + +[ + +|Gest − G|2] + +|G|2 = 1 + +B + +tr(Σ) + +|G|2 = Bsimple + +B , (2.10) In practice, we find that Bsimple and Bnoise typically differ only by a small constant multiplicative factor, particularly when we employ common training schemes that improve conditioning. In our empirical work we will sometimes compute Bnoise , but will primarily compute Bsimple instead, as it requires less computational expense. In Appendix A.1, we provide an extremely simple method to measure this simplified noise scale with negligible overhead in the context of data-parallel training. + +2.3 Predictions for Data/Time Efficiency Tradeoffs + +Thus far our analysis has only involved a single point in the loss landscape. But in Section 3 we will show that Equation 2.7 nevertheless predicts the dependence of training speed on batch size remarkably well, even for full training runs that range over many points in the loss landscape. By averaging Equation 2.7 over multiple optimization steps (see Appendix D), we find a simple relationship between training speed and data efficiency: + +SSmin + +− 1 = + +( EEmin + +− 1 + +)−1 + +. (2.11) Here, S and Smin represent the actual and minimum possible number of steps taken to reach a specified level of performance, respectively, and E and Emin represent the actual and minimum possible number of training examples processed to reach that same level of performance. Since we are training at fixed batch size 8, we have Etot = BS tot . We define the critical batch size by an empirical fit to the above equation, as + +Bcrit = Emin + +Smin + +. (2.12) Our model predicts Bcrit ≈ B noise , where Bnoise is appropriately averaged over training (see Appendix D). Note that the noise scale can vary significantly over the course of a training run, so the critical batch size also depends on the level of performance to which we train the model. The resulting tradeoff curve in serial time vs total compute has a hyperbolic shape represented in Figure 1. The goal of optimization is to reach a given level of performance with minimal S and E – but as depicted in Figure 1, there are tradeoffs involved, as very small S may require very large E, and vice versa. When we choose B = Bcrit , the two sides of Equation 2.11 are both 1, so that training takes twice as many passes through the training data as an optimally data-efficient (small-batch) run would take, and twice as many optimization steps as an optimally time-efficient (large-batch) run would take. + +> 7One might also use preconditioned gradients, obtained for example by dividing gradient components by the square root of the Adam optimizer’s [KB14] accumulated variances. We experimented with this but found mixed results. +> 8We discuss the benefits of dynamically varying the batch size in Appendix D + +72.4 Assumptions and Caveats + +The mathematical argument in the previous sections depends on several assumptions and caveats, and it is useful to list these all in one place, in order to help clarify where and why we might expect the quantities in equations 2.8 and 2.9 to be relevant to training: 1. Short-horizon bias: The picture in Section 2.2 is a strictly local picture – it tells us how to best improve the loss on the next gradient step. Greedily choosing the best local improvement is gen-erally not the best way to globally optimize the loss (see e.g. [WRLG18]). For example, greedy optimization might perform poorly in the presence of bad local minima or when the landscape is ill-conditioned. The critical batch size would then be reduced by the extent to which noise is beneficial. 2. Poor conditioning: In poorly conditioned optimization problems, parameter values often oscillate along the large-curvature directions rather than decreasing in a predictable way (see e.g. [Goh17] and Appendix E.1). This means that Equation 2.7 will not perfectly reflect the amount of optimization progress made per step. Nevertheless, we will see that it still accurately predicts the relative speed of training at different batch sizes via the resulting tradeoff Equation 2.11. 3. Simplified noise scale: As noted in Section 2.2, whenever we use the simplified noise scale (Equa-tion 2.9) rather than the exact noise scale (Equation 2.8), this number may be inaccurate to the extent that the Hessian is not well-conditioned. Different components of the gradient can have very different noise scales. 4. Learning rate tuning: The arguments in Section 2.2 assume that we take the optimal step size and maximize the expected improvement in loss, Equation 2.6. In practice learning rates are unlikely to be perfectly tuned, so that the actual improvement in loss (and thus the scaling of training with batch size) may not perfectly reflect Equation 2.7. However, by trying to choose the best learning rate schedules (or by simply doing a grid search) we can reduce this source of error. In addition, the noise scale depends strongly on the learning rate via a ‘temperature’ of training, though this source of error is small as long as the learning rate is reasonably close to optimal. We provide a more detailed discussion of this dependence in Appendix C. 5. Quadratic approximation: The Taylor expansion in Equation 2.4 is only to second order, so if third order terms are important, in either the distribution of gradient samples or the optimization landscape, then this may introduce deviations from our conceptual model, and in particular devi-ations from Equation 2.7. Intuitively, since parameter updates are local and often quite small we suspect that the previous two sources of error will be more important than this third one. 6. Generalization: The picture in Section 2.2 says nothing about generalization – it is strictly about optimizing the training loss as a mathematical function. Some papers have reported a “generaliza-tion gap” in which large batch sizes lead to good training loss but cause a degradation in test loss, apparently unrelated to overfitting [KMN +16, HHS17]. The arguments in Section 2.2 don’t exclude this possibility, but recent work [SLA +18] has found no evidence of a generalization gap when hyperparameters are properly tuned. Despite these potential issues in our conceptual model, we’ll show in Section 3 that the noise scale is overall a good empirical predictor of the critical batch size. Furthermore, we will see that most training runs fit Equation 2.11 remarkably well. + +2.5 Expected Patterns in the Noise Scale + +In the next section we will measure the noise scale for a number of datasets and confirm its properties. However, it is worth laying out a few patterns we would expect it to exhibit on general grounds: + +• Larger for difficult tasks: We expect B to be larger for more complex/difficult 9 tasks, because individual data points will be less correlated, or only correlated in a more abstract way. This may + +> 9To be clear, we do not expect this to be the primary difference between more and less difficult tasks. Other difficulty metrics such as the intrinsic dimensionality [LFLY18] appear to be unrelated to the amount of gradient noise, though it would be interesting if there were some connection. + +8apply both over the course of training on a given dataset (where we may pick the ‘low-hanging fruit’ first, leaving a long tail of more complicated things to learn) or in moving from easier to harder datasets and environments. In reinforcement learning, we expect environments with sparse rewards or long time horizons to have larger noise scale. We also expect generative models to have smaller + +B as compared to classifiers training on the same dataset, as generative models may obtain more information from each example. + +• Growth over training: B will grow when the gradient decreases in magnitude, as long as the noise + +tr(Σ) stays roughly constant. Since |G| decreases as we approach the minimum of a smooth loss, we would expect B to increase during neural network training. + +• Weak dependence on model size: The number of model parameters in the neural network cancels in the noise scale, so we do not expect B to exhibit a strong dependence on model size (at fixed loss). As discussed above, models that achieve better loss will tend to have a higher noise scale, and larger models often achieve better loss, so in practice we do expect larger models to have higher noise scale, but only through the mechanism of achieving better loss. + +• Learning rate tuning: The noise scale will be artificially inflated if the learning rate is too small, due to the ‘temperature’ dependence described in Appendix C. To get a useful measurement of the noise scale, the learning rate needs to be appropriate to the current point in parameter space. The first and last points can be exhibited analytically in toy models (see Appendix C), but we do not expect theoretical analyses to provide a great deal of insight beyond the intuitions above. Instead, we will focus on confirming these expectations empirically. + +2.6 Summary + +To summarize, our model makes the following predictions about large-batch training: + +• The tradeoff between the speed and efficiency of neural network training is controlled by the batch size and follows the form of Equation 2.11. + +• The critical batch size Bcrit characterizing cost/time tradeoffs can be predicted at the order of mag-nitude level by measuring the gradient noise scale, most easily in the simplified form Bsimple from Equation 2.9. + +• The noise scale can vary significantly over the course of a training run, which suggests that the critical batch size also depends on the chosen level of model performance. + +• The noise scale depends on the learning rate via the ‘temperature’ of training, but is consistent between well-tuned training runs (see Appendix C). + +# 3 Experiments + +We now test the predictions of Section 2 on a range of tasks, including image classification, language mod-eling, reinforcement learning, and generative modeling. The tasks range from very simple (MNIST) to very complex (5v5 Dota), which allows us to test our model’s predictions in drastically varying circumstances. Our central experimental test is to compare the prediction made by the gradient noise scale Bsimple for each task, to the actual limits of batch size Bcrit found by carefully tuned full training runs at an exhaustive range of batch sizes. The overall results of this comparison are summarized in Figure 4. We find that the gradient noise scale predicts the critical batch size at the order of magnitude level, even as the latter varies from 20 (for an SVHN autoencoder) to over 10 million (consistent with prior results reported in [BCD +18]). Details about the hyperparameters, network architectures, and batch size searches are described in Appendix A.4. Below we describe the individual tasks, the detailed measurements we perform on each task, and the results of these measurements. 9Figure 4: The “simple noise scale” roughly predicts the maximum useful batch size for many ML tasks. We define this “critical batch size” to be the point at which compute efficiency drops below 50% optimal, at which point training speed is also typically 50% of optimal. Batch sizes reported in number of images, tokens (for language models), or observations (for games). We show the critical batch size for a full training run, and the noise scale appropriately averaged over a training run (see Appendix D). Due to resource constraints, for Dota 5v5 we show the batch size used by the OpenAI Dota team as a lower bound for the critical batch size. + +Figure 5: Left: The optimal learning rate is displayed for a range of batch sizes, for an SVHN classifier trained with SGD. The optimal learning rate initially scales linearly as we increase the batch size, leveling off in the way predicted by Equation 2.7. Right: For a range of batch sizes, we display the average loss progress + +∆L (B) that can be made from a batch of size B via a line search, normalized by the measured ∆L (Bmax ).Early in training, smaller batches are sufficient to make optimal progress, while larger batches are required later in training. + +3.1 Quantities Measured + +In order to test our model we train each task on a range of batch sizes, selecting the optimal constant learning rate separately for each batch size using a simple grid search. Across a range of tasks, we produce the following results and compare with our model: + +• Optimal learning rates: When optimizing with plain SGD or momentum, we find that the optimal learning rate follows the functional form of Equation 2.6, as shown in Figure 5. For Adam and RMSProp the optimal learning rate initially obeys a power law  (B) ∝ Bα with α between 0.5 and 1.0 depending on the task, then becomes roughly constant. The scale at which the optimal learning 10 Figure 6: Training runs for a simple CNN classifier on the SVHN dataset at constant batch sizes. Small batch training is more compute-efficient (right), while large-batch training requires fewer optimizer steps (left). The turning point between time-efficient and compute-efficient training occurs roughly at B = 64 for the initial phase of training and increases later in training. rate stops increasing is generally somewhat smaller than the typical noise scale. (See Appendix E.2 for a potential explanation for this power law behavior.) + +• Pareto frontiers: For each batch size, we observe the number of optimization steps and total number of data samples needed to achieve various levels of performance. This allows us to visualize the tradeoff between time-efficiency and compute-efficiency as a Pareto frontier (see Figures 6 and 7). We find that Equation 2.11 fits the shape of these tradeoff curves remarkably well in most cases. + +• Critical batch size ( Bcrit ): We determine the critical batch size over the course of a training run by fitting the Pareto fronts to the functional form of Equation 2.11 (see Figure 7). This quantifies the point at which scaling efficiency begins to drop. In particular, training runs at batch sizes much less than Bcrit behave similarly per training example, while training runs at batch sizes much larger than Bcrit behave similarly per optimization step (see Figure 6). The critical batch size typically increases by an order of magnitude or more over the course of training. + +• Simple noise scale ( Bsimple ): We measure the simple noise scale of Equation 2.9 over the course of a single training run using the minimal-overhead procedure described in Appendix A.1. Note that some care must be taken to obtain a proper estimate of Bsimple due to its dependence on the learning rate via the ‘temperature’ of training. We find that the noise scale agrees between different well-tuned training runs when compared at equal values of the loss, so it can be accurately measured at small batch size (see Appendix C). We also find that, aside from some fluctuations early in training, + +Bsimple typically predicts the critical batch size at the order of magnitude level (see Figure 7). The noise scale also typically increases over the course of training, tracking the critical batch size. To obtain a single value for the noise scale representing a full training run, we average over a training run as described in Appendix D. + +• Full noise scale ( Bnoise ): For SVHN trained with SGD, we also measure the full noise scale Bnoise + +by performing line searches for gradients obtained by batches of varying size, then fit to the func-tional form 2.11 (see Figures 5 and 7). This is a somewhat better estimate of Bcrit but is less computationally convenient, so we choose to focus on Bsimple for the remaining tasks. + +3.2 Results + +We summarize our findings in Figure 4: across many tasks, the typical simple noise scale approximately predicts the batch size at which the returns from increasing scale begin to diminish significantly. Results for all of the tasks can be found in Appendix B. We provide a detailed discussion of our methods in Appendix A. 11 Figure 7: The tradeoff between time-efficiency and compute-efficiency can be visualized as a Pareto frontier. Each point on the diagram above (left) represents the number of optimizer steps and processed examples needed to achieve a particular level of classification accuracy. Fits to Equation 2.11 are also shown. + +Supervised Learning + +Basic image classification results are pictured in Figure 14. + +• SVHN We train a simple CNN image classifier on the extended SVHN dataset [NWC +11]. We display all three of Bcrit , Bsimple , and Bnoise for SVHN optimized with SGD in Figure 7. We find that Bnoise better-predicts Bcrit as compared to the more naive Bsimple . We compare to training using the Adam optimizer [KB14] in Figure 14, where Bsimple provides a very accurate prediction for Bcrit . + +• MNIST We train a simple CNN on the MNIST dataset [LC10] using SGD, and find that Bsimple + +roughly estimates Bcrit , though the latter is significantly smaller. + +• CIFAR10 We train a size 32 ResNet [HZRS15] with Momentum on CIFAR10 [Kri09] and find that + +Bsimple predicts Bcrit . + +• ImageNet We train a size 50 ResNet [HZRS15] with Momentum on ImageNet [DDS +09], and use a learning rate schedule that decays the learning rate three times during training. Due to the schedule, both Bsimple , and Bcrit change significantly during training (see Appendix C for a discussion) and must be measured separately at each learning rate. Results are pictured in Figure 10. We find that the noise scale varies from 2,000 to 100,000 in the main phase of training, which matches empirical work (e.g. [JSH +18]) showing that constant batch sizes up to 64 thousand can be used without a loss of efficiency. During the later fine-tuning phase of training, the noise scale increases further to hundreds of thousands and even millions, suggesting that even larger batch sizes may be useful at the very end of training. Our critical batch sizes are slightly lower (15k vs 64k) than those reported in the literature, but we did not use the latest techniques such as layer-wise adaptive learning rates [YGG17]. Overall we find that more complex image datasets have larger noise scales in a way that is not directly determined by dataset size. + +Generative Modeling + +The results for these tasks are pictured in Figure 9. + +• VAE and Autoencoder We train a VAE [KW13] and a simple Autoencoder on the SVHN dataset [NWC +11]; we were motivated to compare these models because VAEs introduce additional stochasticity. As expected, the VAE had larger Bcrit and Bsimple as compared to the Autoencoder, and both models had much lower Bsimple as compared to SVHN image classifiers. However, unlike most of the other tasks, for these generative models Bsimple was significantly smaller than Bcrit . + +• Language Modeling We train a single-layer LSTM for autoregressive prediction on the Billion Word dataset [CMS +13], and find good agreement between Bcrit and Bsimple . We also illustrate the 12 dependence on LSTM size in Figure 8, finding that the noise scale is roughly independent of LSTM size at fixed values of the loss, but that larger LSTMs eventually achieve lower values of the loss and a larger noise scale. + +Reinforcement Learning + +• Atari We train RL agents with the policy gradient algorithm A2C [MBM +16] on seven Atari games [BNVB12] (Alien, Beamrider, Breakout, Pong, Qbert, Seaquest, Space Invaders), with results pic-tured in Figures 11 and 12. The tradeoff curves generally agree well with the prediction of Equation 2.11, though they are somewhat noisy e.g. for Pong since we do not average over multiple seeds. For some Atari games, we find some consistent deviation from 12 at very small batch sizes (see e.g. Beam Rider in Figure 11). It would be interesting to study this phenomenon further, though this could simply indicate greater sensitivity to other hyperparameters (e.g. momentum) at small batch size. Overall, we see that patterns in the noise scale match intuition, such as Pong being much easier to learn than other Atari games. + +• Dota The OpenAI Dota team has made it possible to train PPO [SWD +17] agents on both Dota 1v1 and 5v5 environments (the latter being preliminary ongoing work). We vary the two hyperparameters batch size and learning rate on the existing code, experiment setup, and training infrastructure as described in [BCD +18]. The Dota 1v1 environment features two agents fighting in a restricted part of the map (although they are free to walk anywhere) with a fixed set of abilities and skills, whereas Dota 5v5 involves the whole map, 5 heroes on each side, and vastly more configurations in which heroes might engage each other. This is reflected in the higher noise scale for Dota 5v5 (at least 10 million) relative to Dota 1v1 – we suspect the higher diversity of situations gives rise to more variance in the gradients. Due to resource constraints we were not able to measure the Pareto fronts for Dota 5v5, and so we can only report the batch size used by the Dota team and the measured noise scale. Results for the tasks described above were generally within a reasonable margin of state-of-the-art results, though we did not explicitly try to match SOTA or use special algorithmic or architectural tricks. Our goal was simply to confirm that we were in a reasonably well-performing regime that is typical of ML practice. For the supervised learning and generative modeling tasks listed above, we have the option of using either training set or test set performance to compare different batch sizes. For the main results in this paper, we choose train set performance because it is what is directly predicted by our model, and because it is easier to measure in the presence of overfitting. The choice makes negligible difference in most of our experiments, either because they involve RL or because the datasets are large enough that we don’t overfit. On the small datasets MNIST, CIFAR10, and SVHN, overfitting makes measurement of test error more difficult, but we do measure the test error behavior in Appendix E.3, and both the Pareto fronts and critical batch size generally do not change much. The fact that the noise scale is consistent between well-tuned training runs suggests that the corresponding optimization trajectories are similar in some sense. In Appendix C we investigate this idea further and relate the noise scale to a characteristic ‘temperature’ of the training process. + +Model Size Dependence + +The definitions of the noise scales do not have any manifest dependence on the number of parameters in a model. We have conjectured that they will be roughly independent of model size at fixed values of the loss. LSTM language models provide a natural test case, since LSTM sizes can be scaled up without qualitatively altering model architecture. As shown in Figure 8, the simple noise scale appears to be roughly independent of model size at a fixed value of the loss. However, due to their higher performance and lower achieved loss, larger models eventually reach larger noise scales than smaller models. We do not have specific hypotheses for how the noise scale should vary with model architecture, but interesting results along these lines were recently obtained [SLA +18]. 13 Figure 8: We show the relationship between training perplexity and the simple noise scale (left) for a range of LSTM sizes on the Billion Word dataset. These results show that at fixed values of the loss, the noise scale does not depend significantly on model size . On the right we show the simple noise scale during training, plotted in terms of the number of tokens processed. After processing a given number of examples, larger models will tend to have a larger noise scale, but only as a consequence of having achieved smaller loss. + +# 4 Related Work + +A great deal of prior work has studied large-batch training, investigated versions of the noise scale, explored adaptive batch size and learning rate schedules, and demonstrated that large batch training can be effective on specific datasets. We attempt to summarize this work below. Recent papers have probed the limits of large batch training empirically, especially for ImageNet [GDG +17, YZH +17, JSH +18], in some cases using layer-wise adaptive learning-rates [YGG17]. More recent work has demonstrated that large batch training can also be applied to RL [AAG +18, BCD +18, SA18, HQB +18]. The use of second order optimization methods [BGM17] might increase the utility of data parallelism even further. A thorough review of large batch training and potential issues with generalization was provided in a very nice recent empirical study [SLA +18] done in parallel with this work. [GVY +18] also systematically studied large-batch training, though it did not tune the learning rate separately for each batch size. Other recent work has explored the impact of gradient noise on optimization speed and batch size selec-tion. [SZL13] connected gradient noise and the locally optimal step size to identify an adaptive learning rate. [MHB17] derived a sampling distribution for SGD, motivating our definition of ‘temperature’. [SL17] connected this temperature to the critical batch size, though they predict a dependence on dataset size which we do not observe. [SZT17] identified a signal-dominated and noise-dominated phase of training. [SKYL17] showed that decaying the learning rate and increasing the batch size have the same effect, motivated by the SGD training temperature. ([DNG17] also suggested increasing learning rate and batch size together, but with different motivation.) [IES +18] empirically investigated the role of gradient noise in reinforcement learning. The gradient noise scale in particular has also been studied in earlier work to aid in batch size selection. The noise scale itself is used implicitly in basic statistical techniques for sample size selection (see e.g. [Wik, NIS]). [BCNW12] implicitly uses the gradient noise scale for a theoretical analysis of batch size selection. [BCN16, DYJG16, BRH16] propose adaptive sampling methods based on the gradient noise scale in the context of neural network optimization. [YPL +17] analyzed the gradient noise scale for a particular class of functions and related it to the critical batch size, though it predicts a sharp change in learning speed with batch size rather than the smooth change we observe. [CWZ +18] theoretically analyzed the dependence of the gradient noise scale on network width for shallow or linear networks, though they find inconsistent empirical results on neural networks. [MBB17] found a formula for the optimization speedup in terms of batch size resembling ours, though their critical batch size depends on smoothness parameters of the loss rather than directly gradient noise. There has been a variety of work studying the Neural Network loss landscape and using it to draw conclusions about optimal training. Local properties of the loss landscape are not necessarily a good guide to overall optimal training [WRLG18]. The loss tends to be fairly smooth when interpolating between the start and end 14 of training [GVS14]. But noise may be useful early in training [NVL +15, YPL +17], perhaps because it leads to minima that generalize better [KMN +16]. A big-picture motivation for our work was to better understand the scaling of learning with computational and data resources; this question was addressed from the perspective of scaling the model size in [HNA +17]. Our key contributions include connecting the gradient noise scale to the speed of optimization with a simple model, as well as systematically measuring the critical batch size and noise scale for a variety of tasks. We also clarify the role of the training temperature in SGD and propose an optimal batch size schedule. + +# 5 Discussion + +We have shown that the simplified gradient noise scale Bsimple approximately predicts the actual point of diminishing return on batch size Bcrit on diverse problems where these quantities vary by six orders of magnitude. Furthermore, the tradeoff curve between total compute and optimization steps associated with changing the batch size has roughly the hyperbolic form predicted by our theory. Finally, our theory also roughly predicts how the optimal learning rate scales with batch size, although its predictions are not as precise. What does the validity of this theory mean, and in what way is it useful? At the level of a given task, it allows us to use the noise scale from a single run (even an only partially complete run with much smaller batch size, though see caveats about learning rate tuning in the appendix) to estimate the largest useful batch size, and thus reduces the extensive hyperparameter searches that are necessary to find this batch size by trial and error. It also tells us to expect that larger batch sizes will show diminishing returns in a predictable way that has the same form regardless of the task. Across tasks, it tells us that the largest useful batch size for a task is likely to be correlated to informal notions of the “complexity” of the task, because the noise scale essentially measures how diverse the data is (as seen by the model), which is one aspect of task complexity. We have argued that a specific formula characterizes the time/compute tradeoff between optimization steps and total data processed in neural network training: + +( Optimization Steps Min Steps − 1 + +) ( Data Examples Min Examples − 1 + +) + += 1 (5.1) From this relation we can identify a critical value of the batch size when training to a given value of the loss + +Bcrit (Loss) = Min Examples Min Steps Training at this critical batch size provides a natural compromise between time and compute, as we take only twice the minimum number of optimization steps and use only twice the minimum amount of data. The critical batch size represents a turning point, so that for B > Bcrit there are diminishing returns from greater data parallelism. Our main goal was to provide a simple way to predict Bcrit . We have shown that it can be estimated as + +Bcrit ≈ B simple (5.2) where the easily-measured Bsimple is the ratio of the gradient variance to its squared mean. Theoretical arguments suggest that a more refined quantity, the Hessian-weighted Bnoise of Equation 2.8, may provide an even better 10 estimate of Bcrit .The tradeoff curve of Equation 5.1 provides a remarkably good fit across datasets, models, and optimizers, and the approximate equality of Bcrit and Bsimple holds even as both quantities vary greatly between tasks and training regimes. We have established that as anticipated, both Bcrit and Bsimple tend to increase significantly during training, that they are larger for more complex tasks, and that they are roughly independent of model + +> 10 We have also investigated using gradients preconditioned by the Adam optimizer; the results were mixed. + +15 size (for LSTMs) at fixed values of the loss. We also saw that image classification has a significantly larger per-image noise scale as compared to generative models training on the same dataset, a fact that could have interesting implications for model-based RL. In the case of RL, while the noise scale for Dota was roughly a thousand times larger than that of Atari, the total number of optimization steps needed to train a Dota agent is not so much larger [BCD +18]. Perhaps this suggests that much of the additional compute needed to train more powerful models will be parallelizable. While Bsimple roughly matches Bcrit for all datasets, the ratio Bsimple /Bcrit can vary by about an order of magnitude between tasks. This may not be so surprising, since Bsimple does not take into account Hessian conditioning or global aspects of the loss landscape. But it would be very interesting to obtain a better understanding of this ratio. It was smaller than one for the Autoencoder, VAE, and for Dota 1v1, roughly equal to one for LSTMs, and greater than one for both image classification tasks and Atari, and we lack an explanation for these variations. It would certainly be interesting to study this ratio in other classes of models, and to further explore the behavior of generative models. Due to its crucial role in data-parallelism, we have focused on the batch size B, presuming that the learning rate or effective ‘temperature’ will be optimized after B has been chosen. And our theoretical treatment focused on a specific point in the loss landscape, ignoring issues such as the relationship between early and late training and the necessity of a ‘warm-up’ period. It would be interesting to address these issues, particularly insofar as they may provide motivation for adaptive batch sizes. + +Acknowledgements + +We are grateful to Paul Christiano for initial ideas and discussions about this project. We would like to thank the other members of OpenAI for discussions and help with this project, including Josh Achiam, Danny Hernandez, Geoffrey Irving, Alec Radford, Alex Ray, John Schulman, Jeff Wu, and Daniel Ziegler. We would also like to thank Chris Berner, Chris Hesse, and Eric Sigler for their work on our training infrastructure. We thank Joel Hestness, Heewoo Jun, Jaehoon Lee, and Aleksander Madry for feedback on drafts of this paper. JK would also like to thank Ethan Dyer for discussions. 16 A Methods + +A.1 Unbiased Estimate of the Simple Noise Scale with No Overhead + +In this section, we describe a method for measuring the noise scale that comes essentially for free in a data-parallel training environment. We estimate the noise scale by comparing the norm of the gradient for different batch sizes. From Equation 2.2, the expected gradient norm for a batch of size B is given by: + +E [|Gest |2] = |G|2 + 1 + +B tr(Σ) . (A.1) Given estimates of |Gest |2 for both B = Bsmall and B = Bbig , we can obtain unbiased estimates |G| 2 and S + +for |G|2 and tr(Σ) , respectively: + +|G| 2 ≡ 1 + +Bbig − Bsmall + +(Bbig |GBbig |2 − Bsmall |GBsmall |2) + +S ≡ 11/B small − 1/B big + +(|GBsmall |2 − | GBbig |2) . (A.2) We can verify with Equation A.1 that E [|G| 2] = |G|2 and E [S] = tr(Σ) .11 + +Note that the ratio S/|G| 2 is not an unbiased estimator for Bnoise 12 . It is possible to correct for this bias, but to minimize complexity we instead ensure that |G| 2 has relatively low variance by averaging over many batches. This is especially important due to the precise cancellation involved in the definition of |G| 2.When training a model using a data parallel method, we can compute |GBsmall |2 and |GBbig |2 with minimal effort by computing the norm of gradient before and after averaging between devices. In that case Bsmall is the “local” batch size before averaging, and Bbig is the “global” batch size after averaging. In practice, to account for the noisiness of |G| 2 when computed this way, we calculate |G| 2 and S on every training step and use their values to compute separate exponentially-weighted moving averages. We tune the exponential decay parameters so that the estimates are stable. Then, the ratio of the moving averages provides a good estimate of the noise scale. In our experiments we measure and report the noise scale during training for a single run with a well-optimized learning rate. Note that while the noise scale measurement is consistent between runs at different batch sizes, it is not consistent at different learning rates (see Appendix C). So, it is important to use a run with a well-tuned learning rate in order to get a meaningful noise scale measurement. + +A.2 Systematic Searches Over Batch Sizes + +When doing systematic measurements of how performance scales with batch size (Pareto fronts), we sepa-rately tune the learning rate at each batch size, in order to approximate the ideal batch scaling curve as closely as possible. We tune the learning rate via the following procedure. For each task, we performed a coarse grid search over both batch size and learning rate to determine reasonable bounds for a fine-grained search. The central value typically followed the form + +central (B) = ∗ + +(1 + B∗/B )α , (A.3) where α = 1 for SGD or momentum, and 0.5 < α < 1 for Adam [KB14] or RMSProp. Then, we performed an independent grid search for each batch size centered at central , expanding the bounds of the search if the best value was on the edge of the range. + +> 11 Note that when Bsmall = 1 and Bbig =n, this becomes the familiar Bessel correction nn−1to the sample variance. +> 12 In fact E[x/y ]≥E[x]/E[y]in general for positive variables, see e.g. https://en.wikipedia.org/wiki/ Ratio_estimator for details. + +17 We explain the motivation for Equation A.3 in Appendix E.2. But regardless of the theoretical motivations, we have found that this scaling rule provides a reasonable starting point for grid searches, though we are not suggesting that they produce precisely optimized learning rates. + +A.3 Pareto Front Measurements + +To produce the Pareto front plots, and thus to measure the important parameter Bcrit for a given dataset and optimizer, we begin by performing a grid search over batch sizes and learning rates, as de-scribed in Appendix A.2. With that data in hand, we fix a list of goal values – either loss, perplex-ity, or game-score. For example for SVHN in Figure 7 we chose the training classification error values + +[0 .2, 0.1, 0.07 , 0.05 , 0.035 , 0.025 , 0.015 , 0.004] as the goals. These were generally chosen to provide a vari-ety of evenly spaced Pareto fronts indicative of optimization progress. Then for each value of the goal, and for each value of the batch size, we identified the number of optimization steps and examples processed for the run (among those in the grid search) that achieved that goal most quickly. These optimal runs are the data points on the Pareto front plots. Note that at fixed batch size, different values of the learning rate might be optimal for different values of the goal (this was certainly the case for LSTMs on Billion Word, for example). Next, for each value of the goal, we used the optimal runs at each value of the batch size to fit Equation 2.11 to the relation between examples processed and optimization steps. Note that we performed the fits and extracted the errors in log-space. This was how we produced the lines on the Pareto front plots. Finally, given this fit, we directly measured Bcrit = Emin + +> Smin + +for each value of the goal, as well as the standard error in this quantity. This was how we produced the ‘Noise Scale Comparison’ plots, where we compared + +Bcrit to Bsimple . Errors in Bcrit are standard errors from the fit to Equation 2.11. When we report an overall number for Bcrit for a given dataset and optimizer, we are averaging over optimization steps throughout training. Note that it can be difficult to determine at what point in a training run the model’s performance reaches the specified target. For example, the loss may oscillate significantly, entering and exiting the target region multi-ple times. To remedy this issue, we smooth the loss using an exponentially-weighted moving average before checking whether it has reached the target. The decay parameter of this moving average can affect results noticeably. Though we choose this parameter by hand based on the noisiness of the model’s performance, this could be automated using an adaptive smoothing algorithm. + +A.4 Details of Learning Tasks + +We train a variety of architectures on a variety of ML tasks described below. We use either basic stochastic gradient descent (SGD), SGD with momentum [SMDH13], or the Adam optimizer [KB14] unless otherwise specified. We measure and report the noise scale Bsimple during training for a single run of each task with a well-optimized learning rate. + +A.4.1 Classification + +For image classification, we use the following datasets: + +• MNIST handwritten digits [LC10] + +• Street View House Numbers (SVHN) [NWC +11] + +• CIFAR10 [Kri09] + +• ImageNet [DDS +09] For CIFAR10 and ImageNet classification, we use Residual Networks [HZRS15] of size 32 and 50 respec-tively, based on the TensorFlow Models implementation [Goo]. All hyperparameters are unchanged aside from the learning rate schedule; Instead of decaying the learning rate by a factor of 10 at specified epochs, we decay by a factor of 10 when the training classification error (appropriately smoothed) reaches 0.487, 0.312, and 0.229. For MNIST and SVHN, we use a simple deep network with two sets of convolutional and pooling 18 layers (32 and 64 filters, respectively, with 5x5 filters), one fully-connected hidden layer with 1024 units, and a final dropout layer with dropout rate of 0.4. We train MNIST models using SGD, SVHN with both SGD and Adam [KB14] (with the default parameter settings momentum = 0 .9, β2 = 0 .999 ), and CIFAR10 and ImageNet with momentum [SMDH13] (with momentum = 0 .9). + +A.4.2 Reinforcement Learning + +For reinforcement learning, we use the following tasks via OpenAI Gym [BCP +16]: + +• Atari Arcade Learning Environment [BNVB12] + +• Dota 1v1 and 5v5 [BCD +18] For Atari, we use A2C [MBM +16] with a pure convolutional policy, adapted from OpenAI Baselines [DHK +17]. We train using RMSProp with α = 0 .99 and  = 10 −5. We roll out the environments 5 steps at a time, and vary the batch size by varying the number of environments running parallel. At the beginning of training, we randomly step each parallel environment by a random number of steps up to 500, as suggested in [SA18]. As described in [BCD +18] for Dota an asynchronous version of PPO [SWD +17] was used. The TrueSkill metric [HMG07] was used to measure the skill of an agent. Given the fact that the OpenAI Five effort is ongoing, the values for TrueSkill reported in this paper are incomparable with those in [BCD +18]; on this paper’s scale, TrueSkill 50 is roughly the level of the best semi-pro players. + +A.4.3 Generative and Language Modeling + +For language modeling, we train a size-2048 LSTM [HS97] on the One Billion Word Benchmark corpus [CMS +13], using byte pair encoding (BPE) [SHB15] with a vocabulary of size 40,000 and a 512 -dimensional embedding space. The LSTMs were trained with Adam using momentum 0.5, without dropout, with the gradients clipped to norm 10, and with 20 -token sequences. For both training and evaluation LSTM cell states were reset to zero between samples, and so we have reported perplexity for the last token of the 20-token sequences. We chose to report the batch size in tokens (rather than sequences) because we have found that when the number of sequences and the sequence lengths are varied, both Bsimple and Bcrit depend predominantly on the total number of tokens. We also trained 1024 and 512 size LSTMs for model size comparison; for the last we used a smaller 256 -dimensional embedding space. The model size comparison training runs were conducted with a batch size of + +1024 and Adam learning rate of 0.0007 . The learning rates were chosen from a grid search, which showed that the optimal learning rate did not have a significant dependence on model size. For generative image modeling, we train a Variational Autoencoder [KW13] using the InfoGAN architecture [CDH +16] (see their appendix C.2) on the SVHN dataset. Since VAEs introduce additional stochasticity beyond gradient noise, we also provide training data on a simple autoencoder with the same architecture. + +# B Results for All Tasks + +In Figures 9, 10, 11, 12, 13, and 14, we display the results of a series of training runs for for classifica-tion, reinforcement learning, and generative modeling tasks. On the left, we show tradeoff curves between compute-efficiency and time-efficiency. Each point on each tradeoff curve represents the number of optimizer steps and processed training examples necessary to reach a given level of performance for a particular train-ing run. Fits to the prediction of Equation 2.11 are shown. On the right, we compare the critical batch size, defined as the point where training is within 50% of maximum efficiency in terms of both compute power and speed, and compare to the simple noise scale Bsimple of Equation 2.9 and the true noise scale Bnoise of 2.8, when available. The results are summarized in Figure 4 and table 1. 19 Figure 9: Scaling behavior of generative and language modeling tasks. + +Figure 10: For ImageNet, the typical training schedule decays the learning rate by a factor of 10 at 30, 60, and 80 epochs [HZRS15, GDG +17]. To provide a fair comparison between batch sizes, we instead decay by a factor of 10 when the training classification error reaches 0.487, 0.312, and 0.229. We display Pareto fronts and compute the critical batch size separately for each span. 20 Figure 11: Scaling behavior of Atari tasks (Beam Rider, Breakout, Pong, and Space Invaders) trained with A2C. 21 Figure 12: Scaling behavior of more Atari tasks (Alien, Qbert, and Seaquest) trained with A2C. + +Figure 13: Scaling behavior for Dota 1v1 [Ope17] trained to top-level pro performance. 22 Figure 14: Scaling behavior of image classification tasks. 23 Critical Batch Size Simple Noise Scale + +Start Average Start Average + +Image Classification: + +MNIST 20 200 50 900 SVHN 50 500 300 4,000 CIFAR10 300 900 400 2,000 ImageNet 1,000 15,000 4,000 30,000 + +Generative and Language Modeling: + +Autoencoder (SVHN) 10 40 2 2Variational Autoencoder (SVHN) 10 200 10 10 Billion Word (per token) 700 100,000 1000 150,000 + +Reinforcement Learning: + +Atari (per frame) 100 - 1,000 400 - 8,000 100-1,000 1,000-20,000 Dota 1v1 (per frame) 50,000 3,000,000 100,000 300,000 Dota 5v5 (per frame) (not measured) >8,000,000 (est.) 100,000 24,000,000 Table 1: We report the simple noise scale, both early in training and averaged over a training run, as well as the critical batch size, both early in the run and at the end of the run. The noise scale provides a good estimate for the critical batch size throughout training. Batch sizes reported in number of images, tokens (for language models), or observations (for games). These data are summarized in Figure 4. 24 Figure 15: The noise scale is proportional to the inverse temperature. On the left we display results for SVHN optimized via SGD, while on the right we have an LSTM on the Billion Word dataset optimized via Adam. For each of the three curves, we modified either the learning rate , the batch size B, or both, so that the temperature B was decreased by a factor of 16 between epochs 1 and 1.5 (SVHN) or 0.02 and 0.03 (BW). In all cases we see that the simple noise scale increased by a factor of 16 , then returned to roughly its original value once , B were reset. + +# C Temperature and the Noise Scale + +The noise scale measured during neural network training could depend on a variety of hyperparameters, such as the learning rate  or momentum. However, we have empirically found that noise scale primarily depends on  and B roughly through the ratio + +T (, B ) ≡ max (B) , (C.1) which we refer to as the ‘temperature’ of training. The terminology reflects an idea of the loss as a potential energy function, so that high temperature training explores a larger range of energies. In the case of pure SGD it is approximated by T ≈ /B in the small batch regime. Our definition of T can then be obtained from a toy model of a quadratic loss, which is described below. In that case one can show explicitly [MHB17] that the equilibrium distribution of gradients is characterized by this temperature 13 .In equilibrium, the noise scales vary in proportion to the inverse temperature, so that + +Bnoise ∝ B simple ∝ 1 + +T . (C.2) It may seem surprising that higher temperature results in a smaller noise scale. The intuition is that at larger + +T the neural network parameters are further from the minimum of the loss, or higher up the ‘walls’ of the potential, so that the gradient magnitude is larger relative to the variance. Of course the loss landscape will be much more complicated than this toy model, but we have also observed that this scaling rule provides a good empirical rule of thumb, even away from pure SGD. In particular, when we decay the learning rate  by a constant factor, we often find that the noise scale grows by roughly the same factor. ImageNet training provides an example in Figure 10. A more direct investigation of the relation between Bsimple and T is provided in Figure 15. Since the noise depends primarily on the training temperature, and well-tuned training runs should have the same temperature at different batch sizes, the measured noise scale will also be consistent between optimally-tuned runs at different batch sizes. 14 . The noise scale then depends only on the temperature and the loss. + +> 13 This definition can also be motivated by the empirical results of [SKYL17], which show that decaying the learning rate and increasing the batch size by the same factor have the same effect on training in the small-batch regime. +> 14 This suggests that we can use the noise scale to define the temperature via Equation C.2. Then, once we have tuned the learning rate and measured the noise scale at small batch size, we can tune the learning rate at larger batch sizes to the noise scale constant. Though we have not investigated this idea thoroughly, it could significantly simplify the problem of learning rate tuning at large batch size. + +25 To summarize, the noise scale does not provide an optimal training temperature schedule, but it instead prescribes an optimal batch size at any given temperature. + +A Toy Model for the Temperature + +Now let us consider a simple explanation for the behavior of the noise scale in response to changes in the learning rate  and batch size B. We start by approximating the loss as locally quadratic: + +L (θ) = 12 θT Hθ + const . + +where we set θ = 0 at the minimum without loss of generality. To compute the noise scale, we need a model for the gradient covariance matrix Σ. A simple model appearing in [SZL13] suggests treating the per-example loss Li as a shifted version of the true loss, Li (θ) = L (θ − ci), where ci is a random variable with mean zero and covariance matrix Σc. The gradient covariance matrix is then given by Σ = HΣcH, which is independent of θ. The average gradient itself is given by G = Hθ , with θ changing in response to  or B. As shown in [MHB17] over sufficiently long times SGD 15 will approximately sample θ from the distribution + +pSGD (θ) ∝ exp + +[ + +− 12 θT M −1θ + +] + +where the matrix M satisfies + +M H + HM = B Σ. + +From these results, we can estimate the noise scale: + +Bsimple = tr(Σ) + +|G|2 ≈ B + +tr (Σ) tr ( H2Σ) + +Bnoise = tr(Σ) HGT HG ≈ B + +tr ( HΣ) tr ( H3Σ) + +So, in this model, the noise scale is expected to increase as we decrease the learning rate or increase the batch size. We also expect that scaling the learning rate and batch size together should leave the noise scale unchanged. When B  B simple , the ratio B plays the role of a “temperature”. Since our analysis was only based on a toy model optimized using pure SGD, one might not expect it to work very well in practice. However, as shown in Figure 15, we have found that it provides a quite accurate model of the dependence of the noise scale on  + +and B during neural network training, even when using the Adam 16 optimizer. For these tests, on SVHN we used an initial (, B ) = (0 .18 , 128) while for billion word results we used (, B ) = (6 × 10 −4, 128) .Note that this result relies on the assumption that the optimizer has approached an effective equilibrium. We expect the equilibration timescale to be larger in the directions of low curvature, so that this effect will be strongest when the gradient points mostly in the large-curvature directions of the Hessian. It would be interesting to investigate the timescale for equilibration. + +# D Dynamically Varying the Batch Size + +As one can see from Figure 4 and Section 3, both the measured Bnoise and Bsimple , as well as the empirical + +Bcrit fit to Equation 2.11 all increase by at least an order of magnitude during training. Thus its natural to ask if we should expect to improve efficiency by dynamically scaling the batch size B in response. We will see that the predicted gains are relatively modest unless the Bcrit changes greatly during training, although preliminary empirical tests suggest the benefits may be larger than predicted. + +> 15 With momentum, the same statements hold with replaced by / (1 −m). +> 16 Note that with β2= 0 .999 the Adam variance accumulators would take of order ∼1000 steps to fully react. On the right in Figure 15 we changed and Bfor 0.01 epochs, corresponding to between 100 and 1500 optimizer steps. + +26 D.1 Theory + +Consider a single full-batch optimizer step, over which the loss increases by an amount δL . If we instead use a batch of size B, it will take δS = 1 + B + +> B + +optimizer steps and δE = BδS training examples to make the same amount of progress, where B is the noise scale. Over a full training run, the total number of steps and data examples processed can be written as + +S = + +∫ ( + +1 + B(s) + +B(s) + +) + +ds (D.1) + +E = + +∫ + +(B(s) + B(s)) ds + +where we parameterize the training trajectory by the number s of full-batch optimizer steps (we abbreviated + +Smin above to s for notational simplicity). The question is how to optimally distribute the training examples over the full training trajectory. At each point along the trajectory, we have the choice of trading examples for optimizer steps by increasing or de-creasing the batch size. This “exchange rate” between examples and steps is + +r = − + +> ddB + +δE + +> ddB + +δS = B2(s) + +B(s) . (D.2) If the distribution of training examples (and hence the batch size schedule) is optimal, then transferring examples from one part of training to another should not save any optimization steps. This means that the exchange rate r should be constant throughout training . Thus the batch size should be varied in proportion with the square root of the noise scale: + +B(s) = √rB(s). (D.3) We can determine the resultant Pareto front parameterizing the tradeoff between training cost and time by inserting Equation D.3 into Equation D.1 and eliminating the exchange rate 17 + +Stot + +Smin + +− 1 = γ + +( Etot + +Emin + +− 1 + +)−1 + +, (D.4) where we define Smin ≡ ∫ ds and Emin ≡ ∫ Bds to be the minimum possible number of optimizer steps and training examples needed to reach the desired level of performance, obtained by inserting B  B and + +B  B respectively into D.3. We also define + +γ ≡ + +(∫ √Bds + +)2 + +Smin Emin + +, (D.5) which parameterizes the amount of variation of the noise scale over the course of training. When the noise scale is constant γ = 1 and there is no benefit from using an adaptive batch size; more variation in B pushes γ + +closer to 0, yielding a corresponding predicted improvement in the Pareto front. 18 Note that since γ involves the variation in the square root of B, practically speaking B must vary quite a bit during training for adaptive batch sizes to provide efficiency benefits via these effects. Adaptive batch sizes may also have other benefits, such as replacing adaptive learning rates [SKYL17] and managing the proportion of gradient noise during training. + +> 17 + +The exchange rate r is a free parameter. It can be chosen according to preference from the value of training time vs compute. There is also the fairly natural choice r = Emin + +> Smin + +at which we have Stot + +> Smin + += Etot + +> Emin + += 1 + √γ, so that cost-efficiency and time-efficiency are both within the same factor of optimal, corresponding to the turning point in Figure 16. + +> 18 + +To see explicitly the dependence of γ on the variability of the noise scale, we can rewrite it as γ = E[√B]2 + +> E[B] + += + +> 11+ σ2√B/E + +[√B]2 , where the expectation is over a training run, weighting each full-batch step equally. + +27 Figure 16: Left: We compare training using an adaptive batch size (data points) to the hyperbolic fit to Pareto fronts at fixed batch size (lines). We note a modest but visible improvement to training efficiency. Adaptive batch sizes appear to decrease the minimum number of optimization steps Smin , which was not anticipated by theoretical analysis. Right: Depending on the degree to which the noise scale varies over training, we can predict the potential Pareto improvement from using an adaptive batch size. + +D.2 An SVHN Case Study + +We have argued that a batch size of order the noise scale can simultaneously optimize data parallelism and total resource use. We have also shown that the noise scale tends to grow quite significantly during training. This suggests that one can further optimize resource use by adaptively scaling 19 the batch size with the noise scale as training progresses, as discussed above. For adaptive batch training, we can follow a simple and pragmatic procedure and dynamically set + +B = √rBsimple , (D.6) with Bsimple measured periodically during training. The results from dynamic batch training with this pro-cedure and various values of r are compared to fixed batch size training in Figure 16. We see that adaptive training produces a modest 20 efficiency benefit. We can combine our fixed batch size results with theoretical analysis to predict the magnitude of efficiency gains that we should expect from adaptive batch size training. We displayed Bcrit for fixed batch training of SVHN in Figure 7. We have found that these results are fit very well by Bcrit (s) ≈ 10 √s, where s is the number of steps taken in the limit of very large batch training. Using Equation D.5, we would predict the quite modest efficiency gain of + +γ = + +(∫ ds 4 + +√s)2 + +s ∫ ds √s = 24 25 (D.7) or around 4% . The benefits visible in Figure 16 in some cases appear too large to be fully explained by this analysis. In particular, the adaptive batch size seems to benefit training in the regime of large batch size, decreasing the minimum number of optimization steps Smin . However, our theoretical analysis would predict negligible benefits at large Etot /S tot . This may be due to the fact that the adaptive BS schedule also ‘warms up’ the learning rate, or it may be an effect of a larger and more consistent proportion of gradient noise during training. It would be interesting to disentangle these and other factors in future work, and to study adaptive batch size training on other datasets. + +> 19 This may have an additional advantage compared to training with a fixed, large batch size: it allows for a constant proportion of gradient noise during training, and some have argued [KMN +16, HHS17] that noise benefits generalization. +> 20 It’s challenging to provide a fair comparison between fixed and adaptive batch size training. Here we determined a roughly optimal relation =0.27 B +> 96+ Bbetween learning rate and Bfor fixed batch size training, and used this same function to determine the learning rate for both fixed and adaptive batch size training runs. This meant the adaptive batch size training used a corresponding adaptive learning rate. We did not experiment with learning rate schedules. + +28 Figure 17: Left: This figure shows the magnitude of the optimal step size in the direction of the parameter update divided by the magnitude of the actual update. Optimal step sizes are determined by a line search of the loss. We show training of two quite different models with different optimizers – an LSTM trained with Adam (momentum = 0.5) on Billion Word, and a CNN trained on SVHN with SGD. In both cases, training converges to an approximate steady state where the average update is about twice the optimal update. Right: Learning curves included to clarify that this phenomenon is not due to the cessation of learning. + +Figure 18: The gradient exhibits rapid, long-lived oscillations over the course of training, even when using adaptive optimizers such as Adam. These oscillations are typical when optimizing functions with a large Hi-erarchy in the Hessian spectrum. We measure the moving average of the gradient with decay 0.5, computing its correlations over time. Results are shown for a simple CNN trained on SVHN using the Adam optimizer. + +# E Comments on Optimization + +E.1 Deterministic Training Performs Poorly + +From the discussion of the noise scale in Section 2, we expect that at large batch size B  B noise we can obtain a very good estimate of the gradient. This would then suggest a minimally stochastic approach to training, where at each step one performs a line search of the true loss in the direction of the true gradient, and updates parameters accordingly. This nearly deterministic ‘greedient descent’ method performs poorly in practice [WRLG18]. While its first few steps tend to decrease the loss significantly, subsequent step sizes decrease rapidly and provide minimal further training progress. In fact, when training with a fixed learning rate, we have observed that training 29 often tends towards a regime where the optimal step size (determined by a line search in the direction of the parameter update) is almost exactly half of the actual update magnitude. We have found that this phenomenon occurs regardless of the learning rate (scanning over several orders of magnitude), and seems common to a variety of different models, as shown in Figures 17 and 18. A natural interpretation of these results is that large Hessian directions are dominating the update [GARD18], so that training involves rapid oscillations, as seen in Figure 18 (see [Goh17] for an intuitive picture of these oscillations). Because of this, line searches do not appear to be useful in determining the optimal step size for a full training run. + +E.2 Motivations for Learning Rate Scaling Rules + +The learning rate scaling rule from Appendix A.2 can be motivated as follows. Equation A.3 generalizes Equation 2.6, which was derived for plain SGD. The SGD linear scaling rule ( α = 1 ) means that the step size per data example stays fixed up to B∗; one might intuitively expect that this is necessary to avoid dimin-ishing returns as B increases. In the case of Adam 21 , we can use noise scale considerations to motivate the generalization to 0.5 < α < 1. The Adam update to the parameter θi takes the form + +δθ i =  Eβ1 [Gi] + +√Eβ2 [G2 + +> i + +] + Adam + +where Eβ refers to an exponentially-weighted moving average with decay parameter β, and G refers to a gradient from a batch of size B. If we disregard β1, β 2, and Adam , this is roughly equivalent to + +δθ i ≈  sign ( E [Gi]) + +√1 + si + +> E[Gi]2 + +. + +where si is the variance of Gi over timesteps. If the step-to-step noise in the gradient is primarily due to batch statistics, si should scale inversely with B. Comparing with 2.6, this implies a square-root scaling rule (α = 0 .5) to maintain a constant learning rate per data example. However, since β2 is often set to large values around 0.999, the second moment accumulator may not have time to adapt to quick changes in the gradient noise; this pushes α back towards 1.0. This may explain the variation in α between different tasks. + +E.3 Preliminary Tests of Generalization + +Throughout the paper we have studied the noise scale as a function of the training loss or RL score. But for non-RL tasks, it is also interesting to study the relationship between the noise scale and the critical batch size associated with minimization of the test loss. The difference between the training and test results provides information about generalization. In Figure 19 we report results using the test loss for the small image classification datasets. As expected, early in training there is no difference between train and test results. However, at the very end of training we observe a small dip in Bcrit for the test loss, which appears to occur consistently across datasets. It would be very interesting to further investigate this phenomenon in the future. + +> 21 These considerations apply equally well to RMSProp. + +30 Figure 19: Scaling behavior of image classification tasks, using test set goals rather than Train set goals. These results should be compared to those of Figure 14, which use training goals. 31 References + +[AAG +18] Igor Adamski, Robert Adamski, Tomasz Grel, Adam J˛ edrych, Kamil Kaczmarek, and Henryk Michalewski. Distributed deep reinforcement learning: Learn how to play atari games in 21 minutes, 2018, 1801.02852. 2, 14 [AH18] Dario Amodei and Danny Hernandez. AI and Compute, May 2018. URL https://blog. openai.com/ai-and-compute/ . 2 [ASF17] Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes, 2017, 1711.04325. 2 [BCD +18] Greg Brockman, Brooke Chan, Przemyslaw Debiak, Christy Dennison, David Farhi, Rafal Józefowicz, Jakub Pachocki, Michael Petrov, Henrique Pondé, Jonathan Raiman, Szymon Sidor, Jie Tang, Filip Wolski, and Susan Zhang. OpenAI Five, Jun 2018. URL https: //blog.openai.com/openai-five/ . 1, 2, 3, 9, 13, 14, 16, 19 [BCN16] Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning, 2016, 1606.04838. 14 [BCNW12] Richard H. Byrd, Gillian M. Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical Programming , 134(1):127–155, Aug 2012. doi:10.1007/s10107-012-0572-5. 14 [BCP +16] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym, 2016, 1606.01540. 19 [BDS18] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis, 2018, 1809.11096. 2 [BGM17] Jimmy Ba, Roger Grosse, and James Martens. Distributed second-order optimization us-ing kronecker-factored approximations, 2017. URL https://openreview.net/forum?id= SkkTMpjex . 14 [BNVB12] Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. CoRR , 2012, 1207.4708. 13, 19 [BRH16] Lukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates, 2016, 1612.05086. 14 [CDH +16] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. In-fogan: Interpretable representation learning by information maximizing generative adversarial nets, 2016, 1606.03657. 19 [CMS +13] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling, 2013, 1312.3005. 12, 19 [CWZ +18] Lingjiao Chen, Hongyi Wang, Jinman Zhao, Dimitris Papailiopoulos, and Paraschos Koutris. The effect of network width on the performance of large-batch training, 2018, 1806.03791. 14 [DDS +09] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hier-archical Image Database. In CVPR09 , 2009. URL http://www.image-net.org/papers/ imagenet_cvpr09.pdf . 12, 18 [DHK +17] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. + +https://github.com/openai/baselines , 2017. 19 [DNG17] Aditya Devarakonda, Maxim Naumov, and Michael Garland. Adabatch: Adaptive batch sizes for training deep neural networks, 2017, 1712.02029. 14 [DYJG16] Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Big batch sgd: Automated infer-ence using adaptive batch sizes, 2016, 1610.05792. 14 [GARD18] Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. 2018, arXiv:1812.04754. 30 32 [GDG +17] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour, 2017, 1706.02677. 2, 14, 20 [Goh17] Gabriel Goh. Why momentum really works. Distill , 2017. doi:10.23915/distill.00006. 8, 30 [Goo] Google. Tensorflow official models. URL https://github.com/tensorflow/models/ tree/master/official/resnet . 18 [GVS14] Ian J. Goodfellow, Oriol Vinyals, and Andrew M. Saxe. Qualitatively characterizing neural network optimization problems, 2014, 1412.6544. 15 [GVY +18] Noah Golmant, Nikita Vemuri, Zhewei Yao, Vladimir Feinberg, Amir Gholami, Kai Rothauge, Michael W. Mahoney, and Joseph Gonzalez. On the computational inefficiency of large batch sizes for stochastic gradient descent, 2018, 1811.12941. 14 [HHS17] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the gener-alization gap in large batch training of neural networks. 2017, 1705.08741. 8, 28 [HMG07] Ralf Herbrich, Tom Minka, and Thore Graepel. Trueskill TM : A bayesian skill rating system. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Process-ing Systems 19 , pages 569–576. MIT Press, 2007. URL http://papers.nips.cc/paper/ 3079-trueskilltm-a-bayesian-skill-rating-system.pdf . 19 [HNA +17] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kia-ninejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is pre-dictable, empirically, 2017, 1712.00409. 15 [HQB +18] Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Has-selt, and David Silver. Distributed prioritized experience replay, 2018, 1803.00933. 2, 14 [HS97] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput. ,9(8):1735–1780, November 1997. doi:10.1162/neco.1997.9.8.1735. 19 [HZRS15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR , 2015, 1512.03385. 12, 18, 20 [IES +18] Andrew Ilyas, Logan Engstrom, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Are deep policy gradient algorithms truly policy gradient algorithms?, 2018, 1811.02553. 14 [JSH +18] Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen, Guangxiao Hu, Shaohuai Shi, and Xiaowen Chu. Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes, 2018, 1807.11205. 2, 12, 14 [KB14] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014, 1412.6980. 5, 7, 12, 17, 18, 19 [KMN +16] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima, 2016, 1609.04836. 8, 15, 28 [Kri09] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009. URL https: //www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf . 12, 18 [KW13] Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2013, arXiv:1312.6114. 12, 19 [LC10] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http: //yann.lecun.com/exdb/mnist/ . 12, 18 [LFLY18] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes, 2018, 1804.08838. 8 [MBB17] Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning, 2017, 1712.06559. 14 33 [MBM +16] Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce-ment learning. CoRR , 2016, 1602.01783. 13, 19 [MHB17] Stephan Mandt, Matthew D. Hoffman, and David M. Blei. Stochastic gradient descent as ap-proximate bayesian inference. 2017, 1704.04289. 14, 25, 26 [NIS] Selecting sample sizes. URL https://www.itl.nist.gov/div898/handbook/ppc/ section3/ppc333.htm . 14 [NVL +15] Arvind Neelakantan, Luke Vilnis, Quoc V. Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. Adding gradient noise improves learning for very deep networks, 2015, 1511.06807. 15 [NWC +11] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. 2011. URL http:// ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf . 12, 18 [OEGA18] Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine transla-tion, 2018, 1806.00187. 2 [Ope17] OpenAI. More on Dota 2, Aug 2017. URL https://blog.openai.com/more-on-dota-2/ .22 [PKYC18] Raul Puri, Robert Kirby, Nikolai Yakovenko, and Bryan Catanzaro. Large scale language mod-eling: Converging on 40gb of text in four hours, 2018, 1808.01371. 2 [SA18] Adam Stooke and Pieter Abbeel. Accelerated methods for deep reinforcement learning, 2018, 1803.02811. 2, 14, 19 [SHB15] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. CoRR , 2015, 1508.07909. 19 [SKYL17] Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le. Don’t decay the learning rate, increase the batch size, 2017, 1711.00489. 2, 14, 25, 27 [SL17] Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent, 2017, 1710.06451. 14 [SLA +18] Christopher J. Shallue, Jaehoon Lee, Joe Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl. Measuring the effects of data parallelism on neural network training, 2018, arXiv:1811.03600. 8, 13, 14 [SMDH13] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning , volume 28, pages 1139–1147, 17–19 Jun 2013. URL + +http://proceedings.mlr.press/v28/sutskever13.html . 18, 19 [SWD +17] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR , 2017, 1707.06347. 13, 19 [SZL13] Tom Schaul, Sixin Zhang, and Yann LeCun. No more pesky learning rates. In ICML (3) ,volume 28 of JMLR Workshop and Conference Proceedings , pages 343–351. JMLR.org, 2013. URL http://jmlr.org/proceedings/papers/v28/schaul13.html . 14, 26 [SZT17] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information, 2017, 1703.00810. 14 [Wik] Sample size determination. URL https://en.wikipedia.org/wiki/Sample_size_ determination#Estimation_of_a_mean . 14 [WRLG18] Yuhuai Wu, Mengye Ren, Renjie Liao, and Roger Grosse. Understanding short-horizon bias in stochastic meta-optimization, 2018, 1803.02021. 8, 14, 29 [YGG17] Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks, 2017, 1708.03888. 2, 12, 14 34 [YPL +17] Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran, and Peter Bartlett. Gradient diversity: a key ingredient for scalable distributed learning, 2017, 1706.05699. 14, 15 [YZH +17] Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes, 2017, 1709.05011. 2, 14 35 diff --git a/docs/evidence/reddit_deeprl_bootcamp_2017_75m5vd.md b/docs/evidence/reddit_deeprl_bootcamp_2017_75m5vd.md new file mode 100644 index 0000000..362a2b6 --- /dev/null +++ b/docs/evidence/reddit_deeprl_bootcamp_2017_75m5vd.md @@ -0,0 +1,15 @@ +Source: https://old.reddit.com/r/reinforcementlearning/comments/75m5vd/ +Title: Deep RL Bootcamp 2017 - Slides and Talks +Fetched-via: Reddit JSON API (limit=500, depth=10) +Fetch-status: verbatim + +# Deep RL Bootcamp 2017 - Slides and Talks + +**Posted by:** u/gwern | Score: 8 | 1 comments + +Link: https://sites.google.com/view/deep-rl-bootcamp/lectures + +## Comments + +**u/obsoletelearner** (score: 2): +thank you! diff --git a/docs/evidence/reddit_icml2017_tutorial_levine_6vcvu1.md b/docs/evidence/reddit_icml2017_tutorial_levine_6vcvu1.md new file mode 100644 index 0000000..1683a6a --- /dev/null +++ b/docs/evidence/reddit_icml2017_tutorial_levine_6vcvu1.md @@ -0,0 +1,18 @@ +Source: https://old.reddit.com/r/reinforcementlearning/comments/6vcvu1/ +Title: ICML 2017 Tutorial slides (Levine & Finn): Deep Reinforcement Learning, Decision Making, and Control +Fetched-via: Reddit JSON API (limit=500, depth=10) +Fetch-status: verbatim + +# ICML 2017 Tutorial slides (Levine & Finn): Deep Reinforcement Learning, Decision Making, and Control + +**Posted by:** u/gwern | Score: 12 | 1 comments + +Link: https://sites.google.com/view/icml17deeprl + +## Comments + +**u/[deleted]** (score: 4): +[deleted] + + **u/cbfinn** (score: 1): + Videos of ICML tutorials (as well as conference talks) will be posted by the conference staff at some point. Though, typically they take quite awhile to be released. diff --git a/docs/evidence/reddit_rl_debugging_tips_9sh77q.md b/docs/evidence/reddit_rl_debugging_tips_9sh77q.md new file mode 100644 index 0000000..85c9fa3 --- /dev/null +++ b/docs/evidence/reddit_rl_debugging_tips_9sh77q.md @@ -0,0 +1,78 @@ +Source: https://old.reddit.com/r/reinforcementlearning/comments/9sh77q/ +Title: What are your best tips for debugging RL problems? +Fetched-via: Reddit JSON API (limit=500, depth=10) +Fetch-status: verbatim + +# What are your best tips for debugging RL problems? + +**Posted by:** u/GrundleMoof | Score: 21 | 8 comments + +I've done a few RL toy problems, but I'm still pretty new to the field. In each of the problems I've done, there has been some point where it seems like I've implemented everything correctly, the environment is working correctly, etc, but it's still just not working, or is, with some really strange problem. + +RL seems to be harder to debug than any other type of programming I've done before. There's an element of randomness usually. It often takes a while (in the run) for the problem to manifest, so it's hard to pinpoint exactly *where* something is going wrong. Lastly, stuff just takes a while to even run, so my "attempt solution/code/evaluate" loop takes a long time, which makes it even harder. + +Does anyone have any tips? The things I've figured out so far are to log everything feasible, and to try to isolate things to find the problem, but those are pretty general tips. I've found some help a few times from reading relevant papers, but that's rarer. + +Do any experts in the field have any tips? + +## Comments + +**u/marcin_gumer** (score: 18): +Hi + +This is exactly what I was struggling with for a long time (and still am). RL agent modules are really closely interconnected. No matter which module has issue (neural net, Bellman backups, memory buffer, environment, pre-processing) it will immediately affect all other modules by feeding them bad data. Looking from outside it looks like big gooey mess. + +First, I'm not an RL expert, sorry if my advice sounds basic. Couple of things I have learned so far: + +* RL is very difficult to debug, especially when neural nets are involved +* DO NOT "try stuff" and run to "see if it works" - this approach doesn't work in RL - too many things need to happen exactly right to see any learning at all +* RL agent modules implementation - this is just good programming practices, but even more important in RL: + * most modules can be tested independently. Environment, neural net, RL backups, memory reply buffers all can be tested in isolation. + * I try to unit test everything, usually unit tests take more code than what they test + * I try to put asserts absolutely everywhere, input matrix dimensions (1d array may broadcast differently than 2d array etc.), input/output ranges (state/actions valid?), output matrix dimensions. Input/output data types (in Python np.ndarray behaves differently than np.matrix in some cases) +* Agent modules integration - generally stepping through code at least once after every change to confirm it is doing what I think it should be doing. It's a bit like programming a bomb detonator or something. Really make sure it is working correctly *before* running long experiment. +* Visualise as much as possible, log absolutely everything + * record and display agent observations/actions/rewards + * rewards should have some variance, if all rewards are always equal (e.g. always 0), then there is nothing to learn, it's environment or exploration issue + * record and display current q-function approximation across whole state space (works only on simple tabular problems and 2d continuous state spaces). + * pick couple states (can pick by running random policy) and plot predicted q-values over time for these states. q-values should change and stabilise. + * record inputs/outputs to/from every module (environment, neural net, memory buffer, etc.) + * neural networks are making everything 10x more difficult - I try to make agent work with linear approximator first (on small problem like mountain car), then when I know everything else is working, swap in neural net and try bigger problem. + * with neural nets, one can record gradients, individual neuron activations etc to evaluate if neural net is learning over time, even without access to loss function. Debugging neural networks: + * [https://www.deeplearningbook.org/](https://www.deeplearningbook.org/) "Practical Methodology" chapter + * [Understanding the difficulty of training deep feedforward neural networks](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf) + * [https://cs231n.github.io/neural-networks-3](https://cs231n.github.io/neural-networks-3/#baby)/ + * [https://deeplearning4j.org/docs/latest/deeplearning4j-nn-visualization](https://deeplearning4j.org/docs/latest/deeplearning4j-nn-visualization) +* Sometimes It helps to freeze random seed everywhere (numpy, tensorflow, python hashing, gym) and force single threaded CPU execution (to remove randomness from concurrent execution). This way you can reproduce runs exactly to full floating point precision and debug where these NaNs came from or why q-values exploded to infinity etc. +* I would be careful with reference implementations. Some work on hacked environments that are much easier than normal (seen it couple times in blog posts). Or have some "weird" reward function engineering. Or use older version of 3rd party library with easier version of environment. +* Try hyper parameters from reference implementation. + +Hope this helps! + + **u/AlexanderYau** (score: 1): + Wow, a lot of experience on debugging RL, I can't agree more on "DO NOT "try stuff" and run to "see if it works"". Have you ever published any paper on RL? + + **u/marcin_gumer** (score: 1): + I wouldn't say a lot of experience, just figuring things out as I go. I haven't published any RL paper. Currently just building portfolio on my [github.com/marcinbogdanski](https://github.com/marcinbogdanski). But there is not that much there yet, I implemented some algorithms from Sutton & Barto, currently working on DQN and Atari games, but it will take some time. + +**u/p-morais** (score: 7): +Here’s a good talk by John Schulman on just this: + +https://m.youtube.com/watch?v=8EcdaCk9KaQ + +**u/WhichPressure** (score: 5): +I think anyone who touch the RL has the same problem as you! Me too:P This guy wrote nice article about his adventure with some RL side project and he gave some tips. Maybe somehow it'll be helpful for you: +[http://amid.fish/reproducing-deep-rl?fbclid=IwAR1VPZm3FSTrV8BZ4UdFc2ExZy0olusmaewmloTPhpA4QOnHKRI2LLOz3mM](http://amid.fish/reproducing-deep-rl?fbclid=IwAR1VPZm3FSTrV8BZ4UdFc2ExZy0olusmaewmloTPhpA4QOnHKRI2LLOz3mM) + +**u/lmericle** (score: 4): +It's true, there's a lot of moving parts. I'm no expert but lately I've been experiencing similar setbacks. + +Typically I find the first place to look is the hyperparameters, and then to consider which particular optimization algorithm you're using and how it might explore the space/optimize toward suboptimal behavior. Next I'd consider the reward function and the interplay between that and the exploration/exploitation behavior of the optimization algorithm. Finally, consider where stochasticity is introduced into the problem -- perhaps there's too much, or not enough, or the stochasticity prevents convergence due to inadequate penalty terms (e.g. low entropy coefficient in PPO). + +**u/wassname** (score: 2): +Also checkout [this](https://old.reddit.com/r/reinforcementlearning/comments/7s8px9/deep_reinforcement_learning_practical_tips/) previous discussion. + +**u/[deleted]** (score: 1): +I usually just plot every metric, every layer weight on tensorboard and look for anomalies. + +&#x200B; diff --git a/docs/evidence/reddit_rl_practical_tips_7s8px9.md b/docs/evidence/reddit_rl_practical_tips_7s8px9.md new file mode 100644 index 0000000..e0c5d2f --- /dev/null +++ b/docs/evidence/reddit_rl_practical_tips_7s8px9.md @@ -0,0 +1,110 @@ +Source: https://old.reddit.com/r/reinforcementlearning/comments/7s8px9/deep_reinforcement_learning_practical_tips/ +Title: Deep Reinforcement Learning practical tips +Fetched-via: browser paste (user) +Fetch-status: verbatim + +# Deep Reinforcement Learning practical tips + +submitted 8 years ago by grupiotr | 14 points (90% upvoted) | 13 comments + +I would be particularly grateful for pointers to things you don't seem to be able to find in papers. Examples include: + +- How to choose learning rate? +- Problems that work surprisingly well with high learning rates +- Problems that require surprisingly low learning rates +- Unhealthy-looking learning curves and what to do about them +- Q estimators deciding to always give low scores to a subset of actions effectively limiting their search space +- How to choose decay rate depending on the problem? +- How to design reward function? Rescale? If so, linearly or non-linearly? Introduce/remove bias? +- What to do when learning seems very inconsistent between runs? +- In general, how to estimate how low one should be expecting the loss to get? +- How to tell whether my learning is too low and I'm learning very slowly or too high and loss cannot be decreased further? + +## Comments + +**u/wassname** (11 points): + +Resources: I found these very useful + +- Deep RL Bootcamp Lecture 6: Nuts and Bolts of Deep RL Experimentation (slides) and a written summary +- The 3 NIPS2017 Learning to run write ups contain practical advice from a competition +- Lessons Learned Reproducing a Deep Reinforcement Learning Paper +- Deep Reinforcement Learning that Matters - this gives you an idea of what does and doesn't matter +- Deep Reinforcement Learning Doesn't Work Yet (at least as well as the hype suggests) +- General deep learning tips from Slav Ivanov + +Lessons learnt: + +- log everything with tensorboard/tensorboardX: policy and critic losses, advantages, ratio, actions (mean and std), states, noise. Check values, check losses are decreasing etc. +- keep track of experiments with an experiments log (git commit messages with non-committed data or logs stored by date) +- clip and clamp: mistakes not obvious as they can cause values to blow up instead of NaN + - clamp all values, logarithmic values: `logvalue.clamp(-np.log(1e-5), np.log(1e-5))` + - watch out for dividing by a value: `1/std` should be `1/(std+eps)` where `eps=1e-5` + - clip gradients: `grad_norm = torch.nn.utils.clip_grad(model.params, 20)`, then log grad norm +- normalise everything: use running norms for state and reward; layer norms help +- check everything: plot and sanity check as many values as possible. Check initial outputs, inits, distributions, action range. +- think about step-size/sampling-rate: RL is sensitive to it (action repeat, frame skipping). Papers found skipping 4 Atari frames helped, repeating 4 actions in "Learning to Run" helped. + +Curves: + +- in PPO the std should decrease as it learns +- in actor-critic the critic loss should start converging then the actor loss follows +- watch for local minima where it outputs a constant action +- watch gradients for actor and critic; if much lower than 20 or much larger than 100 often run into problems (20 and 40 are where projects often clip gradient norm) +- run on CartPole and log same curves to see what healthy looks like + +Reward: + +- It's not the scaling factor that matters but the final value. Papers have gotten good results with rewards between 100-1000. + +Learning rate: + +- Use decaying learning rates, watch loss curves to see when they begin to converge. +- loss_actor will often initially increase while the critic is doing its initial learning (value function is a moving target). Focus on making the critic learning rate work first. +- Critic learning rates are often set higher, with larger batches. +- Use cyclical learning rate trick: slowly increase LR to find the min where model learns and max where it still converges. + +My own questions: + +- How do you know if you've set exploration/variance too high or low? +- Should you use a multi-headed actor/critic? Or separate networks? + +"What to do when learning seems very inconsistent between runs?" - This could be an init issue. Try to init so it defaults to reasonable action values even before training. + +--- + +**u/gwern** (8 points): + +I've seen similar engineering details & folklore, but mostly in slides/talks: +- https://www.reddit.com/r/reinforcementlearning/comments/6vcvu1/icml_2017_tutorial_slides_levine_finn_deep/ +- https://www.reddit.com/r/reinforcementlearning/comments/75m5vd/deep_rl_bootcamp_2017_slides_and_talks/ +- https://www.reddit.com/r/reinforcementlearning/comments/5i67zh/deep_reinforcement_learning_through_policy/ +- https://www.reddit.com/r/reinforcementlearning/comments/5hereu/the_nuts_and_bolts_of_deep_rl_research_schulman/ + + **u/twkillian** (1 point): I was about to post John Schulman's talk here as well. Great resource. + + **u/wassname** (1 point): Summarising the ones I hadn't seen: + - 5i67zh: fix random seed to reduce variance; think about step-size/sampling-rate; RL sensitive to optimizer choice (SGD, Adam) + - 6vcvu1: slides focused more on algorithm choice/design, not application tips + +--- + +**u/grupiotr** [OP] (5 points): + +John Schulman's talk wins, particularly: + +- rescaling observations, rewards, targets and prediction targets +- using big replay buffers, bigger batch size and generally more iterations to start with +- always starting with a simple version of the task to get signs of life + +--- + +**u/Kaixhin** (2 points): + +My first bit of advice is actually don't do RL. If the answer is still yes, find some other useful task for the network to do, like predicting something. Get supervised gradients flowing through your network. Training end-to-end on purely an RL signal is impressive, but adding easier learning signals can potentially help a lot. + +--- + +**u/grupiotr** [OP] (1 point): + +What turned out to be the game-changer (made my RL agents actually learn something) was **rescaling the reward from [-1, 1] to [0, 1]**. Thanks again to everyone that contributed! diff --git a/docs/evidence/reddit_rl_roadblocks_bzg3l2.md b/docs/evidence/reddit_rl_roadblocks_bzg3l2.md new file mode 100644 index 0000000..ca27bf0 --- /dev/null +++ b/docs/evidence/reddit_rl_roadblocks_bzg3l2.md @@ -0,0 +1,197 @@ +Source: https://old.reddit.com/r/reinforcementlearning/comments/bzg3l2/ +Title: How to *more intelligently* debug RL roadblocks? +Fetched-via: Reddit JSON API (limit=500, depth=10) +Fetch-status: verbatim + +# How to *more intelligently* debug RL roadblocks? + +**Posted by:** u/GrundleMoof | Score: 4 | 7 comments + +A while ago I [made this post](https://www.reddit.com/r/reinforcementlearning/comments/9sh77q/what_are_your_best_tips_for_debugging_rl_problems/) asking for tips on debugging when you run into a problem with RL. + +However, I think the majority of the advice can be summed up with: + +1) Test bits individually to make sure they're doing what they should + +2) Don't go down a rabbit hole of fiddling with hyperparameters + +3) Log/record/display everything, and "look for things that are acting funny" + + +and I just want to be clear that I'm not disparaging that advice, it's actually really good, I'm thankful, and I know I'm asking a tricky, general question! + +But I want to get to the "next level". I think I know the theory well enough, and I've successfully done a few toy problems, but I'm still here banging my head against the wall. + +I'll take a practical example I'm struggling with now: gym's `Pendulum-v0`, which has a continuous action space of [-2, 2], and three state variables (`(cos(theta), sin(theta), theta_dot)`). I'm trying to solve it with a fairly simple AC setup and PyTorch. I'm using the RMSprop optimizer, and 2 (or 3) fully connected NN layers, with 50 (or 100) units in each layer, to approximate pi (the policy) and V (the value function/baseline). + +To select the actions, like in [the A3C paper](https://arxiv.org/pdf/1602.01783.pdf), I have the pi NN have two outputs, mu and sd2 (the standard deviation squared). Every time step, I select an action `a` from a normal distribution with that mu and `sd**2`. Then, I calculate that `pi(a)` (just from the equation of a normal dist. with that mu, `sd**2`), and iterate the agent to get the reward from that time step. + +Also like the A3C paper (for the Pendulum problem), I'm doing all the updates at once, at the end of each episode (so it's basically MC with V as the baseline). For each time step (after the episode) I accumulate the rewards from t to t_max as `r_accum` (with gamma = 0.99), then say `V_loss = (r_accum - V_list).pow(2).sum()`. For the policy gradient, I do `policy_loss = -(torch.log(pi_list)*(r_accum - V_list)).sum()`, and then zero grads, backwards the losses, step the optimizer, etc. + +And I'm just not seeing any learning, going up to about 20k episodes. I'm plotting to TensorBoard (losses, rewards, weights, biases, gradients), but nothing is striking me as an obvious culprit. It gets varying rewards, the V_loss seems to decrease to 0, and the policy_loss usually kind of wanders but eventually goes to 0 (I think because it's also proportional to (r_accum - V_list) which is also going to 0). + +But I think this is a perfect learning example. This is doable (...right?), it seems mostly correctly set up, and it's probably a fairly simple fix if I knew how to diagnose it. For the more experienced RL'ers out there, where would you start? What would you look at? What would you verify is working correctly? + +Here are some of my guesses/notes: + +* I haven't actually seen any straightforward implementations of a vanilla PG algo solving Pendulum-v0. In the A3C paper, they add an LSTM to it. There are a bunch of DDPG papers online, but that's a pretty different story. I found one A3C that doesn't seem to have an LSTM, so I'll check that out. +* Do I need experience replay? Maybe the variance is just too high using essentially REINFORCE with this problem, so I need to be getting much better data efficiency (or running it for a ton longer) ? +* I was worried that maybe it was never actually getting to positions where it could get a high enough reward (to "motivate" it to reach those positions), but I plotted some trajectories and it's definitely getting up to the top (by swinging wildly anyway), where R = 0, so it's definitely experiencing them. + +Things I've tried (but maybe not systematically enough): + +* Different initial LRs +* Different optimizers +* Different number of hidden layers/units +* Shared pi/V NN body (with diff output layers) vs not +* Changing amount of entropy +* Adding correlated noise +* Using TD residual instead of MC version +* Clipping the gradient +* Different gamma values + +Anyway, I'd love it if anyone has any more general advice for how to think about and go about solving RL problems. I of course want to solve this one, but I want a more general way of thinking. + +## Comments + +**u/i_do_floss** (score: 3): +I dont have the answer for you, but I had an algorithm that was stuck on pendulum for a while and these eventually ended up being the issues: + +1. The environment is wrapped with a wrapper that kills the environment after 200 steps. I was ignoring that so I could use 1024 steps. So I ignored the "done" / "is-terminal" variable but I forgot to exclude it from my stored memories in my memory buffer so my updates were all wrong. + +I decided observing the predicted value and seeing if it was crazy (high variance) may be an indicator of an issue. + +Also I could use tensorboard and visualize runtime information so I could see what was going into my placeholders. + +2. My q value was shaped (none x 1), and my placeholders for rewards/terminals were shaped ( none) when I compared those in a tensor I ended up with a tensor shaped ( none , none) which didnt do what I expected + +I decided I could mitigate that type of issue in the future by writing my expected shapes of the networks in a notebook and checking if they match afterward using tensorboard. Some people use assert shape functions. + +Also, just so you know, I'm training soft actor critic in about 20 episodes of length 1024. I don't think you should wait for 1000s of episodes. + +Pendulum v0 is an easy environment for your algorithm to learn. I suggest sticking with these hyper parameters. If they don't work, it's probably your algorithm. + +policy network size: [64, 64] +batch size: 256 +gamma: 0.99 +adam optimizer +relu network activations (on every layer except the last one which has no activation) + +Lastly, make sure your action space allows your algorithm to output actions in the space of -2 to 2. + + **u/GrundleMoof** (score: 1): + > The environment is wrapped with a wrapper that kills the environment after 200 steps. I was ignoring that so I could use 1024 steps. So I ignored the "done" / "is-terminal" variable but I forgot to exclude it from my stored memories in my memory buffer so my updates were all wrong. + + So I currently have my agent as a wrapper for the gym env, and it returns a tuple of (reward, state_next, done), and I break on done. + + > I decided observing the predicted value and seeing if it was crazy (high variance) may be an indicator of an issue. + + Hmmm, by value, you mean the value function? And do you mean variance across different states, or the same state over time? + + > Also I could use tensorboard and visualize runtime information so I could see what was going into my placeholders. + > + > My q value was shaped (none x 1), and my placeholders for rewards/terminals were shaped ( none) when I compared those in a tensor I ended up with a tensor shaped ( none , none) which didnt do what I expected + > I decided I could mitigate that type of issue in the future by writing my expected shapes of the networks in a notebook and checking if they match afterward using tensorboard. Some people use assert shape functions. + + ahh yeah that's some good advice. I actually got burned by that earlier in this project, but figured it out by printing the sizes. PyTorch is a little tricky in that it will accept multiplying tensors of various combinations of sizes, with different results... so I should probably do asserts from now on. + + > Also, just so you know, I'm training soft actor critic in about 20 episodes of length 1024. I don't think you should wait for 1000s of episodes. + + Hmm, so right now I'm trying a pretty simple setup, just a policy gradient with a value function. I don't know much about SAC, but it seems more advanced. + + I was starting to get skeptical whether this setup could even learn a continuous action space problem like Pendulum-v0, because when I searched for stuff, almost everything I found was using at least DDPG or more complex. But then I found [this guy's project](https://github.com/MorvanZhou/pytorch-A3C), just A3C, and it solves it pretty quickly and reliably. + + I started going through his code and it's nearly exactly the same as mine. I thought that it's possible that using 4 workers has a "decorrelating" effect (like experience replay), so I changed his code to drop it to 1 worker, and it still works! So it's clearly something else and I haven't figured it out yet. It's so similar to mine though, both in terms of setup and hyperparameters... + + > Pendulum v0 is an easy environment for your algorithm to learn. I suggest sticking with these hyper parameters. If they don't work, it's probably your algorithm. + > + > policy network size: [64, 64] batch size: 256 gamma: 0.99 adam optimizer relu network activations (on every layer except the last one which has no activation) + + You mean, two hidden layers of size 64 each? And are you outputting a value function too? + + So, maybe I'm missing something here -- do you mean batches of episodes, or batches of steps? I'm using gamma = 0.9 or 0.99. I've tried Adam and RMSprop, no success with either... I'm using tanh activations, but that probably shouldn't change anything significantly, right? + + > Lastly, make sure your action space allows your algorithm to output actions in the space of -2 to 2. + + Yeah, my policy outputs a mu and sigma. The mu output is 2*tanh, so it's mapped to -2, 2, and the sigma one (actually sigma^2 ) is put through a softplus output. + + **u/i_do_floss** (score: 1): + Yes two hidden layers with 64 nodes. The value function is a third layer basically. + + Tanh on last layer makes sense for policy. What are you using on hidden layers and value function final layer? + + Also, have you tried different reward scales? + + **u/GrundleMoof** (score: 1): + Hi again, sorry for the delay! I was traveling with no service... + + I've tried a few different topologies. Right now I'm doing this: + + self.actor_lin1 = nn.Linear(3, 200) + self.mu = nn.Linear(200, 1) + self.sigma = nn.Linear(200, 1) + self.critic_lin1 = nn.Linear(3, 100) + self.v = nn.Linear(100, 1) + + and for my forward(): + + y = torch.tanh(self.critic_lin1(x)) + v = self.v(y) + + z = torch.tanh(self.actor_lin1(x)) + mu = 2*torch.tanh(self.mu(z)) + sd2 = softplus(self.sigma(z)) + 0.001 + return(v, (mu, sd2)) + + So I'm using tanh() for the nonlinearities as well. I'm adding that 0.001 to the sd2 because it keeps it from getting too small (which should be enforced by the entropy term anyway) and I've seen it done in a few formulations of this. + + I also tried with having the mu/sigma layers combined into a nn.Linear(200, 2) layer (which should be functionally the same I think), as well as having the mu/sigma and v outputs share the first nn.Linear(3, 200) layer before splitting off (which is different, the shared head thing, but I've used elsewhere and seen people use). + + I'm scaling the rewards in a way I've seen a bunch of other people do. Since the reward each step has the range [-16, 0], I'm normalizing it by doing (r + 8.0)/8.0, which should put it about in the range [-1, 1]. + + At this point I'm basically trying to replicate the guy's A3C implementation from above (minus the multiple workers part, but I ran his with 1 worker and it reliably improves every time). Mine *does* seem to improve, but really slowly compared to his, and sometimes seems to get worse after a while. Like, it's not not improving *at all*, just very slowly and also not reliably, which means something must be off. + + **u/i_do_floss** (score: 1): + tanh activations are really sensitive to the weights and bias initialization. Is he using tanh activations? + + tanh makes sense to me for the actor output. But I would probably use relu for the nonlinearities so the initialization is easier. + + tanh starts to experience issues when the inputs and outputs are too big. (bigger than like -.6 and 6.) + + **u/GrundleMoof** (score: 1): + by the way, just to give an example: + + [Here's an image of the reward per episode using his code](https://imgur.com/9K2clLs) + + [and here's mine](https://imgur.com/IMy6Rb5) + + Both with moving averages shown, to smooth it out. + + You can see that mine *does* improve, up to about episode 2000, but then gets worse. It pretty consistently does that. His on the other hand, always improves and stays good. + + To me, that indicates that it's almost there, but something's going on with the optimizer or something, like maybe it becomes unstable or something. But I'm using the same LR he is (2e-4), and I've tried both Adam (like him) and RMSprop. + +**u/i_do_floss** (score: 3): +Also, this only applies to pendulum v0, but it's a great first environment for spinning up an algorithm because you can graph the entire state space on a 2 dimensional plane (as an image). I graph my policy / q assessments in polar coordinates where radius is the velocity and theta is the angle of the pendulum. + +&#x200B; + +Here's a link to the code I use to do it. + +[https://github.com/DanielSmithMichigan/reinforcement-learning/blob/72b0e939c47234fd63c8459fdfa35f18d9053b49/soft-actor-critic/agent/Agent.py#L287](https://github.com/DanielSmithMichigan/reinforcement-learning/blob/72b0e939c47234fd63c8459fdfa35f18d9053b49/soft-actor-critic/agent/Agent.py#L287) + +&#x200B; + +Here is an album of screenshots I took while training a successful policy. + +&#x200B; + +[https://imgur.com/a/05C5vVa](https://imgur.com/a/05C5vVa) + +&#x200B; + +I hope that helps. Feel free to hit me up on discord if you want someone to talk through it with. I'm kind of in the same boat as you with regard to wishing I knew the better ways to debug RL algorithms. + +&#x200B; + +(Discord: Perseus#5383) diff --git a/docs/evidence/reddit_schulman_nuts_bolts_5hereu.md b/docs/evidence/reddit_schulman_nuts_bolts_5hereu.md new file mode 100644 index 0000000..62896c8 --- /dev/null +++ b/docs/evidence/reddit_schulman_nuts_bolts_5hereu.md @@ -0,0 +1,12 @@ +Source: https://old.reddit.com/r/reinforcementlearning/comments/5hereu/ +Title: "The Nuts and Bolts of Deep RL Research" (Schulman December 2016 slides) +Fetched-via: Reddit JSON API (limit=500, depth=10) +Fetch-status: verbatim + +# "The Nuts and Bolts of Deep RL Research" (Schulman December 2016 slides) + +**Posted by:** u/gwern | Score: 5 | 0 comments + +Link: http://rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdf + +## Comments diff --git a/docs/evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md b/docs/evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md new file mode 100644 index 0000000..a9138ff --- /dev/null +++ b/docs/evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md @@ -0,0 +1,1014 @@ +Source: https://www.youtube.com/watch?v=8EcdaCk9KaQ +Title: Nuts and Bolts of Deep RL Research - John Schulman, Deep RL Bootcamp Lecture 6 (2017) +Fetched-via: YouTube auto-generated subtitles (local file) +Fetch-status: verbatim (auto-generated captions, no punctuation) +Compliance-note: Video cannot be downloaded; subtitles are the closest verbatim source. Primary document source is joschu_nuts_and_bolts.md + +so last year at nips I was slated to +give a talk at the deep RL workshop and +I wasn't sure what I was going to talk +about because everything I had prepared +I had already talked about it so many +times that I just didn't want to didn't +want to give another talk on it so I I +asked Peter for his advice on what I +should talk about and he said that +Entering had given the talk earlier in +the conference I called the nuts and +bolts of deep learning where he sort of +went through the flowchart of what you +do when you see a new problem and like +if you if you're overfitting you regular +eyes and if you're underfitting then you +use a bigger model and so on so so Peter +suggested to come up to write a talk +called the nuts and bolts of deep RL +research where I would talk about some +of the similar lessons and the tips and +tricks for the RL setting so I put +together a talk for that and actually +people seem to like it +so I'll give a slightly updated version +of that talk right now so I'm going to +talk about a few different things some +of which are general and sort of apply +to RL using reinforcement learning in +general and some of them pertain to +particular classes of methods like +policy grading methods and these are +just sort of little tips and tricks for +how you how you get your algorithm to +work and what you do day to day so let's +say you have a totally new problem +you're trying to solve like you have you +have some new tasks and you figured out +how to you defined an observation in an +action space and you have your neural +network policy or Q function but and you +want to start learning learning how to +solve it but you but you've never tried +it before +um so or okay or if you have a new +algorithm you're trying to get working +that you've never you you've never used +it before so so what do you do what's +the first thing you do if you have a new +algorithm +so so that I mean my first advice would +be to use the small problems so you can +run a lot of experiments really quickly +and do a hyper parameter search and it's +really useful too +to be able to visualize the learning +process in as many ways as possible so +look at the state visitation like how +that's evolving over time and look at +how well your value function is fitting +and so on so like I spent a lot of time +looking at the pendulum problem where +you're trying to swing up a pendulum +because this problem has a 2d state +space where it's just the angular and +the angular velocity of the pendulum and +I would visit visualize here's exactly +what the value function looks like +here's exactly what the state +distribution looks like and here's how +they evolve over time so I would get a +sense for like what's if my algorithm +isn't working is it because it's like +oscillating in some funny way or maybe +it's just giving a bad fit or maybe the +function it's learnt the value function +alerting isn't smooth enough and so on +so I would say try to visualize +everything and maybe use small problems +where you can visualize everything also +yeah it's useful to construct toy +problems where your idea is going to be +the strongest where you think okay if +this idea has any possibility of working +it's going to work there so for example +let's say you're trying to do something +with hierarchical reinforcement learning +then construct some problem where +there's some kind of obvious hierarchy +that it should learn and you'll be able +to tell if it's doing the right thing +also construct the the problems where +it's going to be weakest obviously and +also as a counterpoint to that don't +over fit your method to some contrived +problem so let's say you've come up with +some toy problem where your method is +really good then don't realize that it's +a toy problem and don't like tweak +everything to just work on this toy +problem perfectly because yeah it's also +pretty useful to have medium-sized +problems that you're very familiar with +and you know exactly how fast the +learning should be and what the reward +should be at every iteration and so on +so +a few problems that I use a lot like +training on pong Atari and the hopper +would the hopper like problem which is +this simulated robot problem with this +hopping robot and I know exactly how +fast an algorithm that's working should +learn on these problems so so I can sort +of it's it makes it easier to tune +things if you have okay that's if you +have a new algorithm let's say you have +a new task I would recommend just making +the task easier until you start seeing +some signs of life you see it learning +something so so there are various ways +you can make it easier you can try doing +some feature engineering so your input +features you think that the you think +that the policy should be a simple +function of your input features like +let's say you're trying to get pong to +work and you tried setting it up with +the images as input and you weren't +learning anything then you can set up +the problem where you pass in XY +coordinates as input and then try +running your algorithm and it's a much +simpler function you're trying to learn +so that's much more likely at work and +then you can try to make it harder and +harder until you're solving the full +problem another way you can make it +easier is by shaping the reward function +that means you if you come up with some +reward function that gives you fast +feedback code on whether you're doing +the right thing or not so let's say we +can define one task where we have this +reaching robot and we just give it a +reward if it reaches if it hits the +target so it gets a reward of one if it +hits the target in zero otherwise so +that might be hard to learn because +you're not getting any feedback as +you're flailing around but we could +define a better shaped reward function +where the where it's just distance to +target then learning is going to be much +faster in that problem there's also the +problem on exactly how to turn your +problem into a pom DP in the first place +so so often it's not clear what your +observation features should be and it's +not even clear what the reward function +should be so or it's not clear if this +problem you're trying to solve is +if it's feasible at all so so let's say +you're trying to solve you you have some +game or some robotics task or something +new like and you you want to turn it +into a reinforcement learning problem +but you're not sure if this is feasible +at all +so the first thing to do is to just +visualize a random policy acting on this +problem and see see what happens so if +the random policy occasionally does the +right thing then there's a high chance +of reinforcement learning is going to +work because bringing forth a policy +grading method is just going to take +this random behavior and it's going to +make the look the good behaviors more +likely +so it'll gradually like hone in on the +good behaviors whereas if you're never +doing the right thing then then there's +RL isn't going to get any signal that +tells it to do the right thing sometimes +RL is able to learn even though it seems +like it it's not clear how it's going to +learn like learning how to walk it's not +clear that that should work but because +you would think that like you really +have to have the whole thing in the +whole policy in place before it does +anything useful but as it turns out you +sort of learn to take one step and then +fall over and then take two steps and +then fall over and so on until you've +got a proper walking gait okay another +thing to do is to make to make your +observations make sure your observations +are useable try to look at them as a +human and see if you can control the +system using the same observations +you're giving to the agent so let's say +you're doing some pre-processing on your +images look at those pre processed +images yourself and make sure you're not +like losing too much detail when you +downsample them or losing too much or in +the color transformations and so on +another thing to do is you want to make +sure that everything is reasonably +scaled so that for example well as a +rule of thumb you usually want +everything to be mean 0 and standard +deviation 1 for the observations and for +the rewards well it's a little less +obvious but that's a reasonable +heuristic so +so you might want to like a scaler using +some kind of filter I mean that's that's +another good thing you can do but if you +don't want to mess with some kind of +filters on your observations and rewards +what you can do it you can just kind of +if you're allowed to define those +yourself then you might want to just +scale them yourself so what I'd +recommend doing is plot histograms of +all of your observations and your +rewards and make sure that for each +component of the observations and +rewards you've scaled it properly so +that it has the right mean insanity +deviation and it doesn't have crazy +outliers okay another thing to do is you +should have some good baselines that you +can use whenever you see a new when you +whenever you see a new problem so just +it's not clear which algorithm is going +to work beforehand so make sure you you +just have a bunch of a bunch of like +well tune things that you can run on +each problem yeah okay the question was +if you're gonna do some kind of reward +normalization should you do this over +your whole training like all of your +training data or just like the recent +data I would yeah that's a there's a lot +of subtlety there so I would say use all +of your data so far because you're +making everything non-stationary if you +do some kind of filtering actually I'm +going to talk about this at a later +slide so anyway I would recommend as +just a few baselines you should have a +cross and to be method some policy +grading methods some kind of cue +learning or sarsa type method there's a +lot of code online now that you can use +other people's code that that's already +written so you can use like we have this +open AI baselines repository and also +our L lab has a bunch of algorithms okay +another thing to do which people often +get tripped up on especially when +they're trying to reproduce published +work is so you implement the algorithm +based on the paper +and then it doesn't really learn +anything at all and then you think oh +maybe mike is my code like wrong or what +happened so I would say early on you +might need to run with more samples than +expected +so one hyper parameter that you can +usually adjust is how big of a batch +size to use or how many samples to use +and I would say sometimes you should use +more samples than you think you're going +to need because usually things just work +better when you have more samples almost +always so often sometimes when you're +trying to reproduce a published paper +you've got it mostly right but not +exactly right like maybe you haven't +scaled everything properly or there's +some like there's some really like +obscure hyper parameter that you have +wrong and then you just find that the +code doesn't learn anything so then I +would say just try to make it work a +little bit and then you can work from +there and try to tweak all the hyper +parameters to to get up to the like to +get fully up to the publish performance +but if you want to just get something +working at all often you need to use +bigger batch sizes and you thought +because if your batch size is too small +than the nor the noise will overwhelm +the signal and you won't learn anything +so like for example for TRP oh I wasn't +seeing any learning for a while and then +it turned out it's just because I was +using too small of a batch size and I +had to use a hundred thousand time steps +of a batch for the batch size but and +for Atari they for dqn the type of +parameters that were found to be best +where you update every ten thousand time +steps you update your queue function +every ten thousand time steps and you +have a 1 million time steps in your +replay buffer which is a lot okay so now +I'll talk about some guidelines for on +for the ongoing development and tuning +process as opposed to the initial +process of I have a totally new problem +or a new algorithm that I want to see +some signs of life on so +let's say you get something working I +recommend looking how sensitive your +algorithm is to every hyper parameter +and if it's too sensitive it it's not +actually a robust algorithm then you +shouldn't be happy with it you probably +just got luck lucky on that one problem +and it's it's actually kind of possible +to have a method that does that is a +fluke and it works in one way because +it's I mean one problem because of some +funny dynamics but then it doesn't work +in general so you kind of have to it +need some serious improvements so yeah +so that's okay there's also a few things +you can look at to see that actually I'm +going to talk about more of these kind +of Diagnostics a little later but there +are some indicators that'll tell you if +that if your algorithm is working +besides just looking at the final +performance but other in look for other +indicators that are going to tell you +that your optimization process is kind +of healthy so this is going to vary +based on the algorithm but for example +you can look at whether your value +function is actually accurate like +whether it's actually predicting returns +well you can look at how big the updates +are in terms of some either parameter +space or the output space standard +Diagnostics for deep networks like you +can look at norms of gradients and so on +okay one thing that takes some +discipline but is very useful is to have +a system for continually benchmarking +your code and that includes all of your +code not just the one thing you're +tuning right now because often it's easy +to tune your algorithm to work well in +one problem and then mess up the +performance on other problems and it's +really easy to overfit on single +problems when you're just adjusting +hyper parameters so I'd really recommend +having some kind of benchmark you can +run frequently and some kind of battery +of benchmarks that you've run +occasionally along as similar lines of +like overfitting of sort of reading +too far into noise or over interpreting +noise it's really easy to just to think +you're improving your algorithm or +you're making it worse but really you're +just seeing random noise so so you can +see seven different tasks these are the +Jim Moo Joko tasks like half cheetah and +hopper and so on and you have three +different algorithms here the red one +the green one and the blue one and you +can see ok let's I mean we can see that +the performance is a little different on +all the problems but let's it looks like +the the red let's see does the green +which one looks like the best well it +kind of varies by problem like the blue +one looks better on this problem and the +red one is worse on this problem and so +on but as it turns out these are all the +exact same algorithms and just random +seeds different random seeds so so it's +easy to imagine that you're just looking +at one of these problems then you see +that blue curve and you think you get +really excited than you think you found +some huge improvement to your algorithm +but it's really that you just got a +lucky seed that one run so yeah really +you've got to run your algorithm +multiple times an average and even if +you're averaging over a lot of seeds +like even if you had like 20 seeds here +there's a still a pretty big error bar +so it's yeah that makes it particularly +hard +I mean I'd recommend having like +multiple tasks and multiple seeds and if +you don't do that then you're probably +just overfitting unless you see a really +drastically large improvement another +thing to do is it's easy to keep adding +little modifications to your algorithm +until it gets really complicated and +then you're not sure and then you think +you have this really complicated +algorithm which is perfect but it turns +out that most of the things you did are +unnecessary because base some of the +tricks substitute for each other this is +often true because a lot of tricks help +because they're like normalizing things +in a better way or improving your +optimization like making +your optimization less susceptible to +like big spikes I don't know a lot of +different modifications you make have +similar effects so so often you you can +remove them and simplify your algorithm +and this is pretty important so it's +like especially with regard to changes +that do whitening these kind of these +kind of all substitute for each other +and also substitute for changes to your +optimization algorithm and yeah I would +and I would simplify things because it's +then it's more likely that your insights +will generalize to other problems and +also lastly it's pretty useful to +automate your experiments because +otherwise you're going to end up +spending all your day your whole day +just watching your code prints out +numbers and and it's actually really +it's it's really tempting to spend all +day doing that but I would I mean +especially if you need to run multiple +random seeds then it's then you you +really need to get your work flow down +so the year you're automating this +process and launching lots of +experiments at the same time so I'd +recommend just getting set up with one +of these cloud computing services so you +can just launch experiments on remote +instances and pull the results back when +you're done question oh yeah question is +you have a recommendation on what +framework to use to keep track of your +experiment results I personally use no +framework at all and I just have like +ipython notebooks and scripts that +collect a bunch of data that's stored in +various log files so I just have scripts +that read all my log files and plot them +I don't use some people like having +databases and stuff where they store all +their hyper parameter results but on I +think I don't find it necessary +personally okay so now I'm let's see I'm +going to talk about general tuning +strategies for RL and then after that +I'll talk about some specific tuning +strategies for different classes of +algorithms +okay so one thing is widening or +standardizing your data so if your +observations have unknown range you +should definitely standardize them I +would do that by computing a running +estimate of the mean and the standard +deviation and then just transform it Z +transforming it like this +and I would recommend computing the mean +and the standard deviation over all data +you've seen so far not just your recent +data because otherwise you're +effectively changing your data in some +way that the policy doesn't know about +like you have your that your policy +grading algorithm doesn't know about +like your policy grading algorithm is +actually optimizing some objective so +then if you just go and change the +problem out from under it then you're +often going to make things a lot worse +like if you rescale your observations +then your optimization algorithm didn't +know about that so you might just +collapse the performance so that's why I +would recommend using your whole all of +your data from the start of time so that +at least it's going to slow down over +time how fast it's how fast your +scalings are changing so yeah that's +what I would recommend doing with the +observations and for the rewards +I'd recommend rescaling it but not +shifting them because that affects the +agents will to live so if you if you +shift the mean reward that'll affect +whether how long it wants to survive +you're actually changing the problem ok +another yeah you might also want to try +to standardize prediction targets in the +same way though that's a little more +complicated to do using okay yeah so +question is what about pca widening +instead of just this element why scaling +yeah that could that could definitely +help I haven't I haven't experimented +with that but yeah that could help it's +hard to predict with like with neural +nets if it's going to help or not +because they seem to be pretty good at +disentangling things so I know that if +you have things that are terribly scaled +like they're from negative one thousand +two one thousand and other coordinates +are from negative point +point one then it's gonna be slow for +learning so this kind of scaling helps a +lot even though you're having their own +networks okay there's some parameters +that are really generally important like +discount factor that determines whether +you're that determines how long how far +away you're doing credit assignments so +whether you're paying attention to +effects that are delayed by a certain +time so if your discount is gamma equals +point 99 then you're basically ignoring +effects that are more delayed by a +hundred time steps so so you're kind of +short-sighted that gamma is controlling +your shortsightedness and you might want +to actually look at if how long that +corresponds to in real time so usually +in reinforcement learning you're sort of +discretizing time in a certain way and +it's worth paying attention to like is +that 100 time steps like three seconds +of real time or what and what happens +during that time also note that if you +have TD lamda kind of methods for either +for value function estimation or for +policy grading methods you can get away +with using a Lambda gamma that's really +close to one like 0.999 and things +aren't going to go unstable because if +you have a lower land of like 0.9 then +that's going to make it so the algorithm +is still stable even though gamma is +really close to one also okay so so as I +mentioned you might want to in in +practice we're usually discretizing some +continuous-time system so then it's +worth seeing if the problem can actually +be solved at this discretization level +so so for example in a game let's say +you're you're doing frame skip a meaning +that you repeat the action multiple +times as a human can you control it at +this rate or is it just impossible to +control is it just too like you're doing +the action too many times in a row and +you have to slow responses to control it +and I would also just look at the what +the random exploration looks like and if +you make sure that you're exploring like +the the +this Croatian is going to determine like +how far your Brownian motion goes +because if you're doing the same action +many times in a row then you're going to +be able to then you're going to tend to +explore further so so it's worth just +looking at what the random exploration +does and and choosing your time +discretization in a sensible way so that +it does interesting things question yeah +so the question is if you have a DQ n +how would you get started like tuning it +with tuning all the hyper parameters +actually I'm going to talk about DQ n +pretty soon so yeah I'll get to that +okay also look at the episode returns +very closely look at don't just look at +the mean look at the minimum and the +maximum so the maximum especially if you +have a deterministic system if you have +a certain maximum return that's +basically something that your policy can +hone in on pretty straightforwardly +because if if you just do that every +time then you're going to increase your +mean return to that level so so it's +worth so so it's useful to look at the +max return to see if your policy is ever +doing like the right thing according to +that max return or if it's just kind of +stuck and it's never discovering the +high return strategy also look at the +episode in length which is sometimes +more informative than the episode reward +like if because sometimes well yeah I +won't go into details on that like well +if you have a game you're it might mean +that like you might be losing every time +so you're never seeing yourself win but +the episode length will tell you if +you're losing slower so you might see an +improvement in episode length at the +beginning but not in reward okay for +Policy gradient there are specific +strategies or prediction there are +specific Diagnostics that are really +helpful so look at the entropy really +carefully if your entropy is going down +too fast that means your policy is +becoming deterministic and it's not +going to explore anything so +so be careful and also if it's not going +down your policy is never going to be +that good because it's always really +random else so you can sort of alleviate +this issue by using an entropy bonus or +a KL penalty so by stopping yourself +from move changing the policy the +probability distribution too fast as a +side effect you also prevent the entropy +from going down too fast when you use +the KL penalties I also look at the KL +as a diagnostic like look at how big of +an update you're doing in terms of KL +divergence if your KL is like 0.01 +that's a pretty small update but if it's +like a 10 that's a really big update +question oh yeah how do you question is +how do you measure entropy so so if you +have for most policies you can compute +the entropy analytically so if you have +a discrete action space then you usually +can just compute it analytically and if +you have a continuous policy you're +usually you're using a Gaussian +distribution or something so you can +compute the differential entropy +analytically so here we're talking about +entropy in action space so the average +over state space of the action space +entropy what you actually might care +about even more is the entropy in state +space but you have no hope at actually +calculating that except maybe to do some +really crude approximation of it +okay yeah so KL is really useful look at +explain variants like whether your value +function is actually explaining is +actually a good predictor of the returns +or if it's just worse than predicting +nothing so if you just predict zeroes +then your explained variance is zero but +sometimes if you have some neural +network that's predicting then you find +that it's actually negative because it's +overfitting or it's just noisy and it's +not doing anything useful so that +probably means you need to tune some +hyper parameters so that your neural +networks actually predicting better than +the constant predicting zero question +okay yeah question is why does the KL +spike give you a loss in performance +well it doesn't always be a lot it's not +always a loss in performance sometimes +it's a gain in performance but in +practice it's usually a loss in +performance because it usually the +approximation that your policy gradient +is just taking you way outside the +region where your local approximation to +the policy performance is accurate so +you're you're probably just overshooting +like if you take your policy and you +take a really big step in any direction +you're probably making it worse so so +that's so usually if you take a big step +you're getting worse like if you have a +convex function if you take a big step +in any direction you're probably going +to make it worse let's see okay +initialize your policy that's pretty +important more important than in +supervised learning because that in +determines what data you're going to see +initially and you're going to learn from +at the beginning so I would recommend +using have initializing the final layer +to be either zero or really small so +that at least you you have the maximum +and you sort of explore randomly at the +beginning we randomly at the beginning +as opposed to having some kind of +particular like policy that has a strong +opinion on the right thing to do which +is based on no information at all okay +that's for Policy gradient for Q +learning so a few thing a few things one +is okay you often it helps to have a +really big replay buffer and to be able +to do this you need to be a little +careful about memory usage so it's worth +putting in the extra effort to do that +learning rate schedules are often quite +helpful here in practice as our +exploration schedules so in qdq any +you're usually using epsilon greedy and +it often helps to do to play with the +schedule on that also it converges +pretty slowly and it has a miss +serious warmup period at the beginning +often so so sometimes you just so I +actually have a lot of admiration for +the authors who originally got this the +people people who got this to work +originally because they had to just let +their code run for a while before it did +anything so so you have to have a lot of +patience - a lot of bravery to do that +ok this is just miscellaneous advice for +not necessarily for tuning algorithms +but just for for personal development so +I recommend reading older textbooks and +theses not just the latest conference +papers because often they up in them +like there are more dense source of +useful information whereas each +conference paper just has one idea ok +yeah don't get too stuck on problems +because often you actually have a +legitimately good algorithm but it's has +like some flaws so its might fail +miserably at some easy problem so in RL +there's some like simple problems like +cart will swing up where you have this +stick and you're trying to swing it up +by moving the cart around and this +problem like you might have a great +algorithm but it's gonna in my but like +some of the state-of-the-art algorithms +are gonna fail on that problem unless +you really tuned them carefully and +that's just because maybe it's not +exactly the right problem to start to I +mean maybe like the thing that makes +this problem hard is not the thing that +your algorithm is doing that's +interesting so you might have like come +up with a better policy grading method +but still it'll converge to the same +local minimum on that swing up problem +and you're not gonna fix that problem so +I I would say just don't get too stuck +on a single problem that your method +bails on and enough in like maybe the +ultimate algorithm will solve all of +these problems but we're not there yet +so you might as well just try to improve +and some like decently large subset of +problems so also like one funny thing is +the dqn performs pretty poorly on a lot +of problems especially with continuous +control +I think it does I mean for cartful it +probably solves it pretty well if with a +reasonable amount of tuning but some of +the other like fairly small continuous +control problems it fails on but that +doesn't mean it's like that doesn't mean +it's a bad algorithm because it solves a +different problem extremely well so yeah +I would say just these these things are +at least right now it's not gonna you +shouldn't expect to be able to solve +everything with the same method without +any tuning also techniques from +supervised learning often don't transfer +over to reinforcement learning so so +don't be surprised if you find that I +guess that's not I said this slide was +gonna be a bad personal development +that's not about personal development +but yeah I guess this is just a grab bag +of miscellaneous advice so yeah so like +Bachner a lot of people look at what +people are doing in RL and they think +why aren't you using batch norm or drop +out or or big networks why are you using +like two layers of 64 units and it's not +like people didn't think of trying these +other things they tried them and then +they found that those other like +architectures and methods don't actually +help here I mean if you figure out how +to make batch norm and drop out actually +help in RL that'll actually be really +great and a big that would be a big +development but yeah I don't know it's +not totally straightforward all right +that's all thank you +okay I have a few minutes for questions +yeah so the question is how long do you +wait until you've decided that your +algorithm is new at work either because +your code is wrong or it's just too hard +I don't have a good general answer to +that I think the problem is worse for +some algorithms and others I'd say for +policy gradient methods you don't see +that burnin period as much like often if +it's going to learn it'll learn it at +the beginning but that's not always true +either I mean sometimes it will kind of +take some time to get into the right +numerical regime so I don't yeah I don't +have general advice I would say you have +to just I would say go back and start +with the easy problems and you'll get +some intuition about whether you're you +should expect a you should expect a burn +in period or not where it's not learning +anything see I want to get some people +in the back because okay oh yeah +question is - do I use unit tests I use +unit tests for code that where there's +it's doing a very particular +mathematical thing that you can actually +write a test for like let's say I'm +computing the KL divergence then I'll +write a test to check I don't know their +various ways of testing it so and it's +easy to get those things wrong like you +have it's I don't know as you're off by +a constant or something so yeah I would +write tests for I write tests for things +where it's nothing that there's a very +well-defined correct thing to do it's +harder to write it for an algorithm +where it has a lot of different moving +parts where you it's not clear how fast +it should learn and it's also there's +some randomness involved +so if you try to write a test saying I +should be at performance 100 after this +many iterations it might fail just out +of random noise but yeah I think +probably unit tests are a good idea oh +yeah so the question is do I have +guidelines on matching the algorithm to +the task like when to use policy +gradients versus a value iteration style +method it's yeah it's hard to give some +general guidelines I think people have +found that and and the guidelines I give +you might just be just be kind of +historical accidents like someone got +this to work here and this to work there +so I think the well certainly if you +don't care that much about sample +complexity policy gradient methods are +are probably are probably the way to go +if you don't care about sample +complexity or using off policy data then +policy grading methods are probably the +safest bet because you I don't know it's +more understandable exactly what it's +doing it's just doing gradient descent +whereas q-learning it's a little bit +indirect what it's doing so it's and it +in practice is more finicky yeah if you +do care about sample complexity though +or need off policy data then hue +learning is usually better or yeah or a +few students like sample complexity is +relevant if your simulator is expensive +of course I would also say that people +have found that dqn and relatives have +worked well on game-like tasks with +images as input whereas policy grading +methods work better on the continuous +control tasks like these robotic +locomotion problems though that this +might not be fundamental it might be +more of a historical accident let's +oh yeah recommendations on older +textbooks let's see there's like brutes +a cuss for take us as books +that's approximate dinette what is it +optimal control and why am i blanking on +the name optimal control and dynamic +programming something like that and the +set I mean sudden embargo is a good one +to read butterman has a textbook kind of +a classic textbook on Markov decision +processes that's in the RL space then +there's books on numerical optimization +that are good and yeah I'd say obviously +the machine learning textbooks have a +lot of good material that might be +useful in the RL setting too +oh yeah can I comment on evolution +strategies and the blog posts the the +opening I blog post on it let's see do +you have any specific questions about it +or like how it compares +oh yeah okay yeah yeah so there's +there's a lot of policy grading methods +out there and some of them are quite +complicated so we've had a couple of +talks on them so far like all these +different work +it's excessively more complicated policy +grading methods but then there's this +old algorithm called evolution +strategies which is an extremely simple +algorithm and and there's a paper by +some of my colleagues where they show it +was called evolutionary strategies as a +scaleable alternative to reinforcement +learning which really meant like to +policy grading methods so and they +claimed that it worked basically as well +as policy grading methods or at least +it's sort of in the same order oh and +beer is one of the authors of that paper +so the claim was that it works it works +like similarly well to policy grading +method so why should we bother with +these policy grading methods if es works +just as well well I think in practice it +works well it works but it works not not +as well like it's it takes me the sample +complexity is is is worse by some +constant factor or it's not clear that +it's a constant factor or if this factor +scales with the size of the network but +it's it is a lot it is significantly +slower and the question is just what is +that constant factor so is that constant +factor like one or is it three or is it +10 or 100 so that's not that's going to +vary between problems and also the that +paper had some innovations in exactly +how to parameterize the networks and so +forth that made everything better +numerically everything better scaled so +that yes did work well but I would say +that if you that it's usually quote like +I don't know it's usually a pretty +decent constant factor slower than +policy grading methods especially the +more advanced ones like the PPO and +actor so so i'm i think it's it's not +really a clear win in the RL setting +where policy gradients work I think if +policy gradients work it's usually going +to be a lot better +and the es is going to be is going to be +better on problems where policy +gradients aren't going to work for some +reason like if you've got really long +you depend time dependencies where the +discounts are gonna are gonna ignore +them +then es might be less sensitive to that +let's see I think I'm okay last question +oh yeah favorite hyper parameter +optimization framework I've used some of +these than I just like to use the +uniform random sampling yeah that works +really well I mean you just run a bunch +of experiments with random hyper +parameters and then you just look at the +results the next day and to do some +regression to figure out which +parameters actually mattered and then +you've run another experiment with +better parameter ranges and so on so I +use the human version of it because +often it's just - it's like a it's it's +useful to be able to look at the results +yourself - to get some to figure out +which parameters actually matter so +you're not wasting a lot of computation +because that information transferred +between problems all right +[Applause] \ No newline at end of file diff --git a/docs/evidence/slavv_37_reasons_nn.md b/docs/evidence/slavv_37_reasons_nn.md new file mode 100644 index 0000000..69cb76d --- /dev/null +++ b/docs/evidence/slavv_37_reasons_nn.md @@ -0,0 +1,272 @@ +Source: https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607 +Title: 37 Reasons why your Neural Network is not working - Slav Ivanov (2017) +Fetched-via: curl https://r.jina.ai/https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607 +Fetch-status: verbatim + +Title: 37 Reasons why your Neural Network is not working + +URL Source: https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607 + +Published Time: 2017-07-25T08:13:45Z + +Markdown Content: +[![Image 1: Slav Ivanov](https://miro.medium.com/v2/resize:fill:32:32/1*EkrMhH3YffQBM18wAoHTTw.jpeg)](https://medium.com/@slavivanov?source=post_page---byline--4020854bd607---------------------------------------) + +10 min read + +Jul 25, 2017 + +The network had been training for the last 12 hours. It all looked good: the gradients were flowing and the loss was decreasing. But then came the predictions: all zeroes, all background, nothing detected. “What did I do wrong?” — I asked my computer, who didn’t answer. + +Where do you start checking if your model is outputting garbage (for example predicting the mean of all outputs, or it has really poor accuracy)? + +A network might not be training for a number of reasons. Over the course of many debugging sessions, I would often find myself doing the same checks. I’ve compiled my experience along with the best ideas around in this handy list. I hope they would be of use to you, too. + +Table of Contents +----------------- + +> [0. How to use this guide?](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607#b6fb) +> +> +> [I. Dataset issues](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607#678a) +> +> +> [II. Data Normalization/Augmentation issues](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607#86fe) +> +> +> [III. Implementation issues](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607#95eb) +> +> +> [IV. Training issues](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607#74de) + +0. How to use this guide? +------------------------- + +A lot of things can go wrong. But some of them are more likely to be broken than others. I usually start with this short list as an emergency first response: + +1. Start with a simple model that is known to work for this type of data (for example, VGG for images). Use a standard loss if possible. +2. Turn off all bells and whistles, e.g. regularization and data augmentation. +3. If finetuning a model, double check the preprocessing, for it should be the same as the original model’s training. +4. Verify that the input data is correct. +5. Start with a really small dataset (2–20 samples). Overfit on it and gradually add more data. +6. Start gradually adding back all the pieces that were omitted: augmentation/regularization, custom loss functions, try more complex models. + +If the steps above don’t do it, start going down the following big list and verify things one by one. + +I. Dataset issues +----------------- + +Press enter or click to view image in full size + +![Image 2](https://miro.medium.com/v2/resize:fit:700/1*xfIbyKKMDmjQF9JFuK2Ykg.png) + +Source: [http://dilbert.com/strip/2014-05-07](http://dilbert.com/strip/2014-05-07) + +### 1. Check your input data + +Check if the input data you are feeding the network makes sense. For example, I’ve more than once mixed the width and the height of an image. Sometimes, I would feed all zeroes by mistake. Or I would use the same batch over and over. So print/display a couple of batches of input and target output and make sure they are OK. + +### 2. Try random input + +Try passing random numbers instead of actual data and see if the error behaves the same way. If it does, it’s a sure sign that your net is turning data into garbage at some point. Try debugging layer by layer /op by op/ and see where things go wrong. + +### 3. Check the data loader + +Your data might be fine but the code that passes the input to the net might be broken. Print the input of the first layer before any operations and check it. + +### 4. Make sure input is connected to output + +Check if a few input samples have the correct labels. Also make sure shuffling input samples works the same way for output labels. + +### 5. Is the relationship between input and output too random? + +Maybe the non-random part of the relationship between the input and output is too small compared to the random part (one could argue that stock prices are like this). I.e. the input are not sufficiently related to the output. There isn’t an universal way to detect this as it depends on the nature of the data. + +### 6. Is there too much noise in the dataset? + +This happened to me once when I scraped an image dataset off a food site. There were so many bad labels that the network couldn’t learn. Check a bunch of input samples manually and see if labels seem off. + +The cutoff point is up for debate, as [this paper](https://arxiv.org/pdf/1412.6596.pdf) got above 50% accuracy on MNIST using 50% corrupted labels. + +### 7. Shuffle the dataset + +If your dataset hasn’t been shuffled and has a particular order to it (ordered by label) this could negatively impact the learning. Shuffle your dataset to avoid this. Make sure you are shuffling input and labels together. + +### 8. Reduce class imbalance + +Are there a 1000 class A images for every class B image? Then you might need to balance your loss function or [try other class imbalance approaches](http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/). + +### 9. Do you have enough training examples? + +If you are training a net from scratch (i.e. not finetuning), you probably need lots of data. For image classification, [people say](https://stats.stackexchange.com/a/226693/30773) you need a 1000 images per class or more. + +### 10. Make sure your batches don’t contain a single label + +This can happen in a sorted dataset (i.e. the first 10k samples contain the same class). Easily fixable by shuffling the dataset. + +### 11. Reduce batch size + +[This paper](https://arxiv.org/abs/1609.04836) points out that having a very large batch can reduce the generalization ability of the model. + +### Addition 1. Use standard dataset (e.g. mnist, cifar10) + +Thanks to @ for this one: + +> When testing new network architecture or writing a new piece of code, use the standard datasets first, instead of your own data. This is because there are many reference results for these datasets and they are proved to be ‘solvable’. There will be no issues of label noise, train/test distribution difference , too much difficulty in dataset, etc. + +II. Data Normalization/Augmentation +----------------------------------- + +![Image 3](https://miro.medium.com/v2/resize:fit:400/1*UQLMfdKi5D4nNDN6Oxa5MA.png) + +### **12. Standardize** the features + +Did you standardize your input to have zero mean and unit variance? + +### 13. Do you have too much data augmentation? + +Augmentation has a regularizing effect. Too much of this combined with other forms of regularization (weight L2, dropout, etc.) can cause the net to underfit. + +### 14. Check the preprocessing of your pretrained model + +If you are using a pretrained model, make sure you are using the same normalization and preprocessing as the model was when training. For example, should an image pixel be in the range [0, 1], [-1, 1] or [0, 255]? + +### 15. Check the preprocessing for train/validation/test set + +CS231n points out a [common pitfall](http://cs231n.github.io/neural-networks-2/#datapre): + +> “… any preprocessing statistics (e.g. the data mean) must only be computed on the training data, and then applied to the validation/test data. E.g. computing the mean and subtracting it from every image across the entire dataset and then splitting the data into train/val/test splits would be a mistake. “ + +Also, check for different preprocessing in each sample or batch. + +III. Implementation issues +-------------------------- + +Press enter or click to view image in full size + +![Image 4](https://miro.medium.com/v2/resize:fit:371/1*EVy3hNSF4Nq7v7bNYOyNcQ.png) + +Credit: [https://xkcd.com/1838/](https://xkcd.com/1838/) + +### 16. Try solving a simpler version of the problem + +This will help with finding where the issue is. For example, if the target output is an object class and coordinates, try limiting the prediction to object class only. + +### 17. Look for correct loss “at chance” + +Again from the excellent [CS231n](http://cs231n.github.io/neural-networks-3/#sanitycheck): _Initialize with small parameters, without regularization. For example, if we have 10 classes, at chance means we will get the correct class 10% of the time, and the Softmax loss is the negative log probability of the correct class so: -ln(0.1) = 2.302._ + +Get Slav Ivanov’s stories in your inbox +--------------------------------------- + +Join Medium for free to get updates from this writer. + +Remember me for faster sign in + +After this, try increasing the regularization strength which should increase the loss. + +### 18. Check your loss function + +If you implemented your own loss function, check it for bugs and add unit tests. Often, my loss would be slightly incorrect and hurt the performance of the network in a subtle way. + +### 19. Verify loss input + +If you are using a loss function provided by your framework, make sure you are passing to it what it expects. For example, in PyTorch I would mix up the NLLLoss and CrossEntropyLoss as the former requires a softmax input and the latter doesn’t. + +### 20. Adjust loss weights + +If your loss is composed of several smaller loss functions, make sure their magnitude relative to each is correct. This might involve testing different combinations of loss weights. + +### 21. Monitor other metrics + +Sometimes the loss is not the best predictor of whether your network is training properly. If you can, use other metrics like accuracy. + +### 22. Test any custom layers + +Did you implement any of the layers in the network yourself? Check and double-check to make sure they are working as intended. + +### 23. Check for “frozen” layers or variables + +Check if you unintentionally disabled gradient updates for some layers/variables that should be learnable. + +### 24. Increase network size + +Maybe the expressive power of your network is not enough to capture the target function. Try adding more layers or more hidden units in fully connected layers. + +### 25. Check for hidden dimension errors + +If your input looks like (k, H, W) = (64, 64, 64) it’s easy to miss errors related to wrong dimensions. Use weird numbers for input dimensions (for example, different prime numbers for each dimension) and check how they propagate through the network. + +### 26. Explore Gradient checking + +If you implemented Gradient Descent by hand, gradient checking makes sure that your backpropagation works like it should. More info: [1](http://ufldl.stanford.edu/tutorial/supervised/DebuggingGradientChecking/)[2](http://cs231n.github.io/neural-networks-3/#gradcheck)[3](https://www.coursera.org/learn/machine-learning/lecture/Y3s6r/gradient-checking). + +IV. Training issues +------------------- + +![Image 5](https://miro.medium.com/v2/resize:fit:448/1*gfcJD0eymh5SGuquzuvpig.png) + +Credit: [http://carlvondrick.com/ihog/](http://carlvondrick.com/ihog/) + +### 27. Solve for a really small dataset + +**Overfit a small subset of the data and make sure it works.**For example, train with just 1 or 2 examples and see if your network can learn to differentiate these. Move on to more samples per class. + +### 28. Check weights initialization + +If unsure, use [Xavier](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf) or [He](http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf) initialization. Also, your initialization might be leading you to a bad local minimum, so try a different initialization and see if it helps. + +### 29. Change your hyperparameters + +Maybe you using a particularly bad set of hyperparameters. If feasible, try a [grid search](http://scikit-learn.org/stable/modules/grid_search.html). + +### 30. Reduce regularization + +Too much regularization can cause the network to underfit badly. Reduce regularization such as dropout, batch norm, weight/bias L2 regularization, etc. In the excellent “[Practical Deep Learning for coders](http://course.fast.ai/)” course, [Jeremy Howard](https://twitter.com/jeremyphoward) advises getting rid of underfitting first. This means you overfit the training data sufficiently, and only then addressing overfitting. + +### 31. Give it time + +Maybe your network needs more time to train before it starts making meaningful predictions. If your loss is steadily decreasing, let it train some more. + +### 32. Switch from Train to Test mode + +Some frameworks have layers like Batch Norm, Dropout, and other layers behave differently during training and testing. Switching to the appropriate mode might help your network to predict properly. + +### 33. Visualize the training + +* Monitor the activations, weights, and updates of each layer. Make sure their magnitudes match. For example, the magnitude of the updates to the parameters (weights and biases) [should be 1-e3](https://cs231n.github.io/neural-networks-3/#summary). +* Consider a visualization library like [Tensorboard](https://www.tensorflow.org/get_started/summaries_and_tensorboard) and [Crayon](https://github.com/torrvision/crayon). In a pinch, you can also print weights/biases/activations. +* Be on the lookout for layer activations with a mean much larger than 0. Try Batch Norm or ELUs. +* [Deeplearning4j](https://deeplearning4j.org/visualization#usingui) points out what to expect in histograms of weights and biases: + +> “For weights, these histograms should have an **approximately Gaussian (normal)**distribution, after some time. For biases, these histograms will generally start at 0, and will usually end up being **approximately Gaussian** (One exception to this is for LSTM). Keep an eye out for parameters that are diverging to +/- infinity. Keep an eye out for biases that become very large. This can sometimes occur in the output layer for classification if the distribution of classes is very imbalanced.” + +* Check layer updates, they should have a Gaussian distribution. + +### 34. Try a different optimizer + +Your choice of optimizer shouldn’t prevent your network from training unless you have selected particularly bad hyperparameters. However, the proper optimizer for a task can be helpful in getting the most training in the shortest amount of time. The paper which describes the algorithm you are using should specify the optimizer. If not, I tend to use Adam or plain SGD with momentum. + +Check this [excellent post](http://ruder.io/optimizing-gradient-descent/) by Sebastian Ruder to learn more about gradient descent optimizers. + +### 35. Exploding / Vanishing gradients + +* Check layer updates, as very large values can indicate exploding gradients. Gradient clipping may help. +* Check layer activations. From [Deeplearning4j](https://deeplearning4j.org/visualization#usingui) comes a great guideline: _“A good standard deviation for the activations is on the order of 0.5 to 2.0. Significantly outside of this range may indicate vanishing or exploding activations.”_ + +### 36. Increase/Decrease Learning Rate + +A low learning rate will cause your model to converge very slowly. + +A high learning rate will quickly decrease the loss in the beginning but might have a hard time finding a good solution. + +Play around with your current learning rate by multiplying it by 0.1 or 10. + +### 37. Overcoming NaNs + +Getting a NaN (Non-a-Number) is a much bigger issue when training RNNs (from what I hear). Some approaches to fix it: + +* Decrease the learning rate, especially if you are getting NaNs in the first 100 iterations. +* NaNs can arise from division by zero or natural log of zero or negative number. +* Russell Stewart has great pointers on [how to deal with NaNs](http://russellsstewart.com/notes/0.html). +* Try evaluating your network layer by layer and see where the NaNs appear. diff --git a/docs/evidence/williamfalcon_deeprl_hacks.md b/docs/evidence/williamfalcon_deeprl_hacks.md new file mode 100644 index 0000000..aa2e7fb --- /dev/null +++ b/docs/evidence/williamfalcon_deeprl_hacks.md @@ -0,0 +1,220 @@ +Source: https://github.com/williamFalcon/DeepRLHacks +Title: DeepRLHacks - Attendee notes from Schulman's Nuts and Bolts talk (2017) +Fetched-via: gh api repos/williamFalcon/DeepRLHacks/contents/README.md +Fetch-status: verbatim +Compliance-note: Secondary source - attendee notes from the talk. Primary source is joschu_nuts_and_bolts.md (http://joschu.net/docs/nuts-and-bolts.pdf) + +# DeepRLHacks +From a talk given by [John Schulman](http://joschu.net/) titled "The Nuts and Bolts of Deep RL Research" (Aug 2017) +These are tricks written down while attending summer [Deep RL Bootcamp at UC Berkeley](https://www.deepbootcamp.io/). + +**Update**: RL bootcamp just released the [video](https://www.youtube.com/watch?v=8EcdaCk9KaQ&feature=youtu.be) and the rest of the [lectures](https://sites.google.com/view/deep-rl-bootcamp/lectures). + +## Tips to debug new algorithm +1. Simplify the problem by using a low dimensional state space environment. + - John suggested to use the [Pendulum problem](https://gym.openai.com/envs/Pendulum-v0) because the problem has a 2-D state space (angle of pendulum and velocity). + - Easy to visualize what the value function looks like and what state the algorithm should be in and how they evolve over time. + - Easy to visually spot why something isn't working (aka, is the value function smooth enough and so on). + +2. To test if your algorithm is reasonable, construct a problem you know it should work on. + - Ex: For hierarchical reinforcement learning you'd construct a problem with an OBVIOUS hierarchy it should learn. + - Can easily see if it's doing the right thing. + - WARNING: Don't over fit method to your toy problem (realize it's a toy problem). + +3. Familiarize yourself with certain environments you know well. + - Over time, you'll learn how long the training should take. + - Know how rewards evolve, etc... + - Allows you to set a benchmark to see how well you're doing against your past trials. + - John uses the hopper robot where he knows how fast learning should take, and he can easily spot odd behaviors. + +## Tips to debug a new task +1. Simplify the task + - Start simple until you see signs of life. + - Approach 1: Simplify the feature space: + - For example, if you're learning from images (huge dimensional space), then maybe hand engineer features first. Example: If you think your function is trying to approximate a location of something, use the x,y location as features as step 1. + - Once it starts working, make the problem harder until you solve the full problem. + - Approach 2: simplify the reward function. + - Formulate so it can give you FAST feedback to know whether you're doing the right thing or not. + - Ex: Have reward for robot when it hits the target (+1). Hard to learn because maybe too much happens in between starting and reward. Reformulate as distance to target instead which will increase learning and allow you to iterate faster. + +## Tips to frame a problem in RL +Maybe it's unclear what the features are and what the reward should be, or if it's feasible at all. + +1. First step: Visualize a random policy acting on this problem. + - See where it takes you. + - If random policy on occasion does the right thing, then high chance RL will do the right thing. + - Policy gradient will find this behavior and make it more likely. + - If random policy never does the right thing, RL will likely also not. + +2. Make sure observations usable: + - See if YOU could control the system by using the same observations you give the agent. + - Example: Look at preprocessed images yourself to make sure you don't remove necessary details or hinder the algorithm in a certain way. + +3. Make sure everything is reasonably scaled. + - Rule of thumb: + - Observations: Make everything mean 0, standard deviation 1. + - Reward: If you control it, then scale it to a reasonable value. + - Do it across ALL your data so far. + - Look at all observations and rewards and make sure there aren't crazy outliers. + +4. Have good baseline whenever you see a new problem. + - It's unclear which algorithm will work, so have a set of baselines (from other methods) + - Cross entropy method + - Policy gradient methods + - Some kind of Q-learning method (checkout [OpenAI Baselines](https://github.com/openai/baselines) as a starter or [RLLab](https://github.com/rll/rllab)) + +## Reproducing papers +Sometimes (often), it's hard to reproduce results from papers. Some tricks to do that: + +1. Use more samples than needed. +2. Policy right... but not exactly + - Try to make it work a little bit. + - Then tweak hyper parameters to get up to the public performance. + - If want to get it to work at ALL, use bigger batch sizes. + - If batch size is too small, noisy will overpower signal. + - Example: TRPO, John was using too tiny of a batch size and had to use 100k time steps. + - For DQN, best hyperparams: 10k time steps, 1mm frames in replay buffer. + + +## Guidelines on-going training process +Sanity check that your training is going well. + +1. Look at sensitivity of EVERY hyper parameter + - If algo is too sensitive, then NOT robust and should NOT be happy with it. + - Sometimes it happens that a method works one way because of funny dynamics but NOT in general. + +2. Look for indicators that the optimization process is healthy. + - Varies + - Look at whether value function is accurate. + - Is it predicting well? + - Is it predicting returns well? + - How big are the updates? + - Standard diagnostics from deep networks + +3. Have a system for continuously benchmarking code. + - Needs DISCIPLINE. + - Look at performance across ALL previous problems you tried. + - Sometimes it'll start working on one problem but mess up performance in others. + - Easy to over fit on a single problem. + - Have a battery of benchmarks you run occasionally. + +4. Think your algorithm is working but you're actually seeing random noise. + - Example: Graph of 7 tasks with 3 algorithms and looks like 1 algorithm might be doing best on all problems, but turns out they're all the same algorithm with DIFFERENT random seeds. + +5. Try different random seeds!! + - Run multiple times and average. + - Run multiple tasks on multiple seeds. + - If not, you're likely to over fit. + +6. Additional algorithm modifications might be unnecessary. + - Most tricks are ACTUALLY normalizing something in some way or improving your optimization. + - A lot of tricks also have the same effect... So you can remove some of them and SIMPLIFY your algorithm (VERY KEY). + +7. Simplify your algorithm + - Will generalize better + +8. Automate your experiments + - Don't spend your whole day watching your code spit out numbers. + - Launch experiments on cloud services and analyze results. + - Frameworks to track experiments and results: + - Mostly use iPython notebooks. + - DBs seem unnecessary to store results. + + +## General training strategies +1. Whiten and standardize data (for ALL seen data since the beginning). + - Observations: + - Do it by computing a running mean and standard deviation. Then z-transform everything. + - Over ALL data seen (not just the recent data). + - At least it'll scale down over time how fast it's changing. + - Might trip up the optimizer if you keep changing the objective. + - Rescaling (by using recent data) means your optimizer probably didn't know about that and performance will collapse. + + - Rewards: + - Scale and DON'T shift. + - Affects agent's will to live. + - Will change the problem (aka, how long you want it to survive). + + - Standardize targets: + - Same way as rewards. + + - PCA Whitening? + - Could help. + - Starting to see if it actually helps with neural nets. + - Huge scales (-1000, 1000) or (-0.001, 0.001) certainly make learning slow. + +2. Parameters that inform discount factors. + - Determines how far you're giving credit assignment. + - Ex: if factor is 0.99, then you're ignoring what happened 100 steps ago... Means you're shortsighted. + - Better to look at how that corresponds to real time + - Intuition, in RL we're usually discretizing time. + - aka: are those 100 steps 3 seconds of actual time? + - what happens during that time? + - If TD methods for policy gradient of Value fx estimation, gamma can be close to 1 (like 0.999) + - Algo becomes very stable. + +3. Look to see that problem can actually be solved in the discretized level. + - Example: In game if you're doing frame skip. + - As a human, can you control it or is it impossible? + - Look at what random exploration looks like + - Discretization determines how far your Brownian motion goes. + - If do many actions in a row, then tend to explore further. + - Choose your time discretization in a way that works. + +4. Look at episode returns closely. + - Not just mean, look at min and max. + - The max return is something your policy can hone in pretty well. + - Is your policy ever doing the right thing?? + - Look at episode length (sometimes more informative than episode reward). + - if on game you might be losing every time so you might never win, but... episode length can tell you if you're losing SLOWER. + - Might see an episode length improvement in the beginning but maybe not reward. + + +## Policy gradient diagnostics +1. Look at entropy really carefully + - Entropy in ACTION space + - Care more about entropy in state space, but don't have good methods for calculating that. + - If going down too fast, then policy becoming deterministic and will not explore. + - If NOT going down, then policy won't be good because it is really random. + - Can fix by: + - KL penalty + - Keep entropy from decreasing too quickly. + - Add entropy bonus. + - How to measure entropy. + - For most policies can compute entropy analytically. + - If continuous, it's usually a Gaussian, so can compute differential entropy. + +2. Look at KL divergence + - Look at size of updates in terms of KL divergence. + - example: + - If KL is .01 then very small. + - If 10 then too much. + +3. Baseline explained variance. + - See if value function is actually a good predictor or a reward. + - if negative it might be overfitting or noisy. + - Likely need to tune hyper parameters + +4. Initialize policy + - Very important (more so than in supervised learning). + - Zero or tiny final layer to maximize entropy + - Maximize random exploration in the beginning + +## Q-Learning Strategies +1. Be careful about replay buffer memory usage. + - You might need a huge buffer, so adapt code accordingly. + +2. Play with learning rate schedule. + +3. If converges slowly or has slow warm-up period in the beginning + - Be patient... DQN converges VERY slowly. + + +## Bonus from [Andrej Karpathy](http://cs.stanford.edu/people/karpathy/): +1. A good feature can be to take the difference between two frames. + - This delta vector can highlight slight state changes otherwise difficult to distinguish. + + + + + diff --git a/docs/ml_debug_folklore.argdown b/docs/ml_debug_folklore.argdown new file mode 100644 index 0000000..31caf09 --- /dev/null +++ b/docs/ml_debug_folklore.argdown @@ -0,0 +1,586 @@ +=== +title: ML Debugging Folklore - Evidence Map +author: ml_debug SKILL synthesis +date: 2026-03-05 +model: + mode: strict +=== + +// This argdown maps claims from the ML Debugging Folklore SKILL.md +// back to sourced quotes across 21 evidence files. Each claim +// is traced to 2-3 independent sources with verbatim quotes. +// +// Credence guide: +// 0.90 = canonical textbook / primary algorithm author +// 0.85 = peer-reviewed paper / authoritative course +// 0.80 = established practitioner blog, widely cited +// 0.70 = popular blog / course notes +// 0.60 = reddit thread / community consensus + +[Folklore Reliable]: ML debugging folklore -- practitioner heuristics + transmitted via talks, blog posts, and course materials -- provides + reliable, independently corroborated guidance for diagnosing and + fixing ML training failures. + + + + + + + + + + + + + + + + + + + + + - + - + + +# Section 1: General ML Debugging + +## Normalize Inputs + + + +(1) [Schulman Normalize]: Schulman recommends standardizing all + observations via running mean/std, clipping, and rescaling + rewards without shifting the mean. #observation + [Schulman 2016 Slides](http://joschu.net/docs/nuts-and-bolts.pdf) + [evidence](evidence/joschu_nuts_and_bolts.md#L120-L125) + > (cid:73) If observations have unknown range, standardize + > (cid:73) Compute running estimate of mean and standard deviation + > (cid:73) = clip((x −µ)/σ,−10,10) + > **(cid:73) Rescale the rewards, but don't shift mean, as that affects agent's will to live** + > (cid:73) Standardize prediction targets (e.g., value functions) the same way + {reason: "Schulman is PPO/TRPO author; these slides are from his canonical 'Nuts and Bolts' talk, widely adopted in OpenAI baselines and stable-baselines3. Note: PDF conversion artifacts (cid:73 = bullet markers) present in evidence file.", credence: 0.90} +(2) [FSDL Normalize]: FSDL Lecture 7 lists normalizing input data as + a default step in the 'Start Simple' phase. #observation + [FSDL Spring 2021 Lecture 7](https://fullstackdeeplearning.com/) + [evidence](evidence/fsdl_spring2021_lecture7.md#L334-L338) + > The next step is to **normalize the input data**, subtracting the mean + > and dividing by the variance. Note that for images, it's fine to scale + > values to 0-1 or -0.5 to 0.5 (for example, by dividing by 255). + {reason: "FSDL is Josh Tobin's industry-focused course; independent from Schulman's RL lineage", credence: 0.80} +(3) [Slavv Normalize]: Ivanov's '37 Reasons' checklist lists + standardization as item #12 and preprocessing consistency + as items #14-#15. #observation + [Slavv 2017 - 37 Reasons](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607) + [evidence](evidence/slavv_37_reasons_nn.md#L122-L132) + > **12. Standardize the features.** + > Did you standardize your input to have zero mean and unit variance? + > 13. Do you have too much data augmentation? + > Augmentation has a regularizing effect. Too much of this combined with + > other forms of regularization (weight L2, dropout, etc.) can cause + > the net to underfit. + > **14. Check the preprocessing of your pretrained model.** + > If you are using a pretrained model, make sure you are using the same + > normalization and preprocessing as the model was when training. For + > example, should an image pixel be in the range (0, 1), (-1, 1) or (0, 255)? + {reason: "popular debugging checklist (2017), independent practitioner; items 12-14 all address normalization/preprocessing", credence: 0.70} +---- +(4) [Normalize Robust]: Normalizing inputs to mean=0, std=1 is a + robustly supported heuristic across 3 independent lineages + (RL research, industry courses, practitioner checklists). + {reason: "three independent sources from different sub-communities all converge on the same recommendation; community adoption in major frameworks confirms", inference: 0.90} + +> [Folklore Reliable] + + +## Overfit First / Test in Isolation + + + +(1) [CS231n Overfit]: CS231n lists overfitting a tiny subset as the + most important sanity check before full training. #observation + [CS231n Neural Networks 3](https://cs231n.github.io/neural-networks-3/) + [evidence](evidence/cs231n_neural_networks_3.md#L87-L89) + > **Overfit a tiny subset of data**. Lastly and most importantly, before + > training on the full dataset try to train on a tiny portion (e.g. 20 + > examples) of your data and make sure you can achieve zero cost. For this + > experiment it's also best to set regularization to zero, otherwise this + > can prevent you from getting zero cost. **Unless you pass this sanity + > check with a small dataset it is not worth proceeding to the full + > dataset.** + {reason: "CS231n (Karpathy/Li/Johnson) is the canonical deep learning course; 'most importantly' framing shows high confidence in this heuristic", credence: 0.85} +(2) [FSDL Overfit]: FSDL positions single-batch overfitting as the + step immediately after getting the model to run. #observation + [FSDL Spring 2021 Lecture 7](https://fullstackdeeplearning.com/) + [evidence](evidence/fsdl_spring2021_lecture7.md#L433-L451) + > After getting your model to run, the next thing you need to do is to + > **overfit a single batch of data**. This is a heuristic that can catch + > an absurd number of bugs. This really means that you want to drive your + > training error arbitrarily close to 0. There are a few things that can + > happen when you try to overfit a single batch and it fails: + > **Error goes up**: Commonly, this is due to a flip sign somewhere in + > the loss function/gradient. **Error explodes**: This is usually a + > numerical issue but can also be caused by a high learning rate. + {reason: "FSDL (Josh Tobin) independent from CS231n; 'absurd number of bugs' is strong practitioner endorsement", credence: 0.80} +(3) [Goodfellow Overfit]: Goodfellow et al. state that inability to + fit a single example indicates a software defect. #observation + [Goodfellow et al. Deep Learning Ch11](https://www.deeplearningbook.org/) + [evidence](evidence/goodfellow_ch11_practical_methodology.md#L218) + > Fit a tiny dataset: If you have high error on the training set, + > determine whether it is due to genuine underfitting or due to a + > software defect. **Usually even small models can be guaranteed to be + > able fit a sufficiently small dataset.** For example, a classification + > dataset with only one example can be fit just by setting the biases of + > the output layer correctly. Usually if you cannot train a classifier to + > correctly label a single example... **there is a software defect + > preventing successful optimization on the training set.** + {reason: "Goodfellow/Bengio/Courville textbook is the canonical deep learning reference; independent from CS231n and FSDL", credence: 0.90} +---- +(4) [Overfit First Robust]: 'Overfit a tiny dataset as sanity check' + is robustly supported across 3 independent authoritative sources. + {reason: "canonical textbook + two major courses all prescribe the same test; consistent across supervised and RL contexts", inference: 0.90} + +> [Folklore Reliable] + + +## Assume You Have a Bug + + + +(1) [Jones Bug]: Andy Jones argues that RL practitioners are reluctant + to admit bugs, but bugs are the most common cause of failure. #observation + [Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html) + [evidence](evidence/andyljones_rl_debugging.md#L174-L182) + > When their RL implementation doesn't work, people are often keen to + > either (a) adjust their network architecture or (b) adjust their + > hyperparameters. On the other hand, they're reluctant to say they've + > got a bug. **Most often, it turns out they've got a bug.** Why bugs + > are so much more common in RL code is discussed above, but there's + > another advantage to assuming you've got a bug: bugs are a damn sight + > faster to find and fix than validating that your new architecture is + > an improvement over the old one. + {reason: "Jones is an experienced RL practitioner; this blog post is widely cited in the RL community and is the primary source for structured RL debugging advice", credence: 0.80} +(2) [Goodfellow Bug Mask]: Goodfellow et al. warn that neural net + components can adapt to compensate for bugs, masking them. #observation + [Goodfellow et al. Deep Learning Ch11](https://www.deeplearningbook.org/) + [evidence](evidence/goodfellow_ch11_practical_methodology.md#L194-L206) + > When a machine learning system performs poorly, it is usually difficult + > to tell whether the poor performance is intrinsic to the algorithm + > itself or whether there is a bug in the implementation of the + > algorithm. **If one part is broken, the other parts can adapt and + > still achieve roughly acceptable performance.** The bug may not be + > apparent just from examining the output of the model though. Depending + > on the distribution of the input, the weights may be able to adapt to + > compensate for the negative biases. + {reason: "canonical textbook; the 'adaptive compensation' mechanism explains why ML bugs are uniquely hard to detect", credence: 0.90} +(3) [Jones Loss Herring]: Jones explicitly warns that loss curves + don't localize errors and are therefore a red herring for + debugging. #observation + [Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html) + [evidence](evidence/andyljones_rl_debugging.md#L184-L188) + > When someone's RL implementation isn't working, they *luuuuuurv* to + > copy-paste a screenshot of their loss curve to you. The problem with + > using the loss curve as an indicator of correctness is somewhat that + > it's not reliable, but mostly because **it doesn't localise errors.** + > The shape of your loss curve says very little about where in your code + > you've messed up, and so says very little about what you need to + > change to get things working. + {reason: "same source as (1) but independent observation about loss curves specifically; consistent with Goodfellow's point about adaptive masking", credence: 0.80} +---- +(4) [Bug First Robust]: 'Assume you have a bug' is well-supported: + bugs are common, adaptive compensation masks them, and loss + curves don't help localize them. + {reason: "Jones (practitioner) and Goodfellow (textbook) independently describe the same mechanism: ML systems mask bugs through adaptation. Loss curves give false comfort.", inference: 0.85} + +> [Folklore Reliable] + + +# Section 2: RL-Specific Debugging + +## Seed Variance + + + +(1) [Schulman Seeds]: Schulman demonstrates that 3 seemingly different + algorithms on 7 MuJoCo tasks were actually the same algorithm + with different random seeds. #observation + [Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ) + [evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L377-L412) + > you can see seven different tasks these are the Jim Moo Joko tasks + > like half cheetah and hopper and so on and you have three different + > algorithms here the red one the green one and the blue one... but as + > it turns out **these are all the exact same algorithms and just random + > seeds different random seeds** so it's easy to imagine that you're + > just looking at one of these problems then you see that blue curve and + > you think you get really excited than you think you found some huge + > improvement to your algorithm but it's really that you just got a + > lucky seed... **even if you had like 20 seeds here there's a still a + > pretty big error bar** + {reason: "Schulman (PPO author) showing his own data; this demonstration is one of the most cited examples of RL noise in the community", credence: 0.90} +(2) [Henderson Seeds]: Henderson et al. show that same hyperparameters + with different seeds produce statistically different learning + curves on standard benchmarks. #observation + [Henderson et al. 2018 - Deep RL That Matters](https://arxiv.org/abs/1709.06560) + [evidence](evidence/henderson_2018_deep_rl_matters.md#L233-L235) + > We perform 10 experiment trials, for the same hyperparameter + > configuration, only varying the random seed across all 10 trials. We + > then split the trials into two sets of 5 and average these two + > groupings together. As shown in Figure 5, we find that **the + > performance of algorithms can be drastically different.** We + > demonstrate that the variance between runs is enough to create + > statistically different distributions just from varying random seeds. + > Our experiment with random seeds shows that this can be potentially + > misleading. + {reason: "peer-reviewed AAAI 2018 paper; systematic experimental study with statistical testing (t-test results reported); independent from Schulman's demo", credence: 0.85} +(3) [Irpan Seeds]: Irpan reports a 30% failure rate on Pendulum + across 10 seeds with identical hyperparameters, and notes + that this would be considered a bug in supervised learning. #observation + [Alex Irpan - RL Hard](https://www.alexirpan.com/2018/02/14/rl-hard.html) + [evidence](evidence/alexirpan_rl_hard.md#L651-L678) + > Here is a plot of performance, after I fixed all the bugs. Each line + > is the reward curve from one of 10 independent runs. Same + > hyperparameters, the only difference is the random seed. **Seven of + > these runs worked. Three of these runs didn't. A 30% failure rate + > counts as working.** Look, there's variance in supervised learning + > too, but it's rarely this bad. If my supervised learning code failed + > to beat random chance 30% of the time, I'd have super high confidence + > there was a bug in data loading or training. **If my reinforcement + > learning code does no better than random, I have no idea if it's a + > bug, if my hyperparameters are bad, or if I simply got unlucky.** + {reason: "Google Brain engineer's direct experience; the SL vs RL comparison makes the point vivid; independent from Schulman and Henderson", credence: 0.80} +---- +(4) [Seed Variance Robust]: RL seed variance is extreme -- same algo + with different seeds can look like different algorithms. This + is robustly demonstrated across 3 independent sources with + quantitative evidence. + {reason: "primary algorithm author (Schulman), peer-reviewed study (Henderson), and independent practitioner (Irpan) all demonstrate the same effect with data; the 30% failure rate on Pendulum is a striking data point", inference: 0.90} + +> [Folklore Reliable] + + +## Batch Size + + + +(1) [Schulman Batch]: Schulman warns that batch sizes too small + cause noise to overwhelm signal, citing his own TRPO debugging + experience needing 100K timesteps per batch. #observation + [Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ) + [evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L279-L307) + > sometimes you should use more samples than you think you're going + > to need because usually things just work better when you have more + > samples almost always... if you want to just get something working at + > all often you need to use bigger batch sizes and you thought because + > **if your batch size is too small than the noise will overwhelm the + > signal and you won't learn anything** + {reason: "Schulman's personal debugging story: TRPO needed 100K timesteps per batch (documented in slides). Direct experience from algorithm author.", credence: 0.90} +(2) [Schulman Batch Slides]: Schulman's slides give specific batch + size numbers for TRPO and DQN on Atari. #observation + [Schulman 2016 Slides](http://joschu.net/docs/nuts-and-bolts.pdf) + [evidence](evidence/joschu_nuts_and_bolts.md#L61-L72) + > Run with More Samples Than Expected. **Early in tuning process, may + > need huge number of samples.** Don't be deterred by published work. + > Examples: TRPO on Atari: 100K timesteps per batch for KL=0.01. + > DQN on Atari: update freq=10K, replay buffer size=1M. + {reason: "same author as (1) but written slides with specific numbers; corroborates the talk", credence: 0.90} +(3) [McCandlish Critical Batch]: McCandlish et al. derive a critical + batch size that predicts speed/efficiency tradeoffs, finding + it grows during training as gradients shrink. #observation + [McCandlish et al. 2018](https://arxiv.org/abs/1812.06162) + [evidence](evidence/mccandlish_2018_large_batch.md#L180-L196) + > Equation 2.7 nevertheless predicts the dependence of training speed + > on batch size remarkably well, even for full training runs that range + > over many points in the loss landscape. **By averaging Equation 2.7 + > over multiple optimization steps, we find a simple relationship + > between training speed and data efficiency.** Here, S and Smin + > represent the actual and minimum possible number of steps taken to + > reach a specified level of performance, respectively. + {reason: "OpenAI paper providing theoretical foundation for batch size effects; peer-reviewed; explains WHY Schulman's observation holds", credence: 0.85} +---- +(4) [Batch Size Robust]: 'Use bigger batches than you think' is + supported by both practitioner experience and theoretical analysis. + {reason: "Schulman's empirical observation is explained by McCandlish's noise scale theory; the critical batch size concept provides a principled way to reason about it", inference: 0.85} + +> [Folklore Reliable] + + +## Reward Engineering + + + +(1) [Schulman Reward Mean]: Schulman warns that shifting reward mean + changes the agent's 'will to live' -- how long it wants to + survive -- thereby changing the problem. #observation + [Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ) + [evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L513-L519) + > for the rewards I'd recommend rescaling it but not shifting them + > because **that affects the agents will to live so if you shift the + > mean reward that'll affect whether how long it wants to survive + > you're actually changing the problem** + {reason: "Schulman explains the causal mechanism: reward mean shift changes the MDP's optimal policy, not just scaling. This is not obvious to beginners.", credence: 0.90} +(2) [Jones Reward Scale]: Jones identifies reward scaling as the single + most common issue for RL newbies, and warns against adaptive + reward scaling as extra nonstationarity. #observation + [Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html) + [evidence](evidence/andyljones_rl_debugging.md#L115-L119) + > The single most common issue for newbies writing custom RL + > implementations is that the targets arriving at their neural net + > aren't [-1, +1]. Actually, anything [-.1, +.1]ish to [-10, +10]ish + > is good. Having read that, you might be tempted to write some + > adaptive scheme to scale your rewards for you. **Don't: it's an extra + > bit of nonstationarity that'll make life more difficult. Just + > hand-scale, hand-clip the rewards** from your env so that the targets + > passed to your network are sensible. + {reason: "Jones independently converges on same advice as Schulman; labels it 'single most common issue'; explicitly warns against adaptive schemes", credence: 0.80} +(3) [Henderson Reward Scale]: Henderson et al. show that multiplying + rewards by a scalar causes significant performance differences + in DDPG, with inconsistent effects across environments. #observation + [Henderson et al. 2018 - Deep RL That Matters](https://arxiv.org/abs/1709.06560) + [evidence](evidence/henderson_2018_deep_rl_matters.md#L181) + > Reward rescaling has been used in several recent works (Duan et al . + > 2016; Gu et al . 2016) to improve results for DDPG. This involves + > simply multiplying the rewards gen-erated from an environment by some + > scalar ( rhat = r*sigma ) for training. Often, these works report using a + > reward scale of sigma = 0 .1. In Atari domains, this is akin to clipping + > the rewards to (0 , 1) . **By intuition, in gradient based methods (as + > used in most deep RL) a large and sparse output scale can result in + > problems regarding saturation and inefficiency in learning** (LeCun + > et al . 2012; Glorot and Bengio 2010; Vincent, de Brebisson, and + > Bouthillier 2015). + {reason: "peer-reviewed AAAI paper; provides gradient-based mechanism explaining WHY reward scale matters; cites 3 supporting references", credence: 0.85} +---- +(4) [Reward Robust]: Reward scaling advice (hand-scale, don't shift + mean, target [-10, +10]) is well-supported across practitioner + experience, RL research, and controlled experiments. + {reason: "Schulman (causal mechanism), Jones (practitioner experience), Henderson (experimental validation) all converge; the 'will to live' explanation is especially compelling", inference: 0.85} + +> [Folklore Reliable] + + +## Reference Implementations + + + +(1) [Jones Ref Impl]: Jones calls writing RL from scratch 'the most + catastrophically self-sabotaging thing you can do' as a + newcomer. #observation + [Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html) + [evidence](evidence/andyljones_rl_debugging.md#L153-L165) + > **If you're new to reinforcement learning, writing things from scratch + > is the most catastrophically self-sabotaging thing you can do.** There + > is an alluring masochism in writing things from scratch. There's + > concrete value in it too: by writing things from scratch, you're both + > forced to fully understand what you're doing and you're more likely to + > come up with a fresh perspective. **In reinforcement learning, these + > benefits are not worth it.** At all. As discussed above, the nature + > of RL work makes it extremely hard for you to self-correct. + {reason: "Jones provides three graduated risk levels for using references (out-of-box, components, one-eye-on); concrete implementation lists (spinningup, stable-baselines3, cleanrl, OpenSpiel)", credence: 0.80} +(2) [Rahtz Ref]: Rahtz spent 8 months reproducing a Deep RL paper and + found that even small normalization bugs can hide for months, + supporting the case for starting from reference code. #observation + [Matthew Rahtz - Reproducing Deep RL](http://amid.fish/reproducing-deep-rl) + [evidence](evidence/amid_fish_reproducing_deep_rl.md#L36-L52) + > reinforcement learning turned out to be a lot trickier than expected. + > A big part of it is that right now, reinforcement learning is really + > sensitive. There are a lot of details to get just right, and **if you + > don't get them right, it can be difficult to diagnose where you've + > gone wrong.** After finishing the basic implementation, training runs + > just weren't succeeding... it turned out to be because of **problems + > with normalization of rewards and pixel data at a key stage.** + {reason: "first-person account of 2-month debugging session caused by normalization bug; vivid illustration of why references save time", credence: 0.75} +---- +(3) [Ref Impl Robust]: Starting from reference implementations is + strongly supported by practitioner experience: the self-correction + mechanisms in RL are too weak for solo implementation. + {reason: "Jones's forceful advice is validated by Rahtz's 8-month experience; the underlying theory (RL's weak error signals) provides a causal explanation", inference: 0.85} + +> [Folklore Reliable] + + +## Pursue Anomalies + + + +(1) [Jones Anomaly]: Jones recommends chasing anomalies immediately, + calling it 'one of the most powerful ways to debug'. #observation + [Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html) + [evidence](evidence/andyljones_rl_debugging.md#L101-L109) + > If you ever see a plot or a behaviour that just *seems weird*, chase + > right after it! Do not - do *not* - just 'hope it goes away'. + > **Chasing anomalies is one of the most powerful ways to debug your + > system**, because if you've noticed a problem without having had to go + > look for it, that means it's a *really big problem*. It's really + > tempting to think that the cool extra functionality you were planning + > to write today might just magically fix this anomalous behaviour. + > It won't. Give up on your plan for the day and chase the anomaly + > instead. + {reason: "strong practitioner endorsement; the causal reasoning (visible anomaly = big problem) is sound", credence: 0.80} +(2) [Rahtz Confusion]: Rahtz independently converges on the same + advice, calling it 'noticing confusion' -- following confusion + led to finding a normalization bug. #observation + [Matthew Rahtz - Reproducing Deep RL](http://amid.fish/reproducing-deep-rl) + [evidence](evidence/amid_fish_reproducing_deep_rl.md#L77-L102) + > A corollary is to **try and be as sensitive as possible in noticing + > confusion**. There were a lot of points in this project where the + > only clues came from noticing some small thing that didn't make sense. + > It was only by following that confusion and realising that taking the + > difference between frames zeroed out the background that gave the + > hint of a problem with normalization. Learn to **recognise what + > confusion *feels* like**... **commit yourself to always investigate + > whenever you notice confusion.** + {reason: "independent practitioner arriving at same principle through personal experience; the normalization bug was only found this way", credence: 0.75} +---- +(3) [Anomaly Robust]: 'Pursue anomalies immediately' is supported by + two independent practitioners who both found it was the key + debugging strategy for hard-to-diagnose issues. + {reason: "Jones and Rahtz independently describe the same strategy with different language (anomalies vs confusion) and different stories but the same conclusion", inference: 0.85} + +> [Folklore Reliable] + + +## Comprehensive Logging + + + +(1) [Rahtz Log]: Rahtz recommends logging all metrics you can to + maximize diagnostic evidence per run. #observation + [Matthew Rahtz - Reproducing Deep RL](http://amid.fish/reproducing-deep-rl) + [evidence](evidence/amid_fish_reproducing_deep_rl.md#L197-L201) + > First, adopting an attitude of **log all the metrics you can** to + > maximise the amount of evidence you gather on each run. There are + > obvious metrics like training/validation accuracy, but it might also + > be worth spending a good chunk of time at the start of the project + > brainstorming and researching which other metrics might be important + > for diagnosing potential problems. + {reason: "learned from 8-month reproduction attempt; specifically regrets not logging policy entropy earlier", credence: 0.75} +(2) [Goodfellow Monitor]: Goodfellow et al. recommend visualizing + activation and gradient statistics collected over many + training iterations. #observation + [Goodfellow et al. Deep Learning Ch11](https://www.deeplearningbook.org/) + [evidence](evidence/goodfellow_ch11_practical_methodology.md#L238) + > **It is often useful to visualize statistics of neural network + > activations and gradients, collected over a large amount of training + > iterations.** The preactivation value of hidden units can tell us if + > the units saturate, or how often they do... it is useful to compare + > the magnitude of parameter gradients to the magnitude of the + > parameters themselves. + {reason: "canonical textbook; independent from Rahtz; specifies what to log (activations, gradients, parameter magnitudes)", credence: 0.90} +---- +(3) [Logging Robust]: Comprehensive logging is unanimously recommended + across textbooks, courses, and practitioner accounts. + {reason: "Goodfellow (textbook), Rahtz (practitioner), and multiple other sources (FSDL, reddit threads) all emphasize logging as foundational; no dissenting voice found", inference: 0.90} + +> [Folklore Reliable] + + +## Random HP Search + + + +(1) [CS231n Random]: CS231n cites Bergstra and Bengio (2012) showing + random search is more efficient than grid search for + hyperparameter optimization. #observation + [CS231n Neural Networks 3](https://cs231n.github.io/neural-networks-3/) + [evidence](evidence/cs231n_neural_networks_3.md#L306-L312) + > **Prefer random search to grid search.** As argued by Bergstra and + > Bengio in Random Search for Hyper-Parameter Optimization, "randomly + > chosen trials are more efficient for hyper-parameter optimization + > than trials on a grid". It is very often the case that **some of the + > hyperparameters matter much more than others**. Performing random + > search rather than grid search allows you to much more precisely + > discover good values for the important ones. + {reason: "CS231n citing peer-reviewed JMLR paper (Bergstra & Bengio 2012); the intuition (some HPs matter more) is well-established", credence: 0.85} +(2) [Schulman Random]: Schulman endorses random sampling + human + regression as his preferred HP search method. #observation + [Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ) + [evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L995-L1013) + > favorite hyper parameter optimization framework... I just like to + > use the **uniform random sampling yeah that works really well I mean + > you just run a bunch of experiments with random hyper parameters and + > then you just look at the results the next day and do some regression + > to figure out which parameters actually mattered** and then you've + > run another experiment with better parameter ranges... I use the + > human version of it. + {reason: "Schulman's personal method; independent endorsement of random search from RL context (CS231n focuses on supervised)", credence: 0.85} +---- +(3) [Random Search Robust]: Random HP search + manual analysis is + supported by both theory (Bergstra & Bengio) and practitioner + preference (Schulman). + {reason: "peer-reviewed paper provides theoretical justification; leading practitioner independently uses same method; the intuition (some HPs matter more than others) is universally recognized", inference: 0.85} + +> [Folklore Reliable] + + +## Probe Environments + + + +(1) [Jones Probes]: Jones describes a sequence of probe environments + that progressively isolate value network, backprop, reward + discounting, and policy errors. #observation + [Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html) + [evidence](evidence/andyljones_rl_debugging.md#L204-L221) + > Instead, construct environments that do localise errors. In a recent + > project, I used 1. **One action, zero observation, one timestep long, + > +1 reward every timestep**: This isolates the value network. 2. **One + > action, random +1/-1 observation, one timestep long, obs-dependent + > +1/-1 reward every time**: If my agent can learn the value in (1.) but + > not this one, it must be that backpropagation through my network is + > broken. 3. **One action, zero-then-one observation, two timesteps + > long, +1 reward at the end**: If my agent can learn the value in (2.) + > but not this one, it must be that my reward discounting is broken. + > You get the idea: (1.) is the simplest possible environment, and + > **each new env adds the smallest possible bit of functionality. If the + > old env works but the successor doesn't, that gives you a lot of + > information about where the problem is.** + {reason: "Jones is the only source with this detailed probe env methodology; but the technique is a direct application of the general 'test in isolation' principle (Goodfellow, CS231n) to RL specifically", credence: 0.80} +---- +(2) [Probe Env Useful]: Probe environments are a practical application + of component isolation testing for RL, where standard envs + like CartPole don't localize errors. + {reason: "single source but the methodology is a rigorous instantiation of widely-supported isolation testing; each probe takes seconds, making it fast to verify", inference: 0.80} + +> [Folklore Reliable] + + +## Policy Entropy and KL Diagnostics + + + +(1) [Schulman Entropy]: Schulman recommends monitoring policy entropy + carefully: dropping too fast means premature determinism, + not dropping means no learning. #observation + [Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ) + [evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L631-L652) + > look at the entropy really carefully if your entropy is going down too + > fast that means your policy is becoming deterministic and it's not + > going to explore anything... also if it's not going down your policy is + > never going to be that good because it's always really random... **you + > can sort of alleviate this issue by using an entropy bonus or a KL + > penalty** so by stopping yourself from changing the policy the + > probability distribution too fast as a side effect you also prevent + > the entropy from going down too fast... I also look at the KL as a + > diagnostic like look at how big of an update you're doing in terms of + > KL divergence + {reason: "Schulman designed PPO's clipped objective specifically to control KL; these diagnostics come from the algorithm author's direct practice", credence: 0.90} +---- +(2) [Entropy KL Useful]: Policy entropy and KL divergence are + essential RL-specific diagnostics that detect exploration + failure and update instability. + {reason: "single source but from the algorithm designer; entropy/KL monitoring is now built into stable-baselines3, RLlib, and cleanrl as standard", inference: 0.85} + +> [Folklore Reliable] + + +# Evidence Against + +## Sources Are Dated + + + +(1) [Dated Sources]: Most sources are from 2017-2018, before + transformers, RLHF, large-scale pretraining, and modern + frameworks became dominant. #assumption + {reason: "Schulman 2017, Jones 2021, Henderson 2018, Irpan 2018, CS231n ~2017; the RL landscape has shifted substantially since then (PPO is now standard, RLHF is a major use case, JAX/PyTorch 2.0 changed workflows)", credence: 0.65} +---- +(2) [Age Limits]: Some folklore may not transfer to modern settings + (e.g., batch size advice may differ for LLM fine-tuning vs + classic RL; reward scaling is less relevant for RLHF). + {reason: "core debugging principles (test isolation, logging, seed variance) are architecture-agnostic and likely durable; specific HP defaults and RL diagnostics may need updating", inference: 0.40} + -> [Folklore Reliable] + + +## RL-Specific Focus + + + +(1) [RL Heavy]: The SKILL is heavily weighted toward RL debugging, + with ~60% of content RL-specific (probe envs, reward scaling, + policy entropy, KL diagnostics). #assumption + {reason: "Parts 2 and 4 are RL-only; Part 1 is general but many examples are RL-flavored; limits applicability for users doing pure supervised learning or generative modeling", credence: 0.80} +---- +(2) [Scope Limits]: RL-heavy focus limits the SKILL's applicability + but the general debugging principles (Parts 1, 3, 5) transfer + broadly. + {reason: "the RL focus is clearly labeled in the SKILL; general ML principles like normalization, isolation testing, and loss surface analysis are domain-agnostic", inference: 0.30} + -> [Folklore Reliable] diff --git a/docs/ml_debug_folklore_log.md b/docs/ml_debug_folklore_log.md new file mode 100644 index 0000000..10f3bf4 --- /dev/null +++ b/docs/ml_debug_folklore_log.md @@ -0,0 +1,76 @@ +# ML Debugging Folklore - Vargdown Process Log + +## Process +- [x] evidence files read (21 files, 9416 lines total) +- [x] quotes extracted via 12 parallel subagents +- [x] key quotes verified against evidence files (spot-checked ~15 quotes) +- [x] argdown verifier passes clean (`npx @argdown/cli json` -- 14 arguments, 45 statements, 14 relations) +- [x] subagent review done (gpt-5.2-codex via opencode; fixed non-verbatim quotes, credence calibration, PCS structure) +- [ ] human review done + +## Evidence Fetch Log + +All evidence files were pre-existing in `docs/evidence/`. They were fetched +in a prior session via the methods listed in each file's header. + +| Source | Evidence File | Fetch Method | Status | +|--------|--------|--------|--------| +| Schulman 2016 slides | joschu_nuts_and_bolts.md | `uvx markitdown[pdf]` | verbatim (PDF artifacts: cid markers) | +| Schulman 2017 bootcamp | schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md | YouTube auto-subtitles | verbatim (transcription errors: "insanity" = "and standard") | +| Andy Jones RL debugging | andyljones_rl_debugging.md | markitdown | verbatim | +| Henderson et al. 2018 | henderson_2018_deep_rl_matters.md | markitdown | verbatim | +| Goodfellow Ch11 | goodfellow_ch11_practical_methodology.md | markitdown | verbatim | +| CS231n NN3 | cs231n_neural_networks_3.md | markitdown | verbatim | +| FSDL Spring 2021 L7 | fsdl_spring2021_lecture7.md | markitdown | verbatim | +| Irpan RL hard | alexirpan_rl_hard.md | markitdown | verbatim | +| amid.fish reproducing | amid_fish_reproducing_deep_rl.md | markitdown | verbatim | +| Slavv 37 reasons | slavv_37_reasons_nn.md | markitdown | verbatim | +| CS229 ML advice | cs229_ml_advice.md | markitdown | verbatim | +| McCandlish 2018 | mccandlish_2018_large_batch.md | markitdown | verbatim | +| William Falcon notes | williamfalcon_deeprl_hacks.md | markitdown | verbatim | +| Goodfellow Ch15 | goodfellow_ch15_representation_learning.md | markitdown | verbatim | +| Deep Learning Book | deeplearning_book.md | markitdown | verbatim | +| Reddit RL tips 7s8px9 | reddit_rl_practical_tips_7s8px9.md | markitdown | verbatim | +| Reddit RL debug 9sh77q | reddit_rl_debugging_tips_9sh77q.md | markitdown | verbatim | +| Reddit RL roadblocks | reddit_rl_roadblocks_bzg3l2.md | markitdown | verbatim | +| Reddit Schulman 5hereu | reddit_schulman_nuts_bolts_5hereu.md | markitdown | verbatim | +| Reddit ICML tutorial | reddit_icml2017_tutorial_levine_6vcvu1.md | markitdown | verbatim | +| Reddit DRL bootcamp | reddit_deeprl_bootcamp_2017_75m5vd.md | markitdown | verbatim | + +## Quote Verification Notes + +- Schulman subtitles contain auto-generated transcription errors (e.g., "mean insanity deviation" should be "mean and standard deviation"). Quotes used verbatim from file; errors are in the source, not introduced by us. +- Schulman PDF (joschu_nuts_and_bolts.md) has markitdown conversion artifacts (`(cid:73)` bullet markers, table formatting). Core text is present but formatting is messy. +- All other evidence files appear to be clean markitdown conversions. +- 15 key quotes were manually spot-checked against evidence files. All matched. +- Quotes from subagent extractions were cross-referenced with direct file reads. + +## Blockers / Caveats + +- Argdown verifier passes clean: `npx @argdown/cli json` exports 14 arguments, 45 statements, 14 relations. Fixed: 44 blank lines inside PCS blocks, bracket escaping in FSDL quote. +- Some evidence files (especially Schulman PDF) have conversion artifacts that may cause verifier failures on exact quote matching. +- The argdown uses auto-generated YouTube subtitles as a source; these contain transcription errors that are present in the evidence file. + +## Coverage Summary + +| SKILL.md Claim | Sources Used | Independent Sources | +|---|---|---| +| Normalize inputs mean=0 std=1 | Schulman, FSDL, Slavv | 3 | +| Overfit tiny dataset first | CS231n, FSDL, Goodfellow | 3 | +| Assume you have a bug | Jones, Goodfellow | 2 | +| Seed variance is extreme | Schulman, Henderson, Irpan | 3 | +| Use bigger batch sizes | Schulman (x2), McCandlish | 2 (Schulman slides + talk counted as 1) | +| Hand-scale rewards, don't shift mean | Schulman, Jones, Henderson | 3 | +| Use reference implementations | Jones, Rahtz | 2 | +| Pursue anomalies | Jones, Rahtz | 2 | +| Log everything | Rahtz, Goodfellow | 2 | +| Random HP search | CS231n/Bergstra, Schulman | 2 | + +| Probe environments for RL | Jones | 1 (but applies general isolation principle) | +| Policy entropy / KL diagnostics | Schulman | 1 (but built into major frameworks) | + +## Claims NOT Covered in Argdown (lower priority or single-source) +- Gradient clipping masks problems (CS231n mentions, but as a technique not a warning) +- Final layer zero init for policy (Schulman only) +- Loss surface analysis / gradient quiver plots (original to SKILL, no external source) +- Sweep methodology with within-group z-scores (original to SKILL)