diff --git a/docs/evidence/joschu_nuts_and_bolts.md b/docs/evidence/joschu_nuts_and_bolts.md index 3c73249..1cc1ee6 100644 --- a/docs/evidence/joschu_nuts_and_bolts.md +++ b/docs/evidence/joschu_nuts_and_bolts.md @@ -1,199 +1,143 @@ Source: http://joschu.net/docs/nuts-and-bolts.pdf Title: Nuts and Bolts of Deep RL Research - John Schulman (2016) -Fetched-via: bash -c 'uvx "markitdown[pdf]" http://joschu.net/docs/nuts-and-bolts.pdf' -Fetch-status: verbatim +Fetched-via: clean transcript of the slide deck (prior markitdown PDF extract was OCR-garbled with `(cid:73)` bullet glyphs; replaced with hand-pasted text) +Fetch-status: verbatim (clean paste) -| The Nuts | and Bolts | of Deep | RL Research | -| -------- | --------- | --------- | ----------- | -| | John | Schulman | | -| | December | 9th, 2016 | | +# The Nuts and Bolts of Deep RL Research -Outline -| Approaching | New Problems | | -| --------------------- | ------------ | ---------- | -| Ongoing Development | | and Tuning | -| General Tuning | Strategies | for RL | -| Policy Gradient | Strategies | | -| Q-Learning Strategies | | | -| Miscellaneous | Advice | | +John Schulman, December 9th, 2016 -Approaching New Problems +## Outline -| New Algorithm? | Use Small | Test Problems | -| -------------------------- | --------- | ------------- | -| (cid:73) Run experiments | quickly | | -| (cid:73) Do hyperparameter | search | | -(cid:73) Interpret and visualize learning process: state visitation, value function, etc. -(cid:73) Counterpoint: don’t overfit algorithm to contrived problem -(cid:73) Useful to have medium-sized problems that you’re intimately familiar with -(Hopper, Atari Pong) +- Approaching New Problems +- Ongoing Development and Tuning +- General Tuning Strategies for RL +- Policy Gradient Strategies +- Q-Learning Strategies +- Miscellaneous Advice -| New Task? | Make | It Easier Until | Signs | of Life | -| ---------------- | --------------- | --------------- | ----- | ------- | -| (cid:73) Provide | good input | features | | | -| (cid:73) Shape | reward function | | | | +## Approaching New Problems + +New Algorithm? Use Small Test Problems +- Run experiments quickly +- Do hyperparameter search +- Interpret and visualize learning process: state visitation, value function, etc. +- Counterpoint: don't overfit algorithm to contrived problem +- Useful to have medium-sized problems that you're intimately familiar with (Hopper, Atari Pong) + +New Task? Make It Easier Until Signs of Life +- Provide good input features +- Shape reward function POMDP Design -(cid:73) Visualize random policy: does it sometimes exhibit desired behavior? -| (cid:73) Human | control | | | | -| -------------- | ------- | --- | --- | --- | -(cid:73) Atari: can you see game features in downsampled image? -(cid:73) Plot time series for observations and rewards. Are they on a reasonable -scale? -| (cid:73) hopper.py | in gym: | | | | -| ------------------ | ------------ | --------------------------- | ------- | ----------- | -| reward | = 1.0 | - 1e-3 * np.square(a).sum() | + delta | x / delta t | -| (cid:73) Histogram | observations | and rewards | | | +- Visualize random policy: does it sometimes exhibit desired behavior? +- Human control +- Atari: can you see game features in downsampled image? +- Plot time series for observations and rewards. Are they on a reasonable scale? +- hopper.py in gym: reward = 1.0 - 1e-3 * np.square(a).sum() + delta x / delta t +- Histogram observations and rewards Run Your Baselines -| (cid:73) Don’t expect | them to | work with default | parameters | -| --------------------- | ------- | ----------------- | ---------- | -(cid:73) Recommended: -| Cross-entropy | method1 | | | -| ------------- | ------- | --- | --- | -(cid:73) -| (cid:73) Well-tuned | policy gradient | method2 | | -| ------------------- | --------------- | -------------- | --- | -| (cid:73) Well-tuned | Q-learning | + SARSA method | | -1Istv´anSzitaandAndr´asL¨orincz(2006).“LearningTetrisusingthenoisycross-entropymethod”. In:Neuralcomputation. -2https://github.com/openai/rllab +- Don't expect them to work with default parameters +- Recommended: + - Cross-entropy method[^1] + - Well-tuned policy gradient method[^2] + - Well-tuned Q-learning + SARSA method -| Run with | More Samples | Than | Expected | | -| -------- | ------------ | ---- | -------- | --- | -(cid:73) Early in tuning process, may need huge number of samples -| | Don’t be deterred | by published | work | | -| --- | ----------------- | ------------ | ---- | --- | -(cid:73) -| (cid:73) Examples: | | | | | -| ------------------ | --- | --- | --- | --- | -(cid:73) TRPO on Atari: 100K timesteps per batch for KL= 0.01 -| | DQN on Atari: | update freq=10K, | replay buffer | size=1M | -| --- | ------------- | ---------------- | ------------- | ------- | -(cid:73) +Run with More Samples Than Expected +- Early in tuning process, may need huge number of samples +- Don't be deterred by published work +- Examples: + - TRPO on Atari: 100K timesteps per batch for KL= 0.01 + - DQN on Atari: update freq=10K, replay buffer size=1M -| Ongoing | Development | and Tuning | -| ------- | ----------- | ---------- | +## Ongoing Development and Tuning -| It | Works! | But | Don’t | Be Satisfied | | | -| --- | ---------------- | ----------- | ----- | ----------------- | --- | --- | -| | (cid:73) Explore | sensitivity | | to each parameter | | | -(cid:73) If too sensitive, it doesn’t really work, you just got lucky -| | (cid:73) Look | for health | indicators | | | | -| --- | ------------- | --------------- | ---------- | --- | --- | --- | -| | | (cid:73) VF fit | quality | | | | -| | | Policy | entropy | | | | -(cid:73) -| | | (cid:73) Update | size in | output space | and parameter | space | -| --- | --- | ----------------- | ----------- | ------------ | ------------- | ----- | -| | | (cid:73) Standard | diagnostics | for | deep networks | | +It Works! But Don't Be Satisfied +- Explore sensitivity to each parameter +- If too sensitive, it doesn't really work, you just got lucky +- Look for health indicators + - VF fit quality + - Policy entropy + - Update size in output space and parameter space + - Standard diagnostics for deep networks -| Continually | Benchmark | | Your Code | -| ------------------- | --------- | ------------- | ------------ | -| (cid:73) If reusing | code, | regressions | occur | -| (cid:73) Run | a battery | of benchmarks | occasionally | +Continually Benchmark Your Code +- If reusing code, regressions occur +- Run a battery of benchmarks occasionally -| Always | Use Multiple | Random | Seeds | -| ------ | ------------ | ------ | ----- | +Always Use Multiple Random Seeds -| Always Be | Ablating | | -| ------------------ | ---------- | ---------- | -| (cid:73) Different | tricks may | substitute | -| Especially | whitening | | -(cid:73) -(cid:73) “Regularize” to favor simplicity in algorithm design space -| (cid:73) As | usual, simplicity | → generalization | -| ----------- | ----------------- | ---------------- | +Always Be Ablating +- Different tricks may substitute +- Especially whitening +- "Regularize" to favor simplicity in algorithm design space +- As usual, simplicity → generalization -| Automate Your | Experiments | | | -| ------------- | ---------------- | --------- | ----------------- | -| Don’t spend | all day watching | your code | print out numbers | -(cid:73) -(cid:73) Consider using a cloud computing platform (Microsoft Azure, Amazon EC2, -| Google Compute | Engine) | | | -| -------------- | ------- | --- | --- | +Automate Your Experiments +- Don't spend all day watching your code print out numbers +- Consider using a cloud computing platform (Microsoft Azure, Amazon EC2, Google Compute Engine) -| General | Tuning | Strategies | for RL | -| ------- | ------ | ---------- | ------ | +## General Tuning Strategies for RL -| Whitening | / Standardizing | Data | -| ------------------------ | --------------- | ------------------ | -| (cid:73) If observations | have unknown | range, standardize | -(cid:73) Compute running estimate of mean and standard deviation -x(cid:48) -(cid:73) = clip((x −µ)/σ,−10,10) -(cid:73) Rescale the rewards, but don’t shift mean, as that affects agent’s will to live -(cid:73) Standardize prediction targets (e.g., value functions) the same way +Whitening / Standardizing Data +- If observations have unknown range, standardize +- Compute running estimate of mean and standard deviation +- x' = clip((x − μ)/σ, −10, 10) +- Rescale the rewards, but don't shift mean, as that affects agent's will to live +- Standardize prediction targets (e.g., value functions) the same way -| Generally | Important | Parameters | | | | -| --------- | --------------- | ------------- | ---- | ------- | --------- | -| (cid:73) | Discount | | | | | -| | (cid:73) Return | = r +γr | +γ2r | +... | | -| | | t t | t+1 | t+2 | | -| | Effective | time horizon: | 1+γ | +γ2+··· | = 1/(1−γ) | -(cid:73) -(cid:73) I.e., γ =0.99⇒ ignore rewards delayed by more than 100 timesteps -| | Low | γ works well | for well-shaped | reward | | -| --- | --- | ------------ | --------------- | ------ | --- | -(cid:73) -(cid:73) In TD(λ) methods, can get away with high γ when λ < 1 -| (cid:73) | Action frequency | | | | | -| -------- | ---------------- | ---------- | ------- | ------------- | --- | -| | Solvable | with human | control | (if possible) | | -(cid:73) -| | (cid:73) View | random exploration | | | | -| --- | ------------- | ------------------ | --- | --- | --- | +Generally Important Parameters +- Discount + - Return_t = r_t + γr_{t+1} + γ²r_{t+2} + . . . + - Effective time horizon: 1 + γ + γ² + · · · = 1/(1 − γ) + - I.e., γ = 0.99 ⇒ ignore rewards delayed by more than 100 timesteps + - Low γ works well for well-shaped reward + - In TD(λ) methods, can get away with high γ when λ < 1 +- Action frequency + - Solvable with human control (if possible) + - View random exploration General RL Diagnostics -(cid:73) Look at min/max/stdev of episode returns, along with mean -(cid:73) Look at episode lengths: sometimes provides additional information -| (cid:73) Solving problem | faster, losing | game slower | -| ------------------------ | -------------- | ----------- | +- Look at min/max/stdev of episode returns, along with mean +- Look at episode lengths: sometimes provides additional information +- Solving problem faster, losing game slower -Policy Gradient Strategies +## Policy Gradient Strategies -| Entropy as | Diagnostic | | | -| ------------------ | ---------------- | ------- | ------------- | -| (cid:73) Premature | drop in policy | entropy | ⇒ no learning | -| (cid:73) Alleviate | by using entropy | bonus | or KL penalty | +Entropy as Diagnostic +- Premature drop in policy entropy ⇒ no learning +- Alleviate by using entropy bonus or KL penalty KL as Diagnostic -(cid:2) (cid:3) -| (cid:73) Compute | KL π | (·|s),π(·|s) | | -| ---------------- | ---- | ------------ | --- | -old -| (cid:73) KL spike | ⇒ drastic | loss of performance | | -| -------------------- | --------- | ------------------- | ------------- | -| (cid:73) No learning | progress | might mean steps | are too large | -(cid:73) batchsize=100K converges to different result than batchsize=20K. +- Compute KL [π_old(· | s), π(· | s)] +- KL spike ⇒ drastic loss of performance +- No learning progress might mean steps are too large +- batchsize=100K converges to different result than batchsize=20K. -| Baseline | Explained | Variance | -| -------- | --------- | -------- | -1−Var[empiricalreturn−predictedvalue] -| (cid:73) | explained variance | = | -| -------- | ------------------ | --- | -Var[empiricalreturn] +Baseline Explained Variance +- explained variance = 1 − Var[empirical return − predicted value] / Var[empirical return] Policy Initialization -(cid:73) More important than in supervised learning: determines initial state -visitation -| (cid:73) Zero | or tiny final layer, | to maximize | entropy | -| ------------- | -------------------- | ----------- | ------- | +- More important than in supervised learning: determines initial state visitation +- Zero or tiny final layer, to maximize entropy -| Q-Learning Strategies | | | -| --------------------- | --- | --- | -(cid:73) Optimize memory usage carefully: you’ll need it for replay buffer -| (cid:73) Learning | rate schedules | | -| -------------------- | -------------- | ------ | -| (cid:73) Exploration | schedules | | -| (cid:73) Be patient. | DQN converges | slowly | -(cid:73) On Atari, often 10-40M frames to get policy much better than random -ThankstoSzymonSidorforsuggestions +## Q-Learning Strategies -Miscellaneous Advice -(cid:73) Read older textbooks and theses, not just conference papers -(cid:73) Don’t get stuck on problems—can’t solve everything at once -| (cid:73) Exploration | problems | like cart-pole swing-up | -| -------------------- | ----------------- | ----------------------- | -| (cid:73) DQN on | Atari vs CartPole | | +- Optimize memory usage carefully: you'll need it for replay buffer +- Learning rate schedules +- Exploration schedules +- Be patient. DQN converges slowly +- On Atari, often 10-40M frames to get policy much better than random -Thanks! +Thanks to Szymon Sidor for suggestions + +## Miscellaneous Advice + +- Read older textbooks and theses, not just conference papers +- Don't get stuck on problems—can't solve everything at once +- Exploration problems like cart-pole swing-up +- DQN on Atari vs CartPole + +[^1]: István Szita and András Lőrincz (2006). "Learning Tetris using the noisy cross-entropy method". In: Neural computation. +[^2]: https://github.com/openai/rllab