Files
ml-debug/docs/evidence/joschu_nuts_and_bolts.md
T

144 lines
4.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Source: http://joschu.net/docs/nuts-and-bolts.pdf
Title: Nuts and Bolts of Deep RL Research - John Schulman (2016)
Fetched-via: clean transcript of the slide deck (prior markitdown PDF extract was OCR-garbled with `(cid:73)` bullet glyphs; replaced with hand-pasted text)
Fetch-status: verbatim (clean paste)
# The Nuts and Bolts of Deep RL Research
John Schulman, December 9th, 2016
## Outline
- Approaching New Problems
- Ongoing Development and Tuning
- General Tuning Strategies for RL
- Policy Gradient Strategies
- Q-Learning Strategies
- Miscellaneous Advice
## Approaching New Problems
New Algorithm? Use Small Test Problems
- Run experiments quickly
- Do hyperparameter search
- Interpret and visualize learning process: state visitation, value function, etc.
- Counterpoint: don't overfit algorithm to contrived problem
- Useful to have medium-sized problems that you're intimately familiar with (Hopper, Atari Pong)
New Task? Make It Easier Until Signs of Life
- Provide good input features
- Shape reward function
POMDP Design
- Visualize random policy: does it sometimes exhibit desired behavior?
- Human control
- Atari: can you see game features in downsampled image?
- Plot time series for observations and rewards. Are they on a reasonable scale?
- hopper.py in gym: reward = 1.0 - 1e-3 * np.square(a).sum() + delta x / delta t
- Histogram observations and rewards
Run Your Baselines
- Don't expect them to work with default parameters
- Recommended:
- Cross-entropy method[^1]
- Well-tuned policy gradient method[^2]
- Well-tuned Q-learning + SARSA method
Run with More Samples Than Expected
- Early in tuning process, may need huge number of samples
- Don't be deterred by published work
- Examples:
- TRPO on Atari: 100K timesteps per batch for KL= 0.01
- DQN on Atari: update freq=10K, replay buffer size=1M
## Ongoing Development and Tuning
It Works! But Don't Be Satisfied
- Explore sensitivity to each parameter
- If too sensitive, it doesn't really work, you just got lucky
- Look for health indicators
- VF fit quality
- Policy entropy
- Update size in output space and parameter space
- Standard diagnostics for deep networks
Continually Benchmark Your Code
- If reusing code, regressions occur
- Run a battery of benchmarks occasionally
Always Use Multiple Random Seeds
Always Be Ablating
- Different tricks may substitute
- Especially whitening
- "Regularize" to favor simplicity in algorithm design space
- As usual, simplicity → generalization
Automate Your Experiments
- Don't spend all day watching your code print out numbers
- Consider using a cloud computing platform (Microsoft Azure, Amazon EC2, Google Compute Engine)
## General Tuning Strategies for RL
Whitening / Standardizing Data
- If observations have unknown range, standardize
- Compute running estimate of mean and standard deviation
- x' = clip((x μ)/σ, 10, 10)
- Rescale the rewards, but don't shift mean, as that affects agent's will to live
- Standardize prediction targets (e.g., value functions) the same way
Generally Important Parameters
- Discount
- Return_t = r_t + γr_{t+1} + γ²r_{t+2} + . . .
- Effective time horizon: 1 + γ + γ² + · · · = 1/(1 γ)
- I.e., γ = 0.99 ⇒ ignore rewards delayed by more than 100 timesteps
- Low γ works well for well-shaped reward
- In TD(λ) methods, can get away with high γ when λ < 1
- Action frequency
- Solvable with human control (if possible)
- View random exploration
General RL Diagnostics
- Look at min/max/stdev of episode returns, along with mean
- Look at episode lengths: sometimes provides additional information
- Solving problem faster, losing game slower
## Policy Gradient Strategies
Entropy as Diagnostic
- Premature drop in policy entropy ⇒ no learning
- Alleviate by using entropy bonus or KL penalty
KL as Diagnostic
- Compute KL [π_old(· | s), π(· | s)]
- KL spike ⇒ drastic loss of performance
- No learning progress might mean steps are too large
- batchsize=100K converges to different result than batchsize=20K.
Baseline Explained Variance
- explained variance = 1 Var[empirical return predicted value] / Var[empirical return]
Policy Initialization
- More important than in supervised learning: determines initial state visitation
- Zero or tiny final layer, to maximize entropy
## Q-Learning Strategies
- Optimize memory usage carefully: you'll need it for replay buffer
- Learning rate schedules
- Exploration schedules
- Be patient. DQN converges slowly
- On Atari, often 10-40M frames to get policy much better than random
Thanks to Szymon Sidor for suggestions
## Miscellaneous Advice
- Read older textbooks and theses, not just conference papers
- Don't get stuck on problems—can't solve everything at once
- Exploration problems like cart-pole swing-up
- DQN on Atari vs CartPole
[^1]: István Szita and András Lőrincz (2006). "Learning Tetris using the noisy cross-entropy method". In: Neural computation.
[^2]: https://github.com/openai/rllab