mirror of
https://github.com/wassname/ml-debug.git
synced 2026-06-27 15:16:02 +08:00
Replace OCR-garbled Schulman cache with clean slide transcript
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -1,199 +1,143 @@
|
|||||||
Source: http://joschu.net/docs/nuts-and-bolts.pdf
|
Source: http://joschu.net/docs/nuts-and-bolts.pdf
|
||||||
Title: Nuts and Bolts of Deep RL Research - John Schulman (2016)
|
Title: Nuts and Bolts of Deep RL Research - John Schulman (2016)
|
||||||
Fetched-via: bash -c 'uvx "markitdown[pdf]" http://joschu.net/docs/nuts-and-bolts.pdf'
|
Fetched-via: clean transcript of the slide deck (prior markitdown PDF extract was OCR-garbled with `(cid:73)` bullet glyphs; replaced with hand-pasted text)
|
||||||
Fetch-status: verbatim
|
Fetch-status: verbatim (clean paste)
|
||||||
|
|
||||||
| The Nuts | and Bolts | of Deep | RL Research |
|
# The Nuts and Bolts of Deep RL Research
|
||||||
| -------- | --------- | --------- | ----------- |
|
|
||||||
| | John | Schulman | |
|
|
||||||
| | December | 9th, 2016 | |
|
|
||||||
|
|
||||||
Outline
|
John Schulman, December 9th, 2016
|
||||||
| Approaching | New Problems | |
|
|
||||||
| --------------------- | ------------ | ---------- |
|
|
||||||
| Ongoing Development | | and Tuning |
|
|
||||||
| General Tuning | Strategies | for RL |
|
|
||||||
| Policy Gradient | Strategies | |
|
|
||||||
| Q-Learning Strategies | | |
|
|
||||||
| Miscellaneous | Advice | |
|
|
||||||
|
|
||||||
Approaching New Problems
|
## Outline
|
||||||
|
|
||||||
| New Algorithm? | Use Small | Test Problems |
|
- Approaching New Problems
|
||||||
| -------------------------- | --------- | ------------- |
|
- Ongoing Development and Tuning
|
||||||
| (cid:73) Run experiments | quickly | |
|
- General Tuning Strategies for RL
|
||||||
| (cid:73) Do hyperparameter | search | |
|
- Policy Gradient Strategies
|
||||||
(cid:73) Interpret and visualize learning process: state visitation, value function, etc.
|
- Q-Learning Strategies
|
||||||
(cid:73) Counterpoint: don’t overfit algorithm to contrived problem
|
- Miscellaneous Advice
|
||||||
(cid:73) Useful to have medium-sized problems that you’re intimately familiar with
|
|
||||||
(Hopper, Atari Pong)
|
|
||||||
|
|
||||||
| New Task? | Make | It Easier Until | Signs | of Life |
|
## Approaching New Problems
|
||||||
| ---------------- | --------------- | --------------- | ----- | ------- |
|
|
||||||
| (cid:73) Provide | good input | features | | |
|
New Algorithm? Use Small Test Problems
|
||||||
| (cid:73) Shape | reward function | | | |
|
- Run experiments quickly
|
||||||
|
- Do hyperparameter search
|
||||||
|
- Interpret and visualize learning process: state visitation, value function, etc.
|
||||||
|
- Counterpoint: don't overfit algorithm to contrived problem
|
||||||
|
- Useful to have medium-sized problems that you're intimately familiar with (Hopper, Atari Pong)
|
||||||
|
|
||||||
|
New Task? Make It Easier Until Signs of Life
|
||||||
|
- Provide good input features
|
||||||
|
- Shape reward function
|
||||||
|
|
||||||
POMDP Design
|
POMDP Design
|
||||||
(cid:73) Visualize random policy: does it sometimes exhibit desired behavior?
|
- Visualize random policy: does it sometimes exhibit desired behavior?
|
||||||
| (cid:73) Human | control | | | |
|
- Human control
|
||||||
| -------------- | ------- | --- | --- | --- |
|
- Atari: can you see game features in downsampled image?
|
||||||
(cid:73) Atari: can you see game features in downsampled image?
|
- Plot time series for observations and rewards. Are they on a reasonable scale?
|
||||||
(cid:73) Plot time series for observations and rewards. Are they on a reasonable
|
- hopper.py in gym: reward = 1.0 - 1e-3 * np.square(a).sum() + delta x / delta t
|
||||||
scale?
|
- Histogram observations and rewards
|
||||||
| (cid:73) hopper.py | in gym: | | | |
|
|
||||||
| ------------------ | ------------ | --------------------------- | ------- | ----------- |
|
|
||||||
| reward | = 1.0 | - 1e-3 * np.square(a).sum() | + delta | x / delta t |
|
|
||||||
| (cid:73) Histogram | observations | and rewards | | |
|
|
||||||
|
|
||||||
Run Your Baselines
|
Run Your Baselines
|
||||||
| (cid:73) Don’t expect | them to | work with default | parameters |
|
- Don't expect them to work with default parameters
|
||||||
| --------------------- | ------- | ----------------- | ---------- |
|
- Recommended:
|
||||||
(cid:73) Recommended:
|
- Cross-entropy method[^1]
|
||||||
| Cross-entropy | method1 | | |
|
- Well-tuned policy gradient method[^2]
|
||||||
| ------------- | ------- | --- | --- |
|
- Well-tuned Q-learning + SARSA method
|
||||||
(cid:73)
|
|
||||||
| (cid:73) Well-tuned | policy gradient | method2 | |
|
|
||||||
| ------------------- | --------------- | -------------- | --- |
|
|
||||||
| (cid:73) Well-tuned | Q-learning | + SARSA method | |
|
|
||||||
1Istv´anSzitaandAndr´asL¨orincz(2006).“LearningTetrisusingthenoisycross-entropymethod”. In:Neuralcomputation.
|
|
||||||
2https://github.com/openai/rllab
|
|
||||||
|
|
||||||
| Run with | More Samples | Than | Expected | |
|
Run with More Samples Than Expected
|
||||||
| -------- | ------------ | ---- | -------- | --- |
|
- Early in tuning process, may need huge number of samples
|
||||||
(cid:73) Early in tuning process, may need huge number of samples
|
- Don't be deterred by published work
|
||||||
| | Don’t be deterred | by published | work | |
|
- Examples:
|
||||||
| --- | ----------------- | ------------ | ---- | --- |
|
- TRPO on Atari: 100K timesteps per batch for KL= 0.01
|
||||||
(cid:73)
|
- DQN on Atari: update freq=10K, replay buffer size=1M
|
||||||
| (cid:73) Examples: | | | | |
|
|
||||||
| ------------------ | --- | --- | --- | --- |
|
|
||||||
(cid:73) TRPO on Atari: 100K timesteps per batch for KL= 0.01
|
|
||||||
| | DQN on Atari: | update freq=10K, | replay buffer | size=1M |
|
|
||||||
| --- | ------------- | ---------------- | ------------- | ------- |
|
|
||||||
(cid:73)
|
|
||||||
|
|
||||||
| Ongoing | Development | and Tuning |
|
## Ongoing Development and Tuning
|
||||||
| ------- | ----------- | ---------- |
|
|
||||||
|
|
||||||
| It | Works! | But | Don’t | Be Satisfied | | |
|
It Works! But Don't Be Satisfied
|
||||||
| --- | ---------------- | ----------- | ----- | ----------------- | --- | --- |
|
- Explore sensitivity to each parameter
|
||||||
| | (cid:73) Explore | sensitivity | | to each parameter | | |
|
- If too sensitive, it doesn't really work, you just got lucky
|
||||||
(cid:73) If too sensitive, it doesn’t really work, you just got lucky
|
- Look for health indicators
|
||||||
| | (cid:73) Look | for health | indicators | | | |
|
- VF fit quality
|
||||||
| --- | ------------- | --------------- | ---------- | --- | --- | --- |
|
- Policy entropy
|
||||||
| | | (cid:73) VF fit | quality | | | |
|
- Update size in output space and parameter space
|
||||||
| | | Policy | entropy | | | |
|
- Standard diagnostics for deep networks
|
||||||
(cid:73)
|
|
||||||
| | | (cid:73) Update | size in | output space | and parameter | space |
|
|
||||||
| --- | --- | ----------------- | ----------- | ------------ | ------------- | ----- |
|
|
||||||
| | | (cid:73) Standard | diagnostics | for | deep networks | |
|
|
||||||
|
|
||||||
| Continually | Benchmark | | Your Code |
|
Continually Benchmark Your Code
|
||||||
| ------------------- | --------- | ------------- | ------------ |
|
- If reusing code, regressions occur
|
||||||
| (cid:73) If reusing | code, | regressions | occur |
|
- Run a battery of benchmarks occasionally
|
||||||
| (cid:73) Run | a battery | of benchmarks | occasionally |
|
|
||||||
|
|
||||||
| Always | Use Multiple | Random | Seeds |
|
Always Use Multiple Random Seeds
|
||||||
| ------ | ------------ | ------ | ----- |
|
|
||||||
|
|
||||||
| Always Be | Ablating | |
|
Always Be Ablating
|
||||||
| ------------------ | ---------- | ---------- |
|
- Different tricks may substitute
|
||||||
| (cid:73) Different | tricks may | substitute |
|
- Especially whitening
|
||||||
| Especially | whitening | |
|
- "Regularize" to favor simplicity in algorithm design space
|
||||||
(cid:73)
|
- As usual, simplicity → generalization
|
||||||
(cid:73) “Regularize” to favor simplicity in algorithm design space
|
|
||||||
| (cid:73) As | usual, simplicity | → generalization |
|
|
||||||
| ----------- | ----------------- | ---------------- |
|
|
||||||
|
|
||||||
| Automate Your | Experiments | | |
|
Automate Your Experiments
|
||||||
| ------------- | ---------------- | --------- | ----------------- |
|
- Don't spend all day watching your code print out numbers
|
||||||
| Don’t spend | all day watching | your code | print out numbers |
|
- Consider using a cloud computing platform (Microsoft Azure, Amazon EC2, Google Compute Engine)
|
||||||
(cid:73)
|
|
||||||
(cid:73) Consider using a cloud computing platform (Microsoft Azure, Amazon EC2,
|
|
||||||
| Google Compute | Engine) | | |
|
|
||||||
| -------------- | ------- | --- | --- |
|
|
||||||
|
|
||||||
| General | Tuning | Strategies | for RL |
|
## General Tuning Strategies for RL
|
||||||
| ------- | ------ | ---------- | ------ |
|
|
||||||
|
|
||||||
| Whitening | / Standardizing | Data |
|
Whitening / Standardizing Data
|
||||||
| ------------------------ | --------------- | ------------------ |
|
- If observations have unknown range, standardize
|
||||||
| (cid:73) If observations | have unknown | range, standardize |
|
- Compute running estimate of mean and standard deviation
|
||||||
(cid:73) Compute running estimate of mean and standard deviation
|
- x' = clip((x − μ)/σ, −10, 10)
|
||||||
x(cid:48)
|
- Rescale the rewards, but don't shift mean, as that affects agent's will to live
|
||||||
(cid:73) = clip((x −µ)/σ,−10,10)
|
- Standardize prediction targets (e.g., value functions) the same way
|
||||||
(cid:73) Rescale the rewards, but don’t shift mean, as that affects agent’s will to live
|
|
||||||
(cid:73) Standardize prediction targets (e.g., value functions) the same way
|
|
||||||
|
|
||||||
| Generally | Important | Parameters | | | |
|
Generally Important Parameters
|
||||||
| --------- | --------------- | ------------- | ---- | ------- | --------- |
|
- Discount
|
||||||
| (cid:73) | Discount | | | | |
|
- Return_t = r_t + γr_{t+1} + γ²r_{t+2} + . . .
|
||||||
| | (cid:73) Return | = r +γr | +γ2r | +... | |
|
- Effective time horizon: 1 + γ + γ² + · · · = 1/(1 − γ)
|
||||||
| | | t t | t+1 | t+2 | |
|
- I.e., γ = 0.99 ⇒ ignore rewards delayed by more than 100 timesteps
|
||||||
| | Effective | time horizon: | 1+γ | +γ2+··· | = 1/(1−γ) |
|
- Low γ works well for well-shaped reward
|
||||||
(cid:73)
|
- In TD(λ) methods, can get away with high γ when λ < 1
|
||||||
(cid:73) I.e., γ =0.99⇒ ignore rewards delayed by more than 100 timesteps
|
- Action frequency
|
||||||
| | Low | γ works well | for well-shaped | reward | |
|
- Solvable with human control (if possible)
|
||||||
| --- | --- | ------------ | --------------- | ------ | --- |
|
- View random exploration
|
||||||
(cid:73)
|
|
||||||
(cid:73) In TD(λ) methods, can get away with high γ when λ < 1
|
|
||||||
| (cid:73) | Action frequency | | | | |
|
|
||||||
| -------- | ---------------- | ---------- | ------- | ------------- | --- |
|
|
||||||
| | Solvable | with human | control | (if possible) | |
|
|
||||||
(cid:73)
|
|
||||||
| | (cid:73) View | random exploration | | | |
|
|
||||||
| --- | ------------- | ------------------ | --- | --- | --- |
|
|
||||||
|
|
||||||
General RL Diagnostics
|
General RL Diagnostics
|
||||||
(cid:73) Look at min/max/stdev of episode returns, along with mean
|
- Look at min/max/stdev of episode returns, along with mean
|
||||||
(cid:73) Look at episode lengths: sometimes provides additional information
|
- Look at episode lengths: sometimes provides additional information
|
||||||
| (cid:73) Solving problem | faster, losing | game slower |
|
- Solving problem faster, losing game slower
|
||||||
| ------------------------ | -------------- | ----------- |
|
|
||||||
|
|
||||||
Policy Gradient Strategies
|
## Policy Gradient Strategies
|
||||||
|
|
||||||
| Entropy as | Diagnostic | | |
|
Entropy as Diagnostic
|
||||||
| ------------------ | ---------------- | ------- | ------------- |
|
- Premature drop in policy entropy ⇒ no learning
|
||||||
| (cid:73) Premature | drop in policy | entropy | ⇒ no learning |
|
- Alleviate by using entropy bonus or KL penalty
|
||||||
| (cid:73) Alleviate | by using entropy | bonus | or KL penalty |
|
|
||||||
|
|
||||||
KL as Diagnostic
|
KL as Diagnostic
|
||||||
(cid:2) (cid:3)
|
- Compute KL [π_old(· | s), π(· | s)]
|
||||||
| (cid:73) Compute | KL π | (·|s),π(·|s) | |
|
- KL spike ⇒ drastic loss of performance
|
||||||
| ---------------- | ---- | ------------ | --- |
|
- No learning progress might mean steps are too large
|
||||||
old
|
- batchsize=100K converges to different result than batchsize=20K.
|
||||||
| (cid:73) KL spike | ⇒ drastic | loss of performance | |
|
|
||||||
| -------------------- | --------- | ------------------- | ------------- |
|
|
||||||
| (cid:73) No learning | progress | might mean steps | are too large |
|
|
||||||
(cid:73) batchsize=100K converges to different result than batchsize=20K.
|
|
||||||
|
|
||||||
| Baseline | Explained | Variance |
|
Baseline Explained Variance
|
||||||
| -------- | --------- | -------- |
|
- explained variance = 1 − Var[empirical return − predicted value] / Var[empirical return]
|
||||||
1−Var[empiricalreturn−predictedvalue]
|
|
||||||
| (cid:73) | explained variance | = |
|
|
||||||
| -------- | ------------------ | --- |
|
|
||||||
Var[empiricalreturn]
|
|
||||||
|
|
||||||
Policy Initialization
|
Policy Initialization
|
||||||
(cid:73) More important than in supervised learning: determines initial state
|
- More important than in supervised learning: determines initial state visitation
|
||||||
visitation
|
- Zero or tiny final layer, to maximize entropy
|
||||||
| (cid:73) Zero | or tiny final layer, | to maximize | entropy |
|
|
||||||
| ------------- | -------------------- | ----------- | ------- |
|
|
||||||
|
|
||||||
| Q-Learning Strategies | | |
|
## Q-Learning Strategies
|
||||||
| --------------------- | --- | --- |
|
|
||||||
(cid:73) Optimize memory usage carefully: you’ll need it for replay buffer
|
|
||||||
| (cid:73) Learning | rate schedules | |
|
|
||||||
| -------------------- | -------------- | ------ |
|
|
||||||
| (cid:73) Exploration | schedules | |
|
|
||||||
| (cid:73) Be patient. | DQN converges | slowly |
|
|
||||||
(cid:73) On Atari, often 10-40M frames to get policy much better than random
|
|
||||||
ThankstoSzymonSidorforsuggestions
|
|
||||||
|
|
||||||
Miscellaneous Advice
|
- Optimize memory usage carefully: you'll need it for replay buffer
|
||||||
(cid:73) Read older textbooks and theses, not just conference papers
|
- Learning rate schedules
|
||||||
(cid:73) Don’t get stuck on problems—can’t solve everything at once
|
- Exploration schedules
|
||||||
| (cid:73) Exploration | problems | like cart-pole swing-up |
|
- Be patient. DQN converges slowly
|
||||||
| -------------------- | ----------------- | ----------------------- |
|
- On Atari, often 10-40M frames to get policy much better than random
|
||||||
| (cid:73) DQN on | Atari vs CartPole | |
|
|
||||||
|
|
||||||
Thanks!
|
Thanks to Szymon Sidor for suggestions
|
||||||
|
|
||||||
|
## Miscellaneous Advice
|
||||||
|
|
||||||
|
- Read older textbooks and theses, not just conference papers
|
||||||
|
- Don't get stuck on problems—can't solve everything at once
|
||||||
|
- Exploration problems like cart-pole swing-up
|
||||||
|
- DQN on Atari vs CartPole
|
||||||
|
|
||||||
|
[^1]: István Szita and András Lőrincz (2006). "Learning Tetris using the noisy cross-entropy method". In: Neural computation.
|
||||||
|
[^2]: https://github.com/openai/rllab
|
||||||
|
|||||||
Reference in New Issue
Block a user