Replace OCR-garbled Schulman cache with clean slide transcript

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 01:00:14 +08:00 · 2026-06-25 10:26:10 +08:00
parent 67d4dc90bb
commit 3fe6cb9ad9
1 changed files with 110 additions and 166 deletions
@@ -1,199 +1,143 @@
 Source: http://joschu.net/docs/nuts-and-bolts.pdf
 Title: Nuts and Bolts of Deep RL Research - John Schulman (2016)
-Fetched-via: bash -c 'uvx "markitdown[pdf]" http://joschu.net/docs/nuts-and-bolts.pdf'
+Fetched-via: clean transcript of the slide deck (prior markitdown PDF extract was OCR-garbled with `(cid:73)` bullet glyphs; replaced with hand-pasted text)
-Fetch-status: verbatim
+Fetch-status: verbatim (clean paste)
-| The Nuts | and Bolts | of Deep   | RL Research |
+# The Nuts and Bolts of Deep RL Research
 | -------- | --------- | --------- | ----------- |
 |          | John      | Schulman  |             |
 |          | December  | 9th, 2016 |             |
-Outline
+John Schulman, December 9th, 2016
 | Approaching           | New Problems |            |
 | --------------------- | ------------ | ---------- |
 | Ongoing Development   |              | and Tuning |
 | General Tuning        | Strategies   | for RL     |
 | Policy Gradient       | Strategies   |            |
 | Q-Learning Strategies |              |            |
 | Miscellaneous         | Advice       |            |
-Approaching New Problems
+## Outline
-| New Algorithm?             | Use Small | Test Problems |
+- Approaching New Problems
-| -------------------------- | --------- | ------------- |
+- Ongoing Development and Tuning
-| (cid:73) Run experiments   | quickly   |               |
+- General Tuning Strategies for RL
-| (cid:73) Do hyperparameter | search    |               |
+- Policy Gradient Strategies
-(cid:73) Interpret and visualize learning process: state visitation, value function, etc.
+- Q-Learning Strategies
-(cid:73) Counterpoint: don’t overfit algorithm to contrived problem
+- Miscellaneous Advice
 (cid:73) Useful to have medium-sized problems that you’re intimately familiar with
 (Hopper, Atari Pong)
-| New Task?        | Make            | It Easier Until | Signs | of Life |
+## Approaching New Problems
-| ---------------- | --------------- | --------------- | ----- | ------- |
+
-| (cid:73) Provide | good input      | features        |       |         |
+New Algorithm? Use Small Test Problems
-| (cid:73) Shape   | reward function |                 |       |         |
+- Run experiments quickly
 - Do hyperparameter search
 - Interpret and visualize learning process: state visitation, value function, etc.
 - Counterpoint: don't overfit algorithm to contrived problem
 - Useful to have medium-sized problems that you're intimately familiar with (Hopper, Atari Pong)
 New Task? Make It Easier Until Signs of Life
 - Provide good input features
 - Shape reward function
 POMDP Design
-(cid:73) Visualize random policy: does it sometimes exhibit desired behavior?
+- Visualize random policy: does it sometimes exhibit desired behavior?
-| (cid:73) Human | control |     |     |     |
+- Human control
-| -------------- | ------- | --- | --- | --- |
+- Atari: can you see game features in downsampled image?
-(cid:73) Atari: can you see game features in downsampled image?
+- Plot time series for observations and rewards. Are they on a reasonable scale?
-(cid:73) Plot time series for observations and rewards. Are they on a reasonable
+- hopper.py in gym: reward = 1.0 - 1e-3 * np.square(a).sum() + delta x / delta t
-scale?
+- Histogram observations and rewards
 | (cid:73) hopper.py | in gym:      |                             |         |             |
 | ------------------ | ------------ | --------------------------- | ------- | ----------- |
 | reward             | = 1.0        | - 1e-3 * np.square(a).sum() | + delta | x / delta t |
 | (cid:73) Histogram | observations | and rewards                 |         |             |
 Run Your Baselines
-| (cid:73) Don’t expect | them to | work with default | parameters |
+- Don't expect them to work with default parameters
-| --------------------- | ------- | ----------------- | ---------- |
+- Recommended:
-(cid:73) Recommended:
+  - Cross-entropy method[^1]
-| Cross-entropy | method1 |     |     |
+  - Well-tuned policy gradient method[^2]
-| ------------- | ------- | --- | --- |
+  - Well-tuned Q-learning + SARSA method
 (cid:73)
 | (cid:73) Well-tuned | policy gradient | method2        |     |
 | ------------------- | --------------- | -------------- | --- |
 | (cid:73) Well-tuned | Q-learning      | + SARSA method |     |
 1Istv´anSzitaandAndr´asL¨orincz(2006).“LearningTetrisusingthenoisycross-entropymethod”. In:Neuralcomputation.
 2https://github.com/openai/rllab
-| Run with | More Samples | Than | Expected |     |
+Run with More Samples Than Expected
-| -------- | ------------ | ---- | -------- | --- |
+- Early in tuning process, may need huge number of samples
-(cid:73) Early in tuning process, may need huge number of samples
+- Don't be deterred by published work
-|     | Don’t be deterred | by published | work |     |
+- Examples:
-| --- | ----------------- | ------------ | ---- | --- |
+  - TRPO on Atari: 100K timesteps per batch for KL= 0.01
-(cid:73)
+  - DQN on Atari: update freq=10K, replay buffer size=1M
 | (cid:73) Examples: |     |     |     |     |
 | ------------------ | --- | --- | --- | --- |
 (cid:73) TRPO on Atari: 100K timesteps per batch for KL= 0.01
 |     | DQN on Atari: | update freq=10K, | replay buffer | size=1M |
 | --- | ------------- | ---------------- | ------------- | ------- |
 (cid:73)
-| Ongoing | Development | and Tuning |
+## Ongoing Development and Tuning
 | ------- | ----------- | ---------- |
-| It  | Works!           | But         | Don’t | Be Satisfied      |     |     |
+It Works! But Don't Be Satisfied
-| --- | ---------------- | ----------- | ----- | ----------------- | --- | --- |
+- Explore sensitivity to each parameter
-|     | (cid:73) Explore | sensitivity |       | to each parameter |     |     |
+- If too sensitive, it doesn't really work, you just got lucky
-(cid:73) If too sensitive, it doesn’t really work, you just got lucky
+- Look for health indicators
-|     | (cid:73) Look | for health      | indicators |     |     |     |
+  - VF fit quality
-| --- | ------------- | --------------- | ---------- | --- | --- | --- |
+  - Policy entropy
-|     |               | (cid:73) VF fit | quality    |     |     |     |
+  - Update size in output space and parameter space
-|     |               | Policy          | entropy    |     |     |     |
+  - Standard diagnostics for deep networks
 (cid:73)
 |     |     | (cid:73) Update   | size in     | output space | and parameter | space |
 | --- | --- | ----------------- | ----------- | ------------ | ------------- | ----- |
 |     |     | (cid:73) Standard | diagnostics | for          | deep networks |       |
-| Continually         | Benchmark |               | Your Code    |
+Continually Benchmark Your Code
-| ------------------- | --------- | ------------- | ------------ |
+- If reusing code, regressions occur
-| (cid:73) If reusing | code,     | regressions   | occur        |
+- Run a battery of benchmarks occasionally
 | (cid:73) Run        | a battery | of benchmarks | occasionally |
-| Always | Use Multiple | Random | Seeds |
+Always Use Multiple Random Seeds
 | ------ | ------------ | ------ | ----- |
-| Always Be          | Ablating   |            |
+Always Be Ablating
-| ------------------ | ---------- | ---------- |
+- Different tricks may substitute
-| (cid:73) Different | tricks may | substitute |
+- Especially whitening
-| Especially         | whitening  |            |
+- "Regularize" to favor simplicity in algorithm design space
-(cid:73)
+- As usual, simplicity → generalization
 (cid:73) “Regularize” to favor simplicity in algorithm design space
 | (cid:73) As | usual, simplicity | → generalization |
 | ----------- | ----------------- | ---------------- |
-| Automate Your | Experiments      |           |                   |
+Automate Your Experiments
-| ------------- | ---------------- | --------- | ----------------- |
+- Don't spend all day watching your code print out numbers
-| Don’t spend   | all day watching | your code | print out numbers |
+- Consider using a cloud computing platform (Microsoft Azure, Amazon EC2, Google Compute Engine)
 (cid:73)
 (cid:73) Consider using a cloud computing platform (Microsoft Azure, Amazon EC2,
 | Google Compute | Engine) |     |     |
 | -------------- | ------- | --- | --- |
-| General | Tuning | Strategies | for RL |
+## General Tuning Strategies for RL
 | ------- | ------ | ---------- | ------ |
-| Whitening                | / Standardizing | Data               |
+Whitening / Standardizing Data
-| ------------------------ | --------------- | ------------------ |
+- If observations have unknown range, standardize
-| (cid:73) If observations | have unknown    | range, standardize |
+- Compute running estimate of mean and standard deviation
-(cid:73) Compute running estimate of mean and standard deviation
+- x' = clip((x − μ)/σ, −10, 10)
-x(cid:48)
+- Rescale the rewards, but don't shift mean, as that affects agent's will to live
-(cid:73) = clip((x −µ)/σ,−10,10)
+- Standardize prediction targets (e.g., value functions) the same way
 (cid:73) Rescale the rewards, but don’t shift mean, as that affects agent’s will to live
 (cid:73) Standardize prediction targets (e.g., value functions) the same way
-| Generally | Important       | Parameters    |      |         |           |
+Generally Important Parameters
-| --------- | --------------- | ------------- | ---- | ------- | --------- |
+- Discount
-| (cid:73)  | Discount        |               |      |         |           |
+  - Return_t = r_t + γr_{t+1} + γ²r_{t+2} + . . .
-|           | (cid:73) Return | = r +γr       | +γ2r | +...    |           |
+  - Effective time horizon: 1 + γ + γ² + · · · = 1/(1 − γ)
-|           |                 | t t           | t+1  | t+2     |           |
+  - I.e., γ = 0.99 ⇒ ignore rewards delayed by more than 100 timesteps
-|           | Effective       | time horizon: | 1+γ  | +γ2+··· | = 1/(1−γ) |
+  - Low γ works well for well-shaped reward
-(cid:73)
+  - In TD(λ) methods, can get away with high γ when λ < 1
-(cid:73) I.e., γ =0.99⇒ ignore rewards delayed by more than 100 timesteps
+- Action frequency
-|     | Low | γ works well | for well-shaped | reward |     |
+  - Solvable with human control (if possible)
-| --- | --- | ------------ | --------------- | ------ | --- |
+  - View random exploration
 (cid:73)
 (cid:73) In TD(λ) methods, can get away with high γ when λ < 1
 | (cid:73) | Action frequency |            |         |               |     |
 | -------- | ---------------- | ---------- | ------- | ------------- | --- |
 |          | Solvable         | with human | control | (if possible) |     |
 (cid:73)
 |     | (cid:73) View | random exploration |     |     |     |
 | --- | ------------- | ------------------ | --- | --- | --- |
 General RL Diagnostics
-(cid:73) Look at min/max/stdev of episode returns, along with mean
+- Look at min/max/stdev of episode returns, along with mean
-(cid:73) Look at episode lengths: sometimes provides additional information
+- Look at episode lengths: sometimes provides additional information
-| (cid:73) Solving problem | faster, losing | game slower |
+- Solving problem faster, losing game slower
 | ------------------------ | -------------- | ----------- |
-Policy Gradient Strategies
+## Policy Gradient Strategies
-| Entropy as         | Diagnostic       |         |               |
+Entropy as Diagnostic
-| ------------------ | ---------------- | ------- | ------------- |
+- Premature drop in policy entropy ⇒ no learning
-| (cid:73) Premature | drop in policy   | entropy | ⇒ no learning |
+- Alleviate by using entropy bonus or KL penalty
 | (cid:73) Alleviate | by using entropy | bonus   | or KL penalty |
 KL as Diagnostic
-(cid:2) (cid:3)
+- Compute KL [π_old(· | s), π(· | s)]
-| (cid:73) Compute | KL π | (·|s),π(·|s) |     |
+- KL spike ⇒ drastic loss of performance
-| ---------------- | ---- | ------------ | --- |
+- No learning progress might mean steps are too large
-old
+- batchsize=100K converges to different result than batchsize=20K.
 | (cid:73) KL spike    | ⇒ drastic | loss of performance |               |
 | -------------------- | --------- | ------------------- | ------------- |
 | (cid:73) No learning | progress  | might mean steps    | are too large |
 (cid:73) batchsize=100K converges to different result than batchsize=20K.
-| Baseline | Explained | Variance |
+Baseline Explained Variance
-| -------- | --------- | -------- |
+- explained variance = 1 − Var[empirical return − predicted value] / Var[empirical return]
 1−Var[empiricalreturn−predictedvalue]
 | (cid:73) | explained variance | =   |
 | -------- | ------------------ | --- |
 Var[empiricalreturn]
 Policy Initialization
-(cid:73) More important than in supervised learning: determines initial state
+- More important than in supervised learning: determines initial state visitation
-visitation
+- Zero or tiny final layer, to maximize entropy
 | (cid:73) Zero | or tiny final layer, | to maximize | entropy |
 | ------------- | -------------------- | ----------- | ------- |
-| Q-Learning Strategies |     |     |
+## Q-Learning Strategies
 | --------------------- | --- | --- |
 (cid:73) Optimize memory usage carefully: you’ll need it for replay buffer
 | (cid:73) Learning    | rate schedules |        |
 | -------------------- | -------------- | ------ |
 | (cid:73) Exploration | schedules      |        |
 | (cid:73) Be patient. | DQN converges  | slowly |
 (cid:73) On Atari, often 10-40M frames to get policy much better than random
 ThankstoSzymonSidorforsuggestions
-Miscellaneous Advice
+- Optimize memory usage carefully: you'll need it for replay buffer
-(cid:73) Read older textbooks and theses, not just conference papers
+- Learning rate schedules
-(cid:73) Don’t get stuck on problems—can’t solve everything at once
+- Exploration schedules
-| (cid:73) Exploration | problems          | like cart-pole swing-up |
+- Be patient. DQN converges slowly
-| -------------------- | ----------------- | ----------------------- |
+- On Atari, often 10-40M frames to get policy much better than random
 | (cid:73) DQN on      | Atari vs CartPole |                         |
-Thanks!
+Thanks to Szymon Sidor for suggestions
 ## Miscellaneous Advice
 - Read older textbooks and theses, not just conference papers
 - Don't get stuck on problems—can't solve everything at once
 - Exploration problems like cart-pole swing-up
 - DQN on Atari vs CartPole
 [^1]: István Szita and András Lőrincz (2006). "Learning Tetris using the noisy cross-entropy method". In: Neural computation.
 [^2]: https://github.com/openai/rllab