mirror of https://github.com/wassname/ml_debug.git synced 2026-06-27 01:00:14 +08:00

Files

T

wassname 3fe6cb9ad9 Replace OCR-garbled Schulman cache with clean slide transcript

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-25 10:31:39 +08:00

4.9 KiB

Raw Blame History

Source: http://joschu.net/docs/nuts-and-bolts.pdf Title: Nuts and Bolts of Deep RL Research - John Schulman (2016) Fetched-via: clean transcript of the slide deck (prior markitdown PDF extract was OCR-garbled with (cid:73) bullet glyphs; replaced with hand-pasted text) Fetch-status: verbatim (clean paste)

The Nuts and Bolts of Deep RL Research

John Schulman, December 9th, 2016

Outline

Approaching New Problems
Ongoing Development and Tuning
General Tuning Strategies for RL
Policy Gradient Strategies
Q-Learning Strategies
Miscellaneous Advice

Approaching New Problems

New Algorithm? Use Small Test Problems

Run experiments quickly
Do hyperparameter search
Interpret and visualize learning process: state visitation, value function, etc.
Counterpoint: don't overfit algorithm to contrived problem
Useful to have medium-sized problems that you're intimately familiar with (Hopper, Atari Pong)

New Task? Make It Easier Until Signs of Life

Provide good input features
Shape reward function

POMDP Design

Visualize random policy: does it sometimes exhibit desired behavior?
Human control
Atari: can you see game features in downsampled image?
Plot time series for observations and rewards. Are they on a reasonable scale?
hopper.py in gym: reward = 1.0 - 1e-3 * np.square(a).sum() + delta x / delta t
Histogram observations and rewards

Run Your Baselines

Don't expect them to work with default parameters
Recommended:
- Cross-entropy method¹
- Well-tuned policy gradient method²
- Well-tuned Q-learning + SARSA method

Run with More Samples Than Expected

Early in tuning process, may need huge number of samples
Don't be deterred by published work
Examples:
- TRPO on Atari: 100K timesteps per batch for KL= 0.01
- DQN on Atari: update freq=10K, replay buffer size=1M

Ongoing Development and Tuning

It Works! But Don't Be Satisfied

Explore sensitivity to each parameter
If too sensitive, it doesn't really work, you just got lucky
Look for health indicators
- VF fit quality
- Policy entropy
- Update size in output space and parameter space
- Standard diagnostics for deep networks

Continually Benchmark Your Code

If reusing code, regressions occur
Run a battery of benchmarks occasionally

Always Use Multiple Random Seeds

Always Be Ablating

Different tricks may substitute
Especially whitening
"Regularize" to favor simplicity in algorithm design space
As usual, simplicity → generalization

Automate Your Experiments

Don't spend all day watching your code print out numbers
Consider using a cloud computing platform (Microsoft Azure, Amazon EC2, Google Compute Engine)

General Tuning Strategies for RL

Whitening / Standardizing Data

If observations have unknown range, standardize
Compute running estimate of mean and standard deviation
x' = clip((x − μ)/σ, −10, 10)
Rescale the rewards, but don't shift mean, as that affects agent's will to live
Standardize prediction targets (e.g., value functions) the same way

Generally Important Parameters

Discount
- Return_t = r_t + γr_{t+1} + γ²r_{t+2} + . . .
- Effective time horizon: 1 + γ + γ² + · · · = 1/(1 − γ)
- I.e., γ = 0.99 ⇒ ignore rewards delayed by more than 100 timesteps
- Low γ works well for well-shaped reward
- In TD(λ) methods, can get away with high γ when λ < 1
Action frequency
- Solvable with human control (if possible)
- View random exploration

General RL Diagnostics

Look at min/max/stdev of episode returns, along with mean
Look at episode lengths: sometimes provides additional information
Solving problem faster, losing game slower

Policy Gradient Strategies

Entropy as Diagnostic

Premature drop in policy entropy ⇒ no learning
Alleviate by using entropy bonus or KL penalty

KL as Diagnostic

Compute KL [π_old(· | s), π(· | s)]
KL spike ⇒ drastic loss of performance
No learning progress might mean steps are too large
batchsize=100K converges to different result than batchsize=20K.

Baseline Explained Variance

explained variance = 1 − Var[empirical return − predicted value] / Var[empirical return]

Policy Initialization

More important than in supervised learning: determines initial state visitation
Zero or tiny final layer, to maximize entropy

Q-Learning Strategies

Optimize memory usage carefully: you'll need it for replay buffer
Learning rate schedules
Exploration schedules
Be patient. DQN converges slowly
On Atari, often 10-40M frames to get policy much better than random

Thanks to Szymon Sidor for suggestions

Miscellaneous Advice

Read older textbooks and theses, not just conference papers
Don't get stuck on problems—can't solve everything at once
Exploration problems like cart-pole swing-up
DQN on Atari vs CartPole

István Szita and András Lőrincz (2006). "Learning Tetris using the noisy cross-entropy method". In: Neural computation. ↩︎
https://github.com/openai/rllab ↩︎

4.9 KiB Raw Blame History Unescape Escape