Files
ml-debug/docs/evidence/joschu_nuts_and_bolts.md
T

4.9 KiB
Raw Blame History

Source: http://joschu.net/docs/nuts-and-bolts.pdf Title: Nuts and Bolts of Deep RL Research - John Schulman (2016) Fetched-via: clean transcript of the slide deck (prior markitdown PDF extract was OCR-garbled with (cid:73) bullet glyphs; replaced with hand-pasted text) Fetch-status: verbatim (clean paste)

The Nuts and Bolts of Deep RL Research

John Schulman, December 9th, 2016

Outline

  • Approaching New Problems
  • Ongoing Development and Tuning
  • General Tuning Strategies for RL
  • Policy Gradient Strategies
  • Q-Learning Strategies
  • Miscellaneous Advice

Approaching New Problems

New Algorithm? Use Small Test Problems

  • Run experiments quickly
  • Do hyperparameter search
  • Interpret and visualize learning process: state visitation, value function, etc.
  • Counterpoint: don't overfit algorithm to contrived problem
  • Useful to have medium-sized problems that you're intimately familiar with (Hopper, Atari Pong)

New Task? Make It Easier Until Signs of Life

  • Provide good input features
  • Shape reward function

POMDP Design

  • Visualize random policy: does it sometimes exhibit desired behavior?
  • Human control
  • Atari: can you see game features in downsampled image?
  • Plot time series for observations and rewards. Are they on a reasonable scale?
  • hopper.py in gym: reward = 1.0 - 1e-3 * np.square(a).sum() + delta x / delta t
  • Histogram observations and rewards

Run Your Baselines

  • Don't expect them to work with default parameters
  • Recommended:
    • Cross-entropy method1
    • Well-tuned policy gradient method2
    • Well-tuned Q-learning + SARSA method

Run with More Samples Than Expected

  • Early in tuning process, may need huge number of samples
  • Don't be deterred by published work
  • Examples:
    • TRPO on Atari: 100K timesteps per batch for KL= 0.01
    • DQN on Atari: update freq=10K, replay buffer size=1M

Ongoing Development and Tuning

It Works! But Don't Be Satisfied

  • Explore sensitivity to each parameter
  • If too sensitive, it doesn't really work, you just got lucky
  • Look for health indicators
    • VF fit quality
    • Policy entropy
    • Update size in output space and parameter space
    • Standard diagnostics for deep networks

Continually Benchmark Your Code

  • If reusing code, regressions occur
  • Run a battery of benchmarks occasionally

Always Use Multiple Random Seeds

Always Be Ablating

  • Different tricks may substitute
  • Especially whitening
  • "Regularize" to favor simplicity in algorithm design space
  • As usual, simplicity → generalization

Automate Your Experiments

  • Don't spend all day watching your code print out numbers
  • Consider using a cloud computing platform (Microsoft Azure, Amazon EC2, Google Compute Engine)

General Tuning Strategies for RL

Whitening / Standardizing Data

  • If observations have unknown range, standardize
  • Compute running estimate of mean and standard deviation
  • x' = clip((x μ)/σ, 10, 10)
  • Rescale the rewards, but don't shift mean, as that affects agent's will to live
  • Standardize prediction targets (e.g., value functions) the same way

Generally Important Parameters

  • Discount
    • Return_t = r_t + γr_{t+1} + γ²r_{t+2} + . . .
    • Effective time horizon: 1 + γ + γ² + · · · = 1/(1 γ)
    • I.e., γ = 0.99 ⇒ ignore rewards delayed by more than 100 timesteps
    • Low γ works well for well-shaped reward
    • In TD(λ) methods, can get away with high γ when λ < 1
  • Action frequency
    • Solvable with human control (if possible)
    • View random exploration

General RL Diagnostics

  • Look at min/max/stdev of episode returns, along with mean
  • Look at episode lengths: sometimes provides additional information
  • Solving problem faster, losing game slower

Policy Gradient Strategies

Entropy as Diagnostic

  • Premature drop in policy entropy ⇒ no learning
  • Alleviate by using entropy bonus or KL penalty

KL as Diagnostic

  • Compute KL [π_old(· | s), π(· | s)]
  • KL spike ⇒ drastic loss of performance
  • No learning progress might mean steps are too large
  • batchsize=100K converges to different result than batchsize=20K.

Baseline Explained Variance

  • explained variance = 1 Var[empirical return predicted value] / Var[empirical return]

Policy Initialization

  • More important than in supervised learning: determines initial state visitation
  • Zero or tiny final layer, to maximize entropy

Q-Learning Strategies

  • Optimize memory usage carefully: you'll need it for replay buffer
  • Learning rate schedules
  • Exploration schedules
  • Be patient. DQN converges slowly
  • On Atari, often 10-40M frames to get policy much better than random

Thanks to Szymon Sidor for suggestions

Miscellaneous Advice

  • Read older textbooks and theses, not just conference papers
  • Don't get stuck on problems—can't solve everything at once
  • Exploration problems like cart-pole swing-up
  • DQN on Atari vs CartPole

  1. István Szita and András Lőrincz (2006). "Learning Tetris using the noisy cross-entropy method". In: Neural computation. ↩︎

  2. https://github.com/openai/rllab ↩︎