ml_debug/docs/evidence/williamfalcon_deeprl_hacks.md

Source: https://github.com/williamFalcon/DeepRLHacks
Title: DeepRLHacks - Attendee notes from Schulman's Nuts and Bolts talk (2017)
Fetched-via: gh api repos/williamFalcon/DeepRLHacks/contents/README.md
Fetch-status: verbatim
Compliance-note: Secondary source - attendee notes from the talk. Primary source is joschu_nuts_and_bolts.md (http://joschu.net/docs/nuts-and-bolts.pdf)

# DeepRLHacks
From a talk given by [John Schulman](http://joschu.net/) titled "The Nuts and Bolts of Deep RL Research" (Aug 2017)
These are tricks written down while attending summer [Deep RL Bootcamp at UC Berkeley](https://www.deepbootcamp.io/).

**Update**: RL bootcamp just released the [video](https://www.youtube.com/watch?v=8EcdaCk9KaQ&feature=youtu.be) and the rest of the [lectures](https://sites.google.com/view/deep-rl-bootcamp/lectures).

## Tips to debug new algorithm
1. Simplify the problem by using a low dimensional state space environment.
    - John suggested to use the [Pendulum problem](https://gym.openai.com/envs/Pendulum-v0) because the problem has a 2-D state space (angle of pendulum and velocity).
    - Easy to visualize what the value function looks like and what state the algorithm should be in and how they evolve over time.
    - Easy to visually spot why something isn't working (aka, is the value function smooth enough and so on).

2. To test if your algorithm is reasonable, construct a problem you know it should work on.
    - Ex: For hierarchical reinforcement learning you'd construct a problem with an OBVIOUS hierarchy it should learn.
    - Can easily see if it's doing the right thing.
    - WARNING: Don't over fit method to your toy problem (realize it's a toy problem).

3. Familiarize yourself with certain environments you know well.
    - Over time, you'll learn how long the training should take.
    - Know how rewards evolve, etc...
    - Allows you to set a benchmark to see how well you're doing against your past trials.
    - John uses the hopper robot where he knows how fast learning should take, and he can easily spot odd behaviors.

## Tips to debug a new task
1. Simplify the task
    - Start simple until you see signs of life.
    - Approach 1: Simplify the feature space:
      - For example, if you're learning from images (huge dimensional space), then maybe hand engineer features first. Example: If you think your function is trying to approximate a location of something, use the x,y location as features as step 1.
      - Once it starts working, make the problem harder until you solve the full problem.
   - Approach 2: simplify the reward function.
      - Formulate so it can give you FAST feedback to know whether you're doing the right thing or not.
      - Ex: Have reward for robot when it hits the target (+1). Hard to learn because maybe too much happens in between starting and reward. Reformulate as distance to target instead which will increase learning and allow you to iterate faster.

## Tips to frame a problem in RL
Maybe it's unclear what the features are and what the reward should be, or if it's feasible at all.

1. First step: Visualize a random policy acting on this problem.
    - See where it takes you.
    - If random policy on occasion does the right thing, then high chance RL will do the right thing.
      - Policy gradient will find this behavior and make it more likely.
    - If random policy never does the right thing, RL will likely also not.

2. Make sure observations usable:
    - See if YOU could control the system by using the same observations you give the agent.
      - Example: Look at preprocessed images yourself to make sure you don't remove necessary details or hinder the algorithm in a certain way.

3. Make sure everything is reasonably scaled.
    - Rule of thumb:
      - Observations: Make everything mean 0, standard deviation 1.
      - Reward: If you control it, then scale it to a reasonable value.
        - Do it across ALL your data so far.
    - Look at all observations and rewards and make sure there aren't crazy outliers.

4. Have good baseline whenever you see a new problem.
    - It's unclear which algorithm will work, so have a set of baselines (from other methods)
      - Cross entropy method
      - Policy gradient methods
      - Some kind of Q-learning method (checkout [OpenAI Baselines](https://github.com/openai/baselines) as a starter or [RLLab](https://github.com/rll/rllab))

## Reproducing papers
Sometimes (often), it's hard to reproduce results from papers. Some tricks to do that:

1. Use more samples than needed.
2. Policy right... but not exactly
     - Try to make it work a little bit.
     - Then tweak hyper parameters to get up to the public performance.
     - If want to get it to work at ALL, use bigger batch sizes.
       - If batch size is too small, noisy will overpower signal.
       - Example: TRPO, John was using too tiny of a batch size and had to use 100k time steps.
       - For DQN, best hyperparams: 10k time steps, 1mm frames in replay buffer.


## Guidelines on-going training process
Sanity check that your training is going well.

1. Look at sensitivity of EVERY hyper parameter
    - If algo is too sensitive, then NOT robust and should NOT be happy with it.
    - Sometimes it happens that a method works one way because of funny dynamics but NOT in general.

2. Look for indicators that the optimization process is healthy.
    - Varies
    - Look at whether value function is accurate.
      - Is it predicting well?
      - Is it predicting returns well?
      - How big are the updates?
    - Standard diagnostics from deep networks

3. Have a system for continuously benchmarking code.
    - Needs DISCIPLINE.
    - Look at performance across ALL previous problems you tried.
      - Sometimes it'll start working on one problem but mess up performance in others.
      - Easy to over fit  on a single problem.
    - Have a battery of benchmarks you run occasionally.

4. Think your algorithm is working but you're actually seeing random noise.
    - Example: Graph of 7 tasks with 3 algorithms and looks like 1 algorithm might be doing best on all problems, but turns out they're all the same algorithm with DIFFERENT random seeds.

5. Try different random seeds!!
    - Run multiple times and average.
    - Run multiple tasks on multiple seeds.
      - If not, you're likely to over fit.

6. Additional algorithm modifications might be unnecessary.
    - Most tricks are ACTUALLY normalizing something in some way or improving your optimization.
    - A lot of tricks also have the same effect... So you can remove some of them and SIMPLIFY your algorithm (VERY KEY).

7. Simplify your algorithm
    - Will generalize better

8. Automate your experiments
    - Don't spend your whole day watching your code spit out numbers.
    - Launch experiments on cloud services and analyze results.
    - Frameworks to track experiments and results:
      - Mostly use iPython notebooks.
      - DBs seem unnecessary to store results.


## General training strategies
1. Whiten and standardize data (for ALL seen data since the beginning).
    - Observations:
      - Do it by computing a running mean and standard deviation. Then z-transform everything.
      - Over ALL data seen (not just the recent data).
        - At least it'll scale down over time how fast it's changing.
        - Might trip up the optimizer if you keep changing the objective.
        - Rescaling (by using recent data) means your optimizer probably didn't know about that and performance will collapse.

    - Rewards:
      - Scale and DON'T shift.
        - Affects agent's will to live.
        - Will change the problem (aka, how long you want it to survive).

    - Standardize targets:
      - Same way as rewards.

    - PCA Whitening?
      - Could help.
      - Starting to see if it actually helps with neural nets.
      - Huge scales (-1000, 1000) or (-0.001, 0.001) certainly make learning slow.

2. Parameters that inform discount factors.
    - Determines how far you're giving credit assignment.
    - Ex: if factor is 0.99, then you're ignoring what happened 100 steps ago... Means you're shortsighted.
      - Better to look at how that corresponds to real time
        - Intuition, in RL we're usually discretizing time.
        - aka: are those 100 steps 3 seconds of actual time?
        - what happens during that time?
    - If TD methods for policy gradient of Value fx estimation, gamma can be close to 1 (like 0.999)
      - Algo becomes very stable.

3. Look to see that problem can actually be solved in the discretized level.
    - Example: In game if you're doing frame skip.
      - As a human, can you control it or is it impossible?
      - Look at what random exploration looks like
        - Discretization determines how far your Brownian motion goes.
        - If do many actions in a row, then tend to explore further.
        - Choose your time discretization in a way that works.

4. Look at episode returns closely.
    - Not just mean, look at min and max.
      - The max return is something your policy can hone in pretty well.
      - Is your policy ever doing the right thing??
    - Look at episode length (sometimes more informative than episode reward).
      - if on game you might be losing every time so you might never win, but... episode length can tell you if you're losing SLOWER.
      - Might see an episode length improvement in the beginning but maybe not reward.


## Policy gradient diagnostics
1. Look at entropy really carefully
    - Entropy in ACTION space
      - Care more about entropy in state space, but don't have good methods for calculating that.
    - If going down too fast, then policy becoming deterministic and will not explore.
    - If NOT going down, then policy won't be good because it is really random.
    - Can fix by:
      - KL penalty
        - Keep entropy from decreasing too quickly.
      - Add entropy bonus.
    - How to measure entropy.
      - For most policies can compute entropy analytically.
        - If continuous, it's usually a Gaussian, so can compute differential entropy.

2. Look at KL divergence
    - Look at size of updates in terms of KL divergence.
    - example:
      - If KL is .01 then very small.
      - If 10 then too much.

3. Baseline explained variance.
    - See if value function is actually a good predictor or a reward.
      - if negative it might be overfitting or noisy.
        - Likely need to tune hyper parameters

4. Initialize policy
    - Very important (more so than in supervised learning).
    - Zero or tiny final layer to maximize entropy
      - Maximize random exploration in the beginning

## Q-Learning Strategies
1. Be careful about replay buffer memory usage.
    - You might need a huge buffer, so adapt code accordingly.

2. Play with learning rate schedule.

3. If converges slowly or has slow warm-up period in the beginning
    - Be patient... DQN converges VERY slowly.


## Bonus from [Andrej Karpathy](http://cs.stanford.edu/people/karpathy/):
1. A good feature can be to take the difference between two frames.
   - This delta vector can highlight slight state changes otherwise difficult to distinguish.