Source: https://github.com/williamFalcon/DeepRLHacks
Title: DeepRLHacks - Attendee notes from Schulman's Nuts and Bolts talk (2017)
Fetched-via: gh api repos/williamFalcon/DeepRLHacks/contents/README.md
Fetch-status: verbatim
Compliance-note: Secondary source - attendee notes from the talk. Primary source is joschu_nuts_and_bolts.md (http://joschu.net/docs/nuts-and-bolts.pdf)

# DeepRLHacks
From a talk given by [John Schulman](http://joschu.net/) titled "The Nuts and Bolts of Deep RL Research" (Aug 2017)
These are tricks written down while attending summer [Deep RL Bootcamp at UC Berkeley](https://www.deepbootcamp.io/).   

**Update**: RL bootcamp just released the [video](https://www.youtube.com/watch?v=8EcdaCk9KaQ&feature=youtu.be) and the rest of the [lectures](https://sites.google.com/view/deep-rl-bootcamp/lectures). 

## Tips to debug new algorithm   
1. Simplify the problem by using a low dimensional state space environment.      
    - John suggested to use the [Pendulum problem](https://gym.openai.com/envs/Pendulum-v0) because the problem has a 2-D state space (angle of pendulum and velocity).    
    - Easy to visualize what the value function looks like and what state the algorithm should be in and how they evolve over time.  
    - Easy to visually spot why something isn't working (aka, is the value function smooth enough and so on).

2. To test if your algorithm is reasonable, construct a problem you know it should work on.   
    - Ex: For hierarchical reinforcement learning you'd construct a problem with an OBVIOUS hierarchy it should learn. 
    - Can easily see if it's doing the right thing.   
    - WARNING: Don't over fit method to your toy problem (realize it's a toy problem).   

3. Familiarize yourself with certain environments you know well.
    - Over time, you'll learn how long the training should take.   
    - Know how rewards evolve, etc... 
    - Allows you to set a benchmark to see how well you're doing against your past trials.    
    - John uses the hopper robot where he knows how fast learning should take, and he can easily spot odd behaviors.    

## Tips to debug a new task   
1. Simplify the task
    - Start simple until you see signs of life.   
    - Approach 1: Simplify the feature space: 
      - For example, if you're learning from images (huge dimensional space), then maybe hand engineer features first. Example: If you think your function is trying to approximate a location of something, use the x,y location as features as step 1. 
      - Once it starts working, make the problem harder until you solve the full problem.   
   - Approach 2: simplify the reward function.
      - Formulate so it can give you FAST feedback to know whether you're doing the right thing or not.   
      - Ex: Have reward for robot when it hits the target (+1). Hard to learn because maybe too much happens in between starting and reward. Reformulate as distance to target instead which will increase learning and allow you to iterate faster.    

## Tips to frame a problem in RL   
Maybe it's unclear what the features are and what the reward should be, or if it's feasible at all.    

1. First step: Visualize a random policy acting on this problem.   
    - See where it takes you.    
    - If random policy on occasion does the right thing, then high chance RL will do the right thing.   
      - Policy gradient will find this behavior and make it more likely.  
    - If random policy never does the right thing, RL will likely also not.   

2. Make sure observations usable:
    - See if YOU could control the system by using the same observations you give the agent.   
      - Example: Look at preprocessed images yourself to make sure you don't remove necessary details or hinder the algorithm in a certain way.

3. Make sure everything is reasonably scaled.   
    - Rule of thumb: 
      - Observations: Make everything mean 0, standard deviation 1.
      - Reward: If you control it, then scale it to a reasonable value.
        - Do it across ALL your data so far.   
    - Look at all observations and rewards and make sure there aren't crazy outliers.    

4. Have good baseline whenever you see a new problem.   
    - It's unclear which algorithm will work, so have a set of baselines (from other methods)
      - Cross entropy method   
      - Policy gradient methods 
      - Some kind of Q-learning method (checkout [OpenAI Baselines](https://github.com/openai/baselines) as a starter or [RLLab](https://github.com/rll/rllab)) 

## Reproducing papers    
Sometimes (often), it's hard to reproduce results from papers. Some tricks to do that:   

1. Use more samples than needed.    
2. Policy right... but not exactly
     - Try to make it work a little bit.   
     - Then tweak hyper parameters to get up to the public performance.   
     - If want to get it to work at ALL, use bigger batch sizes. 
       - If batch size is too small, noisy will overpower signal.  
       - Example: TRPO, John was using too tiny of a batch size and had to use 100k time steps. 
       - For DQN, best hyperparams: 10k time steps, 1mm frames in replay buffer.


## Guidelines on-going training process   
Sanity check that your training is going well.    

1. Look at sensitivity of EVERY hyper parameter
    - If algo is too sensitive, then NOT robust and should NOT be happy with it.   
    - Sometimes it happens that a method works one way because of funny dynamics but NOT in general.

2. Look for indicators that the optimization process is healthy.  
    - Varies 
    - Look at whether value function is accurate.
      - Is it predicting well?    
      - Is it predicting returns well?
      - How big are the updates?   
    - Standard diagnostics from deep networks   

3. Have a system for continuously benchmarking code.    
    - Needs DISCIPLINE.   
    - Look at performance across ALL previous problems you tried.   
      - Sometimes it'll start working on one problem but mess up performance in others.   
      - Easy to over fit  on a single problem.
    - Have a battery of benchmarks you run occasionally.   

4. Think your algorithm is working but you're actually seeing random noise.   
    - Example: Graph of 7 tasks with 3 algorithms and looks like 1 algorithm might be doing best on all problems, but turns out they're all the same algorithm with DIFFERENT random seeds.   

5. Try different random seeds!!
    - Run multiple times and average.   
    - Run multiple tasks on multiple seeds. 
      - If not, you're likely to over fit.   

6. Additional algorithm modifications might be unnecessary.      
    - Most tricks are ACTUALLY normalizing something in some way or improving your optimization.  
    - A lot of tricks also have the same effect... So you can remove some of them and SIMPLIFY your algorithm (VERY KEY).   

7. Simplify your algorithm   
    - Will generalize better

8. Automate your experiments   
    - Don't spend your whole day watching your code spit out numbers.   
    - Launch experiments on cloud services and analyze results.   
    - Frameworks to track experiments and results:
      - Mostly use iPython notebooks.
      - DBs seem unnecessary to store results.   


## General training strategies
1. Whiten and standardize data (for ALL seen data since the beginning).   
    - Observations:
      - Do it by computing a running mean and standard deviation. Then z-transform everything.   
      - Over ALL data seen (not just the recent data).
        - At least it'll scale down over time how fast it's changing.
        - Might trip up the optimizer if you keep changing the objective. 
        - Rescaling (by using recent data) means your optimizer probably didn't know about that and performance will collapse.
  
    - Rewards:
      - Scale and DON'T shift. 
        - Affects agent's will to live.
        - Will change the problem (aka, how long you want it to survive).

    - Standardize targets:
      - Same way as rewards.
  
    - PCA Whitening?
      - Could help.
      - Starting to see if it actually helps with neural nets.
      - Huge scales (-1000, 1000) or (-0.001, 0.001) certainly make learning slow.   

2. Parameters that inform discount factors.
    - Determines how far you're giving credit assignment.   
    - Ex: if factor is 0.99, then you're ignoring what happened 100 steps ago... Means you're shortsighted. 
      - Better to look at how that corresponds to real time 
        - Intuition, in RL we're usually discretizing time.  
        - aka: are those 100 steps 3 seconds of actual time? 
        - what happens during that time?
    - If TD methods for policy gradient of Value fx estimation, gamma can be close to 1 (like 0.999)
      - Algo becomes very stable.   

3. Look to see that problem can actually be solved in the discretized level.  
    - Example: In game if you're doing frame skip.
      - As a human, can you control it or is it impossible?
      - Look at what random exploration looks like 
        - Discretization determines how far your Brownian motion goes. 
        - If do many actions in a row, then tend to explore further.   
        - Choose your time discretization in a way that works.

4. Look at episode returns closely.   
    - Not just mean, look at min and max.
      - The max return is something your policy can hone in pretty well.
      - Is your policy ever doing the right thing??
    - Look at episode length (sometimes more informative than episode reward).
      - if on game you might be losing every time so you might never win, but... episode length can tell you if you're losing SLOWER.
      - Might see an episode length improvement in the beginning but maybe not reward.


## Policy gradient diagnostics   
1. Look at entropy really carefully   
    - Entropy in ACTION space
      - Care more about entropy in state space, but don't have good methods for calculating that.
    - If going down too fast, then policy becoming deterministic and will not explore.   
    - If NOT going down, then policy won't be good because it is really random.   
    - Can fix by:
      - KL penalty
        - Keep entropy from decreasing too quickly.    
      - Add entropy bonus.
    - How to measure entropy.   
      - For most policies can compute entropy analytically. 
        - If continuous, it's usually a Gaussian, so can compute differential entropy.  
    
2. Look at KL divergence
    - Look at size of updates in terms of KL divergence.   
    - example:
      - If KL is .01 then very small.
      - If 10 then too much.
  
3. Baseline explained variance.   
    - See if value function is actually a good predictor or a reward.   
      - if negative it might be overfitting or noisy.
        - Likely need to tune hyper parameters

4. Initialize policy   
    - Very important (more so than in supervised learning).   
    - Zero or tiny final layer to maximize entropy
      - Maximize random exploration in the beginning   

## Q-Learning Strategies 
1. Be careful about replay buffer memory usage.  
    - You might need a huge buffer, so adapt code accordingly.   

2. Play with learning rate schedule.   

3. If converges slowly or has slow warm-up period in the beginning
    - Be patient... DQN converges VERY slowly.   


## Bonus from [Andrej Karpathy](http://cs.stanford.edu/people/karpathy/):   
1. A good feature can be to take the difference between two frames.   
   - This delta vector can highlight slight state changes otherwise difficult to distinguish.