mirror of
https://github.com/wassname/ml_debug.git
synced 2026-06-27 01:00:14 +08:00
4393cceefd
Deep research to uplift LLMs for ML debugging, opinionated by source selection. Distilled from Schulman, Jones, Rahtz, Goodfellow, CS231n, FSDL, and more. Includes runnable diagnostic scripts and LLM-specific anti-patterns. Author: wassname (https://github.com/wassname)
221 lines
11 KiB
Markdown
221 lines
11 KiB
Markdown
Source: https://github.com/williamFalcon/DeepRLHacks
|
|
Title: DeepRLHacks - Attendee notes from Schulman's Nuts and Bolts talk (2017)
|
|
Fetched-via: gh api repos/williamFalcon/DeepRLHacks/contents/README.md
|
|
Fetch-status: verbatim
|
|
Compliance-note: Secondary source - attendee notes from the talk. Primary source is joschu_nuts_and_bolts.md (http://joschu.net/docs/nuts-and-bolts.pdf)
|
|
|
|
# DeepRLHacks
|
|
From a talk given by [John Schulman](http://joschu.net/) titled "The Nuts and Bolts of Deep RL Research" (Aug 2017)
|
|
These are tricks written down while attending summer [Deep RL Bootcamp at UC Berkeley](https://www.deepbootcamp.io/).
|
|
|
|
**Update**: RL bootcamp just released the [video](https://www.youtube.com/watch?v=8EcdaCk9KaQ&feature=youtu.be) and the rest of the [lectures](https://sites.google.com/view/deep-rl-bootcamp/lectures).
|
|
|
|
## Tips to debug new algorithm
|
|
1. Simplify the problem by using a low dimensional state space environment.
|
|
- John suggested to use the [Pendulum problem](https://gym.openai.com/envs/Pendulum-v0) because the problem has a 2-D state space (angle of pendulum and velocity).
|
|
- Easy to visualize what the value function looks like and what state the algorithm should be in and how they evolve over time.
|
|
- Easy to visually spot why something isn't working (aka, is the value function smooth enough and so on).
|
|
|
|
2. To test if your algorithm is reasonable, construct a problem you know it should work on.
|
|
- Ex: For hierarchical reinforcement learning you'd construct a problem with an OBVIOUS hierarchy it should learn.
|
|
- Can easily see if it's doing the right thing.
|
|
- WARNING: Don't over fit method to your toy problem (realize it's a toy problem).
|
|
|
|
3. Familiarize yourself with certain environments you know well.
|
|
- Over time, you'll learn how long the training should take.
|
|
- Know how rewards evolve, etc...
|
|
- Allows you to set a benchmark to see how well you're doing against your past trials.
|
|
- John uses the hopper robot where he knows how fast learning should take, and he can easily spot odd behaviors.
|
|
|
|
## Tips to debug a new task
|
|
1. Simplify the task
|
|
- Start simple until you see signs of life.
|
|
- Approach 1: Simplify the feature space:
|
|
- For example, if you're learning from images (huge dimensional space), then maybe hand engineer features first. Example: If you think your function is trying to approximate a location of something, use the x,y location as features as step 1.
|
|
- Once it starts working, make the problem harder until you solve the full problem.
|
|
- Approach 2: simplify the reward function.
|
|
- Formulate so it can give you FAST feedback to know whether you're doing the right thing or not.
|
|
- Ex: Have reward for robot when it hits the target (+1). Hard to learn because maybe too much happens in between starting and reward. Reformulate as distance to target instead which will increase learning and allow you to iterate faster.
|
|
|
|
## Tips to frame a problem in RL
|
|
Maybe it's unclear what the features are and what the reward should be, or if it's feasible at all.
|
|
|
|
1. First step: Visualize a random policy acting on this problem.
|
|
- See where it takes you.
|
|
- If random policy on occasion does the right thing, then high chance RL will do the right thing.
|
|
- Policy gradient will find this behavior and make it more likely.
|
|
- If random policy never does the right thing, RL will likely also not.
|
|
|
|
2. Make sure observations usable:
|
|
- See if YOU could control the system by using the same observations you give the agent.
|
|
- Example: Look at preprocessed images yourself to make sure you don't remove necessary details or hinder the algorithm in a certain way.
|
|
|
|
3. Make sure everything is reasonably scaled.
|
|
- Rule of thumb:
|
|
- Observations: Make everything mean 0, standard deviation 1.
|
|
- Reward: If you control it, then scale it to a reasonable value.
|
|
- Do it across ALL your data so far.
|
|
- Look at all observations and rewards and make sure there aren't crazy outliers.
|
|
|
|
4. Have good baseline whenever you see a new problem.
|
|
- It's unclear which algorithm will work, so have a set of baselines (from other methods)
|
|
- Cross entropy method
|
|
- Policy gradient methods
|
|
- Some kind of Q-learning method (checkout [OpenAI Baselines](https://github.com/openai/baselines) as a starter or [RLLab](https://github.com/rll/rllab))
|
|
|
|
## Reproducing papers
|
|
Sometimes (often), it's hard to reproduce results from papers. Some tricks to do that:
|
|
|
|
1. Use more samples than needed.
|
|
2. Policy right... but not exactly
|
|
- Try to make it work a little bit.
|
|
- Then tweak hyper parameters to get up to the public performance.
|
|
- If want to get it to work at ALL, use bigger batch sizes.
|
|
- If batch size is too small, noisy will overpower signal.
|
|
- Example: TRPO, John was using too tiny of a batch size and had to use 100k time steps.
|
|
- For DQN, best hyperparams: 10k time steps, 1mm frames in replay buffer.
|
|
|
|
|
|
## Guidelines on-going training process
|
|
Sanity check that your training is going well.
|
|
|
|
1. Look at sensitivity of EVERY hyper parameter
|
|
- If algo is too sensitive, then NOT robust and should NOT be happy with it.
|
|
- Sometimes it happens that a method works one way because of funny dynamics but NOT in general.
|
|
|
|
2. Look for indicators that the optimization process is healthy.
|
|
- Varies
|
|
- Look at whether value function is accurate.
|
|
- Is it predicting well?
|
|
- Is it predicting returns well?
|
|
- How big are the updates?
|
|
- Standard diagnostics from deep networks
|
|
|
|
3. Have a system for continuously benchmarking code.
|
|
- Needs DISCIPLINE.
|
|
- Look at performance across ALL previous problems you tried.
|
|
- Sometimes it'll start working on one problem but mess up performance in others.
|
|
- Easy to over fit on a single problem.
|
|
- Have a battery of benchmarks you run occasionally.
|
|
|
|
4. Think your algorithm is working but you're actually seeing random noise.
|
|
- Example: Graph of 7 tasks with 3 algorithms and looks like 1 algorithm might be doing best on all problems, but turns out they're all the same algorithm with DIFFERENT random seeds.
|
|
|
|
5. Try different random seeds!!
|
|
- Run multiple times and average.
|
|
- Run multiple tasks on multiple seeds.
|
|
- If not, you're likely to over fit.
|
|
|
|
6. Additional algorithm modifications might be unnecessary.
|
|
- Most tricks are ACTUALLY normalizing something in some way or improving your optimization.
|
|
- A lot of tricks also have the same effect... So you can remove some of them and SIMPLIFY your algorithm (VERY KEY).
|
|
|
|
7. Simplify your algorithm
|
|
- Will generalize better
|
|
|
|
8. Automate your experiments
|
|
- Don't spend your whole day watching your code spit out numbers.
|
|
- Launch experiments on cloud services and analyze results.
|
|
- Frameworks to track experiments and results:
|
|
- Mostly use iPython notebooks.
|
|
- DBs seem unnecessary to store results.
|
|
|
|
|
|
## General training strategies
|
|
1. Whiten and standardize data (for ALL seen data since the beginning).
|
|
- Observations:
|
|
- Do it by computing a running mean and standard deviation. Then z-transform everything.
|
|
- Over ALL data seen (not just the recent data).
|
|
- At least it'll scale down over time how fast it's changing.
|
|
- Might trip up the optimizer if you keep changing the objective.
|
|
- Rescaling (by using recent data) means your optimizer probably didn't know about that and performance will collapse.
|
|
|
|
- Rewards:
|
|
- Scale and DON'T shift.
|
|
- Affects agent's will to live.
|
|
- Will change the problem (aka, how long you want it to survive).
|
|
|
|
- Standardize targets:
|
|
- Same way as rewards.
|
|
|
|
- PCA Whitening?
|
|
- Could help.
|
|
- Starting to see if it actually helps with neural nets.
|
|
- Huge scales (-1000, 1000) or (-0.001, 0.001) certainly make learning slow.
|
|
|
|
2. Parameters that inform discount factors.
|
|
- Determines how far you're giving credit assignment.
|
|
- Ex: if factor is 0.99, then you're ignoring what happened 100 steps ago... Means you're shortsighted.
|
|
- Better to look at how that corresponds to real time
|
|
- Intuition, in RL we're usually discretizing time.
|
|
- aka: are those 100 steps 3 seconds of actual time?
|
|
- what happens during that time?
|
|
- If TD methods for policy gradient of Value fx estimation, gamma can be close to 1 (like 0.999)
|
|
- Algo becomes very stable.
|
|
|
|
3. Look to see that problem can actually be solved in the discretized level.
|
|
- Example: In game if you're doing frame skip.
|
|
- As a human, can you control it or is it impossible?
|
|
- Look at what random exploration looks like
|
|
- Discretization determines how far your Brownian motion goes.
|
|
- If do many actions in a row, then tend to explore further.
|
|
- Choose your time discretization in a way that works.
|
|
|
|
4. Look at episode returns closely.
|
|
- Not just mean, look at min and max.
|
|
- The max return is something your policy can hone in pretty well.
|
|
- Is your policy ever doing the right thing??
|
|
- Look at episode length (sometimes more informative than episode reward).
|
|
- if on game you might be losing every time so you might never win, but... episode length can tell you if you're losing SLOWER.
|
|
- Might see an episode length improvement in the beginning but maybe not reward.
|
|
|
|
|
|
## Policy gradient diagnostics
|
|
1. Look at entropy really carefully
|
|
- Entropy in ACTION space
|
|
- Care more about entropy in state space, but don't have good methods for calculating that.
|
|
- If going down too fast, then policy becoming deterministic and will not explore.
|
|
- If NOT going down, then policy won't be good because it is really random.
|
|
- Can fix by:
|
|
- KL penalty
|
|
- Keep entropy from decreasing too quickly.
|
|
- Add entropy bonus.
|
|
- How to measure entropy.
|
|
- For most policies can compute entropy analytically.
|
|
- If continuous, it's usually a Gaussian, so can compute differential entropy.
|
|
|
|
2. Look at KL divergence
|
|
- Look at size of updates in terms of KL divergence.
|
|
- example:
|
|
- If KL is .01 then very small.
|
|
- If 10 then too much.
|
|
|
|
3. Baseline explained variance.
|
|
- See if value function is actually a good predictor or a reward.
|
|
- if negative it might be overfitting or noisy.
|
|
- Likely need to tune hyper parameters
|
|
|
|
4. Initialize policy
|
|
- Very important (more so than in supervised learning).
|
|
- Zero or tiny final layer to maximize entropy
|
|
- Maximize random exploration in the beginning
|
|
|
|
## Q-Learning Strategies
|
|
1. Be careful about replay buffer memory usage.
|
|
- You might need a huge buffer, so adapt code accordingly.
|
|
|
|
2. Play with learning rate schedule.
|
|
|
|
3. If converges slowly or has slow warm-up period in the beginning
|
|
- Be patient... DQN converges VERY slowly.
|
|
|
|
|
|
## Bonus from [Andrej Karpathy](http://cs.stanford.edu/people/karpathy/):
|
|
1. A good feature can be to take the difference between two frames.
|
|
- This delta vector can highlight slight state changes otherwise difficult to distinguish.
|
|
|
|
|
|
|
|
|
|
|