initial: ML debugging folklore skill

Deep research to uplift LLMs for ML debugging, opinionated by source
selection. Distilled from Schulman, Jones, Rahtz, Goodfellow, CS231n,
FSDL, and more. Includes runnable diagnostic scripts and LLM-specific
anti-patterns.

Author: wassname (https://github.com/wassname)
This commit is contained in:
wassname
2026-03-06 10:11:30 +08:00
commit 4393cceefd
25 changed files with 12512 additions and 0 deletions
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,679 @@
Source: http://amid.fish/reproducing-deep-rl
Title: Lessons Learned Reproducing a Deep Reinforcement Learning Paper - Matthew Rahtz (2018)
Fetched-via: uvx markitdown http://amid.fish/reproducing-deep-rl
Fetch-status: verbatim
[Amid Fish](/)
# Lessons Learned Reproducing a Deep Reinforcement Learning Paper
Apr 6, 2018
There are a lot of neat things going on in deep reinforcement learning. One of
the coolest things from last year was OpenAI and DeepMinds work on training an
agent using feedback from a human rather than a classical reward signal.
Theres a great blog post about it at [Learning from Human
Preferences](https://blog.openai.com/deep-reinforcement-learning-from-human-preferences/),
and the original paper is at [Deep Reinforcement Learning from Human
Preferences](https://arxiv.org/pdf/1706.03741.pdf).
![](images/humanfeedbackjump.gif)
Learn some deep reinforcement learning, and you too can train a noodle to do backflip. From [Learning from Human Preferences](https://blog.openai.com/deep-reinforcement-learning-from-human-preferences/).
Ive seen a few recommendations that reproducing papers is a good way of
levelling up machine learning skills, and I decided this could be an
interesting one to try with. It was indeed a [super fun
project](https://github.com/mrahtz/learning-from-human-preferences), and Im
happy to have tackled it - but looking back, I realise it wasnt exactly the
experience I thought it would be.
If youre thinking about reproducing papers too, here are some notes on what
surprised me about working with deep RL.
---
First, in general, **reinforcement learning turned out to be a lot trickier
than expected**.
A big part of it is that right now, reinforcement learning is really sensitive.
There are a lot of details to get *just* right, and if you dont get them
right, it can be difficult to diagnose where youve gone wrong.
Example 1: after finishing the basic implementation, training runs just werent
succeeding. I had all sorts of ideas about what the problem might be, but after
a couple of months of head scratching, it turned out to be because of problems
with normalization of rewards and pixel data at a key stage[1](#fn:normproblems).
Even with the benefit of hindsight, there were no obvious clues pointing in
that direction: the accuracy of the reward predictor network the pixel data
went into was just fine, and it took a long time to occur to me to examine the
rewards predicted carefully enough to notice the reward normalization bug.
Figuring out what the problem was happened almost accidentally, noticing a
small inconsistency that eventually lead to the right path.
Example 2: doing a final code cleanup, I realised Id implemented dropout kind
of wrong. The reward predictor network takes as input a pair of video clips,
each processed identically by two networks with shared weights. If you add
dropout and youre not careful about giving it the same random seed in each
network, youll drop out differently for each network, so the video clips wont
be processed identically. As it turned out, though, fixing it completely broke
training, despite prediction accuracy of the network looking exactly the same!
![](images/broken_dropout.png)
Spot which one is broken. Yeah, I don't see it either.
I get the impression this is a pretty common story (e.g. [Deep Reinforcement
Learning Doesnt Work Yet](https://www.alexirpan.com/2018/02/14/rl-hard.html)).
My takeaway is that, starting a reinforcement learning project, you should
**expect to get stuck like you get stuck on a math problem**. Its not like my
experience of programming in general so far where you get stuck but theres
usually a clear trail to follow and you can get unstuck within a couple of days
at most. Its more like when youre trying to solve a puzzle, there are no
clear inroads into the problem, and the only way to proceed is to try things
until you find the key piece of evidence or get the key spark that lets you
figure it out.
A corollary is to **try and be as sensitive as possible in noticing
confusion**.
There were a lot of points in this project where the only clues came from
noticing some small thing that didnt make sense. For example, at some point it
turned out that taking the difference between frames as features made things
work much better. It was tempting to just forge ahead with the new features,
but I realised I was confused about *why* it made such a big difference for the
simple environment I was working with back then. It was only by following that
confusion and realising that taking the difference between frames zeroed out
the background that gave the hint of a problem with normalization.
Im not entirely sure how to make ones mind do more of this, but my best
guesses at the moment are:
* Learn to **recognise what confusion *feels* like**. There are a lot of
different shades of the “somethings not quite right” feeling. Sometimes its
code you know is ugly. Sometimes its worry about wasting time on the wrong
thing. But sometimes its that *youve seen something you didnt expect*:
confusion. Being able to recognise that exact shade of discomfort is
important, so that you can…
* Develop the habit of following through on confusion. There are some
sources of discomfort that it can be better to ignore in the moment (e.g.
code smell while prototyping), but confusion isnt one of them. It seems
important to really **commit yourself to *always* investigate whenever you
notice confusion**.
In any case: expect to get stuck for several weeks at a time. (And have
confidence you will be able to get to the other side if you keep at it, paying
attention to those small details.)
---
Speaking of differences to past programming experiences, a second major
learning experience was the **difference in mindset required for working with
long iteration times**.
Debugging seems to involve four basic steps:
* Gather evidence about what the problem might be.
* Form hypotheses about the problem based on the evidence you have so far.
* Choose the most likely hypothesis, implement a fix, and see what happens.
* Repeat until the problem goes away.
In most of the programming Ive done before, Ive been used to rapid feedback.
If something doesnt work, you can make a change and see what difference it
makes within seconds or minutes. Gathering evidence is very cheap.
In fact, in rapid-feedback situations, gathering evidence can be a lot cheaper
than forming hypotheses. Why spend 15 minutes carefully considering everything
that could be causing what you see when you can check the first idea that jumps
to mind in a fraction of that (and gather more evidence in the process)? To put
it another way: if you have rapid feedback, you can narrow down the hypothesis
space a lot faster by trying things than thinking carefully.
If you keep that strategy when each run takes 10 hours, though, you can easily
waste a *lot* of time. Last run didnt work? OK, I think its this thing. Lets
set off another run to check. Coming back the next morning: still doesnt work?
OK, maybe its this other thing. Lets set off another run. A week later, you
still havent solved the problem.
Doing multiple runs at the same time, each trying a different thing, can help
to some extent, but a) unless you have access to a cluster you can end up
racking up a lot of costs on cloud compute (see below), and b) because of the
kinds of difficulties with reinforcement learning mentioned above, if you try
to iterate too quickly, you might never realise what kind of evidence you
actually need.
Switching from **experimenting a lot and thinking a little** to **experimenting
a little and thinking a lot** was a key turnaround in productivity. When
debugging with long iteration times, you really need to *pour* time into the
hypothesis-forming step - thinking about what all the possibilities are, how
likely they seem on their own, and how likely they seem in light of everything
youve seen so far. Spend as much time as you need, even if it takes 30
minutes, or an hour. Reserve experiments for once youve fleshed out the
hypothesis space as thoroughly as possible and know which pieces of evidence
would allow you to best distinguish between the different possibilities.
(Its especially important to be deliberate about this if youre working on
something as a side project. If youre only working on it for an hour a day and
each iteration takes a day to run, the number of runs you can do per week ends
up feeling a precious commodity you have to make the most of. Its easy to
then feel a sense of pressure to spend your working hour each day rushing to
figure out something to do for that days run. Another turnaround was being
willing to spend several days just *thinking*, not starting any runs, until I
felt really confident I had a strong hypothesis about what the problem was.)
A key enabler of the switch to thinking more was **keeping a much more detailed
work log**. Working without a log is fine when each chunk of progress takes
less than a few hours, but anything longer than that and its easy to forget
what youve tried so far and end up just going in circles. The log format I
converged on was:
* Log 1: what specific output am I working on right now?
* Log 2: thinking out loud - e.g. hypotheses about the current problem, what to
work on next
* Log 3: record of currently ongoing runs along with a short reminder of what
question each run is supposed to answer
* Log 4: results of runs (TensorBoard graphs, any other significant
observations), separated by type of run (e.g. by environment the agent is
being trained in)
I started out with relatively sparse logs, but towards the end of the project
my attitude moved more towards “log absolutely everything going through my
head”. The overhead was significant, but I think it was worth it - partly
because some debugging required cross-referencing results and thoughts that
were days or weeks apart, and partly for (at least, this is my impression)
general improvements in thinking quality from the massive upgrade to effective
mental RAM.
![](images/rl_logs.jpg)
A typical day's log.
---
In terms of **getting the most out of the experiments you do run**, there are
two things I started experimenting with towards the end of the project which
seem like they could be helpful in the future.
First, adopting an attitude of **log all the metrics you can** to maximise the
amount of evidence you gather on each run. There are obvious metrics like
training/validation accuracy, but it might also be worth spending a good chunk
of time at the start of the project brainstorming and researching which other
metrics might be important for diagnosing potential problems.
I might be making this recommendation partly out of hindsight bias where I
*know* which metrics I should have started logging earlier. Its hard to
predict which metrics will be useful in advance. Still, heuristics that might
be useful are:
* For every important component in the system, consider what *can* be measured
about it. If theres a database, measure how quickly its growing in size.
If theres a queue, measure how quickly items are being processed.
* For every complex procedure, measure how long different parts of it take. If
youve got a training loop, measure how long each batch takes to run. If
youve got a complex inference procedure, measure how long each sub-inference
takes. Those times are going to help a lot for performance debugging later
on, and can sometimes reveal bugs that are otherwise hard to spot. (For
example, if you see something taking longer and longer, it might be because
of a memory leak.)
* Similarly, consider profiling memory usage of different components. Small
memory leaks can be indicative of all sorts of things.
Another strategy is to look at what other people are measuring. In the context
of deep reinforcement learning, John Schulman has some good tips in his [Nuts
and Bolts of Deep RL talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ)
([slides](http://joschu.net/docs/nuts-and-bolts.pdf); [summary
notes](https://github.com/williamFalcon/DeepRLHacks)). For policy gradient
methods, Ive found policy entropy in particular to be a good indicator of
whether training is going anywhere - much more sensitive than per-episode
rewards.
![](images/entropies.png)
Examples of unhealthy and healthy
policy entropy graphs. Failure mode 1 (left): convergence to constant entropy (random choice among a subset of actions). Failure mode 2 (centre): convergence to zero entropy (choosing the same action every time). Right: policy entropy from a successful Pong training run.
When you do see something suspicious in metrics recorded, remembering to
*notice confusion*, err on the side of assuming its something important rather
than just e.g. an inefficient implementation of some data structure. (I missed
a multithreading bug for several months by ignoring a small but mysterious
decay in frames per second.)
Debugging is much easier if you can see all your metrics in one place. I like
to have as much as possible on TensorBoard. Logging arbitrary metrics with
TensorFlow can be awkward, though, so **consider checking out
[easy-tf-log](https://github.com/mrahtz/easy-tf-log)**, which provides an easy
`tflog(key, value)` interface without any extra setup.
A second thing that seems promising for getting more out of runs is
**taking the time to try and predict failure in advance**.
Thanks to hindsight bias, failures often seem obvious in retrospect. But the
*really* frustrating thing is when the failure mode is obvious *before youve
even observed what it was*. You know when youve set off a run, you come back
the next day, you see its failed, and even before youve investigated, you
realise, “Oh, it must have been because I forgot to set the frobulator”? Thats
what Im talking about.
The neat thing is that sometimes you can trigger that kind of
half-hindsight-realisation in advance. It does take conscious effort, though -
really stopping for a good five minutes before launching a run to think about
what might go wrong. The particular script I found most helpful to go through
was: [2](#fn:murphyjitsu)
1. Ask yourself, “How surprised would I be if this run failed?”
2. If the answer is not very surprised, put yourself in the shoes of
future-you where the run *has* failed, and ask, “If Im here, what might
have gone wrong?”
3. Fix whatever comes to mind.
4. Repeat until the answer to question 1 is “very surprised” (or at least “as
surprised as I can get”).
There are always going to be failures you couldnt have predicted, and
sometimes you still miss obvious things, but this does at least seem to *cut
down* on the number of times something fails in a way you feel *really* stupid
for not having thought of earlier.
---
Finally, though, **the biggest surprise with this project was just how long it
took** - and related, the amount of compute resources it needed.
The first surprise was in terms of calendar time. My original estimate was that
as a side project it would take about 3 months. It actually took around *8
months*. (And the original estimate was supposed to be pessimistic!) Some of
that was down to underestimating how many hours each stage would take, but a
big chunk of the underestimate was failing to anticipate other things coming up
outside the project. Its hard to say how well this generalises, but **for
side projects, taking your original (already pessimistic) time estimates and
doubling them** might not be a bad rule-of-thumb.
The more interesting surprise was in how many hours each stage actually took.
The main stages of my initial project plan were basically:
![](images/pretime.png)
Heres how long each stage *actually* took.
![](images/posttime.png)
It wasnt writing code that took a long time - it was debugging it. In fact,
getting it working on even a [supposedly-simple
environment](https://github.com/mrahtz/gym-moving-dot) took *four times* as
long as initial implementation. (This is the first side project where Ive been
keeping track of hours, but experiences with past machine learning projects
have been similar.)
(Side note: be careful about designing from scratch what you hope should be an
easy environment for reinforcement learning. In particular, think carefully
about a) whether your rewards really convey the right information to be able to
solve the task - yes, this is easy to mess up - and b) whether rewards depend
only on previous observations or also on current action. The latter, in
particular, might be relevant if youre doing any kind of reward prediction,
e.g. with a critic.)
**Another surprise was the amount of compute time needed.** I was lucky having
access to my universitys cluster - only CPU machines, but that was fine for
some tasks. For work which needed a GPU (e.g. to iterate quickly on some small
part) or when the cluster was too busy, I experimented with two cloud services:
VMs on [Google Cloud Compute
Engine](https://console.cloud.google.com/projectselector/compute/instances?supportedpurview=project),
and [FloydHub](http://floydhub.com/).
Compute Engine is fine if you just want shell access to a GPU machine, but I
tried to do as much as possible on FloydHub. FloydHub is basically a cloud
compute service targeted at machine learning. You run `floyd run python
awesomecode.py` and FloydHub sets up a container, uploads your code to it, and
runs the code. The two key things which make FloydHub awesome are:
* Containers come preinstalled with GPU drivers and common libraries. (Even in
2018, I wasted a good few hours fiddling with CUDA versions while upgrading
TensorFlow on the Compute Engine VM.)
* Each run is automatically archived. For each run, the code used, the exact
command used to start the run, any command-line output, and any data outputs
are saved automatically, and indexed through a web interface.
[![](images/floydhub.png)](images/floydhub.png)
FloydHub's web interface. Top: index of past runs,
and overview of a single run. Bottom: both the code used for each run and any
data output from the run are automatically archived.
I cant stress enough how important that second feature is. For any project
this long, detailed records of what youve tried and the ability to reproduce
past experiments are an absolute must. Version control software can help, but
a) managing large outputs can be painful, and b) requires extreme diligence.
(For example, if youve set off some runs, then make a small change and launch
another run, when you commit the results of the first runs, is it going to be
clear which code was used?) You could take careful notes or roll your own
system, but with FloydHub, *it just works* and you save *so* much mental
energy.
(Update: check out some example FloydHub runs at
<https://www.floydhub.com/mrahtz/projects/learning-from-human-preferences>.)
Other things I like about FloydHub are:
* Containers are automatically shut down once the run is finished. Not having
to worry about checking runs to see whether theyve finished and the VM can
be turned off is a big relief.
* Billing is much more straightforward than with cloud VMs. You pay for usage
in, say, 10-hour blocks, and youre charged immediately. That makes keeping
weekly budgets much easier.
The one pain point Ive had with FloydHub is that you cant customize
containers. If your code has a lot of dependencies, youll need to install them
at the start of every run. That limits the rate at which you can iterate on
short runs. You *can* get around this, though, by creating a dataset which
contains the changes to the filesystem from installing dependencies, then
copying files from that dataset at the start of each run (e.g.
[`create_floyd_base.sh`](https://github.com/mrahtz/learning-from-human-preferences/blob/master/floydhub_utils/create_floyd_base.sh)).
Its awkward, but still probably less awkward than having to deal with GPU
drivers.
FloydHub is a little more expensive than Compute Engine: as of writing,
$1.20/hour for a machine with a K80 GPU, compared to about $0.85/hour for a
similarly-specced VM (though less if you dont need as much as 61 GB of RAM).
Unless your budget is really limited, I think the extra convenience of FloydHub
is worth it. The only case where Compute Engine can be a lot cheaper is doing a
lot of runs in parallel, which you can stack up on a single large VM.
(A third option is Googles new
[Colaboratory](https://colab.research.google.com) service, which gives you a
hosted Jupyter notebook with free access to a single K80 GPU. Dont be put off
by Jupyter: you can execute arbitrary commands, and set up shell access if you
really want it. The main drawbacks are that your code doesnt keep running if
you close the browser window, and there are time limits on how long you can run
before the container hosting the notebook gets reset. So its not suitable for
doing long runs, but can be useful for quick prototyping on a GPU.)
In total, the project took:
* **150 hours of GPU time and 7,700 hours (wall time × cores) of CPU time** on
Compute Engine,
* **292 hours of GPU time** on FloydHub,
* and **1,500 hours (wall time, 4 to 16 cores) of CPU time** on my universitys
cluster.
I was horrified to realise that in total, that added up to **about $850** ($200
on FloydHub, $650 on Compute Engine) over the 8 months of the project.
Some of thats down to me being ham-fisted (see the above section on mindset
for slow iteration). Some of its down to the fact that reinforcement learning
is still so sample-inefficient that runs do just take a long time (up to 10
hours to train a Pong agent that beats the computer every time).
But a big chunk of it was down to a horrible surprise I had during the final
stages of the project: **reinforcement learning can be so unstable that you
need to repeat every run multiple times with different seeds to be confident**.
For example, once I thought everything was basically working, I sat down to
make end-to-end tests for the environments Id been working with. But I was
having trouble getting even the simplest environment Id been working with,
[training a dot to move to the centre of a
square](https://github.com/mrahtz/gym-moving-dot), to train successfully. I
went back to the FloydHub job that had originally worked and re-ran three
copies. It turned out that the hyperparameters I thought were fine actually
only succeeded one out of three times.
![](images/failed_reproductions.png)
It's not uncommon for two out of three random seeds (red/blue) to fail.
To give a visceral sense of how much compute that means you need:
* Using A3C with 16 workers, Pong would take about 10 hours to train.
* Thats 160 hours of CPU time.
* Running 3 random seeds, that 480 hours (20 days) of CPU time.
In terms of costs:
* FloydHub charges about $0.50 per hour for an 8-core machine.
* So 10 hours costs about $5 per run.
* **Running 3 different random seeds at the same time, thats $15 per run.**
**Thats, like, 3 sandwiches every time you want to test an idea.**
Again, from [Deep Reinforcement Learning Doesnt Work
Yet](https://www.alexirpan.com/2018/02/14/rl-hard.html), that kind of
instability seems normal and accepted right now. In fact, even “Five random
seeds (a common reporting metric) may not be enough to argue significant
results, since with careful selection you can get non-overlapping confidence
intervals.”
(All of a sudden the $25,000 of AWS credits that the [OpenAI Scholars
programme](https://blog.openai.com/openai-scholars/) provides doesnt seem
quite so crazy. That probably *is* about the amount you need to give someone so
that compute isnt a worry at all.)
My point here is that **if you want to tackle a deep reinforcement learning
project, make sure you know what youre getting yourself into**. Make sure
youre prepared for how much time it could take and how much it might cost.
---
Overall, reproducing a reinforcement learning paper was a fun side project to
try. But looking back, thinking about which skills it actually levelled up, Im
also wondering whether reproducing a paper was really the best use of time over
the past months.
On one hand, I definitely feel like my machine learning *engineering* ability
improved a lot. I feel more confident in being able to recognise common RL
implementation mistakes; my workflow got a whole lot better; and from this
particular paper I got to learn a bunch about Distributed TensorFlow and
asynchronous design in general.
On the other hand, I dont feel like my machine learning *research* ability
improved much (which is, in retrospect, what I was actually aiming for). Rather
than implementation, the much more difficult part of research seems to be
coming up with ideas that are interesting but also *tractable and concrete*;
ideas which give you the best bang-for-your-buck for the time you *do* spend
implementing. Coming up with interesting ideas seems to be a matter of a)
having a large vocabulary of concepts to draw on, and b) having good taste
for ideas (e.g. what kind of work is likely to be useful to the community). I
think a better project for both of those might have been to, say, read
influential papers and write summaries and critical analyses of them.
So I think my main meta-takeaway from this project is that **its worth
thinking carefully whether you want to level up engineering skills or research
skills**. Not that theres no overlap; but if youre particularly weak on one
of them you might be better off with a project specifically targeting that one.
If you want to level up both, a better project might be to read papers until
you find something youre really interested in that comes with clean code, and
trying to implement an extension to it.
---
If you *do* want to tackle a deep RL project, here are some more specific
things to watch out for.
#### Choosing papers to reproduce
* Look for papers with few moving parts. Avoid papers which require multiple
parts working together in coordination.
#### Reinforcement learning
* If youre doing anything that involves an RL algorithm as a component in a
larger system, dont try and implement the RL algorithm yourself. Its a fun
challenge, and youll learn a lot, but RL is unstable enough at the moment
that youll never be sure whether your system doesnt work because of a bug
in your RL implementation or because of a bug in your larger system.
* Before doing anything, see how easily an agent can be trained on your
environment with a baseline algorithm.
* Dont forget to normalize observations. *Everywhere* that observations might
be being used. [3](#fn:norm2)
* Write end-to-end tests as soon as you think youve got something working.
Successful training can be more fragile than you expected.
* If youre working with OpenAI Gym environments, note that with `-v0`
environments, 25% of the time, the current action is ignored and the previous
action is repeated (to make the environment less deterministic). Use `-v4`
environments if you dont want that extra randomness. Also note that
environments by default only give you every 4th frame from the emulator,
matching the early DeepMind papers. Use `NoFrameSkip` environments if you
dont want that. For a fully deterministic environment that gives you exactly
what the emulator gives you, use e.g. `PongNoFrameskip-v4`.
#### General machine learning
* Because of how long end-to-end tests take to run, youll waste a lot of time
if you have to do major refactoring later on. Err on the side of implementing
things well the first time rather than hacking something up and saving
refactoring for later.
* Initialising a model can easily take ~ 20 seconds. Thats a painful amount of
time to waste because of e.g. syntax errors. If you dont like using IDEs, or
you cant because youre editing on a server with only shell access, its
worth investing the time to set up a linter for your editor. (For Vim, I like
[ALE](https://github.com/w0rp/ale) with *both*
[Pylint](https://www.pylint.org/) and
[Flake8](http://flake8.pycqa.org/en/latest/). Though Flake8 is more of a
style checker, it can catch some things that Pylint cant, like wrong
arguments to a function.) Either way, every time you hit a stupid error while
trying to start a run, invest time in making your linter catch it in the
future.
* Its not just dropout you have to be careful about implementing in networks
with weight-sharing - its also batchnorm. Dont forget there are
normalization statistics and extra variables in the network to match.
* Seeing regular spikes in memory usage while training? It might be that your
validation batch size is too large.
* If youre seeing strange things when using Adam as an optimizer, it might be
because of Adams momentum. Try using an optimizer without momentum like
RMSprop, or disable Adams momentum by setting β1 to zero.
#### TensorFlow
* If you want to debug whats happening with some node buried deep in the
middle of your graph, check out
[`tf.Print`](https://www.tensorflow.org/api_docs/python/tf/Print), an
identity operation which prints the value of its input every time the graph
is run.
* If youre saving checkpoints only for inference, you can save a lot of space
by omitting optimizer parameters from the set of variables that are saved.
* `session.run()` can have a large overhead. Group up multiple calls in a batch
wherever possible.
* If youre getting out-of-GPU-memory errors when trying to run more than one
TensorFlow instance on the same machine, it could just be because one of your
instances is trying to reserve all the GPU memory, rather than because your
models are too large. This is TensorFlows default behaviour. To tell
TensorFlow to only reserve the memory it needs, see the
[`allow_growth`](https://www.tensorflow.org/programmers_guide/using_gpu#allowing_gpu_memory_growth)
option.
* If you want to access the graph from multiple things running at once, it
looks like you *can* access the same graph from multiple threads, but theres
a lock somewhere which only allows one thread at a time to actually do
anything. This seems to be distinct from the Python global interpreter lock,
which TensorFlow is [supposed
to](https://stackoverflow.com/questions/38206695/python-parallelizing-gpu-and-cpu-work)
release before doing heavy lifting. Im uncertain about this, and didnt have
time to debug more thoroughly, but if youre in the same boat, it might be
simpler to just use multiple processes and replicate the graph between them
with [Distributed
TensorFlow](http://amid.fish/distributed-tensorflow-a-gentle-introduction).
* Working with Python, you get used to not having to worry about overflows. In
TensorFlow, though, you still need to be careful:
```
> a = np.array([255, 200]).astype(np.uint8)
> sess.run(tf.reduce_sum(a))
199
```
* Be careful about using `allow_soft_placement` to fall back to a CPU if a GPU
isnt available. If youve accidentally coded something that cant be run on
a GPU, itll be silently moved to a CPU. For example:
```
with tf.device("/device:GPU:0"):
a = tf.placeholder(tf.uint8, shape=(4))
b = a[..., -1]
sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True))
sess.run(tf.global_variables_initializer())
# Seems to work fine. But with allow_soft_placement=False
sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=False))
sess.run(tf.global_variables_initializer())
# we get
# Cannot assign a device for operation 'strided_slice_5':
# Could not satisfy explicit device specification '/device:GPU:0'
# because no supported kernel for GPU devices is available.
```
* I dont know how many operations there are like this that cant be run on a
GPU, but to be safe, do CPU fallback manually:
```
gpu_name = tf.test.gpu_device_name()
device = gpu_name if gpu_name else "/cpu:0"
with tf.device(device):
# graph code
```
#### Mental health
* Dont get addicted to TensorBoard. Im serious. Its the perfect example of
addiction through unpredictable rewards: most of the time you check how your
run is doing and its just pootling away, but as training progresses,
sometimes you check and all of the sudden - jackpot! Its doing something
super exciting. If you start feeling urges to check TensorBoard every few
minutes, it might be worth setting rules for yourself about how often its
reasonable to check.
---
If youve read this far and havent been put off, awesome! If youd like to get
into deep RL too, here are some resources for getting started.
* Andrej Karpathys [Deep Reinforcement Learning: Pong from
Pixels](http://karpathy.github.io/2016/05/31/rl/) is a great introduction to
build motivation and intuition.
* For more on the theory of reinforcement learning, check out [David Silvers
lectures](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html). There
isnt much on deep RL (reinforcement learning using neural networks), but it
does teach the vocabulary youll need to be able to understand papers.
* John Schulmans [Nuts and Bolts of Deep RL
talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ)
([slides](http://joschu.net/docs/nuts-and-bolts.pdf); [summary
notes](https://github.com/williamFalcon/DeepRLHacks)) has lots more tips
about practical issues you might run into.
For a sense of the bigger picture of whats going on in deep RL at the moment,
check out some of these.
* Alex Irpans [Deep Reinforcement Learning Doesnt Work
Yet](https://www.alexirpan.com/2018/02/14/rl-hard.html) has a great overview
of where things are right now.
* Vlad Mnihs talk on [Recent Advances and Frontiers in Deep
RL](https://www.youtube.com/watch?v=bsuvM1jO-4w) has more examples of work on
some of the problems mentioned in Alexs post.
* Sergey Levines [Deep Robotic
Learning](https://www.youtube.com/watch?v=eKaYnXQUb2g) talk, with a focus on
improving generalization and sample efficiency in robotics.
* Pieter Abbeels [Deep Learning for
Robotics](https://www.youtube.com/watch?v=TyOooJC_bLY) keynote at NIPS 2017
with some of the more recent tricks in deep RL.
Good luck!
Thanks to [Michal Pokorný](http://agentydragon.com/about.html) and Marko Thiel for thoughts on
a first draft on this post.
1. Observations are fed into two different training loops, policy training and reward predictor training, and Id forgotten to normalize observations for the second one. Also, calculating running statistics (specifically, variance) is tricky. Check out [John Schulmans code](https://github.com/joschu/modular_rl/blob/master/modular_rl/running_stat.py) for a good reference. [](#fnref:normproblems)
2. This is basically [CFARs](http://www.rationality.org/) MurphyJitsu script. [](#fnref:murphyjitsu)
3. As mentioned above, I was stuck for a good while because of forgetting to normalize observations used for training the reward predictor. Derp. [](#fnref:norm2)
Please enable JavaScript to view the [comments powered by Disqus.](https://disqus.com/?ref_noscript)
![](/images/me.png)
## Amid Fish
is Matthew Rahtz's blog
[GitHub](https://github.com/mrahtz),
[LinkedIn](https://uk.linkedin.com/pub/matthew-rahtz/b8/a47/540),
or say hello at
[[email protected]](/cdn-cgi/l/email-protection#ee838f9a9a868b99c09c8f869a94ae89838f8782c08d8183)!
+399
View File
@@ -0,0 +1,399 @@
Source: https://andyljones.com/posts/rl-debugging.html
Title: Debugging RL, Without the Agonizing Pain - Andy Jones (2021)
Fetched-via: uvx markitdown https://andyljones.com/posts/rl-debugging.html
Fetch-status: verbatim
[andy jones](/)
[![RSS](/icons/rss-solid.svg)](/rss.xml)
![Email](/icons/at-solid.svg)
[![Scholar](/icons/scholar-brands.svg)](https://scholar.google.com/citations?user=wjU_zmMAAAAJ)
[![Github](/icons/github-brands.svg)](https://github.com/andyljones)
[![LinkedIn](/icons/linkedin-in-brands.svg)](https://www.linkedin.com/in/andyjonescs)
[![Twitter](/icons/twitter-brands.svg)](https://twitter.com/andy_l_jones)
[![Reddit](/icons/reddit-brands.svg)](https://www.reddit.com/u/bluecoffee)
[![StackOverflow](/icons/stack-overflow-brands.svg)](https://stackoverflow.com/users/2565457/andy-jones)
# Debugging RL, Without the Agonizing Pain
Debugging reinforcement learning systems combines the pain of debugging distributed systems with the pain of debugging numerical optimizers. Which is to say, it *sucks*. If this is your first time, you might have a few hundred lines of code that you *think* are correct in an hour, and a system that's *actually* correct two months later. [Here's the head of Tesla AI having just that experience](https://news.ycombinator.com/item?id=13519044).
This is a collection of debugging advice that has served me well over the past few years. It was formed both from my personal experiences, and from several months of helping people out in the [RL Discord](https://discord.com/invite/xhfNqQv). It is intended as compliment to the [other excellent articles on debugging RL that can be found elsewhere](https://github.com/andyljones/reinforcement-learning-discord-wiki/wiki#debugging-advice). I recommend you read all of them; each one has their own unique set of bugbears to warn you away from.
There are three sections: one on [theory](#theory), one on [common fixes](#fixes), and one on [practical advice](#tactics). Things flow a little better if you read them in order, but you can skip on ahead if you wish.
# Theory
## Why is debugging RL so hard?
A combination of issues. These issues show up in debugging any kind of system, but in RL they're more common, and they'll show up starting with the first system you ever write.
### Feedback is poor
**Errors aren't local**: The vast majority of the bugs you'll make are the 'doing the wrong calculation' sort. Because information in an RL system flows in a loop - actor to learner and then back to actor - a numerical error in one spot gets smeared throughout the system in seconds, poisoning everything. This means that most numerical errors manifest as *all* your metrics going weird at the same time; your loss exploding, your KL div collapsing, your rewards oscillating. From the outside, you can tell something is wrong but you've no idea *what* is wrong or where to start looking.
To my mind this is the single biggest issue with debugging RL systems, and much of the advice below is about how to better-localise errors.
**Performance is noisy**: The ultimate arbiter of an RL system - how good it is at collecting reward - is only weakly related to how good of an implementation you've written. You could write a bug-free implementation the first time and other factors (like hyperparameters, architecture or your environment) could sabotage performance. In the worst case, your evaluation run could just get an unlucky seed. Conversely, you could write a bug-laden implementation and it might seem to work! After all, bugs are just one more source of noise and your neural net is going to [try its damnedest](https://twitter.com/gwern/status/1014978860369182722) to pull the signal out of that mess you're feeding it.
The real kicker though is that because run-to-run variability is so high, it's very easy to fix - or introduce - a bug and then see no change in performance at all.
### Simplifying is hard
**There're few narrow interfaces**: Smart software development involves splitting the system up into components so that each component only talks to the others through a narrow interface. This way you can easily pinch a component off from the the rest of the system, feed it some mock inputs and see if it gives the correct answers.
This is difficult in RL systems. In RL systems, each component typically consumes a large number of mega- or gigabyte arrays and returns the same. The components are also unavoidably stateful, with the principal two components - the actor and learner - hefting around the state of the environnment and the network weights respectively. State can be thought of an interface with the own component's past, and in RL this interface is *huge*.
Consequently while you *can* isolate components in RL (and we'll talk about how to below), it's much more painful to do than it is in other kinds of software.
**There are few black boxes**: A black box is a component that works in a complex way, but which you can reason about in a simple way. Another name for a black box would be 'a good abstraction'. The prototypical example is your computer: there's a hierarchy of concepts in there, from doped silicon through to operating systems, but as far as you the programmer are concerned it's all about for loops and function calls.
RL has surprisingly few of these black boxes. You're required to know how your environment works, how your network works, how your optimizer works, how backprop works, how multiprocessing works, how stat collection and logging work. How GPUs work! There are [lots](https://docs.ray.io/en/latest/rllib.html) of [attempts](https://github.com/thu-ml/tianshou) at [writing](https://github.com/deepmind/acme) black-box [RL](https://github.com/astooke/rlpyt) libraries, but as of Jan 2021 my experience has been that these libraries have yet to be both flexible *and* easy-to-use. This might be a symptom of my odd strand of research, but I've heard several other researchers echo my frustrations.
### We're bad at writing RL systems
**Your expectations suck**: In any domain, problems evaporate as you get used to them. The first stack trace you see in your life is a nightmare; the millionth a triviality. All of the problems with RL listed above are only really problems because people new to the field expect something much more refined and reliable, as they've come to expect from other fields of programming and numerical research. If instead you arrive in RL expecting a garbage fire, you might just stay zen throughout.
Obviously though, this begs the question of *why* RL development is a garbage fire.
**The community is young**: While reinforcement learning as a field stretches back decades, it has *exploded* in the past few years and continues apace today. Finding good abstractions requires in part that the userbase's requirements stabilize, and that just isn't the case yet. Some of that is because it's very much a community of researchers rather than a community of practitioners, and the terrible thing about researchers it that they're very keen on doing new and different things. Maybe it'll be different once someone figures out how to turn RL into an industry.
**The community has other priorities**: Again, the community is a community of researchers. The population sets the priorities, and the priority is publication. Reliable, reproducible research contributes to publishing high-impact papers, but it also costs time and effort that is arguably better spent working on something *new*. And, well, it's hard to argue with the results: the current standards of RL development have carried us [a](https://deepmind.com/blog/article/muzero-mastering-go-chess-shogi-and-atari-without-rules) [long](https://openai.com/blog/learning-dexterity/) [way](https://deepmind.com/blog/article/AlphaStar-Grandmaster-level-in-StarCraft-II-using-multi-agent-reinforcement-learning).
Don't take this as a clarion call for better practices, nor a stalwart defense of practices as they are. It's not a hill I wish to die on. I'm only giving an explanation for why things are the way they are, rather than a justification for it. My preferences are towards improved practices, but I can see the sense in the other side's position.
## Debugging Strategies
With all that in mind, here are some broad strategies to keep in mind when chasing a bug.
### Design reliable tests
Write tests that either clearly pass or clearly fail. There's some amount of true randomness in RL, but most of that can be controlled with a seed. What's harder to deal with is psuedorandomness such that on one seed a test might pass and another seed the test might fail. This is *awful* to deal with, and you should go out of your way to avoid it.
While the ideal is a test that is guaranteed to cleanly pass or fail, a good fallback is one that is simply *overwhelmingly likely* to pass or fail. Typically, this means substituting out environments or algorithms with simpler ones that behave more predictably, and which you can run through your implementation with some massive batch size that'll suppress a lot of the wackiness that you might otherwise suffer.
### Design *fast* tests
Iteration speed is a huge determinant of debugging speed. Running a test should take at most as long as it takes you to make a potential fix, which is to say 'a few seconds'.
This means: don't try to debug your implementation by just running it on your full task. That might take days! That way madness lies. Instead, design setups that can execute more quickly, but still exercise the code you're looking at. For specific tips, look at the [probe environments](#probe) section below.
### Localise errors
Write test code that'll tell you the most about where the error is. The classic example of this is binary search: if you're looking for an specific item in a sorted list, then taking a look at the middle item tells you a *lot* more about where your target item is than looking at the first item.
Similarly, when debugging RL systems try to find tests that cut your system in half in some way, and tell you which half the problem is in. Incrementally testing every.single.chunk of code - well, sometimes that's what it comes down to! But it's something to try and avoid.
### Be Bayesian
But sometimes you can't avoid it! Binary search wouldn't have been much help in [finding the wreck of the USS Scorpion](https://en.wikipedia.org/wiki/USS_Scorpion_%28SSN-589%29). There they had to do a location-by-location search, and the key turned out to be prioritising the areas where
* the Scorpion was likely to be and
* where it was likely to be *spotted*.
This kind of thinking isn't so critical in traditional software development because isolating components is much easier, so you can do the sort of binary search I mentioned previously. But in RL, well, sometimes you just can't untangle something. Then you should reflect on which bits of your code are most likely to *contain* bugs, and which bits of your code you're going to be able to *easily spot* those bugs in. Prioritise looking in those places!
As an aside, the [parable of the drunk and his keys](https://en.wikipedia.org/wiki/Streetlight_effect) has always confused me: I don't know if it's saying the wise thing to do is to look under the streetlight, or to look in the dark. Best moral I've heard for it is 'it depends'.
### Pursue Anomalies
If you ever see a plot or a behaviour that just *seems weird*, chase right after it! Do not - do *not* - just 'hope it goes away'. Chasing anomalies is one of the most powerful ways to debug your system, because if you've noticed a problem without having had to go look for it, that means it's a *really big problem*.
This takes quite a bit of a mindset change though. It's really tempting to think that the cool extra functionality you were planning to write today - a tournament, adaptive reward scaling, a transformer - might just magically fix this anomalous behaviour.
It won't.
Give up on your plan for the day and chase the anomaly instead.
# Common Fixes
These are specific things that frequently trip people up.
## Hand-tune your reward scale
The single most common issue for newbies writing custom RL implementations is that the targets arriving at their neural net aren't [-1, +1]. Actually, anything [-.1, +.1]ish to [-10, +10]ish is good. The point is to have rewards that generate 'sensible' targets for your network. The hyperparameters you've pulled from the literature are adapted to work with these nicely-scaled targets, but lots of envs don't natively provide rewards of the right size so as to generate these nicely-scaled targets.
Having read that, you might be tempted to write some adaptive scheme to scale your rewards for you. Don't: it's an extra bit of nonstationarity that'll make life more difficult. Just hand-scale, hand-clip the rewards from your env so that the targets passed to your network are sensible. When everything else is working, you can come back and replace this with something less artificial.
## Use a really large batch size
One of the most reliable ways to make life easier in RL is to use a really large batch size. A *really* large batch size. There's an [excellent paper on picking batch sizes](https://arxiv.org/abs/1812.06162), and to pull some examples from there:
* Pong: ~1k batch size
* Space Invaders: ~10k batch size
* 1v1 Dota: ~100k batch size
The idea behind this is that with small batches and complex envs, it's easy for your learner to end up with a batch that represents some weird idiosyncratic part of the problem. Big batches do a lot to suppress this.
## Use a really small network
Hand in hand with really large batch sizes is really small networks. When you use really large batches, your binding constraint is likely to be the memory it takes to hold the forward pass activations on your GPU. By making the network smaller, you can fit bigger batches! And frankly, small networks can accomplish a *lot*. In my [boardlaw](https://andyljones.com/boardlaw/) project, I found that a fully connected network with 4 layers of 256 neurons was enough to learn perfect play on a 9x9 board. Perfect play! That's really complex!
## Avoid pixels
And hand-in-hand with 'use a small network' is: *avoid pixels*. Especially if you're an independent researcher with hardware constraints, just... don't work on environments with hefty, expensive-to-ingest observations like Atari. Pixel-based observations mean that before it does anything interesting, your agent has to learn to *see*. From sparse rewards! That's hard, and it's compute-intensive, and it's *boring*. If you've got any choice in the matter, pick the simplest env that will be able to generate the behaviour you're after. For example:
* Gridworlds like [Griddly](https://github.com/Bam4d/Griddly) and [minigrid](https://github.com/maximecb/gym-minigrid). Gridworlds can support most of the interesting behaviours you'd find in a continuous environment, but are much more resource-efficient. If you've just graduated out of [the Gym envs](https://gym.openai.com/envs/#classic_control), gridworlds are an excellent next step.
* Multi-agent setups like the boardgames from [OpenSpiel](https://openspiel.readthedocs.io/en/latest/games.html), [microRTS](https://github.com/santiontanon/microrts) or [Neural MMO](https://github.com/jsuarez5341/neural-mmo). A multi-agent env shouldn't be your *first* foray into RL - they're substantially more complex than the single-agent case - but competition and cooperation can generate a lot of complexity from very lightweight environments.
* Unusual envs like [WordCraft](https://github.com/minqi/wordcraft). WordCraft is unique in that it isolates learning about the real world from actually having to model the real world! But again, possibly not the best choice for a first RL project; I've included it here as an example of how powerful simple environments can be.
In all, fast environments with small networks and big batches are far easier to debug than slow environments with big networks and small batches. Make sure you can walk before you try running.
## Mix your vectorized envs
If you've got a long-lived env and you're simulating a lot of them in parallel, you might find that your system behaves a bit strangely at the start of training. One common issue is that if all your envs start from the same state, then your learner gets passed very highly-correlated samples, and so it tries to optimise for, say, steps 0-10 of the env in the first batch, then 10-20 in the second batch, etc. You can avoid this by '[mixing](https://en.wikipedia.org/wiki/Markov_chain_mixing_time)' your envs: taking enough random steps in the env that they become uncorrelated with one another. A good way to check that things are well-mixed is to look at the number of resets at each timestep: if they look pretty uniform, things are well-mixed. If they all cluster on a specific timestep, you need to take some more random actions.
# Practical Advice
This advice sits somewhere between the 'common mistakes' and the more general 'theory' we discussed earlier.
## Work from a reference implementation
*If you're new to reinforcement learning, writing things from scratch is the most catastrophically self-sabotaging thing you can do.*
There is an alluring masochism in writing things from scratch. There's concrete value in it too: by writing things from scratch, you're both forced to fully understand what you're doing and you're more likely to come up with a fresh perspective. In many other fields of software development these benefits would be worth the slow-down you suffer from having to work everything out yourself.
In reinforcement learning, these benefits are not worth it. At all. As discussed [above](#theory), the nature of RL work makes it extremely hard for you to self-correct.
When I say 'use a reference implementation', there are several interpretations you can take depending on your risk tolerance.
* The safest thing to do is to use a reference implementation out-of-the-box. Check that it works on your task, then repeatedly make a small change and check that it works as it did before.
* Less safe is to just use the reference implementation as a source of reliable components. Work to the same API, and check that giving your version of a component and their version give the same outputs.
* Least safe (but still dramatically better than going in blind) is to have one eye on the reference implementation while you write your own. Copy their hyperparameters, copy their discounting code, copy how they handle termination and invalid actions and a hundred other little things that you're likely to muck up otherwise.
Here are some excellent reference implementations to choose from:
* [spinning-up](https://github.com/openai/spinningup) has been written by OpenAI, and has a [short course to go along with it](https://spinningup.openai.com/).
* [stable-baselines3](https://github.com/DLR-RM/stable-baselines3) is based on an older set of OpenAI implementations, but cleaned up and actively maintained.
* [cleanrl](https://github.com/vwxyzjn/cleanrl/tree/master/cleanrl) isolates every algorithm in its own file.
* [OpenSpiel](https://github.com/deepmind/open_spiel) is DeepMind's multi-agent reinforcement learning library. They provide both Python and C++ implementations of many algorithms - you'll probably want the Python ones.
## Assume you have a bug
When their RL implementation doesn't work, people are often keen to either (a) adjust their network architecture or (b) adjust their hyperparameters. On the other hand, they're reluctant to say they've got a bug.
Most often, it turns out they've got a bug.
Why bugs are so much more common in RL code is discussed [above](#theory), but there's another advantage to assuming you've got a bug: bugs are a damn sight faster to find and fix than validating that your new architecture is an improvement over the old one.
Now having said that you should assume you have a bug, it's worth mentioning that sometimes - rarely - you don't have a bug. What I'm advocating for here is not a blind faith in the buginess of your code, but for dramatically raising the threshold at which you start thinking 'OK, I think this is correct.'
## Loss curves are a red herring
When someone's RL implementation isn't working, they *luuuuuurv* to copy-paste a screenshot of their loss curve to you. They do this because they know they want a pretty, exponentially-decaying loss curve, and they know what they have *isn't that*.
The problem with using the loss curve as an indicator of correctness is somewhat that it's not reliable, but mostly because it doesn't localise errors. The shape of your loss curve says very little about where in your code you've messed up, and so says very little about what you need to change to get things working.
As in the previous section, my sweeping proclamation comes with some qualifiers. Once you have a semi-functional implementation and you've exhausted other, better methods of error localisation (as documented in the rest of this post), there *is* valuable information in a loss curve. If nothing else, being able to split a model's performance into 'how fast it learns' and 'where it plateaus' is a useful way to think about the next improvement you might want to make. But because it only offers *global* information about the performance of your implementation, it makes for a really poor debugging tool.
## Unit test the tricky bits
Most of the bugs in a typical attempt at an RL implementation turn up in the same few places. Some of the usual suspects are
* reward discounting, especially around episode resets
* advantage calculations, again especially around resets
* buffering and batching, especially pairing the wrong rewards with the wrong observations
Fortunately, these components are all really easy to test! They've got none of the issues that validating RL algorithms as a whole has. These components are deterministic, they're easy to factor out, and they're fast. Checking you've got the termination right on your reward discounting is [a few lines](https://github.com/andyljones/megastep/blob/master/megastep/demo/learning.py#L134-L159).
What's even better is that most of the time, *as you write these things* you know you're messing them up. If you're not certain whether you've just accumulated the reward on one side of the reset or the other, *put a test in*.
## Use probe environments.
The usual advice to people writing RL algorithms is to use a simple environment like the [classic control ones from the Gym](https://gym.openai.com/envs/#classic_control).
Thing is, these envs have the same problem as looking at loss curves: at best they give you a noisy indicator, and if the noisy indicator looks poor you don't know *why* it looks poor. They don't localise errors.
Instead, construct environments that *do* localise errors. In a recent project, I used
1. **One action, zero observation, one timestep long, +1 reward every timestep**: This isolates the value network. If my agent can't learn that the value of the only observation it ever sees it 1, there's a problem with the value loss calculation or the optimizer.
2. **One action, random +1/-1 observation, one timestep long, obs-dependent +1/-1 reward every time**: If my agent can learn the value in (1.) but not this one - meaning it can learn a constant reward but not a predictable one! - it must be that backpropagation through my network is broken.
3. **One action, zero-then-one observation, *two* timesteps long, +1 reward at the end**: If my agent can learn the value in (2.) but not this one, it must be that my reward discounting is broken.
4. **Two actions, zero observation, one timestep long, action-dependent +1/-1 reward**: The first env to exercise the policy! If my agent can't learn to pick the better action, there's something wrong with either my advantage calculations, my policy loss or my policy update. That's three things, but it's easy to work out by hand the expected values for each one and check that the values produced by your actual code line up with them.
5. **Two actions, random +1/-1 observation, one timestep long, action-and-obs dependent +1/-1 reward**: Now we've got a dependence on both obs and action. The policy and value networks interact here, so there's a couple of things to verify: that the policy network learns to pick the right action in each of the two states, and that the value network learns that the value of each state is +1. If everything's worked up until now, then if - for example - the value network fails to learn here, it likely means your batching process is feeding the value network stale experience.
6. Etc.
You get the idea: (1.) is the simplest possible environment, and each new env adds the smallest possible bit of functionality. If the old env works but the successor doesn't, that gives you a *lot* of information about where the problem is.
Even better, these environments are extraordinarily fast. When you've a correct implementation, it should only take a second or two to learn them. And they're *decisive*: if your value network in (1.) ends up more than an epsilon away from the correct value, it means you've got a bug.
## Use probe agents.
In much the same way that you can simplify your environments to localise errors, you can do the same with your agents too.
*Cheat* agents are ones that you leak extra information to. For example, if I'm writing an agent to navigate to a goal, then slipping the agent an extra vector saying which direction the goal is in should help a *lot*. My agent should be able to solve this problem *much* faster, and if it can't then how the heck can I expect it to solve the original problem?
*Automatons* are agents that don't use a neural network at all. Instead, they're hand-written algorithms. The point of writing something like this is to check that your environment is actually solvable. On an navigation environment I wrote once, I set up a room with a red post behind the agent. Then I wrote an automaton which would just turn left until a block of red was in the middle of it's view. Shocker: my automaton couldn't solve this task, because it turned out I'd mucked up the observation generation on odd-numbered environments.
It's worth keeping in mind that automatons can be handed cheat information too! Combining automatons and progressively more cheat information is a powerful way to debug an environment.
*Tabular* agents a good match for probe environments. If you've set up a real simple environment and *still* nothing works, then replacing your NN with a far-easier-to-interpret lookup table of state values is a great way to figure out what you're missing. Be aware that it might take some time with a pen and paper to check that the values that you're seeing in the table are the ones you expect, but it's a hard setup to fool.
## Use adaptive network definitions
One of the issues with probe environments and probe agents is that every time you swap out your environment or agent, you'll find yourself having to rewrite the interface of the network with the rest of the world. By 'interface' I mean 'the bit that eats the observation and the bit that spits out the action'.
One way to avoid this is to write a function that takes the observation space and action space of the environment, and generates 'heads' for the network that convert the observation into a fixed-width vector, and which convert a fixed-width vector to the action. Then you can hand-implement *just* the body of the net that converts the intake vector to the output vector, and the rest will be slotted in by your function based on the env it has to work with.
You can see [one](https://github.com/andyljones/megastep/blob/master/megastep/demo/heads.py) [implementation](https://github.com/andyljones/megastep/blob/master/megastep/demo/__init__.py#L17-L26) of this in my [megastep](https://andyljones.com/megastep/) work, but it's an idea that's been independently developed a few times. I haven't yet seen a general library for it.
## Log excessively.
The last few sections have involved controlled experiments of a sort, where you place your components in a known setup and see how they act. The complement to a controlled experiment is an observational study: watching your system in its natural habitat *very carefully* and seeing if you can spot anything anomalous.
In reinforcement learning, watching your system carefully means logging. Lots of logging. Below are some of the logs I've found particularly useful.
### Relative policy entropy
The entropy of your policy network's outputs, relative to the maximum possible entropy. It'll usually start near 1, then rapidly fall for a while, then flatten out for the rest of training.
If it stays very near 1, your agent is failing to learn any policy at all. You should check that your policy targets are being computed correctly, that the gradient's being backpropagated correctly, and - if you've defined a custom environment - then your environment is actually correct!
If it drops to zero or close to zero, then your agent has 'collapsed' into some - likely myopic - policy, and isn't exploring any more. This is usually because you'v either forgotten to include an exploration mechanism of some sort (like epsilon-greedy actions or an entropy term in the loss), or because your rewards are much larger than whatever you're using to encourage exploration.
Sometimes it'll go up for a while; don't stress about that unless it's a large, permanent increase. If it *is* a large permanent increase and the minimum was very early in training, that can be an indicator that your policy fell into some myopic obviously-good behaviour that it's having to gradually climb back out of. It might help to turn up the exploration incentives.
If the entropy oscillates wildly, that usually means your learning rate is too high.
### Kullback-Leibler divergence
The KL div between the policy that was used to collect the experience in the batch, and the policy that your learner's just generated for the same batch. This should be small but positive.
If it's very large then your agent is having to learn from experience that's very different to the current policy. In some algorithms - like those with a replay buffer - that's expected, and all that's important is the KL div is stable. In other algorithms (like PPO), a very large KL div is an indicator that the experience reaching your network is 'stale', and that'll slow down training.
If it's very low then that suggests your network hasn't changed much in the time since the experience was generated, and you can probably get away with turning the learning rate up.
If it's growing steadily over time, that means you're probably feeding the same experience from early on in training back into the network again and again. Check your buffering system.
If it's negative - that shouldn't happen, and it means you're likely calculating the KL div incorrectly (probably by not handling invalid actions).
### Residual variance
The variance of (target values - network values), divided by the variance of the target values.
Like the policy entropy, this should start close to 1, fall very rapidly early on, and then decrease more gradually over the course of training.
If it stays near 1, your value network isn't learning to predict the rewards. Check that your rewards are what you think they are, and check that your value loss and backprop through the value net are all working correctly.
If it drops to zero, that's usually because the policy entropy has dropped to zero too, the policy has collapsed into some deterministic behaviour, and the value network has learned the rewards it is collecting perfectly. Another common reason is that some scenarios are generating vastly larger returns than the others, and the value net's learned to identify when that happens.
If the residual variance oscillates wildly, that usually means your learning rate is too high.
### Terminal correlation
The correlation between the value in the final state and the reward in the final step. This is only useful when there's lots of reward in the final step (like in boardgames).
It should start near zero, rise rapidly, then plateau near 1.
If it stays near zero but all the other value-related logs look good, then check that your reward-to-gos are being calculated correctly near termination!
If reward is more evenly distributed through the episode, you could write a version of this that looks at the correlation of (next state's value - this state's value) with the reward in that step. I haven't used this myself though, so can't offer commentary.
### Penultimate terminal correlation
The correlation between the value in the penultimate step and the final reward. Again, only useful when there's lots of reward at the end of the episode. If terminal correlation is high but penultimate terminal correlation is low, that's a strong indicator that your reward-to-gos aren't being carried backwards properly.
### Value target distribution
Either plot a histogram, or the min/max/mean/std. The plots should indicate 'reasonable' value targets in the range [-10, +10] (and ideally [-3, +3]).
If they're larger than that, make your rewards proportionately smaller; if they're smaller than that, make your rewards larger.
If they blow up, check that your reward discounting is correct, and possibly make your discount rate smaller.
If they're blowing up but you're insistent on leaving the discount rate where it is, one alternative is to increase the number of steps used to bootstrap the value targets. In PPO, this'd mean using longer chunks. Longer chunks mean that the values used for bootstrapping get shrunk more before they're fed back to the value net as targets, increasing the stability. You could also consider annealing the discount factor from a smaller value up towards 1.
### Reward distribution
Again, as a histogram or min/max/mean/std. What a reasonable reward distribution is depends on the environment; some envs have a few large rewards, while others have lots of small rewards. Either way, if it doesn't match your expectations then you should investigate.
### Value distribution
Again, as a histogram or min/max/mean/std. This is a complement to the previous two distributions and *should* closely match the value target distribution. If it doesn't, and it stays different from the value target distribution, that's an indicator that your value network is having trouble learning.
It's also worth keeping an eye on the sign of the distribution. If your env only produces positive rewards but there are persistently negatives values in the value target distribution, that suggests your reward-to-go mechanism is badly broken or your value network is failing to learn.
### Advantage distribution
Again, as a histogram or min/max/mean/std. As with the value targets, these should be in the range [-10, +10] (and ideally [-3, +3]).
Advantages should also be approximately mean-zero due to how they're constructed; if they're persistently not then you've messed up your advantage calculations.
### Episode length distribution
Again, as a histogram or a min/max/mean/std. As with the reward distribution, interpreting this depends on the environment. If your environment should have arbitrary-length episodes, but you're seeing that every episode here is length 7, that indicates your environment is broken or your network's fallen into some degenerate behaviour.
### Sample staleness
Sample staleness is the number of learner steps between the network used to generate a sample, and the network currently learning from that sample. You can generate this by setting an 'age' attribute on the network, and incrementing it at every learner step. Then when a sample arrives at the learner, diff it against the learner's current age.
How to interpret this depends on the algorithm, but it should generally stay at a steady value throughout training. In on-policy algorithms, lower sample stalenesses are better; in off-policy algorithms it's a tradeoff between fresh samples that let the network bootstrap quickly, and aged samples that stabilise things.
### Step statistics
Step statistics are the abs-max and mean-square-value of the difference between the network's parameters when it enters the learner, and the network's parameters when it leaves the learner.
Interpreting this depends on a whole bunch of things, but the mean-square value should typically be very small (1e-3 in my current training run with a LR of 1e-2), while the abs-max should small yet substantially larger than the mean-square-value.
If the statistics are much smaller than that, you might be able to increase your learning rate; if they're much larger than that then be on the lookout for instability in your training.
### Gradient statistics
Gradient statistics re the abs-max and mean-square-value of the gradient. In the age of Adam and other quasi-Newton optimizers, this isn't as informative as it once was, because normalising by the curvature estimates can dramatically inflate or collapse the gradient.
That said, if the step statistics are looking strange, this can help diagnose whether the problem is with the gradient calculation or with Adam's second-order magic.
### Gradient noise
This is from [McCandlish and Kaplan](https://arxiv.org/abs/1812.06162), and it's intended to help you choose your batch size. Unfortunately it's *spectacularly* noisy, to the point where you likely want to average over all steps in your run.
I've been thinking that it might be possible to get more stable estimates of the gradient noise from Adam's moment estimates, but that's decidedly on the to-do list.
### Component throughput
At the least, keep track of the actor throughput and learner throughput in terms of samples per second, and steps per second.
Typically the actor should be generating *at most* as many samples as the learner is consuming. If the actor is generating excess samples there are weak reasons that might be a good thing - it'll refresh the replay buffer more rapidly - but typically it's considered a waste of compute.
More generally, you want to see these remain stable throughout training. If your throughputs gradually decay, you're accumulating some costly state somewhere in your system.
(For me, problems with gradually-slowing-down systems have always turned out to be with stats and logging, but I suspect that's because I've rolled my own stats and logging systems)
### Value trace
The trace of the value over a random episode from recent history, plotted together with the rewards. This can be useful if you suspect your value function or rewards of 'being weird' in some way; the value trace should typically be a collection of exponentially-increasing curves leading up to rewards, followed by vertical drops as the agent collects those rewards.
### GPU stats
There are several GPU-related stats that are worth tracking. First are the memory stats, which in PyTorch include
* the *memory allocation*, as reported by `torch.cuda.max_memory_allocated`. This is how much memory has actually been *used* by your computations,
* the *memory reserve*, as reported by `torch.cuda.max_memory_reserved`. This is how much memory PyTorch has *set aside* for your computations,
* the *memory gross*, as reported by `nvidia-smi`. This is how much memory PyTorch is using overall, [including the ~gigabyte it needs for its own kernels](https://github.com/pytorch/pytorch/issues/20532#issuecomment-540628939). It's this figure that'll crash your program if it hits the GPU's memory limit.
Keeping track of all three is useful for diagnosing memory issues: figuring out if it's you that's hanging onto too many tensors, or PyTorch that's being too aggressive with its caching.
If you're running out of memory and you can't immediately figure out why, [memlab](https://github.com/Stonesjtu/pytorch_memlab#memory-profiler) can help a lot. Disclosure: I wrote the frontend.
As well as the memory stats, it's also useful to track the utilization, fan speed and temperature reported by `nvidia-smi`. You can get these values in [machine-readable form](https://github.com/andyljones/megastep/blob/master/rebar/stats/gpu.py#L17-L29).
In particular, if the utilization is persistently low then you should profile your code. Make sure to set `CUDA_LAUNCH_BLOCKING=1` before importing your tensor library, and then use [snakeviz](https://jiffyclub.github.io/snakeviz/) or [tuna](https://github.com/nschloe/tuna) to profile things in a broad way. If that's not enough detail, you can dig into things further with [nsight](https://developer.nvidia.com/nsight-systems).
### Traditional metrics
As well as the above, I also plot some other things out of habit
* **Reward per trajectory**: should increase dramatically at the start of training. This is, usually, what you care about. Unfortunately it's incredibly noisy and does little to localise errors. Closely related is the **reward per step**, which is typically what you care about in infinite environments.
* **Mean value**: is (if your value network is working well) a less-noisy proxy for the reward per trajectory. If your trajectories are particularly long compared to your reward discount factor however, this can be dramatically different from the reward per trajectory.
* **Policy and value losses**: should fall dramatically at the start of training, then level out.
## Credit
* **kfir.b.y**, for spotting an error in my description of the probe environments.
2021/01/01
[icons by dave gandy](https://fontawesome.com/license), theme by [#6d2e98](https://color-hex.org/color/6d2e98 "i have never been funny")
+719
View File
@@ -0,0 +1,719 @@
Source: https://cs229.stanford.edu/materials/ML-advice.pdf
Title: CS229 - Advice for Applying Machine Learning (Andrew Ng)
Fetched-via: bash -c 'uvx "markitdown[pdf]" https://cs229.stanford.edu/materials/ML-advice.pdf'
Fetch-status: verbatim
Advice for applying
Machine Learning
Andrew Ng
Stanford University
Andrew Y. Ng
Todays Lecture
• Advice on how getting learning algorithms to different applications.
• Most of todays material is not very mathematical. But its also some of the
hardest material in this class to understand.
• Some of what Ill say today is debatable.
• Some of what Ill say is not good advice for doing novel machine learning
research.
• Key ideas:
1. Diagnostics for debugging learning algorithms.
2. Error analyses and ablative analysis.
3. How to get started on a machine learning problem.
Premature (statistical) optimization.
Andrew Y. Ng
Debugging Learning
Algorithms
Andrew Y. Ng
Debugging learning algorithms
Motivating example:
• Anti-spam. You carefully choose a small set of 100 words to use as
features. (Instead of using all 50000+ words in English.)
• Bayesian logistic regression, implemented with gradient descent, gets 20%
test error, which is unacceptably high.
• What to do next?
Andrew Y. Ng
Fixing the learning algorithm
• Bayesian logistic regression:
• Common approach: Try improving the algorithm in different ways.
Try getting more training examples.
Try a smaller set of features.
Try a larger set of features.
Try changing the features: Email header vs. email body features.
Run gradient descent for more iterations.
Try Newtons method.
Use a different value for λ.
Try using an SVM.
• This approach might work, but its very time-consuming, and largely a matter
of luck whether you end up fixing what the problem really is.
Andrew Y. Ng
Diagnostic for bias vs. variance
Better approach:
Run diagnostics to figure out what the problem is.
Fix whatever the problem is.
Bayesian logistic regressions test error is 20% (unacceptably high).
Suppose you suspect the problem is either:
Overfitting (high variance).
Too few features to classify spam (high bias).
Diagnostic:
Variance: Training error will be much lower than test error.
Bias: Training error will also be high.
Andrew Y. Ng
More on bias vs. variance
Typical learning curve for high variance:
r
o
r
r
e
Test error
Desired performance
Training error
m (training set size)
• Test error still decreasing as m increases. Suggests larger training set
will help.
• Large gap between training and test error.
Andrew Y. Ng
More on bias vs. variance
Typical learning curve for high bias:
r
o
r
r
e
Test error
Training error
Desired performance
m (training set size)
• Even training error is unacceptably high.
• Small gap between training and test error.
Andrew Y. Ng
Diagnostics tell you what to try next
Bayesian logistic regression, implemented with gradient descent.
Fixes to try:
Try getting more training examples.
Try a smaller set of features.
Try a larger set of features.
Try email header features.
Run gradient descent for more iterations.
Try Newtons method.
Use a different value for λ.
Try using an SVM.
Fixes high variance.
Fixes high variance.
Fixes high bias.
Fixes high bias.
Andrew Y. Ng
Optimization algorithm diagnostics
• Bias vs. variance is one common diagnostic.
• For other problems, its usually up to your own ingenuity to construct your
own diagnostics to figure out whats wrong.
• Another example:
Bayesian logistic regression gets 2% error on spam, and 2% error on non-spam.
(Unacceptably high error on non-spam.)
SVM using a linear kernel gets 10% error on spam, and 0.01% error on non-
spam. (Acceptable performance.)
But you want to use logistic regression, because of computational efficiency, etc.
• What to do next?
Andrew Y. Ng
More diagnostics
• Other common questions:
Is the algorithm (gradient descent for logistic regression) converging?
J(θ)
e
v
i
t
c
e
b
O
j
Iterations
Its often very hard to tell if an algorithm has converged yet by looking at the objective.
Andrew Y. Ng
More diagnostics
• Other common questions:
Is the algorithm (gradient descent for logistic regression) converging?
Are you optimizing the right function?
I.e., what you care about:
(weights w(i) higher for non-spam than for spam).
Bayesian logistic regression? Correct value for λ?
SVM? Correct value for C?
Andrew Y. Ng
Diagnostic
An SVM outperforms Bayesian logistic regression, but you really want to deploy Bayesian
logistic regression for your application.
Let θSVM be the parameters learned by an SVM.
Let θBLR be the parameters learned by Bayesian logistic regression.
You care about weighted accuracy:
θSVM outperforms θBLR. So:
BLR tries to maximize:
Diagnostic:
Andrew Y. Ng
Two cases
Case 1:
But BLR was trying to maximize J(θ). This means that θBLR fails to maximize J, and the
problem is with the convergence of the algorithm. Problem is with optimization
algorithm.
Case 2:
This means that BLR succeeded at maximizing J(θ). But the SVM, which does worse on
J(θ), actually does better on weighted accuracy a(θ).
This means that J(θ) is the wrong function to be maximizing, if you care about a(θ).
Problem is with objective function of the maximization problem.
Andrew Y. Ng
Diagnostics tell you what to try next
Bayesian logistic regression, implemented with gradient descent.
Fixes to try:
Try getting more training examples.
Try a smaller set of features.
Try a larger set of features.
Try email header features.
Run gradient descent for more iterations.
Try Newtons method.
Use a different value for λ.
Try using an SVM.
Fixes high variance.
Fixes high variance.
Fixes high bias.
Fixes high bias.
Fixes optimization algorithm.
Fixes optimization algorithm.
Fixes optimization objective.
Fixes optimization objective.
Andrew Y. Ng
The Stanford Autonomous Helicopter
Payload: 14 pounds
Weight: 32 pounds
Andrew Y. Ng
Machine learning algorithm
1. Build a simulator of helicopter.
Simulator
2. Choose a cost function. Say J(θ) = ||x xdesired||2 (x = helicopter position)
3. Run reinforcement learning (RL) algorithm to fly helicopter in simulation, so
as to try to minimize cost function:
θRL = arg minθ J(θ)
Suppose you do this, and the resulting controller parameters θRL gives much worse
performance than your human pilot. What to do next?
Improve simulator?
Modify cost function J?
Modify RL algorithm?
Andrew Y. Ng
Debugging an RL algorithm
The controller given by θRL performs poorly.
Suppose that:
1. The helicopter simulator is accurate.
2. The RL algorithm correctly controls the helicopter (in simulation) so as to
minimize J(θ).
3. Minimizing J(θ) corresponds to correct autonomous flight.
Then: The learned parameters θRL should fly well on the actual helicopter.
Diagnostics:
1.
If θRL flies well in simulation, but not in real life, then the problem is in the
simulator. Otherwise:
2. Let θhuman be the human control policy. If J(θhuman) < J(θRL), then the problem is
in the reinforcement learning algorithm. (Failing to minimize the cost function J.)
If J(θhuman)
J(θRL), then the problem is in the cost function. (Maximizing it
3.
doesnt correspond to good autonomous flight.)
Andrew Y. Ng
More on diagnostics
• Quite often, youll need to come up with your own diagnostics to figure out
whats happening in an algorithm.
• Even if a learning algorithm is working well, you might also run diagnostics to
make sure you understand whats going on. This is useful for:
Understanding your application problem: If youre working on one important ML
application for months/years, its very valuable for you personally to get a intuitive
understand of what works and what doesnt work in your problem.
Writing research papers: Diagnostics and error analysis help convey insight about
the problem, and justify your research claims.
I.e., Rather than saying “Heres an algorithm that works,” its more interesting to
say “Heres an algorithm that works because of component X, and heres my
justification.”
• Good machine learning practice: Error analysis. Try to understand what
your sources of error are.
Andrew Y. Ng
Error Analysis
Andrew Y. Ng
Error analysis
Many applications combine many different learning components into a
“pipeline.” E.g., Face recognition from images: [contrived example]
Camera
image
Preprocess
(remove background)
Eyes segmentation
Face detection
Nose segmentation
Logistic regression
Label
Mouth segmentation
Andrew Y. Ng
Camera
image
Preprocess
Preprocess
(remove background)
(remove background)
Error analysis
Eyes segmentation
Eyes segmentation
Face detection
Face detection
Nose segmentation
Nose segmentation
Logistic regression
Logistic regression
Label
Mouth segmentation
Mouth segmentation
How much error is attributable to each of the
components?
Plug in ground-truth for each component, and
see how accuracy changes.
Conclusion: Most room for improvement in face
detection and eyes segmentation.
Component
Accuracy
Overall system
85%
Preprocess (remove
background)
Face detection
Eyes segmentation
Nose segmentation
Mouth segmentation
85.1%
91%
95%
96%
97%
Logistic regression
100%
Andrew Y. Ng
Ablative analysis
Error analysis tries to explain the difference between current performance and
perfect performance.
Ablative analysis tries to explain the difference between some baseline (much
poorer) performance and current performance.
E.g., Suppose that youve build a good anti-spam classifier by adding lots of
clever features to logistic regression:
Spelling correction.
Sender host features.
Email header features.
Email text parser features.
Javascript parser.
Features from embedded images.
Question: How much did each of these components really help?
Andrew Y. Ng
Ablative analysis
Simple logistic regression without any clever features get 94% performance.
Just what accounts for your improvement from 94 to 99.9%?
Ablative analysis: Remove components from your system one at a time, to see
how it breaks.
Component
Accuracy
Overall system
Spelling correction
Sender host features
Email header features
Email text parser features
Javascript parser
Features from images
99.9%
99.0
98.9%
98.9%
95%
94.5%
94.0%
[baseline]
Conclusion: The email text parser features account for most of the
improvement.
Andrew Y. Ng
Getting started on a
learning problem
Andrew Y. Ng
Getting started on a problem
Approach #1: Careful design.
• Spend a long term designing exactly the right features, collecting the right dataset,
and designing the right algorithmic architecture.
Implement it and hope it works.
• Benefit: Nicer, perhaps more scalable algorithms. May come up with new, elegant,
learning algorithms; contribute to basic research in machine learning.
Approach #2: Build-and-fix.
Implement something quick-and-dirty.
• Run error analyses and diagnostics to see whats wrong with it, and fix its errors.
• Benefit: Will often get your application problem working more quickly. Faster time to
market.
Andrew Y. Ng
Premature statistical optimization
Very often, its not clear what parts of a system are easy or difficult to build, and
which parts you need to spend lots of time focusing on. E.g.,
Camera
image
Preprocess
(remove background)
This systems much too
complicated for a first attempt.
Eyes segmentation
Step 1 of designing a learning
system: Plot the data.
Face detection
Nose segmentation
Logistic regression
Label
The only way to find out what needs work is to implement something quickly,
and find out what parts break.
Mouth segmentation
[But this may be bad advice if your goal is to come up with new machine
learning algorithms.]
Andrew Y. Ng
The danger of over-theorizing
3d similarity
learning
Color
invariance
Object
detection
Navigation
Differential
geometry of
3d manifolds
Complexity of
non-Riemannian
geometries
VC
dimension
… Convergence
bounds for
sampled non-
monotonic logic
Mail
delivery
robot
Obstacle
avoidance
Robot
manipulation
[Based on Papadimitriou, 1995]
Andrew Y. Ng
Summary
Andrew Y. Ng
Summary
• Time spent coming up with diagnostics for learning algorithms is time well-
spent.
Its often up to your own ingenuity to come up with right diagnostics.
• Error analyses and ablative analyses also give insight into the problem.
• Two approaches to applying learning algorithms:
Design very carefully, then implement.
• Risk of premature (statistical) optimization.
Build a quick-and-dirty prototype, diagnose, and fix.
Andrew Y. Ng
+353
View File
@@ -0,0 +1,353 @@
Source: https://cs231n.github.io/neural-networks-3/
Title: CS231n - Neural Networks Part 3: Learning and Evaluation
Fetched-via: uvx markitdown https://cs231n.github.io/neural-networks-3/
Fetch-status: verbatim
[CS231n Deep Learning for Computer Vision](https://cs231n.github.io)
[Course Website](http://cs231n.stanford.edu/)
#
Table of Contents:
* [Gradient checks](#gradcheck)
* [Sanity checks](#sanitycheck)
* [Babysitting the learning process](#baby)
+ [Loss function](#loss)
+ [Train/val accuracy](#accuracy)
+ [Weights:Updates ratio](#ratio)
+ [Activation/Gradient distributions per layer](#distr)
+ [Visualization](#vis)
* [Parameter updates](#update)
+ [First-order (SGD), momentum, Nesterov momentum](#sgd)
+ [Annealing the learning rate](#anneal)
+ [Second-order methods](#second)
+ [Per-parameter adaptive learning rates (Adagrad, RMSProp)](#ada)
* [Hyperparameter Optimization](#hyper)
* [Evaluation](#eval)
+ [Model Ensembles](#ensemble)
* [Summary](#summary)
* [Additional References](#add)
## Learning
In the previous sections weve discussed the static parts of a Neural Networks: how we can set up the network connectivity, the data, and the loss function. This section is devoted to the dynamics, or in other words, the process of learning the parameters and finding good hyperparameters.
### Gradient Checks
In theory, performing a gradient check is as simple as comparing the analytic gradient to the numerical gradient. In practice, the process is much more involved and error prone. Here are some tips, tricks, and issues to watch out for:
**Use the centered formula**. The formula you may have seen for the finite difference approximation when evaluating the numerical gradient looks as follows:
\[\frac{df(x)}{dx} = \frac{f(x + h) - f(x)}{h} \hspace{0.1in} \text{(bad, do not use)}\]
where \(h\) is a very small number, in practice approximately 1e-5 or so. In practice, it turns out that it is much better to use the *centered* difference formula of the form:
\[\frac{df(x)}{dx} = \frac{f(x + h) - f(x - h)}{2h} \hspace{0.1in} \text{(use instead)}\]
This requires you to evaluate the loss function twice to check every single dimension of the gradient (so it is about 2 times as expensive), but the gradient approximation turns out to be much more precise. To see this, you can use Taylor expansion of \(f(x+h)\) and \(f(x-h)\) and verify that the first formula has an error on order of \(O(h)\), while the second formula only has error terms on order of \(O(h^2)\) (i.e. it is a second order approximation).
**Use relative error for the comparison**. What are the details of comparing the numerical gradient \(f\_n\) and analytic gradient \(f\_a\)? That is, how do we know if the two are not compatible? You might be temped to keep track of the difference \(\mid f\_a - f\_n \mid \) or its square and define the gradient check as failed if that difference is above a threshold. However, this is problematic. For example, consider the case where their difference is 1e-4. This seems like a very appropriate difference if the two gradients are about 1.0, so wed consider the two gradients to match. But if the gradients were both on order of 1e-5 or lower, then wed consider 1e-4 to be a huge difference and likely a failure. Hence, it is always more appropriate to consider the *relative error*:
\[\frac{\mid f'\_a - f'\_n \mid}{\max(\mid f'\_a \mid, \mid f'\_n \mid)}\]
which considers their ratio of the differences to the ratio of the absolute values of both gradients. Notice that normally the relative error formula only includes one of the two terms (either one), but I prefer to max (or add) both to make it symmetric and to prevent dividing by zero in the case where one of the two is zero (which can often happen, especially with ReLUs). However, one must explicitly keep track of the case where both are zero and pass the gradient check in that edge case. In practice:
* relative error > 1e-2 usually means the gradient is probably wrong
* 1e-2 > relative error > 1e-4 should make you feel uncomfortable
* 1e-4 > relative error is usually okay for objectives with kinks. But if there are no kinks (e.g. use of tanh nonlinearities and softmax), then 1e-4 is too high.
* 1e-7 and less you should be happy.
Also keep in mind that the deeper the network, the higher the relative errors will be. So if you are gradient checking the input data for a 10-layer network, a relative error of 1e-2 might be okay because the errors build up on the way. Conversely, an error of 1e-2 for a single differentiable function likely indicates incorrect gradient.
**Use double precision**. A common pitfall is using single precision floating point to compute gradient check. It is often that case that you might get high relative errors (as high as 1e-2) even with a correct gradient implementation. In my experience Ive sometimes seen my relative errors plummet from 1e-2 to 1e-8 by switching to double precision.
**Stick around active range of floating point**. Its a good idea to read through [“What Every Computer Scientist Should Know About Floating-Point Arithmetic”](http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html), as it may demystify your errors and enable you to write more careful code. For example, in neural nets it can be common to normalize the loss function over the batch. However, if your gradients per datapoint are very small, then *additionally* dividing them by the number of data points is starting to give very small numbers, which in turn will lead to more numerical issues. This is why I like to always print the raw numerical/analytic gradient, and make sure that the numbers you are comparing are not extremely small (e.g. roughly 1e-10 and smaller in absolute value is worrying). If they are you may want to temporarily scale your loss function up by a constant to bring them to a “nicer” range where floats are more dense - ideally on the order of 1.0, where your float exponent is 0.
**Kinks in the objective**. One source of inaccuracy to be aware of during gradient checking is the problem of *kinks*. Kinks refer to non-differentiable parts of an objective function, introduced by functions such as ReLU (\(max(0,x)\)), or the SVM loss, Maxout neurons, etc. Consider gradient checking the ReLU function at \(x = -1e6\). Since \(x < 0\), the analytic gradient at this point is exactly zero. However, the numerical gradient would suddenly compute a non-zero gradient because \(f(x+h)\) might cross over the kink (e.g. if \(h > 1e-6\)) and introduce a non-zero contribution. You might think that this is a pathological case, but in fact this case can be very common. For example, an SVM for CIFAR-10 contains up to 450,000 \(max(0,x)\) terms because there are 50,000 examples and each example yields 9 terms to the objective. Moreover, a Neural Network with an SVM classifier will contain many more kinks due to ReLUs.
Note that it is possible to know if a kink was crossed in the evaluation of the loss. This can be done by keeping track of the identities of all “winners” in a function of form \(max(x,y)\); That is, was x or y higher during the forward pass. If the identity of at least one winner changes when evaluating \(f(x+h)\) and then \(f(x-h)\), then a kink was crossed and the numerical gradient will not be exact.
**Use only few datapoints**. One fix to the above problem of kinks is to use fewer datapoints, since loss functions that contain kinks (e.g. due to use of ReLUs or margin losses etc.) will have fewer kinks with fewer datapoints, so it is less likely for you to cross one when you perform the finite different approximation. Moreover, if your gradcheck for only ~2 or 3 datapoints then you would almost certainly gradcheck for an entire batch. Using very few datapoints also makes your gradient check faster and more efficient.
**Be careful with the step size h**. It is not necessarily the case that smaller is better, because when \(h\) is much smaller, you may start running into numerical precision problems. Sometimes when the gradient doesnt check, it is possible that you change \(h\) to be 1e-4 or 1e-6 and suddenly the gradient will be correct. This [wikipedia article](http://en.wikipedia.org/wiki/Numerical_differentiation) contains a chart that plots the value of **h** on the x-axis and the numerical gradient error on the y-axis.
**Gradcheck during a “characteristic” mode of operation**. It is important to realize that a gradient check is performed at a particular (and usually random), single point in the space of parameters. Even if the gradient check succeeds at that point, it is not immediately certain that the gradient is correctly implemented globally. Additionally, a random initialization might not be the most “characteristic” point in the space of parameters and may in fact introduce pathological situations where the gradient seems to be correctly implemented but isnt. For instance, an SVM with very small weight initialization will assign almost exactly zero scores to all datapoints and the gradients will exhibit a particular pattern across all datapoints. An incorrect implementation of the gradient could still produce this pattern and not generalize to a more characteristic mode of operation where some scores are larger than others. Therefore, to be safe it is best to use a short **burn-in** time during which the network is allowed to learn and perform the gradient check after the loss starts to go down. The danger of performing it at the first iteration is that this could introduce pathological edge cases and mask an incorrect implementation of the gradient.
**Dont let the regularization overwhelm the data**. It is often the case that a loss function is a sum of the data loss and the regularization loss (e.g. L2 penalty on weights). One danger to be aware of is that the regularization loss may overwhelm the data loss, in which case the gradients will be primarily coming from the regularization term (which usually has a much simpler gradient expression). This can mask an incorrect implementation of the data loss gradient. Therefore, it is recommended to turn off regularization and check the data loss alone first, and then the regularization term second and independently. One way to perform the latter is to hack the code to remove the data loss contribution. Another way is to increase the regularization strength so as to ensure that its effect is non-negligible in the gradient check, and that an incorrect implementation would be spotted.
**Remember to turn off dropout/augmentations**. When performing gradient check, remember to turn off any non-deterministic effects in the network, such as dropout, random data augmentations, etc. Otherwise these can clearly introduce huge errors when estimating the numerical gradient. The downside of turning off these effects is that you wouldnt be gradient checking them (e.g. it might be that dropout isnt backpropagated correctly). Therefore, a better solution might be to force a particular random seed before evaluating both \(f(x+h)\) and \(f(x-h)\), and when evaluating the analytic gradient.
**Check only few dimensions**. In practice the gradients can have sizes of million parameters. In these cases it is only practical to check some of the dimensions of the gradient and assume that the others are correct. **Be careful**: One issue to be careful with is to make sure to gradient check a few dimensions for every separate parameter. In some applications, people combine the parameters into a single large parameter vector for convenience. In these cases, for example, the biases could only take up a tiny number of parameters from the whole vector, so it is important to not sample at random but to take this into account and check that all parameters receive the correct gradients.
### Before learning: sanity checks Tips/Tricks
Here are a few sanity checks you might consider running before you plunge into expensive optimization:
* **Look for correct loss at chance performance.** Make sure youre getting the loss you expect when you initialize with small parameters. Its best to first check the data loss alone (so set regularization strength to zero). For example, for CIFAR-10 with a Softmax classifier we would expect the initial loss to be 2.302, because we expect a diffuse probability of 0.1 for each class (since there are 10 classes), and Softmax loss is the negative log probability of the correct class so: -ln(0.1) = 2.302. For The Weston Watkins SVM, we expect all desired margins to be violated (since all scores are approximately zero), and hence expect a loss of 9 (since margin is 1 for each wrong class). If youre not seeing these losses there might be issue with initialization.
* As a second sanity check, increasing the regularization strength should increase the loss
* **Overfit a tiny subset of data**. Lastly and most importantly, before training on the full dataset try to train on a tiny portion (e.g. 20 examples) of your data and make sure you can achieve zero cost. For this experiment its also best to set regularization to zero, otherwise this can prevent you from getting zero cost. Unless you pass this sanity check with a small dataset it is not worth proceeding to the full dataset. Note that it may happen that you can overfit very small dataset but still have an incorrect implementation. For instance, if your datapoints features are random due to some bug, then it will be possible to overfit your small training set but you will never notice any generalization when you fold it your full dataset.
### Babysitting the learning process
There are multiple useful quantities you should monitor during training of a neural network. These plots are the window into the training process and should be utilized to get intuitions about different hyperparameter settings and how they should be changed for more efficient learning.
The x-axis of the plots below are always in units of epochs, which measure how many times every example has been seen during training in expectation (e.g. one epoch means that every example has been seen once). It is preferable to track epochs rather than iterations since the number of iterations depends on the arbitrary setting of batch size.
#### Loss function
The first quantity that is useful to track during training is the loss, as it is evaluated on the individual batches during the forward pass. Below is a cartoon diagram showing the loss over time, and especially what the shape might tell you about the learning rate:
![](/assets/nn3/learningrates.jpeg)
![](/assets/nn3/loss.jpeg)
**Left:** A cartoon depicting the effects of different learning rates. With low learning rates the improvements will be linear. With high learning rates they will start to look more exponential. Higher learning rates will decay the loss faster, but they get stuck at worse values of loss (green line). This is because there is too much "energy" in the optimization and the parameters are bouncing around chaotically, unable to settle in a nice spot in the optimization landscape. **Right:** An example of a typical loss function over time, while training a small network on CIFAR-10 dataset. This loss function looks reasonable (it might indicate a slightly too small learning rate based on its speed of decay, but it's hard to say), and also indicates that the batch size might be a little too low (since the cost is a little too noisy).
The amount of “wiggle” in the loss is related to the batch size. When the batch size is 1, the wiggle will be relatively high. When the batch size is the full dataset, the wiggle will be minimal because every gradient update should be improving the loss function monotonically (unless the learning rate is set too high).
Some people prefer to plot their loss functions in the log domain. Since learning progress generally takes an exponential form shape, the plot appears as a slightly more interpretable straight line, rather than a hockey stick. Additionally, if multiple cross-validated models are plotted on the same loss graph, the differences between them become more apparent.
Sometimes loss functions can look funny [lossfunctions.tumblr.com](http://lossfunctions.tumblr.com/).
#### Train/Val accuracy
The second important quantity to track while training a classifier is the validation/training accuracy. This plot can give you valuable insights into the amount of overfitting in your model:
![](/assets/nn3/accuracies.jpeg)
The gap between the training and validation accuracy indicates the amount of overfitting. Two possible cases are shown in the diagram on the left. The blue validation error curve shows very small validation accuracy compared to the training accuracy, indicating strong overfitting (note, it's possible for the validation accuracy to even start to go down after some point). When you see this in practice you probably want to increase regularization (stronger L2 weight penalty, more dropout, etc.) or collect more data. The other possible case is when the validation accuracy tracks the training accuracy fairly well. This case indicates that your model capacity is not high enough: make the model larger by increasing the number of parameters.
#### Ratio of weights:updates
The last quantity you might want to track is the ratio of the update magnitudes to the value magnitudes. Note: *updates*, not the raw gradients (e.g. in vanilla sgd this would be the gradient multiplied by the learning rate). You might want to evaluate and track this ratio for every set of parameters independently. A rough heuristic is that this ratio should be somewhere around 1e-3. If it is lower than this then the learning rate might be too low. If it is higher then the learning rate is likely too high. Here is a specific example:
```
# assume parameter vector W and its gradient vector dW
param_scale = np.linalg.norm(W.ravel())
update = -learning_rate*dW # simple SGD update
update_scale = np.linalg.norm(update.ravel())
W += update # the actual update
print update_scale / param_scale # want ~1e-3
```
Instead of tracking the min or the max, some people prefer to compute and track the norm of the gradients and their updates instead. These metrics are usually correlated and often give approximately the same results.
#### Activation / Gradient distributions per layer
An incorrect initialization can slow down or even completely stall the learning process. Luckily, this issue can be diagnosed relatively easily. One way to do so is to plot activation/gradient histograms for all layers of the network. Intuitively, it is not a good sign to see any strange distributions - e.g. with tanh neurons we would like to see a distribution of neuron activations between the full range of [-1,1], instead of seeing all neurons outputting zero, or all neurons being completely saturated at either -1 or 1.
#### First-layer Visualizations
Lastly, when one is working with image pixels it can be helpful and satisfying to plot the first-layer features visually:
![](/assets/nn3/weights.jpeg)
![](/assets/nn3/cnnweights.jpg)
Examples of visualized weights for the first layer of a neural network. **Left**: Noisy features indicate could be a symptom: Unconverged network, improperly set learning rate, very low weight regularization penalty. **Right:** Nice, smooth, clean and diverse features are a good indication that the training is proceeding well.
### Parameter updates
Once the analytic gradient is computed with backpropagation, the gradients are used to perform a parameter update. There are several approaches for performing the update, which we discuss next.
We note that optimization for deep networks is currently a very active area of research. In this section we highlight some established and common techniques you may see in practice, briefly describe their intuition, but leave a detailed analysis outside of the scope of the class. We provide some further pointers for an interested reader.
#### SGD and bells and whistles
**Vanilla update**. The simplest form of update is to change the parameters along the negative gradient direction (since the gradient indicates the direction of increase, but we usually wish to minimize a loss function). Assuming a vector of parameters `x` and the gradient `dx`, the simplest update has the form:
```
# Vanilla update
x += - learning_rate * dx
```
where `learning_rate` is a hyperparameter - a fixed constant. When evaluated on the full dataset, and when the learning rate is low enough, this is guaranteed to make non-negative progress on the loss function.
**Momentum update** is another approach that almost always enjoys better converge rates on deep networks. This update can be motivated from a physical perspective of the optimization problem. In particular, the loss can be interpreted as the height of a hilly terrain (and therefore also to the potential energy since \(U = mgh\) and therefore \( U \propto h \) ). Initializing the parameters with random numbers is equivalent to setting a particle with zero initial velocity at some location. The optimization process can then be seen as equivalent to the process of simulating the parameter vector (i.e. a particle) as rolling on the landscape.
Since the force on the particle is related to the gradient of potential energy (i.e. \(F = - \nabla U \) ), the **force** felt by the particle is precisely the (negative) **gradient** of the loss function. Moreover, \(F = ma \) so the (negative) gradient is in this view proportional to the acceleration of the particle. Note that this is different from the SGD update shown above, where the gradient directly integrates the position. Instead, the physics view suggests an update in which the gradient only directly influences the velocity, which in turn has an effect on the position:
```
# Momentum update
v = mu * v - learning_rate * dx # integrate velocity
x += v # integrate position
```
Here we see an introduction of a `v` variable that is initialized at zero, and an additional hyperparameter (`mu`). As an unfortunate misnomer, this variable is in optimization referred to as *momentum* (its typical value is about 0.9), but its physical meaning is more consistent with the coefficient of friction. Effectively, this variable damps the velocity and reduces the kinetic energy of the system, or otherwise the particle would never come to a stop at the bottom of a hill. When cross-validated, this parameter is usually set to values such as [0.5, 0.9, 0.95, 0.99]. Similar to annealing schedules for learning rates (discussed later, below), optimization can sometimes benefit a little from momentum schedules, where the momentum is increased in later stages of learning. A typical setting is to start with momentum of about 0.5 and anneal it to 0.99 or so over multiple epochs.
> With Momentum update, the parameter vector will build up velocity in any direction that has consistent gradient.
**Nesterov Momentum** is a slightly different version of the momentum update that has recently been gaining popularity. It enjoys stronger theoretical converge guarantees for convex functions and in practice it also consistenly works slightly better than standard momentum.
The core idea behind Nesterov momentum is that when the current parameter vector is at some position `x`, then looking at the momentum update above, we know that the momentum term alone (i.e. ignoring the second term with the gradient) is about to nudge the parameter vector by `mu * v`. Therefore, if we are about to compute the gradient, we can treat the future approximate position `x + mu * v` as a “lookahead” - this is a point in the vicinity of where we are soon going to end up. Hence, it makes sense to compute the gradient at `x + mu * v` instead of at the “old/stale” position `x`.
![](/assets/nn3/nesterov.jpeg)
Nesterov momentum. Instead of evaluating gradient at the current position (red circle), we know that our momentum is about to carry us to the tip of the green arrow. With Nesterov momentum we therefore instead evaluate the gradient at this "looked-ahead" position.
That is, in a slightly awkward notation, we would like to do the following:
```
x_ahead = x + mu * v
# evaluate dx_ahead (the gradient at x_ahead instead of at x)
v = mu * v - learning_rate * dx_ahead
x += v
```
However, in practice people prefer to express the update to look as similar to vanilla SGD or to the previous momentum update as possible. This is possible to achieve by manipulating the update above with a variable transform `x_ahead = x + mu * v`, and then expressing the update in terms of `x_ahead` instead of `x`. That is, the parameter vector we are actually storing is always the ahead version. The equations in terms of `x_ahead` (but renaming it back to `x`) then become:
```
v_prev = v # back this up
v = mu * v - learning_rate * dx # velocity update stays the same
x += -mu * v_prev + (1 + mu) * v # position update changes form
```
We recommend this further reading to understand the source of these equations and the mathematical formulation of Nesterovs Accelerated Momentum (NAG):
* [Advances in optimizing Recurrent Networks](http://arxiv.org/pdf/1212.0901v2.pdf) by Yoshua Bengio, Section 3.5.
* [Ilya Sutskevers thesis](http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf) (pdf) contains a longer exposition of the topic in section 7.2
#### Annealing the learning rate
In training deep networks, it is usually helpful to anneal the learning rate over time. Good intuition to have in mind is that with a high learning rate, the system contains too much kinetic energy and the parameter vector bounces around chaotically, unable to settle down into deeper, but narrower parts of the loss function. Knowing when to decay the learning rate can be tricky: Decay it slowly and youll be wasting computation bouncing around chaotically with little improvement for a long time. But decay it too aggressively and the system will cool too quickly, unable to reach the best position it can. There are three common types of implementing the learning rate decay:
* **Step decay**: Reduce the learning rate by some factor every few epochs. Typical values might be reducing the learning rate by a half every 5 epochs, or by 0.1 every 20 epochs. These numbers depend heavily on the type of problem and the model. One heuristic you may see in practice is to watch the validation error while training with a fixed learning rate, and reduce the learning rate by a constant (e.g. 0.5) whenever the validation error stops improving.
* **Exponential decay.** has the mathematical form \(\alpha = \alpha\_0 e^{-k t}\), where \(\alpha\_0, k\) are hyperparameters and \(t\) is the iteration number (but you can also use units of epochs).
* **1/t decay** has the mathematical form \(\alpha = \alpha\_0 / (1 + k t )\) where \(a\_0, k\) are hyperparameters and \(t\) is the iteration number.
In practice, we find that the step decay is slightly preferable because the hyperparameters it involves (the fraction of decay and the step timings in units of epochs) are more interpretable than the hyperparameter \(k\). Lastly, if you can afford the computational budget, err on the side of slower decay and train for a longer time.
#### Second order methods
A second, popular group of methods for optimization in context of deep learning is based on [Newtons method](http://en.wikipedia.org/wiki/Newton%27s_method_in_optimization), which iterates the following update:
\[x \leftarrow x - [H f(x)]^{-1} \nabla f(x)\]
Here, \(H f(x)\) is the [Hessian matrix](http://en.wikipedia.org/wiki/Hessian_matrix), which is a square matrix of second-order partial derivatives of the function. The term \(\nabla f(x)\) is the gradient vector, as seen in Gradient Descent. Intuitively, the Hessian describes the local curvature of the loss function, which allows us to perform a more efficient update. In particular, multiplying by the inverse Hessian leads the optimization to take more aggressive steps in directions of shallow curvature and shorter steps in directions of steep curvature. Note, crucially, the absence of any learning rate hyperparameters in the update formula, which the proponents of these methods cite this as a large advantage over first-order methods.
However, the update above is impractical for most deep learning applications because computing (and inverting) the Hessian in its explicit form is a very costly process in both space and time. For instance, a Neural Network with one million parameters would have a Hessian matrix of size [1,000,000 x 1,000,000], occupying approximately 3725 gigabytes of RAM. Hence, a large variety of *quasi-Newton* methods have been developed that seek to approximate the inverse Hessian. Among these, the most popular is [L-BFGS](http://en.wikipedia.org/wiki/Limited-memory_BFGS), which uses the information in the gradients over time to form the approximation implicitly (i.e. the full matrix is never computed).
However, even after we eliminate the memory concerns, a large downside of a naive application of L-BFGS is that it must be computed over the entire training set, which could contain millions of examples. Unlike mini-batch SGD, getting L-BFGS to work on mini-batches is more tricky and an active area of research.
**In practice**, it is currently not common to see L-BFGS or similar second-order methods applied to large-scale Deep Learning and Convolutional Neural Networks. Instead, SGD variants based on (Nesterovs) momentum are more standard because they are simpler and scale more easily.
Additional references:
* [Large Scale Distributed Deep Networks](http://research.google.com/archive/large_deep_networks_nips2012.html) is a paper from the Google Brain team, comparing L-BFGS and SGD variants in large-scale distributed optimization.
* [SFO](http://arxiv.org/abs/1311.2115) algorithm strives to combine the advantages of SGD with advantages of L-BFGS.
#### Per-parameter adaptive learning rate methods
All previous approaches weve discussed so far manipulated the learning rate globally and equally for all parameters. Tuning the learning rates is an expensive process, so much work has gone into devising methods that can adaptively tune the learning rates, and even do so per parameter. Many of these methods may still require other hyperparameter settings, but the argument is that they are well-behaved for a broader range of hyperparameter values than the raw learning rate. In this section we highlight some common adaptive methods you may encounter in practice:
**Adagrad** is an adaptive learning rate method originally proposed by [Duchi et al.](http://jmlr.org/papers/v12/duchi11a.html).
```
# Assume the gradient dx and parameter vector x
cache += dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)
```
Notice that the variable `cache` has size equal to the size of the gradient, and keeps track of per-parameter sum of squared gradients. This is then used to normalize the parameter update step, element-wise. Notice that the weights that receive high gradients will have their effective learning rate reduced, while weights that receive small or infrequent updates will have their effective learning rate increased. Amusingly, the square root operation turns out to be very important and without it the algorithm performs much worse. The smoothing term `eps` (usually set somewhere in range from 1e-4 to 1e-8) avoids division by zero. A downside of Adagrad is that in case of Deep Learning, the monotonic learning rate usually proves too aggressive and stops learning too early.
**RMSprop.** RMSprop is a very effective, but currently unpublished adaptive learning rate method. Amusingly, everyone who uses this method in their work currently cites [slide 29 of Lecture 6](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) of Geoff Hintons Coursera class. The RMSProp update adjusts the Adagrad method in a very simple way in an attempt to reduce its aggressive, monotonically decreasing learning rate. In particular, it uses a moving average of squared gradients instead, giving:
```
cache = decay_rate * cache + (1 - decay_rate) * dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)
```
Here, `decay_rate` is a hyperparameter and typical values are [0.9, 0.99, 0.999]. Notice that the `x+=` update is identical to Adagrad, but the `cache` variable is a “leaky”. Hence, RMSProp still modulates the learning rate of each weight based on the magnitudes of its gradients, which has a beneficial equalizing effect, but unlike Adagrad the updates do not get monotonically smaller.
**Adam.** [Adam](http://arxiv.org/abs/1412.6980) is a recently proposed update that looks a bit like RMSProp with momentum. The (simplified) update looks as follows:
```
m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
x += - learning_rate * m / (np.sqrt(v) + eps)
```
Notice that the update looks exactly as RMSProp update, except the “smooth” version of the gradient `m` is used instead of the raw (and perhaps noisy) gradient vector `dx`. Recommended values in the paper are `eps = 1e-8`, `beta1 = 0.9`, `beta2 = 0.999`. In practice Adam is currently recommended as the default algorithm to use, and often works slightly better than RMSProp. However, it is often also worth trying SGD+Nesterov Momentum as an alternative. The full Adam update also includes a *bias correction* mechanism, which compensates for the fact that in the first few time steps the vectors `m,v` are both initialized and therefore biased at zero, before they fully “warm up”. With the *bias correction* mechanism, the update looks as follows:
```
# t is your iteration counter going from 1 to infinity
m = beta1*m + (1-beta1)*dx
mt = m / (1-beta1**t)
v = beta2*v + (1-beta2)*(dx**2)
vt = v / (1-beta2**t)
x += - learning_rate * mt / (np.sqrt(vt) + eps)
```
Note that the update is now a function of the iteration as well as the other parameters.
We refer the reader to the paper for the details, or the course slides where this is expanded on.
Additional References:
* [Unit Tests for Stochastic Optimization](http://arxiv.org/abs/1312.6055) proposes a series of tests as a standardized benchmark for stochastic optimization.
![](/assets/nn3/opt2.gif)
![](/assets/nn3/opt1.gif)
Animations that may help your intuitions about the learning process dynamics. **Left:** Contours of a loss surface and time evolution of different optimization algorithms. Notice the "overshooting" behavior of momentum-based methods, which make the optimization look like a ball rolling down the hill. **Right:** A visualization of a saddle point in the optimization landscape, where the curvature along different dimension has different signs (one dimension curves up and another down). Notice that SGD has a very hard time breaking symmetry and gets stuck on the top. Conversely, algorithms such as RMSprop will see very low gradients in the saddle direction. Due to the denominator term in the RMSprop update, this will increase the effective learning rate along this direction, helping RMSProp proceed. Images credit: [Alec Radford](https://twitter.com/alecrad).
### Hyperparameter optimization
As weve seen, training Neural Networks can involve many hyperparameter settings. The most common hyperparameters in context of Neural Networks include:
* the initial learning rate
* learning rate decay schedule (such as the decay constant)
* regularization strength (L2 penalty, dropout strength)
But as we saw, there are many more relatively less sensitive hyperparameters, for example in per-parameter adaptive learning methods, the setting of momentum and its schedule, etc. In this section we describe some additional tips and tricks for performing the hyperparameter search:
**Implementation**. Larger Neural Networks typically require a long time to train, so performing hyperparameter search can take many days/weeks. It is important to keep this in mind since it influences the design of your code base. One particular design is to have a **worker** that continuously samples random hyperparameters and performs the optimization. During the training, the worker will keep track of the validation performance after every epoch, and writes a model checkpoint (together with miscellaneous training statistics such as the loss over time) to a file, preferably on a shared file system. It is useful to include the validation performance directly in the filename, so that it is simple to inspect and sort the progress. Then there is a second program which we will call a **master**, which launches or kills workers across a computing cluster, and may additionally inspect the checkpoints written by workers and plot their training statistics, etc.
**Prefer one validation fold to cross-validation**. In most cases a single validation set of respectable size substantially simplifies the code base, without the need for cross-validation with multiple folds. Youll hear people say they “cross-validated” a parameter, but many times it is assumed that they still only used a single validation set.
**Hyperparameter ranges**. Search for hyperparameters on log scale. For example, a typical sampling of the learning rate would look as follows: `learning_rate = 10 ** uniform(-6, 1)`. That is, we are generating a random number from a uniform distribution, but then raising it to the power of 10. The same strategy should be used for the regularization strength. Intuitively, this is because learning rate and regularization strength have multiplicative effects on the training dynamics. For example, a fixed change of adding 0.01 to a learning rate has huge effects on the dynamics if the learning rate is 0.001, but nearly no effect if the learning rate when it is 10. This is because the learning rate multiplies the computed gradient in the update. Therefore, it is much more natural to consider a range of learning rate multiplied or divided by some value, than a range of learning rate added or subtracted to by some value. Some parameters (e.g. dropout) are instead usually searched in the original scale (e.g. `dropout = uniform(0,1)`).
**Prefer random search to grid search**. As argued by Bergstra and Bengio in [Random Search for Hyper-Parameter Optimization](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf), “randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid”. As it turns out, this is also usually easier to implement.
![](/assets/nn3/gridsearchbad.jpeg)
Core illustration from [Random Search for Hyper-Parameter Optimization](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf) by Bergstra and Bengio. It is very often the case that some of the hyperparameters matter much more than others (e.g. top hyperparam vs. left one in this figure). Performing random search rather than grid search allows you to much more precisely discover good values for the important ones.
**Careful with best values on border**. Sometimes it can happen that youre searching for a hyperparameter (e.g. learning rate) in a bad range. For example, suppose we use `learning_rate = 10 ** uniform(-6, 1)`. Once we receive the results, it is important to double check that the final learning rate is not at the edge of this interval, or otherwise you may be missing more optimal hyperparameter setting beyond the interval.
**Stage your search from coarse to fine**. In practice, it can be helpful to first search in coarse ranges (e.g. 10 \*\* [-6, 1]), and then depending on where the best results are turning up, narrow the range. Also, it can be helpful to perform the initial coarse search while only training for 1 epoch or even less, because many hyperparameter settings can lead the model to not learn at all, or immediately explode with infinite cost. The second stage could then perform a narrower search with 5 epochs, and the last stage could perform a detailed search in the final range for many more epochs (for example).
**Bayesian Hyperparameter Optimization** is a whole area of research devoted to coming up with algorithms that try to more efficiently navigate the space of hyperparameters. The core idea is to appropriately balance the exploration - exploitation trade-off when querying the performance at different hyperparameters. Multiple libraries have been developed based on these models as well, among some of the better known ones are [Spearmint](https://github.com/JasperSnoek/spearmint), [SMAC](http://www.cs.ubc.ca/labs/beta/Projects/SMAC/), and [Hyperopt](http://jaberg.github.io/hyperopt/). However, in practical settings with ConvNets it is still relatively difficult to beat random search in a carefully-chosen intervals. See some additional from-the-trenches discussion [here](http://nlpers.blogspot.com/2014/10/hyperparameter-search-bayesian.html).
## Evaluation
### Model Ensembles
In practice, one reliable approach to improving the performance of Neural Networks by a few percent is to train multiple independent models, and at test time average their predictions. As the number of models in the ensemble increases, the performance typically monotonically improves (though with diminishing returns). Moreover, the improvements are more dramatic with higher model variety in the ensemble. There are a few approaches to forming an ensemble:
* **Same model, different initializations**. Use cross-validation to determine the best hyperparameters, then train multiple models with the best set of hyperparameters but with different random initialization. The danger with this approach is that the variety is only due to initialization.
* **Top models discovered during cross-validation**. Use cross-validation to determine the best hyperparameters, then pick the top few (e.g. 10) models to form the ensemble. This improves the variety of the ensemble but has the danger of including suboptimal models. In practice, this can be easier to perform since it doesnt require additional retraining of models after cross-validation
* **Different checkpoints of a single model**. If training is very expensive, some people have had limited success in taking different checkpoints of a single network over time (for example after every epoch) and using those to form an ensemble. Clearly, this suffers from some lack of variety, but can still work reasonably well in practice. The advantage of this approach is that is very cheap.
* **Running average of parameters during training**. Related to the last point, a cheap way of almost always getting an extra percent or two of performance is to maintain a second copy of the networks weights in memory that maintains an exponentially decaying sum of previous weights during training. This way youre averaging the state of the network over last several iterations. You will find that this “smoothed” version of the weights over last few steps almost always achieves better validation error. The rough intuition to have in mind is that the objective is bowl-shaped and your network is jumping around the mode, so the average has a higher chance of being somewhere nearer the mode.
One disadvantage of model ensembles is that they take longer to evaluate on test example. An interested reader may find the recent work from Geoff Hinton on [“Dark Knowledge”](https://www.youtube.com/watch?v=EK61htlw8hY) inspiring, where the idea is to “distill” a good ensemble back to a single model by incorporating the ensemble log likelihoods into a modified objective.
## Summary
To train a Neural Network:
* Gradient check your implementation with a small batch of data and be aware of the pitfalls.
* As a sanity check, make sure your initial loss is reasonable, and that you can achieve 100% training accuracy on a very small portion of the data
* During training, monitor the loss, the training/validation accuracy, and if youre feeling fancier, the magnitude of updates in relation to parameter values (it should be ~1e-3), and when dealing with ConvNets, the first-layer weights.
* The two recommended updates to use are either SGD+Nesterov Momentum or Adam.
* Decay your learning rate over the period of the training. For example, halve the learning rate after a fixed number of epochs, or whenever the validation accuracy tops off.
* Search for good hyperparameters with random search (not grid search). Stage your search from coarse (wide hyperparameter ranges, training only for 1-5 epochs), to fine (narrower rangers, training for many more epochs)
* Form model ensembles for extra performance
## Additional References
* [SGD](http://research.microsoft.com/pubs/192769/tricks-2012.pdf) tips and tricks from Leon Bottou
* [Efficient BackProp](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf) (pdf) from Yann LeCun
* [Practical Recommendations for Gradient-Based Training of Deep
Architectures](http://arxiv.org/pdf/1206.5533v2.pdf) from Yoshua Bengio
* [cs231n](https://github.com/cs231n)
* [cs231n](https://twitter.com/cs231n)
+786
View File
@@ -0,0 +1,786 @@
Source: https://fullstackdeeplearning.com/spring2021/lecture-7/
Title: FSDL Spring 2021 - Lecture 7: Troubleshooting Deep Neural Networks
Fetched-via: uvx markitdown https://fullstackdeeplearning.com/spring2021/lecture-7/
Fetch-status: verbatim
[Skip to content](#lecture-7-troubleshooting-deep-neural-networks)
[Sign up for our latest in-person course!](https://www.scale.bythebay.io/llm-workshop)
[![logo](../../images/favicon.png)](../.. "The Full Stack")
The Full Stack
Lecture 7: Troubleshooting Deep Neural Networks
Initializing search
[The Full Stack Website](https://github.com/the-full-stack/website "Go to repository")
* [Home](../..)
* [LLM Bootcamp](../../llm-bootcamp/)
* [Deep Learning Course](../../course/)
* [Blog](../../blog/)
* [Cloud GPUs](../../cloud-gpus/)
[![logo](../../images/favicon.png)](../.. "The Full Stack")
The Full Stack
[The Full Stack Website](https://github.com/the-full-stack/website "Go to repository")
* [Home](../..)
* [ ]
[LLM Bootcamp](../../llm-bootcamp/)
LLM Bootcamp
+ [ ]
[Spring 2023](../../llm-bootcamp/spring-2023/)
Spring 2023
- [Launch an LLM App in One Hour](../../llm-bootcamp/spring-2023/launch-an-llm-app-in-one-hour/)
- [LLM Foundations](../../llm-bootcamp/spring-2023/llm-foundations/)
- [Learn to Spell: Prompt Engineering](../../llm-bootcamp/spring-2023/prompt-engineering/)
- [Augmented Language Models](../../llm-bootcamp/spring-2023/augmented-language-models/)
- [Project Walkthrough: askFSDL](../../llm-bootcamp/spring-2023/askfsdl-walkthrough/)
- [UX for Language User Interfaces](../../llm-bootcamp/spring-2023/ux-for-luis/)
- [LLMOps](../../llm-bootcamp/spring-2023/llmops/)
- [What's Next?](../../llm-bootcamp/spring-2023/whats-next/)
- [Reza Shabani: How to train your own LLM](../../llm-bootcamp/spring-2023/shabani-train-your-own/)
- [Harrison Chase: Agents](../../llm-bootcamp/spring-2023/chase-agents/)
- [Fireside Chat with Peter Welinder](../../llm-bootcamp/spring-2023/welinder-fireside-chat/)
* [x]
[Deep Learning Course](../../course/)
Deep Learning Course
+ [ ]
[FSDL 2022](../../course/2022/)
FSDL 2022
- [Lecture 1: Course Vision and When to Use ML](../../course/2022/lecture-1-course-vision-and-when-to-use-ml/)
- [Lab Overview](../../course/2022/lab-0-overview/)
- [Lecture 2: Development Infrastructure & Tooling](../../course/2022/lecture-2-development-infrastructure-and-tooling/)
- [Lab 4: Experiment Management](../../course/2022/lab-4-experiment-management/)
- [Lecture 3: Troubleshooting & Testing](../../course/2022/lecture-3-troubleshooting-and-testing/)
- [Lab 5: Troubleshooting & Testing](../../course/2022/lab-5-troubleshooting-and-testing/)
- [Lecture 4: Data Management](../../course/2022/lecture-4-data-management/)
- [Lab 6: Data Annotation](../../course/2022/lab-6-data-annotation/)
- [Lecture 5: Deployment](../../course/2022/lecture-5-deployment/)
- [Lab 7: Web Deployment](../../course/2022/lab-7-web-deployment/)
- [Lecture 6: Continual Learning](../../course/2022/lecture-6-continual-learning/)
- [Lab 8: Model Monitoring](../../course/2022/lab-8-model-monitoring/)
- [Lecture 7: Foundation Models](../../course/2022/lecture-7-foundation-models/)
- [Lecture 8: ML Teams and Project Management](../../course/2022/lecture-8-teams-and-pm/)
- [Lecture 9: Ethics](../../course/2022/lecture-9-ethics/)
- [Project Showcase](../../course/2022/project-showcase/)
- [Course Announcement](../../course/2022/announcement/)
+ [x]
Older
Older
- [x]
[FSDL 2021](../)
FSDL 2021
* [Synchronous Online Course](../synchronous/)
* [Course Projects Showcase](../projects/)
* [Lecture 1: DL Fundamentals](../lecture-1/)
* [Lab 1: Setup and Introduction](../lab-1/)
* [Notebook: Coding a neural net](../notebook-1/)
* [Lecture 2A: CNNs](../lecture-2a/)
* [Lecture 2B: Computer Vision](../lecture-2b/)
* [Lab 2: CNNs and Synthetic Data](../lab-2/)
* [Lecture 3: RNNs](../lecture-3/)
* [Lab 3: RNNs](../lab-3/)
* [Lecture 4: Transformers](../lecture-4/)
* [Lab 4: Transformers](../lab-4/)
* [Lecture 5: ML Projects](../lecture-5/)
* [Lecture 6: MLOps Infrastructure & Tooling](../lecture-6/)
* [Lab 5: Experiment Management](../lab-5/)
* [ ]
Lecture 7: Troubleshooting Deep Neural Networks
[Lecture 7: Troubleshooting Deep Neural Networks](./)
Table of contents
+ [Video](#video)
+ [Slides](#slides)
+ [Notes](#notes)
- [1 - Why Is Deep Learning Troubleshooting Hard?](#1-why-is-deep-learning-troubleshooting-hard)
- [2 - Strategy to Debug Neural Networks](#2-strategy-to-debug-neural-networks)
- [3 - Start Simple](#3-start-simple)
* [Choose A Simple Architecture](#choose-a-simple-architecture)
* [Use Sensible Defaults](#use-sensible-defaults)
* [Normalize Inputs](#normalize-inputs)
* [Simplify The Problem](#simplify-the-problem)
- [4 - Implement and Debug](#4-implement-and-debug)
* [Get Your Model To Run](#get-your-model-to-run)
* [Overfit A Single Batch](#overfit-a-single-batch)
* [Compare To A Known Result](#compare-to-a-known-result)
- [5 - Evaluate](#5-evaluate)
* [Bias-Variance Decomposition](#bias-variance-decomposition)
* [Distribution Shift](#distribution-shift)
- [6 - Improve Model and Data](#6-improve-model-and-data)
* [Step 1: Address Underfitting](#step-1-address-underfitting)
* [Step 2: Address Overfitting](#step-2-address-overfitting)
* [Step 3: Address Distribution Shift](#step-3-address-distribution-shift)
+ [Error Analysis](#error-analysis)
+ [Domain Adaptation](#domain-adaptation)
* [Step 4: Rebalance datasets](#step-4-rebalance-datasets)
- [7 - Tune Hyperparameters](#7-tune-hyperparameters)
* [Techniques for Tuning Hyperparameter Optimization](#techniques-for-tuning-hyperparameter-optimization)
- [8 - Conclusion](#8-conclusion)
* [Lecture 8: Data Management](../lecture-8/)
* [Lab 6: Data Labeling](../lab-6/)
* [Lecture 9: AI Ethics](../lecture-9/)
* [Lab 7: Paragraph Recognition](../lab-7/)
* [Lecture 10: Testing & Explainability](../lecture-10/)
* [Lab 8: Testing & CI](../lab-8/)
* [Lecture 11: Deployment & Monitoring](../lecture-11/)
* [Lab 9: Web Deployment](../lab-9/)
* [Lecture 12: Research Directions](../lecture-12/)
* [Lecture 13: ML Teams and Startups](../lecture-13/)
* [Panel Discussion: Do I need a PhD to work in ML?](../panel/)
- [FSDL 2021 (Berkeley)](https://bit.ly/berkeleyfsdl)
- [FSDL 2020 (UW)](https://bit.ly/uwfsdl)
- [FSDL 2019 (Online)](https://fall2019.fullstackdeeplearning.com)
- [FSDL 2019 (Bootcamp)](/march2019.html)
- [FSDL 2018 (Bootcamp)](/august2018.html)
* [Blog](../../blog/)
* [Cloud GPUs](../../cloud-gpus/)
Table of contents
* [Video](#video)
* [Slides](#slides)
* [Notes](#notes)
+ [1 - Why Is Deep Learning Troubleshooting Hard?](#1-why-is-deep-learning-troubleshooting-hard)
+ [2 - Strategy to Debug Neural Networks](#2-strategy-to-debug-neural-networks)
+ [3 - Start Simple](#3-start-simple)
- [Choose A Simple Architecture](#choose-a-simple-architecture)
- [Use Sensible Defaults](#use-sensible-defaults)
- [Normalize Inputs](#normalize-inputs)
- [Simplify The Problem](#simplify-the-problem)
+ [4 - Implement and Debug](#4-implement-and-debug)
- [Get Your Model To Run](#get-your-model-to-run)
- [Overfit A Single Batch](#overfit-a-single-batch)
- [Compare To A Known Result](#compare-to-a-known-result)
+ [5 - Evaluate](#5-evaluate)
- [Bias-Variance Decomposition](#bias-variance-decomposition)
- [Distribution Shift](#distribution-shift)
+ [6 - Improve Model and Data](#6-improve-model-and-data)
- [Step 1: Address Underfitting](#step-1-address-underfitting)
- [Step 2: Address Overfitting](#step-2-address-overfitting)
- [Step 3: Address Distribution Shift](#step-3-address-distribution-shift)
* [Error Analysis](#error-analysis)
* [Domain Adaptation](#domain-adaptation)
- [Step 4: Rebalance datasets](#step-4-rebalance-datasets)
+ [7 - Tune Hyperparameters](#7-tune-hyperparameters)
- [Techniques for Tuning Hyperparameter Optimization](#techniques-for-tuning-hyperparameter-optimization)
+ [8 - Conclusion](#8-conclusion)
# Lecture 7: Troubleshooting Deep Neural Networks
## Video
## Slides
[Download slides as PDF](https://drive.google.com/file/d/1yXQCnGGp3wWdoCf6nSP5b758cXF92rtg/view?usp=sharing)
## Notes
*Lecture by [Josh Tobin](http://josh-tobin.com).
Notes transcribed by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).*
In traditional software engineering, a bug usually leads to the program
crashing. While this is annoying for the user, it is critical for the
developer to inspect the errors to understand why. With deep learning,
we sometimes encounter errors, but all too often, the program crashes
without a clear reason why. While these issues can be debugged manually,
deep learning models most often fail because of poor output predictions.
Whats worse is that when the model performance is low, there is usually
no signal about why or when the models failed.
A common sentiment among practitioners is that they spend **8090% of
time debugging and tuning the models** and only 1020% of time deriving
math equations and implementing things. This is confirmed by Andrej
Kaparthy, [as seen in this
tweet](https://twitter.com/karpathy/status/423990618289733632).
### 1 - Why Is Deep Learning Troubleshooting Hard?
Suppose you are trying to reproduce a research paper result for your
work, but your results are worse. You might wonder why your models
performance is significantly worse than the paper that youre trying to
reproduce?
![](/spring2021/lecture-7-notes-media/image3.png)
Many different things can cause this:
* It can be **implementation bugs**. Most bugs in deep learning are
actually invisible.
* **Hyper-parameter choices** can also cause your performance to
degrade. Deep learning models are very sensitive to
hyper-parameters. Even very subtle choices of learning rate and
weight initialization can make a big difference.
* Performance can also be worse just because of **data/model fit**.
For example, you pre-train your model on ImageNet data and fit it
on self-driving car images, which are harder to learn.
* Finally, poor model performance could be caused not by your model
but your **dataset construction**. Typical issues here include not
having enough examples, dealing with noisy labels and imbalanced
classes, splitting train and test set with different
distributions.
### 2 - Strategy to Debug Neural Networks
The key idea of deep learning troubleshooting is: *Since it is hard to
disambiguate errors, its best to start simple and gradually ramp up
complexity.*
This lecture provides **a decision tree for debugging deep learning
models and improving performance**. This guide assumes that you already
have an initial test dataset, a single metric to improve, and target
performance based on human-level performance, published results,
previous baselines, etc.
![](/spring2021/lecture-7-notes-media/image4.png)
### 3 - Start Simple
The first step is the troubleshooting workflow is **starting simple**.
#### Choose A Simple Architecture
There are a few things to consider when you want to start simple. The
first is how to **choose a simple architecture**. These are
architectures that are easy to implement and are likely to get you part
of the way towards solving your problem without introducing as many
bugs.
Architecture selection is one of the many intimidating parts of getting
into deep learning because there are tons of papers coming out
all-the-time and claiming to be state-of-the-art on some problems. They
get very complicated fast. In the limit, if youre trying to get to
maximal performance, then architecture selection is challenging. But
when starting on a new problem, you can just solve a simple set of rules
that will allow you to pick an architecture that enables you to do a
decent job on the problem youre working on.
* If your data looks like **images**, start with a LeNet-like
architecture and consider using something like ResNet as your
codebase gets more mature.
* If your data looks like **sequences**, start with an LSTM with one
hidden layer and/or temporal/classical convolutions. Then, when
your problem gets more mature, you can move to an Attention-based
model or a WaveNet-like model.
* For **all other tasks**, start with a fully-connected neural network
with one hidden layer and use more advanced networks later,
depending on the problem.
![](/spring2021/lecture-7-notes-media/image7.png)
In reality, many times, the input data contains multiple of those things
above. So how to deal with **multiple input modalities** into a neural
network? Here is the 3-step strategy that we recommend:
* First, map each of these modalities into a lower-dimensional feature
space. In the example above, the images are passed through a
ConvNet, and the words are passed through an LSTM.
* Then we flatten the outputs of those networks to get a single vector
for each of the inputs that will go into the model. Then we
concatenate those inputs.
* Finally, we pass them through some fully-connected layers to an
output.
#### Use Sensible Defaults
After choosing a simple architecture, the next thing to do is to
**select sensible hyper-parameter defaults** to start with. Here are the
defaults that we recommend:
* [Adam optimizer with a “magic” learning rate value of
3e-4](https://twitter.com/karpathy/status/801621764144971776?lang=en).
* [ReLU](https://stats.stackexchange.com/questions/226923/why-do-we-use-relu-in-neural-networks-and-how-do-we-use-it)
activation for fully-connected and convolutional models and
[Tanh](https://stats.stackexchange.com/questions/330559/why-is-tanh-almost-always-better-than-sigmoid-as-an-activation-function)
activation for LSTM models.
* [He initialization for ReLU activation function and Glorot
initialization for Tanh activation
function](https://datascience.stackexchange.com/questions/13061/when-to-use-he-or-glorot-normal-initialization-over-uniform-init-and-what-are).
* No regularization and data normalization.
#### Normalize Inputs
The next step is to **normalize the input data**, subtracting the mean
and dividing by the variance. Note that for images, its fine to scale
values to [0, 1] or [-0.5, 0.5] (for example, by dividing by 255).
#### Simplify The Problem
The final thing you should do is consider **simplifying the problem**
itself. If you have a complicated problem with massive data and tons of
classes to deal with, then you should consider:
* Working with a small training set around 10,000 examples.
* Using a fixed number of objects, classes, input size, etc.
* Creating a simpler synthetic training set like in research labs.
This is important because (1) you will have reasonable confidence that
your model should be able to solve, and (2) your iteration speed will
increase.
The diagram below neatly summarizes how to start simple:
![](/spring2021/lecture-7-notes-media/image6.png)
### 4 - Implement and Debug
To give you a preview, below are the five most common bugs in deep
learning models that we recognize:
* **Incorrect shapes for the network tensors**: This bug is a common
one and can fail silently. This happens many times because the
automatic differentiation systems in the deep learning framework
do silent broadcasting. Tensors become different shapes in the
network and can cause a lot of problems.
* **Pre-processing inputs incorrectly**: For example, you forget to
normalize your inputs or apply too much input pre-processing
(over-normalization and excessive data augmentation).
* **Incorrect input to the models loss function**: For example, you
use softmax outputs to a loss that expects logits.
* **Forgot to set up train mode for the network correctly**: For
example, toggling train/evaluation mode or controlling batch norm
dependencies.
* **Numerical instability**: For example, you get `inf` or `NaN`
as outputs. This bug often stems from using an exponent, a log, or
a division operation somewhere in the code.
Here are three pieces of general advice for implementing your model:
* **Start with a lightweight implementation**. You want minimum
possible new lines of code for the 1st version of your model. The
rule of thumb is less than 200 lines. This doesnt count tested
infrastructure components or TensorFlow/PyTorch code.
* **Use off-the-shelf components** such as Keras if possible, since
most of the stuff in Keras works well out-of-the-box. If you have
to use TensorFlow, use the built-in functions, dont do the math
yourself. This would help you avoid a lot of numerical instability
issues.
* **Build complicated data pipelines later**. These are important for
large-scale ML systems, but you should not start with them because
data pipelines themselves can be a big source of bugs. Just start
with a dataset that you can load into memory.
![](/spring2021/lecture-7-notes-media/image11.png)
#### Get Your Model To Run
The first step of implementing bug-free deep learning models is
**getting your model to run at all**. There are a few things that can
prevent this from happening:
* **Shape mismatch/casting issue**: To address this type of problem,
you should step through your model creation and inference
step-by-step in a debugger, checking for correct shapes and data
types of your tensors.
* **Out-of-memory issues**: This can be very difficult to debug. You
can scale back your memory-intensive operations one-by-one. For
example, if you create large matrices anywhere in your code, you
can reduce the size of their dimensions or cut your batch size in
half.
* **Other issues**: You can simply Google it. Stack Overflow would be
great most of the time.
Lets zoom in on the process of stepping through model creation in a
debugger and talk about **debuggers for deep learning code**:
* In PyTorch, you can use
[ipdb](https://pypi.org/project/ipdb/) — which exports
functions to access the interactive
[IPython](http://ipython.org/) debugger.
* In TensorFlow, its trickier. TensorFlow separates the process of
creating the graph and executing operations in the graph. There
are three options you can try: (1) step through the graph creation
itself and inspect each tensor layer, (2) step into the training
loop and evaluate the tensor layers, or (3) use [TensorFlow
Debugger](https://mullikine.github.io/posts/tensorflow-debugger-tfdb-and-emacs/)
(tfdb), which does option 1 and 2 automatically.
![](/spring2021/lecture-7-notes-media/image14.png)
#### Overfit A Single Batch
After getting your model to run, the next thing you need to do is to
**overfit a single batch of data**. This is a heuristic that can catch
an absurd number of bugs. This really means that you want to drive your
training error arbitrarily close to 0.
There are a few things that can happen when you try to overfit a single
batch and it fails:
* **Error goes up**: Commonly, this is due to a flip sign somewhere in
the loss function/gradient.
* **Error explodes**: This is usually a numerical issue but can also
be caused by a high learning rate.
* **Error oscillates**: You can lower the learning rate and inspect
the data for shuffled labels or incorrect data augmentation.
* **Error plateaus**: You can increase the learning rate and get rid
of regulation. Then you can inspect the loss function and the data
pipeline for correctness.
![](/spring2021/lecture-7-notes-media/image10.png)
#### Compare To A Known Result
Once your model overfits in a single batch, there can still be some
other issues that cause bugs. The last step here is to **compare your
results to a known result**. So what sort of known results are useful?
* The most useful results come from **an official model implementation
evaluated on a similar dataset to yours**. You can step through
the code in both models line-by-line and ensure your model has the
same output. You want to ensure that your model performance is up
to par with expectations.
* If you cant find an official implementation on a similar dataset,
you can compare your approach to results from **an official model
implementation evaluated on a benchmark dataset**. You most
definitely want to walk through the code line-by-line and ensure
you have the same output.
* If there is no official implementation of your approach, you can
compare it to results from **an unofficial model implementation**.
You can review the code the same as before but with lower
confidence (because almost all the unofficial implementations on
GitHub have bugs).
* Then, you can compare to results from **a paper with no code** (to
ensure that your performance is up to par with expectations),
results from **your model on a benchmark dataset** (to make sure
your model performs well in a simpler setting), and results from
**a similar model on a similar dataset** (to help you get a
general sense of what kind of performance can be expected).
* An under-rated source of results comes from **simple baselines**
(for example, the average of outputs or linear regression), which
can help make sure that your model is learning anything at all.
The diagram below neatly summarizes how to implement and debug deep
neural networks:
![](/spring2021/lecture-7-notes-media/image8.png)
### 5 - Evaluate
#### Bias-Variance Decomposition
To evaluate models and prioritize the next steps in model development,
we will apply the bias-variance decomposition. The [bias-variance
decomposition](http://scott.fortmann-roe.com/docs/BiasVariance.html)
is the fundamental model fitting tradeoff. In our application, lets
talk more specifically about the formula for bias-variance tradeoff with
respect to the **test error;** this will help us apply the concept more
directly to our models performance. There are four terms in the formula
for test error:
*Test error = irreducible error + bias + variance + validation
overfitting*
1. **Irreducible error** is the baseline error you dont expect your
model to do better. It can be estimated through strong baselines,
like human performance.
2. **Avoidable bias**, a measure of underfitting, is the difference
between our train error and irreducible error.
3. **Variance**, a measure of overfitting, is the difference between
validation error and training error.
4. **Validation set overfitting** is the difference between test error
and validation error.
Consider the chart of learning curves and errors below. Using the test
error formula for bias and variance, we can calculate each component of
test error and make decisions based on the value. For example, our
avoidable bias is rather low (only 2 points), while the variance is much
higher (5 points). With this knowledge, we should prioritize methods of
preventing overfitting, like regularization.
![](/spring2021/lecture-7-notes-media/image12.png)
#### Distribution Shift
Clearly, the application of the bias-variance decomposition to the test
error has already helped prioritize our next steps for model
development. However, until now, weve assumed that the samples
(training, validation, testing) all come from the same distribution.
What if this isnt the case? In practical ML situations, this
**distribution shift** often cars. In building self-driving cars, a
frequent occurrence might be training with samples from one distribution
(e.g., daytime driving video) but testing or inferring on samples from a
totally different distribution (e.g., night time driving).
A simple way of handling this wrinkle in our assumption is to create two
validation sets: one from the training distribution and one from the
test distribution. This can be helpful even with a very small testing
set. If we apply this, we can actually estimate our distribution shift,
which is the difference between testing validation error and testing
error. This is really useful for practical applications of ML! With this
new term, lets update our test error formula of bias and variance:
*Test error = irreducible error + bias + variance + distribution shift +
validation overfitting*
### 6 - Improve Model and Data
Using the updated formula from the last section, well be able to decide
on and prioritize the right next steps for each iteration of a model. In
particular, well follow a specific process (shown below).
![](/spring2021/lecture-7-notes-media/image1.png)
#### Step 1: Address Underfitting
Well start by addressing underfitting (i.e., reducing bias). The first
thing to try in this case is to make your model bigger (e.g., add
layers, more units per layer). Next, consider regularization, which can
prevent a tight fit to your data. Other options are error analysis,
choosing a different model architecture (e.g., something more state of
the art), tuning hyperparameters, or adding features. Some notes:
* Choosing different architectures, especially a SOTA one, can be very
helpful but is also risky. Bugs are easily introduced in the
implementation process.
* Adding features is uncommon in the deep learning paradigm (vs.
traditional machine learning). We usually want the network to
learn features of its own accord. If all else fails, it can be
beneficial in a practical setting.
![](/spring2021/lecture-7-notes-media/image13.png)
#### Step 2: Address Overfitting
After addressing underfitting, move on to solving overfitting.
Similarly, theres a recommended series of methods to try in order.
Starting with collecting training data (if possible) is the soundest way
to address overfitting, though it can be challenging in certain
applications. Next, tactical improvements like normalization, data
augmentation, and regularization can help. Following these steps,
traditional defaults like tuning hyperparameters, choosing a different
architecture, or error analysis are useful. Finally, if overfitting is
rather intractable, theres a series of less recommended steps, such as
early stopping, removing features, and reducing model size. Early
stopping is a personal choice; the fast.ai community is a strong
proponent.
![](/spring2021/lecture-7-notes-media/image15.png)
#### Step 3: Address Distribution Shift
After addressing underfitting and overfitting, If theres a difference
between the error on our training validation set vs. our test validation
set, we need to address the error caused by the distribution shift. This
is a harder problem to solve, so theres less in our toolkit to apply.
Start by looking manually at the errors in the test-validation set.
Compare the potential logic behind these errors to the performance in
the train-validation set, and use the errors to guide further data
collection. Essentially, reason about why your model may be suffering
from distribution shift error. This is the most principled way to deal
with distribution shift, though its the most challenging way
practically. If collecting more data to address these errors isnt
possible, try synthesizing data. Additionally, you can try [domain
adaptation](https://ece.engin.umich.edu/wp-content/uploads/2019/09/4142.pdf).
![](/spring2021/lecture-7-notes-media/image9.png)
##### Error Analysis
Manually evaluating errors to understand model performance is generally
a high-yield way of figuring out how to improve the model.
Systematically performing this **error analysis** process and
decomposing the error from different error types can help prioritize
model improvements. For example, in a self-driving car use case with
error types like hard-to-see pedestrians, reflections, and nighttime
scenes, decomposing the error contribution of each and where it occurs
(train-val vs. test-val) can give rise to a clear set of prioritized
action items. See the table for an example of how this error analysis
can be effectively structured.
![](/spring2021/lecture-7-notes-media/image5.png)
##### Domain Adaptation
Domain adaptation is a class of techniques that train on a “source”
distribution and generalize to another “target” using only unlabeled
data or limited labeled data. You should use domain adaptation when
access to labeled data from the test distribution is limited, but access
to relatively similar data is plentiful.
There are a few different types of domain adaptation:
1. **Supervised domain adaptation**: In this case, we have limited data
from the target domain to adapt to. Some example applications of
the concept include fine-tuning a pre-trained model or adding
target data to a training set.
2. **Unsupervised domain adaptation**: In this case, we have lots of
unlabeled data from the target domain. Some techniques you might
see are CORAL, domain confusion, and CycleGAN.
Practically speaking, supervised domain adaptation can work really well!
Unsupervised domain adaptation has a little bit further to go.
#### Step 4: Rebalance datasets
If the test-validation set performance starts to look considerably
better than the test performance, you may have overfit the validation
set. This commonly occurs with small validation sets or lots of
hyperparameter training. If this occurs, resample the validation set
from the test distribution and get a fresh estimate of the performance.
### 7 - Tune Hyperparameters
One of the core challenges in hyperparameter optimization is very basic:
**which hyperparameters should you tune?** As we consider this
fundamental question, lets keep the following in mind:
* Models are more sensitive to some hyperparameters than others. This
means we should focus our efforts on the more impactful
hyperparameters.
* However, which hyperparameters are most important depends heavily on
our choice of model.
* Certain rules of thumbs can help guide our initial thinking.
* Sensitivity is always relative to default values; if you use good
defaults, you might start in a good place!
See the following table for a ranked list of hyperparameters and their
impact on the model:
![](/spring2021/lecture-7-notes-media/image2.png)
#### Techniques for Tuning Hyperparameter Optimization
Now that we know which hyperparameters make the most sense to tune
(using rules of thumb), lets consider the various methods of actually
tuning them:
1. **Manual Hyperparameter Optimization**. Colloquially referred to as
Graduate Student Descent, this method works by taking a manual,
detailed look at your algorithm, building intuition, and
considering which hyperparameters would make the most difference.
After figuring out these parameters, you train, evaluate, and
guess a better hyperparameter value using your intuition for the
algorithm and intelligence. While it may seem archaic, this method
combines well with other methods (e.g., setting a range of values
for hyperparameters) and has the main benefit of reducing
computation time and cost if used skillfully. It can be
time-consuming and challenging, but it can be a good starting
point.
2. **Grid Search**. Imagine each of your parameters plotted against
each other on a grid, from which you uniformly sample values to
test. For each point, you run a training run and evaluate
performance. The advantages are that its very simple and can
often produce good results. However, its quite inefficient, as
you must run every combination of hyperparameters. It also often
requires prior knowledge about the hyperparameters since we must
manually set the range of values.
3. **Random Search**: This method is recommended over grid search.
Rather than sampling from the grid of values for the
hyperparameter evenly, well choose n points sampled randomly
across the grid. Empirically, this method produces better results
than grid search. However, the results can be somewhat
uninterpretable, with unexpected values in certain hyperparameters
returned.
4. **Coarse-to-fine Search**: Rather than running entirely random runs,
we can gradually narrow in on the best hyperparameters through
this method. Initially, start by defining a very large range to
run a randomized search on. Within the pool of results, you can
find N best results and hone in on the hyperparameter values used
to generate those samples. As you iteratively perform this method,
you can get excellent performance. This doesnt remove the manual
component, as you have to select which range to continuously
narrow your search to, but its perhaps the most popular method
available.
5. **Bayesian Hyperparameter Optimization**: This is a reasonably
sophisticated method, which you can read more about
[here](http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/slides/lec21.pdf)
and
[here](https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f).
At a high level, start with a prior estimate of parameter
distributions. Subsequently, maintain a probabilistic model of the
relationship between hyperparameter values and model performance.
As you maintain this model, you toggle between training with
hyperparameter values that maximize the expected improvement (per
the model) and use training results to update the initial
probabilistic model and its expectations. This is a great,
hands-off, efficient method to choose hyperparameters. However,
these techniques can be quite challenging to implement from
scratch. As libraries and infrastructure mature, the integration
of these methods into training will become easier.
In summary, you should probably start with coarse-to-fine random
searches and move to Bayesian methods as your codebase matures and
youre more certain of your model.
### 8 - Conclusion
To wrap up this lecture, deep learning troubleshooting and debugging is
really hard. Its difficult to tell if you have a bug because there are
many possible sources for the same degradation in performance.
Furthermore, the results can be sensitive to small changes in
hyper-parameters and dataset makeup.
To train bug-free deep learning models, we need to treat building them
as an iterative process. If you skipped to the end, the following steps
can make this process easier and catch errors as early as possible:
* **Start Simple**: Choose the simplest model and data possible.
* **Implement and Debug**: Once the model runs, overfit a single batch
and reproduce a known result.
* **Evaluate**: Apply the bias-variance decomposition to decide what
to do next.
* **Tune Hyper-parameters**: Use coarse-to-fine random searches to
tune the models hyper-parameters.
* **Improve Model and Data**: Make your model bigger if your model
under-fits and add more data and/or regularization if your model
over-fits.
Here are additional resources that you can go to learn more:
* Andrew Ngs “[Machine Learning
Yearning](https://www.deeplearning.ai/machine-learning-yearning/)”
book.
* This [Twitter
thread](https://twitter.com/karpathy/status/1013244313327681536)
from Andrej Karpathy.
* BYUs “[Practical Advice for Building Deep Neural
Networks](https://pcc.cs.byu.edu/2017/10/02/practical-advice-for-building-deep-neural-networks/)”
blog post.
## We are excited to share this course with you for **free**.
We have more upcoming great content.
Subscribe to stay up to date as we release it.
We take your privacy and attention very seriously and will never spam you.
I am already a subscriber
The Full Stack, 2023
Made with
[Material for MkDocs](https://squidfunk.github.io/mkdocs-material/)
File diff suppressed because one or more lines are too long
+199
View File
@@ -0,0 +1,199 @@
Source: http://joschu.net/docs/nuts-and-bolts.pdf
Title: Nuts and Bolts of Deep RL Research - John Schulman (2016)
Fetched-via: bash -c 'uvx "markitdown[pdf]" http://joschu.net/docs/nuts-and-bolts.pdf'
Fetch-status: verbatim
| The Nuts | and Bolts | of Deep | RL Research |
| -------- | --------- | --------- | ----------- |
| | John | Schulman | |
| | December | 9th, 2016 | |
Outline
| Approaching | New Problems | |
| --------------------- | ------------ | ---------- |
| Ongoing Development | | and Tuning |
| General Tuning | Strategies | for RL |
| Policy Gradient | Strategies | |
| Q-Learning Strategies | | |
| Miscellaneous | Advice | |
Approaching New Problems
| New Algorithm? | Use Small | Test Problems |
| -------------------------- | --------- | ------------- |
| (cid:73) Run experiments | quickly | |
| (cid:73) Do hyperparameter | search | |
(cid:73) Interpret and visualize learning process: state visitation, value function, etc.
(cid:73) Counterpoint: dont overfit algorithm to contrived problem
(cid:73) Useful to have medium-sized problems that youre intimately familiar with
(Hopper, Atari Pong)
| New Task? | Make | It Easier Until | Signs | of Life |
| ---------------- | --------------- | --------------- | ----- | ------- |
| (cid:73) Provide | good input | features | | |
| (cid:73) Shape | reward function | | | |
POMDP Design
(cid:73) Visualize random policy: does it sometimes exhibit desired behavior?
| (cid:73) Human | control | | | |
| -------------- | ------- | --- | --- | --- |
(cid:73) Atari: can you see game features in downsampled image?
(cid:73) Plot time series for observations and rewards. Are they on a reasonable
scale?
| (cid:73) hopper.py | in gym: | | | |
| ------------------ | ------------ | --------------------------- | ------- | ----------- |
| reward | = 1.0 | - 1e-3 * np.square(a).sum() | + delta | x / delta t |
| (cid:73) Histogram | observations | and rewards | | |
Run Your Baselines
| (cid:73) Dont expect | them to | work with default | parameters |
| --------------------- | ------- | ----------------- | ---------- |
(cid:73) Recommended:
| Cross-entropy | method1 | | |
| ------------- | ------- | --- | --- |
(cid:73)
| (cid:73) Well-tuned | policy gradient | method2 | |
| ------------------- | --------------- | -------------- | --- |
| (cid:73) Well-tuned | Q-learning | + SARSA method | |
1Istv´anSzitaandAndr´asL¨orincz(2006).“LearningTetrisusingthenoisycross-entropymethod”. In:Neuralcomputation.
2https://github.com/openai/rllab
| Run with | More Samples | Than | Expected | |
| -------- | ------------ | ---- | -------- | --- |
(cid:73) Early in tuning process, may need huge number of samples
| | Dont be deterred | by published | work | |
| --- | ----------------- | ------------ | ---- | --- |
(cid:73)
| (cid:73) Examples: | | | | |
| ------------------ | --- | --- | --- | --- |
(cid:73) TRPO on Atari: 100K timesteps per batch for KL= 0.01
| | DQN on Atari: | update freq=10K, | replay buffer | size=1M |
| --- | ------------- | ---------------- | ------------- | ------- |
(cid:73)
| Ongoing | Development | and Tuning |
| ------- | ----------- | ---------- |
| It | Works! | But | Dont | Be Satisfied | | |
| --- | ---------------- | ----------- | ----- | ----------------- | --- | --- |
| | (cid:73) Explore | sensitivity | | to each parameter | | |
(cid:73) If too sensitive, it doesnt really work, you just got lucky
| | (cid:73) Look | for health | indicators | | | |
| --- | ------------- | --------------- | ---------- | --- | --- | --- |
| | | (cid:73) VF fit | quality | | | |
| | | Policy | entropy | | | |
(cid:73)
| | | (cid:73) Update | size in | output space | and parameter | space |
| --- | --- | ----------------- | ----------- | ------------ | ------------- | ----- |
| | | (cid:73) Standard | diagnostics | for | deep networks | |
| Continually | Benchmark | | Your Code |
| ------------------- | --------- | ------------- | ------------ |
| (cid:73) If reusing | code, | regressions | occur |
| (cid:73) Run | a battery | of benchmarks | occasionally |
| Always | Use Multiple | Random | Seeds |
| ------ | ------------ | ------ | ----- |
| Always Be | Ablating | |
| ------------------ | ---------- | ---------- |
| (cid:73) Different | tricks may | substitute |
| Especially | whitening | |
(cid:73)
(cid:73) “Regularize” to favor simplicity in algorithm design space
| (cid:73) As | usual, simplicity | → generalization |
| ----------- | ----------------- | ---------------- |
| Automate Your | Experiments | | |
| ------------- | ---------------- | --------- | ----------------- |
| Dont spend | all day watching | your code | print out numbers |
(cid:73)
(cid:73) Consider using a cloud computing platform (Microsoft Azure, Amazon EC2,
| Google Compute | Engine) | | |
| -------------- | ------- | --- | --- |
| General | Tuning | Strategies | for RL |
| ------- | ------ | ---------- | ------ |
| Whitening | / Standardizing | Data |
| ------------------------ | --------------- | ------------------ |
| (cid:73) If observations | have unknown | range, standardize |
(cid:73) Compute running estimate of mean and standard deviation
x(cid:48)
(cid:73) = clip((x −µ)/σ,10,10)
(cid:73) Rescale the rewards, but dont shift mean, as that affects agents will to live
(cid:73) Standardize prediction targets (e.g., value functions) the same way
| Generally | Important | Parameters | | | |
| --------- | --------------- | ------------- | ---- | ------- | --------- |
| (cid:73) | Discount | | | | |
| | (cid:73) Return | = r +γr | +γ2r | +... | |
| | | t t | t+1 | t+2 | |
| | Effective | time horizon: | 1+γ | +γ2+··· | = 1/(1−γ) |
(cid:73)
(cid:73) I.e., γ =0.99⇒ ignore rewards delayed by more than 100 timesteps
| | Low | γ works well | for well-shaped | reward | |
| --- | --- | ------------ | --------------- | ------ | --- |
(cid:73)
(cid:73) In TD(λ) methods, can get away with high γ when λ < 1
| (cid:73) | Action frequency | | | | |
| -------- | ---------------- | ---------- | ------- | ------------- | --- |
| | Solvable | with human | control | (if possible) | |
(cid:73)
| | (cid:73) View | random exploration | | | |
| --- | ------------- | ------------------ | --- | --- | --- |
General RL Diagnostics
(cid:73) Look at min/max/stdev of episode returns, along with mean
(cid:73) Look at episode lengths: sometimes provides additional information
| (cid:73) Solving problem | faster, losing | game slower |
| ------------------------ | -------------- | ----------- |
Policy Gradient Strategies
| Entropy as | Diagnostic | | |
| ------------------ | ---------------- | ------- | ------------- |
| (cid:73) Premature | drop in policy | entropy | ⇒ no learning |
| (cid:73) Alleviate | by using entropy | bonus | or KL penalty |
KL as Diagnostic
(cid:2) (cid:3)
| (cid:73) Compute | KL π | (·|s),π(·|s) | |
| ---------------- | ---- | ------------ | --- |
old
| (cid:73) KL spike | ⇒ drastic | loss of performance | |
| -------------------- | --------- | ------------------- | ------------- |
| (cid:73) No learning | progress | might mean steps | are too large |
(cid:73) batchsize=100K converges to different result than batchsize=20K.
| Baseline | Explained | Variance |
| -------- | --------- | -------- |
1Var[empiricalreturnpredictedvalue]
| (cid:73) | explained variance | = |
| -------- | ------------------ | --- |
Var[empiricalreturn]
Policy Initialization
(cid:73) More important than in supervised learning: determines initial state
visitation
| (cid:73) Zero | or tiny final layer, | to maximize | entropy |
| ------------- | -------------------- | ----------- | ------- |
| Q-Learning Strategies | | |
| --------------------- | --- | --- |
(cid:73) Optimize memory usage carefully: youll need it for replay buffer
| (cid:73) Learning | rate schedules | |
| -------------------- | -------------- | ------ |
| (cid:73) Exploration | schedules | |
| (cid:73) Be patient. | DQN converges | slowly |
(cid:73) On Atari, often 10-40M frames to get policy much better than random
ThankstoSzymonSidorforsuggestions
Miscellaneous Advice
(cid:73) Read older textbooks and theses, not just conference papers
(cid:73) Dont get stuck on problems—cant solve everything at once
| (cid:73) Exploration | problems | like cart-pole swing-up |
| -------------------- | ----------------- | ----------------------- |
| (cid:73) DQN on | Atari vs CartPole | |
Thanks!
File diff suppressed because one or more lines are too long
@@ -0,0 +1,15 @@
Source: https://old.reddit.com/r/reinforcementlearning/comments/75m5vd/
Title: Deep RL Bootcamp 2017 - Slides and Talks
Fetched-via: Reddit JSON API (limit=500, depth=10)
Fetch-status: verbatim
# Deep RL Bootcamp 2017 - Slides and Talks
**Posted by:** u/gwern | Score: 8 | 1 comments
Link: https://sites.google.com/view/deep-rl-bootcamp/lectures
## Comments
**u/obsoletelearner** (score: 2):
thank you!
@@ -0,0 +1,18 @@
Source: https://old.reddit.com/r/reinforcementlearning/comments/6vcvu1/
Title: ICML 2017 Tutorial slides (Levine &amp; Finn): Deep Reinforcement Learning, Decision Making, and Control
Fetched-via: Reddit JSON API (limit=500, depth=10)
Fetch-status: verbatim
# ICML 2017 Tutorial slides (Levine &amp; Finn): Deep Reinforcement Learning, Decision Making, and Control
**Posted by:** u/gwern | Score: 12 | 1 comments
Link: https://sites.google.com/view/icml17deeprl
## Comments
**u/[deleted]** (score: 4):
[deleted]
**u/cbfinn** (score: 1):
Videos of ICML tutorials (as well as conference talks) will be posted by the conference staff at some point. Though, typically they take quite awhile to be released.
@@ -0,0 +1,78 @@
Source: https://old.reddit.com/r/reinforcementlearning/comments/9sh77q/
Title: What are your best tips for debugging RL problems?
Fetched-via: Reddit JSON API (limit=500, depth=10)
Fetch-status: verbatim
# What are your best tips for debugging RL problems?
**Posted by:** u/GrundleMoof | Score: 21 | 8 comments
I've done a few RL toy problems, but I'm still pretty new to the field. In each of the problems I've done, there has been some point where it seems like I've implemented everything correctly, the environment is working correctly, etc, but it's still just not working, or is, with some really strange problem.
RL seems to be harder to debug than any other type of programming I've done before. There's an element of randomness usually. It often takes a while (in the run) for the problem to manifest, so it's hard to pinpoint exactly *where* something is going wrong. Lastly, stuff just takes a while to even run, so my "attempt solution/code/evaluate" loop takes a long time, which makes it even harder.
Does anyone have any tips? The things I've figured out so far are to log everything feasible, and to try to isolate things to find the problem, but those are pretty general tips. I've found some help a few times from reading relevant papers, but that's rarer.
Do any experts in the field have any tips?
## Comments
**u/marcin_gumer** (score: 18):
Hi
This is exactly what I was struggling with for a long time (and still am). RL agent modules are really closely interconnected. No matter which module has issue (neural net, Bellman backups, memory buffer, environment, pre-processing) it will immediately affect all other modules by feeding them bad data. Looking from outside it looks like big gooey mess.
First, I'm not an RL expert, sorry if my advice sounds basic. Couple of things I have learned so far:
* RL is very difficult to debug, especially when neural nets are involved
* DO NOT "try stuff" and run to "see if it works" - this approach doesn't work in RL - too many things need to happen exactly right to see any learning at all
* RL agent modules implementation - this is just good programming practices, but even more important in RL:
* most modules can be tested independently. Environment, neural net, RL backups, memory reply buffers all can be tested in isolation.
* I try to unit test everything, usually unit tests take more code than what they test
* I try to put asserts absolutely everywhere, input matrix dimensions (1d array may broadcast differently than 2d array etc.), input/output ranges (state/actions valid?), output matrix dimensions. Input/output data types (in Python np.ndarray behaves differently than np.matrix in some cases)
* Agent modules integration - generally stepping through code at least once after every change to confirm it is doing what I think it should be doing. It's a bit like programming a bomb detonator or something. Really make sure it is working correctly *before* running long experiment.
* Visualise as much as possible, log absolutely everything
* record and display agent observations/actions/rewards
* rewards should have some variance, if all rewards are always equal (e.g. always 0), then there is nothing to learn, it's environment or exploration issue
* record and display current q-function approximation across whole state space (works only on simple tabular problems and 2d continuous state spaces).
* pick couple states (can pick by running random policy) and plot predicted q-values over time for these states. q-values should change and stabilise.
* record inputs/outputs to/from every module (environment, neural net, memory buffer, etc.)
* neural networks are making everything 10x more difficult - I try to make agent work with linear approximator first (on small problem like mountain car), then when I know everything else is working, swap in neural net and try bigger problem.
* with neural nets, one can record gradients, individual neuron activations etc to evaluate if neural net is learning over time, even without access to loss function. Debugging neural networks:
* [https://www.deeplearningbook.org/](https://www.deeplearningbook.org/) "Practical Methodology" chapter
* [Understanding the difficulty of training deep feedforward neural networks](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
* [https://cs231n.github.io/neural-networks-3](https://cs231n.github.io/neural-networks-3/#baby)/
* [https://deeplearning4j.org/docs/latest/deeplearning4j-nn-visualization](https://deeplearning4j.org/docs/latest/deeplearning4j-nn-visualization)
* Sometimes It helps to freeze random seed everywhere (numpy, tensorflow, python hashing, gym) and force single threaded CPU execution (to remove randomness from concurrent execution). This way you can reproduce runs exactly to full floating point precision and debug where these NaNs came from or why q-values exploded to infinity etc.
* I would be careful with reference implementations. Some work on hacked environments that are much easier than normal (seen it couple times in blog posts). Or have some "weird" reward function engineering. Or use older version of 3rd party library with easier version of environment.
* Try hyper parameters from reference implementation.
Hope this helps!
**u/AlexanderYau** (score: 1):
Wow, a lot of experience on debugging RL, I can't agree more on "DO NOT "try stuff" and run to "see if it works"". Have you ever published any paper on RL?
**u/marcin_gumer** (score: 1):
I wouldn't say a lot of experience, just figuring things out as I go. I haven't published any RL paper. Currently just building portfolio on my [github.com/marcinbogdanski](https://github.com/marcinbogdanski). But there is not that much there yet, I implemented some algorithms from Sutton &amp; Barto, currently working on DQN and Atari games, but it will take some time.
**u/p-morais** (score: 7):
Heres a good talk by John Schulman on just this:
https://m.youtube.com/watch?v=8EcdaCk9KaQ
**u/WhichPressure** (score: 5):
I think anyone who touch the RL has the same problem as you! Me too:P This guy wrote nice article about his adventure with some RL side project and he gave some tips. Maybe somehow it'll be helpful for you:
[http://amid.fish/reproducing-deep-rl?fbclid=IwAR1VPZm3FSTrV8BZ4UdFc2ExZy0olusmaewmloTPhpA4QOnHKRI2LLOz3mM](http://amid.fish/reproducing-deep-rl?fbclid=IwAR1VPZm3FSTrV8BZ4UdFc2ExZy0olusmaewmloTPhpA4QOnHKRI2LLOz3mM)
**u/lmericle** (score: 4):
It's true, there's a lot of moving parts. I'm no expert but lately I've been experiencing similar setbacks.
Typically I find the first place to look is the hyperparameters, and then to consider which particular optimization algorithm you're using and how it might explore the space/optimize toward suboptimal behavior. Next I'd consider the reward function and the interplay between that and the exploration/exploitation behavior of the optimization algorithm. Finally, consider where stochasticity is introduced into the problem -- perhaps there's too much, or not enough, or the stochasticity prevents convergence due to inadequate penalty terms (e.g. low entropy coefficient in PPO).
**u/wassname** (score: 2):
Also checkout [this](https://old.reddit.com/r/reinforcementlearning/comments/7s8px9/deep_reinforcement_learning_practical_tips/) previous discussion.
**u/[deleted]** (score: 1):
I usually just plot every metric, every layer weight on tensorboard and look for anomalies.
&amp;#x200B;
@@ -0,0 +1,110 @@
Source: https://old.reddit.com/r/reinforcementlearning/comments/7s8px9/deep_reinforcement_learning_practical_tips/
Title: Deep Reinforcement Learning practical tips
Fetched-via: browser paste (user)
Fetch-status: verbatim
# Deep Reinforcement Learning practical tips
submitted 8 years ago by grupiotr | 14 points (90% upvoted) | 13 comments
I would be particularly grateful for pointers to things you don't seem to be able to find in papers. Examples include:
- How to choose learning rate?
- Problems that work surprisingly well with high learning rates
- Problems that require surprisingly low learning rates
- Unhealthy-looking learning curves and what to do about them
- Q estimators deciding to always give low scores to a subset of actions effectively limiting their search space
- How to choose decay rate depending on the problem?
- How to design reward function? Rescale? If so, linearly or non-linearly? Introduce/remove bias?
- What to do when learning seems very inconsistent between runs?
- In general, how to estimate how low one should be expecting the loss to get?
- How to tell whether my learning is too low and I'm learning very slowly or too high and loss cannot be decreased further?
## Comments
**u/wassname** (11 points):
Resources: I found these very useful
- Deep RL Bootcamp Lecture 6: Nuts and Bolts of Deep RL Experimentation (slides) and a written summary
- The 3 NIPS2017 Learning to run write ups contain practical advice from a competition
- Lessons Learned Reproducing a Deep Reinforcement Learning Paper
- Deep Reinforcement Learning that Matters - this gives you an idea of what does and doesn't matter
- Deep Reinforcement Learning Doesn't Work Yet (at least as well as the hype suggests)
- General deep learning tips from Slav Ivanov
Lessons learnt:
- log everything with tensorboard/tensorboardX: policy and critic losses, advantages, ratio, actions (mean and std), states, noise. Check values, check losses are decreasing etc.
- keep track of experiments with an experiments log (git commit messages with non-committed data or logs stored by date)
- clip and clamp: mistakes not obvious as they can cause values to blow up instead of NaN
- clamp all values, logarithmic values: `logvalue.clamp(-np.log(1e-5), np.log(1e-5))`
- watch out for dividing by a value: `1/std` should be `1/(std+eps)` where `eps=1e-5`
- clip gradients: `grad_norm = torch.nn.utils.clip_grad(model.params, 20)`, then log grad norm
- normalise everything: use running norms for state and reward; layer norms help
- check everything: plot and sanity check as many values as possible. Check initial outputs, inits, distributions, action range.
- think about step-size/sampling-rate: RL is sensitive to it (action repeat, frame skipping). Papers found skipping 4 Atari frames helped, repeating 4 actions in "Learning to Run" helped.
Curves:
- in PPO the std should decrease as it learns
- in actor-critic the critic loss should start converging then the actor loss follows
- watch for local minima where it outputs a constant action
- watch gradients for actor and critic; if much lower than 20 or much larger than 100 often run into problems (20 and 40 are where projects often clip gradient norm)
- run on CartPole and log same curves to see what healthy looks like
Reward:
- It's not the scaling factor that matters but the final value. Papers have gotten good results with rewards between 100-1000.
Learning rate:
- Use decaying learning rates, watch loss curves to see when they begin to converge.
- loss_actor will often initially increase while the critic is doing its initial learning (value function is a moving target). Focus on making the critic learning rate work first.
- Critic learning rates are often set higher, with larger batches.
- Use cyclical learning rate trick: slowly increase LR to find the min where model learns and max where it still converges.
My own questions:
- How do you know if you've set exploration/variance too high or low?
- Should you use a multi-headed actor/critic? Or separate networks?
"What to do when learning seems very inconsistent between runs?" - This could be an init issue. Try to init so it defaults to reasonable action values even before training.
---
**u/gwern** (8 points):
I've seen similar engineering details & folklore, but mostly in slides/talks:
- https://www.reddit.com/r/reinforcementlearning/comments/6vcvu1/icml_2017_tutorial_slides_levine_finn_deep/
- https://www.reddit.com/r/reinforcementlearning/comments/75m5vd/deep_rl_bootcamp_2017_slides_and_talks/
- https://www.reddit.com/r/reinforcementlearning/comments/5i67zh/deep_reinforcement_learning_through_policy/
- https://www.reddit.com/r/reinforcementlearning/comments/5hereu/the_nuts_and_bolts_of_deep_rl_research_schulman/
**u/twkillian** (1 point): I was about to post John Schulman's talk here as well. Great resource.
**u/wassname** (1 point): Summarising the ones I hadn't seen:
- 5i67zh: fix random seed to reduce variance; think about step-size/sampling-rate; RL sensitive to optimizer choice (SGD, Adam)
- 6vcvu1: slides focused more on algorithm choice/design, not application tips
---
**u/grupiotr** [OP] (5 points):
John Schulman's talk wins, particularly:
- rescaling observations, rewards, targets and prediction targets
- using big replay buffers, bigger batch size and generally more iterations to start with
- always starting with a simple version of the task to get signs of life
---
**u/Kaixhin** (2 points):
My first bit of advice is actually don't do RL. If the answer is still yes, find some other useful task for the network to do, like predicting something. Get supervised gradients flowing through your network. Training end-to-end on purely an RL signal is impressive, but adding easier learning signals can potentially help a lot.
---
**u/grupiotr** [OP] (1 point):
What turned out to be the game-changer (made my RL agents actually learn something) was **rescaling the reward from [-1, 1] to [0, 1]**. Thanks again to everyone that contributed!
@@ -0,0 +1,197 @@
Source: https://old.reddit.com/r/reinforcementlearning/comments/bzg3l2/
Title: How to *more intelligently* debug RL roadblocks?
Fetched-via: Reddit JSON API (limit=500, depth=10)
Fetch-status: verbatim
# How to *more intelligently* debug RL roadblocks?
**Posted by:** u/GrundleMoof | Score: 4 | 7 comments
A while ago I [made this post](https://www.reddit.com/r/reinforcementlearning/comments/9sh77q/what_are_your_best_tips_for_debugging_rl_problems/) asking for tips on debugging when you run into a problem with RL.
However, I think the majority of the advice can be summed up with:
1) Test bits individually to make sure they're doing what they should
2) Don't go down a rabbit hole of fiddling with hyperparameters
3) Log/record/display everything, and "look for things that are acting funny"
and I just want to be clear that I'm not disparaging that advice, it's actually really good, I'm thankful, and I know I'm asking a tricky, general question!
But I want to get to the "next level". I think I know the theory well enough, and I've successfully done a few toy problems, but I'm still here banging my head against the wall.
I'll take a practical example I'm struggling with now: gym's `Pendulum-v0`, which has a continuous action space of [-2, 2], and three state variables (`(cos(theta), sin(theta), theta_dot)`). I'm trying to solve it with a fairly simple AC setup and PyTorch. I'm using the RMSprop optimizer, and 2 (or 3) fully connected NN layers, with 50 (or 100) units in each layer, to approximate pi (the policy) and V (the value function/baseline).
To select the actions, like in [the A3C paper](https://arxiv.org/pdf/1602.01783.pdf), I have the pi NN have two outputs, mu and sd2 (the standard deviation squared). Every time step, I select an action `a` from a normal distribution with that mu and `sd**2`. Then, I calculate that `pi(a)` (just from the equation of a normal dist. with that mu, `sd**2`), and iterate the agent to get the reward from that time step.
Also like the A3C paper (for the Pendulum problem), I'm doing all the updates at once, at the end of each episode (so it's basically MC with V as the baseline). For each time step (after the episode) I accumulate the rewards from t to t_max as `r_accum` (with gamma = 0.99), then say `V_loss = (r_accum - V_list).pow(2).sum()`. For the policy gradient, I do `policy_loss = -(torch.log(pi_list)*(r_accum - V_list)).sum()`, and then zero grads, backwards the losses, step the optimizer, etc.
And I'm just not seeing any learning, going up to about 20k episodes. I'm plotting to TensorBoard (losses, rewards, weights, biases, gradients), but nothing is striking me as an obvious culprit. It gets varying rewards, the V_loss seems to decrease to 0, and the policy_loss usually kind of wanders but eventually goes to 0 (I think because it's also proportional to (r_accum - V_list) which is also going to 0).
But I think this is a perfect learning example. This is doable (...right?), it seems mostly correctly set up, and it's probably a fairly simple fix if I knew how to diagnose it. For the more experienced RL'ers out there, where would you start? What would you look at? What would you verify is working correctly?
Here are some of my guesses/notes:
* I haven't actually seen any straightforward implementations of a vanilla PG algo solving Pendulum-v0. In the A3C paper, they add an LSTM to it. There are a bunch of DDPG papers online, but that's a pretty different story. I found one A3C that doesn't seem to have an LSTM, so I'll check that out.
* Do I need experience replay? Maybe the variance is just too high using essentially REINFORCE with this problem, so I need to be getting much better data efficiency (or running it for a ton longer) ?
* I was worried that maybe it was never actually getting to positions where it could get a high enough reward (to "motivate" it to reach those positions), but I plotted some trajectories and it's definitely getting up to the top (by swinging wildly anyway), where R = 0, so it's definitely experiencing them.
Things I've tried (but maybe not systematically enough):
* Different initial LRs
* Different optimizers
* Different number of hidden layers/units
* Shared pi/V NN body (with diff output layers) vs not
* Changing amount of entropy
* Adding correlated noise
* Using TD residual instead of MC version
* Clipping the gradient
* Different gamma values
Anyway, I'd love it if anyone has any more general advice for how to think about and go about solving RL problems. I of course want to solve this one, but I want a more general way of thinking.
## Comments
**u/i_do_floss** (score: 3):
I dont have the answer for you, but I had an algorithm that was stuck on pendulum for a while and these eventually ended up being the issues:
1. The environment is wrapped with a wrapper that kills the environment after 200 steps. I was ignoring that so I could use 1024 steps. So I ignored the "done" / "is-terminal" variable but I forgot to exclude it from my stored memories in my memory buffer so my updates were all wrong.
I decided observing the predicted value and seeing if it was crazy (high variance) may be an indicator of an issue.
Also I could use tensorboard and visualize runtime information so I could see what was going into my placeholders.
2. My q value was shaped (none x 1), and my placeholders for rewards/terminals were shaped ( none) when I compared those in a tensor I ended up with a tensor shaped ( none , none) which didnt do what I expected
I decided I could mitigate that type of issue in the future by writing my expected shapes of the networks in a notebook and checking if they match afterward using tensorboard. Some people use assert shape functions.
Also, just so you know, I'm training soft actor critic in about 20 episodes of length 1024. I don't think you should wait for 1000s of episodes.
Pendulum v0 is an easy environment for your algorithm to learn. I suggest sticking with these hyper parameters. If they don't work, it's probably your algorithm.
policy network size: [64, 64]
batch size: 256
gamma: 0.99
adam optimizer
relu network activations (on every layer except the last one which has no activation)
Lastly, make sure your action space allows your algorithm to output actions in the space of -2 to 2.
**u/GrundleMoof** (score: 1):
&gt; The environment is wrapped with a wrapper that kills the environment after 200 steps. I was ignoring that so I could use 1024 steps. So I ignored the "done" / "is-terminal" variable but I forgot to exclude it from my stored memories in my memory buffer so my updates were all wrong.
So I currently have my agent as a wrapper for the gym env, and it returns a tuple of (reward, state_next, done), and I break on done.
&gt; I decided observing the predicted value and seeing if it was crazy (high variance) may be an indicator of an issue.
Hmmm, by value, you mean the value function? And do you mean variance across different states, or the same state over time?
&gt; Also I could use tensorboard and visualize runtime information so I could see what was going into my placeholders.
&gt;
&gt; My q value was shaped (none x 1), and my placeholders for rewards/terminals were shaped ( none) when I compared those in a tensor I ended up with a tensor shaped ( none , none) which didnt do what I expected
&gt; I decided I could mitigate that type of issue in the future by writing my expected shapes of the networks in a notebook and checking if they match afterward using tensorboard. Some people use assert shape functions.
ahh yeah that's some good advice. I actually got burned by that earlier in this project, but figured it out by printing the sizes. PyTorch is a little tricky in that it will accept multiplying tensors of various combinations of sizes, with different results... so I should probably do asserts from now on.
&gt; Also, just so you know, I'm training soft actor critic in about 20 episodes of length 1024. I don't think you should wait for 1000s of episodes.
Hmm, so right now I'm trying a pretty simple setup, just a policy gradient with a value function. I don't know much about SAC, but it seems more advanced.
I was starting to get skeptical whether this setup could even learn a continuous action space problem like Pendulum-v0, because when I searched for stuff, almost everything I found was using at least DDPG or more complex. But then I found [this guy's project](https://github.com/MorvanZhou/pytorch-A3C), just A3C, and it solves it pretty quickly and reliably.
I started going through his code and it's nearly exactly the same as mine. I thought that it's possible that using 4 workers has a "decorrelating" effect (like experience replay), so I changed his code to drop it to 1 worker, and it still works! So it's clearly something else and I haven't figured it out yet. It's so similar to mine though, both in terms of setup and hyperparameters...
&gt; Pendulum v0 is an easy environment for your algorithm to learn. I suggest sticking with these hyper parameters. If they don't work, it's probably your algorithm.
&gt;
&gt; policy network size: [64, 64] batch size: 256 gamma: 0.99 adam optimizer relu network activations (on every layer except the last one which has no activation)
You mean, two hidden layers of size 64 each? And are you outputting a value function too?
So, maybe I'm missing something here -- do you mean batches of episodes, or batches of steps? I'm using gamma = 0.9 or 0.99. I've tried Adam and RMSprop, no success with either... I'm using tanh activations, but that probably shouldn't change anything significantly, right?
&gt; Lastly, make sure your action space allows your algorithm to output actions in the space of -2 to 2.
Yeah, my policy outputs a mu and sigma. The mu output is 2*tanh, so it's mapped to -2, 2, and the sigma one (actually sigma^2 ) is put through a softplus output.
**u/i_do_floss** (score: 1):
Yes two hidden layers with 64 nodes. The value function is a third layer basically.
Tanh on last layer makes sense for policy. What are you using on hidden layers and value function final layer?
Also, have you tried different reward scales?
**u/GrundleMoof** (score: 1):
Hi again, sorry for the delay! I was traveling with no service...
I've tried a few different topologies. Right now I'm doing this:
self.actor_lin1 = nn.Linear(3, 200)
self.mu = nn.Linear(200, 1)
self.sigma = nn.Linear(200, 1)
self.critic_lin1 = nn.Linear(3, 100)
self.v = nn.Linear(100, 1)
and for my forward():
y = torch.tanh(self.critic_lin1(x))
v = self.v(y)
z = torch.tanh(self.actor_lin1(x))
mu = 2*torch.tanh(self.mu(z))
sd2 = softplus(self.sigma(z)) + 0.001
return(v, (mu, sd2))
So I'm using tanh() for the nonlinearities as well. I'm adding that 0.001 to the sd2 because it keeps it from getting too small (which should be enforced by the entropy term anyway) and I've seen it done in a few formulations of this.
I also tried with having the mu/sigma layers combined into a nn.Linear(200, 2) layer (which should be functionally the same I think), as well as having the mu/sigma and v outputs share the first nn.Linear(3, 200) layer before splitting off (which is different, the shared head thing, but I've used elsewhere and seen people use).
I'm scaling the rewards in a way I've seen a bunch of other people do. Since the reward each step has the range [-16, 0], I'm normalizing it by doing (r + 8.0)/8.0, which should put it about in the range [-1, 1].
At this point I'm basically trying to replicate the guy's A3C implementation from above (minus the multiple workers part, but I ran his with 1 worker and it reliably improves every time). Mine *does* seem to improve, but really slowly compared to his, and sometimes seems to get worse after a while. Like, it's not not improving *at all*, just very slowly and also not reliably, which means something must be off.
**u/i_do_floss** (score: 1):
tanh activations are really sensitive to the weights and bias initialization. Is he using tanh activations?
tanh makes sense to me for the actor output. But I would probably use relu for the nonlinearities so the initialization is easier.
tanh starts to experience issues when the inputs and outputs are too big. (bigger than like -.6 and 6.)
**u/GrundleMoof** (score: 1):
by the way, just to give an example:
[Here's an image of the reward per episode using his code](https://imgur.com/9K2clLs)
[and here's mine](https://imgur.com/IMy6Rb5)
Both with moving averages shown, to smooth it out.
You can see that mine *does* improve, up to about episode 2000, but then gets worse. It pretty consistently does that. His on the other hand, always improves and stays good.
To me, that indicates that it's almost there, but something's going on with the optimizer or something, like maybe it becomes unstable or something. But I'm using the same LR he is (2e-4), and I've tried both Adam (like him) and RMSprop.
**u/i_do_floss** (score: 3):
Also, this only applies to pendulum v0, but it's a great first environment for spinning up an algorithm because you can graph the entire state space on a 2 dimensional plane (as an image). I graph my policy / q assessments in polar coordinates where radius is the velocity and theta is the angle of the pendulum.
&amp;#x200B;
Here's a link to the code I use to do it.
[https://github.com/DanielSmithMichigan/reinforcement-learning/blob/72b0e939c47234fd63c8459fdfa35f18d9053b49/soft-actor-critic/agent/Agent.py#L287](https://github.com/DanielSmithMichigan/reinforcement-learning/blob/72b0e939c47234fd63c8459fdfa35f18d9053b49/soft-actor-critic/agent/Agent.py#L287)
&amp;#x200B;
Here is an album of screenshots I took while training a successful policy.
&amp;#x200B;
[https://imgur.com/a/05C5vVa](https://imgur.com/a/05C5vVa)
&amp;#x200B;
I hope that helps. Feel free to hit me up on discord if you want someone to talk through it with. I'm kind of in the same boat as you with regard to wishing I knew the better ways to debug RL algorithms.
&amp;#x200B;
(Discord: Perseus#5383)
@@ -0,0 +1,12 @@
Source: https://old.reddit.com/r/reinforcementlearning/comments/5hereu/
Title: "The Nuts and Bolts of Deep RL Research" (Schulman December 2016 slides)
Fetched-via: Reddit JSON API (limit=500, depth=10)
Fetch-status: verbatim
# "The Nuts and Bolts of Deep RL Research" (Schulman December 2016 slides)
**Posted by:** u/gwern | Score: 5 | 0 comments
Link: http://rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdf
## Comments
File diff suppressed because it is too large Load Diff
+272
View File
@@ -0,0 +1,272 @@
Source: https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607
Title: 37 Reasons why your Neural Network is not working - Slav Ivanov (2017)
Fetched-via: curl https://r.jina.ai/https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607
Fetch-status: verbatim
Title: 37 Reasons why your Neural Network is not working
URL Source: https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607
Published Time: 2017-07-25T08:13:45Z
Markdown Content:
[![Image 1: Slav Ivanov](https://miro.medium.com/v2/resize:fill:32:32/1*EkrMhH3YffQBM18wAoHTTw.jpeg)](https://medium.com/@slavivanov?source=post_page---byline--4020854bd607---------------------------------------)
10 min read
Jul 25, 2017
The network had been training for the last 12 hours. It all looked good: the gradients were flowing and the loss was decreasing. But then came the predictions: all zeroes, all background, nothing detected. “What did I do wrong?” — I asked my computer, who didnt answer.
Where do you start checking if your model is outputting garbage (for example predicting the mean of all outputs, or it has really poor accuracy)?
A network might not be training for a number of reasons. Over the course of many debugging sessions, I would often find myself doing the same checks. Ive compiled my experience along with the best ideas around in this handy list. I hope they would be of use to you, too.
Table of Contents
-----------------
> [0. How to use this guide?](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607#b6fb)
>
>
> [I. Dataset issues](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607#678a)
>
>
> [II. Data Normalization/Augmentation issues](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607#86fe)
>
>
> [III. Implementation issues](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607#95eb)
>
>
> [IV. Training issues](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607#74de)
0. How to use this guide?
-------------------------
A lot of things can go wrong. But some of them are more likely to be broken than others. I usually start with this short list as an emergency first response:
1. Start with a simple model that is known to work for this type of data (for example, VGG for images). Use a standard loss if possible.
2. Turn off all bells and whistles, e.g. regularization and data augmentation.
3. If finetuning a model, double check the preprocessing, for it should be the same as the original models training.
4. Verify that the input data is correct.
5. Start with a really small dataset (220 samples). Overfit on it and gradually add more data.
6. Start gradually adding back all the pieces that were omitted: augmentation/regularization, custom loss functions, try more complex models.
If the steps above dont do it, start going down the following big list and verify things one by one.
I. Dataset issues
-----------------
Press enter or click to view image in full size
![Image 2](https://miro.medium.com/v2/resize:fit:700/1*xfIbyKKMDmjQF9JFuK2Ykg.png)
Source: [http://dilbert.com/strip/2014-05-07](http://dilbert.com/strip/2014-05-07)
### 1. Check your input data
Check if the input data you are feeding the network makes sense. For example, Ive more than once mixed the width and the height of an image. Sometimes, I would feed all zeroes by mistake. Or I would use the same batch over and over. So print/display a couple of batches of input and target output and make sure they are OK.
### 2. Try random input
Try passing random numbers instead of actual data and see if the error behaves the same way. If it does, its a sure sign that your net is turning data into garbage at some point. Try debugging layer by layer /op by op/ and see where things go wrong.
### 3. Check the data loader
Your data might be fine but the code that passes the input to the net might be broken. Print the input of the first layer before any operations and check it.
### 4. Make sure input is connected to output
Check if a few input samples have the correct labels. Also make sure shuffling input samples works the same way for output labels.
### 5. Is the relationship between input and output too random?
Maybe the non-random part of the relationship between the input and output is too small compared to the random part (one could argue that stock prices are like this). I.e. the input are not sufficiently related to the output. There isnt an universal way to detect this as it depends on the nature of the data.
### 6. Is there too much noise in the dataset?
This happened to me once when I scraped an image dataset off a food site. There were so many bad labels that the network couldnt learn. Check a bunch of input samples manually and see if labels seem off.
The cutoff point is up for debate, as [this paper](https://arxiv.org/pdf/1412.6596.pdf) got above 50% accuracy on MNIST using 50% corrupted labels.
### 7. Shuffle the dataset
If your dataset hasnt been shuffled and has a particular order to it (ordered by label) this could negatively impact the learning. Shuffle your dataset to avoid this. Make sure you are shuffling input and labels together.
### 8. Reduce class imbalance
Are there a 1000 class A images for every class B image? Then you might need to balance your loss function or [try other class imbalance approaches](http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/).
### 9. Do you have enough training examples?
If you are training a net from scratch (i.e. not finetuning), you probably need lots of data. For image classification, [people say](https://stats.stackexchange.com/a/226693/30773) you need a 1000 images per class or more.
### 10. Make sure your batches dont contain a single label
This can happen in a sorted dataset (i.e. the first 10k samples contain the same class). Easily fixable by shuffling the dataset.
### 11. Reduce batch size
[This paper](https://arxiv.org/abs/1609.04836) points out that having a very large batch can reduce the generalization ability of the model.
### Addition 1. Use standard dataset (e.g. mnist, cifar10)
Thanks to @ for this one:
> When testing new network architecture or writing a new piece of code, use the standard datasets first, instead of your own data. This is because there are many reference results for these datasets and they are proved to be solvable. There will be no issues of label noise, train/test distribution difference , too much difficulty in dataset, etc.
II. Data Normalization/Augmentation
-----------------------------------
![Image 3](https://miro.medium.com/v2/resize:fit:400/1*UQLMfdKi5D4nNDN6Oxa5MA.png)
### **12. Standardize** the features
Did you standardize your input to have zero mean and unit variance?
### 13. Do you have too much data augmentation?
Augmentation has a regularizing effect. Too much of this combined with other forms of regularization (weight L2, dropout, etc.) can cause the net to underfit.
### 14. Check the preprocessing of your pretrained model
If you are using a pretrained model, make sure you are using the same normalization and preprocessing as the model was when training. For example, should an image pixel be in the range [0, 1], [-1, 1] or [0, 255]?
### 15. Check the preprocessing for train/validation/test set
CS231n points out a [common pitfall](http://cs231n.github.io/neural-networks-2/#datapre):
> “… any preprocessing statistics (e.g. the data mean) must only be computed on the training data, and then applied to the validation/test data. E.g. computing the mean and subtracting it from every image across the entire dataset and then splitting the data into train/val/test splits would be a mistake. “
Also, check for different preprocessing in each sample or batch.
III. Implementation issues
--------------------------
Press enter or click to view image in full size
![Image 4](https://miro.medium.com/v2/resize:fit:371/1*EVy3hNSF4Nq7v7bNYOyNcQ.png)
Credit: [https://xkcd.com/1838/](https://xkcd.com/1838/)
### 16. Try solving a simpler version of the problem
This will help with finding where the issue is. For example, if the target output is an object class and coordinates, try limiting the prediction to object class only.
### 17. Look for correct loss “at chance”
Again from the excellent [CS231n](http://cs231n.github.io/neural-networks-3/#sanitycheck): _Initialize with small parameters, without regularization. For example, if we have 10 classes, at chance means we will get the correct class 10% of the time, and the Softmax loss is the negative log probability of the correct class so: -ln(0.1) = 2.302._
Get Slav Ivanovs stories in your inbox
---------------------------------------
Join Medium for free to get updates from this writer.
Remember me for faster sign in
After this, try increasing the regularization strength which should increase the loss.
### 18. Check your loss function
If you implemented your own loss function, check it for bugs and add unit tests. Often, my loss would be slightly incorrect and hurt the performance of the network in a subtle way.
### 19. Verify loss input
If you are using a loss function provided by your framework, make sure you are passing to it what it expects. For example, in PyTorch I would mix up the NLLLoss and CrossEntropyLoss as the former requires a softmax input and the latter doesnt.
### 20. Adjust loss weights
If your loss is composed of several smaller loss functions, make sure their magnitude relative to each is correct. This might involve testing different combinations of loss weights.
### 21. Monitor other metrics
Sometimes the loss is not the best predictor of whether your network is training properly. If you can, use other metrics like accuracy.
### 22. Test any custom layers
Did you implement any of the layers in the network yourself? Check and double-check to make sure they are working as intended.
### 23. Check for “frozen” layers or variables
Check if you unintentionally disabled gradient updates for some layers/variables that should be learnable.
### 24. Increase network size
Maybe the expressive power of your network is not enough to capture the target function. Try adding more layers or more hidden units in fully connected layers.
### 25. Check for hidden dimension errors
If your input looks like (k, H, W) = (64, 64, 64) its easy to miss errors related to wrong dimensions. Use weird numbers for input dimensions (for example, different prime numbers for each dimension) and check how they propagate through the network.
### 26. Explore Gradient checking
If you implemented Gradient Descent by hand, gradient checking makes sure that your backpropagation works like it should. More info: [1](http://ufldl.stanford.edu/tutorial/supervised/DebuggingGradientChecking/)[2](http://cs231n.github.io/neural-networks-3/#gradcheck)[3](https://www.coursera.org/learn/machine-learning/lecture/Y3s6r/gradient-checking).
IV. Training issues
-------------------
![Image 5](https://miro.medium.com/v2/resize:fit:448/1*gfcJD0eymh5SGuquzuvpig.png)
Credit: [http://carlvondrick.com/ihog/](http://carlvondrick.com/ihog/)
### 27. Solve for a really small dataset
**Overfit a small subset of the data and make sure it works.**For example, train with just 1 or 2 examples and see if your network can learn to differentiate these. Move on to more samples per class.
### 28. Check weights initialization
If unsure, use [Xavier](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf) or [He](http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf) initialization. Also, your initialization might be leading you to a bad local minimum, so try a different initialization and see if it helps.
### 29. Change your hyperparameters
Maybe you using a particularly bad set of hyperparameters. If feasible, try a [grid search](http://scikit-learn.org/stable/modules/grid_search.html).
### 30. Reduce regularization
Too much regularization can cause the network to underfit badly. Reduce regularization such as dropout, batch norm, weight/bias L2 regularization, etc. In the excellent “[Practical Deep Learning for coders](http://course.fast.ai/)” course, [Jeremy Howard](https://twitter.com/jeremyphoward) advises getting rid of underfitting first. This means you overfit the training data sufficiently, and only then addressing overfitting.
### 31. Give it time
Maybe your network needs more time to train before it starts making meaningful predictions. If your loss is steadily decreasing, let it train some more.
### 32. Switch from Train to Test mode
Some frameworks have layers like Batch Norm, Dropout, and other layers behave differently during training and testing. Switching to the appropriate mode might help your network to predict properly.
### 33. Visualize the training
* Monitor the activations, weights, and updates of each layer. Make sure their magnitudes match. For example, the magnitude of the updates to the parameters (weights and biases) [should be 1-e3](https://cs231n.github.io/neural-networks-3/#summary).
* Consider a visualization library like [Tensorboard](https://www.tensorflow.org/get_started/summaries_and_tensorboard) and [Crayon](https://github.com/torrvision/crayon). In a pinch, you can also print weights/biases/activations.
* Be on the lookout for layer activations with a mean much larger than 0. Try Batch Norm or ELUs.
* [Deeplearning4j](https://deeplearning4j.org/visualization#usingui) points out what to expect in histograms of weights and biases:
> “For weights, these histograms should have an **approximately Gaussian (normal)**distribution, after some time. For biases, these histograms will generally start at 0, and will usually end up being **approximately Gaussian** (One exception to this is for LSTM). Keep an eye out for parameters that are diverging to +/- infinity. Keep an eye out for biases that become very large. This can sometimes occur in the output layer for classification if the distribution of classes is very imbalanced.”
* Check layer updates, they should have a Gaussian distribution.
### 34. Try a different optimizer
Your choice of optimizer shouldnt prevent your network from training unless you have selected particularly bad hyperparameters. However, the proper optimizer for a task can be helpful in getting the most training in the shortest amount of time. The paper which describes the algorithm you are using should specify the optimizer. If not, I tend to use Adam or plain SGD with momentum.
Check this [excellent post](http://ruder.io/optimizing-gradient-descent/) by Sebastian Ruder to learn more about gradient descent optimizers.
### 35. Exploding / Vanishing gradients
* Check layer updates, as very large values can indicate exploding gradients. Gradient clipping may help.
* Check layer activations. From [Deeplearning4j](https://deeplearning4j.org/visualization#usingui) comes a great guideline: _“A good standard deviation for the activations is on the order of 0.5 to 2.0. Significantly outside of this range may indicate vanishing or exploding activations.”_
### 36. Increase/Decrease Learning Rate
A low learning rate will cause your model to converge very slowly.
A high learning rate will quickly decrease the loss in the beginning but might have a hard time finding a good solution.
Play around with your current learning rate by multiplying it by 0.1 or 10.
### 37. Overcoming NaNs
Getting a NaN (Non-a-Number) is a much bigger issue when training RNNs (from what I hear). Some approaches to fix it:
* Decrease the learning rate, especially if you are getting NaNs in the first 100 iterations.
* NaNs can arise from division by zero or natural log of zero or negative number.
* Russell Stewart has great pointers on [how to deal with NaNs](http://russellsstewart.com/notes/0.html).
* Try evaluating your network layer by layer and see where the NaNs appear.
+220
View File
@@ -0,0 +1,220 @@
Source: https://github.com/williamFalcon/DeepRLHacks
Title: DeepRLHacks - Attendee notes from Schulman's Nuts and Bolts talk (2017)
Fetched-via: gh api repos/williamFalcon/DeepRLHacks/contents/README.md
Fetch-status: verbatim
Compliance-note: Secondary source - attendee notes from the talk. Primary source is joschu_nuts_and_bolts.md (http://joschu.net/docs/nuts-and-bolts.pdf)
# DeepRLHacks
From a talk given by [John Schulman](http://joschu.net/) titled "The Nuts and Bolts of Deep RL Research" (Aug 2017)
These are tricks written down while attending summer [Deep RL Bootcamp at UC Berkeley](https://www.deepbootcamp.io/).
**Update**: RL bootcamp just released the [video](https://www.youtube.com/watch?v=8EcdaCk9KaQ&feature=youtu.be) and the rest of the [lectures](https://sites.google.com/view/deep-rl-bootcamp/lectures).
## Tips to debug new algorithm
1. Simplify the problem by using a low dimensional state space environment.
- John suggested to use the [Pendulum problem](https://gym.openai.com/envs/Pendulum-v0) because the problem has a 2-D state space (angle of pendulum and velocity).
- Easy to visualize what the value function looks like and what state the algorithm should be in and how they evolve over time.
- Easy to visually spot why something isn't working (aka, is the value function smooth enough and so on).
2. To test if your algorithm is reasonable, construct a problem you know it should work on.
- Ex: For hierarchical reinforcement learning you'd construct a problem with an OBVIOUS hierarchy it should learn.
- Can easily see if it's doing the right thing.
- WARNING: Don't over fit method to your toy problem (realize it's a toy problem).
3. Familiarize yourself with certain environments you know well.
- Over time, you'll learn how long the training should take.
- Know how rewards evolve, etc...
- Allows you to set a benchmark to see how well you're doing against your past trials.
- John uses the hopper robot where he knows how fast learning should take, and he can easily spot odd behaviors.
## Tips to debug a new task
1. Simplify the task
- Start simple until you see signs of life.
- Approach 1: Simplify the feature space:
- For example, if you're learning from images (huge dimensional space), then maybe hand engineer features first. Example: If you think your function is trying to approximate a location of something, use the x,y location as features as step 1.
- Once it starts working, make the problem harder until you solve the full problem.
- Approach 2: simplify the reward function.
- Formulate so it can give you FAST feedback to know whether you're doing the right thing or not.
- Ex: Have reward for robot when it hits the target (+1). Hard to learn because maybe too much happens in between starting and reward. Reformulate as distance to target instead which will increase learning and allow you to iterate faster.
## Tips to frame a problem in RL
Maybe it's unclear what the features are and what the reward should be, or if it's feasible at all.
1. First step: Visualize a random policy acting on this problem.
- See where it takes you.
- If random policy on occasion does the right thing, then high chance RL will do the right thing.
- Policy gradient will find this behavior and make it more likely.
- If random policy never does the right thing, RL will likely also not.
2. Make sure observations usable:
- See if YOU could control the system by using the same observations you give the agent.
- Example: Look at preprocessed images yourself to make sure you don't remove necessary details or hinder the algorithm in a certain way.
3. Make sure everything is reasonably scaled.
- Rule of thumb:
- Observations: Make everything mean 0, standard deviation 1.
- Reward: If you control it, then scale it to a reasonable value.
- Do it across ALL your data so far.
- Look at all observations and rewards and make sure there aren't crazy outliers.
4. Have good baseline whenever you see a new problem.
- It's unclear which algorithm will work, so have a set of baselines (from other methods)
- Cross entropy method
- Policy gradient methods
- Some kind of Q-learning method (checkout [OpenAI Baselines](https://github.com/openai/baselines) as a starter or [RLLab](https://github.com/rll/rllab))
## Reproducing papers
Sometimes (often), it's hard to reproduce results from papers. Some tricks to do that:
1. Use more samples than needed.
2. Policy right... but not exactly
- Try to make it work a little bit.
- Then tweak hyper parameters to get up to the public performance.
- If want to get it to work at ALL, use bigger batch sizes.
- If batch size is too small, noisy will overpower signal.
- Example: TRPO, John was using too tiny of a batch size and had to use 100k time steps.
- For DQN, best hyperparams: 10k time steps, 1mm frames in replay buffer.
## Guidelines on-going training process
Sanity check that your training is going well.
1. Look at sensitivity of EVERY hyper parameter
- If algo is too sensitive, then NOT robust and should NOT be happy with it.
- Sometimes it happens that a method works one way because of funny dynamics but NOT in general.
2. Look for indicators that the optimization process is healthy.
- Varies
- Look at whether value function is accurate.
- Is it predicting well?
- Is it predicting returns well?
- How big are the updates?
- Standard diagnostics from deep networks
3. Have a system for continuously benchmarking code.
- Needs DISCIPLINE.
- Look at performance across ALL previous problems you tried.
- Sometimes it'll start working on one problem but mess up performance in others.
- Easy to over fit on a single problem.
- Have a battery of benchmarks you run occasionally.
4. Think your algorithm is working but you're actually seeing random noise.
- Example: Graph of 7 tasks with 3 algorithms and looks like 1 algorithm might be doing best on all problems, but turns out they're all the same algorithm with DIFFERENT random seeds.
5. Try different random seeds!!
- Run multiple times and average.
- Run multiple tasks on multiple seeds.
- If not, you're likely to over fit.
6. Additional algorithm modifications might be unnecessary.
- Most tricks are ACTUALLY normalizing something in some way or improving your optimization.
- A lot of tricks also have the same effect... So you can remove some of them and SIMPLIFY your algorithm (VERY KEY).
7. Simplify your algorithm
- Will generalize better
8. Automate your experiments
- Don't spend your whole day watching your code spit out numbers.
- Launch experiments on cloud services and analyze results.
- Frameworks to track experiments and results:
- Mostly use iPython notebooks.
- DBs seem unnecessary to store results.
## General training strategies
1. Whiten and standardize data (for ALL seen data since the beginning).
- Observations:
- Do it by computing a running mean and standard deviation. Then z-transform everything.
- Over ALL data seen (not just the recent data).
- At least it'll scale down over time how fast it's changing.
- Might trip up the optimizer if you keep changing the objective.
- Rescaling (by using recent data) means your optimizer probably didn't know about that and performance will collapse.
- Rewards:
- Scale and DON'T shift.
- Affects agent's will to live.
- Will change the problem (aka, how long you want it to survive).
- Standardize targets:
- Same way as rewards.
- PCA Whitening?
- Could help.
- Starting to see if it actually helps with neural nets.
- Huge scales (-1000, 1000) or (-0.001, 0.001) certainly make learning slow.
2. Parameters that inform discount factors.
- Determines how far you're giving credit assignment.
- Ex: if factor is 0.99, then you're ignoring what happened 100 steps ago... Means you're shortsighted.
- Better to look at how that corresponds to real time
- Intuition, in RL we're usually discretizing time.
- aka: are those 100 steps 3 seconds of actual time?
- what happens during that time?
- If TD methods for policy gradient of Value fx estimation, gamma can be close to 1 (like 0.999)
- Algo becomes very stable.
3. Look to see that problem can actually be solved in the discretized level.
- Example: In game if you're doing frame skip.
- As a human, can you control it or is it impossible?
- Look at what random exploration looks like
- Discretization determines how far your Brownian motion goes.
- If do many actions in a row, then tend to explore further.
- Choose your time discretization in a way that works.
4. Look at episode returns closely.
- Not just mean, look at min and max.
- The max return is something your policy can hone in pretty well.
- Is your policy ever doing the right thing??
- Look at episode length (sometimes more informative than episode reward).
- if on game you might be losing every time so you might never win, but... episode length can tell you if you're losing SLOWER.
- Might see an episode length improvement in the beginning but maybe not reward.
## Policy gradient diagnostics
1. Look at entropy really carefully
- Entropy in ACTION space
- Care more about entropy in state space, but don't have good methods for calculating that.
- If going down too fast, then policy becoming deterministic and will not explore.
- If NOT going down, then policy won't be good because it is really random.
- Can fix by:
- KL penalty
- Keep entropy from decreasing too quickly.
- Add entropy bonus.
- How to measure entropy.
- For most policies can compute entropy analytically.
- If continuous, it's usually a Gaussian, so can compute differential entropy.
2. Look at KL divergence
- Look at size of updates in terms of KL divergence.
- example:
- If KL is .01 then very small.
- If 10 then too much.
3. Baseline explained variance.
- See if value function is actually a good predictor or a reward.
- if negative it might be overfitting or noisy.
- Likely need to tune hyper parameters
4. Initialize policy
- Very important (more so than in supervised learning).
- Zero or tiny final layer to maximize entropy
- Maximize random exploration in the beginning
## Q-Learning Strategies
1. Be careful about replay buffer memory usage.
- You might need a huge buffer, so adapt code accordingly.
2. Play with learning rate schedule.
3. If converges slowly or has slow warm-up period in the beginning
- Be patient... DQN converges VERY slowly.
## Bonus from [Andrej Karpathy](http://cs.stanford.edu/people/karpathy/):
1. A good feature can be to take the difference between two frames.
- This delta vector can highlight slight state changes otherwise difficult to distinguish.