mirror of
https://github.com/wassname/ml-debug.git
synced 2026-06-27 22:07:42 +08:00
initial: ML debugging folklore skill
Deep research to uplift LLMs for ML debugging, opinionated by source selection. Distilled from Schulman, Jones, Rahtz, Goodfellow, CS231n, FSDL, and more. Includes runnable diagnostic scripts and LLM-specific anti-patterns. Author: wassname (https://github.com/wassname)
This commit is contained in:
Symlink
+1
@@ -0,0 +1 @@
|
||||
/media/wassname/SGIronWolf/projects5/2026/far_ai/dlbooks
|
||||
+1008
File diff suppressed because it is too large
Load Diff
+1008
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,679 @@
|
||||
Source: http://amid.fish/reproducing-deep-rl
|
||||
Title: Lessons Learned Reproducing a Deep Reinforcement Learning Paper - Matthew Rahtz (2018)
|
||||
Fetched-via: uvx markitdown http://amid.fish/reproducing-deep-rl
|
||||
Fetch-status: verbatim
|
||||
|
||||
[Amid Fish](/)
|
||||
|
||||
# Lessons Learned Reproducing a Deep Reinforcement Learning Paper
|
||||
|
||||
Apr 6, 2018
|
||||
|
||||
There are a lot of neat things going on in deep reinforcement learning. One of
|
||||
the coolest things from last year was OpenAI and DeepMind’s work on training an
|
||||
agent using feedback from a human rather than a classical reward signal.
|
||||
There’s a great blog post about it at [Learning from Human
|
||||
Preferences](https://blog.openai.com/deep-reinforcement-learning-from-human-preferences/),
|
||||
and the original paper is at [Deep Reinforcement Learning from Human
|
||||
Preferences](https://arxiv.org/pdf/1706.03741.pdf).
|
||||
|
||||

|
||||
|
||||
Learn some deep reinforcement learning, and you too can train a noodle to do backflip. From [Learning from Human Preferences](https://blog.openai.com/deep-reinforcement-learning-from-human-preferences/).
|
||||
|
||||
I’ve seen a few recommendations that reproducing papers is a good way of
|
||||
levelling up machine learning skills, and I decided this could be an
|
||||
interesting one to try with. It was indeed a [super fun
|
||||
project](https://github.com/mrahtz/learning-from-human-preferences), and I’m
|
||||
happy to have tackled it - but looking back, I realise it wasn’t exactly the
|
||||
experience I thought it would be.
|
||||
|
||||
If you’re thinking about reproducing papers too, here are some notes on what
|
||||
surprised me about working with deep RL.
|
||||
|
||||
---
|
||||
|
||||
First, in general, **reinforcement learning turned out to be a lot trickier
|
||||
than expected**.
|
||||
|
||||
A big part of it is that right now, reinforcement learning is really sensitive.
|
||||
There are a lot of details to get *just* right, and if you don’t get them
|
||||
right, it can be difficult to diagnose where you’ve gone wrong.
|
||||
|
||||
Example 1: after finishing the basic implementation, training runs just weren’t
|
||||
succeeding. I had all sorts of ideas about what the problem might be, but after
|
||||
a couple of months of head scratching, it turned out to be because of problems
|
||||
with normalization of rewards and pixel data at a key stage[1](#fn:normproblems).
|
||||
Even with the benefit of hindsight, there were no obvious clues pointing in
|
||||
that direction: the accuracy of the reward predictor network the pixel data
|
||||
went into was just fine, and it took a long time to occur to me to examine the
|
||||
rewards predicted carefully enough to notice the reward normalization bug.
|
||||
Figuring out what the problem was happened almost accidentally, noticing a
|
||||
small inconsistency that eventually lead to the right path.
|
||||
|
||||
Example 2: doing a final code cleanup, I realised I’d implemented dropout kind
|
||||
of wrong. The reward predictor network takes as input a pair of video clips,
|
||||
each processed identically by two networks with shared weights. If you add
|
||||
dropout and you’re not careful about giving it the same random seed in each
|
||||
network, you’ll drop out differently for each network, so the video clips won’t
|
||||
be processed identically. As it turned out, though, fixing it completely broke
|
||||
training, despite prediction accuracy of the network looking exactly the same!
|
||||
|
||||

|
||||
|
||||
Spot which one is broken. Yeah, I don't see it either.
|
||||
|
||||
I get the impression this is a pretty common story (e.g. [Deep Reinforcement
|
||||
Learning Doesn’t Work Yet](https://www.alexirpan.com/2018/02/14/rl-hard.html)).
|
||||
My takeaway is that, starting a reinforcement learning project, you should
|
||||
**expect to get stuck like you get stuck on a math problem**. It’s not like my
|
||||
experience of programming in general so far where you get stuck but there’s
|
||||
usually a clear trail to follow and you can get unstuck within a couple of days
|
||||
at most. It’s more like when you’re trying to solve a puzzle, there are no
|
||||
clear inroads into the problem, and the only way to proceed is to try things
|
||||
until you find the key piece of evidence or get the key spark that lets you
|
||||
figure it out.
|
||||
|
||||
A corollary is to **try and be as sensitive as possible in noticing
|
||||
confusion**.
|
||||
|
||||
There were a lot of points in this project where the only clues came from
|
||||
noticing some small thing that didn’t make sense. For example, at some point it
|
||||
turned out that taking the difference between frames as features made things
|
||||
work much better. It was tempting to just forge ahead with the new features,
|
||||
but I realised I was confused about *why* it made such a big difference for the
|
||||
simple environment I was working with back then. It was only by following that
|
||||
confusion and realising that taking the difference between frames zeroed out
|
||||
the background that gave the hint of a problem with normalization.
|
||||
|
||||
I’m not entirely sure how to make one’s mind do more of this, but my best
|
||||
guesses at the moment are:
|
||||
|
||||
* Learn to **recognise what confusion *feels* like**. There are a lot of
|
||||
different shades of the “something’s not quite right” feeling. Sometimes it’s
|
||||
code you know is ugly. Sometimes it’s worry about wasting time on the wrong
|
||||
thing. But sometimes it’s that *you’ve seen something you didn’t expect*:
|
||||
confusion. Being able to recognise that exact shade of discomfort is
|
||||
important, so that you can…
|
||||
* Develop the habit of following through on confusion. There are some
|
||||
sources of discomfort that it can be better to ignore in the moment (e.g.
|
||||
code smell while prototyping), but confusion isn’t one of them. It seems
|
||||
important to really **commit yourself to *always* investigate whenever you
|
||||
notice confusion**.
|
||||
|
||||
In any case: expect to get stuck for several weeks at a time. (And have
|
||||
confidence you will be able to get to the other side if you keep at it, paying
|
||||
attention to those small details.)
|
||||
|
||||
---
|
||||
|
||||
Speaking of differences to past programming experiences, a second major
|
||||
learning experience was the **difference in mindset required for working with
|
||||
long iteration times**.
|
||||
|
||||
Debugging seems to involve four basic steps:
|
||||
|
||||
* Gather evidence about what the problem might be.
|
||||
* Form hypotheses about the problem based on the evidence you have so far.
|
||||
* Choose the most likely hypothesis, implement a fix, and see what happens.
|
||||
* Repeat until the problem goes away.
|
||||
|
||||
In most of the programming I’ve done before, I’ve been used to rapid feedback.
|
||||
If something doesn’t work, you can make a change and see what difference it
|
||||
makes within seconds or minutes. Gathering evidence is very cheap.
|
||||
|
||||
In fact, in rapid-feedback situations, gathering evidence can be a lot cheaper
|
||||
than forming hypotheses. Why spend 15 minutes carefully considering everything
|
||||
that could be causing what you see when you can check the first idea that jumps
|
||||
to mind in a fraction of that (and gather more evidence in the process)? To put
|
||||
it another way: if you have rapid feedback, you can narrow down the hypothesis
|
||||
space a lot faster by trying things than thinking carefully.
|
||||
|
||||
If you keep that strategy when each run takes 10 hours, though, you can easily
|
||||
waste a *lot* of time. Last run didn’t work? OK, I think it’s this thing. Let’s
|
||||
set off another run to check. Coming back the next morning: still doesn’t work?
|
||||
OK, maybe it’s this other thing. Let’s set off another run. A week later, you
|
||||
still haven’t solved the problem.
|
||||
|
||||
Doing multiple runs at the same time, each trying a different thing, can help
|
||||
to some extent, but a) unless you have access to a cluster you can end up
|
||||
racking up a lot of costs on cloud compute (see below), and b) because of the
|
||||
kinds of difficulties with reinforcement learning mentioned above, if you try
|
||||
to iterate too quickly, you might never realise what kind of evidence you
|
||||
actually need.
|
||||
|
||||
Switching from **experimenting a lot and thinking a little** to **experimenting
|
||||
a little and thinking a lot** was a key turnaround in productivity. When
|
||||
debugging with long iteration times, you really need to *pour* time into the
|
||||
hypothesis-forming step - thinking about what all the possibilities are, how
|
||||
likely they seem on their own, and how likely they seem in light of everything
|
||||
you’ve seen so far. Spend as much time as you need, even if it takes 30
|
||||
minutes, or an hour. Reserve experiments for once you’ve fleshed out the
|
||||
hypothesis space as thoroughly as possible and know which pieces of evidence
|
||||
would allow you to best distinguish between the different possibilities.
|
||||
|
||||
(It’s especially important to be deliberate about this if you’re working on
|
||||
something as a side project. If you’re only working on it for an hour a day and
|
||||
each iteration takes a day to run, the number of runs you can do per week ends
|
||||
up feeling a precious commodity you have to make the most of. It’s easy to
|
||||
then feel a sense of pressure to spend your working hour each day rushing to
|
||||
figure out something to do for that day’s run. Another turnaround was being
|
||||
willing to spend several days just *thinking*, not starting any runs, until I
|
||||
felt really confident I had a strong hypothesis about what the problem was.)
|
||||
|
||||
A key enabler of the switch to thinking more was **keeping a much more detailed
|
||||
work log**. Working without a log is fine when each chunk of progress takes
|
||||
less than a few hours, but anything longer than that and it’s easy to forget
|
||||
what you’ve tried so far and end up just going in circles. The log format I
|
||||
converged on was:
|
||||
|
||||
* Log 1: what specific output am I working on right now?
|
||||
* Log 2: thinking out loud - e.g. hypotheses about the current problem, what to
|
||||
work on next
|
||||
* Log 3: record of currently ongoing runs along with a short reminder of what
|
||||
question each run is supposed to answer
|
||||
* Log 4: results of runs (TensorBoard graphs, any other significant
|
||||
observations), separated by type of run (e.g. by environment the agent is
|
||||
being trained in)
|
||||
|
||||
I started out with relatively sparse logs, but towards the end of the project
|
||||
my attitude moved more towards “log absolutely everything going through my
|
||||
head”. The overhead was significant, but I think it was worth it - partly
|
||||
because some debugging required cross-referencing results and thoughts that
|
||||
were days or weeks apart, and partly for (at least, this is my impression)
|
||||
general improvements in thinking quality from the massive upgrade to effective
|
||||
mental RAM.
|
||||
|
||||

|
||||
|
||||
A typical day's log.
|
||||
|
||||
---
|
||||
|
||||
In terms of **getting the most out of the experiments you do run**, there are
|
||||
two things I started experimenting with towards the end of the project which
|
||||
seem like they could be helpful in the future.
|
||||
|
||||
First, adopting an attitude of **log all the metrics you can** to maximise the
|
||||
amount of evidence you gather on each run. There are obvious metrics like
|
||||
training/validation accuracy, but it might also be worth spending a good chunk
|
||||
of time at the start of the project brainstorming and researching which other
|
||||
metrics might be important for diagnosing potential problems.
|
||||
|
||||
I might be making this recommendation partly out of hindsight bias where I
|
||||
*know* which metrics I should have started logging earlier. It’s hard to
|
||||
predict which metrics will be useful in advance. Still, heuristics that might
|
||||
be useful are:
|
||||
|
||||
* For every important component in the system, consider what *can* be measured
|
||||
about it. If there’s a database, measure how quickly it’s growing in size.
|
||||
If there’s a queue, measure how quickly items are being processed.
|
||||
* For every complex procedure, measure how long different parts of it take. If
|
||||
you’ve got a training loop, measure how long each batch takes to run. If
|
||||
you’ve got a complex inference procedure, measure how long each sub-inference
|
||||
takes. Those times are going to help a lot for performance debugging later
|
||||
on, and can sometimes reveal bugs that are otherwise hard to spot. (For
|
||||
example, if you see something taking longer and longer, it might be because
|
||||
of a memory leak.)
|
||||
* Similarly, consider profiling memory usage of different components. Small
|
||||
memory leaks can be indicative of all sorts of things.
|
||||
|
||||
Another strategy is to look at what other people are measuring. In the context
|
||||
of deep reinforcement learning, John Schulman has some good tips in his [Nuts
|
||||
and Bolts of Deep RL talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ)
|
||||
([slides](http://joschu.net/docs/nuts-and-bolts.pdf); [summary
|
||||
notes](https://github.com/williamFalcon/DeepRLHacks)). For policy gradient
|
||||
methods, I’ve found policy entropy in particular to be a good indicator of
|
||||
whether training is going anywhere - much more sensitive than per-episode
|
||||
rewards.
|
||||
|
||||

|
||||
|
||||
Examples of unhealthy and healthy
|
||||
policy entropy graphs. Failure mode 1 (left): convergence to constant entropy (random choice among a subset of actions). Failure mode 2 (centre): convergence to zero entropy (choosing the same action every time). Right: policy entropy from a successful Pong training run.
|
||||
|
||||
When you do see something suspicious in metrics recorded, remembering to
|
||||
*notice confusion*, err on the side of assuming it’s something important rather
|
||||
than just e.g. an inefficient implementation of some data structure. (I missed
|
||||
a multithreading bug for several months by ignoring a small but mysterious
|
||||
decay in frames per second.)
|
||||
|
||||
Debugging is much easier if you can see all your metrics in one place. I like
|
||||
to have as much as possible on TensorBoard. Logging arbitrary metrics with
|
||||
TensorFlow can be awkward, though, so **consider checking out
|
||||
[easy-tf-log](https://github.com/mrahtz/easy-tf-log)**, which provides an easy
|
||||
`tflog(key, value)` interface without any extra setup.
|
||||
|
||||
A second thing that seems promising for getting more out of runs is
|
||||
**taking the time to try and predict failure in advance**.
|
||||
|
||||
Thanks to hindsight bias, failures often seem obvious in retrospect. But the
|
||||
*really* frustrating thing is when the failure mode is obvious *before you’ve
|
||||
even observed what it was*. You know when you’ve set off a run, you come back
|
||||
the next day, you see it’s failed, and even before you’ve investigated, you
|
||||
realise, “Oh, it must have been because I forgot to set the frobulator”? That’s
|
||||
what I’m talking about.
|
||||
|
||||
The neat thing is that sometimes you can trigger that kind of
|
||||
half-hindsight-realisation in advance. It does take conscious effort, though -
|
||||
really stopping for a good five minutes before launching a run to think about
|
||||
what might go wrong. The particular script I found most helpful to go through
|
||||
was: [2](#fn:murphyjitsu)
|
||||
|
||||
1. Ask yourself, “How surprised would I be if this run failed?”
|
||||
2. If the answer is ‘not very surprised’, put yourself in the shoes of
|
||||
future-you where the run *has* failed, and ask, “If I’m here, what might
|
||||
have gone wrong?”
|
||||
3. Fix whatever comes to mind.
|
||||
4. Repeat until the answer to question 1 is “very surprised” (or at least “as
|
||||
surprised as I can get”).
|
||||
|
||||
There are always going to be failures you couldn’t have predicted, and
|
||||
sometimes you still miss obvious things, but this does at least seem to *cut
|
||||
down* on the number of times something fails in a way you feel *really* stupid
|
||||
for not having thought of earlier.
|
||||
|
||||
---
|
||||
|
||||
Finally, though, **the biggest surprise with this project was just how long it
|
||||
took** - and related, the amount of compute resources it needed.
|
||||
|
||||
The first surprise was in terms of calendar time. My original estimate was that
|
||||
as a side project it would take about 3 months. It actually took around *8
|
||||
months*. (And the original estimate was supposed to be pessimistic!) Some of
|
||||
that was down to underestimating how many hours each stage would take, but a
|
||||
big chunk of the underestimate was failing to anticipate other things coming up
|
||||
outside the project. It’s hard to say how well this generalises, but **for
|
||||
side projects, taking your original (already pessimistic) time estimates and
|
||||
doubling them** might not be a bad rule-of-thumb.
|
||||
|
||||
The more interesting surprise was in how many hours each stage actually took.
|
||||
The main stages of my initial project plan were basically:
|
||||
|
||||

|
||||
|
||||
Here’s how long each stage *actually* took.
|
||||
|
||||

|
||||
|
||||
It wasn’t writing code that took a long time - it was debugging it. In fact,
|
||||
getting it working on even a [supposedly-simple
|
||||
environment](https://github.com/mrahtz/gym-moving-dot) took *four times* as
|
||||
long as initial implementation. (This is the first side project where I’ve been
|
||||
keeping track of hours, but experiences with past machine learning projects
|
||||
have been similar.)
|
||||
|
||||
(Side note: be careful about designing from scratch what you hope should be an
|
||||
‘easy’ environment for reinforcement learning. In particular, think carefully
|
||||
about a) whether your rewards really convey the right information to be able to
|
||||
solve the task - yes, this is easy to mess up - and b) whether rewards depend
|
||||
only on previous observations or also on current action. The latter, in
|
||||
particular, might be relevant if you’re doing any kind of reward prediction,
|
||||
e.g. with a critic.)
|
||||
|
||||
**Another surprise was the amount of compute time needed.** I was lucky having
|
||||
access to my university’s cluster - only CPU machines, but that was fine for
|
||||
some tasks. For work which needed a GPU (e.g. to iterate quickly on some small
|
||||
part) or when the cluster was too busy, I experimented with two cloud services:
|
||||
VMs on [Google Cloud Compute
|
||||
Engine](https://console.cloud.google.com/projectselector/compute/instances?supportedpurview=project),
|
||||
and [FloydHub](http://floydhub.com/).
|
||||
|
||||
Compute Engine is fine if you just want shell access to a GPU machine, but I
|
||||
tried to do as much as possible on FloydHub. FloydHub is basically a cloud
|
||||
compute service targeted at machine learning. You run `floyd run python
|
||||
awesomecode.py` and FloydHub sets up a container, uploads your code to it, and
|
||||
runs the code. The two key things which make FloydHub awesome are:
|
||||
|
||||
* Containers come preinstalled with GPU drivers and common libraries. (Even in
|
||||
2018, I wasted a good few hours fiddling with CUDA versions while upgrading
|
||||
TensorFlow on the Compute Engine VM.)
|
||||
* Each run is automatically archived. For each run, the code used, the exact
|
||||
command used to start the run, any command-line output, and any data outputs
|
||||
are saved automatically, and indexed through a web interface.
|
||||
|
||||
[](images/floydhub.png)
|
||||
|
||||
FloydHub's web interface. Top: index of past runs,
|
||||
and overview of a single run. Bottom: both the code used for each run and any
|
||||
data output from the run are automatically archived.
|
||||
|
||||
I can’t stress enough how important that second feature is. For any project
|
||||
this long, detailed records of what you’ve tried and the ability to reproduce
|
||||
past experiments are an absolute must. Version control software can help, but
|
||||
a) managing large outputs can be painful, and b) requires extreme diligence.
|
||||
(For example, if you’ve set off some runs, then make a small change and launch
|
||||
another run, when you commit the results of the first runs, is it going to be
|
||||
clear which code was used?) You could take careful notes or roll your own
|
||||
system, but with FloydHub, *it just works* and you save *so* much mental
|
||||
energy.
|
||||
|
||||
(Update: check out some example FloydHub runs at
|
||||
<https://www.floydhub.com/mrahtz/projects/learning-from-human-preferences>.)
|
||||
|
||||
Other things I like about FloydHub are:
|
||||
|
||||
* Containers are automatically shut down once the run is finished. Not having
|
||||
to worry about checking runs to see whether they’ve finished and the VM can
|
||||
be turned off is a big relief.
|
||||
* Billing is much more straightforward than with cloud VMs. You pay for usage
|
||||
in, say, 10-hour blocks, and you’re charged immediately. That makes keeping
|
||||
weekly budgets much easier.
|
||||
|
||||
The one pain point I’ve had with FloydHub is that you can’t customize
|
||||
containers. If your code has a lot of dependencies, you’ll need to install them
|
||||
at the start of every run. That limits the rate at which you can iterate on
|
||||
short runs. You *can* get around this, though, by creating a ‘dataset’ which
|
||||
contains the changes to the filesystem from installing dependencies, then
|
||||
copying files from that dataset at the start of each run (e.g.
|
||||
[`create_floyd_base.sh`](https://github.com/mrahtz/learning-from-human-preferences/blob/master/floydhub_utils/create_floyd_base.sh)).
|
||||
It’s awkward, but still probably less awkward than having to deal with GPU
|
||||
drivers.
|
||||
|
||||
FloydHub is a little more expensive than Compute Engine: as of writing,
|
||||
$1.20/hour for a machine with a K80 GPU, compared to about $0.85/hour for a
|
||||
similarly-specced VM (though less if you don’t need as much as 61 GB of RAM).
|
||||
Unless your budget is really limited, I think the extra convenience of FloydHub
|
||||
is worth it. The only case where Compute Engine can be a lot cheaper is doing a
|
||||
lot of runs in parallel, which you can stack up on a single large VM.
|
||||
|
||||
(A third option is Google’s new
|
||||
[Colaboratory](https://colab.research.google.com) service, which gives you a
|
||||
hosted Jupyter notebook with free access to a single K80 GPU. Don’t be put off
|
||||
by Jupyter: you can execute arbitrary commands, and set up shell access if you
|
||||
really want it. The main drawbacks are that your code doesn’t keep running if
|
||||
you close the browser window, and there are time limits on how long you can run
|
||||
before the container hosting the notebook gets reset. So it’s not suitable for
|
||||
doing long runs, but can be useful for quick prototyping on a GPU.)
|
||||
|
||||
In total, the project took:
|
||||
|
||||
* **150 hours of GPU time and 7,700 hours (wall time × cores) of CPU time** on
|
||||
Compute Engine,
|
||||
* **292 hours of GPU time** on FloydHub,
|
||||
* and **1,500 hours (wall time, 4 to 16 cores) of CPU time** on my university’s
|
||||
cluster.
|
||||
|
||||
I was horrified to realise that in total, that added up to **about $850** ($200
|
||||
on FloydHub, $650 on Compute Engine) over the 8 months of the project.
|
||||
|
||||
Some of that’s down to me being ham-fisted (see the above section on mindset
|
||||
for slow iteration). Some of it’s down to the fact that reinforcement learning
|
||||
is still so sample-inefficient that runs do just take a long time (up to 10
|
||||
hours to train a Pong agent that beats the computer every time).
|
||||
|
||||
But a big chunk of it was down to a horrible surprise I had during the final
|
||||
stages of the project: **reinforcement learning can be so unstable that you
|
||||
need to repeat every run multiple times with different seeds to be confident**.
|
||||
|
||||
For example, once I thought everything was basically working, I sat down to
|
||||
make end-to-end tests for the environments I’d been working with. But I was
|
||||
having trouble getting even the simplest environment I’d been working with,
|
||||
[training a dot to move to the centre of a
|
||||
square](https://github.com/mrahtz/gym-moving-dot), to train successfully. I
|
||||
went back to the FloydHub job that had originally worked and re-ran three
|
||||
copies. It turned out that the hyperparameters I thought were fine actually
|
||||
only succeeded one out of three times.
|
||||
|
||||

|
||||
|
||||
It's not uncommon for two out of three random seeds (red/blue) to fail.
|
||||
|
||||
To give a visceral sense of how much compute that means you need:
|
||||
|
||||
* Using A3C with 16 workers, Pong would take about 10 hours to train.
|
||||
* That’s 160 hours of CPU time.
|
||||
* Running 3 random seeds, that 480 hours (20 days) of CPU time.
|
||||
|
||||
In terms of costs:
|
||||
|
||||
* FloydHub charges about $0.50 per hour for an 8-core machine.
|
||||
* So 10 hours costs about $5 per run.
|
||||
* **Running 3 different random seeds at the same time, that’s $15 per run.**
|
||||
|
||||
**That’s, like, 3 sandwiches every time you want to test an idea.**
|
||||
|
||||
Again, from [Deep Reinforcement Learning Doesn’t Work
|
||||
Yet](https://www.alexirpan.com/2018/02/14/rl-hard.html), that kind of
|
||||
instability seems normal and accepted right now. In fact, even “Five random
|
||||
seeds (a common reporting metric) may not be enough to argue significant
|
||||
results, since with careful selection you can get non-overlapping confidence
|
||||
intervals.”
|
||||
|
||||
(All of a sudden the $25,000 of AWS credits that the [OpenAI Scholars
|
||||
programme](https://blog.openai.com/openai-scholars/) provides doesn’t seem
|
||||
quite so crazy. That probably *is* about the amount you need to give someone so
|
||||
that compute isn’t a worry at all.)
|
||||
|
||||
My point here is that **if you want to tackle a deep reinforcement learning
|
||||
project, make sure you know what you’re getting yourself into**. Make sure
|
||||
you’re prepared for how much time it could take and how much it might cost.
|
||||
|
||||
---
|
||||
|
||||
Overall, reproducing a reinforcement learning paper was a fun side project to
|
||||
try. But looking back, thinking about which skills it actually levelled up, I’m
|
||||
also wondering whether reproducing a paper was really the best use of time over
|
||||
the past months.
|
||||
|
||||
On one hand, I definitely feel like my machine learning *engineering* ability
|
||||
improved a lot. I feel more confident in being able to recognise common RL
|
||||
implementation mistakes; my workflow got a whole lot better; and from this
|
||||
particular paper I got to learn a bunch about Distributed TensorFlow and
|
||||
asynchronous design in general.
|
||||
|
||||
On the other hand, I don’t feel like my machine learning *research* ability
|
||||
improved much (which is, in retrospect, what I was actually aiming for). Rather
|
||||
than implementation, the much more difficult part of research seems to be
|
||||
coming up with ideas that are interesting but also *tractable and concrete*;
|
||||
ideas which give you the best bang-for-your-buck for the time you *do* spend
|
||||
implementing. Coming up with interesting ideas seems to be a matter of a)
|
||||
having a large vocabulary of concepts to draw on, and b) having good ‘taste’
|
||||
for ideas (e.g. what kind of work is likely to be useful to the community). I
|
||||
think a better project for both of those might have been to, say, read
|
||||
influential papers and write summaries and critical analyses of them.
|
||||
|
||||
So I think my main meta-takeaway from this project is that **it’s worth
|
||||
thinking carefully whether you want to level up engineering skills or research
|
||||
skills**. Not that there’s no overlap; but if you’re particularly weak on one
|
||||
of them you might be better off with a project specifically targeting that one.
|
||||
|
||||
If you want to level up both, a better project might be to read papers until
|
||||
you find something you’re really interested in that comes with clean code, and
|
||||
trying to implement an extension to it.
|
||||
|
||||
---
|
||||
|
||||
If you *do* want to tackle a deep RL project, here are some more specific
|
||||
things to watch out for.
|
||||
|
||||
#### Choosing papers to reproduce
|
||||
|
||||
* Look for papers with few moving parts. Avoid papers which require multiple
|
||||
parts working together in coordination.
|
||||
|
||||
#### Reinforcement learning
|
||||
|
||||
* If you’re doing anything that involves an RL algorithm as a component in a
|
||||
larger system, don’t try and implement the RL algorithm yourself. It’s a fun
|
||||
challenge, and you’ll learn a lot, but RL is unstable enough at the moment
|
||||
that you’ll never be sure whether your system doesn’t work because of a bug
|
||||
in your RL implementation or because of a bug in your larger system.
|
||||
* Before doing anything, see how easily an agent can be trained on your
|
||||
environment with a baseline algorithm.
|
||||
* Don’t forget to normalize observations. *Everywhere* that observations might
|
||||
be being used. [3](#fn:norm2)
|
||||
* Write end-to-end tests as soon as you think you’ve got something working.
|
||||
Successful training can be more fragile than you expected.
|
||||
* If you’re working with OpenAI Gym environments, note that with `-v0`
|
||||
environments, 25% of the time, the current action is ignored and the previous
|
||||
action is repeated (to make the environment less deterministic). Use `-v4`
|
||||
environments if you don’t want that extra randomness. Also note that
|
||||
environments by default only give you every 4th frame from the emulator,
|
||||
matching the early DeepMind papers. Use `NoFrameSkip` environments if you
|
||||
don’t want that. For a fully deterministic environment that gives you exactly
|
||||
what the emulator gives you, use e.g. `PongNoFrameskip-v4`.
|
||||
|
||||
#### General machine learning
|
||||
|
||||
* Because of how long end-to-end tests take to run, you’ll waste a lot of time
|
||||
if you have to do major refactoring later on. Err on the side of implementing
|
||||
things well the first time rather than hacking something up and saving
|
||||
refactoring for later.
|
||||
* Initialising a model can easily take ~ 20 seconds. That’s a painful amount of
|
||||
time to waste because of e.g. syntax errors. If you don’t like using IDEs, or
|
||||
you can’t because you’re editing on a server with only shell access, it’s
|
||||
worth investing the time to set up a linter for your editor. (For Vim, I like
|
||||
[ALE](https://github.com/w0rp/ale) with *both*
|
||||
[Pylint](https://www.pylint.org/) and
|
||||
[Flake8](http://flake8.pycqa.org/en/latest/). Though Flake8 is more of a
|
||||
style checker, it can catch some things that Pylint can’t, like wrong
|
||||
arguments to a function.) Either way, every time you hit a stupid error while
|
||||
trying to start a run, invest time in making your linter catch it in the
|
||||
future.
|
||||
* It’s not just dropout you have to be careful about implementing in networks
|
||||
with weight-sharing - it’s also batchnorm. Don’t forget there are
|
||||
normalization statistics and extra variables in the network to match.
|
||||
* Seeing regular spikes in memory usage while training? It might be that your
|
||||
validation batch size is too large.
|
||||
* If you’re seeing strange things when using Adam as an optimizer, it might be
|
||||
because of Adam’s momentum. Try using an optimizer without momentum like
|
||||
RMSprop, or disable Adam’s momentum by setting β1 to zero.
|
||||
|
||||
#### TensorFlow
|
||||
|
||||
* If you want to debug what’s happening with some node buried deep in the
|
||||
middle of your graph, check out
|
||||
[`tf.Print`](https://www.tensorflow.org/api_docs/python/tf/Print), an
|
||||
identity operation which prints the value of its input every time the graph
|
||||
is run.
|
||||
* If you’re saving checkpoints only for inference, you can save a lot of space
|
||||
by omitting optimizer parameters from the set of variables that are saved.
|
||||
* `session.run()` can have a large overhead. Group up multiple calls in a batch
|
||||
wherever possible.
|
||||
* If you’re getting out-of-GPU-memory errors when trying to run more than one
|
||||
TensorFlow instance on the same machine, it could just be because one of your
|
||||
instances is trying to reserve all the GPU memory, rather than because your
|
||||
models are too large. This is TensorFlow’s default behaviour. To tell
|
||||
TensorFlow to only reserve the memory it needs, see the
|
||||
[`allow_growth`](https://www.tensorflow.org/programmers_guide/using_gpu#allowing_gpu_memory_growth)
|
||||
option.
|
||||
* If you want to access the graph from multiple things running at once, it
|
||||
looks like you *can* access the same graph from multiple threads, but there’s
|
||||
a lock somewhere which only allows one thread at a time to actually do
|
||||
anything. This seems to be distinct from the Python global interpreter lock,
|
||||
which TensorFlow is [supposed
|
||||
to](https://stackoverflow.com/questions/38206695/python-parallelizing-gpu-and-cpu-work)
|
||||
release before doing heavy lifting. I’m uncertain about this, and didn’t have
|
||||
time to debug more thoroughly, but if you’re in the same boat, it might be
|
||||
simpler to just use multiple processes and replicate the graph between them
|
||||
with [Distributed
|
||||
TensorFlow](http://amid.fish/distributed-tensorflow-a-gentle-introduction).
|
||||
* Working with Python, you get used to not having to worry about overflows. In
|
||||
TensorFlow, though, you still need to be careful:
|
||||
|
||||
```
|
||||
> a = np.array([255, 200]).astype(np.uint8)
|
||||
> sess.run(tf.reduce_sum(a))
|
||||
199
|
||||
```
|
||||
|
||||
* Be careful about using `allow_soft_placement` to fall back to a CPU if a GPU
|
||||
isn’t available. If you’ve accidentally coded something that can’t be run on
|
||||
a GPU, it’ll be silently moved to a CPU. For example:
|
||||
|
||||
```
|
||||
with tf.device("/device:GPU:0"):
|
||||
a = tf.placeholder(tf.uint8, shape=(4))
|
||||
b = a[..., -1]
|
||||
|
||||
sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True))
|
||||
sess.run(tf.global_variables_initializer())
|
||||
|
||||
# Seems to work fine. But with allow_soft_placement=False
|
||||
|
||||
sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=False))
|
||||
sess.run(tf.global_variables_initializer())
|
||||
|
||||
# we get
|
||||
|
||||
# Cannot assign a device for operation 'strided_slice_5':
|
||||
# Could not satisfy explicit device specification '/device:GPU:0'
|
||||
# because no supported kernel for GPU devices is available.
|
||||
```
|
||||
|
||||
* I don’t know how many operations there are like this that can’t be run on a
|
||||
GPU, but to be safe, do CPU fallback manually:
|
||||
|
||||
```
|
||||
gpu_name = tf.test.gpu_device_name()
|
||||
device = gpu_name if gpu_name else "/cpu:0"
|
||||
with tf.device(device):
|
||||
# graph code
|
||||
```
|
||||
|
||||
#### Mental health
|
||||
|
||||
* Don’t get addicted to TensorBoard. I’m serious. It’s the perfect example of
|
||||
addiction through unpredictable rewards: most of the time you check how your
|
||||
run is doing and it’s just pootling away, but as training progresses,
|
||||
sometimes you check and all of the sudden - jackpot! It’s doing something
|
||||
super exciting. If you start feeling urges to check TensorBoard every few
|
||||
minutes, it might be worth setting rules for yourself about how often it’s
|
||||
reasonable to check.
|
||||
|
||||
---
|
||||
|
||||
If you’ve read this far and haven’t been put off, awesome! If you’d like to get
|
||||
into deep RL too, here are some resources for getting started.
|
||||
|
||||
* Andrej Karpathy’s [Deep Reinforcement Learning: Pong from
|
||||
Pixels](http://karpathy.github.io/2016/05/31/rl/) is a great introduction to
|
||||
build motivation and intuition.
|
||||
* For more on the theory of reinforcement learning, check out [David Silver’s
|
||||
lectures](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html). There
|
||||
isn’t much on deep RL (reinforcement learning using neural networks), but it
|
||||
does teach the vocabulary you’ll need to be able to understand papers.
|
||||
* John Schulman’s [Nuts and Bolts of Deep RL
|
||||
talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ)
|
||||
([slides](http://joschu.net/docs/nuts-and-bolts.pdf); [summary
|
||||
notes](https://github.com/williamFalcon/DeepRLHacks)) has lots more tips
|
||||
about practical issues you might run into.
|
||||
|
||||
For a sense of the bigger picture of what’s going on in deep RL at the moment,
|
||||
check out some of these.
|
||||
|
||||
* Alex Irpan’s [Deep Reinforcement Learning Doesn’t Work
|
||||
Yet](https://www.alexirpan.com/2018/02/14/rl-hard.html) has a great overview
|
||||
of where things are right now.
|
||||
* Vlad Mnih’s talk on [Recent Advances and Frontiers in Deep
|
||||
RL](https://www.youtube.com/watch?v=bsuvM1jO-4w) has more examples of work on
|
||||
some of the problems mentioned in Alex’s post.
|
||||
* Sergey Levine’s [Deep Robotic
|
||||
Learning](https://www.youtube.com/watch?v=eKaYnXQUb2g) talk, with a focus on
|
||||
improving generalization and sample efficiency in robotics.
|
||||
* Pieter Abbeel’s [Deep Learning for
|
||||
Robotics](https://www.youtube.com/watch?v=TyOooJC_bLY) keynote at NIPS 2017
|
||||
with some of the more recent tricks in deep RL.
|
||||
|
||||
Good luck!
|
||||
|
||||
Thanks to [Michal Pokorný](http://agentydragon.com/about.html) and Marko Thiel for thoughts on
|
||||
a first draft on this post.
|
||||
|
||||
1. Observations are fed into two different training loops, policy training and reward predictor training, and I’d forgotten to normalize observations for the second one. Also, calculating running statistics (specifically, variance) is tricky. Check out [John Schulman’s code](https://github.com/joschu/modular_rl/blob/master/modular_rl/running_stat.py) for a good reference. [↩](#fnref:normproblems)
|
||||
2. This is basically [CFAR’s](http://www.rationality.org/) ‘MurphyJitsu’ script. [↩](#fnref:murphyjitsu)
|
||||
3. As mentioned above, I was stuck for a good while because of forgetting to normalize observations used for training the reward predictor. Derp. [↩](#fnref:norm2)
|
||||
|
||||
Please enable JavaScript to view the [comments powered by Disqus.](https://disqus.com/?ref_noscript)
|
||||
|
||||

|
||||
|
||||
## Amid Fish
|
||||
|
||||
is Matthew Rahtz's blog
|
||||
|
||||
[GitHub](https://github.com/mrahtz),
|
||||
[LinkedIn](https://uk.linkedin.com/pub/matthew-rahtz/b8/a47/540),
|
||||
or say hello at
|
||||
[[email protected]](/cdn-cgi/l/email-protection#ee838f9a9a868b99c09c8f869a94ae89838f8782c08d8183)!
|
||||
@@ -0,0 +1,399 @@
|
||||
Source: https://andyljones.com/posts/rl-debugging.html
|
||||
Title: Debugging RL, Without the Agonizing Pain - Andy Jones (2021)
|
||||
Fetched-via: uvx markitdown https://andyljones.com/posts/rl-debugging.html
|
||||
Fetch-status: verbatim
|
||||
|
||||
[andy jones](/)
|
||||
|
||||
[](/rss.xml)
|
||||
|
||||

|
||||
|
||||
[](https://scholar.google.com/citations?user=wjU_zmMAAAAJ)
|
||||
[](https://github.com/andyljones)
|
||||
[](https://www.linkedin.com/in/andyjonescs)
|
||||
|
||||
[](https://twitter.com/andy_l_jones)
|
||||
[](https://www.reddit.com/u/bluecoffee)
|
||||
[](https://stackoverflow.com/users/2565457/andy-jones)
|
||||
|
||||
# Debugging RL, Without the Agonizing Pain
|
||||
|
||||
Debugging reinforcement learning systems combines the pain of debugging distributed systems with the pain of debugging numerical optimizers. Which is to say, it *sucks*. If this is your first time, you might have a few hundred lines of code that you *think* are correct in an hour, and a system that's *actually* correct two months later. [Here's the head of Tesla AI having just that experience](https://news.ycombinator.com/item?id=13519044).
|
||||
|
||||
This is a collection of debugging advice that has served me well over the past few years. It was formed both from my personal experiences, and from several months of helping people out in the [RL Discord](https://discord.com/invite/xhfNqQv). It is intended as compliment to the [other excellent articles on debugging RL that can be found elsewhere](https://github.com/andyljones/reinforcement-learning-discord-wiki/wiki#debugging-advice). I recommend you read all of them; each one has their own unique set of bugbears to warn you away from.
|
||||
|
||||
There are three sections: one on [theory](#theory), one on [common fixes](#fixes), and one on [practical advice](#tactics). Things flow a little better if you read them in order, but you can skip on ahead if you wish.
|
||||
|
||||
# Theory
|
||||
|
||||
## Why is debugging RL so hard?
|
||||
|
||||
A combination of issues. These issues show up in debugging any kind of system, but in RL they're more common, and they'll show up starting with the first system you ever write.
|
||||
|
||||
### Feedback is poor
|
||||
|
||||
**Errors aren't local**: The vast majority of the bugs you'll make are the 'doing the wrong calculation' sort. Because information in an RL system flows in a loop - actor to learner and then back to actor - a numerical error in one spot gets smeared throughout the system in seconds, poisoning everything. This means that most numerical errors manifest as *all* your metrics going weird at the same time; your loss exploding, your KL div collapsing, your rewards oscillating. From the outside, you can tell something is wrong but you've no idea *what* is wrong or where to start looking.
|
||||
|
||||
To my mind this is the single biggest issue with debugging RL systems, and much of the advice below is about how to better-localise errors.
|
||||
|
||||
**Performance is noisy**: The ultimate arbiter of an RL system - how good it is at collecting reward - is only weakly related to how good of an implementation you've written. You could write a bug-free implementation the first time and other factors (like hyperparameters, architecture or your environment) could sabotage performance. In the worst case, your evaluation run could just get an unlucky seed. Conversely, you could write a bug-laden implementation and it might seem to work! After all, bugs are just one more source of noise and your neural net is going to [try its damnedest](https://twitter.com/gwern/status/1014978860369182722) to pull the signal out of that mess you're feeding it.
|
||||
|
||||
The real kicker though is that because run-to-run variability is so high, it's very easy to fix - or introduce - a bug and then see no change in performance at all.
|
||||
|
||||
### Simplifying is hard
|
||||
|
||||
**There're few narrow interfaces**: Smart software development involves splitting the system up into components so that each component only talks to the others through a narrow interface. This way you can easily pinch a component off from the the rest of the system, feed it some mock inputs and see if it gives the correct answers.
|
||||
|
||||
This is difficult in RL systems. In RL systems, each component typically consumes a large number of mega- or gigabyte arrays and returns the same. The components are also unavoidably stateful, with the principal two components - the actor and learner - hefting around the state of the environnment and the network weights respectively. State can be thought of an interface with the own component's past, and in RL this interface is *huge*.
|
||||
|
||||
Consequently while you *can* isolate components in RL (and we'll talk about how to below), it's much more painful to do than it is in other kinds of software.
|
||||
|
||||
**There are few black boxes**: A black box is a component that works in a complex way, but which you can reason about in a simple way. Another name for a black box would be 'a good abstraction'. The prototypical example is your computer: there's a hierarchy of concepts in there, from doped silicon through to operating systems, but as far as you the programmer are concerned it's all about for loops and function calls.
|
||||
|
||||
RL has surprisingly few of these black boxes. You're required to know how your environment works, how your network works, how your optimizer works, how backprop works, how multiprocessing works, how stat collection and logging work. How GPUs work! There are [lots](https://docs.ray.io/en/latest/rllib.html) of [attempts](https://github.com/thu-ml/tianshou) at [writing](https://github.com/deepmind/acme) black-box [RL](https://github.com/astooke/rlpyt) libraries, but as of Jan 2021 my experience has been that these libraries have yet to be both flexible *and* easy-to-use. This might be a symptom of my odd strand of research, but I've heard several other researchers echo my frustrations.
|
||||
|
||||
### We're bad at writing RL systems
|
||||
|
||||
**Your expectations suck**: In any domain, problems evaporate as you get used to them. The first stack trace you see in your life is a nightmare; the millionth a triviality. All of the problems with RL listed above are only really problems because people new to the field expect something much more refined and reliable, as they've come to expect from other fields of programming and numerical research. If instead you arrive in RL expecting a garbage fire, you might just stay zen throughout.
|
||||
|
||||
Obviously though, this begs the question of *why* RL development is a garbage fire.
|
||||
|
||||
**The community is young**: While reinforcement learning as a field stretches back decades, it has *exploded* in the past few years and continues apace today. Finding good abstractions requires in part that the userbase's requirements stabilize, and that just isn't the case yet. Some of that is because it's very much a community of researchers rather than a community of practitioners, and the terrible thing about researchers it that they're very keen on doing new and different things. Maybe it'll be different once someone figures out how to turn RL into an industry.
|
||||
|
||||
**The community has other priorities**: Again, the community is a community of researchers. The population sets the priorities, and the priority is publication. Reliable, reproducible research contributes to publishing high-impact papers, but it also costs time and effort that is arguably better spent working on something *new*. And, well, it's hard to argue with the results: the current standards of RL development have carried us [a](https://deepmind.com/blog/article/muzero-mastering-go-chess-shogi-and-atari-without-rules) [long](https://openai.com/blog/learning-dexterity/) [way](https://deepmind.com/blog/article/AlphaStar-Grandmaster-level-in-StarCraft-II-using-multi-agent-reinforcement-learning).
|
||||
|
||||
Don't take this as a clarion call for better practices, nor a stalwart defense of practices as they are. It's not a hill I wish to die on. I'm only giving an explanation for why things are the way they are, rather than a justification for it. My preferences are towards improved practices, but I can see the sense in the other side's position.
|
||||
|
||||
## Debugging Strategies
|
||||
|
||||
With all that in mind, here are some broad strategies to keep in mind when chasing a bug.
|
||||
|
||||
### Design reliable tests
|
||||
|
||||
Write tests that either clearly pass or clearly fail. There's some amount of true randomness in RL, but most of that can be controlled with a seed. What's harder to deal with is psuedorandomness such that on one seed a test might pass and another seed the test might fail. This is *awful* to deal with, and you should go out of your way to avoid it.
|
||||
|
||||
While the ideal is a test that is guaranteed to cleanly pass or fail, a good fallback is one that is simply *overwhelmingly likely* to pass or fail. Typically, this means substituting out environments or algorithms with simpler ones that behave more predictably, and which you can run through your implementation with some massive batch size that'll suppress a lot of the wackiness that you might otherwise suffer.
|
||||
|
||||
### Design *fast* tests
|
||||
|
||||
Iteration speed is a huge determinant of debugging speed. Running a test should take at most as long as it takes you to make a potential fix, which is to say 'a few seconds'.
|
||||
|
||||
This means: don't try to debug your implementation by just running it on your full task. That might take days! That way madness lies. Instead, design setups that can execute more quickly, but still exercise the code you're looking at. For specific tips, look at the [probe environments](#probe) section below.
|
||||
|
||||
### Localise errors
|
||||
|
||||
Write test code that'll tell you the most about where the error is. The classic example of this is binary search: if you're looking for an specific item in a sorted list, then taking a look at the middle item tells you a *lot* more about where your target item is than looking at the first item.
|
||||
|
||||
Similarly, when debugging RL systems try to find tests that cut your system in half in some way, and tell you which half the problem is in. Incrementally testing every.single.chunk of code - well, sometimes that's what it comes down to! But it's something to try and avoid.
|
||||
|
||||
### Be Bayesian
|
||||
|
||||
But sometimes you can't avoid it! Binary search wouldn't have been much help in [finding the wreck of the USS Scorpion](https://en.wikipedia.org/wiki/USS_Scorpion_%28SSN-589%29). There they had to do a location-by-location search, and the key turned out to be prioritising the areas where
|
||||
|
||||
* the Scorpion was likely to be and
|
||||
* where it was likely to be *spotted*.
|
||||
|
||||
This kind of thinking isn't so critical in traditional software development because isolating components is much easier, so you can do the sort of binary search I mentioned previously. But in RL, well, sometimes you just can't untangle something. Then you should reflect on which bits of your code are most likely to *contain* bugs, and which bits of your code you're going to be able to *easily spot* those bugs in. Prioritise looking in those places!
|
||||
|
||||
As an aside, the [parable of the drunk and his keys](https://en.wikipedia.org/wiki/Streetlight_effect) has always confused me: I don't know if it's saying the wise thing to do is to look under the streetlight, or to look in the dark. Best moral I've heard for it is 'it depends'.
|
||||
|
||||
### Pursue Anomalies
|
||||
|
||||
If you ever see a plot or a behaviour that just *seems weird*, chase right after it! Do not - do *not* - just 'hope it goes away'. Chasing anomalies is one of the most powerful ways to debug your system, because if you've noticed a problem without having had to go look for it, that means it's a *really big problem*.
|
||||
|
||||
This takes quite a bit of a mindset change though. It's really tempting to think that the cool extra functionality you were planning to write today - a tournament, adaptive reward scaling, a transformer - might just magically fix this anomalous behaviour.
|
||||
|
||||
It won't.
|
||||
|
||||
Give up on your plan for the day and chase the anomaly instead.
|
||||
|
||||
# Common Fixes
|
||||
|
||||
These are specific things that frequently trip people up.
|
||||
|
||||
## Hand-tune your reward scale
|
||||
|
||||
The single most common issue for newbies writing custom RL implementations is that the targets arriving at their neural net aren't [-1, +1]. Actually, anything [-.1, +.1]ish to [-10, +10]ish is good. The point is to have rewards that generate 'sensible' targets for your network. The hyperparameters you've pulled from the literature are adapted to work with these nicely-scaled targets, but lots of envs don't natively provide rewards of the right size so as to generate these nicely-scaled targets.
|
||||
|
||||
Having read that, you might be tempted to write some adaptive scheme to scale your rewards for you. Don't: it's an extra bit of nonstationarity that'll make life more difficult. Just hand-scale, hand-clip the rewards from your env so that the targets passed to your network are sensible. When everything else is working, you can come back and replace this with something less artificial.
|
||||
|
||||
## Use a really large batch size
|
||||
|
||||
One of the most reliable ways to make life easier in RL is to use a really large batch size. A *really* large batch size. There's an [excellent paper on picking batch sizes](https://arxiv.org/abs/1812.06162), and to pull some examples from there:
|
||||
|
||||
* Pong: ~1k batch size
|
||||
* Space Invaders: ~10k batch size
|
||||
* 1v1 Dota: ~100k batch size
|
||||
|
||||
The idea behind this is that with small batches and complex envs, it's easy for your learner to end up with a batch that represents some weird idiosyncratic part of the problem. Big batches do a lot to suppress this.
|
||||
|
||||
## Use a really small network
|
||||
|
||||
Hand in hand with really large batch sizes is really small networks. When you use really large batches, your binding constraint is likely to be the memory it takes to hold the forward pass activations on your GPU. By making the network smaller, you can fit bigger batches! And frankly, small networks can accomplish a *lot*. In my [boardlaw](https://andyljones.com/boardlaw/) project, I found that a fully connected network with 4 layers of 256 neurons was enough to learn perfect play on a 9x9 board. Perfect play! That's really complex!
|
||||
|
||||
## Avoid pixels
|
||||
|
||||
And hand-in-hand with 'use a small network' is: *avoid pixels*. Especially if you're an independent researcher with hardware constraints, just... don't work on environments with hefty, expensive-to-ingest observations like Atari. Pixel-based observations mean that before it does anything interesting, your agent has to learn to *see*. From sparse rewards! That's hard, and it's compute-intensive, and it's *boring*. If you've got any choice in the matter, pick the simplest env that will be able to generate the behaviour you're after. For example:
|
||||
|
||||
* Gridworlds like [Griddly](https://github.com/Bam4d/Griddly) and [minigrid](https://github.com/maximecb/gym-minigrid). Gridworlds can support most of the interesting behaviours you'd find in a continuous environment, but are much more resource-efficient. If you've just graduated out of [the Gym envs](https://gym.openai.com/envs/#classic_control), gridworlds are an excellent next step.
|
||||
* Multi-agent setups like the boardgames from [OpenSpiel](https://openspiel.readthedocs.io/en/latest/games.html), [microRTS](https://github.com/santiontanon/microrts) or [Neural MMO](https://github.com/jsuarez5341/neural-mmo). A multi-agent env shouldn't be your *first* foray into RL - they're substantially more complex than the single-agent case - but competition and cooperation can generate a lot of complexity from very lightweight environments.
|
||||
* Unusual envs like [WordCraft](https://github.com/minqi/wordcraft). WordCraft is unique in that it isolates learning about the real world from actually having to model the real world! But again, possibly not the best choice for a first RL project; I've included it here as an example of how powerful simple environments can be.
|
||||
|
||||
In all, fast environments with small networks and big batches are far easier to debug than slow environments with big networks and small batches. Make sure you can walk before you try running.
|
||||
|
||||
## Mix your vectorized envs
|
||||
|
||||
If you've got a long-lived env and you're simulating a lot of them in parallel, you might find that your system behaves a bit strangely at the start of training. One common issue is that if all your envs start from the same state, then your learner gets passed very highly-correlated samples, and so it tries to optimise for, say, steps 0-10 of the env in the first batch, then 10-20 in the second batch, etc. You can avoid this by '[mixing](https://en.wikipedia.org/wiki/Markov_chain_mixing_time)' your envs: taking enough random steps in the env that they become uncorrelated with one another. A good way to check that things are well-mixed is to look at the number of resets at each timestep: if they look pretty uniform, things are well-mixed. If they all cluster on a specific timestep, you need to take some more random actions.
|
||||
|
||||
# Practical Advice
|
||||
|
||||
This advice sits somewhere between the 'common mistakes' and the more general 'theory' we discussed earlier.
|
||||
|
||||
## Work from a reference implementation
|
||||
|
||||
*If you're new to reinforcement learning, writing things from scratch is the most catastrophically self-sabotaging thing you can do.*
|
||||
|
||||
There is an alluring masochism in writing things from scratch. There's concrete value in it too: by writing things from scratch, you're both forced to fully understand what you're doing and you're more likely to come up with a fresh perspective. In many other fields of software development these benefits would be worth the slow-down you suffer from having to work everything out yourself.
|
||||
|
||||
In reinforcement learning, these benefits are not worth it. At all. As discussed [above](#theory), the nature of RL work makes it extremely hard for you to self-correct.
|
||||
|
||||
When I say 'use a reference implementation', there are several interpretations you can take depending on your risk tolerance.
|
||||
|
||||
* The safest thing to do is to use a reference implementation out-of-the-box. Check that it works on your task, then repeatedly make a small change and check that it works as it did before.
|
||||
* Less safe is to just use the reference implementation as a source of reliable components. Work to the same API, and check that giving your version of a component and their version give the same outputs.
|
||||
* Least safe (but still dramatically better than going in blind) is to have one eye on the reference implementation while you write your own. Copy their hyperparameters, copy their discounting code, copy how they handle termination and invalid actions and a hundred other little things that you're likely to muck up otherwise.
|
||||
|
||||
Here are some excellent reference implementations to choose from:
|
||||
|
||||
* [spinning-up](https://github.com/openai/spinningup) has been written by OpenAI, and has a [short course to go along with it](https://spinningup.openai.com/).
|
||||
* [stable-baselines3](https://github.com/DLR-RM/stable-baselines3) is based on an older set of OpenAI implementations, but cleaned up and actively maintained.
|
||||
* [cleanrl](https://github.com/vwxyzjn/cleanrl/tree/master/cleanrl) isolates every algorithm in its own file.
|
||||
* [OpenSpiel](https://github.com/deepmind/open_spiel) is DeepMind's multi-agent reinforcement learning library. They provide both Python and C++ implementations of many algorithms - you'll probably want the Python ones.
|
||||
|
||||
## Assume you have a bug
|
||||
|
||||
When their RL implementation doesn't work, people are often keen to either (a) adjust their network architecture or (b) adjust their hyperparameters. On the other hand, they're reluctant to say they've got a bug.
|
||||
|
||||
Most often, it turns out they've got a bug.
|
||||
|
||||
Why bugs are so much more common in RL code is discussed [above](#theory), but there's another advantage to assuming you've got a bug: bugs are a damn sight faster to find and fix than validating that your new architecture is an improvement over the old one.
|
||||
|
||||
Now having said that you should assume you have a bug, it's worth mentioning that sometimes - rarely - you don't have a bug. What I'm advocating for here is not a blind faith in the buginess of your code, but for dramatically raising the threshold at which you start thinking 'OK, I think this is correct.'
|
||||
|
||||
## Loss curves are a red herring
|
||||
|
||||
When someone's RL implementation isn't working, they *luuuuuurv* to copy-paste a screenshot of their loss curve to you. They do this because they know they want a pretty, exponentially-decaying loss curve, and they know what they have *isn't that*.
|
||||
|
||||
The problem with using the loss curve as an indicator of correctness is somewhat that it's not reliable, but mostly because it doesn't localise errors. The shape of your loss curve says very little about where in your code you've messed up, and so says very little about what you need to change to get things working.
|
||||
|
||||
As in the previous section, my sweeping proclamation comes with some qualifiers. Once you have a semi-functional implementation and you've exhausted other, better methods of error localisation (as documented in the rest of this post), there *is* valuable information in a loss curve. If nothing else, being able to split a model's performance into 'how fast it learns' and 'where it plateaus' is a useful way to think about the next improvement you might want to make. But because it only offers *global* information about the performance of your implementation, it makes for a really poor debugging tool.
|
||||
|
||||
## Unit test the tricky bits
|
||||
|
||||
Most of the bugs in a typical attempt at an RL implementation turn up in the same few places. Some of the usual suspects are
|
||||
|
||||
* reward discounting, especially around episode resets
|
||||
* advantage calculations, again especially around resets
|
||||
* buffering and batching, especially pairing the wrong rewards with the wrong observations
|
||||
|
||||
Fortunately, these components are all really easy to test! They've got none of the issues that validating RL algorithms as a whole has. These components are deterministic, they're easy to factor out, and they're fast. Checking you've got the termination right on your reward discounting is [a few lines](https://github.com/andyljones/megastep/blob/master/megastep/demo/learning.py#L134-L159).
|
||||
|
||||
What's even better is that most of the time, *as you write these things* you know you're messing them up. If you're not certain whether you've just accumulated the reward on one side of the reset or the other, *put a test in*.
|
||||
|
||||
## Use probe environments.
|
||||
|
||||
The usual advice to people writing RL algorithms is to use a simple environment like the [classic control ones from the Gym](https://gym.openai.com/envs/#classic_control).
|
||||
|
||||
Thing is, these envs have the same problem as looking at loss curves: at best they give you a noisy indicator, and if the noisy indicator looks poor you don't know *why* it looks poor. They don't localise errors.
|
||||
|
||||
Instead, construct environments that *do* localise errors. In a recent project, I used
|
||||
|
||||
1. **One action, zero observation, one timestep long, +1 reward every timestep**: This isolates the value network. If my agent can't learn that the value of the only observation it ever sees it 1, there's a problem with the value loss calculation or the optimizer.
|
||||
2. **One action, random +1/-1 observation, one timestep long, obs-dependent +1/-1 reward every time**: If my agent can learn the value in (1.) but not this one - meaning it can learn a constant reward but not a predictable one! - it must be that backpropagation through my network is broken.
|
||||
3. **One action, zero-then-one observation, *two* timesteps long, +1 reward at the end**: If my agent can learn the value in (2.) but not this one, it must be that my reward discounting is broken.
|
||||
4. **Two actions, zero observation, one timestep long, action-dependent +1/-1 reward**: The first env to exercise the policy! If my agent can't learn to pick the better action, there's something wrong with either my advantage calculations, my policy loss or my policy update. That's three things, but it's easy to work out by hand the expected values for each one and check that the values produced by your actual code line up with them.
|
||||
5. **Two actions, random +1/-1 observation, one timestep long, action-and-obs dependent +1/-1 reward**: Now we've got a dependence on both obs and action. The policy and value networks interact here, so there's a couple of things to verify: that the policy network learns to pick the right action in each of the two states, and that the value network learns that the value of each state is +1. If everything's worked up until now, then if - for example - the value network fails to learn here, it likely means your batching process is feeding the value network stale experience.
|
||||
6. Etc.
|
||||
|
||||
You get the idea: (1.) is the simplest possible environment, and each new env adds the smallest possible bit of functionality. If the old env works but the successor doesn't, that gives you a *lot* of information about where the problem is.
|
||||
|
||||
Even better, these environments are extraordinarily fast. When you've a correct implementation, it should only take a second or two to learn them. And they're *decisive*: if your value network in (1.) ends up more than an epsilon away from the correct value, it means you've got a bug.
|
||||
|
||||
## Use probe agents.
|
||||
|
||||
In much the same way that you can simplify your environments to localise errors, you can do the same with your agents too.
|
||||
|
||||
*Cheat* agents are ones that you leak extra information to. For example, if I'm writing an agent to navigate to a goal, then slipping the agent an extra vector saying which direction the goal is in should help a *lot*. My agent should be able to solve this problem *much* faster, and if it can't then how the heck can I expect it to solve the original problem?
|
||||
|
||||
*Automatons* are agents that don't use a neural network at all. Instead, they're hand-written algorithms. The point of writing something like this is to check that your environment is actually solvable. On an navigation environment I wrote once, I set up a room with a red post behind the agent. Then I wrote an automaton which would just turn left until a block of red was in the middle of it's view. Shocker: my automaton couldn't solve this task, because it turned out I'd mucked up the observation generation on odd-numbered environments.
|
||||
|
||||
It's worth keeping in mind that automatons can be handed cheat information too! Combining automatons and progressively more cheat information is a powerful way to debug an environment.
|
||||
|
||||
*Tabular* agents a good match for probe environments. If you've set up a real simple environment and *still* nothing works, then replacing your NN with a far-easier-to-interpret lookup table of state values is a great way to figure out what you're missing. Be aware that it might take some time with a pen and paper to check that the values that you're seeing in the table are the ones you expect, but it's a hard setup to fool.
|
||||
|
||||
## Use adaptive network definitions
|
||||
|
||||
One of the issues with probe environments and probe agents is that every time you swap out your environment or agent, you'll find yourself having to rewrite the interface of the network with the rest of the world. By 'interface' I mean 'the bit that eats the observation and the bit that spits out the action'.
|
||||
|
||||
One way to avoid this is to write a function that takes the observation space and action space of the environment, and generates 'heads' for the network that convert the observation into a fixed-width vector, and which convert a fixed-width vector to the action. Then you can hand-implement *just* the body of the net that converts the intake vector to the output vector, and the rest will be slotted in by your function based on the env it has to work with.
|
||||
|
||||
You can see [one](https://github.com/andyljones/megastep/blob/master/megastep/demo/heads.py) [implementation](https://github.com/andyljones/megastep/blob/master/megastep/demo/__init__.py#L17-L26) of this in my [megastep](https://andyljones.com/megastep/) work, but it's an idea that's been independently developed a few times. I haven't yet seen a general library for it.
|
||||
|
||||
## Log excessively.
|
||||
|
||||
The last few sections have involved controlled experiments of a sort, where you place your components in a known setup and see how they act. The complement to a controlled experiment is an observational study: watching your system in its natural habitat *very carefully* and seeing if you can spot anything anomalous.
|
||||
|
||||
In reinforcement learning, watching your system carefully means logging. Lots of logging. Below are some of the logs I've found particularly useful.
|
||||
|
||||
### Relative policy entropy
|
||||
|
||||
The entropy of your policy network's outputs, relative to the maximum possible entropy. It'll usually start near 1, then rapidly fall for a while, then flatten out for the rest of training.
|
||||
|
||||
If it stays very near 1, your agent is failing to learn any policy at all. You should check that your policy targets are being computed correctly, that the gradient's being backpropagated correctly, and - if you've defined a custom environment - then your environment is actually correct!
|
||||
|
||||
If it drops to zero or close to zero, then your agent has 'collapsed' into some - likely myopic - policy, and isn't exploring any more. This is usually because you'v either forgotten to include an exploration mechanism of some sort (like epsilon-greedy actions or an entropy term in the loss), or because your rewards are much larger than whatever you're using to encourage exploration.
|
||||
|
||||
Sometimes it'll go up for a while; don't stress about that unless it's a large, permanent increase. If it *is* a large permanent increase and the minimum was very early in training, that can be an indicator that your policy fell into some myopic obviously-good behaviour that it's having to gradually climb back out of. It might help to turn up the exploration incentives.
|
||||
|
||||
If the entropy oscillates wildly, that usually means your learning rate is too high.
|
||||
|
||||
### Kullback-Leibler divergence
|
||||
|
||||
The KL div between the policy that was used to collect the experience in the batch, and the policy that your learner's just generated for the same batch. This should be small but positive.
|
||||
|
||||
If it's very large then your agent is having to learn from experience that's very different to the current policy. In some algorithms - like those with a replay buffer - that's expected, and all that's important is the KL div is stable. In other algorithms (like PPO), a very large KL div is an indicator that the experience reaching your network is 'stale', and that'll slow down training.
|
||||
|
||||
If it's very low then that suggests your network hasn't changed much in the time since the experience was generated, and you can probably get away with turning the learning rate up.
|
||||
|
||||
If it's growing steadily over time, that means you're probably feeding the same experience from early on in training back into the network again and again. Check your buffering system.
|
||||
|
||||
If it's negative - that shouldn't happen, and it means you're likely calculating the KL div incorrectly (probably by not handling invalid actions).
|
||||
|
||||
### Residual variance
|
||||
|
||||
The variance of (target values - network values), divided by the variance of the target values.
|
||||
|
||||
Like the policy entropy, this should start close to 1, fall very rapidly early on, and then decrease more gradually over the course of training.
|
||||
|
||||
If it stays near 1, your value network isn't learning to predict the rewards. Check that your rewards are what you think they are, and check that your value loss and backprop through the value net are all working correctly.
|
||||
|
||||
If it drops to zero, that's usually because the policy entropy has dropped to zero too, the policy has collapsed into some deterministic behaviour, and the value network has learned the rewards it is collecting perfectly. Another common reason is that some scenarios are generating vastly larger returns than the others, and the value net's learned to identify when that happens.
|
||||
|
||||
If the residual variance oscillates wildly, that usually means your learning rate is too high.
|
||||
|
||||
### Terminal correlation
|
||||
|
||||
The correlation between the value in the final state and the reward in the final step. This is only useful when there's lots of reward in the final step (like in boardgames).
|
||||
|
||||
It should start near zero, rise rapidly, then plateau near 1.
|
||||
|
||||
If it stays near zero but all the other value-related logs look good, then check that your reward-to-gos are being calculated correctly near termination!
|
||||
|
||||
If reward is more evenly distributed through the episode, you could write a version of this that looks at the correlation of (next state's value - this state's value) with the reward in that step. I haven't used this myself though, so can't offer commentary.
|
||||
|
||||
### Penultimate terminal correlation
|
||||
|
||||
The correlation between the value in the penultimate step and the final reward. Again, only useful when there's lots of reward at the end of the episode. If terminal correlation is high but penultimate terminal correlation is low, that's a strong indicator that your reward-to-gos aren't being carried backwards properly.
|
||||
|
||||
### Value target distribution
|
||||
|
||||
Either plot a histogram, or the min/max/mean/std. The plots should indicate 'reasonable' value targets in the range [-10, +10] (and ideally [-3, +3]).
|
||||
|
||||
If they're larger than that, make your rewards proportionately smaller; if they're smaller than that, make your rewards larger.
|
||||
|
||||
If they blow up, check that your reward discounting is correct, and possibly make your discount rate smaller.
|
||||
|
||||
If they're blowing up but you're insistent on leaving the discount rate where it is, one alternative is to increase the number of steps used to bootstrap the value targets. In PPO, this'd mean using longer chunks. Longer chunks mean that the values used for bootstrapping get shrunk more before they're fed back to the value net as targets, increasing the stability. You could also consider annealing the discount factor from a smaller value up towards 1.
|
||||
|
||||
### Reward distribution
|
||||
|
||||
Again, as a histogram or min/max/mean/std. What a reasonable reward distribution is depends on the environment; some envs have a few large rewards, while others have lots of small rewards. Either way, if it doesn't match your expectations then you should investigate.
|
||||
|
||||
### Value distribution
|
||||
|
||||
Again, as a histogram or min/max/mean/std. This is a complement to the previous two distributions and *should* closely match the value target distribution. If it doesn't, and it stays different from the value target distribution, that's an indicator that your value network is having trouble learning.
|
||||
|
||||
It's also worth keeping an eye on the sign of the distribution. If your env only produces positive rewards but there are persistently negatives values in the value target distribution, that suggests your reward-to-go mechanism is badly broken or your value network is failing to learn.
|
||||
|
||||
### Advantage distribution
|
||||
|
||||
Again, as a histogram or min/max/mean/std. As with the value targets, these should be in the range [-10, +10] (and ideally [-3, +3]).
|
||||
|
||||
Advantages should also be approximately mean-zero due to how they're constructed; if they're persistently not then you've messed up your advantage calculations.
|
||||
|
||||
### Episode length distribution
|
||||
|
||||
Again, as a histogram or a min/max/mean/std. As with the reward distribution, interpreting this depends on the environment. If your environment should have arbitrary-length episodes, but you're seeing that every episode here is length 7, that indicates your environment is broken or your network's fallen into some degenerate behaviour.
|
||||
|
||||
### Sample staleness
|
||||
|
||||
Sample staleness is the number of learner steps between the network used to generate a sample, and the network currently learning from that sample. You can generate this by setting an 'age' attribute on the network, and incrementing it at every learner step. Then when a sample arrives at the learner, diff it against the learner's current age.
|
||||
|
||||
How to interpret this depends on the algorithm, but it should generally stay at a steady value throughout training. In on-policy algorithms, lower sample stalenesses are better; in off-policy algorithms it's a tradeoff between fresh samples that let the network bootstrap quickly, and aged samples that stabilise things.
|
||||
|
||||
### Step statistics
|
||||
|
||||
Step statistics are the abs-max and mean-square-value of the difference between the network's parameters when it enters the learner, and the network's parameters when it leaves the learner.
|
||||
|
||||
Interpreting this depends on a whole bunch of things, but the mean-square value should typically be very small (1e-3 in my current training run with a LR of 1e-2), while the abs-max should small yet substantially larger than the mean-square-value.
|
||||
|
||||
If the statistics are much smaller than that, you might be able to increase your learning rate; if they're much larger than that then be on the lookout for instability in your training.
|
||||
|
||||
### Gradient statistics
|
||||
|
||||
Gradient statistics re the abs-max and mean-square-value of the gradient. In the age of Adam and other quasi-Newton optimizers, this isn't as informative as it once was, because normalising by the curvature estimates can dramatically inflate or collapse the gradient.
|
||||
|
||||
That said, if the step statistics are looking strange, this can help diagnose whether the problem is with the gradient calculation or with Adam's second-order magic.
|
||||
|
||||
### Gradient noise
|
||||
|
||||
This is from [McCandlish and Kaplan](https://arxiv.org/abs/1812.06162), and it's intended to help you choose your batch size. Unfortunately it's *spectacularly* noisy, to the point where you likely want to average over all steps in your run.
|
||||
|
||||
I've been thinking that it might be possible to get more stable estimates of the gradient noise from Adam's moment estimates, but that's decidedly on the to-do list.
|
||||
|
||||
### Component throughput
|
||||
|
||||
At the least, keep track of the actor throughput and learner throughput in terms of samples per second, and steps per second.
|
||||
|
||||
Typically the actor should be generating *at most* as many samples as the learner is consuming. If the actor is generating excess samples there are weak reasons that might be a good thing - it'll refresh the replay buffer more rapidly - but typically it's considered a waste of compute.
|
||||
|
||||
More generally, you want to see these remain stable throughout training. If your throughputs gradually decay, you're accumulating some costly state somewhere in your system.
|
||||
|
||||
(For me, problems with gradually-slowing-down systems have always turned out to be with stats and logging, but I suspect that's because I've rolled my own stats and logging systems)
|
||||
|
||||
### Value trace
|
||||
|
||||
The trace of the value over a random episode from recent history, plotted together with the rewards. This can be useful if you suspect your value function or rewards of 'being weird' in some way; the value trace should typically be a collection of exponentially-increasing curves leading up to rewards, followed by vertical drops as the agent collects those rewards.
|
||||
|
||||
### GPU stats
|
||||
|
||||
There are several GPU-related stats that are worth tracking. First are the memory stats, which in PyTorch include
|
||||
|
||||
* the *memory allocation*, as reported by `torch.cuda.max_memory_allocated`. This is how much memory has actually been *used* by your computations,
|
||||
* the *memory reserve*, as reported by `torch.cuda.max_memory_reserved`. This is how much memory PyTorch has *set aside* for your computations,
|
||||
* the *memory gross*, as reported by `nvidia-smi`. This is how much memory PyTorch is using overall, [including the ~gigabyte it needs for its own kernels](https://github.com/pytorch/pytorch/issues/20532#issuecomment-540628939). It's this figure that'll crash your program if it hits the GPU's memory limit.
|
||||
|
||||
Keeping track of all three is useful for diagnosing memory issues: figuring out if it's you that's hanging onto too many tensors, or PyTorch that's being too aggressive with its caching.
|
||||
|
||||
If you're running out of memory and you can't immediately figure out why, [memlab](https://github.com/Stonesjtu/pytorch_memlab#memory-profiler) can help a lot. Disclosure: I wrote the frontend.
|
||||
|
||||
As well as the memory stats, it's also useful to track the utilization, fan speed and temperature reported by `nvidia-smi`. You can get these values in [machine-readable form](https://github.com/andyljones/megastep/blob/master/rebar/stats/gpu.py#L17-L29).
|
||||
|
||||
In particular, if the utilization is persistently low then you should profile your code. Make sure to set `CUDA_LAUNCH_BLOCKING=1` before importing your tensor library, and then use [snakeviz](https://jiffyclub.github.io/snakeviz/) or [tuna](https://github.com/nschloe/tuna) to profile things in a broad way. If that's not enough detail, you can dig into things further with [nsight](https://developer.nvidia.com/nsight-systems).
|
||||
|
||||
### Traditional metrics
|
||||
|
||||
As well as the above, I also plot some other things out of habit
|
||||
|
||||
* **Reward per trajectory**: should increase dramatically at the start of training. This is, usually, what you care about. Unfortunately it's incredibly noisy and does little to localise errors. Closely related is the **reward per step**, which is typically what you care about in infinite environments.
|
||||
* **Mean value**: is (if your value network is working well) a less-noisy proxy for the reward per trajectory. If your trajectories are particularly long compared to your reward discount factor however, this can be dramatically different from the reward per trajectory.
|
||||
* **Policy and value losses**: should fall dramatically at the start of training, then level out.
|
||||
|
||||
## Credit
|
||||
|
||||
* **kfir.b.y**, for spotting an error in my description of the probe environments.
|
||||
|
||||
2021/01/01
|
||||
|
||||
[icons by dave gandy](https://fontawesome.com/license), theme by [#6d2e98](https://color-hex.org/color/6d2e98 "i have never been funny")
|
||||
@@ -0,0 +1,719 @@
|
||||
Source: https://cs229.stanford.edu/materials/ML-advice.pdf
|
||||
Title: CS229 - Advice for Applying Machine Learning (Andrew Ng)
|
||||
Fetched-via: bash -c 'uvx "markitdown[pdf]" https://cs229.stanford.edu/materials/ML-advice.pdf'
|
||||
Fetch-status: verbatim
|
||||
|
||||
Advice for applying
|
||||
Machine Learning
|
||||
|
||||
Andrew Ng
|
||||
|
||||
Stanford University
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
Today’s Lecture
|
||||
|
||||
• Advice on how getting learning algorithms to different applications.
|
||||
|
||||
• Most of today’s material is not very mathematical. But it’s also some of the
|
||||
|
||||
hardest material in this class to understand.
|
||||
|
||||
• Some of what I’ll say today is debatable.
|
||||
|
||||
• Some of what I’ll say is not good advice for doing novel machine learning
|
||||
|
||||
research.
|
||||
|
||||
• Key ideas:
|
||||
|
||||
1. Diagnostics for debugging learning algorithms.
|
||||
2. Error analyses and ablative analysis.
|
||||
3. How to get started on a machine learning problem.
|
||||
|
||||
– Premature (statistical) optimization.
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
Debugging Learning
|
||||
Algorithms
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
Debugging learning algorithms
|
||||
|
||||
Motivating example:
|
||||
|
||||
• Anti-spam. You carefully choose a small set of 100 words to use as
|
||||
|
||||
features. (Instead of using all 50000+ words in English.)
|
||||
|
||||
• Bayesian logistic regression, implemented with gradient descent, gets 20%
|
||||
|
||||
test error, which is unacceptably high.
|
||||
|
||||
• What to do next?
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
Fixing the learning algorithm
|
||||
|
||||
• Bayesian logistic regression:
|
||||
|
||||
• Common approach: Try improving the algorithm in different ways.
|
||||
|
||||
– Try getting more training examples.
|
||||
– Try a smaller set of features.
|
||||
– Try a larger set of features.
|
||||
– Try changing the features: Email header vs. email body features.
|
||||
– Run gradient descent for more iterations.
|
||||
– Try Newton’s method.
|
||||
– Use a different value for λ.
|
||||
– Try using an SVM.
|
||||
|
||||
• This approach might work, but it’s very time-consuming, and largely a matter
|
||||
|
||||
of luck whether you end up fixing what the problem really is.
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
Diagnostic for bias vs. variance
|
||||
|
||||
Better approach:
|
||||
|
||||
– Run diagnostics to figure out what the problem is.
|
||||
– Fix whatever the problem is.
|
||||
|
||||
Bayesian logistic regression’s test error is 20% (unacceptably high).
|
||||
|
||||
Suppose you suspect the problem is either:
|
||||
|
||||
– Overfitting (high variance).
|
||||
– Too few features to classify spam (high bias).
|
||||
|
||||
Diagnostic:
|
||||
|
||||
– Variance: Training error will be much lower than test error.
|
||||
– Bias: Training error will also be high.
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
More on bias vs. variance
|
||||
|
||||
Typical learning curve for high variance:
|
||||
|
||||
r
|
||||
o
|
||||
r
|
||||
r
|
||||
e
|
||||
|
||||
Test error
|
||||
|
||||
Desired performance
|
||||
|
||||
Training error
|
||||
|
||||
m (training set size)
|
||||
|
||||
• Test error still decreasing as m increases. Suggests larger training set
|
||||
will help.
|
||||
• Large gap between training and test error.
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
More on bias vs. variance
|
||||
|
||||
Typical learning curve for high bias:
|
||||
|
||||
r
|
||||
o
|
||||
r
|
||||
r
|
||||
e
|
||||
|
||||
Test error
|
||||
|
||||
Training error
|
||||
|
||||
Desired performance
|
||||
|
||||
m (training set size)
|
||||
|
||||
• Even training error is unacceptably high.
|
||||
• Small gap between training and test error.
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
Diagnostics tell you what to try next
|
||||
|
||||
Bayesian logistic regression, implemented with gradient descent.
|
||||
|
||||
Fixes to try:
|
||||
|
||||
– Try getting more training examples.
|
||||
– Try a smaller set of features.
|
||||
– Try a larger set of features.
|
||||
– Try email header features.
|
||||
– Run gradient descent for more iterations.
|
||||
– Try Newton’s method.
|
||||
– Use a different value for λ.
|
||||
– Try using an SVM.
|
||||
|
||||
Fixes high variance.
|
||||
Fixes high variance.
|
||||
Fixes high bias.
|
||||
Fixes high bias.
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
Optimization algorithm diagnostics
|
||||
|
||||
• Bias vs. variance is one common diagnostic.
|
||||
|
||||
• For other problems, it’s usually up to your own ingenuity to construct your
|
||||
|
||||
own diagnostics to figure out what’s wrong.
|
||||
|
||||
• Another example:
|
||||
|
||||
– Bayesian logistic regression gets 2% error on spam, and 2% error on non-spam.
|
||||
|
||||
(Unacceptably high error on non-spam.)
|
||||
|
||||
– SVM using a linear kernel gets 10% error on spam, and 0.01% error on non-
|
||||
|
||||
spam. (Acceptable performance.)
|
||||
|
||||
– But you want to use logistic regression, because of computational efficiency, etc.
|
||||
|
||||
• What to do next?
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
More diagnostics
|
||||
|
||||
• Other common questions:
|
||||
|
||||
– Is the algorithm (gradient descent for logistic regression) converging?
|
||||
|
||||
J(θ)
|
||||
|
||||
e
|
||||
v
|
||||
i
|
||||
t
|
||||
c
|
||||
e
|
||||
b
|
||||
O
|
||||
|
||||
j
|
||||
|
||||
Iterations
|
||||
|
||||
It’s often very hard to tell if an algorithm has converged yet by looking at the objective.
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
More diagnostics
|
||||
|
||||
• Other common questions:
|
||||
|
||||
– Is the algorithm (gradient descent for logistic regression) converging?
|
||||
– Are you optimizing the right function?
|
||||
– I.e., what you care about:
|
||||
|
||||
(weights w(i) higher for non-spam than for spam).
|
||||
– Bayesian logistic regression? Correct value for λ?
|
||||
|
||||
– SVM? Correct value for C?
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
Diagnostic
|
||||
|
||||
An SVM outperforms Bayesian logistic regression, but you really want to deploy Bayesian
|
||||
|
||||
logistic regression for your application.
|
||||
|
||||
Let θSVM be the parameters learned by an SVM.
|
||||
|
||||
Let θBLR be the parameters learned by Bayesian logistic regression.
|
||||
|
||||
You care about weighted accuracy:
|
||||
|
||||
θSVM outperforms θBLR. So:
|
||||
|
||||
BLR tries to maximize:
|
||||
|
||||
Diagnostic:
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
Two cases
|
||||
|
||||
Case 1:
|
||||
|
||||
But BLR was trying to maximize J(θ). This means that θBLR fails to maximize J, and the
|
||||
|
||||
problem is with the convergence of the algorithm. Problem is with optimization
|
||||
algorithm.
|
||||
|
||||
Case 2:
|
||||
|
||||
This means that BLR succeeded at maximizing J(θ). But the SVM, which does worse on
|
||||
|
||||
J(θ), actually does better on weighted accuracy a(θ).
|
||||
|
||||
This means that J(θ) is the wrong function to be maximizing, if you care about a(θ).
|
||||
|
||||
Problem is with objective function of the maximization problem.
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
Diagnostics tell you what to try next
|
||||
|
||||
Bayesian logistic regression, implemented with gradient descent.
|
||||
|
||||
Fixes to try:
|
||||
|
||||
– Try getting more training examples.
|
||||
– Try a smaller set of features.
|
||||
– Try a larger set of features.
|
||||
– Try email header features.
|
||||
– Run gradient descent for more iterations.
|
||||
– Try Newton’s method.
|
||||
– Use a different value for λ.
|
||||
– Try using an SVM.
|
||||
|
||||
Fixes high variance.
|
||||
Fixes high variance.
|
||||
Fixes high bias.
|
||||
Fixes high bias.
|
||||
Fixes optimization algorithm.
|
||||
Fixes optimization algorithm.
|
||||
Fixes optimization objective.
|
||||
Fixes optimization objective.
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
The Stanford Autonomous Helicopter
|
||||
|
||||
Payload: 14 pounds
|
||||
Weight: 32 pounds
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
Machine learning algorithm
|
||||
|
||||
1. Build a simulator of helicopter.
|
||||
|
||||
Simulator
|
||||
|
||||
2. Choose a cost function. Say J(θ) = ||x – xdesired||2 (x = helicopter position)
|
||||
|
||||
3. Run reinforcement learning (RL) algorithm to fly helicopter in simulation, so
|
||||
|
||||
as to try to minimize cost function:
|
||||
|
||||
θRL = arg minθ J(θ)
|
||||
|
||||
Suppose you do this, and the resulting controller parameters θRL gives much worse
|
||||
|
||||
performance than your human pilot. What to do next?
|
||||
|
||||
Improve simulator?
|
||||
Modify cost function J?
|
||||
Modify RL algorithm?
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
Debugging an RL algorithm
|
||||
|
||||
The controller given by θRL performs poorly.
|
||||
Suppose that:
|
||||
|
||||
1. The helicopter simulator is accurate.
|
||||
|
||||
2. The RL algorithm correctly controls the helicopter (in simulation) so as to
|
||||
|
||||
minimize J(θ).
|
||||
|
||||
3. Minimizing J(θ) corresponds to correct autonomous flight.
|
||||
|
||||
Then: The learned parameters θRL should fly well on the actual helicopter.
|
||||
|
||||
Diagnostics:
|
||||
|
||||
1.
|
||||
|
||||
If θRL flies well in simulation, but not in real life, then the problem is in the
|
||||
simulator. Otherwise:
|
||||
|
||||
2. Let θhuman be the human control policy. If J(θhuman) < J(θRL), then the problem is
|
||||
in the reinforcement learning algorithm. (Failing to minimize the cost function J.)
|
||||
If J(θhuman)
|
||||
|
||||
J(θRL), then the problem is in the cost function. (Maximizing it
|
||||
|
||||
3.
|
||||
|
||||
≥
|
||||
|
||||
doesn’t correspond to good autonomous flight.)
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
More on diagnostics
|
||||
|
||||
• Quite often, you’ll need to come up with your own diagnostics to figure out
|
||||
|
||||
what’s happening in an algorithm.
|
||||
|
||||
• Even if a learning algorithm is working well, you might also run diagnostics to
|
||||
|
||||
make sure you understand what’s going on. This is useful for:
|
||||
|
||||
– Understanding your application problem: If you’re working on one important ML
|
||||
|
||||
application for months/years, it’s very valuable for you personally to get a intuitive
|
||||
understand of what works and what doesn’t work in your problem.
|
||||
|
||||
– Writing research papers: Diagnostics and error analysis help convey insight about
|
||||
|
||||
the problem, and justify your research claims.
|
||||
|
||||
– I.e., Rather than saying “Here’s an algorithm that works,” it’s more interesting to
|
||||
say “Here’s an algorithm that works because of component X, and here’s my
|
||||
justification.”
|
||||
|
||||
• Good machine learning practice: Error analysis. Try to understand what
|
||||
|
||||
your sources of error are.
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
Error Analysis
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
Error analysis
|
||||
|
||||
Many applications combine many different learning components into a
|
||||
“pipeline.” E.g., Face recognition from images: [contrived example]
|
||||
|
||||
Camera
|
||||
image
|
||||
|
||||
Preprocess
|
||||
(remove background)
|
||||
|
||||
Eyes segmentation
|
||||
|
||||
Face detection
|
||||
|
||||
Nose segmentation
|
||||
|
||||
Logistic regression
|
||||
|
||||
Label
|
||||
|
||||
Mouth segmentation
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
Camera
|
||||
image
|
||||
|
||||
Preprocess
|
||||
Preprocess
|
||||
(remove background)
|
||||
(remove background)
|
||||
|
||||
Error analysis
|
||||
|
||||
Eyes segmentation
|
||||
Eyes segmentation
|
||||
|
||||
Face detection
|
||||
Face detection
|
||||
|
||||
Nose segmentation
|
||||
Nose segmentation
|
||||
|
||||
Logistic regression
|
||||
Logistic regression
|
||||
|
||||
Label
|
||||
|
||||
Mouth segmentation
|
||||
Mouth segmentation
|
||||
|
||||
How much error is attributable to each of the
|
||||
|
||||
components?
|
||||
|
||||
Plug in ground-truth for each component, and
|
||||
|
||||
see how accuracy changes.
|
||||
|
||||
Conclusion: Most room for improvement in face
|
||||
|
||||
detection and eyes segmentation.
|
||||
|
||||
Component
|
||||
|
||||
Accuracy
|
||||
|
||||
Overall system
|
||||
|
||||
85%
|
||||
|
||||
Preprocess (remove
|
||||
background)
|
||||
|
||||
Face detection
|
||||
|
||||
Eyes segmentation
|
||||
|
||||
Nose segmentation
|
||||
|
||||
Mouth segmentation
|
||||
|
||||
85.1%
|
||||
|
||||
91%
|
||||
|
||||
95%
|
||||
|
||||
96%
|
||||
|
||||
97%
|
||||
|
||||
Logistic regression
|
||||
|
||||
100%
|
||||
Andrew Y. Ng
|
||||
|
||||
Ablative analysis
|
||||
|
||||
Error analysis tries to explain the difference between current performance and
|
||||
|
||||
perfect performance.
|
||||
|
||||
Ablative analysis tries to explain the difference between some baseline (much
|
||||
|
||||
poorer) performance and current performance.
|
||||
|
||||
E.g., Suppose that you’ve build a good anti-spam classifier by adding lots of
|
||||
|
||||
clever features to logistic regression:
|
||||
|
||||
– Spelling correction.
|
||||
– Sender host features.
|
||||
– Email header features.
|
||||
– Email text parser features.
|
||||
– Javascript parser.
|
||||
– Features from embedded images.
|
||||
|
||||
Question: How much did each of these components really help?
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
Ablative analysis
|
||||
|
||||
Simple logistic regression without any clever features get 94% performance.
|
||||
|
||||
Just what accounts for your improvement from 94 to 99.9%?
|
||||
|
||||
Ablative analysis: Remove components from your system one at a time, to see
|
||||
|
||||
how it breaks.
|
||||
|
||||
Component
|
||||
|
||||
Accuracy
|
||||
|
||||
Overall system
|
||||
|
||||
Spelling correction
|
||||
|
||||
Sender host features
|
||||
|
||||
Email header features
|
||||
|
||||
Email text parser features
|
||||
|
||||
Javascript parser
|
||||
|
||||
Features from images
|
||||
|
||||
99.9%
|
||||
|
||||
99.0
|
||||
|
||||
98.9%
|
||||
|
||||
98.9%
|
||||
|
||||
95%
|
||||
|
||||
94.5%
|
||||
|
||||
94.0%
|
||||
|
||||
[baseline]
|
||||
|
||||
Conclusion: The email text parser features account for most of the
|
||||
|
||||
improvement.
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
Getting started on a
|
||||
learning problem
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
Getting started on a problem
|
||||
|
||||
Approach #1: Careful design.
|
||||
|
||||
• Spend a long term designing exactly the right features, collecting the right dataset,
|
||||
|
||||
and designing the right algorithmic architecture.
|
||||
|
||||
•
|
||||
|
||||
Implement it and hope it works.
|
||||
|
||||
• Benefit: Nicer, perhaps more scalable algorithms. May come up with new, elegant,
|
||||
|
||||
learning algorithms; contribute to basic research in machine learning.
|
||||
|
||||
Approach #2: Build-and-fix.
|
||||
|
||||
•
|
||||
|
||||
Implement something quick-and-dirty.
|
||||
|
||||
• Run error analyses and diagnostics to see what’s wrong with it, and fix its errors.
|
||||
|
||||
• Benefit: Will often get your application problem working more quickly. Faster time to
|
||||
|
||||
market.
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
Premature statistical optimization
|
||||
|
||||
Very often, it’s not clear what parts of a system are easy or difficult to build, and
|
||||
|
||||
which parts you need to spend lots of time focusing on. E.g.,
|
||||
|
||||
Camera
|
||||
image
|
||||
|
||||
Preprocess
|
||||
(remove background)
|
||||
|
||||
This system’s much too
|
||||
complicated for a first attempt.
|
||||
|
||||
Eyes segmentation
|
||||
|
||||
Step 1 of designing a learning
|
||||
system: Plot the data.
|
||||
|
||||
Face detection
|
||||
|
||||
Nose segmentation
|
||||
|
||||
Logistic regression
|
||||
|
||||
Label
|
||||
|
||||
The only way to find out what needs work is to implement something quickly,
|
||||
|
||||
and find out what parts break.
|
||||
|
||||
Mouth segmentation
|
||||
|
||||
[But this may be bad advice if your goal is to come up with new machine
|
||||
|
||||
learning algorithms.]
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
The danger of over-theorizing
|
||||
|
||||
3d similarity
|
||||
learning
|
||||
|
||||
Color
|
||||
invariance
|
||||
|
||||
Object
|
||||
detection
|
||||
|
||||
Navigation
|
||||
|
||||
Differential
|
||||
geometry of
|
||||
3d manifolds
|
||||
|
||||
Complexity of
|
||||
non-Riemannian
|
||||
geometries
|
||||
|
||||
VC
|
||||
dimension
|
||||
|
||||
… Convergence
|
||||
|
||||
bounds for
|
||||
sampled non-
|
||||
monotonic logic
|
||||
|
||||
Mail
|
||||
delivery
|
||||
robot
|
||||
|
||||
Obstacle
|
||||
avoidance
|
||||
|
||||
Robot
|
||||
manipulation
|
||||
|
||||
[Based on Papadimitriou, 1995]
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
Summary
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
Summary
|
||||
|
||||
• Time spent coming up with diagnostics for learning algorithms is time well-
|
||||
|
||||
spent.
|
||||
|
||||
•
|
||||
|
||||
It’s often up to your own ingenuity to come up with right diagnostics.
|
||||
|
||||
• Error analyses and ablative analyses also give insight into the problem.
|
||||
|
||||
• Two approaches to applying learning algorithms:
|
||||
|
||||
– Design very carefully, then implement.
|
||||
|
||||
• Risk of premature (statistical) optimization.
|
||||
– Build a quick-and-dirty prototype, diagnose, and fix.
|
||||
|
||||
Andrew Y. Ng
|
||||
|
||||
|
||||
@@ -0,0 +1,353 @@
|
||||
Source: https://cs231n.github.io/neural-networks-3/
|
||||
Title: CS231n - Neural Networks Part 3: Learning and Evaluation
|
||||
Fetched-via: uvx markitdown https://cs231n.github.io/neural-networks-3/
|
||||
Fetch-status: verbatim
|
||||
|
||||
[CS231n Deep Learning for Computer Vision](https://cs231n.github.io)
|
||||
[Course Website](http://cs231n.stanford.edu/)
|
||||
|
||||
#
|
||||
|
||||
Table of Contents:
|
||||
|
||||
* [Gradient checks](#gradcheck)
|
||||
* [Sanity checks](#sanitycheck)
|
||||
* [Babysitting the learning process](#baby)
|
||||
+ [Loss function](#loss)
|
||||
+ [Train/val accuracy](#accuracy)
|
||||
+ [Weights:Updates ratio](#ratio)
|
||||
+ [Activation/Gradient distributions per layer](#distr)
|
||||
+ [Visualization](#vis)
|
||||
* [Parameter updates](#update)
|
||||
+ [First-order (SGD), momentum, Nesterov momentum](#sgd)
|
||||
+ [Annealing the learning rate](#anneal)
|
||||
+ [Second-order methods](#second)
|
||||
+ [Per-parameter adaptive learning rates (Adagrad, RMSProp)](#ada)
|
||||
* [Hyperparameter Optimization](#hyper)
|
||||
* [Evaluation](#eval)
|
||||
+ [Model Ensembles](#ensemble)
|
||||
* [Summary](#summary)
|
||||
* [Additional References](#add)
|
||||
|
||||
## Learning
|
||||
|
||||
In the previous sections we’ve discussed the static parts of a Neural Networks: how we can set up the network connectivity, the data, and the loss function. This section is devoted to the dynamics, or in other words, the process of learning the parameters and finding good hyperparameters.
|
||||
|
||||
### Gradient Checks
|
||||
|
||||
In theory, performing a gradient check is as simple as comparing the analytic gradient to the numerical gradient. In practice, the process is much more involved and error prone. Here are some tips, tricks, and issues to watch out for:
|
||||
|
||||
**Use the centered formula**. The formula you may have seen for the finite difference approximation when evaluating the numerical gradient looks as follows:
|
||||
|
||||
\[\frac{df(x)}{dx} = \frac{f(x + h) - f(x)}{h} \hspace{0.1in} \text{(bad, do not use)}\]
|
||||
|
||||
where \(h\) is a very small number, in practice approximately 1e-5 or so. In practice, it turns out that it is much better to use the *centered* difference formula of the form:
|
||||
|
||||
\[\frac{df(x)}{dx} = \frac{f(x + h) - f(x - h)}{2h} \hspace{0.1in} \text{(use instead)}\]
|
||||
|
||||
This requires you to evaluate the loss function twice to check every single dimension of the gradient (so it is about 2 times as expensive), but the gradient approximation turns out to be much more precise. To see this, you can use Taylor expansion of \(f(x+h)\) and \(f(x-h)\) and verify that the first formula has an error on order of \(O(h)\), while the second formula only has error terms on order of \(O(h^2)\) (i.e. it is a second order approximation).
|
||||
|
||||
**Use relative error for the comparison**. What are the details of comparing the numerical gradient \(f’\_n\) and analytic gradient \(f’\_a\)? That is, how do we know if the two are not compatible? You might be temped to keep track of the difference \(\mid f’\_a - f’\_n \mid \) or its square and define the gradient check as failed if that difference is above a threshold. However, this is problematic. For example, consider the case where their difference is 1e-4. This seems like a very appropriate difference if the two gradients are about 1.0, so we’d consider the two gradients to match. But if the gradients were both on order of 1e-5 or lower, then we’d consider 1e-4 to be a huge difference and likely a failure. Hence, it is always more appropriate to consider the *relative error*:
|
||||
|
||||
\[\frac{\mid f'\_a - f'\_n \mid}{\max(\mid f'\_a \mid, \mid f'\_n \mid)}\]
|
||||
|
||||
which considers their ratio of the differences to the ratio of the absolute values of both gradients. Notice that normally the relative error formula only includes one of the two terms (either one), but I prefer to max (or add) both to make it symmetric and to prevent dividing by zero in the case where one of the two is zero (which can often happen, especially with ReLUs). However, one must explicitly keep track of the case where both are zero and pass the gradient check in that edge case. In practice:
|
||||
|
||||
* relative error > 1e-2 usually means the gradient is probably wrong
|
||||
* 1e-2 > relative error > 1e-4 should make you feel uncomfortable
|
||||
* 1e-4 > relative error is usually okay for objectives with kinks. But if there are no kinks (e.g. use of tanh nonlinearities and softmax), then 1e-4 is too high.
|
||||
* 1e-7 and less you should be happy.
|
||||
|
||||
Also keep in mind that the deeper the network, the higher the relative errors will be. So if you are gradient checking the input data for a 10-layer network, a relative error of 1e-2 might be okay because the errors build up on the way. Conversely, an error of 1e-2 for a single differentiable function likely indicates incorrect gradient.
|
||||
|
||||
**Use double precision**. A common pitfall is using single precision floating point to compute gradient check. It is often that case that you might get high relative errors (as high as 1e-2) even with a correct gradient implementation. In my experience I’ve sometimes seen my relative errors plummet from 1e-2 to 1e-8 by switching to double precision.
|
||||
|
||||
**Stick around active range of floating point**. It’s a good idea to read through [“What Every Computer Scientist Should Know About Floating-Point Arithmetic”](http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html), as it may demystify your errors and enable you to write more careful code. For example, in neural nets it can be common to normalize the loss function over the batch. However, if your gradients per datapoint are very small, then *additionally* dividing them by the number of data points is starting to give very small numbers, which in turn will lead to more numerical issues. This is why I like to always print the raw numerical/analytic gradient, and make sure that the numbers you are comparing are not extremely small (e.g. roughly 1e-10 and smaller in absolute value is worrying). If they are you may want to temporarily scale your loss function up by a constant to bring them to a “nicer” range where floats are more dense - ideally on the order of 1.0, where your float exponent is 0.
|
||||
|
||||
**Kinks in the objective**. One source of inaccuracy to be aware of during gradient checking is the problem of *kinks*. Kinks refer to non-differentiable parts of an objective function, introduced by functions such as ReLU (\(max(0,x)\)), or the SVM loss, Maxout neurons, etc. Consider gradient checking the ReLU function at \(x = -1e6\). Since \(x < 0\), the analytic gradient at this point is exactly zero. However, the numerical gradient would suddenly compute a non-zero gradient because \(f(x+h)\) might cross over the kink (e.g. if \(h > 1e-6\)) and introduce a non-zero contribution. You might think that this is a pathological case, but in fact this case can be very common. For example, an SVM for CIFAR-10 contains up to 450,000 \(max(0,x)\) terms because there are 50,000 examples and each example yields 9 terms to the objective. Moreover, a Neural Network with an SVM classifier will contain many more kinks due to ReLUs.
|
||||
|
||||
Note that it is possible to know if a kink was crossed in the evaluation of the loss. This can be done by keeping track of the identities of all “winners” in a function of form \(max(x,y)\); That is, was x or y higher during the forward pass. If the identity of at least one winner changes when evaluating \(f(x+h)\) and then \(f(x-h)\), then a kink was crossed and the numerical gradient will not be exact.
|
||||
|
||||
**Use only few datapoints**. One fix to the above problem of kinks is to use fewer datapoints, since loss functions that contain kinks (e.g. due to use of ReLUs or margin losses etc.) will have fewer kinks with fewer datapoints, so it is less likely for you to cross one when you perform the finite different approximation. Moreover, if your gradcheck for only ~2 or 3 datapoints then you would almost certainly gradcheck for an entire batch. Using very few datapoints also makes your gradient check faster and more efficient.
|
||||
|
||||
**Be careful with the step size h**. It is not necessarily the case that smaller is better, because when \(h\) is much smaller, you may start running into numerical precision problems. Sometimes when the gradient doesn’t check, it is possible that you change \(h\) to be 1e-4 or 1e-6 and suddenly the gradient will be correct. This [wikipedia article](http://en.wikipedia.org/wiki/Numerical_differentiation) contains a chart that plots the value of **h** on the x-axis and the numerical gradient error on the y-axis.
|
||||
|
||||
**Gradcheck during a “characteristic” mode of operation**. It is important to realize that a gradient check is performed at a particular (and usually random), single point in the space of parameters. Even if the gradient check succeeds at that point, it is not immediately certain that the gradient is correctly implemented globally. Additionally, a random initialization might not be the most “characteristic” point in the space of parameters and may in fact introduce pathological situations where the gradient seems to be correctly implemented but isn’t. For instance, an SVM with very small weight initialization will assign almost exactly zero scores to all datapoints and the gradients will exhibit a particular pattern across all datapoints. An incorrect implementation of the gradient could still produce this pattern and not generalize to a more characteristic mode of operation where some scores are larger than others. Therefore, to be safe it is best to use a short **burn-in** time during which the network is allowed to learn and perform the gradient check after the loss starts to go down. The danger of performing it at the first iteration is that this could introduce pathological edge cases and mask an incorrect implementation of the gradient.
|
||||
|
||||
**Don’t let the regularization overwhelm the data**. It is often the case that a loss function is a sum of the data loss and the regularization loss (e.g. L2 penalty on weights). One danger to be aware of is that the regularization loss may overwhelm the data loss, in which case the gradients will be primarily coming from the regularization term (which usually has a much simpler gradient expression). This can mask an incorrect implementation of the data loss gradient. Therefore, it is recommended to turn off regularization and check the data loss alone first, and then the regularization term second and independently. One way to perform the latter is to hack the code to remove the data loss contribution. Another way is to increase the regularization strength so as to ensure that its effect is non-negligible in the gradient check, and that an incorrect implementation would be spotted.
|
||||
|
||||
**Remember to turn off dropout/augmentations**. When performing gradient check, remember to turn off any non-deterministic effects in the network, such as dropout, random data augmentations, etc. Otherwise these can clearly introduce huge errors when estimating the numerical gradient. The downside of turning off these effects is that you wouldn’t be gradient checking them (e.g. it might be that dropout isn’t backpropagated correctly). Therefore, a better solution might be to force a particular random seed before evaluating both \(f(x+h)\) and \(f(x-h)\), and when evaluating the analytic gradient.
|
||||
|
||||
**Check only few dimensions**. In practice the gradients can have sizes of million parameters. In these cases it is only practical to check some of the dimensions of the gradient and assume that the others are correct. **Be careful**: One issue to be careful with is to make sure to gradient check a few dimensions for every separate parameter. In some applications, people combine the parameters into a single large parameter vector for convenience. In these cases, for example, the biases could only take up a tiny number of parameters from the whole vector, so it is important to not sample at random but to take this into account and check that all parameters receive the correct gradients.
|
||||
|
||||
### Before learning: sanity checks Tips/Tricks
|
||||
|
||||
Here are a few sanity checks you might consider running before you plunge into expensive optimization:
|
||||
|
||||
* **Look for correct loss at chance performance.** Make sure you’re getting the loss you expect when you initialize with small parameters. It’s best to first check the data loss alone (so set regularization strength to zero). For example, for CIFAR-10 with a Softmax classifier we would expect the initial loss to be 2.302, because we expect a diffuse probability of 0.1 for each class (since there are 10 classes), and Softmax loss is the negative log probability of the correct class so: -ln(0.1) = 2.302. For The Weston Watkins SVM, we expect all desired margins to be violated (since all scores are approximately zero), and hence expect a loss of 9 (since margin is 1 for each wrong class). If you’re not seeing these losses there might be issue with initialization.
|
||||
* As a second sanity check, increasing the regularization strength should increase the loss
|
||||
* **Overfit a tiny subset of data**. Lastly and most importantly, before training on the full dataset try to train on a tiny portion (e.g. 20 examples) of your data and make sure you can achieve zero cost. For this experiment it’s also best to set regularization to zero, otherwise this can prevent you from getting zero cost. Unless you pass this sanity check with a small dataset it is not worth proceeding to the full dataset. Note that it may happen that you can overfit very small dataset but still have an incorrect implementation. For instance, if your datapoints’ features are random due to some bug, then it will be possible to overfit your small training set but you will never notice any generalization when you fold it your full dataset.
|
||||
|
||||
### Babysitting the learning process
|
||||
|
||||
There are multiple useful quantities you should monitor during training of a neural network. These plots are the window into the training process and should be utilized to get intuitions about different hyperparameter settings and how they should be changed for more efficient learning.
|
||||
|
||||
The x-axis of the plots below are always in units of epochs, which measure how many times every example has been seen during training in expectation (e.g. one epoch means that every example has been seen once). It is preferable to track epochs rather than iterations since the number of iterations depends on the arbitrary setting of batch size.
|
||||
|
||||
#### Loss function
|
||||
|
||||
The first quantity that is useful to track during training is the loss, as it is evaluated on the individual batches during the forward pass. Below is a cartoon diagram showing the loss over time, and especially what the shape might tell you about the learning rate:
|
||||
|
||||

|
||||

|
||||
|
||||
**Left:** A cartoon depicting the effects of different learning rates. With low learning rates the improvements will be linear. With high learning rates they will start to look more exponential. Higher learning rates will decay the loss faster, but they get stuck at worse values of loss (green line). This is because there is too much "energy" in the optimization and the parameters are bouncing around chaotically, unable to settle in a nice spot in the optimization landscape. **Right:** An example of a typical loss function over time, while training a small network on CIFAR-10 dataset. This loss function looks reasonable (it might indicate a slightly too small learning rate based on its speed of decay, but it's hard to say), and also indicates that the batch size might be a little too low (since the cost is a little too noisy).
|
||||
|
||||
The amount of “wiggle” in the loss is related to the batch size. When the batch size is 1, the wiggle will be relatively high. When the batch size is the full dataset, the wiggle will be minimal because every gradient update should be improving the loss function monotonically (unless the learning rate is set too high).
|
||||
|
||||
Some people prefer to plot their loss functions in the log domain. Since learning progress generally takes an exponential form shape, the plot appears as a slightly more interpretable straight line, rather than a hockey stick. Additionally, if multiple cross-validated models are plotted on the same loss graph, the differences between them become more apparent.
|
||||
|
||||
Sometimes loss functions can look funny [lossfunctions.tumblr.com](http://lossfunctions.tumblr.com/).
|
||||
|
||||
#### Train/Val accuracy
|
||||
|
||||
The second important quantity to track while training a classifier is the validation/training accuracy. This plot can give you valuable insights into the amount of overfitting in your model:
|
||||
|
||||

|
||||
|
||||
The gap between the training and validation accuracy indicates the amount of overfitting. Two possible cases are shown in the diagram on the left. The blue validation error curve shows very small validation accuracy compared to the training accuracy, indicating strong overfitting (note, it's possible for the validation accuracy to even start to go down after some point). When you see this in practice you probably want to increase regularization (stronger L2 weight penalty, more dropout, etc.) or collect more data. The other possible case is when the validation accuracy tracks the training accuracy fairly well. This case indicates that your model capacity is not high enough: make the model larger by increasing the number of parameters.
|
||||
|
||||
#### Ratio of weights:updates
|
||||
|
||||
The last quantity you might want to track is the ratio of the update magnitudes to the value magnitudes. Note: *updates*, not the raw gradients (e.g. in vanilla sgd this would be the gradient multiplied by the learning rate). You might want to evaluate and track this ratio for every set of parameters independently. A rough heuristic is that this ratio should be somewhere around 1e-3. If it is lower than this then the learning rate might be too low. If it is higher then the learning rate is likely too high. Here is a specific example:
|
||||
|
||||
```
|
||||
# assume parameter vector W and its gradient vector dW
|
||||
param_scale = np.linalg.norm(W.ravel())
|
||||
update = -learning_rate*dW # simple SGD update
|
||||
update_scale = np.linalg.norm(update.ravel())
|
||||
W += update # the actual update
|
||||
print update_scale / param_scale # want ~1e-3
|
||||
```
|
||||
|
||||
Instead of tracking the min or the max, some people prefer to compute and track the norm of the gradients and their updates instead. These metrics are usually correlated and often give approximately the same results.
|
||||
|
||||
#### Activation / Gradient distributions per layer
|
||||
|
||||
An incorrect initialization can slow down or even completely stall the learning process. Luckily, this issue can be diagnosed relatively easily. One way to do so is to plot activation/gradient histograms for all layers of the network. Intuitively, it is not a good sign to see any strange distributions - e.g. with tanh neurons we would like to see a distribution of neuron activations between the full range of [-1,1], instead of seeing all neurons outputting zero, or all neurons being completely saturated at either -1 or 1.
|
||||
|
||||
#### First-layer Visualizations
|
||||
|
||||
Lastly, when one is working with image pixels it can be helpful and satisfying to plot the first-layer features visually:
|
||||
|
||||

|
||||

|
||||
|
||||
Examples of visualized weights for the first layer of a neural network. **Left**: Noisy features indicate could be a symptom: Unconverged network, improperly set learning rate, very low weight regularization penalty. **Right:** Nice, smooth, clean and diverse features are a good indication that the training is proceeding well.
|
||||
|
||||
### Parameter updates
|
||||
|
||||
Once the analytic gradient is computed with backpropagation, the gradients are used to perform a parameter update. There are several approaches for performing the update, which we discuss next.
|
||||
|
||||
We note that optimization for deep networks is currently a very active area of research. In this section we highlight some established and common techniques you may see in practice, briefly describe their intuition, but leave a detailed analysis outside of the scope of the class. We provide some further pointers for an interested reader.
|
||||
|
||||
#### SGD and bells and whistles
|
||||
|
||||
**Vanilla update**. The simplest form of update is to change the parameters along the negative gradient direction (since the gradient indicates the direction of increase, but we usually wish to minimize a loss function). Assuming a vector of parameters `x` and the gradient `dx`, the simplest update has the form:
|
||||
|
||||
```
|
||||
# Vanilla update
|
||||
x += - learning_rate * dx
|
||||
```
|
||||
|
||||
where `learning_rate` is a hyperparameter - a fixed constant. When evaluated on the full dataset, and when the learning rate is low enough, this is guaranteed to make non-negative progress on the loss function.
|
||||
|
||||
**Momentum update** is another approach that almost always enjoys better converge rates on deep networks. This update can be motivated from a physical perspective of the optimization problem. In particular, the loss can be interpreted as the height of a hilly terrain (and therefore also to the potential energy since \(U = mgh\) and therefore \( U \propto h \) ). Initializing the parameters with random numbers is equivalent to setting a particle with zero initial velocity at some location. The optimization process can then be seen as equivalent to the process of simulating the parameter vector (i.e. a particle) as rolling on the landscape.
|
||||
|
||||
Since the force on the particle is related to the gradient of potential energy (i.e. \(F = - \nabla U \) ), the **force** felt by the particle is precisely the (negative) **gradient** of the loss function. Moreover, \(F = ma \) so the (negative) gradient is in this view proportional to the acceleration of the particle. Note that this is different from the SGD update shown above, where the gradient directly integrates the position. Instead, the physics view suggests an update in which the gradient only directly influences the velocity, which in turn has an effect on the position:
|
||||
|
||||
```
|
||||
# Momentum update
|
||||
v = mu * v - learning_rate * dx # integrate velocity
|
||||
x += v # integrate position
|
||||
```
|
||||
|
||||
Here we see an introduction of a `v` variable that is initialized at zero, and an additional hyperparameter (`mu`). As an unfortunate misnomer, this variable is in optimization referred to as *momentum* (its typical value is about 0.9), but its physical meaning is more consistent with the coefficient of friction. Effectively, this variable damps the velocity and reduces the kinetic energy of the system, or otherwise the particle would never come to a stop at the bottom of a hill. When cross-validated, this parameter is usually set to values such as [0.5, 0.9, 0.95, 0.99]. Similar to annealing schedules for learning rates (discussed later, below), optimization can sometimes benefit a little from momentum schedules, where the momentum is increased in later stages of learning. A typical setting is to start with momentum of about 0.5 and anneal it to 0.99 or so over multiple epochs.
|
||||
|
||||
> With Momentum update, the parameter vector will build up velocity in any direction that has consistent gradient.
|
||||
|
||||
**Nesterov Momentum** is a slightly different version of the momentum update that has recently been gaining popularity. It enjoys stronger theoretical converge guarantees for convex functions and in practice it also consistenly works slightly better than standard momentum.
|
||||
|
||||
The core idea behind Nesterov momentum is that when the current parameter vector is at some position `x`, then looking at the momentum update above, we know that the momentum term alone (i.e. ignoring the second term with the gradient) is about to nudge the parameter vector by `mu * v`. Therefore, if we are about to compute the gradient, we can treat the future approximate position `x + mu * v` as a “lookahead” - this is a point in the vicinity of where we are soon going to end up. Hence, it makes sense to compute the gradient at `x + mu * v` instead of at the “old/stale” position `x`.
|
||||
|
||||

|
||||
|
||||
Nesterov momentum. Instead of evaluating gradient at the current position (red circle), we know that our momentum is about to carry us to the tip of the green arrow. With Nesterov momentum we therefore instead evaluate the gradient at this "looked-ahead" position.
|
||||
|
||||
That is, in a slightly awkward notation, we would like to do the following:
|
||||
|
||||
```
|
||||
x_ahead = x + mu * v
|
||||
# evaluate dx_ahead (the gradient at x_ahead instead of at x)
|
||||
v = mu * v - learning_rate * dx_ahead
|
||||
x += v
|
||||
```
|
||||
|
||||
However, in practice people prefer to express the update to look as similar to vanilla SGD or to the previous momentum update as possible. This is possible to achieve by manipulating the update above with a variable transform `x_ahead = x + mu * v`, and then expressing the update in terms of `x_ahead` instead of `x`. That is, the parameter vector we are actually storing is always the ahead version. The equations in terms of `x_ahead` (but renaming it back to `x`) then become:
|
||||
|
||||
```
|
||||
v_prev = v # back this up
|
||||
v = mu * v - learning_rate * dx # velocity update stays the same
|
||||
x += -mu * v_prev + (1 + mu) * v # position update changes form
|
||||
```
|
||||
|
||||
We recommend this further reading to understand the source of these equations and the mathematical formulation of Nesterov’s Accelerated Momentum (NAG):
|
||||
|
||||
* [Advances in optimizing Recurrent Networks](http://arxiv.org/pdf/1212.0901v2.pdf) by Yoshua Bengio, Section 3.5.
|
||||
* [Ilya Sutskever’s thesis](http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf) (pdf) contains a longer exposition of the topic in section 7.2
|
||||
|
||||
#### Annealing the learning rate
|
||||
|
||||
In training deep networks, it is usually helpful to anneal the learning rate over time. Good intuition to have in mind is that with a high learning rate, the system contains too much kinetic energy and the parameter vector bounces around chaotically, unable to settle down into deeper, but narrower parts of the loss function. Knowing when to decay the learning rate can be tricky: Decay it slowly and you’ll be wasting computation bouncing around chaotically with little improvement for a long time. But decay it too aggressively and the system will cool too quickly, unable to reach the best position it can. There are three common types of implementing the learning rate decay:
|
||||
|
||||
* **Step decay**: Reduce the learning rate by some factor every few epochs. Typical values might be reducing the learning rate by a half every 5 epochs, or by 0.1 every 20 epochs. These numbers depend heavily on the type of problem and the model. One heuristic you may see in practice is to watch the validation error while training with a fixed learning rate, and reduce the learning rate by a constant (e.g. 0.5) whenever the validation error stops improving.
|
||||
* **Exponential decay.** has the mathematical form \(\alpha = \alpha\_0 e^{-k t}\), where \(\alpha\_0, k\) are hyperparameters and \(t\) is the iteration number (but you can also use units of epochs).
|
||||
* **1/t decay** has the mathematical form \(\alpha = \alpha\_0 / (1 + k t )\) where \(a\_0, k\) are hyperparameters and \(t\) is the iteration number.
|
||||
|
||||
In practice, we find that the step decay is slightly preferable because the hyperparameters it involves (the fraction of decay and the step timings in units of epochs) are more interpretable than the hyperparameter \(k\). Lastly, if you can afford the computational budget, err on the side of slower decay and train for a longer time.
|
||||
|
||||
#### Second order methods
|
||||
|
||||
A second, popular group of methods for optimization in context of deep learning is based on [Newton’s method](http://en.wikipedia.org/wiki/Newton%27s_method_in_optimization), which iterates the following update:
|
||||
|
||||
\[x \leftarrow x - [H f(x)]^{-1} \nabla f(x)\]
|
||||
|
||||
Here, \(H f(x)\) is the [Hessian matrix](http://en.wikipedia.org/wiki/Hessian_matrix), which is a square matrix of second-order partial derivatives of the function. The term \(\nabla f(x)\) is the gradient vector, as seen in Gradient Descent. Intuitively, the Hessian describes the local curvature of the loss function, which allows us to perform a more efficient update. In particular, multiplying by the inverse Hessian leads the optimization to take more aggressive steps in directions of shallow curvature and shorter steps in directions of steep curvature. Note, crucially, the absence of any learning rate hyperparameters in the update formula, which the proponents of these methods cite this as a large advantage over first-order methods.
|
||||
|
||||
However, the update above is impractical for most deep learning applications because computing (and inverting) the Hessian in its explicit form is a very costly process in both space and time. For instance, a Neural Network with one million parameters would have a Hessian matrix of size [1,000,000 x 1,000,000], occupying approximately 3725 gigabytes of RAM. Hence, a large variety of *quasi-Newton* methods have been developed that seek to approximate the inverse Hessian. Among these, the most popular is [L-BFGS](http://en.wikipedia.org/wiki/Limited-memory_BFGS), which uses the information in the gradients over time to form the approximation implicitly (i.e. the full matrix is never computed).
|
||||
|
||||
However, even after we eliminate the memory concerns, a large downside of a naive application of L-BFGS is that it must be computed over the entire training set, which could contain millions of examples. Unlike mini-batch SGD, getting L-BFGS to work on mini-batches is more tricky and an active area of research.
|
||||
|
||||
**In practice**, it is currently not common to see L-BFGS or similar second-order methods applied to large-scale Deep Learning and Convolutional Neural Networks. Instead, SGD variants based on (Nesterov’s) momentum are more standard because they are simpler and scale more easily.
|
||||
|
||||
Additional references:
|
||||
|
||||
* [Large Scale Distributed Deep Networks](http://research.google.com/archive/large_deep_networks_nips2012.html) is a paper from the Google Brain team, comparing L-BFGS and SGD variants in large-scale distributed optimization.
|
||||
* [SFO](http://arxiv.org/abs/1311.2115) algorithm strives to combine the advantages of SGD with advantages of L-BFGS.
|
||||
|
||||
#### Per-parameter adaptive learning rate methods
|
||||
|
||||
All previous approaches we’ve discussed so far manipulated the learning rate globally and equally for all parameters. Tuning the learning rates is an expensive process, so much work has gone into devising methods that can adaptively tune the learning rates, and even do so per parameter. Many of these methods may still require other hyperparameter settings, but the argument is that they are well-behaved for a broader range of hyperparameter values than the raw learning rate. In this section we highlight some common adaptive methods you may encounter in practice:
|
||||
|
||||
**Adagrad** is an adaptive learning rate method originally proposed by [Duchi et al.](http://jmlr.org/papers/v12/duchi11a.html).
|
||||
|
||||
```
|
||||
# Assume the gradient dx and parameter vector x
|
||||
cache += dx**2
|
||||
x += - learning_rate * dx / (np.sqrt(cache) + eps)
|
||||
```
|
||||
|
||||
Notice that the variable `cache` has size equal to the size of the gradient, and keeps track of per-parameter sum of squared gradients. This is then used to normalize the parameter update step, element-wise. Notice that the weights that receive high gradients will have their effective learning rate reduced, while weights that receive small or infrequent updates will have their effective learning rate increased. Amusingly, the square root operation turns out to be very important and without it the algorithm performs much worse. The smoothing term `eps` (usually set somewhere in range from 1e-4 to 1e-8) avoids division by zero. A downside of Adagrad is that in case of Deep Learning, the monotonic learning rate usually proves too aggressive and stops learning too early.
|
||||
|
||||
**RMSprop.** RMSprop is a very effective, but currently unpublished adaptive learning rate method. Amusingly, everyone who uses this method in their work currently cites [slide 29 of Lecture 6](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) of Geoff Hinton’s Coursera class. The RMSProp update adjusts the Adagrad method in a very simple way in an attempt to reduce its aggressive, monotonically decreasing learning rate. In particular, it uses a moving average of squared gradients instead, giving:
|
||||
|
||||
```
|
||||
cache = decay_rate * cache + (1 - decay_rate) * dx**2
|
||||
x += - learning_rate * dx / (np.sqrt(cache) + eps)
|
||||
```
|
||||
|
||||
Here, `decay_rate` is a hyperparameter and typical values are [0.9, 0.99, 0.999]. Notice that the `x+=` update is identical to Adagrad, but the `cache` variable is a “leaky”. Hence, RMSProp still modulates the learning rate of each weight based on the magnitudes of its gradients, which has a beneficial equalizing effect, but unlike Adagrad the updates do not get monotonically smaller.
|
||||
|
||||
**Adam.** [Adam](http://arxiv.org/abs/1412.6980) is a recently proposed update that looks a bit like RMSProp with momentum. The (simplified) update looks as follows:
|
||||
|
||||
```
|
||||
m = beta1*m + (1-beta1)*dx
|
||||
v = beta2*v + (1-beta2)*(dx**2)
|
||||
x += - learning_rate * m / (np.sqrt(v) + eps)
|
||||
```
|
||||
|
||||
Notice that the update looks exactly as RMSProp update, except the “smooth” version of the gradient `m` is used instead of the raw (and perhaps noisy) gradient vector `dx`. Recommended values in the paper are `eps = 1e-8`, `beta1 = 0.9`, `beta2 = 0.999`. In practice Adam is currently recommended as the default algorithm to use, and often works slightly better than RMSProp. However, it is often also worth trying SGD+Nesterov Momentum as an alternative. The full Adam update also includes a *bias correction* mechanism, which compensates for the fact that in the first few time steps the vectors `m,v` are both initialized and therefore biased at zero, before they fully “warm up”. With the *bias correction* mechanism, the update looks as follows:
|
||||
|
||||
```
|
||||
# t is your iteration counter going from 1 to infinity
|
||||
m = beta1*m + (1-beta1)*dx
|
||||
mt = m / (1-beta1**t)
|
||||
v = beta2*v + (1-beta2)*(dx**2)
|
||||
vt = v / (1-beta2**t)
|
||||
x += - learning_rate * mt / (np.sqrt(vt) + eps)
|
||||
```
|
||||
|
||||
Note that the update is now a function of the iteration as well as the other parameters.
|
||||
We refer the reader to the paper for the details, or the course slides where this is expanded on.
|
||||
|
||||
Additional References:
|
||||
|
||||
* [Unit Tests for Stochastic Optimization](http://arxiv.org/abs/1312.6055) proposes a series of tests as a standardized benchmark for stochastic optimization.
|
||||
|
||||

|
||||

|
||||
|
||||
Animations that may help your intuitions about the learning process dynamics. **Left:** Contours of a loss surface and time evolution of different optimization algorithms. Notice the "overshooting" behavior of momentum-based methods, which make the optimization look like a ball rolling down the hill. **Right:** A visualization of a saddle point in the optimization landscape, where the curvature along different dimension has different signs (one dimension curves up and another down). Notice that SGD has a very hard time breaking symmetry and gets stuck on the top. Conversely, algorithms such as RMSprop will see very low gradients in the saddle direction. Due to the denominator term in the RMSprop update, this will increase the effective learning rate along this direction, helping RMSProp proceed. Images credit: [Alec Radford](https://twitter.com/alecrad).
|
||||
|
||||
### Hyperparameter optimization
|
||||
|
||||
As we’ve seen, training Neural Networks can involve many hyperparameter settings. The most common hyperparameters in context of Neural Networks include:
|
||||
|
||||
* the initial learning rate
|
||||
* learning rate decay schedule (such as the decay constant)
|
||||
* regularization strength (L2 penalty, dropout strength)
|
||||
|
||||
But as we saw, there are many more relatively less sensitive hyperparameters, for example in per-parameter adaptive learning methods, the setting of momentum and its schedule, etc. In this section we describe some additional tips and tricks for performing the hyperparameter search:
|
||||
|
||||
**Implementation**. Larger Neural Networks typically require a long time to train, so performing hyperparameter search can take many days/weeks. It is important to keep this in mind since it influences the design of your code base. One particular design is to have a **worker** that continuously samples random hyperparameters and performs the optimization. During the training, the worker will keep track of the validation performance after every epoch, and writes a model checkpoint (together with miscellaneous training statistics such as the loss over time) to a file, preferably on a shared file system. It is useful to include the validation performance directly in the filename, so that it is simple to inspect and sort the progress. Then there is a second program which we will call a **master**, which launches or kills workers across a computing cluster, and may additionally inspect the checkpoints written by workers and plot their training statistics, etc.
|
||||
|
||||
**Prefer one validation fold to cross-validation**. In most cases a single validation set of respectable size substantially simplifies the code base, without the need for cross-validation with multiple folds. You’ll hear people say they “cross-validated” a parameter, but many times it is assumed that they still only used a single validation set.
|
||||
|
||||
**Hyperparameter ranges**. Search for hyperparameters on log scale. For example, a typical sampling of the learning rate would look as follows: `learning_rate = 10 ** uniform(-6, 1)`. That is, we are generating a random number from a uniform distribution, but then raising it to the power of 10. The same strategy should be used for the regularization strength. Intuitively, this is because learning rate and regularization strength have multiplicative effects on the training dynamics. For example, a fixed change of adding 0.01 to a learning rate has huge effects on the dynamics if the learning rate is 0.001, but nearly no effect if the learning rate when it is 10. This is because the learning rate multiplies the computed gradient in the update. Therefore, it is much more natural to consider a range of learning rate multiplied or divided by some value, than a range of learning rate added or subtracted to by some value. Some parameters (e.g. dropout) are instead usually searched in the original scale (e.g. `dropout = uniform(0,1)`).
|
||||
|
||||
**Prefer random search to grid search**. As argued by Bergstra and Bengio in [Random Search for Hyper-Parameter Optimization](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf), “randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid”. As it turns out, this is also usually easier to implement.
|
||||
|
||||

|
||||
|
||||
Core illustration from [Random Search for Hyper-Parameter Optimization](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf) by Bergstra and Bengio. It is very often the case that some of the hyperparameters matter much more than others (e.g. top hyperparam vs. left one in this figure). Performing random search rather than grid search allows you to much more precisely discover good values for the important ones.
|
||||
|
||||
**Careful with best values on border**. Sometimes it can happen that you’re searching for a hyperparameter (e.g. learning rate) in a bad range. For example, suppose we use `learning_rate = 10 ** uniform(-6, 1)`. Once we receive the results, it is important to double check that the final learning rate is not at the edge of this interval, or otherwise you may be missing more optimal hyperparameter setting beyond the interval.
|
||||
|
||||
**Stage your search from coarse to fine**. In practice, it can be helpful to first search in coarse ranges (e.g. 10 \*\* [-6, 1]), and then depending on where the best results are turning up, narrow the range. Also, it can be helpful to perform the initial coarse search while only training for 1 epoch or even less, because many hyperparameter settings can lead the model to not learn at all, or immediately explode with infinite cost. The second stage could then perform a narrower search with 5 epochs, and the last stage could perform a detailed search in the final range for many more epochs (for example).
|
||||
|
||||
**Bayesian Hyperparameter Optimization** is a whole area of research devoted to coming up with algorithms that try to more efficiently navigate the space of hyperparameters. The core idea is to appropriately balance the exploration - exploitation trade-off when querying the performance at different hyperparameters. Multiple libraries have been developed based on these models as well, among some of the better known ones are [Spearmint](https://github.com/JasperSnoek/spearmint), [SMAC](http://www.cs.ubc.ca/labs/beta/Projects/SMAC/), and [Hyperopt](http://jaberg.github.io/hyperopt/). However, in practical settings with ConvNets it is still relatively difficult to beat random search in a carefully-chosen intervals. See some additional from-the-trenches discussion [here](http://nlpers.blogspot.com/2014/10/hyperparameter-search-bayesian.html).
|
||||
|
||||
## Evaluation
|
||||
|
||||
### Model Ensembles
|
||||
|
||||
In practice, one reliable approach to improving the performance of Neural Networks by a few percent is to train multiple independent models, and at test time average their predictions. As the number of models in the ensemble increases, the performance typically monotonically improves (though with diminishing returns). Moreover, the improvements are more dramatic with higher model variety in the ensemble. There are a few approaches to forming an ensemble:
|
||||
|
||||
* **Same model, different initializations**. Use cross-validation to determine the best hyperparameters, then train multiple models with the best set of hyperparameters but with different random initialization. The danger with this approach is that the variety is only due to initialization.
|
||||
* **Top models discovered during cross-validation**. Use cross-validation to determine the best hyperparameters, then pick the top few (e.g. 10) models to form the ensemble. This improves the variety of the ensemble but has the danger of including suboptimal models. In practice, this can be easier to perform since it doesn’t require additional retraining of models after cross-validation
|
||||
* **Different checkpoints of a single model**. If training is very expensive, some people have had limited success in taking different checkpoints of a single network over time (for example after every epoch) and using those to form an ensemble. Clearly, this suffers from some lack of variety, but can still work reasonably well in practice. The advantage of this approach is that is very cheap.
|
||||
* **Running average of parameters during training**. Related to the last point, a cheap way of almost always getting an extra percent or two of performance is to maintain a second copy of the network’s weights in memory that maintains an exponentially decaying sum of previous weights during training. This way you’re averaging the state of the network over last several iterations. You will find that this “smoothed” version of the weights over last few steps almost always achieves better validation error. The rough intuition to have in mind is that the objective is bowl-shaped and your network is jumping around the mode, so the average has a higher chance of being somewhere nearer the mode.
|
||||
|
||||
One disadvantage of model ensembles is that they take longer to evaluate on test example. An interested reader may find the recent work from Geoff Hinton on [“Dark Knowledge”](https://www.youtube.com/watch?v=EK61htlw8hY) inspiring, where the idea is to “distill” a good ensemble back to a single model by incorporating the ensemble log likelihoods into a modified objective.
|
||||
|
||||
## Summary
|
||||
|
||||
To train a Neural Network:
|
||||
|
||||
* Gradient check your implementation with a small batch of data and be aware of the pitfalls.
|
||||
* As a sanity check, make sure your initial loss is reasonable, and that you can achieve 100% training accuracy on a very small portion of the data
|
||||
* During training, monitor the loss, the training/validation accuracy, and if you’re feeling fancier, the magnitude of updates in relation to parameter values (it should be ~1e-3), and when dealing with ConvNets, the first-layer weights.
|
||||
* The two recommended updates to use are either SGD+Nesterov Momentum or Adam.
|
||||
* Decay your learning rate over the period of the training. For example, halve the learning rate after a fixed number of epochs, or whenever the validation accuracy tops off.
|
||||
* Search for good hyperparameters with random search (not grid search). Stage your search from coarse (wide hyperparameter ranges, training only for 1-5 epochs), to fine (narrower rangers, training for many more epochs)
|
||||
* Form model ensembles for extra performance
|
||||
|
||||
## Additional References
|
||||
|
||||
* [SGD](http://research.microsoft.com/pubs/192769/tricks-2012.pdf) tips and tricks from Leon Bottou
|
||||
* [Efficient BackProp](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf) (pdf) from Yann LeCun
|
||||
* [Practical Recommendations for Gradient-Based Training of Deep
|
||||
Architectures](http://arxiv.org/pdf/1206.5533v2.pdf) from Yoshua Bengio
|
||||
|
||||
* [cs231n](https://github.com/cs231n)
|
||||
* [cs231n](https://twitter.com/cs231n)
|
||||
@@ -0,0 +1,786 @@
|
||||
Source: https://fullstackdeeplearning.com/spring2021/lecture-7/
|
||||
Title: FSDL Spring 2021 - Lecture 7: Troubleshooting Deep Neural Networks
|
||||
Fetched-via: uvx markitdown https://fullstackdeeplearning.com/spring2021/lecture-7/
|
||||
Fetch-status: verbatim
|
||||
|
||||
[Skip to content](#lecture-7-troubleshooting-deep-neural-networks)
|
||||
|
||||
[Sign up for our latest in-person course!](https://www.scale.bythebay.io/llm-workshop)
|
||||
|
||||
[](../.. "The Full Stack")
|
||||
|
||||
The Full Stack
|
||||
|
||||
Lecture 7: Troubleshooting Deep Neural Networks
|
||||
|
||||
Initializing search
|
||||
|
||||
[The Full Stack Website](https://github.com/the-full-stack/website "Go to repository")
|
||||
|
||||
* [Home](../..)
|
||||
* [LLM Bootcamp](../../llm-bootcamp/)
|
||||
* [Deep Learning Course](../../course/)
|
||||
* [Blog](../../blog/)
|
||||
* [Cloud GPUs](../../cloud-gpus/)
|
||||
|
||||
[](../.. "The Full Stack")
|
||||
The Full Stack
|
||||
|
||||
[The Full Stack Website](https://github.com/the-full-stack/website "Go to repository")
|
||||
|
||||
* [Home](../..)
|
||||
* [ ]
|
||||
|
||||
[LLM Bootcamp](../../llm-bootcamp/)
|
||||
|
||||
LLM Bootcamp
|
||||
+ [ ]
|
||||
|
||||
[Spring 2023](../../llm-bootcamp/spring-2023/)
|
||||
|
||||
Spring 2023
|
||||
- [Launch an LLM App in One Hour](../../llm-bootcamp/spring-2023/launch-an-llm-app-in-one-hour/)
|
||||
- [LLM Foundations](../../llm-bootcamp/spring-2023/llm-foundations/)
|
||||
- [Learn to Spell: Prompt Engineering](../../llm-bootcamp/spring-2023/prompt-engineering/)
|
||||
- [Augmented Language Models](../../llm-bootcamp/spring-2023/augmented-language-models/)
|
||||
- [Project Walkthrough: askFSDL](../../llm-bootcamp/spring-2023/askfsdl-walkthrough/)
|
||||
- [UX for Language User Interfaces](../../llm-bootcamp/spring-2023/ux-for-luis/)
|
||||
- [LLMOps](../../llm-bootcamp/spring-2023/llmops/)
|
||||
- [What's Next?](../../llm-bootcamp/spring-2023/whats-next/)
|
||||
- [Reza Shabani: How to train your own LLM](../../llm-bootcamp/spring-2023/shabani-train-your-own/)
|
||||
- [Harrison Chase: Agents](../../llm-bootcamp/spring-2023/chase-agents/)
|
||||
- [Fireside Chat with Peter Welinder](../../llm-bootcamp/spring-2023/welinder-fireside-chat/)
|
||||
* [x]
|
||||
|
||||
[Deep Learning Course](../../course/)
|
||||
|
||||
Deep Learning Course
|
||||
+ [ ]
|
||||
|
||||
[FSDL 2022](../../course/2022/)
|
||||
|
||||
FSDL 2022
|
||||
- [Lecture 1: Course Vision and When to Use ML](../../course/2022/lecture-1-course-vision-and-when-to-use-ml/)
|
||||
- [Lab Overview](../../course/2022/lab-0-overview/)
|
||||
- [Lecture 2: Development Infrastructure & Tooling](../../course/2022/lecture-2-development-infrastructure-and-tooling/)
|
||||
- [Lab 4: Experiment Management](../../course/2022/lab-4-experiment-management/)
|
||||
- [Lecture 3: Troubleshooting & Testing](../../course/2022/lecture-3-troubleshooting-and-testing/)
|
||||
- [Lab 5: Troubleshooting & Testing](../../course/2022/lab-5-troubleshooting-and-testing/)
|
||||
- [Lecture 4: Data Management](../../course/2022/lecture-4-data-management/)
|
||||
- [Lab 6: Data Annotation](../../course/2022/lab-6-data-annotation/)
|
||||
- [Lecture 5: Deployment](../../course/2022/lecture-5-deployment/)
|
||||
- [Lab 7: Web Deployment](../../course/2022/lab-7-web-deployment/)
|
||||
- [Lecture 6: Continual Learning](../../course/2022/lecture-6-continual-learning/)
|
||||
- [Lab 8: Model Monitoring](../../course/2022/lab-8-model-monitoring/)
|
||||
- [Lecture 7: Foundation Models](../../course/2022/lecture-7-foundation-models/)
|
||||
- [Lecture 8: ML Teams and Project Management](../../course/2022/lecture-8-teams-and-pm/)
|
||||
- [Lecture 9: Ethics](../../course/2022/lecture-9-ethics/)
|
||||
- [Project Showcase](../../course/2022/project-showcase/)
|
||||
- [Course Announcement](../../course/2022/announcement/)
|
||||
+ [x]
|
||||
|
||||
Older
|
||||
|
||||
Older
|
||||
- [x]
|
||||
|
||||
[FSDL 2021](../)
|
||||
|
||||
FSDL 2021
|
||||
* [Synchronous Online Course](../synchronous/)
|
||||
* [Course Projects Showcase](../projects/)
|
||||
* [Lecture 1: DL Fundamentals](../lecture-1/)
|
||||
* [Lab 1: Setup and Introduction](../lab-1/)
|
||||
* [Notebook: Coding a neural net](../notebook-1/)
|
||||
* [Lecture 2A: CNNs](../lecture-2a/)
|
||||
* [Lecture 2B: Computer Vision](../lecture-2b/)
|
||||
* [Lab 2: CNNs and Synthetic Data](../lab-2/)
|
||||
* [Lecture 3: RNNs](../lecture-3/)
|
||||
* [Lab 3: RNNs](../lab-3/)
|
||||
* [Lecture 4: Transformers](../lecture-4/)
|
||||
* [Lab 4: Transformers](../lab-4/)
|
||||
* [Lecture 5: ML Projects](../lecture-5/)
|
||||
* [Lecture 6: MLOps Infrastructure & Tooling](../lecture-6/)
|
||||
* [Lab 5: Experiment Management](../lab-5/)
|
||||
* [ ]
|
||||
|
||||
Lecture 7: Troubleshooting Deep Neural Networks
|
||||
|
||||
[Lecture 7: Troubleshooting Deep Neural Networks](./)
|
||||
|
||||
Table of contents
|
||||
+ [Video](#video)
|
||||
+ [Slides](#slides)
|
||||
+ [Notes](#notes)
|
||||
|
||||
- [1 - Why Is Deep Learning Troubleshooting Hard?](#1-why-is-deep-learning-troubleshooting-hard)
|
||||
- [2 - Strategy to Debug Neural Networks](#2-strategy-to-debug-neural-networks)
|
||||
- [3 - Start Simple](#3-start-simple)
|
||||
|
||||
* [Choose A Simple Architecture](#choose-a-simple-architecture)
|
||||
* [Use Sensible Defaults](#use-sensible-defaults)
|
||||
* [Normalize Inputs](#normalize-inputs)
|
||||
* [Simplify The Problem](#simplify-the-problem)
|
||||
- [4 - Implement and Debug](#4-implement-and-debug)
|
||||
|
||||
* [Get Your Model To Run](#get-your-model-to-run)
|
||||
* [Overfit A Single Batch](#overfit-a-single-batch)
|
||||
* [Compare To A Known Result](#compare-to-a-known-result)
|
||||
- [5 - Evaluate](#5-evaluate)
|
||||
|
||||
* [Bias-Variance Decomposition](#bias-variance-decomposition)
|
||||
* [Distribution Shift](#distribution-shift)
|
||||
- [6 - Improve Model and Data](#6-improve-model-and-data)
|
||||
|
||||
* [Step 1: Address Underfitting](#step-1-address-underfitting)
|
||||
* [Step 2: Address Overfitting](#step-2-address-overfitting)
|
||||
* [Step 3: Address Distribution Shift](#step-3-address-distribution-shift)
|
||||
|
||||
+ [Error Analysis](#error-analysis)
|
||||
+ [Domain Adaptation](#domain-adaptation)
|
||||
* [Step 4: Rebalance datasets](#step-4-rebalance-datasets)
|
||||
- [7 - Tune Hyperparameters](#7-tune-hyperparameters)
|
||||
|
||||
* [Techniques for Tuning Hyperparameter Optimization](#techniques-for-tuning-hyperparameter-optimization)
|
||||
- [8 - Conclusion](#8-conclusion)
|
||||
* [Lecture 8: Data Management](../lecture-8/)
|
||||
* [Lab 6: Data Labeling](../lab-6/)
|
||||
* [Lecture 9: AI Ethics](../lecture-9/)
|
||||
* [Lab 7: Paragraph Recognition](../lab-7/)
|
||||
* [Lecture 10: Testing & Explainability](../lecture-10/)
|
||||
* [Lab 8: Testing & CI](../lab-8/)
|
||||
* [Lecture 11: Deployment & Monitoring](../lecture-11/)
|
||||
* [Lab 9: Web Deployment](../lab-9/)
|
||||
* [Lecture 12: Research Directions](../lecture-12/)
|
||||
* [Lecture 13: ML Teams and Startups](../lecture-13/)
|
||||
* [Panel Discussion: Do I need a PhD to work in ML?](../panel/)
|
||||
- [FSDL 2021 (Berkeley)](https://bit.ly/berkeleyfsdl)
|
||||
- [FSDL 2020 (UW)](https://bit.ly/uwfsdl)
|
||||
- [FSDL 2019 (Online)](https://fall2019.fullstackdeeplearning.com)
|
||||
- [FSDL 2019 (Bootcamp)](/march2019.html)
|
||||
- [FSDL 2018 (Bootcamp)](/august2018.html)
|
||||
* [Blog](../../blog/)
|
||||
* [Cloud GPUs](../../cloud-gpus/)
|
||||
|
||||
Table of contents
|
||||
|
||||
* [Video](#video)
|
||||
* [Slides](#slides)
|
||||
* [Notes](#notes)
|
||||
|
||||
+ [1 - Why Is Deep Learning Troubleshooting Hard?](#1-why-is-deep-learning-troubleshooting-hard)
|
||||
+ [2 - Strategy to Debug Neural Networks](#2-strategy-to-debug-neural-networks)
|
||||
+ [3 - Start Simple](#3-start-simple)
|
||||
|
||||
- [Choose A Simple Architecture](#choose-a-simple-architecture)
|
||||
- [Use Sensible Defaults](#use-sensible-defaults)
|
||||
- [Normalize Inputs](#normalize-inputs)
|
||||
- [Simplify The Problem](#simplify-the-problem)
|
||||
+ [4 - Implement and Debug](#4-implement-and-debug)
|
||||
|
||||
- [Get Your Model To Run](#get-your-model-to-run)
|
||||
- [Overfit A Single Batch](#overfit-a-single-batch)
|
||||
- [Compare To A Known Result](#compare-to-a-known-result)
|
||||
+ [5 - Evaluate](#5-evaluate)
|
||||
|
||||
- [Bias-Variance Decomposition](#bias-variance-decomposition)
|
||||
- [Distribution Shift](#distribution-shift)
|
||||
+ [6 - Improve Model and Data](#6-improve-model-and-data)
|
||||
|
||||
- [Step 1: Address Underfitting](#step-1-address-underfitting)
|
||||
- [Step 2: Address Overfitting](#step-2-address-overfitting)
|
||||
- [Step 3: Address Distribution Shift](#step-3-address-distribution-shift)
|
||||
|
||||
* [Error Analysis](#error-analysis)
|
||||
* [Domain Adaptation](#domain-adaptation)
|
||||
- [Step 4: Rebalance datasets](#step-4-rebalance-datasets)
|
||||
+ [7 - Tune Hyperparameters](#7-tune-hyperparameters)
|
||||
|
||||
- [Techniques for Tuning Hyperparameter Optimization](#techniques-for-tuning-hyperparameter-optimization)
|
||||
+ [8 - Conclusion](#8-conclusion)
|
||||
|
||||
# Lecture 7: Troubleshooting Deep Neural Networks
|
||||
|
||||
## Video
|
||||
|
||||
## Slides
|
||||
|
||||
[Download slides as PDF](https://drive.google.com/file/d/1yXQCnGGp3wWdoCf6nSP5b758cXF92rtg/view?usp=sharing)
|
||||
|
||||
## Notes
|
||||
|
||||
*Lecture by [Josh Tobin](http://josh-tobin.com).
|
||||
Notes transcribed by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).*
|
||||
|
||||
In traditional software engineering, a bug usually leads to the program
|
||||
crashing. While this is annoying for the user, it is critical for the
|
||||
developer to inspect the errors to understand why. With deep learning,
|
||||
we sometimes encounter errors, but all too often, the program crashes
|
||||
without a clear reason why. While these issues can be debugged manually,
|
||||
deep learning models most often fail because of poor output predictions.
|
||||
What’s worse is that when the model performance is low, there is usually
|
||||
no signal about why or when the models failed.
|
||||
|
||||
A common sentiment among practitioners is that they spend **80–90% of
|
||||
time debugging and tuning the models** and only 10–20% of time deriving
|
||||
math equations and implementing things. This is confirmed by Andrej
|
||||
Kaparthy, [as seen in this
|
||||
tweet](https://twitter.com/karpathy/status/423990618289733632).
|
||||
|
||||
### 1 - Why Is Deep Learning Troubleshooting Hard?
|
||||
|
||||
Suppose you are trying to reproduce a research paper result for your
|
||||
work, but your results are worse. You might wonder why your model’s
|
||||
performance is significantly worse than the paper that you’re trying to
|
||||
reproduce?
|
||||
|
||||

|
||||
|
||||
Many different things can cause this:
|
||||
|
||||
* It can be **implementation bugs**. Most bugs in deep learning are
|
||||
actually invisible.
|
||||
* **Hyper-parameter choices** can also cause your performance to
|
||||
degrade. Deep learning models are very sensitive to
|
||||
hyper-parameters. Even very subtle choices of learning rate and
|
||||
weight initialization can make a big difference.
|
||||
* Performance can also be worse just because of **data/model fit**.
|
||||
For example, you pre-train your model on ImageNet data and fit it
|
||||
on self-driving car images, which are harder to learn.
|
||||
* Finally, poor model performance could be caused not by your model
|
||||
but your **dataset construction**. Typical issues here include not
|
||||
having enough examples, dealing with noisy labels and imbalanced
|
||||
classes, splitting train and test set with different
|
||||
distributions.
|
||||
|
||||
### 2 - Strategy to Debug Neural Networks
|
||||
|
||||
The key idea of deep learning troubleshooting is: *Since it is hard to
|
||||
disambiguate errors, it’s best to start simple and gradually ramp up
|
||||
complexity.*
|
||||
|
||||
This lecture provides **a decision tree for debugging deep learning
|
||||
models and improving performance**. This guide assumes that you already
|
||||
have an initial test dataset, a single metric to improve, and target
|
||||
performance based on human-level performance, published results,
|
||||
previous baselines, etc.
|
||||
|
||||

|
||||
|
||||
### 3 - Start Simple
|
||||
|
||||
The first step is the troubleshooting workflow is **starting simple**.
|
||||
|
||||
#### Choose A Simple Architecture
|
||||
|
||||
There are a few things to consider when you want to start simple. The
|
||||
first is how to **choose a simple architecture**. These are
|
||||
architectures that are easy to implement and are likely to get you part
|
||||
of the way towards solving your problem without introducing as many
|
||||
bugs.
|
||||
|
||||
Architecture selection is one of the many intimidating parts of getting
|
||||
into deep learning because there are tons of papers coming out
|
||||
all-the-time and claiming to be state-of-the-art on some problems. They
|
||||
get very complicated fast. In the limit, if you’re trying to get to
|
||||
maximal performance, then architecture selection is challenging. But
|
||||
when starting on a new problem, you can just solve a simple set of rules
|
||||
that will allow you to pick an architecture that enables you to do a
|
||||
decent job on the problem you’re working on.
|
||||
|
||||
* If your data looks like **images**, start with a LeNet-like
|
||||
architecture and consider using something like ResNet as your
|
||||
codebase gets more mature.
|
||||
* If your data looks like **sequences**, start with an LSTM with one
|
||||
hidden layer and/or temporal/classical convolutions. Then, when
|
||||
your problem gets more mature, you can move to an Attention-based
|
||||
model or a WaveNet-like model.
|
||||
* For **all other tasks**, start with a fully-connected neural network
|
||||
with one hidden layer and use more advanced networks later,
|
||||
depending on the problem.
|
||||
|
||||

|
||||
|
||||
In reality, many times, the input data contains multiple of those things
|
||||
above. So how to deal with **multiple input modalities** into a neural
|
||||
network? Here is the 3-step strategy that we recommend:
|
||||
|
||||
* First, map each of these modalities into a lower-dimensional feature
|
||||
space. In the example above, the images are passed through a
|
||||
ConvNet, and the words are passed through an LSTM.
|
||||
* Then we flatten the outputs of those networks to get a single vector
|
||||
for each of the inputs that will go into the model. Then we
|
||||
concatenate those inputs.
|
||||
* Finally, we pass them through some fully-connected layers to an
|
||||
output.
|
||||
|
||||
#### Use Sensible Defaults
|
||||
|
||||
After choosing a simple architecture, the next thing to do is to
|
||||
**select sensible hyper-parameter defaults** to start with. Here are the
|
||||
defaults that we recommend:
|
||||
|
||||
* [Adam optimizer with a “magic” learning rate value of
|
||||
3e-4](https://twitter.com/karpathy/status/801621764144971776?lang=en).
|
||||
* [ReLU](https://stats.stackexchange.com/questions/226923/why-do-we-use-relu-in-neural-networks-and-how-do-we-use-it)
|
||||
activation for fully-connected and convolutional models and
|
||||
[Tanh](https://stats.stackexchange.com/questions/330559/why-is-tanh-almost-always-better-than-sigmoid-as-an-activation-function)
|
||||
activation for LSTM models.
|
||||
* [He initialization for ReLU activation function and Glorot
|
||||
initialization for Tanh activation
|
||||
function](https://datascience.stackexchange.com/questions/13061/when-to-use-he-or-glorot-normal-initialization-over-uniform-init-and-what-are).
|
||||
* No regularization and data normalization.
|
||||
|
||||
#### Normalize Inputs
|
||||
|
||||
The next step is to **normalize the input data**, subtracting the mean
|
||||
and dividing by the variance. Note that for images, it’s fine to scale
|
||||
values to [0, 1] or [-0.5, 0.5] (for example, by dividing by 255).
|
||||
|
||||
#### Simplify The Problem
|
||||
|
||||
The final thing you should do is consider **simplifying the problem**
|
||||
itself. If you have a complicated problem with massive data and tons of
|
||||
classes to deal with, then you should consider:
|
||||
|
||||
* Working with a small training set around 10,000 examples.
|
||||
* Using a fixed number of objects, classes, input size, etc.
|
||||
* Creating a simpler synthetic training set like in research labs.
|
||||
|
||||
This is important because (1) you will have reasonable confidence that
|
||||
your model should be able to solve, and (2) your iteration speed will
|
||||
increase.
|
||||
|
||||
The diagram below neatly summarizes how to start simple:
|
||||
|
||||

|
||||
|
||||
### 4 - Implement and Debug
|
||||
|
||||
To give you a preview, below are the five most common bugs in deep
|
||||
learning models that we recognize:
|
||||
|
||||
* **Incorrect shapes for the network tensors**: This bug is a common
|
||||
one and can fail silently. This happens many times because the
|
||||
automatic differentiation systems in the deep learning framework
|
||||
do silent broadcasting. Tensors become different shapes in the
|
||||
network and can cause a lot of problems.
|
||||
* **Pre-processing inputs incorrectly**: For example, you forget to
|
||||
normalize your inputs or apply too much input pre-processing
|
||||
(over-normalization and excessive data augmentation).
|
||||
* **Incorrect input to the model’s loss function**: For example, you
|
||||
use softmax outputs to a loss that expects logits.
|
||||
* **Forgot to set up train mode for the network correctly**: For
|
||||
example, toggling train/evaluation mode or controlling batch norm
|
||||
dependencies.
|
||||
* **Numerical instability**: For example, you get `inf` or `NaN`
|
||||
as outputs. This bug often stems from using an exponent, a log, or
|
||||
a division operation somewhere in the code.
|
||||
|
||||
Here are three pieces of general advice for implementing your model:
|
||||
|
||||
* **Start with a lightweight implementation**. You want minimum
|
||||
possible new lines of code for the 1st version of your model. The
|
||||
rule of thumb is less than 200 lines. This doesn’t count tested
|
||||
infrastructure components or TensorFlow/PyTorch code.
|
||||
* **Use off-the-shelf components** such as Keras if possible, since
|
||||
most of the stuff in Keras works well out-of-the-box. If you have
|
||||
to use TensorFlow, use the built-in functions, don’t do the math
|
||||
yourself. This would help you avoid a lot of numerical instability
|
||||
issues.
|
||||
* **Build complicated data pipelines later**. These are important for
|
||||
large-scale ML systems, but you should not start with them because
|
||||
data pipelines themselves can be a big source of bugs. Just start
|
||||
with a dataset that you can load into memory.
|
||||
|
||||

|
||||
|
||||
#### Get Your Model To Run
|
||||
|
||||
The first step of implementing bug-free deep learning models is
|
||||
**getting your model to run at all**. There are a few things that can
|
||||
prevent this from happening:
|
||||
|
||||
* **Shape mismatch/casting issue**: To address this type of problem,
|
||||
you should step through your model creation and inference
|
||||
step-by-step in a debugger, checking for correct shapes and data
|
||||
types of your tensors.
|
||||
* **Out-of-memory issues**: This can be very difficult to debug. You
|
||||
can scale back your memory-intensive operations one-by-one. For
|
||||
example, if you create large matrices anywhere in your code, you
|
||||
can reduce the size of their dimensions or cut your batch size in
|
||||
half.
|
||||
* **Other issues**: You can simply Google it. Stack Overflow would be
|
||||
great most of the time.
|
||||
|
||||
Let’s zoom in on the process of stepping through model creation in a
|
||||
debugger and talk about **debuggers for deep learning code**:
|
||||
|
||||
* In PyTorch, you can use
|
||||
[ipdb](https://pypi.org/project/ipdb/) — which exports
|
||||
functions to access the interactive
|
||||
[IPython](http://ipython.org/) debugger.
|
||||
* In TensorFlow, it’s trickier. TensorFlow separates the process of
|
||||
creating the graph and executing operations in the graph. There
|
||||
are three options you can try: (1) step through the graph creation
|
||||
itself and inspect each tensor layer, (2) step into the training
|
||||
loop and evaluate the tensor layers, or (3) use [TensorFlow
|
||||
Debugger](https://mullikine.github.io/posts/tensorflow-debugger-tfdb-and-emacs/)
|
||||
(tfdb), which does option 1 and 2 automatically.
|
||||
|
||||

|
||||
|
||||
#### Overfit A Single Batch
|
||||
|
||||
After getting your model to run, the next thing you need to do is to
|
||||
**overfit a single batch of data**. This is a heuristic that can catch
|
||||
an absurd number of bugs. This really means that you want to drive your
|
||||
training error arbitrarily close to 0.
|
||||
|
||||
There are a few things that can happen when you try to overfit a single
|
||||
batch and it fails:
|
||||
|
||||
* **Error goes up**: Commonly, this is due to a flip sign somewhere in
|
||||
the loss function/gradient.
|
||||
* **Error explodes**: This is usually a numerical issue but can also
|
||||
be caused by a high learning rate.
|
||||
* **Error oscillates**: You can lower the learning rate and inspect
|
||||
the data for shuffled labels or incorrect data augmentation.
|
||||
* **Error plateaus**: You can increase the learning rate and get rid
|
||||
of regulation. Then you can inspect the loss function and the data
|
||||
pipeline for correctness.
|
||||
|
||||

|
||||
|
||||
#### Compare To A Known Result
|
||||
|
||||
Once your model overfits in a single batch, there can still be some
|
||||
other issues that cause bugs. The last step here is to **compare your
|
||||
results to a known result**. So what sort of known results are useful?
|
||||
|
||||
* The most useful results come from **an official model implementation
|
||||
evaluated on a similar dataset to yours**. You can step through
|
||||
the code in both models line-by-line and ensure your model has the
|
||||
same output. You want to ensure that your model performance is up
|
||||
to par with expectations.
|
||||
* If you can’t find an official implementation on a similar dataset,
|
||||
you can compare your approach to results from **an official model
|
||||
implementation evaluated on a benchmark dataset**. You most
|
||||
definitely want to walk through the code line-by-line and ensure
|
||||
you have the same output.
|
||||
* If there is no official implementation of your approach, you can
|
||||
compare it to results from **an unofficial model implementation**.
|
||||
You can review the code the same as before but with lower
|
||||
confidence (because almost all the unofficial implementations on
|
||||
GitHub have bugs).
|
||||
* Then, you can compare to results from **a paper with no code** (to
|
||||
ensure that your performance is up to par with expectations),
|
||||
results from **your model on a benchmark dataset** (to make sure
|
||||
your model performs well in a simpler setting), and results from
|
||||
**a similar model on a similar dataset** (to help you get a
|
||||
general sense of what kind of performance can be expected).
|
||||
* An under-rated source of results comes from **simple baselines**
|
||||
(for example, the average of outputs or linear regression), which
|
||||
can help make sure that your model is learning anything at all.
|
||||
|
||||
The diagram below neatly summarizes how to implement and debug deep
|
||||
neural networks:
|
||||
|
||||

|
||||
|
||||
### 5 - Evaluate
|
||||
|
||||
#### Bias-Variance Decomposition
|
||||
|
||||
To evaluate models and prioritize the next steps in model development,
|
||||
we will apply the bias-variance decomposition. The [bias-variance
|
||||
decomposition](http://scott.fortmann-roe.com/docs/BiasVariance.html)
|
||||
is the fundamental model fitting tradeoff. In our application, let’s
|
||||
talk more specifically about the formula for bias-variance tradeoff with
|
||||
respect to the **test error;** this will help us apply the concept more
|
||||
directly to our model’s performance. There are four terms in the formula
|
||||
for test error:
|
||||
|
||||
*Test error = irreducible error + bias + variance + validation
|
||||
overfitting*
|
||||
|
||||
1. **Irreducible error** is the baseline error you don’t expect your
|
||||
model to do better. It can be estimated through strong baselines,
|
||||
like human performance.
|
||||
2. **Avoidable bias**, a measure of underfitting, is the difference
|
||||
between our train error and irreducible error.
|
||||
3. **Variance**, a measure of overfitting, is the difference between
|
||||
validation error and training error.
|
||||
4. **Validation set overfitting** is the difference between test error
|
||||
and validation error.
|
||||
|
||||
Consider the chart of learning curves and errors below. Using the test
|
||||
error formula for bias and variance, we can calculate each component of
|
||||
test error and make decisions based on the value. For example, our
|
||||
avoidable bias is rather low (only 2 points), while the variance is much
|
||||
higher (5 points). With this knowledge, we should prioritize methods of
|
||||
preventing overfitting, like regularization.
|
||||
|
||||

|
||||
|
||||
#### Distribution Shift
|
||||
|
||||
Clearly, the application of the bias-variance decomposition to the test
|
||||
error has already helped prioritize our next steps for model
|
||||
development. However, until now, we’ve assumed that the samples
|
||||
(training, validation, testing) all come from the same distribution.
|
||||
What if this isn’t the case? In practical ML situations, this
|
||||
**distribution shift** often cars. In building self-driving cars, a
|
||||
frequent occurrence might be training with samples from one distribution
|
||||
(e.g., daytime driving video) but testing or inferring on samples from a
|
||||
totally different distribution (e.g., night time driving).
|
||||
|
||||
A simple way of handling this wrinkle in our assumption is to create two
|
||||
validation sets: one from the training distribution and one from the
|
||||
test distribution. This can be helpful even with a very small testing
|
||||
set. If we apply this, we can actually estimate our distribution shift,
|
||||
which is the difference between testing validation error and testing
|
||||
error. This is really useful for practical applications of ML! With this
|
||||
new term, let’s update our test error formula of bias and variance:
|
||||
|
||||
*Test error = irreducible error + bias + variance + distribution shift +
|
||||
validation overfitting*
|
||||
|
||||
### 6 - Improve Model and Data
|
||||
|
||||
Using the updated formula from the last section, we’ll be able to decide
|
||||
on and prioritize the right next steps for each iteration of a model. In
|
||||
particular, we’ll follow a specific process (shown below).
|
||||
|
||||

|
||||
|
||||
#### Step 1: Address Underfitting
|
||||
|
||||
We’ll start by addressing underfitting (i.e., reducing bias). The first
|
||||
thing to try in this case is to make your model bigger (e.g., add
|
||||
layers, more units per layer). Next, consider regularization, which can
|
||||
prevent a tight fit to your data. Other options are error analysis,
|
||||
choosing a different model architecture (e.g., something more state of
|
||||
the art), tuning hyperparameters, or adding features. Some notes:
|
||||
|
||||
* Choosing different architectures, especially a SOTA one, can be very
|
||||
helpful but is also risky. Bugs are easily introduced in the
|
||||
implementation process.
|
||||
* Adding features is uncommon in the deep learning paradigm (vs.
|
||||
traditional machine learning). We usually want the network to
|
||||
learn features of its own accord. If all else fails, it can be
|
||||
beneficial in a practical setting.
|
||||
|
||||

|
||||
|
||||
#### Step 2: Address Overfitting
|
||||
|
||||
After addressing underfitting, move on to solving overfitting.
|
||||
Similarly, there’s a recommended series of methods to try in order.
|
||||
Starting with collecting training data (if possible) is the soundest way
|
||||
to address overfitting, though it can be challenging in certain
|
||||
applications. Next, tactical improvements like normalization, data
|
||||
augmentation, and regularization can help. Following these steps,
|
||||
traditional defaults like tuning hyperparameters, choosing a different
|
||||
architecture, or error analysis are useful. Finally, if overfitting is
|
||||
rather intractable, there’s a series of less recommended steps, such as
|
||||
early stopping, removing features, and reducing model size. Early
|
||||
stopping is a personal choice; the fast.ai community is a strong
|
||||
proponent.
|
||||
|
||||

|
||||
|
||||
#### Step 3: Address Distribution Shift
|
||||
|
||||
After addressing underfitting and overfitting, If there’s a difference
|
||||
between the error on our training validation set vs. our test validation
|
||||
set, we need to address the error caused by the distribution shift. This
|
||||
is a harder problem to solve, so there’s less in our toolkit to apply.
|
||||
|
||||
Start by looking manually at the errors in the test-validation set.
|
||||
Compare the potential logic behind these errors to the performance in
|
||||
the train-validation set, and use the errors to guide further data
|
||||
collection. Essentially, reason about why your model may be suffering
|
||||
from distribution shift error. This is the most principled way to deal
|
||||
with distribution shift, though it’s the most challenging way
|
||||
practically. If collecting more data to address these errors isn’t
|
||||
possible, try synthesizing data. Additionally, you can try [domain
|
||||
adaptation](https://ece.engin.umich.edu/wp-content/uploads/2019/09/4142.pdf).
|
||||
|
||||

|
||||
|
||||
##### Error Analysis
|
||||
|
||||
Manually evaluating errors to understand model performance is generally
|
||||
a high-yield way of figuring out how to improve the model.
|
||||
Systematically performing this **error analysis** process and
|
||||
decomposing the error from different error types can help prioritize
|
||||
model improvements. For example, in a self-driving car use case with
|
||||
error types like hard-to-see pedestrians, reflections, and nighttime
|
||||
scenes, decomposing the error contribution of each and where it occurs
|
||||
(train-val vs. test-val) can give rise to a clear set of prioritized
|
||||
action items. See the table for an example of how this error analysis
|
||||
can be effectively structured.
|
||||
|
||||

|
||||
|
||||
##### Domain Adaptation
|
||||
|
||||
Domain adaptation is a class of techniques that train on a “source”
|
||||
distribution and generalize to another “target” using only unlabeled
|
||||
data or limited labeled data. You should use domain adaptation when
|
||||
access to labeled data from the test distribution is limited, but access
|
||||
to relatively similar data is plentiful.
|
||||
|
||||
There are a few different types of domain adaptation:
|
||||
|
||||
1. **Supervised domain adaptation**: In this case, we have limited data
|
||||
from the target domain to adapt to. Some example applications of
|
||||
the concept include fine-tuning a pre-trained model or adding
|
||||
target data to a training set.
|
||||
2. **Unsupervised domain adaptation**: In this case, we have lots of
|
||||
unlabeled data from the target domain. Some techniques you might
|
||||
see are CORAL, domain confusion, and CycleGAN.
|
||||
|
||||
Practically speaking, supervised domain adaptation can work really well!
|
||||
Unsupervised domain adaptation has a little bit further to go.
|
||||
|
||||
#### Step 4: Rebalance datasets
|
||||
|
||||
If the test-validation set performance starts to look considerably
|
||||
better than the test performance, you may have overfit the validation
|
||||
set. This commonly occurs with small validation sets or lots of
|
||||
hyperparameter training. If this occurs, resample the validation set
|
||||
from the test distribution and get a fresh estimate of the performance.
|
||||
|
||||
### 7 - Tune Hyperparameters
|
||||
|
||||
One of the core challenges in hyperparameter optimization is very basic:
|
||||
**which hyperparameters should you tune?** As we consider this
|
||||
fundamental question, let’s keep the following in mind:
|
||||
|
||||
* Models are more sensitive to some hyperparameters than others. This
|
||||
means we should focus our efforts on the more impactful
|
||||
hyperparameters.
|
||||
* However, which hyperparameters are most important depends heavily on
|
||||
our choice of model.
|
||||
* Certain rules of thumbs can help guide our initial thinking.
|
||||
* Sensitivity is always relative to default values; if you use good
|
||||
defaults, you might start in a good place!
|
||||
|
||||
See the following table for a ranked list of hyperparameters and their
|
||||
impact on the model:
|
||||
|
||||

|
||||
|
||||
#### Techniques for Tuning Hyperparameter Optimization
|
||||
|
||||
Now that we know which hyperparameters make the most sense to tune
|
||||
(using rules of thumb), let’s consider the various methods of actually
|
||||
tuning them:
|
||||
|
||||
1. **Manual Hyperparameter Optimization**. Colloquially referred to as
|
||||
Graduate Student Descent, this method works by taking a manual,
|
||||
detailed look at your algorithm, building intuition, and
|
||||
considering which hyperparameters would make the most difference.
|
||||
After figuring out these parameters, you train, evaluate, and
|
||||
guess a better hyperparameter value using your intuition for the
|
||||
algorithm and intelligence. While it may seem archaic, this method
|
||||
combines well with other methods (e.g., setting a range of values
|
||||
for hyperparameters) and has the main benefit of reducing
|
||||
computation time and cost if used skillfully. It can be
|
||||
time-consuming and challenging, but it can be a good starting
|
||||
point.
|
||||
2. **Grid Search**. Imagine each of your parameters plotted against
|
||||
each other on a grid, from which you uniformly sample values to
|
||||
test. For each point, you run a training run and evaluate
|
||||
performance. The advantages are that it’s very simple and can
|
||||
often produce good results. However, it’s quite inefficient, as
|
||||
you must run every combination of hyperparameters. It also often
|
||||
requires prior knowledge about the hyperparameters since we must
|
||||
manually set the range of values.
|
||||
3. **Random Search**: This method is recommended over grid search.
|
||||
Rather than sampling from the grid of values for the
|
||||
hyperparameter evenly, we’ll choose n points sampled randomly
|
||||
across the grid. Empirically, this method produces better results
|
||||
than grid search. However, the results can be somewhat
|
||||
uninterpretable, with unexpected values in certain hyperparameters
|
||||
returned.
|
||||
4. **Coarse-to-fine Search**: Rather than running entirely random runs,
|
||||
we can gradually narrow in on the best hyperparameters through
|
||||
this method. Initially, start by defining a very large range to
|
||||
run a randomized search on. Within the pool of results, you can
|
||||
find N best results and hone in on the hyperparameter values used
|
||||
to generate those samples. As you iteratively perform this method,
|
||||
you can get excellent performance. This doesn’t remove the manual
|
||||
component, as you have to select which range to continuously
|
||||
narrow your search to, but it’s perhaps the most popular method
|
||||
available.
|
||||
5. **Bayesian Hyperparameter Optimization**: This is a reasonably
|
||||
sophisticated method, which you can read more about
|
||||
[here](http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/slides/lec21.pdf)
|
||||
and
|
||||
[here](https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f).
|
||||
At a high level, start with a prior estimate of parameter
|
||||
distributions. Subsequently, maintain a probabilistic model of the
|
||||
relationship between hyperparameter values and model performance.
|
||||
As you maintain this model, you toggle between training with
|
||||
hyperparameter values that maximize the expected improvement (per
|
||||
the model) and use training results to update the initial
|
||||
probabilistic model and its expectations. This is a great,
|
||||
hands-off, efficient method to choose hyperparameters. However,
|
||||
these techniques can be quite challenging to implement from
|
||||
scratch. As libraries and infrastructure mature, the integration
|
||||
of these methods into training will become easier.
|
||||
|
||||
In summary, you should probably start with coarse-to-fine random
|
||||
searches and move to Bayesian methods as your codebase matures and
|
||||
you’re more certain of your model.
|
||||
|
||||
### 8 - Conclusion
|
||||
|
||||
To wrap up this lecture, deep learning troubleshooting and debugging is
|
||||
really hard. It’s difficult to tell if you have a bug because there are
|
||||
many possible sources for the same degradation in performance.
|
||||
Furthermore, the results can be sensitive to small changes in
|
||||
hyper-parameters and dataset makeup.
|
||||
|
||||
To train bug-free deep learning models, we need to treat building them
|
||||
as an iterative process. If you skipped to the end, the following steps
|
||||
can make this process easier and catch errors as early as possible:
|
||||
|
||||
* **Start Simple**: Choose the simplest model and data possible.
|
||||
* **Implement and Debug**: Once the model runs, overfit a single batch
|
||||
and reproduce a known result.
|
||||
* **Evaluate**: Apply the bias-variance decomposition to decide what
|
||||
to do next.
|
||||
* **Tune Hyper-parameters**: Use coarse-to-fine random searches to
|
||||
tune the model’s hyper-parameters.
|
||||
* **Improve Model and Data**: Make your model bigger if your model
|
||||
under-fits and add more data and/or regularization if your model
|
||||
over-fits.
|
||||
|
||||
Here are additional resources that you can go to learn more:
|
||||
|
||||
* Andrew Ng’s “[Machine Learning
|
||||
Yearning](https://www.deeplearning.ai/machine-learning-yearning/)”
|
||||
book.
|
||||
* This [Twitter
|
||||
thread](https://twitter.com/karpathy/status/1013244313327681536)
|
||||
from Andrej Karpathy.
|
||||
* BYU’s “[Practical Advice for Building Deep Neural
|
||||
Networks](https://pcc.cs.byu.edu/2017/10/02/practical-advice-for-building-deep-neural-networks/)”
|
||||
blog post.
|
||||
|
||||
## We are excited to share this course with you for **free**.
|
||||
|
||||
We have more upcoming great content.
|
||||
Subscribe to stay up to date as we release it.
|
||||
|
||||
We take your privacy and attention very seriously and will never spam you.
|
||||
I am already a subscriber
|
||||
|
||||
The Full Stack, 2023
|
||||
|
||||
Made with
|
||||
[Material for MkDocs](https://squidfunk.github.io/mkdocs-material/)
|
||||
File diff suppressed because one or more lines are too long
@@ -0,0 +1,199 @@
|
||||
Source: http://joschu.net/docs/nuts-and-bolts.pdf
|
||||
Title: Nuts and Bolts of Deep RL Research - John Schulman (2016)
|
||||
Fetched-via: bash -c 'uvx "markitdown[pdf]" http://joschu.net/docs/nuts-and-bolts.pdf'
|
||||
Fetch-status: verbatim
|
||||
|
||||
| The Nuts | and Bolts | of Deep | RL Research |
|
||||
| -------- | --------- | --------- | ----------- |
|
||||
| | John | Schulman | |
|
||||
| | December | 9th, 2016 | |
|
||||
|
||||
Outline
|
||||
| Approaching | New Problems | |
|
||||
| --------------------- | ------------ | ---------- |
|
||||
| Ongoing Development | | and Tuning |
|
||||
| General Tuning | Strategies | for RL |
|
||||
| Policy Gradient | Strategies | |
|
||||
| Q-Learning Strategies | | |
|
||||
| Miscellaneous | Advice | |
|
||||
|
||||
Approaching New Problems
|
||||
|
||||
| New Algorithm? | Use Small | Test Problems |
|
||||
| -------------------------- | --------- | ------------- |
|
||||
| (cid:73) Run experiments | quickly | |
|
||||
| (cid:73) Do hyperparameter | search | |
|
||||
(cid:73) Interpret and visualize learning process: state visitation, value function, etc.
|
||||
(cid:73) Counterpoint: don’t overfit algorithm to contrived problem
|
||||
(cid:73) Useful to have medium-sized problems that you’re intimately familiar with
|
||||
(Hopper, Atari Pong)
|
||||
|
||||
| New Task? | Make | It Easier Until | Signs | of Life |
|
||||
| ---------------- | --------------- | --------------- | ----- | ------- |
|
||||
| (cid:73) Provide | good input | features | | |
|
||||
| (cid:73) Shape | reward function | | | |
|
||||
|
||||
POMDP Design
|
||||
(cid:73) Visualize random policy: does it sometimes exhibit desired behavior?
|
||||
| (cid:73) Human | control | | | |
|
||||
| -------------- | ------- | --- | --- | --- |
|
||||
(cid:73) Atari: can you see game features in downsampled image?
|
||||
(cid:73) Plot time series for observations and rewards. Are they on a reasonable
|
||||
scale?
|
||||
| (cid:73) hopper.py | in gym: | | | |
|
||||
| ------------------ | ------------ | --------------------------- | ------- | ----------- |
|
||||
| reward | = 1.0 | - 1e-3 * np.square(a).sum() | + delta | x / delta t |
|
||||
| (cid:73) Histogram | observations | and rewards | | |
|
||||
|
||||
Run Your Baselines
|
||||
| (cid:73) Don’t expect | them to | work with default | parameters |
|
||||
| --------------------- | ------- | ----------------- | ---------- |
|
||||
(cid:73) Recommended:
|
||||
| Cross-entropy | method1 | | |
|
||||
| ------------- | ------- | --- | --- |
|
||||
(cid:73)
|
||||
| (cid:73) Well-tuned | policy gradient | method2 | |
|
||||
| ------------------- | --------------- | -------------- | --- |
|
||||
| (cid:73) Well-tuned | Q-learning | + SARSA method | |
|
||||
1Istv´anSzitaandAndr´asL¨orincz(2006).“LearningTetrisusingthenoisycross-entropymethod”. In:Neuralcomputation.
|
||||
2https://github.com/openai/rllab
|
||||
|
||||
| Run with | More Samples | Than | Expected | |
|
||||
| -------- | ------------ | ---- | -------- | --- |
|
||||
(cid:73) Early in tuning process, may need huge number of samples
|
||||
| | Don’t be deterred | by published | work | |
|
||||
| --- | ----------------- | ------------ | ---- | --- |
|
||||
(cid:73)
|
||||
| (cid:73) Examples: | | | | |
|
||||
| ------------------ | --- | --- | --- | --- |
|
||||
(cid:73) TRPO on Atari: 100K timesteps per batch for KL= 0.01
|
||||
| | DQN on Atari: | update freq=10K, | replay buffer | size=1M |
|
||||
| --- | ------------- | ---------------- | ------------- | ------- |
|
||||
(cid:73)
|
||||
|
||||
| Ongoing | Development | and Tuning |
|
||||
| ------- | ----------- | ---------- |
|
||||
|
||||
| It | Works! | But | Don’t | Be Satisfied | | |
|
||||
| --- | ---------------- | ----------- | ----- | ----------------- | --- | --- |
|
||||
| | (cid:73) Explore | sensitivity | | to each parameter | | |
|
||||
(cid:73) If too sensitive, it doesn’t really work, you just got lucky
|
||||
| | (cid:73) Look | for health | indicators | | | |
|
||||
| --- | ------------- | --------------- | ---------- | --- | --- | --- |
|
||||
| | | (cid:73) VF fit | quality | | | |
|
||||
| | | Policy | entropy | | | |
|
||||
(cid:73)
|
||||
| | | (cid:73) Update | size in | output space | and parameter | space |
|
||||
| --- | --- | ----------------- | ----------- | ------------ | ------------- | ----- |
|
||||
| | | (cid:73) Standard | diagnostics | for | deep networks | |
|
||||
|
||||
| Continually | Benchmark | | Your Code |
|
||||
| ------------------- | --------- | ------------- | ------------ |
|
||||
| (cid:73) If reusing | code, | regressions | occur |
|
||||
| (cid:73) Run | a battery | of benchmarks | occasionally |
|
||||
|
||||
| Always | Use Multiple | Random | Seeds |
|
||||
| ------ | ------------ | ------ | ----- |
|
||||
|
||||
| Always Be | Ablating | |
|
||||
| ------------------ | ---------- | ---------- |
|
||||
| (cid:73) Different | tricks may | substitute |
|
||||
| Especially | whitening | |
|
||||
(cid:73)
|
||||
(cid:73) “Regularize” to favor simplicity in algorithm design space
|
||||
| (cid:73) As | usual, simplicity | → generalization |
|
||||
| ----------- | ----------------- | ---------------- |
|
||||
|
||||
| Automate Your | Experiments | | |
|
||||
| ------------- | ---------------- | --------- | ----------------- |
|
||||
| Don’t spend | all day watching | your code | print out numbers |
|
||||
(cid:73)
|
||||
(cid:73) Consider using a cloud computing platform (Microsoft Azure, Amazon EC2,
|
||||
| Google Compute | Engine) | | |
|
||||
| -------------- | ------- | --- | --- |
|
||||
|
||||
| General | Tuning | Strategies | for RL |
|
||||
| ------- | ------ | ---------- | ------ |
|
||||
|
||||
| Whitening | / Standardizing | Data |
|
||||
| ------------------------ | --------------- | ------------------ |
|
||||
| (cid:73) If observations | have unknown | range, standardize |
|
||||
(cid:73) Compute running estimate of mean and standard deviation
|
||||
x(cid:48)
|
||||
(cid:73) = clip((x −µ)/σ,−10,10)
|
||||
(cid:73) Rescale the rewards, but don’t shift mean, as that affects agent’s will to live
|
||||
(cid:73) Standardize prediction targets (e.g., value functions) the same way
|
||||
|
||||
| Generally | Important | Parameters | | | |
|
||||
| --------- | --------------- | ------------- | ---- | ------- | --------- |
|
||||
| (cid:73) | Discount | | | | |
|
||||
| | (cid:73) Return | = r +γr | +γ2r | +... | |
|
||||
| | | t t | t+1 | t+2 | |
|
||||
| | Effective | time horizon: | 1+γ | +γ2+··· | = 1/(1−γ) |
|
||||
(cid:73)
|
||||
(cid:73) I.e., γ =0.99⇒ ignore rewards delayed by more than 100 timesteps
|
||||
| | Low | γ works well | for well-shaped | reward | |
|
||||
| --- | --- | ------------ | --------------- | ------ | --- |
|
||||
(cid:73)
|
||||
(cid:73) In TD(λ) methods, can get away with high γ when λ < 1
|
||||
| (cid:73) | Action frequency | | | | |
|
||||
| -------- | ---------------- | ---------- | ------- | ------------- | --- |
|
||||
| | Solvable | with human | control | (if possible) | |
|
||||
(cid:73)
|
||||
| | (cid:73) View | random exploration | | | |
|
||||
| --- | ------------- | ------------------ | --- | --- | --- |
|
||||
|
||||
General RL Diagnostics
|
||||
(cid:73) Look at min/max/stdev of episode returns, along with mean
|
||||
(cid:73) Look at episode lengths: sometimes provides additional information
|
||||
| (cid:73) Solving problem | faster, losing | game slower |
|
||||
| ------------------------ | -------------- | ----------- |
|
||||
|
||||
Policy Gradient Strategies
|
||||
|
||||
| Entropy as | Diagnostic | | |
|
||||
| ------------------ | ---------------- | ------- | ------------- |
|
||||
| (cid:73) Premature | drop in policy | entropy | ⇒ no learning |
|
||||
| (cid:73) Alleviate | by using entropy | bonus | or KL penalty |
|
||||
|
||||
KL as Diagnostic
|
||||
(cid:2) (cid:3)
|
||||
| (cid:73) Compute | KL π | (·|s),π(·|s) | |
|
||||
| ---------------- | ---- | ------------ | --- |
|
||||
old
|
||||
| (cid:73) KL spike | ⇒ drastic | loss of performance | |
|
||||
| -------------------- | --------- | ------------------- | ------------- |
|
||||
| (cid:73) No learning | progress | might mean steps | are too large |
|
||||
(cid:73) batchsize=100K converges to different result than batchsize=20K.
|
||||
|
||||
| Baseline | Explained | Variance |
|
||||
| -------- | --------- | -------- |
|
||||
1−Var[empiricalreturn−predictedvalue]
|
||||
| (cid:73) | explained variance | = |
|
||||
| -------- | ------------------ | --- |
|
||||
Var[empiricalreturn]
|
||||
|
||||
Policy Initialization
|
||||
(cid:73) More important than in supervised learning: determines initial state
|
||||
visitation
|
||||
| (cid:73) Zero | or tiny final layer, | to maximize | entropy |
|
||||
| ------------- | -------------------- | ----------- | ------- |
|
||||
|
||||
| Q-Learning Strategies | | |
|
||||
| --------------------- | --- | --- |
|
||||
(cid:73) Optimize memory usage carefully: you’ll need it for replay buffer
|
||||
| (cid:73) Learning | rate schedules | |
|
||||
| -------------------- | -------------- | ------ |
|
||||
| (cid:73) Exploration | schedules | |
|
||||
| (cid:73) Be patient. | DQN converges | slowly |
|
||||
(cid:73) On Atari, often 10-40M frames to get policy much better than random
|
||||
ThankstoSzymonSidorforsuggestions
|
||||
|
||||
Miscellaneous Advice
|
||||
(cid:73) Read older textbooks and theses, not just conference papers
|
||||
(cid:73) Don’t get stuck on problems—can’t solve everything at once
|
||||
| (cid:73) Exploration | problems | like cart-pole swing-up |
|
||||
| -------------------- | ----------------- | ----------------------- |
|
||||
| (cid:73) DQN on | Atari vs CartPole | |
|
||||
|
||||
Thanks!
|
||||
File diff suppressed because one or more lines are too long
@@ -0,0 +1,15 @@
|
||||
Source: https://old.reddit.com/r/reinforcementlearning/comments/75m5vd/
|
||||
Title: Deep RL Bootcamp 2017 - Slides and Talks
|
||||
Fetched-via: Reddit JSON API (limit=500, depth=10)
|
||||
Fetch-status: verbatim
|
||||
|
||||
# Deep RL Bootcamp 2017 - Slides and Talks
|
||||
|
||||
**Posted by:** u/gwern | Score: 8 | 1 comments
|
||||
|
||||
Link: https://sites.google.com/view/deep-rl-bootcamp/lectures
|
||||
|
||||
## Comments
|
||||
|
||||
**u/obsoletelearner** (score: 2):
|
||||
thank you!
|
||||
@@ -0,0 +1,18 @@
|
||||
Source: https://old.reddit.com/r/reinforcementlearning/comments/6vcvu1/
|
||||
Title: ICML 2017 Tutorial slides (Levine & Finn): Deep Reinforcement Learning, Decision Making, and Control
|
||||
Fetched-via: Reddit JSON API (limit=500, depth=10)
|
||||
Fetch-status: verbatim
|
||||
|
||||
# ICML 2017 Tutorial slides (Levine & Finn): Deep Reinforcement Learning, Decision Making, and Control
|
||||
|
||||
**Posted by:** u/gwern | Score: 12 | 1 comments
|
||||
|
||||
Link: https://sites.google.com/view/icml17deeprl
|
||||
|
||||
## Comments
|
||||
|
||||
**u/[deleted]** (score: 4):
|
||||
[deleted]
|
||||
|
||||
**u/cbfinn** (score: 1):
|
||||
Videos of ICML tutorials (as well as conference talks) will be posted by the conference staff at some point. Though, typically they take quite awhile to be released.
|
||||
@@ -0,0 +1,78 @@
|
||||
Source: https://old.reddit.com/r/reinforcementlearning/comments/9sh77q/
|
||||
Title: What are your best tips for debugging RL problems?
|
||||
Fetched-via: Reddit JSON API (limit=500, depth=10)
|
||||
Fetch-status: verbatim
|
||||
|
||||
# What are your best tips for debugging RL problems?
|
||||
|
||||
**Posted by:** u/GrundleMoof | Score: 21 | 8 comments
|
||||
|
||||
I've done a few RL toy problems, but I'm still pretty new to the field. In each of the problems I've done, there has been some point where it seems like I've implemented everything correctly, the environment is working correctly, etc, but it's still just not working, or is, with some really strange problem.
|
||||
|
||||
RL seems to be harder to debug than any other type of programming I've done before. There's an element of randomness usually. It often takes a while (in the run) for the problem to manifest, so it's hard to pinpoint exactly *where* something is going wrong. Lastly, stuff just takes a while to even run, so my "attempt solution/code/evaluate" loop takes a long time, which makes it even harder.
|
||||
|
||||
Does anyone have any tips? The things I've figured out so far are to log everything feasible, and to try to isolate things to find the problem, but those are pretty general tips. I've found some help a few times from reading relevant papers, but that's rarer.
|
||||
|
||||
Do any experts in the field have any tips?
|
||||
|
||||
## Comments
|
||||
|
||||
**u/marcin_gumer** (score: 18):
|
||||
Hi
|
||||
|
||||
This is exactly what I was struggling with for a long time (and still am). RL agent modules are really closely interconnected. No matter which module has issue (neural net, Bellman backups, memory buffer, environment, pre-processing) it will immediately affect all other modules by feeding them bad data. Looking from outside it looks like big gooey mess.
|
||||
|
||||
First, I'm not an RL expert, sorry if my advice sounds basic. Couple of things I have learned so far:
|
||||
|
||||
* RL is very difficult to debug, especially when neural nets are involved
|
||||
* DO NOT "try stuff" and run to "see if it works" - this approach doesn't work in RL - too many things need to happen exactly right to see any learning at all
|
||||
* RL agent modules implementation - this is just good programming practices, but even more important in RL:
|
||||
* most modules can be tested independently. Environment, neural net, RL backups, memory reply buffers all can be tested in isolation.
|
||||
* I try to unit test everything, usually unit tests take more code than what they test
|
||||
* I try to put asserts absolutely everywhere, input matrix dimensions (1d array may broadcast differently than 2d array etc.), input/output ranges (state/actions valid?), output matrix dimensions. Input/output data types (in Python np.ndarray behaves differently than np.matrix in some cases)
|
||||
* Agent modules integration - generally stepping through code at least once after every change to confirm it is doing what I think it should be doing. It's a bit like programming a bomb detonator or something. Really make sure it is working correctly *before* running long experiment.
|
||||
* Visualise as much as possible, log absolutely everything
|
||||
* record and display agent observations/actions/rewards
|
||||
* rewards should have some variance, if all rewards are always equal (e.g. always 0), then there is nothing to learn, it's environment or exploration issue
|
||||
* record and display current q-function approximation across whole state space (works only on simple tabular problems and 2d continuous state spaces).
|
||||
* pick couple states (can pick by running random policy) and plot predicted q-values over time for these states. q-values should change and stabilise.
|
||||
* record inputs/outputs to/from every module (environment, neural net, memory buffer, etc.)
|
||||
* neural networks are making everything 10x more difficult - I try to make agent work with linear approximator first (on small problem like mountain car), then when I know everything else is working, swap in neural net and try bigger problem.
|
||||
* with neural nets, one can record gradients, individual neuron activations etc to evaluate if neural net is learning over time, even without access to loss function. Debugging neural networks:
|
||||
* [https://www.deeplearningbook.org/](https://www.deeplearningbook.org/) "Practical Methodology" chapter
|
||||
* [Understanding the difficulty of training deep feedforward neural networks](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
|
||||
* [https://cs231n.github.io/neural-networks-3](https://cs231n.github.io/neural-networks-3/#baby)/
|
||||
* [https://deeplearning4j.org/docs/latest/deeplearning4j-nn-visualization](https://deeplearning4j.org/docs/latest/deeplearning4j-nn-visualization)
|
||||
* Sometimes It helps to freeze random seed everywhere (numpy, tensorflow, python hashing, gym) and force single threaded CPU execution (to remove randomness from concurrent execution). This way you can reproduce runs exactly to full floating point precision and debug where these NaNs came from or why q-values exploded to infinity etc.
|
||||
* I would be careful with reference implementations. Some work on hacked environments that are much easier than normal (seen it couple times in blog posts). Or have some "weird" reward function engineering. Or use older version of 3rd party library with easier version of environment.
|
||||
* Try hyper parameters from reference implementation.
|
||||
|
||||
Hope this helps!
|
||||
|
||||
**u/AlexanderYau** (score: 1):
|
||||
Wow, a lot of experience on debugging RL, I can't agree more on "DO NOT "try stuff" and run to "see if it works"". Have you ever published any paper on RL?
|
||||
|
||||
**u/marcin_gumer** (score: 1):
|
||||
I wouldn't say a lot of experience, just figuring things out as I go. I haven't published any RL paper. Currently just building portfolio on my [github.com/marcinbogdanski](https://github.com/marcinbogdanski). But there is not that much there yet, I implemented some algorithms from Sutton & Barto, currently working on DQN and Atari games, but it will take some time.
|
||||
|
||||
**u/p-morais** (score: 7):
|
||||
Here’s a good talk by John Schulman on just this:
|
||||
|
||||
https://m.youtube.com/watch?v=8EcdaCk9KaQ
|
||||
|
||||
**u/WhichPressure** (score: 5):
|
||||
I think anyone who touch the RL has the same problem as you! Me too:P This guy wrote nice article about his adventure with some RL side project and he gave some tips. Maybe somehow it'll be helpful for you:
|
||||
[http://amid.fish/reproducing-deep-rl?fbclid=IwAR1VPZm3FSTrV8BZ4UdFc2ExZy0olusmaewmloTPhpA4QOnHKRI2LLOz3mM](http://amid.fish/reproducing-deep-rl?fbclid=IwAR1VPZm3FSTrV8BZ4UdFc2ExZy0olusmaewmloTPhpA4QOnHKRI2LLOz3mM)
|
||||
|
||||
**u/lmericle** (score: 4):
|
||||
It's true, there's a lot of moving parts. I'm no expert but lately I've been experiencing similar setbacks.
|
||||
|
||||
Typically I find the first place to look is the hyperparameters, and then to consider which particular optimization algorithm you're using and how it might explore the space/optimize toward suboptimal behavior. Next I'd consider the reward function and the interplay between that and the exploration/exploitation behavior of the optimization algorithm. Finally, consider where stochasticity is introduced into the problem -- perhaps there's too much, or not enough, or the stochasticity prevents convergence due to inadequate penalty terms (e.g. low entropy coefficient in PPO).
|
||||
|
||||
**u/wassname** (score: 2):
|
||||
Also checkout [this](https://old.reddit.com/r/reinforcementlearning/comments/7s8px9/deep_reinforcement_learning_practical_tips/) previous discussion.
|
||||
|
||||
**u/[deleted]** (score: 1):
|
||||
I usually just plot every metric, every layer weight on tensorboard and look for anomalies.
|
||||
|
||||
&#x200B;
|
||||
@@ -0,0 +1,110 @@
|
||||
Source: https://old.reddit.com/r/reinforcementlearning/comments/7s8px9/deep_reinforcement_learning_practical_tips/
|
||||
Title: Deep Reinforcement Learning practical tips
|
||||
Fetched-via: browser paste (user)
|
||||
Fetch-status: verbatim
|
||||
|
||||
# Deep Reinforcement Learning practical tips
|
||||
|
||||
submitted 8 years ago by grupiotr | 14 points (90% upvoted) | 13 comments
|
||||
|
||||
I would be particularly grateful for pointers to things you don't seem to be able to find in papers. Examples include:
|
||||
|
||||
- How to choose learning rate?
|
||||
- Problems that work surprisingly well with high learning rates
|
||||
- Problems that require surprisingly low learning rates
|
||||
- Unhealthy-looking learning curves and what to do about them
|
||||
- Q estimators deciding to always give low scores to a subset of actions effectively limiting their search space
|
||||
- How to choose decay rate depending on the problem?
|
||||
- How to design reward function? Rescale? If so, linearly or non-linearly? Introduce/remove bias?
|
||||
- What to do when learning seems very inconsistent between runs?
|
||||
- In general, how to estimate how low one should be expecting the loss to get?
|
||||
- How to tell whether my learning is too low and I'm learning very slowly or too high and loss cannot be decreased further?
|
||||
|
||||
## Comments
|
||||
|
||||
**u/wassname** (11 points):
|
||||
|
||||
Resources: I found these very useful
|
||||
|
||||
- Deep RL Bootcamp Lecture 6: Nuts and Bolts of Deep RL Experimentation (slides) and a written summary
|
||||
- The 3 NIPS2017 Learning to run write ups contain practical advice from a competition
|
||||
- Lessons Learned Reproducing a Deep Reinforcement Learning Paper
|
||||
- Deep Reinforcement Learning that Matters - this gives you an idea of what does and doesn't matter
|
||||
- Deep Reinforcement Learning Doesn't Work Yet (at least as well as the hype suggests)
|
||||
- General deep learning tips from Slav Ivanov
|
||||
|
||||
Lessons learnt:
|
||||
|
||||
- log everything with tensorboard/tensorboardX: policy and critic losses, advantages, ratio, actions (mean and std), states, noise. Check values, check losses are decreasing etc.
|
||||
- keep track of experiments with an experiments log (git commit messages with non-committed data or logs stored by date)
|
||||
- clip and clamp: mistakes not obvious as they can cause values to blow up instead of NaN
|
||||
- clamp all values, logarithmic values: `logvalue.clamp(-np.log(1e-5), np.log(1e-5))`
|
||||
- watch out for dividing by a value: `1/std` should be `1/(std+eps)` where `eps=1e-5`
|
||||
- clip gradients: `grad_norm = torch.nn.utils.clip_grad(model.params, 20)`, then log grad norm
|
||||
- normalise everything: use running norms for state and reward; layer norms help
|
||||
- check everything: plot and sanity check as many values as possible. Check initial outputs, inits, distributions, action range.
|
||||
- think about step-size/sampling-rate: RL is sensitive to it (action repeat, frame skipping). Papers found skipping 4 Atari frames helped, repeating 4 actions in "Learning to Run" helped.
|
||||
|
||||
Curves:
|
||||
|
||||
- in PPO the std should decrease as it learns
|
||||
- in actor-critic the critic loss should start converging then the actor loss follows
|
||||
- watch for local minima where it outputs a constant action
|
||||
- watch gradients for actor and critic; if much lower than 20 or much larger than 100 often run into problems (20 and 40 are where projects often clip gradient norm)
|
||||
- run on CartPole and log same curves to see what healthy looks like
|
||||
|
||||
Reward:
|
||||
|
||||
- It's not the scaling factor that matters but the final value. Papers have gotten good results with rewards between 100-1000.
|
||||
|
||||
Learning rate:
|
||||
|
||||
- Use decaying learning rates, watch loss curves to see when they begin to converge.
|
||||
- loss_actor will often initially increase while the critic is doing its initial learning (value function is a moving target). Focus on making the critic learning rate work first.
|
||||
- Critic learning rates are often set higher, with larger batches.
|
||||
- Use cyclical learning rate trick: slowly increase LR to find the min where model learns and max where it still converges.
|
||||
|
||||
My own questions:
|
||||
|
||||
- How do you know if you've set exploration/variance too high or low?
|
||||
- Should you use a multi-headed actor/critic? Or separate networks?
|
||||
|
||||
"What to do when learning seems very inconsistent between runs?" - This could be an init issue. Try to init so it defaults to reasonable action values even before training.
|
||||
|
||||
---
|
||||
|
||||
**u/gwern** (8 points):
|
||||
|
||||
I've seen similar engineering details & folklore, but mostly in slides/talks:
|
||||
- https://www.reddit.com/r/reinforcementlearning/comments/6vcvu1/icml_2017_tutorial_slides_levine_finn_deep/
|
||||
- https://www.reddit.com/r/reinforcementlearning/comments/75m5vd/deep_rl_bootcamp_2017_slides_and_talks/
|
||||
- https://www.reddit.com/r/reinforcementlearning/comments/5i67zh/deep_reinforcement_learning_through_policy/
|
||||
- https://www.reddit.com/r/reinforcementlearning/comments/5hereu/the_nuts_and_bolts_of_deep_rl_research_schulman/
|
||||
|
||||
**u/twkillian** (1 point): I was about to post John Schulman's talk here as well. Great resource.
|
||||
|
||||
**u/wassname** (1 point): Summarising the ones I hadn't seen:
|
||||
- 5i67zh: fix random seed to reduce variance; think about step-size/sampling-rate; RL sensitive to optimizer choice (SGD, Adam)
|
||||
- 6vcvu1: slides focused more on algorithm choice/design, not application tips
|
||||
|
||||
---
|
||||
|
||||
**u/grupiotr** [OP] (5 points):
|
||||
|
||||
John Schulman's talk wins, particularly:
|
||||
|
||||
- rescaling observations, rewards, targets and prediction targets
|
||||
- using big replay buffers, bigger batch size and generally more iterations to start with
|
||||
- always starting with a simple version of the task to get signs of life
|
||||
|
||||
---
|
||||
|
||||
**u/Kaixhin** (2 points):
|
||||
|
||||
My first bit of advice is actually don't do RL. If the answer is still yes, find some other useful task for the network to do, like predicting something. Get supervised gradients flowing through your network. Training end-to-end on purely an RL signal is impressive, but adding easier learning signals can potentially help a lot.
|
||||
|
||||
---
|
||||
|
||||
**u/grupiotr** [OP] (1 point):
|
||||
|
||||
What turned out to be the game-changer (made my RL agents actually learn something) was **rescaling the reward from [-1, 1] to [0, 1]**. Thanks again to everyone that contributed!
|
||||
@@ -0,0 +1,197 @@
|
||||
Source: https://old.reddit.com/r/reinforcementlearning/comments/bzg3l2/
|
||||
Title: How to *more intelligently* debug RL roadblocks?
|
||||
Fetched-via: Reddit JSON API (limit=500, depth=10)
|
||||
Fetch-status: verbatim
|
||||
|
||||
# How to *more intelligently* debug RL roadblocks?
|
||||
|
||||
**Posted by:** u/GrundleMoof | Score: 4 | 7 comments
|
||||
|
||||
A while ago I [made this post](https://www.reddit.com/r/reinforcementlearning/comments/9sh77q/what_are_your_best_tips_for_debugging_rl_problems/) asking for tips on debugging when you run into a problem with RL.
|
||||
|
||||
However, I think the majority of the advice can be summed up with:
|
||||
|
||||
1) Test bits individually to make sure they're doing what they should
|
||||
|
||||
2) Don't go down a rabbit hole of fiddling with hyperparameters
|
||||
|
||||
3) Log/record/display everything, and "look for things that are acting funny"
|
||||
|
||||
|
||||
and I just want to be clear that I'm not disparaging that advice, it's actually really good, I'm thankful, and I know I'm asking a tricky, general question!
|
||||
|
||||
But I want to get to the "next level". I think I know the theory well enough, and I've successfully done a few toy problems, but I'm still here banging my head against the wall.
|
||||
|
||||
I'll take a practical example I'm struggling with now: gym's `Pendulum-v0`, which has a continuous action space of [-2, 2], and three state variables (`(cos(theta), sin(theta), theta_dot)`). I'm trying to solve it with a fairly simple AC setup and PyTorch. I'm using the RMSprop optimizer, and 2 (or 3) fully connected NN layers, with 50 (or 100) units in each layer, to approximate pi (the policy) and V (the value function/baseline).
|
||||
|
||||
To select the actions, like in [the A3C paper](https://arxiv.org/pdf/1602.01783.pdf), I have the pi NN have two outputs, mu and sd2 (the standard deviation squared). Every time step, I select an action `a` from a normal distribution with that mu and `sd**2`. Then, I calculate that `pi(a)` (just from the equation of a normal dist. with that mu, `sd**2`), and iterate the agent to get the reward from that time step.
|
||||
|
||||
Also like the A3C paper (for the Pendulum problem), I'm doing all the updates at once, at the end of each episode (so it's basically MC with V as the baseline). For each time step (after the episode) I accumulate the rewards from t to t_max as `r_accum` (with gamma = 0.99), then say `V_loss = (r_accum - V_list).pow(2).sum()`. For the policy gradient, I do `policy_loss = -(torch.log(pi_list)*(r_accum - V_list)).sum()`, and then zero grads, backwards the losses, step the optimizer, etc.
|
||||
|
||||
And I'm just not seeing any learning, going up to about 20k episodes. I'm plotting to TensorBoard (losses, rewards, weights, biases, gradients), but nothing is striking me as an obvious culprit. It gets varying rewards, the V_loss seems to decrease to 0, and the policy_loss usually kind of wanders but eventually goes to 0 (I think because it's also proportional to (r_accum - V_list) which is also going to 0).
|
||||
|
||||
But I think this is a perfect learning example. This is doable (...right?), it seems mostly correctly set up, and it's probably a fairly simple fix if I knew how to diagnose it. For the more experienced RL'ers out there, where would you start? What would you look at? What would you verify is working correctly?
|
||||
|
||||
Here are some of my guesses/notes:
|
||||
|
||||
* I haven't actually seen any straightforward implementations of a vanilla PG algo solving Pendulum-v0. In the A3C paper, they add an LSTM to it. There are a bunch of DDPG papers online, but that's a pretty different story. I found one A3C that doesn't seem to have an LSTM, so I'll check that out.
|
||||
* Do I need experience replay? Maybe the variance is just too high using essentially REINFORCE with this problem, so I need to be getting much better data efficiency (or running it for a ton longer) ?
|
||||
* I was worried that maybe it was never actually getting to positions where it could get a high enough reward (to "motivate" it to reach those positions), but I plotted some trajectories and it's definitely getting up to the top (by swinging wildly anyway), where R = 0, so it's definitely experiencing them.
|
||||
|
||||
Things I've tried (but maybe not systematically enough):
|
||||
|
||||
* Different initial LRs
|
||||
* Different optimizers
|
||||
* Different number of hidden layers/units
|
||||
* Shared pi/V NN body (with diff output layers) vs not
|
||||
* Changing amount of entropy
|
||||
* Adding correlated noise
|
||||
* Using TD residual instead of MC version
|
||||
* Clipping the gradient
|
||||
* Different gamma values
|
||||
|
||||
Anyway, I'd love it if anyone has any more general advice for how to think about and go about solving RL problems. I of course want to solve this one, but I want a more general way of thinking.
|
||||
|
||||
## Comments
|
||||
|
||||
**u/i_do_floss** (score: 3):
|
||||
I dont have the answer for you, but I had an algorithm that was stuck on pendulum for a while and these eventually ended up being the issues:
|
||||
|
||||
1. The environment is wrapped with a wrapper that kills the environment after 200 steps. I was ignoring that so I could use 1024 steps. So I ignored the "done" / "is-terminal" variable but I forgot to exclude it from my stored memories in my memory buffer so my updates were all wrong.
|
||||
|
||||
I decided observing the predicted value and seeing if it was crazy (high variance) may be an indicator of an issue.
|
||||
|
||||
Also I could use tensorboard and visualize runtime information so I could see what was going into my placeholders.
|
||||
|
||||
2. My q value was shaped (none x 1), and my placeholders for rewards/terminals were shaped ( none) when I compared those in a tensor I ended up with a tensor shaped ( none , none) which didnt do what I expected
|
||||
|
||||
I decided I could mitigate that type of issue in the future by writing my expected shapes of the networks in a notebook and checking if they match afterward using tensorboard. Some people use assert shape functions.
|
||||
|
||||
Also, just so you know, I'm training soft actor critic in about 20 episodes of length 1024. I don't think you should wait for 1000s of episodes.
|
||||
|
||||
Pendulum v0 is an easy environment for your algorithm to learn. I suggest sticking with these hyper parameters. If they don't work, it's probably your algorithm.
|
||||
|
||||
policy network size: [64, 64]
|
||||
batch size: 256
|
||||
gamma: 0.99
|
||||
adam optimizer
|
||||
relu network activations (on every layer except the last one which has no activation)
|
||||
|
||||
Lastly, make sure your action space allows your algorithm to output actions in the space of -2 to 2.
|
||||
|
||||
**u/GrundleMoof** (score: 1):
|
||||
> The environment is wrapped with a wrapper that kills the environment after 200 steps. I was ignoring that so I could use 1024 steps. So I ignored the "done" / "is-terminal" variable but I forgot to exclude it from my stored memories in my memory buffer so my updates were all wrong.
|
||||
|
||||
So I currently have my agent as a wrapper for the gym env, and it returns a tuple of (reward, state_next, done), and I break on done.
|
||||
|
||||
> I decided observing the predicted value and seeing if it was crazy (high variance) may be an indicator of an issue.
|
||||
|
||||
Hmmm, by value, you mean the value function? And do you mean variance across different states, or the same state over time?
|
||||
|
||||
> Also I could use tensorboard and visualize runtime information so I could see what was going into my placeholders.
|
||||
>
|
||||
> My q value was shaped (none x 1), and my placeholders for rewards/terminals were shaped ( none) when I compared those in a tensor I ended up with a tensor shaped ( none , none) which didnt do what I expected
|
||||
> I decided I could mitigate that type of issue in the future by writing my expected shapes of the networks in a notebook and checking if they match afterward using tensorboard. Some people use assert shape functions.
|
||||
|
||||
ahh yeah that's some good advice. I actually got burned by that earlier in this project, but figured it out by printing the sizes. PyTorch is a little tricky in that it will accept multiplying tensors of various combinations of sizes, with different results... so I should probably do asserts from now on.
|
||||
|
||||
> Also, just so you know, I'm training soft actor critic in about 20 episodes of length 1024. I don't think you should wait for 1000s of episodes.
|
||||
|
||||
Hmm, so right now I'm trying a pretty simple setup, just a policy gradient with a value function. I don't know much about SAC, but it seems more advanced.
|
||||
|
||||
I was starting to get skeptical whether this setup could even learn a continuous action space problem like Pendulum-v0, because when I searched for stuff, almost everything I found was using at least DDPG or more complex. But then I found [this guy's project](https://github.com/MorvanZhou/pytorch-A3C), just A3C, and it solves it pretty quickly and reliably.
|
||||
|
||||
I started going through his code and it's nearly exactly the same as mine. I thought that it's possible that using 4 workers has a "decorrelating" effect (like experience replay), so I changed his code to drop it to 1 worker, and it still works! So it's clearly something else and I haven't figured it out yet. It's so similar to mine though, both in terms of setup and hyperparameters...
|
||||
|
||||
> Pendulum v0 is an easy environment for your algorithm to learn. I suggest sticking with these hyper parameters. If they don't work, it's probably your algorithm.
|
||||
>
|
||||
> policy network size: [64, 64] batch size: 256 gamma: 0.99 adam optimizer relu network activations (on every layer except the last one which has no activation)
|
||||
|
||||
You mean, two hidden layers of size 64 each? And are you outputting a value function too?
|
||||
|
||||
So, maybe I'm missing something here -- do you mean batches of episodes, or batches of steps? I'm using gamma = 0.9 or 0.99. I've tried Adam and RMSprop, no success with either... I'm using tanh activations, but that probably shouldn't change anything significantly, right?
|
||||
|
||||
> Lastly, make sure your action space allows your algorithm to output actions in the space of -2 to 2.
|
||||
|
||||
Yeah, my policy outputs a mu and sigma. The mu output is 2*tanh, so it's mapped to -2, 2, and the sigma one (actually sigma^2 ) is put through a softplus output.
|
||||
|
||||
**u/i_do_floss** (score: 1):
|
||||
Yes two hidden layers with 64 nodes. The value function is a third layer basically.
|
||||
|
||||
Tanh on last layer makes sense for policy. What are you using on hidden layers and value function final layer?
|
||||
|
||||
Also, have you tried different reward scales?
|
||||
|
||||
**u/GrundleMoof** (score: 1):
|
||||
Hi again, sorry for the delay! I was traveling with no service...
|
||||
|
||||
I've tried a few different topologies. Right now I'm doing this:
|
||||
|
||||
self.actor_lin1 = nn.Linear(3, 200)
|
||||
self.mu = nn.Linear(200, 1)
|
||||
self.sigma = nn.Linear(200, 1)
|
||||
self.critic_lin1 = nn.Linear(3, 100)
|
||||
self.v = nn.Linear(100, 1)
|
||||
|
||||
and for my forward():
|
||||
|
||||
y = torch.tanh(self.critic_lin1(x))
|
||||
v = self.v(y)
|
||||
|
||||
z = torch.tanh(self.actor_lin1(x))
|
||||
mu = 2*torch.tanh(self.mu(z))
|
||||
sd2 = softplus(self.sigma(z)) + 0.001
|
||||
return(v, (mu, sd2))
|
||||
|
||||
So I'm using tanh() for the nonlinearities as well. I'm adding that 0.001 to the sd2 because it keeps it from getting too small (which should be enforced by the entropy term anyway) and I've seen it done in a few formulations of this.
|
||||
|
||||
I also tried with having the mu/sigma layers combined into a nn.Linear(200, 2) layer (which should be functionally the same I think), as well as having the mu/sigma and v outputs share the first nn.Linear(3, 200) layer before splitting off (which is different, the shared head thing, but I've used elsewhere and seen people use).
|
||||
|
||||
I'm scaling the rewards in a way I've seen a bunch of other people do. Since the reward each step has the range [-16, 0], I'm normalizing it by doing (r + 8.0)/8.0, which should put it about in the range [-1, 1].
|
||||
|
||||
At this point I'm basically trying to replicate the guy's A3C implementation from above (minus the multiple workers part, but I ran his with 1 worker and it reliably improves every time). Mine *does* seem to improve, but really slowly compared to his, and sometimes seems to get worse after a while. Like, it's not not improving *at all*, just very slowly and also not reliably, which means something must be off.
|
||||
|
||||
**u/i_do_floss** (score: 1):
|
||||
tanh activations are really sensitive to the weights and bias initialization. Is he using tanh activations?
|
||||
|
||||
tanh makes sense to me for the actor output. But I would probably use relu for the nonlinearities so the initialization is easier.
|
||||
|
||||
tanh starts to experience issues when the inputs and outputs are too big. (bigger than like -.6 and 6.)
|
||||
|
||||
**u/GrundleMoof** (score: 1):
|
||||
by the way, just to give an example:
|
||||
|
||||
[Here's an image of the reward per episode using his code](https://imgur.com/9K2clLs)
|
||||
|
||||
[and here's mine](https://imgur.com/IMy6Rb5)
|
||||
|
||||
Both with moving averages shown, to smooth it out.
|
||||
|
||||
You can see that mine *does* improve, up to about episode 2000, but then gets worse. It pretty consistently does that. His on the other hand, always improves and stays good.
|
||||
|
||||
To me, that indicates that it's almost there, but something's going on with the optimizer or something, like maybe it becomes unstable or something. But I'm using the same LR he is (2e-4), and I've tried both Adam (like him) and RMSprop.
|
||||
|
||||
**u/i_do_floss** (score: 3):
|
||||
Also, this only applies to pendulum v0, but it's a great first environment for spinning up an algorithm because you can graph the entire state space on a 2 dimensional plane (as an image). I graph my policy / q assessments in polar coordinates where radius is the velocity and theta is the angle of the pendulum.
|
||||
|
||||
&#x200B;
|
||||
|
||||
Here's a link to the code I use to do it.
|
||||
|
||||
[https://github.com/DanielSmithMichigan/reinforcement-learning/blob/72b0e939c47234fd63c8459fdfa35f18d9053b49/soft-actor-critic/agent/Agent.py#L287](https://github.com/DanielSmithMichigan/reinforcement-learning/blob/72b0e939c47234fd63c8459fdfa35f18d9053b49/soft-actor-critic/agent/Agent.py#L287)
|
||||
|
||||
&#x200B;
|
||||
|
||||
Here is an album of screenshots I took while training a successful policy.
|
||||
|
||||
&#x200B;
|
||||
|
||||
[https://imgur.com/a/05C5vVa](https://imgur.com/a/05C5vVa)
|
||||
|
||||
&#x200B;
|
||||
|
||||
I hope that helps. Feel free to hit me up on discord if you want someone to talk through it with. I'm kind of in the same boat as you with regard to wishing I knew the better ways to debug RL algorithms.
|
||||
|
||||
&#x200B;
|
||||
|
||||
(Discord: Perseus#5383)
|
||||
@@ -0,0 +1,12 @@
|
||||
Source: https://old.reddit.com/r/reinforcementlearning/comments/5hereu/
|
||||
Title: "The Nuts and Bolts of Deep RL Research" (Schulman December 2016 slides)
|
||||
Fetched-via: Reddit JSON API (limit=500, depth=10)
|
||||
Fetch-status: verbatim
|
||||
|
||||
# "The Nuts and Bolts of Deep RL Research" (Schulman December 2016 slides)
|
||||
|
||||
**Posted by:** u/gwern | Score: 5 | 0 comments
|
||||
|
||||
Link: http://rll.berkeley.edu/deeprlcourse/docs/nuts-and-bolts.pdf
|
||||
|
||||
## Comments
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,272 @@
|
||||
Source: https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607
|
||||
Title: 37 Reasons why your Neural Network is not working - Slav Ivanov (2017)
|
||||
Fetched-via: curl https://r.jina.ai/https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607
|
||||
Fetch-status: verbatim
|
||||
|
||||
Title: 37 Reasons why your Neural Network is not working
|
||||
|
||||
URL Source: https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607
|
||||
|
||||
Published Time: 2017-07-25T08:13:45Z
|
||||
|
||||
Markdown Content:
|
||||
[](https://medium.com/@slavivanov?source=post_page---byline--4020854bd607---------------------------------------)
|
||||
|
||||
10 min read
|
||||
|
||||
Jul 25, 2017
|
||||
|
||||
The network had been training for the last 12 hours. It all looked good: the gradients were flowing and the loss was decreasing. But then came the predictions: all zeroes, all background, nothing detected. “What did I do wrong?” — I asked my computer, who didn’t answer.
|
||||
|
||||
Where do you start checking if your model is outputting garbage (for example predicting the mean of all outputs, or it has really poor accuracy)?
|
||||
|
||||
A network might not be training for a number of reasons. Over the course of many debugging sessions, I would often find myself doing the same checks. I’ve compiled my experience along with the best ideas around in this handy list. I hope they would be of use to you, too.
|
||||
|
||||
Table of Contents
|
||||
-----------------
|
||||
|
||||
> [0. How to use this guide?](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607#b6fb)
|
||||
>
|
||||
>
|
||||
> [I. Dataset issues](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607#678a)
|
||||
>
|
||||
>
|
||||
> [II. Data Normalization/Augmentation issues](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607#86fe)
|
||||
>
|
||||
>
|
||||
> [III. Implementation issues](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607#95eb)
|
||||
>
|
||||
>
|
||||
> [IV. Training issues](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607#74de)
|
||||
|
||||
0. How to use this guide?
|
||||
-------------------------
|
||||
|
||||
A lot of things can go wrong. But some of them are more likely to be broken than others. I usually start with this short list as an emergency first response:
|
||||
|
||||
1. Start with a simple model that is known to work for this type of data (for example, VGG for images). Use a standard loss if possible.
|
||||
2. Turn off all bells and whistles, e.g. regularization and data augmentation.
|
||||
3. If finetuning a model, double check the preprocessing, for it should be the same as the original model’s training.
|
||||
4. Verify that the input data is correct.
|
||||
5. Start with a really small dataset (2–20 samples). Overfit on it and gradually add more data.
|
||||
6. Start gradually adding back all the pieces that were omitted: augmentation/regularization, custom loss functions, try more complex models.
|
||||
|
||||
If the steps above don’t do it, start going down the following big list and verify things one by one.
|
||||
|
||||
I. Dataset issues
|
||||
-----------------
|
||||
|
||||
Press enter or click to view image in full size
|
||||
|
||||

|
||||
|
||||
Source: [http://dilbert.com/strip/2014-05-07](http://dilbert.com/strip/2014-05-07)
|
||||
|
||||
### 1. Check your input data
|
||||
|
||||
Check if the input data you are feeding the network makes sense. For example, I’ve more than once mixed the width and the height of an image. Sometimes, I would feed all zeroes by mistake. Or I would use the same batch over and over. So print/display a couple of batches of input and target output and make sure they are OK.
|
||||
|
||||
### 2. Try random input
|
||||
|
||||
Try passing random numbers instead of actual data and see if the error behaves the same way. If it does, it’s a sure sign that your net is turning data into garbage at some point. Try debugging layer by layer /op by op/ and see where things go wrong.
|
||||
|
||||
### 3. Check the data loader
|
||||
|
||||
Your data might be fine but the code that passes the input to the net might be broken. Print the input of the first layer before any operations and check it.
|
||||
|
||||
### 4. Make sure input is connected to output
|
||||
|
||||
Check if a few input samples have the correct labels. Also make sure shuffling input samples works the same way for output labels.
|
||||
|
||||
### 5. Is the relationship between input and output too random?
|
||||
|
||||
Maybe the non-random part of the relationship between the input and output is too small compared to the random part (one could argue that stock prices are like this). I.e. the input are not sufficiently related to the output. There isn’t an universal way to detect this as it depends on the nature of the data.
|
||||
|
||||
### 6. Is there too much noise in the dataset?
|
||||
|
||||
This happened to me once when I scraped an image dataset off a food site. There were so many bad labels that the network couldn’t learn. Check a bunch of input samples manually and see if labels seem off.
|
||||
|
||||
The cutoff point is up for debate, as [this paper](https://arxiv.org/pdf/1412.6596.pdf) got above 50% accuracy on MNIST using 50% corrupted labels.
|
||||
|
||||
### 7. Shuffle the dataset
|
||||
|
||||
If your dataset hasn’t been shuffled and has a particular order to it (ordered by label) this could negatively impact the learning. Shuffle your dataset to avoid this. Make sure you are shuffling input and labels together.
|
||||
|
||||
### 8. Reduce class imbalance
|
||||
|
||||
Are there a 1000 class A images for every class B image? Then you might need to balance your loss function or [try other class imbalance approaches](http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/).
|
||||
|
||||
### 9. Do you have enough training examples?
|
||||
|
||||
If you are training a net from scratch (i.e. not finetuning), you probably need lots of data. For image classification, [people say](https://stats.stackexchange.com/a/226693/30773) you need a 1000 images per class or more.
|
||||
|
||||
### 10. Make sure your batches don’t contain a single label
|
||||
|
||||
This can happen in a sorted dataset (i.e. the first 10k samples contain the same class). Easily fixable by shuffling the dataset.
|
||||
|
||||
### 11. Reduce batch size
|
||||
|
||||
[This paper](https://arxiv.org/abs/1609.04836) points out that having a very large batch can reduce the generalization ability of the model.
|
||||
|
||||
### Addition 1. Use standard dataset (e.g. mnist, cifar10)
|
||||
|
||||
Thanks to @ for this one:
|
||||
|
||||
> When testing new network architecture or writing a new piece of code, use the standard datasets first, instead of your own data. This is because there are many reference results for these datasets and they are proved to be ‘solvable’. There will be no issues of label noise, train/test distribution difference , too much difficulty in dataset, etc.
|
||||
|
||||
II. Data Normalization/Augmentation
|
||||
-----------------------------------
|
||||
|
||||

|
||||
|
||||
### **12. Standardize** the features
|
||||
|
||||
Did you standardize your input to have zero mean and unit variance?
|
||||
|
||||
### 13. Do you have too much data augmentation?
|
||||
|
||||
Augmentation has a regularizing effect. Too much of this combined with other forms of regularization (weight L2, dropout, etc.) can cause the net to underfit.
|
||||
|
||||
### 14. Check the preprocessing of your pretrained model
|
||||
|
||||
If you are using a pretrained model, make sure you are using the same normalization and preprocessing as the model was when training. For example, should an image pixel be in the range [0, 1], [-1, 1] or [0, 255]?
|
||||
|
||||
### 15. Check the preprocessing for train/validation/test set
|
||||
|
||||
CS231n points out a [common pitfall](http://cs231n.github.io/neural-networks-2/#datapre):
|
||||
|
||||
> “… any preprocessing statistics (e.g. the data mean) must only be computed on the training data, and then applied to the validation/test data. E.g. computing the mean and subtracting it from every image across the entire dataset and then splitting the data into train/val/test splits would be a mistake. “
|
||||
|
||||
Also, check for different preprocessing in each sample or batch.
|
||||
|
||||
III. Implementation issues
|
||||
--------------------------
|
||||
|
||||
Press enter or click to view image in full size
|
||||
|
||||

|
||||
|
||||
Credit: [https://xkcd.com/1838/](https://xkcd.com/1838/)
|
||||
|
||||
### 16. Try solving a simpler version of the problem
|
||||
|
||||
This will help with finding where the issue is. For example, if the target output is an object class and coordinates, try limiting the prediction to object class only.
|
||||
|
||||
### 17. Look for correct loss “at chance”
|
||||
|
||||
Again from the excellent [CS231n](http://cs231n.github.io/neural-networks-3/#sanitycheck): _Initialize with small parameters, without regularization. For example, if we have 10 classes, at chance means we will get the correct class 10% of the time, and the Softmax loss is the negative log probability of the correct class so: -ln(0.1) = 2.302._
|
||||
|
||||
Get Slav Ivanov’s stories in your inbox
|
||||
---------------------------------------
|
||||
|
||||
Join Medium for free to get updates from this writer.
|
||||
|
||||
Remember me for faster sign in
|
||||
|
||||
After this, try increasing the regularization strength which should increase the loss.
|
||||
|
||||
### 18. Check your loss function
|
||||
|
||||
If you implemented your own loss function, check it for bugs and add unit tests. Often, my loss would be slightly incorrect and hurt the performance of the network in a subtle way.
|
||||
|
||||
### 19. Verify loss input
|
||||
|
||||
If you are using a loss function provided by your framework, make sure you are passing to it what it expects. For example, in PyTorch I would mix up the NLLLoss and CrossEntropyLoss as the former requires a softmax input and the latter doesn’t.
|
||||
|
||||
### 20. Adjust loss weights
|
||||
|
||||
If your loss is composed of several smaller loss functions, make sure their magnitude relative to each is correct. This might involve testing different combinations of loss weights.
|
||||
|
||||
### 21. Monitor other metrics
|
||||
|
||||
Sometimes the loss is not the best predictor of whether your network is training properly. If you can, use other metrics like accuracy.
|
||||
|
||||
### 22. Test any custom layers
|
||||
|
||||
Did you implement any of the layers in the network yourself? Check and double-check to make sure they are working as intended.
|
||||
|
||||
### 23. Check for “frozen” layers or variables
|
||||
|
||||
Check if you unintentionally disabled gradient updates for some layers/variables that should be learnable.
|
||||
|
||||
### 24. Increase network size
|
||||
|
||||
Maybe the expressive power of your network is not enough to capture the target function. Try adding more layers or more hidden units in fully connected layers.
|
||||
|
||||
### 25. Check for hidden dimension errors
|
||||
|
||||
If your input looks like (k, H, W) = (64, 64, 64) it’s easy to miss errors related to wrong dimensions. Use weird numbers for input dimensions (for example, different prime numbers for each dimension) and check how they propagate through the network.
|
||||
|
||||
### 26. Explore Gradient checking
|
||||
|
||||
If you implemented Gradient Descent by hand, gradient checking makes sure that your backpropagation works like it should. More info: [1](http://ufldl.stanford.edu/tutorial/supervised/DebuggingGradientChecking/)[2](http://cs231n.github.io/neural-networks-3/#gradcheck)[3](https://www.coursera.org/learn/machine-learning/lecture/Y3s6r/gradient-checking).
|
||||
|
||||
IV. Training issues
|
||||
-------------------
|
||||
|
||||

|
||||
|
||||
Credit: [http://carlvondrick.com/ihog/](http://carlvondrick.com/ihog/)
|
||||
|
||||
### 27. Solve for a really small dataset
|
||||
|
||||
**Overfit a small subset of the data and make sure it works.**For example, train with just 1 or 2 examples and see if your network can learn to differentiate these. Move on to more samples per class.
|
||||
|
||||
### 28. Check weights initialization
|
||||
|
||||
If unsure, use [Xavier](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf) or [He](http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf) initialization. Also, your initialization might be leading you to a bad local minimum, so try a different initialization and see if it helps.
|
||||
|
||||
### 29. Change your hyperparameters
|
||||
|
||||
Maybe you using a particularly bad set of hyperparameters. If feasible, try a [grid search](http://scikit-learn.org/stable/modules/grid_search.html).
|
||||
|
||||
### 30. Reduce regularization
|
||||
|
||||
Too much regularization can cause the network to underfit badly. Reduce regularization such as dropout, batch norm, weight/bias L2 regularization, etc. In the excellent “[Practical Deep Learning for coders](http://course.fast.ai/)” course, [Jeremy Howard](https://twitter.com/jeremyphoward) advises getting rid of underfitting first. This means you overfit the training data sufficiently, and only then addressing overfitting.
|
||||
|
||||
### 31. Give it time
|
||||
|
||||
Maybe your network needs more time to train before it starts making meaningful predictions. If your loss is steadily decreasing, let it train some more.
|
||||
|
||||
### 32. Switch from Train to Test mode
|
||||
|
||||
Some frameworks have layers like Batch Norm, Dropout, and other layers behave differently during training and testing. Switching to the appropriate mode might help your network to predict properly.
|
||||
|
||||
### 33. Visualize the training
|
||||
|
||||
* Monitor the activations, weights, and updates of each layer. Make sure their magnitudes match. For example, the magnitude of the updates to the parameters (weights and biases) [should be 1-e3](https://cs231n.github.io/neural-networks-3/#summary).
|
||||
* Consider a visualization library like [Tensorboard](https://www.tensorflow.org/get_started/summaries_and_tensorboard) and [Crayon](https://github.com/torrvision/crayon). In a pinch, you can also print weights/biases/activations.
|
||||
* Be on the lookout for layer activations with a mean much larger than 0. Try Batch Norm or ELUs.
|
||||
* [Deeplearning4j](https://deeplearning4j.org/visualization#usingui) points out what to expect in histograms of weights and biases:
|
||||
|
||||
> “For weights, these histograms should have an **approximately Gaussian (normal)**distribution, after some time. For biases, these histograms will generally start at 0, and will usually end up being **approximately Gaussian** (One exception to this is for LSTM). Keep an eye out for parameters that are diverging to +/- infinity. Keep an eye out for biases that become very large. This can sometimes occur in the output layer for classification if the distribution of classes is very imbalanced.”
|
||||
|
||||
* Check layer updates, they should have a Gaussian distribution.
|
||||
|
||||
### 34. Try a different optimizer
|
||||
|
||||
Your choice of optimizer shouldn’t prevent your network from training unless you have selected particularly bad hyperparameters. However, the proper optimizer for a task can be helpful in getting the most training in the shortest amount of time. The paper which describes the algorithm you are using should specify the optimizer. If not, I tend to use Adam or plain SGD with momentum.
|
||||
|
||||
Check this [excellent post](http://ruder.io/optimizing-gradient-descent/) by Sebastian Ruder to learn more about gradient descent optimizers.
|
||||
|
||||
### 35. Exploding / Vanishing gradients
|
||||
|
||||
* Check layer updates, as very large values can indicate exploding gradients. Gradient clipping may help.
|
||||
* Check layer activations. From [Deeplearning4j](https://deeplearning4j.org/visualization#usingui) comes a great guideline: _“A good standard deviation for the activations is on the order of 0.5 to 2.0. Significantly outside of this range may indicate vanishing or exploding activations.”_
|
||||
|
||||
### 36. Increase/Decrease Learning Rate
|
||||
|
||||
A low learning rate will cause your model to converge very slowly.
|
||||
|
||||
A high learning rate will quickly decrease the loss in the beginning but might have a hard time finding a good solution.
|
||||
|
||||
Play around with your current learning rate by multiplying it by 0.1 or 10.
|
||||
|
||||
### 37. Overcoming NaNs
|
||||
|
||||
Getting a NaN (Non-a-Number) is a much bigger issue when training RNNs (from what I hear). Some approaches to fix it:
|
||||
|
||||
* Decrease the learning rate, especially if you are getting NaNs in the first 100 iterations.
|
||||
* NaNs can arise from division by zero or natural log of zero or negative number.
|
||||
* Russell Stewart has great pointers on [how to deal with NaNs](http://russellsstewart.com/notes/0.html).
|
||||
* Try evaluating your network layer by layer and see where the NaNs appear.
|
||||
@@ -0,0 +1,220 @@
|
||||
Source: https://github.com/williamFalcon/DeepRLHacks
|
||||
Title: DeepRLHacks - Attendee notes from Schulman's Nuts and Bolts talk (2017)
|
||||
Fetched-via: gh api repos/williamFalcon/DeepRLHacks/contents/README.md
|
||||
Fetch-status: verbatim
|
||||
Compliance-note: Secondary source - attendee notes from the talk. Primary source is joschu_nuts_and_bolts.md (http://joschu.net/docs/nuts-and-bolts.pdf)
|
||||
|
||||
# DeepRLHacks
|
||||
From a talk given by [John Schulman](http://joschu.net/) titled "The Nuts and Bolts of Deep RL Research" (Aug 2017)
|
||||
These are tricks written down while attending summer [Deep RL Bootcamp at UC Berkeley](https://www.deepbootcamp.io/).
|
||||
|
||||
**Update**: RL bootcamp just released the [video](https://www.youtube.com/watch?v=8EcdaCk9KaQ&feature=youtu.be) and the rest of the [lectures](https://sites.google.com/view/deep-rl-bootcamp/lectures).
|
||||
|
||||
## Tips to debug new algorithm
|
||||
1. Simplify the problem by using a low dimensional state space environment.
|
||||
- John suggested to use the [Pendulum problem](https://gym.openai.com/envs/Pendulum-v0) because the problem has a 2-D state space (angle of pendulum and velocity).
|
||||
- Easy to visualize what the value function looks like and what state the algorithm should be in and how they evolve over time.
|
||||
- Easy to visually spot why something isn't working (aka, is the value function smooth enough and so on).
|
||||
|
||||
2. To test if your algorithm is reasonable, construct a problem you know it should work on.
|
||||
- Ex: For hierarchical reinforcement learning you'd construct a problem with an OBVIOUS hierarchy it should learn.
|
||||
- Can easily see if it's doing the right thing.
|
||||
- WARNING: Don't over fit method to your toy problem (realize it's a toy problem).
|
||||
|
||||
3. Familiarize yourself with certain environments you know well.
|
||||
- Over time, you'll learn how long the training should take.
|
||||
- Know how rewards evolve, etc...
|
||||
- Allows you to set a benchmark to see how well you're doing against your past trials.
|
||||
- John uses the hopper robot where he knows how fast learning should take, and he can easily spot odd behaviors.
|
||||
|
||||
## Tips to debug a new task
|
||||
1. Simplify the task
|
||||
- Start simple until you see signs of life.
|
||||
- Approach 1: Simplify the feature space:
|
||||
- For example, if you're learning from images (huge dimensional space), then maybe hand engineer features first. Example: If you think your function is trying to approximate a location of something, use the x,y location as features as step 1.
|
||||
- Once it starts working, make the problem harder until you solve the full problem.
|
||||
- Approach 2: simplify the reward function.
|
||||
- Formulate so it can give you FAST feedback to know whether you're doing the right thing or not.
|
||||
- Ex: Have reward for robot when it hits the target (+1). Hard to learn because maybe too much happens in between starting and reward. Reformulate as distance to target instead which will increase learning and allow you to iterate faster.
|
||||
|
||||
## Tips to frame a problem in RL
|
||||
Maybe it's unclear what the features are and what the reward should be, or if it's feasible at all.
|
||||
|
||||
1. First step: Visualize a random policy acting on this problem.
|
||||
- See where it takes you.
|
||||
- If random policy on occasion does the right thing, then high chance RL will do the right thing.
|
||||
- Policy gradient will find this behavior and make it more likely.
|
||||
- If random policy never does the right thing, RL will likely also not.
|
||||
|
||||
2. Make sure observations usable:
|
||||
- See if YOU could control the system by using the same observations you give the agent.
|
||||
- Example: Look at preprocessed images yourself to make sure you don't remove necessary details or hinder the algorithm in a certain way.
|
||||
|
||||
3. Make sure everything is reasonably scaled.
|
||||
- Rule of thumb:
|
||||
- Observations: Make everything mean 0, standard deviation 1.
|
||||
- Reward: If you control it, then scale it to a reasonable value.
|
||||
- Do it across ALL your data so far.
|
||||
- Look at all observations and rewards and make sure there aren't crazy outliers.
|
||||
|
||||
4. Have good baseline whenever you see a new problem.
|
||||
- It's unclear which algorithm will work, so have a set of baselines (from other methods)
|
||||
- Cross entropy method
|
||||
- Policy gradient methods
|
||||
- Some kind of Q-learning method (checkout [OpenAI Baselines](https://github.com/openai/baselines) as a starter or [RLLab](https://github.com/rll/rllab))
|
||||
|
||||
## Reproducing papers
|
||||
Sometimes (often), it's hard to reproduce results from papers. Some tricks to do that:
|
||||
|
||||
1. Use more samples than needed.
|
||||
2. Policy right... but not exactly
|
||||
- Try to make it work a little bit.
|
||||
- Then tweak hyper parameters to get up to the public performance.
|
||||
- If want to get it to work at ALL, use bigger batch sizes.
|
||||
- If batch size is too small, noisy will overpower signal.
|
||||
- Example: TRPO, John was using too tiny of a batch size and had to use 100k time steps.
|
||||
- For DQN, best hyperparams: 10k time steps, 1mm frames in replay buffer.
|
||||
|
||||
|
||||
## Guidelines on-going training process
|
||||
Sanity check that your training is going well.
|
||||
|
||||
1. Look at sensitivity of EVERY hyper parameter
|
||||
- If algo is too sensitive, then NOT robust and should NOT be happy with it.
|
||||
- Sometimes it happens that a method works one way because of funny dynamics but NOT in general.
|
||||
|
||||
2. Look for indicators that the optimization process is healthy.
|
||||
- Varies
|
||||
- Look at whether value function is accurate.
|
||||
- Is it predicting well?
|
||||
- Is it predicting returns well?
|
||||
- How big are the updates?
|
||||
- Standard diagnostics from deep networks
|
||||
|
||||
3. Have a system for continuously benchmarking code.
|
||||
- Needs DISCIPLINE.
|
||||
- Look at performance across ALL previous problems you tried.
|
||||
- Sometimes it'll start working on one problem but mess up performance in others.
|
||||
- Easy to over fit on a single problem.
|
||||
- Have a battery of benchmarks you run occasionally.
|
||||
|
||||
4. Think your algorithm is working but you're actually seeing random noise.
|
||||
- Example: Graph of 7 tasks with 3 algorithms and looks like 1 algorithm might be doing best on all problems, but turns out they're all the same algorithm with DIFFERENT random seeds.
|
||||
|
||||
5. Try different random seeds!!
|
||||
- Run multiple times and average.
|
||||
- Run multiple tasks on multiple seeds.
|
||||
- If not, you're likely to over fit.
|
||||
|
||||
6. Additional algorithm modifications might be unnecessary.
|
||||
- Most tricks are ACTUALLY normalizing something in some way or improving your optimization.
|
||||
- A lot of tricks also have the same effect... So you can remove some of them and SIMPLIFY your algorithm (VERY KEY).
|
||||
|
||||
7. Simplify your algorithm
|
||||
- Will generalize better
|
||||
|
||||
8. Automate your experiments
|
||||
- Don't spend your whole day watching your code spit out numbers.
|
||||
- Launch experiments on cloud services and analyze results.
|
||||
- Frameworks to track experiments and results:
|
||||
- Mostly use iPython notebooks.
|
||||
- DBs seem unnecessary to store results.
|
||||
|
||||
|
||||
## General training strategies
|
||||
1. Whiten and standardize data (for ALL seen data since the beginning).
|
||||
- Observations:
|
||||
- Do it by computing a running mean and standard deviation. Then z-transform everything.
|
||||
- Over ALL data seen (not just the recent data).
|
||||
- At least it'll scale down over time how fast it's changing.
|
||||
- Might trip up the optimizer if you keep changing the objective.
|
||||
- Rescaling (by using recent data) means your optimizer probably didn't know about that and performance will collapse.
|
||||
|
||||
- Rewards:
|
||||
- Scale and DON'T shift.
|
||||
- Affects agent's will to live.
|
||||
- Will change the problem (aka, how long you want it to survive).
|
||||
|
||||
- Standardize targets:
|
||||
- Same way as rewards.
|
||||
|
||||
- PCA Whitening?
|
||||
- Could help.
|
||||
- Starting to see if it actually helps with neural nets.
|
||||
- Huge scales (-1000, 1000) or (-0.001, 0.001) certainly make learning slow.
|
||||
|
||||
2. Parameters that inform discount factors.
|
||||
- Determines how far you're giving credit assignment.
|
||||
- Ex: if factor is 0.99, then you're ignoring what happened 100 steps ago... Means you're shortsighted.
|
||||
- Better to look at how that corresponds to real time
|
||||
- Intuition, in RL we're usually discretizing time.
|
||||
- aka: are those 100 steps 3 seconds of actual time?
|
||||
- what happens during that time?
|
||||
- If TD methods for policy gradient of Value fx estimation, gamma can be close to 1 (like 0.999)
|
||||
- Algo becomes very stable.
|
||||
|
||||
3. Look to see that problem can actually be solved in the discretized level.
|
||||
- Example: In game if you're doing frame skip.
|
||||
- As a human, can you control it or is it impossible?
|
||||
- Look at what random exploration looks like
|
||||
- Discretization determines how far your Brownian motion goes.
|
||||
- If do many actions in a row, then tend to explore further.
|
||||
- Choose your time discretization in a way that works.
|
||||
|
||||
4. Look at episode returns closely.
|
||||
- Not just mean, look at min and max.
|
||||
- The max return is something your policy can hone in pretty well.
|
||||
- Is your policy ever doing the right thing??
|
||||
- Look at episode length (sometimes more informative than episode reward).
|
||||
- if on game you might be losing every time so you might never win, but... episode length can tell you if you're losing SLOWER.
|
||||
- Might see an episode length improvement in the beginning but maybe not reward.
|
||||
|
||||
|
||||
## Policy gradient diagnostics
|
||||
1. Look at entropy really carefully
|
||||
- Entropy in ACTION space
|
||||
- Care more about entropy in state space, but don't have good methods for calculating that.
|
||||
- If going down too fast, then policy becoming deterministic and will not explore.
|
||||
- If NOT going down, then policy won't be good because it is really random.
|
||||
- Can fix by:
|
||||
- KL penalty
|
||||
- Keep entropy from decreasing too quickly.
|
||||
- Add entropy bonus.
|
||||
- How to measure entropy.
|
||||
- For most policies can compute entropy analytically.
|
||||
- If continuous, it's usually a Gaussian, so can compute differential entropy.
|
||||
|
||||
2. Look at KL divergence
|
||||
- Look at size of updates in terms of KL divergence.
|
||||
- example:
|
||||
- If KL is .01 then very small.
|
||||
- If 10 then too much.
|
||||
|
||||
3. Baseline explained variance.
|
||||
- See if value function is actually a good predictor or a reward.
|
||||
- if negative it might be overfitting or noisy.
|
||||
- Likely need to tune hyper parameters
|
||||
|
||||
4. Initialize policy
|
||||
- Very important (more so than in supervised learning).
|
||||
- Zero or tiny final layer to maximize entropy
|
||||
- Maximize random exploration in the beginning
|
||||
|
||||
## Q-Learning Strategies
|
||||
1. Be careful about replay buffer memory usage.
|
||||
- You might need a huge buffer, so adapt code accordingly.
|
||||
|
||||
2. Play with learning rate schedule.
|
||||
|
||||
3. If converges slowly or has slow warm-up period in the beginning
|
||||
- Be patient... DQN converges VERY slowly.
|
||||
|
||||
|
||||
## Bonus from [Andrej Karpathy](http://cs.stanford.edu/people/karpathy/):
|
||||
1. A good feature can be to take the difference between two frames.
|
||||
- This delta vector can highlight slight state changes otherwise difficult to distinguish.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
@@ -0,0 +1,586 @@
|
||||
===
|
||||
title: ML Debugging Folklore - Evidence Map
|
||||
author: ml_debug SKILL synthesis
|
||||
date: 2026-03-05
|
||||
model:
|
||||
mode: strict
|
||||
===
|
||||
|
||||
// This argdown maps claims from the ML Debugging Folklore SKILL.md
|
||||
// back to sourced quotes across 21 evidence files. Each claim
|
||||
// is traced to 2-3 independent sources with verbatim quotes.
|
||||
//
|
||||
// Credence guide:
|
||||
// 0.90 = canonical textbook / primary algorithm author
|
||||
// 0.85 = peer-reviewed paper / authoritative course
|
||||
// 0.80 = established practitioner blog, widely cited
|
||||
// 0.70 = popular blog / course notes
|
||||
// 0.60 = reddit thread / community consensus
|
||||
|
||||
[Folklore Reliable]: ML debugging folklore -- practitioner heuristics
|
||||
transmitted via talks, blog posts, and course materials -- provides
|
||||
reliable, independently corroborated guidance for diagnosing and
|
||||
fixing ML training failures.
|
||||
+ <Normalization Consensus>
|
||||
+ <Isolation Testing Consensus>
|
||||
+ <Bug First Consensus>
|
||||
+ <Seed Variance Evidence>
|
||||
+ <Batch Size Evidence>
|
||||
+ <Reward Engineering Evidence>
|
||||
+ <Logging Consensus>
|
||||
+ <Reference Impl Consensus>
|
||||
+ <Anomaly Pursuit Evidence>
|
||||
+ <Random Search Evidence>
|
||||
- <Source Age Concern>
|
||||
- <RL Specificity Concern>
|
||||
|
||||
|
||||
# Section 1: General ML Debugging
|
||||
|
||||
## Normalize Inputs
|
||||
|
||||
<Normalization Consensus>
|
||||
|
||||
(1) [Schulman Normalize]: Schulman recommends standardizing all
|
||||
observations via running mean/std, clipping, and rescaling
|
||||
rewards without shifting the mean. #observation
|
||||
[Schulman 2016 Slides](http://joschu.net/docs/nuts-and-bolts.pdf)
|
||||
[evidence](evidence/joschu_nuts_and_bolts.md#L120-L125)
|
||||
> (cid:73) If observations have unknown range, standardize
|
||||
> (cid:73) Compute running estimate of mean and standard deviation
|
||||
> (cid:73) = clip((x −µ)/σ,−10,10)
|
||||
> **(cid:73) Rescale the rewards, but don't shift mean, as that affects agent's will to live**
|
||||
> (cid:73) Standardize prediction targets (e.g., value functions) the same way
|
||||
{reason: "Schulman is PPO/TRPO author; these slides are from his canonical 'Nuts and Bolts' talk, widely adopted in OpenAI baselines and stable-baselines3. Note: PDF conversion artifacts (cid:73 = bullet markers) present in evidence file.", credence: 0.90}
|
||||
(2) [FSDL Normalize]: FSDL Lecture 7 lists normalizing input data as
|
||||
a default step in the 'Start Simple' phase. #observation
|
||||
[FSDL Spring 2021 Lecture 7](https://fullstackdeeplearning.com/)
|
||||
[evidence](evidence/fsdl_spring2021_lecture7.md#L334-L338)
|
||||
> The next step is to **normalize the input data**, subtracting the mean
|
||||
> and dividing by the variance. Note that for images, it's fine to scale
|
||||
> values to 0-1 or -0.5 to 0.5 (for example, by dividing by 255).
|
||||
{reason: "FSDL is Josh Tobin's industry-focused course; independent from Schulman's RL lineage", credence: 0.80}
|
||||
(3) [Slavv Normalize]: Ivanov's '37 Reasons' checklist lists
|
||||
standardization as item #12 and preprocessing consistency
|
||||
as items #14-#15. #observation
|
||||
[Slavv 2017 - 37 Reasons](https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-4020854bd607)
|
||||
[evidence](evidence/slavv_37_reasons_nn.md#L122-L132)
|
||||
> **12. Standardize the features.**
|
||||
> Did you standardize your input to have zero mean and unit variance?
|
||||
> 13. Do you have too much data augmentation?
|
||||
> Augmentation has a regularizing effect. Too much of this combined with
|
||||
> other forms of regularization (weight L2, dropout, etc.) can cause
|
||||
> the net to underfit.
|
||||
> **14. Check the preprocessing of your pretrained model.**
|
||||
> If you are using a pretrained model, make sure you are using the same
|
||||
> normalization and preprocessing as the model was when training. For
|
||||
> example, should an image pixel be in the range (0, 1), (-1, 1) or (0, 255)?
|
||||
{reason: "popular debugging checklist (2017), independent practitioner; items 12-14 all address normalization/preprocessing", credence: 0.70}
|
||||
----
|
||||
(4) [Normalize Robust]: Normalizing inputs to mean=0, std=1 is a
|
||||
robustly supported heuristic across 3 independent lineages
|
||||
(RL research, industry courses, practitioner checklists).
|
||||
{reason: "three independent sources from different sub-communities all converge on the same recommendation; community adoption in major frameworks confirms", inference: 0.90}
|
||||
+> [Folklore Reliable]
|
||||
|
||||
|
||||
## Overfit First / Test in Isolation
|
||||
|
||||
<Isolation Testing Consensus>
|
||||
|
||||
(1) [CS231n Overfit]: CS231n lists overfitting a tiny subset as the
|
||||
most important sanity check before full training. #observation
|
||||
[CS231n Neural Networks 3](https://cs231n.github.io/neural-networks-3/)
|
||||
[evidence](evidence/cs231n_neural_networks_3.md#L87-L89)
|
||||
> **Overfit a tiny subset of data**. Lastly and most importantly, before
|
||||
> training on the full dataset try to train on a tiny portion (e.g. 20
|
||||
> examples) of your data and make sure you can achieve zero cost. For this
|
||||
> experiment it's also best to set regularization to zero, otherwise this
|
||||
> can prevent you from getting zero cost. **Unless you pass this sanity
|
||||
> check with a small dataset it is not worth proceeding to the full
|
||||
> dataset.**
|
||||
{reason: "CS231n (Karpathy/Li/Johnson) is the canonical deep learning course; 'most importantly' framing shows high confidence in this heuristic", credence: 0.85}
|
||||
(2) [FSDL Overfit]: FSDL positions single-batch overfitting as the
|
||||
step immediately after getting the model to run. #observation
|
||||
[FSDL Spring 2021 Lecture 7](https://fullstackdeeplearning.com/)
|
||||
[evidence](evidence/fsdl_spring2021_lecture7.md#L433-L451)
|
||||
> After getting your model to run, the next thing you need to do is to
|
||||
> **overfit a single batch of data**. This is a heuristic that can catch
|
||||
> an absurd number of bugs. This really means that you want to drive your
|
||||
> training error arbitrarily close to 0. There are a few things that can
|
||||
> happen when you try to overfit a single batch and it fails:
|
||||
> **Error goes up**: Commonly, this is due to a flip sign somewhere in
|
||||
> the loss function/gradient. **Error explodes**: This is usually a
|
||||
> numerical issue but can also be caused by a high learning rate.
|
||||
{reason: "FSDL (Josh Tobin) independent from CS231n; 'absurd number of bugs' is strong practitioner endorsement", credence: 0.80}
|
||||
(3) [Goodfellow Overfit]: Goodfellow et al. state that inability to
|
||||
fit a single example indicates a software defect. #observation
|
||||
[Goodfellow et al. Deep Learning Ch11](https://www.deeplearningbook.org/)
|
||||
[evidence](evidence/goodfellow_ch11_practical_methodology.md#L218)
|
||||
> Fit a tiny dataset: If you have high error on the training set,
|
||||
> determine whether it is due to genuine underfitting or due to a
|
||||
> software defect. **Usually even small models can be guaranteed to be
|
||||
> able fit a sufficiently small dataset.** For example, a classification
|
||||
> dataset with only one example can be fit just by setting the biases of
|
||||
> the output layer correctly. Usually if you cannot train a classifier to
|
||||
> correctly label a single example... **there is a software defect
|
||||
> preventing successful optimization on the training set.**
|
||||
{reason: "Goodfellow/Bengio/Courville textbook is the canonical deep learning reference; independent from CS231n and FSDL", credence: 0.90}
|
||||
----
|
||||
(4) [Overfit First Robust]: 'Overfit a tiny dataset as sanity check'
|
||||
is robustly supported across 3 independent authoritative sources.
|
||||
{reason: "canonical textbook + two major courses all prescribe the same test; consistent across supervised and RL contexts", inference: 0.90}
|
||||
+> [Folklore Reliable]
|
||||
|
||||
|
||||
## Assume You Have a Bug
|
||||
|
||||
<Bug First Consensus>
|
||||
|
||||
(1) [Jones Bug]: Andy Jones argues that RL practitioners are reluctant
|
||||
to admit bugs, but bugs are the most common cause of failure. #observation
|
||||
[Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html)
|
||||
[evidence](evidence/andyljones_rl_debugging.md#L174-L182)
|
||||
> When their RL implementation doesn't work, people are often keen to
|
||||
> either (a) adjust their network architecture or (b) adjust their
|
||||
> hyperparameters. On the other hand, they're reluctant to say they've
|
||||
> got a bug. **Most often, it turns out they've got a bug.** Why bugs
|
||||
> are so much more common in RL code is discussed above, but there's
|
||||
> another advantage to assuming you've got a bug: bugs are a damn sight
|
||||
> faster to find and fix than validating that your new architecture is
|
||||
> an improvement over the old one.
|
||||
{reason: "Jones is an experienced RL practitioner; this blog post is widely cited in the RL community and is the primary source for structured RL debugging advice", credence: 0.80}
|
||||
(2) [Goodfellow Bug Mask]: Goodfellow et al. warn that neural net
|
||||
components can adapt to compensate for bugs, masking them. #observation
|
||||
[Goodfellow et al. Deep Learning Ch11](https://www.deeplearningbook.org/)
|
||||
[evidence](evidence/goodfellow_ch11_practical_methodology.md#L194-L206)
|
||||
> When a machine learning system performs poorly, it is usually difficult
|
||||
> to tell whether the poor performance is intrinsic to the algorithm
|
||||
> itself or whether there is a bug in the implementation of the
|
||||
> algorithm. **If one part is broken, the other parts can adapt and
|
||||
> still achieve roughly acceptable performance.** The bug may not be
|
||||
> apparent just from examining the output of the model though. Depending
|
||||
> on the distribution of the input, the weights may be able to adapt to
|
||||
> compensate for the negative biases.
|
||||
{reason: "canonical textbook; the 'adaptive compensation' mechanism explains why ML bugs are uniquely hard to detect", credence: 0.90}
|
||||
(3) [Jones Loss Herring]: Jones explicitly warns that loss curves
|
||||
don't localize errors and are therefore a red herring for
|
||||
debugging. #observation
|
||||
[Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html)
|
||||
[evidence](evidence/andyljones_rl_debugging.md#L184-L188)
|
||||
> When someone's RL implementation isn't working, they *luuuuuurv* to
|
||||
> copy-paste a screenshot of their loss curve to you. The problem with
|
||||
> using the loss curve as an indicator of correctness is somewhat that
|
||||
> it's not reliable, but mostly because **it doesn't localise errors.**
|
||||
> The shape of your loss curve says very little about where in your code
|
||||
> you've messed up, and so says very little about what you need to
|
||||
> change to get things working.
|
||||
{reason: "same source as (1) but independent observation about loss curves specifically; consistent with Goodfellow's point about adaptive masking", credence: 0.80}
|
||||
----
|
||||
(4) [Bug First Robust]: 'Assume you have a bug' is well-supported:
|
||||
bugs are common, adaptive compensation masks them, and loss
|
||||
curves don't help localize them.
|
||||
{reason: "Jones (practitioner) and Goodfellow (textbook) independently describe the same mechanism: ML systems mask bugs through adaptation. Loss curves give false comfort.", inference: 0.85}
|
||||
+> [Folklore Reliable]
|
||||
|
||||
|
||||
# Section 2: RL-Specific Debugging
|
||||
|
||||
## Seed Variance
|
||||
|
||||
<Seed Variance Evidence>
|
||||
|
||||
(1) [Schulman Seeds]: Schulman demonstrates that 3 seemingly different
|
||||
algorithms on 7 MuJoCo tasks were actually the same algorithm
|
||||
with different random seeds. #observation
|
||||
[Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ)
|
||||
[evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L377-L412)
|
||||
> you can see seven different tasks these are the Jim Moo Joko tasks
|
||||
> like half cheetah and hopper and so on and you have three different
|
||||
> algorithms here the red one the green one and the blue one... but as
|
||||
> it turns out **these are all the exact same algorithms and just random
|
||||
> seeds different random seeds** so it's easy to imagine that you're
|
||||
> just looking at one of these problems then you see that blue curve and
|
||||
> you think you get really excited than you think you found some huge
|
||||
> improvement to your algorithm but it's really that you just got a
|
||||
> lucky seed... **even if you had like 20 seeds here there's a still a
|
||||
> pretty big error bar**
|
||||
{reason: "Schulman (PPO author) showing his own data; this demonstration is one of the most cited examples of RL noise in the community", credence: 0.90}
|
||||
(2) [Henderson Seeds]: Henderson et al. show that same hyperparameters
|
||||
with different seeds produce statistically different learning
|
||||
curves on standard benchmarks. #observation
|
||||
[Henderson et al. 2018 - Deep RL That Matters](https://arxiv.org/abs/1709.06560)
|
||||
[evidence](evidence/henderson_2018_deep_rl_matters.md#L233-L235)
|
||||
> We perform 10 experiment trials, for the same hyperparameter
|
||||
> configuration, only varying the random seed across all 10 trials. We
|
||||
> then split the trials into two sets of 5 and average these two
|
||||
> groupings together. As shown in Figure 5, we find that **the
|
||||
> performance of algorithms can be drastically different.** We
|
||||
> demonstrate that the variance between runs is enough to create
|
||||
> statistically different distributions just from varying random seeds.
|
||||
> Our experiment with random seeds shows that this can be potentially
|
||||
> misleading.
|
||||
{reason: "peer-reviewed AAAI 2018 paper; systematic experimental study with statistical testing (t-test results reported); independent from Schulman's demo", credence: 0.85}
|
||||
(3) [Irpan Seeds]: Irpan reports a 30% failure rate on Pendulum
|
||||
across 10 seeds with identical hyperparameters, and notes
|
||||
that this would be considered a bug in supervised learning. #observation
|
||||
[Alex Irpan - RL Hard](https://www.alexirpan.com/2018/02/14/rl-hard.html)
|
||||
[evidence](evidence/alexirpan_rl_hard.md#L651-L678)
|
||||
> Here is a plot of performance, after I fixed all the bugs. Each line
|
||||
> is the reward curve from one of 10 independent runs. Same
|
||||
> hyperparameters, the only difference is the random seed. **Seven of
|
||||
> these runs worked. Three of these runs didn't. A 30% failure rate
|
||||
> counts as working.** Look, there's variance in supervised learning
|
||||
> too, but it's rarely this bad. If my supervised learning code failed
|
||||
> to beat random chance 30% of the time, I'd have super high confidence
|
||||
> there was a bug in data loading or training. **If my reinforcement
|
||||
> learning code does no better than random, I have no idea if it's a
|
||||
> bug, if my hyperparameters are bad, or if I simply got unlucky.**
|
||||
{reason: "Google Brain engineer's direct experience; the SL vs RL comparison makes the point vivid; independent from Schulman and Henderson", credence: 0.80}
|
||||
----
|
||||
(4) [Seed Variance Robust]: RL seed variance is extreme -- same algo
|
||||
with different seeds can look like different algorithms. This
|
||||
is robustly demonstrated across 3 independent sources with
|
||||
quantitative evidence.
|
||||
{reason: "primary algorithm author (Schulman), peer-reviewed study (Henderson), and independent practitioner (Irpan) all demonstrate the same effect with data; the 30% failure rate on Pendulum is a striking data point", inference: 0.90}
|
||||
+> [Folklore Reliable]
|
||||
|
||||
|
||||
## Batch Size
|
||||
|
||||
<Batch Size Evidence>
|
||||
|
||||
(1) [Schulman Batch]: Schulman warns that batch sizes too small
|
||||
cause noise to overwhelm signal, citing his own TRPO debugging
|
||||
experience needing 100K timesteps per batch. #observation
|
||||
[Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ)
|
||||
[evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L279-L307)
|
||||
> sometimes you should use more samples than you think you're going
|
||||
> to need because usually things just work better when you have more
|
||||
> samples almost always... if you want to just get something working at
|
||||
> all often you need to use bigger batch sizes and you thought because
|
||||
> **if your batch size is too small than the noise will overwhelm the
|
||||
> signal and you won't learn anything**
|
||||
{reason: "Schulman's personal debugging story: TRPO needed 100K timesteps per batch (documented in slides). Direct experience from algorithm author.", credence: 0.90}
|
||||
(2) [Schulman Batch Slides]: Schulman's slides give specific batch
|
||||
size numbers for TRPO and DQN on Atari. #observation
|
||||
[Schulman 2016 Slides](http://joschu.net/docs/nuts-and-bolts.pdf)
|
||||
[evidence](evidence/joschu_nuts_and_bolts.md#L61-L72)
|
||||
> Run with More Samples Than Expected. **Early in tuning process, may
|
||||
> need huge number of samples.** Don't be deterred by published work.
|
||||
> Examples: TRPO on Atari: 100K timesteps per batch for KL=0.01.
|
||||
> DQN on Atari: update freq=10K, replay buffer size=1M.
|
||||
{reason: "same author as (1) but written slides with specific numbers; corroborates the talk", credence: 0.90}
|
||||
(3) [McCandlish Critical Batch]: McCandlish et al. derive a critical
|
||||
batch size that predicts speed/efficiency tradeoffs, finding
|
||||
it grows during training as gradients shrink. #observation
|
||||
[McCandlish et al. 2018](https://arxiv.org/abs/1812.06162)
|
||||
[evidence](evidence/mccandlish_2018_large_batch.md#L180-L196)
|
||||
> Equation 2.7 nevertheless predicts the dependence of training speed
|
||||
> on batch size remarkably well, even for full training runs that range
|
||||
> over many points in the loss landscape. **By averaging Equation 2.7
|
||||
> over multiple optimization steps, we find a simple relationship
|
||||
> between training speed and data efficiency.** Here, S and Smin
|
||||
> represent the actual and minimum possible number of steps taken to
|
||||
> reach a specified level of performance, respectively.
|
||||
{reason: "OpenAI paper providing theoretical foundation for batch size effects; peer-reviewed; explains WHY Schulman's observation holds", credence: 0.85}
|
||||
----
|
||||
(4) [Batch Size Robust]: 'Use bigger batches than you think' is
|
||||
supported by both practitioner experience and theoretical analysis.
|
||||
{reason: "Schulman's empirical observation is explained by McCandlish's noise scale theory; the critical batch size concept provides a principled way to reason about it", inference: 0.85}
|
||||
+> [Folklore Reliable]
|
||||
|
||||
|
||||
## Reward Engineering
|
||||
|
||||
<Reward Engineering Evidence>
|
||||
|
||||
(1) [Schulman Reward Mean]: Schulman warns that shifting reward mean
|
||||
changes the agent's 'will to live' -- how long it wants to
|
||||
survive -- thereby changing the problem. #observation
|
||||
[Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ)
|
||||
[evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L513-L519)
|
||||
> for the rewards I'd recommend rescaling it but not shifting them
|
||||
> because **that affects the agents will to live so if you shift the
|
||||
> mean reward that'll affect whether how long it wants to survive
|
||||
> you're actually changing the problem**
|
||||
{reason: "Schulman explains the causal mechanism: reward mean shift changes the MDP's optimal policy, not just scaling. This is not obvious to beginners.", credence: 0.90}
|
||||
(2) [Jones Reward Scale]: Jones identifies reward scaling as the single
|
||||
most common issue for RL newbies, and warns against adaptive
|
||||
reward scaling as extra nonstationarity. #observation
|
||||
[Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html)
|
||||
[evidence](evidence/andyljones_rl_debugging.md#L115-L119)
|
||||
> The single most common issue for newbies writing custom RL
|
||||
> implementations is that the targets arriving at their neural net
|
||||
> aren't [-1, +1]. Actually, anything [-.1, +.1]ish to [-10, +10]ish
|
||||
> is good. Having read that, you might be tempted to write some
|
||||
> adaptive scheme to scale your rewards for you. **Don't: it's an extra
|
||||
> bit of nonstationarity that'll make life more difficult. Just
|
||||
> hand-scale, hand-clip the rewards** from your env so that the targets
|
||||
> passed to your network are sensible.
|
||||
{reason: "Jones independently converges on same advice as Schulman; labels it 'single most common issue'; explicitly warns against adaptive schemes", credence: 0.80}
|
||||
(3) [Henderson Reward Scale]: Henderson et al. show that multiplying
|
||||
rewards by a scalar causes significant performance differences
|
||||
in DDPG, with inconsistent effects across environments. #observation
|
||||
[Henderson et al. 2018 - Deep RL That Matters](https://arxiv.org/abs/1709.06560)
|
||||
[evidence](evidence/henderson_2018_deep_rl_matters.md#L181)
|
||||
> Reward rescaling has been used in several recent works (Duan et al .
|
||||
> 2016; Gu et al . 2016) to improve results for DDPG. This involves
|
||||
> simply multiplying the rewards gen-erated from an environment by some
|
||||
> scalar ( rhat = r*sigma ) for training. Often, these works report using a
|
||||
> reward scale of sigma = 0 .1. In Atari domains, this is akin to clipping
|
||||
> the rewards to (0 , 1) . **By intuition, in gradient based methods (as
|
||||
> used in most deep RL) a large and sparse output scale can result in
|
||||
> problems regarding saturation and inefficiency in learning** (LeCun
|
||||
> et al . 2012; Glorot and Bengio 2010; Vincent, de Brebisson, and
|
||||
> Bouthillier 2015).
|
||||
{reason: "peer-reviewed AAAI paper; provides gradient-based mechanism explaining WHY reward scale matters; cites 3 supporting references", credence: 0.85}
|
||||
----
|
||||
(4) [Reward Robust]: Reward scaling advice (hand-scale, don't shift
|
||||
mean, target [-10, +10]) is well-supported across practitioner
|
||||
experience, RL research, and controlled experiments.
|
||||
{reason: "Schulman (causal mechanism), Jones (practitioner experience), Henderson (experimental validation) all converge; the 'will to live' explanation is especially compelling", inference: 0.85}
|
||||
+> [Folklore Reliable]
|
||||
|
||||
|
||||
## Reference Implementations
|
||||
|
||||
<Reference Impl Consensus>
|
||||
|
||||
(1) [Jones Ref Impl]: Jones calls writing RL from scratch 'the most
|
||||
catastrophically self-sabotaging thing you can do' as a
|
||||
newcomer. #observation
|
||||
[Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html)
|
||||
[evidence](evidence/andyljones_rl_debugging.md#L153-L165)
|
||||
> **If you're new to reinforcement learning, writing things from scratch
|
||||
> is the most catastrophically self-sabotaging thing you can do.** There
|
||||
> is an alluring masochism in writing things from scratch. There's
|
||||
> concrete value in it too: by writing things from scratch, you're both
|
||||
> forced to fully understand what you're doing and you're more likely to
|
||||
> come up with a fresh perspective. **In reinforcement learning, these
|
||||
> benefits are not worth it.** At all. As discussed above, the nature
|
||||
> of RL work makes it extremely hard for you to self-correct.
|
||||
{reason: "Jones provides three graduated risk levels for using references (out-of-box, components, one-eye-on); concrete implementation lists (spinningup, stable-baselines3, cleanrl, OpenSpiel)", credence: 0.80}
|
||||
(2) [Rahtz Ref]: Rahtz spent 8 months reproducing a Deep RL paper and
|
||||
found that even small normalization bugs can hide for months,
|
||||
supporting the case for starting from reference code. #observation
|
||||
[Matthew Rahtz - Reproducing Deep RL](http://amid.fish/reproducing-deep-rl)
|
||||
[evidence](evidence/amid_fish_reproducing_deep_rl.md#L36-L52)
|
||||
> reinforcement learning turned out to be a lot trickier than expected.
|
||||
> A big part of it is that right now, reinforcement learning is really
|
||||
> sensitive. There are a lot of details to get just right, and **if you
|
||||
> don't get them right, it can be difficult to diagnose where you've
|
||||
> gone wrong.** After finishing the basic implementation, training runs
|
||||
> just weren't succeeding... it turned out to be because of **problems
|
||||
> with normalization of rewards and pixel data at a key stage.**
|
||||
{reason: "first-person account of 2-month debugging session caused by normalization bug; vivid illustration of why references save time", credence: 0.75}
|
||||
----
|
||||
(3) [Ref Impl Robust]: Starting from reference implementations is
|
||||
strongly supported by practitioner experience: the self-correction
|
||||
mechanisms in RL are too weak for solo implementation.
|
||||
{reason: "Jones's forceful advice is validated by Rahtz's 8-month experience; the underlying theory (RL's weak error signals) provides a causal explanation", inference: 0.85}
|
||||
+> [Folklore Reliable]
|
||||
|
||||
|
||||
## Pursue Anomalies
|
||||
|
||||
<Anomaly Pursuit Evidence>
|
||||
|
||||
(1) [Jones Anomaly]: Jones recommends chasing anomalies immediately,
|
||||
calling it 'one of the most powerful ways to debug'. #observation
|
||||
[Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html)
|
||||
[evidence](evidence/andyljones_rl_debugging.md#L101-L109)
|
||||
> If you ever see a plot or a behaviour that just *seems weird*, chase
|
||||
> right after it! Do not - do *not* - just 'hope it goes away'.
|
||||
> **Chasing anomalies is one of the most powerful ways to debug your
|
||||
> system**, because if you've noticed a problem without having had to go
|
||||
> look for it, that means it's a *really big problem*. It's really
|
||||
> tempting to think that the cool extra functionality you were planning
|
||||
> to write today might just magically fix this anomalous behaviour.
|
||||
> It won't. Give up on your plan for the day and chase the anomaly
|
||||
> instead.
|
||||
{reason: "strong practitioner endorsement; the causal reasoning (visible anomaly = big problem) is sound", credence: 0.80}
|
||||
(2) [Rahtz Confusion]: Rahtz independently converges on the same
|
||||
advice, calling it 'noticing confusion' -- following confusion
|
||||
led to finding a normalization bug. #observation
|
||||
[Matthew Rahtz - Reproducing Deep RL](http://amid.fish/reproducing-deep-rl)
|
||||
[evidence](evidence/amid_fish_reproducing_deep_rl.md#L77-L102)
|
||||
> A corollary is to **try and be as sensitive as possible in noticing
|
||||
> confusion**. There were a lot of points in this project where the
|
||||
> only clues came from noticing some small thing that didn't make sense.
|
||||
> It was only by following that confusion and realising that taking the
|
||||
> difference between frames zeroed out the background that gave the
|
||||
> hint of a problem with normalization. Learn to **recognise what
|
||||
> confusion *feels* like**... **commit yourself to always investigate
|
||||
> whenever you notice confusion.**
|
||||
{reason: "independent practitioner arriving at same principle through personal experience; the normalization bug was only found this way", credence: 0.75}
|
||||
----
|
||||
(3) [Anomaly Robust]: 'Pursue anomalies immediately' is supported by
|
||||
two independent practitioners who both found it was the key
|
||||
debugging strategy for hard-to-diagnose issues.
|
||||
{reason: "Jones and Rahtz independently describe the same strategy with different language (anomalies vs confusion) and different stories but the same conclusion", inference: 0.85}
|
||||
+> [Folklore Reliable]
|
||||
|
||||
|
||||
## Comprehensive Logging
|
||||
|
||||
<Logging Consensus>
|
||||
|
||||
(1) [Rahtz Log]: Rahtz recommends logging all metrics you can to
|
||||
maximize diagnostic evidence per run. #observation
|
||||
[Matthew Rahtz - Reproducing Deep RL](http://amid.fish/reproducing-deep-rl)
|
||||
[evidence](evidence/amid_fish_reproducing_deep_rl.md#L197-L201)
|
||||
> First, adopting an attitude of **log all the metrics you can** to
|
||||
> maximise the amount of evidence you gather on each run. There are
|
||||
> obvious metrics like training/validation accuracy, but it might also
|
||||
> be worth spending a good chunk of time at the start of the project
|
||||
> brainstorming and researching which other metrics might be important
|
||||
> for diagnosing potential problems.
|
||||
{reason: "learned from 8-month reproduction attempt; specifically regrets not logging policy entropy earlier", credence: 0.75}
|
||||
(2) [Goodfellow Monitor]: Goodfellow et al. recommend visualizing
|
||||
activation and gradient statistics collected over many
|
||||
training iterations. #observation
|
||||
[Goodfellow et al. Deep Learning Ch11](https://www.deeplearningbook.org/)
|
||||
[evidence](evidence/goodfellow_ch11_practical_methodology.md#L238)
|
||||
> **It is often useful to visualize statistics of neural network
|
||||
> activations and gradients, collected over a large amount of training
|
||||
> iterations.** The preactivation value of hidden units can tell us if
|
||||
> the units saturate, or how often they do... it is useful to compare
|
||||
> the magnitude of parameter gradients to the magnitude of the
|
||||
> parameters themselves.
|
||||
{reason: "canonical textbook; independent from Rahtz; specifies what to log (activations, gradients, parameter magnitudes)", credence: 0.90}
|
||||
----
|
||||
(3) [Logging Robust]: Comprehensive logging is unanimously recommended
|
||||
across textbooks, courses, and practitioner accounts.
|
||||
{reason: "Goodfellow (textbook), Rahtz (practitioner), and multiple other sources (FSDL, reddit threads) all emphasize logging as foundational; no dissenting voice found", inference: 0.90}
|
||||
+> [Folklore Reliable]
|
||||
|
||||
|
||||
## Random HP Search
|
||||
|
||||
<Random Search Evidence>
|
||||
|
||||
(1) [CS231n Random]: CS231n cites Bergstra and Bengio (2012) showing
|
||||
random search is more efficient than grid search for
|
||||
hyperparameter optimization. #observation
|
||||
[CS231n Neural Networks 3](https://cs231n.github.io/neural-networks-3/)
|
||||
[evidence](evidence/cs231n_neural_networks_3.md#L306-L312)
|
||||
> **Prefer random search to grid search.** As argued by Bergstra and
|
||||
> Bengio in Random Search for Hyper-Parameter Optimization, "randomly
|
||||
> chosen trials are more efficient for hyper-parameter optimization
|
||||
> than trials on a grid". It is very often the case that **some of the
|
||||
> hyperparameters matter much more than others**. Performing random
|
||||
> search rather than grid search allows you to much more precisely
|
||||
> discover good values for the important ones.
|
||||
{reason: "CS231n citing peer-reviewed JMLR paper (Bergstra & Bengio 2012); the intuition (some HPs matter more) is well-established", credence: 0.85}
|
||||
(2) [Schulman Random]: Schulman endorses random sampling + human
|
||||
regression as his preferred HP search method. #observation
|
||||
[Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ)
|
||||
[evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L995-L1013)
|
||||
> favorite hyper parameter optimization framework... I just like to
|
||||
> use the **uniform random sampling yeah that works really well I mean
|
||||
> you just run a bunch of experiments with random hyper parameters and
|
||||
> then you just look at the results the next day and do some regression
|
||||
> to figure out which parameters actually mattered** and then you've
|
||||
> run another experiment with better parameter ranges... I use the
|
||||
> human version of it.
|
||||
{reason: "Schulman's personal method; independent endorsement of random search from RL context (CS231n focuses on supervised)", credence: 0.85}
|
||||
----
|
||||
(3) [Random Search Robust]: Random HP search + manual analysis is
|
||||
supported by both theory (Bergstra & Bengio) and practitioner
|
||||
preference (Schulman).
|
||||
{reason: "peer-reviewed paper provides theoretical justification; leading practitioner independently uses same method; the intuition (some HPs matter more than others) is universally recognized", inference: 0.85}
|
||||
+> [Folklore Reliable]
|
||||
|
||||
|
||||
## Probe Environments
|
||||
|
||||
<Probe Env Evidence>
|
||||
|
||||
(1) [Jones Probes]: Jones describes a sequence of probe environments
|
||||
that progressively isolate value network, backprop, reward
|
||||
discounting, and policy errors. #observation
|
||||
[Andy Jones - RL Debugging](https://andyljones.com/posts/rl-debugging.html)
|
||||
[evidence](evidence/andyljones_rl_debugging.md#L204-L221)
|
||||
> Instead, construct environments that do localise errors. In a recent
|
||||
> project, I used 1. **One action, zero observation, one timestep long,
|
||||
> +1 reward every timestep**: This isolates the value network. 2. **One
|
||||
> action, random +1/-1 observation, one timestep long, obs-dependent
|
||||
> +1/-1 reward every time**: If my agent can learn the value in (1.) but
|
||||
> not this one, it must be that backpropagation through my network is
|
||||
> broken. 3. **One action, zero-then-one observation, two timesteps
|
||||
> long, +1 reward at the end**: If my agent can learn the value in (2.)
|
||||
> but not this one, it must be that my reward discounting is broken.
|
||||
> You get the idea: (1.) is the simplest possible environment, and
|
||||
> **each new env adds the smallest possible bit of functionality. If the
|
||||
> old env works but the successor doesn't, that gives you a lot of
|
||||
> information about where the problem is.**
|
||||
{reason: "Jones is the only source with this detailed probe env methodology; but the technique is a direct application of the general 'test in isolation' principle (Goodfellow, CS231n) to RL specifically", credence: 0.80}
|
||||
----
|
||||
(2) [Probe Env Useful]: Probe environments are a practical application
|
||||
of component isolation testing for RL, where standard envs
|
||||
like CartPole don't localize errors.
|
||||
{reason: "single source but the methodology is a rigorous instantiation of widely-supported isolation testing; each probe takes seconds, making it fast to verify", inference: 0.80}
|
||||
+> [Folklore Reliable]
|
||||
|
||||
|
||||
## Policy Entropy and KL Diagnostics
|
||||
|
||||
<Entropy KL Evidence>
|
||||
|
||||
(1) [Schulman Entropy]: Schulman recommends monitoring policy entropy
|
||||
carefully: dropping too fast means premature determinism,
|
||||
not dropping means no learning. #observation
|
||||
[Schulman 2017 Bootcamp Talk](https://www.youtube.com/watch?v=8EcdaCk9KaQ)
|
||||
[evidence](evidence/schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md#L631-L652)
|
||||
> look at the entropy really carefully if your entropy is going down too
|
||||
> fast that means your policy is becoming deterministic and it's not
|
||||
> going to explore anything... also if it's not going down your policy is
|
||||
> never going to be that good because it's always really random... **you
|
||||
> can sort of alleviate this issue by using an entropy bonus or a KL
|
||||
> penalty** so by stopping yourself from changing the policy the
|
||||
> probability distribution too fast as a side effect you also prevent
|
||||
> the entropy from going down too fast... I also look at the KL as a
|
||||
> diagnostic like look at how big of an update you're doing in terms of
|
||||
> KL divergence
|
||||
{reason: "Schulman designed PPO's clipped objective specifically to control KL; these diagnostics come from the algorithm author's direct practice", credence: 0.90}
|
||||
----
|
||||
(2) [Entropy KL Useful]: Policy entropy and KL divergence are
|
||||
essential RL-specific diagnostics that detect exploration
|
||||
failure and update instability.
|
||||
{reason: "single source but from the algorithm designer; entropy/KL monitoring is now built into stable-baselines3, RLlib, and cleanrl as standard", inference: 0.85}
|
||||
+> [Folklore Reliable]
|
||||
|
||||
|
||||
# Evidence Against
|
||||
|
||||
## Sources Are Dated
|
||||
|
||||
<Source Age Concern>
|
||||
|
||||
(1) [Dated Sources]: Most sources are from 2017-2018, before
|
||||
transformers, RLHF, large-scale pretraining, and modern
|
||||
frameworks became dominant. #assumption
|
||||
{reason: "Schulman 2017, Jones 2021, Henderson 2018, Irpan 2018, CS231n ~2017; the RL landscape has shifted substantially since then (PPO is now standard, RLHF is a major use case, JAX/PyTorch 2.0 changed workflows)", credence: 0.65}
|
||||
----
|
||||
(2) [Age Limits]: Some folklore may not transfer to modern settings
|
||||
(e.g., batch size advice may differ for LLM fine-tuning vs
|
||||
classic RL; reward scaling is less relevant for RLHF).
|
||||
{reason: "core debugging principles (test isolation, logging, seed variance) are architecture-agnostic and likely durable; specific HP defaults and RL diagnostics may need updating", inference: 0.40}
|
||||
-> [Folklore Reliable]
|
||||
|
||||
|
||||
## RL-Specific Focus
|
||||
|
||||
<RL Specificity Concern>
|
||||
|
||||
(1) [RL Heavy]: The SKILL is heavily weighted toward RL debugging,
|
||||
with ~60% of content RL-specific (probe envs, reward scaling,
|
||||
policy entropy, KL diagnostics). #assumption
|
||||
{reason: "Parts 2 and 4 are RL-only; Part 1 is general but many examples are RL-flavored; limits applicability for users doing pure supervised learning or generative modeling", credence: 0.80}
|
||||
----
|
||||
(2) [Scope Limits]: RL-heavy focus limits the SKILL's applicability
|
||||
but the general debugging principles (Parts 1, 3, 5) transfer
|
||||
broadly.
|
||||
{reason: "the RL focus is clearly labeled in the SKILL; general ML principles like normalization, isolation testing, and loss surface analysis are domain-agnostic", inference: 0.30}
|
||||
-> [Folklore Reliable]
|
||||
@@ -0,0 +1,76 @@
|
||||
# ML Debugging Folklore - Vargdown Process Log
|
||||
|
||||
## Process
|
||||
- [x] evidence files read (21 files, 9416 lines total)
|
||||
- [x] quotes extracted via 12 parallel subagents
|
||||
- [x] key quotes verified against evidence files (spot-checked ~15 quotes)
|
||||
- [x] argdown verifier passes clean (`npx @argdown/cli json` -- 14 arguments, 45 statements, 14 relations)
|
||||
- [x] subagent review done (gpt-5.2-codex via opencode; fixed non-verbatim quotes, credence calibration, PCS structure)
|
||||
- [ ] human review done
|
||||
|
||||
## Evidence Fetch Log
|
||||
|
||||
All evidence files were pre-existing in `docs/evidence/`. They were fetched
|
||||
in a prior session via the methods listed in each file's header.
|
||||
|
||||
| Source | Evidence File | Fetch Method | Status |
|
||||
|--------|--------|--------|--------|
|
||||
| Schulman 2016 slides | joschu_nuts_and_bolts.md | `uvx markitdown[pdf]` | verbatim (PDF artifacts: cid markers) |
|
||||
| Schulman 2017 bootcamp | schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md | YouTube auto-subtitles | verbatim (transcription errors: "insanity" = "and standard") |
|
||||
| Andy Jones RL debugging | andyljones_rl_debugging.md | markitdown | verbatim |
|
||||
| Henderson et al. 2018 | henderson_2018_deep_rl_matters.md | markitdown | verbatim |
|
||||
| Goodfellow Ch11 | goodfellow_ch11_practical_methodology.md | markitdown | verbatim |
|
||||
| CS231n NN3 | cs231n_neural_networks_3.md | markitdown | verbatim |
|
||||
| FSDL Spring 2021 L7 | fsdl_spring2021_lecture7.md | markitdown | verbatim |
|
||||
| Irpan RL hard | alexirpan_rl_hard.md | markitdown | verbatim |
|
||||
| amid.fish reproducing | amid_fish_reproducing_deep_rl.md | markitdown | verbatim |
|
||||
| Slavv 37 reasons | slavv_37_reasons_nn.md | markitdown | verbatim |
|
||||
| CS229 ML advice | cs229_ml_advice.md | markitdown | verbatim |
|
||||
| McCandlish 2018 | mccandlish_2018_large_batch.md | markitdown | verbatim |
|
||||
| William Falcon notes | williamfalcon_deeprl_hacks.md | markitdown | verbatim |
|
||||
| Goodfellow Ch15 | goodfellow_ch15_representation_learning.md | markitdown | verbatim |
|
||||
| Deep Learning Book | deeplearning_book.md | markitdown | verbatim |
|
||||
| Reddit RL tips 7s8px9 | reddit_rl_practical_tips_7s8px9.md | markitdown | verbatim |
|
||||
| Reddit RL debug 9sh77q | reddit_rl_debugging_tips_9sh77q.md | markitdown | verbatim |
|
||||
| Reddit RL roadblocks | reddit_rl_roadblocks_bzg3l2.md | markitdown | verbatim |
|
||||
| Reddit Schulman 5hereu | reddit_schulman_nuts_bolts_5hereu.md | markitdown | verbatim |
|
||||
| Reddit ICML tutorial | reddit_icml2017_tutorial_levine_6vcvu1.md | markitdown | verbatim |
|
||||
| Reddit DRL bootcamp | reddit_deeprl_bootcamp_2017_75m5vd.md | markitdown | verbatim |
|
||||
|
||||
## Quote Verification Notes
|
||||
|
||||
- Schulman subtitles contain auto-generated transcription errors (e.g., "mean insanity deviation" should be "mean and standard deviation"). Quotes used verbatim from file; errors are in the source, not introduced by us.
|
||||
- Schulman PDF (joschu_nuts_and_bolts.md) has markitdown conversion artifacts (`(cid:73)` bullet markers, table formatting). Core text is present but formatting is messy.
|
||||
- All other evidence files appear to be clean markitdown conversions.
|
||||
- 15 key quotes were manually spot-checked against evidence files. All matched.
|
||||
- Quotes from subagent extractions were cross-referenced with direct file reads.
|
||||
|
||||
## Blockers / Caveats
|
||||
|
||||
- Argdown verifier passes clean: `npx @argdown/cli json` exports 14 arguments, 45 statements, 14 relations. Fixed: 44 blank lines inside PCS blocks, bracket escaping in FSDL quote.
|
||||
- Some evidence files (especially Schulman PDF) have conversion artifacts that may cause verifier failures on exact quote matching.
|
||||
- The argdown uses auto-generated YouTube subtitles as a source; these contain transcription errors that are present in the evidence file.
|
||||
|
||||
## Coverage Summary
|
||||
|
||||
| SKILL.md Claim | Sources Used | Independent Sources |
|
||||
|---|---|---|
|
||||
| Normalize inputs mean=0 std=1 | Schulman, FSDL, Slavv | 3 |
|
||||
| Overfit tiny dataset first | CS231n, FSDL, Goodfellow | 3 |
|
||||
| Assume you have a bug | Jones, Goodfellow | 2 |
|
||||
| Seed variance is extreme | Schulman, Henderson, Irpan | 3 |
|
||||
| Use bigger batch sizes | Schulman (x2), McCandlish | 2 (Schulman slides + talk counted as 1) |
|
||||
| Hand-scale rewards, don't shift mean | Schulman, Jones, Henderson | 3 |
|
||||
| Use reference implementations | Jones, Rahtz | 2 |
|
||||
| Pursue anomalies | Jones, Rahtz | 2 |
|
||||
| Log everything | Rahtz, Goodfellow | 2 |
|
||||
| Random HP search | CS231n/Bergstra, Schulman | 2 |
|
||||
|
||||
| Probe environments for RL | Jones | 1 (but applies general isolation principle) |
|
||||
| Policy entropy / KL diagnostics | Schulman | 1 (but built into major frameworks) |
|
||||
|
||||
## Claims NOT Covered in Argdown (lower priority or single-source)
|
||||
- Gradient clipping masks problems (CS231n mentions, but as a technique not a warning)
|
||||
- Final layer zero init for policy (Schulman only)
|
||||
- Loss surface analysis / gradient quiver plots (original to SKILL, no external source)
|
||||
- Sweep methodology with within-group z-scores (original to SKILL)
|
||||
Reference in New Issue
Block a user