mirror of
https://github.com/wassname/ml-debug.git
synced 2026-06-27 15:00:40 +08:00
4393cceefd
Deep research to uplift LLMs for ML debugging, opinionated by source selection. Distilled from Schulman, Jones, Rahtz, Goodfellow, CS231n, FSDL, and more. Includes runnable diagnostic scripts and LLM-specific anti-patterns. Author: wassname (https://github.com/wassname)
1008 lines
37 KiB
Plaintext
1008 lines
37 KiB
Plaintext
so last year at nips I was slated to
|
|
give a talk at the deep RL workshop and
|
|
I wasn't sure what I was going to talk
|
|
about because everything I had prepared
|
|
I had already talked about it so many
|
|
times that I just didn't want to didn't
|
|
want to give another talk on it so I I
|
|
asked Peter for his advice on what I
|
|
should talk about and he said that
|
|
Entering had given the talk earlier in
|
|
the conference I called the nuts and
|
|
bolts of deep learning where he sort of
|
|
went through the flowchart of what you
|
|
do when you see a new problem and like
|
|
if you if you're overfitting you regular
|
|
eyes and if you're underfitting then you
|
|
use a bigger model and so on so so Peter
|
|
suggested to come up to write a talk
|
|
called the nuts and bolts of deep RL
|
|
research where I would talk about some
|
|
of the similar lessons and the tips and
|
|
tricks for the RL setting so I put
|
|
together a talk for that and actually
|
|
people seem to like it
|
|
so I'll give a slightly updated version
|
|
of that talk right now so I'm going to
|
|
talk about a few different things some
|
|
of which are general and sort of apply
|
|
to RL using reinforcement learning in
|
|
general and some of them pertain to
|
|
particular classes of methods like
|
|
policy grading methods and these are
|
|
just sort of little tips and tricks for
|
|
how you how you get your algorithm to
|
|
work and what you do day to day so let's
|
|
say you have a totally new problem
|
|
you're trying to solve like you have you
|
|
have some new tasks and you figured out
|
|
how to you defined an observation in an
|
|
action space and you have your neural
|
|
network policy or Q function but and you
|
|
want to start learning learning how to
|
|
solve it but you but you've never tried
|
|
it before
|
|
um so or okay or if you have a new
|
|
algorithm you're trying to get working
|
|
that you've never you you've never used
|
|
it before so so what do you do what's
|
|
the first thing you do if you have a new
|
|
algorithm
|
|
so so that I mean my first advice would
|
|
be to use the small problems so you can
|
|
run a lot of experiments really quickly
|
|
and do a hyper parameter search and it's
|
|
really useful too
|
|
to be able to visualize the learning
|
|
process in as many ways as possible so
|
|
look at the state visitation like how
|
|
that's evolving over time and look at
|
|
how well your value function is fitting
|
|
and so on so like I spent a lot of time
|
|
looking at the pendulum problem where
|
|
you're trying to swing up a pendulum
|
|
because this problem has a 2d state
|
|
space where it's just the angular and
|
|
the angular velocity of the pendulum and
|
|
I would visit visualize here's exactly
|
|
what the value function looks like
|
|
here's exactly what the state
|
|
distribution looks like and here's how
|
|
they evolve over time so I would get a
|
|
sense for like what's if my algorithm
|
|
isn't working is it because it's like
|
|
oscillating in some funny way or maybe
|
|
it's just giving a bad fit or maybe the
|
|
function it's learnt the value function
|
|
alerting isn't smooth enough and so on
|
|
so I would say try to visualize
|
|
everything and maybe use small problems
|
|
where you can visualize everything also
|
|
yeah it's useful to construct toy
|
|
problems where your idea is going to be
|
|
the strongest where you think okay if
|
|
this idea has any possibility of working
|
|
it's going to work there so for example
|
|
let's say you're trying to do something
|
|
with hierarchical reinforcement learning
|
|
then construct some problem where
|
|
there's some kind of obvious hierarchy
|
|
that it should learn and you'll be able
|
|
to tell if it's doing the right thing
|
|
also construct the the problems where
|
|
it's going to be weakest obviously and
|
|
also as a counterpoint to that don't
|
|
over fit your method to some contrived
|
|
problem so let's say you've come up with
|
|
some toy problem where your method is
|
|
really good then don't realize that it's
|
|
a toy problem and don't like tweak
|
|
everything to just work on this toy
|
|
problem perfectly because yeah it's also
|
|
pretty useful to have medium-sized
|
|
problems that you're very familiar with
|
|
and you know exactly how fast the
|
|
learning should be and what the reward
|
|
should be at every iteration and so on
|
|
so
|
|
a few problems that I use a lot like
|
|
training on pong Atari and the hopper
|
|
would the hopper like problem which is
|
|
this simulated robot problem with this
|
|
hopping robot and I know exactly how
|
|
fast an algorithm that's working should
|
|
learn on these problems so so I can sort
|
|
of it's it makes it easier to tune
|
|
things if you have okay that's if you
|
|
have a new algorithm let's say you have
|
|
a new task I would recommend just making
|
|
the task easier until you start seeing
|
|
some signs of life you see it learning
|
|
something so so there are various ways
|
|
you can make it easier you can try doing
|
|
some feature engineering so your input
|
|
features you think that the you think
|
|
that the policy should be a simple
|
|
function of your input features like
|
|
let's say you're trying to get pong to
|
|
work and you tried setting it up with
|
|
the images as input and you weren't
|
|
learning anything then you can set up
|
|
the problem where you pass in XY
|
|
coordinates as input and then try
|
|
running your algorithm and it's a much
|
|
simpler function you're trying to learn
|
|
so that's much more likely at work and
|
|
then you can try to make it harder and
|
|
harder until you're solving the full
|
|
problem another way you can make it
|
|
easier is by shaping the reward function
|
|
that means you if you come up with some
|
|
reward function that gives you fast
|
|
feedback code on whether you're doing
|
|
the right thing or not so let's say we
|
|
can define one task where we have this
|
|
reaching robot and we just give it a
|
|
reward if it reaches if it hits the
|
|
target so it gets a reward of one if it
|
|
hits the target in zero otherwise so
|
|
that might be hard to learn because
|
|
you're not getting any feedback as
|
|
you're flailing around but we could
|
|
define a better shaped reward function
|
|
where the where it's just distance to
|
|
target then learning is going to be much
|
|
faster in that problem there's also the
|
|
problem on exactly how to turn your
|
|
problem into a pom DP in the first place
|
|
so so often it's not clear what your
|
|
observation features should be and it's
|
|
not even clear what the reward function
|
|
should be so or it's not clear if this
|
|
problem you're trying to solve is
|
|
if it's feasible at all so so let's say
|
|
you're trying to solve you you have some
|
|
game or some robotics task or something
|
|
new like and you you want to turn it
|
|
into a reinforcement learning problem
|
|
but you're not sure if this is feasible
|
|
at all
|
|
so the first thing to do is to just
|
|
visualize a random policy acting on this
|
|
problem and see see what happens so if
|
|
the random policy occasionally does the
|
|
right thing then there's a high chance
|
|
of reinforcement learning is going to
|
|
work because bringing forth a policy
|
|
grading method is just going to take
|
|
this random behavior and it's going to
|
|
make the look the good behaviors more
|
|
likely
|
|
so it'll gradually like hone in on the
|
|
good behaviors whereas if you're never
|
|
doing the right thing then then there's
|
|
RL isn't going to get any signal that
|
|
tells it to do the right thing sometimes
|
|
RL is able to learn even though it seems
|
|
like it it's not clear how it's going to
|
|
learn like learning how to walk it's not
|
|
clear that that should work but because
|
|
you would think that like you really
|
|
have to have the whole thing in the
|
|
whole policy in place before it does
|
|
anything useful but as it turns out you
|
|
sort of learn to take one step and then
|
|
fall over and then take two steps and
|
|
then fall over and so on until you've
|
|
got a proper walking gait okay another
|
|
thing to do is to make to make your
|
|
observations make sure your observations
|
|
are useable try to look at them as a
|
|
human and see if you can control the
|
|
system using the same observations
|
|
you're giving to the agent so let's say
|
|
you're doing some pre-processing on your
|
|
images look at those pre processed
|
|
images yourself and make sure you're not
|
|
like losing too much detail when you
|
|
downsample them or losing too much or in
|
|
the color transformations and so on
|
|
another thing to do is you want to make
|
|
sure that everything is reasonably
|
|
scaled so that for example well as a
|
|
rule of thumb you usually want
|
|
everything to be mean 0 and standard
|
|
deviation 1 for the observations and for
|
|
the rewards well it's a little less
|
|
obvious but that's a reasonable
|
|
heuristic so
|
|
so you might want to like a scaler using
|
|
some kind of filter I mean that's that's
|
|
another good thing you can do but if you
|
|
don't want to mess with some kind of
|
|
filters on your observations and rewards
|
|
what you can do it you can just kind of
|
|
if you're allowed to define those
|
|
yourself then you might want to just
|
|
scale them yourself so what I'd
|
|
recommend doing is plot histograms of
|
|
all of your observations and your
|
|
rewards and make sure that for each
|
|
component of the observations and
|
|
rewards you've scaled it properly so
|
|
that it has the right mean insanity
|
|
deviation and it doesn't have crazy
|
|
outliers okay another thing to do is you
|
|
should have some good baselines that you
|
|
can use whenever you see a new when you
|
|
whenever you see a new problem so just
|
|
it's not clear which algorithm is going
|
|
to work beforehand so make sure you you
|
|
just have a bunch of a bunch of like
|
|
well tune things that you can run on
|
|
each problem yeah okay the question was
|
|
if you're gonna do some kind of reward
|
|
normalization should you do this over
|
|
your whole training like all of your
|
|
training data or just like the recent
|
|
data I would yeah that's a there's a lot
|
|
of subtlety there so I would say use all
|
|
of your data so far because you're
|
|
making everything non-stationary if you
|
|
do some kind of filtering actually I'm
|
|
going to talk about this at a later
|
|
slide so anyway I would recommend as
|
|
just a few baselines you should have a
|
|
cross and to be method some policy
|
|
grading methods some kind of cue
|
|
learning or sarsa type method there's a
|
|
lot of code online now that you can use
|
|
other people's code that that's already
|
|
written so you can use like we have this
|
|
open AI baselines repository and also
|
|
our L lab has a bunch of algorithms okay
|
|
another thing to do which people often
|
|
get tripped up on especially when
|
|
they're trying to reproduce published
|
|
work is so you implement the algorithm
|
|
based on the paper
|
|
and then it doesn't really learn
|
|
anything at all and then you think oh
|
|
maybe mike is my code like wrong or what
|
|
happened so I would say early on you
|
|
might need to run with more samples than
|
|
expected
|
|
so one hyper parameter that you can
|
|
usually adjust is how big of a batch
|
|
size to use or how many samples to use
|
|
and I would say sometimes you should use
|
|
more samples than you think you're going
|
|
to need because usually things just work
|
|
better when you have more samples almost
|
|
always so often sometimes when you're
|
|
trying to reproduce a published paper
|
|
you've got it mostly right but not
|
|
exactly right like maybe you haven't
|
|
scaled everything properly or there's
|
|
some like there's some really like
|
|
obscure hyper parameter that you have
|
|
wrong and then you just find that the
|
|
code doesn't learn anything so then I
|
|
would say just try to make it work a
|
|
little bit and then you can work from
|
|
there and try to tweak all the hyper
|
|
parameters to to get up to the like to
|
|
get fully up to the publish performance
|
|
but if you want to just get something
|
|
working at all often you need to use
|
|
bigger batch sizes and you thought
|
|
because if your batch size is too small
|
|
than the nor the noise will overwhelm
|
|
the signal and you won't learn anything
|
|
so like for example for TRP oh I wasn't
|
|
seeing any learning for a while and then
|
|
it turned out it's just because I was
|
|
using too small of a batch size and I
|
|
had to use a hundred thousand time steps
|
|
of a batch for the batch size but and
|
|
for Atari they for dqn the type of
|
|
parameters that were found to be best
|
|
where you update every ten thousand time
|
|
steps you update your queue function
|
|
every ten thousand time steps and you
|
|
have a 1 million time steps in your
|
|
replay buffer which is a lot okay so now
|
|
I'll talk about some guidelines for on
|
|
for the ongoing development and tuning
|
|
process as opposed to the initial
|
|
process of I have a totally new problem
|
|
or a new algorithm that I want to see
|
|
some signs of life on so
|
|
let's say you get something working I
|
|
recommend looking how sensitive your
|
|
algorithm is to every hyper parameter
|
|
and if it's too sensitive it it's not
|
|
actually a robust algorithm then you
|
|
shouldn't be happy with it you probably
|
|
just got luck lucky on that one problem
|
|
and it's it's actually kind of possible
|
|
to have a method that does that is a
|
|
fluke and it works in one way because
|
|
it's I mean one problem because of some
|
|
funny dynamics but then it doesn't work
|
|
in general so you kind of have to it
|
|
need some serious improvements so yeah
|
|
so that's okay there's also a few things
|
|
you can look at to see that actually I'm
|
|
going to talk about more of these kind
|
|
of Diagnostics a little later but there
|
|
are some indicators that'll tell you if
|
|
that if your algorithm is working
|
|
besides just looking at the final
|
|
performance but other in look for other
|
|
indicators that are going to tell you
|
|
that your optimization process is kind
|
|
of healthy so this is going to vary
|
|
based on the algorithm but for example
|
|
you can look at whether your value
|
|
function is actually accurate like
|
|
whether it's actually predicting returns
|
|
well you can look at how big the updates
|
|
are in terms of some either parameter
|
|
space or the output space standard
|
|
Diagnostics for deep networks like you
|
|
can look at norms of gradients and so on
|
|
okay one thing that takes some
|
|
discipline but is very useful is to have
|
|
a system for continually benchmarking
|
|
your code and that includes all of your
|
|
code not just the one thing you're
|
|
tuning right now because often it's easy
|
|
to tune your algorithm to work well in
|
|
one problem and then mess up the
|
|
performance on other problems and it's
|
|
really easy to overfit on single
|
|
problems when you're just adjusting
|
|
hyper parameters so I'd really recommend
|
|
having some kind of benchmark you can
|
|
run frequently and some kind of battery
|
|
of benchmarks that you've run
|
|
occasionally along as similar lines of
|
|
like overfitting of sort of reading
|
|
too far into noise or over interpreting
|
|
noise it's really easy to just to think
|
|
you're improving your algorithm or
|
|
you're making it worse but really you're
|
|
just seeing random noise so so you can
|
|
see seven different tasks these are the
|
|
Jim Moo Joko tasks like half cheetah and
|
|
hopper and so on and you have three
|
|
different algorithms here the red one
|
|
the green one and the blue one and you
|
|
can see ok let's I mean we can see that
|
|
the performance is a little different on
|
|
all the problems but let's it looks like
|
|
the the red let's see does the green
|
|
which one looks like the best well it
|
|
kind of varies by problem like the blue
|
|
one looks better on this problem and the
|
|
red one is worse on this problem and so
|
|
on but as it turns out these are all the
|
|
exact same algorithms and just random
|
|
seeds different random seeds so so it's
|
|
easy to imagine that you're just looking
|
|
at one of these problems then you see
|
|
that blue curve and you think you get
|
|
really excited than you think you found
|
|
some huge improvement to your algorithm
|
|
but it's really that you just got a
|
|
lucky seed that one run so yeah really
|
|
you've got to run your algorithm
|
|
multiple times an average and even if
|
|
you're averaging over a lot of seeds
|
|
like even if you had like 20 seeds here
|
|
there's a still a pretty big error bar
|
|
so it's yeah that makes it particularly
|
|
hard
|
|
I mean I'd recommend having like
|
|
multiple tasks and multiple seeds and if
|
|
you don't do that then you're probably
|
|
just overfitting unless you see a really
|
|
drastically large improvement another
|
|
thing to do is it's easy to keep adding
|
|
little modifications to your algorithm
|
|
until it gets really complicated and
|
|
then you're not sure and then you think
|
|
you have this really complicated
|
|
algorithm which is perfect but it turns
|
|
out that most of the things you did are
|
|
unnecessary because base some of the
|
|
tricks substitute for each other this is
|
|
often true because a lot of tricks help
|
|
because they're like normalizing things
|
|
in a better way or improving your
|
|
optimization like making
|
|
your optimization less susceptible to
|
|
like big spikes I don't know a lot of
|
|
different modifications you make have
|
|
similar effects so so often you you can
|
|
remove them and simplify your algorithm
|
|
and this is pretty important so it's
|
|
like especially with regard to changes
|
|
that do whitening these kind of these
|
|
kind of all substitute for each other
|
|
and also substitute for changes to your
|
|
optimization algorithm and yeah I would
|
|
and I would simplify things because it's
|
|
then it's more likely that your insights
|
|
will generalize to other problems and
|
|
also lastly it's pretty useful to
|
|
automate your experiments because
|
|
otherwise you're going to end up
|
|
spending all your day your whole day
|
|
just watching your code prints out
|
|
numbers and and it's actually really
|
|
it's it's really tempting to spend all
|
|
day doing that but I would I mean
|
|
especially if you need to run multiple
|
|
random seeds then it's then you you
|
|
really need to get your work flow down
|
|
so the year you're automating this
|
|
process and launching lots of
|
|
experiments at the same time so I'd
|
|
recommend just getting set up with one
|
|
of these cloud computing services so you
|
|
can just launch experiments on remote
|
|
instances and pull the results back when
|
|
you're done question oh yeah question is
|
|
you have a recommendation on what
|
|
framework to use to keep track of your
|
|
experiment results I personally use no
|
|
framework at all and I just have like
|
|
ipython notebooks and scripts that
|
|
collect a bunch of data that's stored in
|
|
various log files so I just have scripts
|
|
that read all my log files and plot them
|
|
I don't use some people like having
|
|
databases and stuff where they store all
|
|
their hyper parameter results but on I
|
|
think I don't find it necessary
|
|
personally okay so now I'm let's see I'm
|
|
going to talk about general tuning
|
|
strategies for RL and then after that
|
|
I'll talk about some specific tuning
|
|
strategies for different classes of
|
|
algorithms
|
|
okay so one thing is widening or
|
|
standardizing your data so if your
|
|
observations have unknown range you
|
|
should definitely standardize them I
|
|
would do that by computing a running
|
|
estimate of the mean and the standard
|
|
deviation and then just transform it Z
|
|
transforming it like this
|
|
and I would recommend computing the mean
|
|
and the standard deviation over all data
|
|
you've seen so far not just your recent
|
|
data because otherwise you're
|
|
effectively changing your data in some
|
|
way that the policy doesn't know about
|
|
like you have your that your policy
|
|
grading algorithm doesn't know about
|
|
like your policy grading algorithm is
|
|
actually optimizing some objective so
|
|
then if you just go and change the
|
|
problem out from under it then you're
|
|
often going to make things a lot worse
|
|
like if you rescale your observations
|
|
then your optimization algorithm didn't
|
|
know about that so you might just
|
|
collapse the performance so that's why I
|
|
would recommend using your whole all of
|
|
your data from the start of time so that
|
|
at least it's going to slow down over
|
|
time how fast it's how fast your
|
|
scalings are changing so yeah that's
|
|
what I would recommend doing with the
|
|
observations and for the rewards
|
|
I'd recommend rescaling it but not
|
|
shifting them because that affects the
|
|
agents will to live so if you if you
|
|
shift the mean reward that'll affect
|
|
whether how long it wants to survive
|
|
you're actually changing the problem ok
|
|
another yeah you might also want to try
|
|
to standardize prediction targets in the
|
|
same way though that's a little more
|
|
complicated to do using okay yeah so
|
|
question is what about pca widening
|
|
instead of just this element why scaling
|
|
yeah that could that could definitely
|
|
help I haven't I haven't experimented
|
|
with that but yeah that could help it's
|
|
hard to predict with like with neural
|
|
nets if it's going to help or not
|
|
because they seem to be pretty good at
|
|
disentangling things so I know that if
|
|
you have things that are terribly scaled
|
|
like they're from negative one thousand
|
|
two one thousand and other coordinates
|
|
are from negative point
|
|
point one then it's gonna be slow for
|
|
learning so this kind of scaling helps a
|
|
lot even though you're having their own
|
|
networks okay there's some parameters
|
|
that are really generally important like
|
|
discount factor that determines whether
|
|
you're that determines how long how far
|
|
away you're doing credit assignments so
|
|
whether you're paying attention to
|
|
effects that are delayed by a certain
|
|
time so if your discount is gamma equals
|
|
point 99 then you're basically ignoring
|
|
effects that are more delayed by a
|
|
hundred time steps so so you're kind of
|
|
short-sighted that gamma is controlling
|
|
your shortsightedness and you might want
|
|
to actually look at if how long that
|
|
corresponds to in real time so usually
|
|
in reinforcement learning you're sort of
|
|
discretizing time in a certain way and
|
|
it's worth paying attention to like is
|
|
that 100 time steps like three seconds
|
|
of real time or what and what happens
|
|
during that time also note that if you
|
|
have TD lamda kind of methods for either
|
|
for value function estimation or for
|
|
policy grading methods you can get away
|
|
with using a Lambda gamma that's really
|
|
close to one like 0.999 and things
|
|
aren't going to go unstable because if
|
|
you have a lower land of like 0.9 then
|
|
that's going to make it so the algorithm
|
|
is still stable even though gamma is
|
|
really close to one also okay so so as I
|
|
mentioned you might want to in in
|
|
practice we're usually discretizing some
|
|
continuous-time system so then it's
|
|
worth seeing if the problem can actually
|
|
be solved at this discretization level
|
|
so so for example in a game let's say
|
|
you're you're doing frame skip a meaning
|
|
that you repeat the action multiple
|
|
times as a human can you control it at
|
|
this rate or is it just impossible to
|
|
control is it just too like you're doing
|
|
the action too many times in a row and
|
|
you have to slow responses to control it
|
|
and I would also just look at the what
|
|
the random exploration looks like and if
|
|
you make sure that you're exploring like
|
|
the the
|
|
this Croatian is going to determine like
|
|
how far your Brownian motion goes
|
|
because if you're doing the same action
|
|
many times in a row then you're going to
|
|
be able to then you're going to tend to
|
|
explore further so so it's worth just
|
|
looking at what the random exploration
|
|
does and and choosing your time
|
|
discretization in a sensible way so that
|
|
it does interesting things question yeah
|
|
so the question is if you have a DQ n
|
|
how would you get started like tuning it
|
|
with tuning all the hyper parameters
|
|
actually I'm going to talk about DQ n
|
|
pretty soon so yeah I'll get to that
|
|
okay also look at the episode returns
|
|
very closely look at don't just look at
|
|
the mean look at the minimum and the
|
|
maximum so the maximum especially if you
|
|
have a deterministic system if you have
|
|
a certain maximum return that's
|
|
basically something that your policy can
|
|
hone in on pretty straightforwardly
|
|
because if if you just do that every
|
|
time then you're going to increase your
|
|
mean return to that level so so it's
|
|
worth so so it's useful to look at the
|
|
max return to see if your policy is ever
|
|
doing like the right thing according to
|
|
that max return or if it's just kind of
|
|
stuck and it's never discovering the
|
|
high return strategy also look at the
|
|
episode in length which is sometimes
|
|
more informative than the episode reward
|
|
like if because sometimes well yeah I
|
|
won't go into details on that like well
|
|
if you have a game you're it might mean
|
|
that like you might be losing every time
|
|
so you're never seeing yourself win but
|
|
the episode length will tell you if
|
|
you're losing slower so you might see an
|
|
improvement in episode length at the
|
|
beginning but not in reward okay for
|
|
Policy gradient there are specific
|
|
strategies or prediction there are
|
|
specific Diagnostics that are really
|
|
helpful so look at the entropy really
|
|
carefully if your entropy is going down
|
|
too fast that means your policy is
|
|
becoming deterministic and it's not
|
|
going to explore anything so
|
|
so be careful and also if it's not going
|
|
down your policy is never going to be
|
|
that good because it's always really
|
|
random else so you can sort of alleviate
|
|
this issue by using an entropy bonus or
|
|
a KL penalty so by stopping yourself
|
|
from move changing the policy the
|
|
probability distribution too fast as a
|
|
side effect you also prevent the entropy
|
|
from going down too fast when you use
|
|
the KL penalties I also look at the KL
|
|
as a diagnostic like look at how big of
|
|
an update you're doing in terms of KL
|
|
divergence if your KL is like 0.01
|
|
that's a pretty small update but if it's
|
|
like a 10 that's a really big update
|
|
question oh yeah how do you question is
|
|
how do you measure entropy so so if you
|
|
have for most policies you can compute
|
|
the entropy analytically so if you have
|
|
a discrete action space then you usually
|
|
can just compute it analytically and if
|
|
you have a continuous policy you're
|
|
usually you're using a Gaussian
|
|
distribution or something so you can
|
|
compute the differential entropy
|
|
analytically so here we're talking about
|
|
entropy in action space so the average
|
|
over state space of the action space
|
|
entropy what you actually might care
|
|
about even more is the entropy in state
|
|
space but you have no hope at actually
|
|
calculating that except maybe to do some
|
|
really crude approximation of it
|
|
okay yeah so KL is really useful look at
|
|
explain variants like whether your value
|
|
function is actually explaining is
|
|
actually a good predictor of the returns
|
|
or if it's just worse than predicting
|
|
nothing so if you just predict zeroes
|
|
then your explained variance is zero but
|
|
sometimes if you have some neural
|
|
network that's predicting then you find
|
|
that it's actually negative because it's
|
|
overfitting or it's just noisy and it's
|
|
not doing anything useful so that
|
|
probably means you need to tune some
|
|
hyper parameters so that your neural
|
|
networks actually predicting better than
|
|
the constant predicting zero question
|
|
okay yeah question is why does the KL
|
|
spike give you a loss in performance
|
|
well it doesn't always be a lot it's not
|
|
always a loss in performance sometimes
|
|
it's a gain in performance but in
|
|
practice it's usually a loss in
|
|
performance because it usually the
|
|
approximation that your policy gradient
|
|
is just taking you way outside the
|
|
region where your local approximation to
|
|
the policy performance is accurate so
|
|
you're you're probably just overshooting
|
|
like if you take your policy and you
|
|
take a really big step in any direction
|
|
you're probably making it worse so so
|
|
that's so usually if you take a big step
|
|
you're getting worse like if you have a
|
|
convex function if you take a big step
|
|
in any direction you're probably going
|
|
to make it worse let's see okay
|
|
initialize your policy that's pretty
|
|
important more important than in
|
|
supervised learning because that in
|
|
determines what data you're going to see
|
|
initially and you're going to learn from
|
|
at the beginning so I would recommend
|
|
using have initializing the final layer
|
|
to be either zero or really small so
|
|
that at least you you have the maximum
|
|
and you sort of explore randomly at the
|
|
beginning we randomly at the beginning
|
|
as opposed to having some kind of
|
|
particular like policy that has a strong
|
|
opinion on the right thing to do which
|
|
is based on no information at all okay
|
|
that's for Policy gradient for Q
|
|
learning so a few thing a few things one
|
|
is okay you often it helps to have a
|
|
really big replay buffer and to be able
|
|
to do this you need to be a little
|
|
careful about memory usage so it's worth
|
|
putting in the extra effort to do that
|
|
learning rate schedules are often quite
|
|
helpful here in practice as our
|
|
exploration schedules so in qdq any
|
|
you're usually using epsilon greedy and
|
|
it often helps to do to play with the
|
|
schedule on that also it converges
|
|
pretty slowly and it has a miss
|
|
serious warmup period at the beginning
|
|
often so so sometimes you just so I
|
|
actually have a lot of admiration for
|
|
the authors who originally got this the
|
|
people people who got this to work
|
|
originally because they had to just let
|
|
their code run for a while before it did
|
|
anything so so you have to have a lot of
|
|
patience - a lot of bravery to do that
|
|
ok this is just miscellaneous advice for
|
|
not necessarily for tuning algorithms
|
|
but just for for personal development so
|
|
I recommend reading older textbooks and
|
|
theses not just the latest conference
|
|
papers because often they up in them
|
|
like there are more dense source of
|
|
useful information whereas each
|
|
conference paper just has one idea ok
|
|
yeah don't get too stuck on problems
|
|
because often you actually have a
|
|
legitimately good algorithm but it's has
|
|
like some flaws so its might fail
|
|
miserably at some easy problem so in RL
|
|
there's some like simple problems like
|
|
cart will swing up where you have this
|
|
stick and you're trying to swing it up
|
|
by moving the cart around and this
|
|
problem like you might have a great
|
|
algorithm but it's gonna in my but like
|
|
some of the state-of-the-art algorithms
|
|
are gonna fail on that problem unless
|
|
you really tuned them carefully and
|
|
that's just because maybe it's not
|
|
exactly the right problem to start to I
|
|
mean maybe like the thing that makes
|
|
this problem hard is not the thing that
|
|
your algorithm is doing that's
|
|
interesting so you might have like come
|
|
up with a better policy grading method
|
|
but still it'll converge to the same
|
|
local minimum on that swing up problem
|
|
and you're not gonna fix that problem so
|
|
I I would say just don't get too stuck
|
|
on a single problem that your method
|
|
bails on and enough in like maybe the
|
|
ultimate algorithm will solve all of
|
|
these problems but we're not there yet
|
|
so you might as well just try to improve
|
|
and some like decently large subset of
|
|
problems so also like one funny thing is
|
|
the dqn performs pretty poorly on a lot
|
|
of problems especially with continuous
|
|
control
|
|
I think it does I mean for cartful it
|
|
probably solves it pretty well if with a
|
|
reasonable amount of tuning but some of
|
|
the other like fairly small continuous
|
|
control problems it fails on but that
|
|
doesn't mean it's like that doesn't mean
|
|
it's a bad algorithm because it solves a
|
|
different problem extremely well so yeah
|
|
I would say just these these things are
|
|
at least right now it's not gonna you
|
|
shouldn't expect to be able to solve
|
|
everything with the same method without
|
|
any tuning also techniques from
|
|
supervised learning often don't transfer
|
|
over to reinforcement learning so so
|
|
don't be surprised if you find that I
|
|
guess that's not I said this slide was
|
|
gonna be a bad personal development
|
|
that's not about personal development
|
|
but yeah I guess this is just a grab bag
|
|
of miscellaneous advice so yeah so like
|
|
Bachner a lot of people look at what
|
|
people are doing in RL and they think
|
|
why aren't you using batch norm or drop
|
|
out or or big networks why are you using
|
|
like two layers of 64 units and it's not
|
|
like people didn't think of trying these
|
|
other things they tried them and then
|
|
they found that those other like
|
|
architectures and methods don't actually
|
|
help here I mean if you figure out how
|
|
to make batch norm and drop out actually
|
|
help in RL that'll actually be really
|
|
great and a big that would be a big
|
|
development but yeah I don't know it's
|
|
not totally straightforward all right
|
|
that's all thank you
|
|
okay I have a few minutes for questions
|
|
yeah so the question is how long do you
|
|
wait until you've decided that your
|
|
algorithm is new at work either because
|
|
your code is wrong or it's just too hard
|
|
I don't have a good general answer to
|
|
that I think the problem is worse for
|
|
some algorithms and others I'd say for
|
|
policy gradient methods you don't see
|
|
that burnin period as much like often if
|
|
it's going to learn it'll learn it at
|
|
the beginning but that's not always true
|
|
either I mean sometimes it will kind of
|
|
take some time to get into the right
|
|
numerical regime so I don't yeah I don't
|
|
have general advice I would say you have
|
|
to just I would say go back and start
|
|
with the easy problems and you'll get
|
|
some intuition about whether you're you
|
|
should expect a you should expect a burn
|
|
in period or not where it's not learning
|
|
anything see I want to get some people
|
|
in the back because okay oh yeah
|
|
question is - do I use unit tests I use
|
|
unit tests for code that where there's
|
|
it's doing a very particular
|
|
mathematical thing that you can actually
|
|
write a test for like let's say I'm
|
|
computing the KL divergence then I'll
|
|
write a test to check I don't know their
|
|
various ways of testing it so and it's
|
|
easy to get those things wrong like you
|
|
have it's I don't know as you're off by
|
|
a constant or something so yeah I would
|
|
write tests for I write tests for things
|
|
where it's nothing that there's a very
|
|
well-defined correct thing to do it's
|
|
harder to write it for an algorithm
|
|
where it has a lot of different moving
|
|
parts where you it's not clear how fast
|
|
it should learn and it's also there's
|
|
some randomness involved
|
|
so if you try to write a test saying I
|
|
should be at performance 100 after this
|
|
many iterations it might fail just out
|
|
of random noise but yeah I think
|
|
probably unit tests are a good idea oh
|
|
yeah so the question is do I have
|
|
guidelines on matching the algorithm to
|
|
the task like when to use policy
|
|
gradients versus a value iteration style
|
|
method it's yeah it's hard to give some
|
|
general guidelines I think people have
|
|
found that and and the guidelines I give
|
|
you might just be just be kind of
|
|
historical accidents like someone got
|
|
this to work here and this to work there
|
|
so I think the well certainly if you
|
|
don't care that much about sample
|
|
complexity policy gradient methods are
|
|
are probably are probably the way to go
|
|
if you don't care about sample
|
|
complexity or using off policy data then
|
|
policy grading methods are probably the
|
|
safest bet because you I don't know it's
|
|
more understandable exactly what it's
|
|
doing it's just doing gradient descent
|
|
whereas q-learning it's a little bit
|
|
indirect what it's doing so it's and it
|
|
in practice is more finicky yeah if you
|
|
do care about sample complexity though
|
|
or need off policy data then hue
|
|
learning is usually better or yeah or a
|
|
few students like sample complexity is
|
|
relevant if your simulator is expensive
|
|
of course I would also say that people
|
|
have found that dqn and relatives have
|
|
worked well on game-like tasks with
|
|
images as input whereas policy grading
|
|
methods work better on the continuous
|
|
control tasks like these robotic
|
|
locomotion problems though that this
|
|
might not be fundamental it might be
|
|
more of a historical accident let's
|
|
oh yeah recommendations on older
|
|
textbooks let's see there's like brutes
|
|
a cuss for take us as books
|
|
that's approximate dinette what is it
|
|
optimal control and why am i blanking on
|
|
the name optimal control and dynamic
|
|
programming something like that and the
|
|
set I mean sudden embargo is a good one
|
|
to read butterman has a textbook kind of
|
|
a classic textbook on Markov decision
|
|
processes that's in the RL space then
|
|
there's books on numerical optimization
|
|
that are good and yeah I'd say obviously
|
|
the machine learning textbooks have a
|
|
lot of good material that might be
|
|
useful in the RL setting too
|
|
oh yeah can I comment on evolution
|
|
strategies and the blog posts the the
|
|
opening I blog post on it let's see do
|
|
you have any specific questions about it
|
|
or like how it compares
|
|
oh yeah okay yeah yeah so there's
|
|
there's a lot of policy grading methods
|
|
out there and some of them are quite
|
|
complicated so we've had a couple of
|
|
talks on them so far like all these
|
|
different work
|
|
it's excessively more complicated policy
|
|
grading methods but then there's this
|
|
old algorithm called evolution
|
|
strategies which is an extremely simple
|
|
algorithm and and there's a paper by
|
|
some of my colleagues where they show it
|
|
was called evolutionary strategies as a
|
|
scaleable alternative to reinforcement
|
|
learning which really meant like to
|
|
policy grading methods so and they
|
|
claimed that it worked basically as well
|
|
as policy grading methods or at least
|
|
it's sort of in the same order oh and
|
|
beer is one of the authors of that paper
|
|
so the claim was that it works it works
|
|
like similarly well to policy grading
|
|
method so why should we bother with
|
|
these policy grading methods if es works
|
|
just as well well I think in practice it
|
|
works well it works but it works not not
|
|
as well like it's it takes me the sample
|
|
complexity is is is worse by some
|
|
constant factor or it's not clear that
|
|
it's a constant factor or if this factor
|
|
scales with the size of the network but
|
|
it's it is a lot it is significantly
|
|
slower and the question is just what is
|
|
that constant factor so is that constant
|
|
factor like one or is it three or is it
|
|
10 or 100 so that's not that's going to
|
|
vary between problems and also the that
|
|
paper had some innovations in exactly
|
|
how to parameterize the networks and so
|
|
forth that made everything better
|
|
numerically everything better scaled so
|
|
that yes did work well but I would say
|
|
that if you that it's usually quote like
|
|
I don't know it's usually a pretty
|
|
decent constant factor slower than
|
|
policy grading methods especially the
|
|
more advanced ones like the PPO and
|
|
actor so so i'm i think it's it's not
|
|
really a clear win in the RL setting
|
|
where policy gradients work I think if
|
|
policy gradients work it's usually going
|
|
to be a lot better
|
|
and the es is going to be is going to be
|
|
better on problems where policy
|
|
gradients aren't going to work for some
|
|
reason like if you've got really long
|
|
you depend time dependencies where the
|
|
discounts are gonna are gonna ignore
|
|
them
|
|
then es might be less sensitive to that
|
|
let's see I think I'm okay last question
|
|
oh yeah favorite hyper parameter
|
|
optimization framework I've used some of
|
|
these than I just like to use the
|
|
uniform random sampling yeah that works
|
|
really well I mean you just run a bunch
|
|
of experiments with random hyper
|
|
parameters and then you just look at the
|
|
results the next day and to do some
|
|
regression to figure out which
|
|
parameters actually mattered and then
|
|
you've run another experiment with
|
|
better parameter ranges and so on so I
|
|
use the human version of it because
|
|
often it's just - it's like a it's it's
|
|
useful to be able to look at the results
|
|
yourself - to get some to figure out
|
|
which parameters actually matter so
|
|
you're not wasting a lot of computation
|
|
because that information transferred
|
|
between problems all right
|
|
[Applause] |