Files
ml-debug/docs/evidence/[English (auto-generated)] Deep RL Bootcamp Lecture 6 Nuts and Bolts of Deep RL Experimentation [Do.txt
wassname 4393cceefd initial: ML debugging folklore skill
Deep research to uplift LLMs for ML debugging, opinionated by source
selection. Distilled from Schulman, Jones, Rahtz, Goodfellow, CS231n,
FSDL, and more. Includes runnable diagnostic scripts and LLM-specific
anti-patterns.

Author: wassname (https://github.com/wassname)
2026-03-06 10:11:30 +08:00

1008 lines
37 KiB
Plaintext

so last year at nips I was slated to
give a talk at the deep RL workshop and
I wasn't sure what I was going to talk
about because everything I had prepared
I had already talked about it so many
times that I just didn't want to didn't
want to give another talk on it so I I
asked Peter for his advice on what I
should talk about and he said that
Entering had given the talk earlier in
the conference I called the nuts and
bolts of deep learning where he sort of
went through the flowchart of what you
do when you see a new problem and like
if you if you're overfitting you regular
eyes and if you're underfitting then you
use a bigger model and so on so so Peter
suggested to come up to write a talk
called the nuts and bolts of deep RL
research where I would talk about some
of the similar lessons and the tips and
tricks for the RL setting so I put
together a talk for that and actually
people seem to like it
so I'll give a slightly updated version
of that talk right now so I'm going to
talk about a few different things some
of which are general and sort of apply
to RL using reinforcement learning in
general and some of them pertain to
particular classes of methods like
policy grading methods and these are
just sort of little tips and tricks for
how you how you get your algorithm to
work and what you do day to day so let's
say you have a totally new problem
you're trying to solve like you have you
have some new tasks and you figured out
how to you defined an observation in an
action space and you have your neural
network policy or Q function but and you
want to start learning learning how to
solve it but you but you've never tried
it before
um so or okay or if you have a new
algorithm you're trying to get working
that you've never you you've never used
it before so so what do you do what's
the first thing you do if you have a new
algorithm
so so that I mean my first advice would
be to use the small problems so you can
run a lot of experiments really quickly
and do a hyper parameter search and it's
really useful too
to be able to visualize the learning
process in as many ways as possible so
look at the state visitation like how
that's evolving over time and look at
how well your value function is fitting
and so on so like I spent a lot of time
looking at the pendulum problem where
you're trying to swing up a pendulum
because this problem has a 2d state
space where it's just the angular and
the angular velocity of the pendulum and
I would visit visualize here's exactly
what the value function looks like
here's exactly what the state
distribution looks like and here's how
they evolve over time so I would get a
sense for like what's if my algorithm
isn't working is it because it's like
oscillating in some funny way or maybe
it's just giving a bad fit or maybe the
function it's learnt the value function
alerting isn't smooth enough and so on
so I would say try to visualize
everything and maybe use small problems
where you can visualize everything also
yeah it's useful to construct toy
problems where your idea is going to be
the strongest where you think okay if
this idea has any possibility of working
it's going to work there so for example
let's say you're trying to do something
with hierarchical reinforcement learning
then construct some problem where
there's some kind of obvious hierarchy
that it should learn and you'll be able
to tell if it's doing the right thing
also construct the the problems where
it's going to be weakest obviously and
also as a counterpoint to that don't
over fit your method to some contrived
problem so let's say you've come up with
some toy problem where your method is
really good then don't realize that it's
a toy problem and don't like tweak
everything to just work on this toy
problem perfectly because yeah it's also
pretty useful to have medium-sized
problems that you're very familiar with
and you know exactly how fast the
learning should be and what the reward
should be at every iteration and so on
so
a few problems that I use a lot like
training on pong Atari and the hopper
would the hopper like problem which is
this simulated robot problem with this
hopping robot and I know exactly how
fast an algorithm that's working should
learn on these problems so so I can sort
of it's it makes it easier to tune
things if you have okay that's if you
have a new algorithm let's say you have
a new task I would recommend just making
the task easier until you start seeing
some signs of life you see it learning
something so so there are various ways
you can make it easier you can try doing
some feature engineering so your input
features you think that the you think
that the policy should be a simple
function of your input features like
let's say you're trying to get pong to
work and you tried setting it up with
the images as input and you weren't
learning anything then you can set up
the problem where you pass in XY
coordinates as input and then try
running your algorithm and it's a much
simpler function you're trying to learn
so that's much more likely at work and
then you can try to make it harder and
harder until you're solving the full
problem another way you can make it
easier is by shaping the reward function
that means you if you come up with some
reward function that gives you fast
feedback code on whether you're doing
the right thing or not so let's say we
can define one task where we have this
reaching robot and we just give it a
reward if it reaches if it hits the
target so it gets a reward of one if it
hits the target in zero otherwise so
that might be hard to learn because
you're not getting any feedback as
you're flailing around but we could
define a better shaped reward function
where the where it's just distance to
target then learning is going to be much
faster in that problem there's also the
problem on exactly how to turn your
problem into a pom DP in the first place
so so often it's not clear what your
observation features should be and it's
not even clear what the reward function
should be so or it's not clear if this
problem you're trying to solve is
if it's feasible at all so so let's say
you're trying to solve you you have some
game or some robotics task or something
new like and you you want to turn it
into a reinforcement learning problem
but you're not sure if this is feasible
at all
so the first thing to do is to just
visualize a random policy acting on this
problem and see see what happens so if
the random policy occasionally does the
right thing then there's a high chance
of reinforcement learning is going to
work because bringing forth a policy
grading method is just going to take
this random behavior and it's going to
make the look the good behaviors more
likely
so it'll gradually like hone in on the
good behaviors whereas if you're never
doing the right thing then then there's
RL isn't going to get any signal that
tells it to do the right thing sometimes
RL is able to learn even though it seems
like it it's not clear how it's going to
learn like learning how to walk it's not
clear that that should work but because
you would think that like you really
have to have the whole thing in the
whole policy in place before it does
anything useful but as it turns out you
sort of learn to take one step and then
fall over and then take two steps and
then fall over and so on until you've
got a proper walking gait okay another
thing to do is to make to make your
observations make sure your observations
are useable try to look at them as a
human and see if you can control the
system using the same observations
you're giving to the agent so let's say
you're doing some pre-processing on your
images look at those pre processed
images yourself and make sure you're not
like losing too much detail when you
downsample them or losing too much or in
the color transformations and so on
another thing to do is you want to make
sure that everything is reasonably
scaled so that for example well as a
rule of thumb you usually want
everything to be mean 0 and standard
deviation 1 for the observations and for
the rewards well it's a little less
obvious but that's a reasonable
heuristic so
so you might want to like a scaler using
some kind of filter I mean that's that's
another good thing you can do but if you
don't want to mess with some kind of
filters on your observations and rewards
what you can do it you can just kind of
if you're allowed to define those
yourself then you might want to just
scale them yourself so what I'd
recommend doing is plot histograms of
all of your observations and your
rewards and make sure that for each
component of the observations and
rewards you've scaled it properly so
that it has the right mean insanity
deviation and it doesn't have crazy
outliers okay another thing to do is you
should have some good baselines that you
can use whenever you see a new when you
whenever you see a new problem so just
it's not clear which algorithm is going
to work beforehand so make sure you you
just have a bunch of a bunch of like
well tune things that you can run on
each problem yeah okay the question was
if you're gonna do some kind of reward
normalization should you do this over
your whole training like all of your
training data or just like the recent
data I would yeah that's a there's a lot
of subtlety there so I would say use all
of your data so far because you're
making everything non-stationary if you
do some kind of filtering actually I'm
going to talk about this at a later
slide so anyway I would recommend as
just a few baselines you should have a
cross and to be method some policy
grading methods some kind of cue
learning or sarsa type method there's a
lot of code online now that you can use
other people's code that that's already
written so you can use like we have this
open AI baselines repository and also
our L lab has a bunch of algorithms okay
another thing to do which people often
get tripped up on especially when
they're trying to reproduce published
work is so you implement the algorithm
based on the paper
and then it doesn't really learn
anything at all and then you think oh
maybe mike is my code like wrong or what
happened so I would say early on you
might need to run with more samples than
expected
so one hyper parameter that you can
usually adjust is how big of a batch
size to use or how many samples to use
and I would say sometimes you should use
more samples than you think you're going
to need because usually things just work
better when you have more samples almost
always so often sometimes when you're
trying to reproduce a published paper
you've got it mostly right but not
exactly right like maybe you haven't
scaled everything properly or there's
some like there's some really like
obscure hyper parameter that you have
wrong and then you just find that the
code doesn't learn anything so then I
would say just try to make it work a
little bit and then you can work from
there and try to tweak all the hyper
parameters to to get up to the like to
get fully up to the publish performance
but if you want to just get something
working at all often you need to use
bigger batch sizes and you thought
because if your batch size is too small
than the nor the noise will overwhelm
the signal and you won't learn anything
so like for example for TRP oh I wasn't
seeing any learning for a while and then
it turned out it's just because I was
using too small of a batch size and I
had to use a hundred thousand time steps
of a batch for the batch size but and
for Atari they for dqn the type of
parameters that were found to be best
where you update every ten thousand time
steps you update your queue function
every ten thousand time steps and you
have a 1 million time steps in your
replay buffer which is a lot okay so now
I'll talk about some guidelines for on
for the ongoing development and tuning
process as opposed to the initial
process of I have a totally new problem
or a new algorithm that I want to see
some signs of life on so
let's say you get something working I
recommend looking how sensitive your
algorithm is to every hyper parameter
and if it's too sensitive it it's not
actually a robust algorithm then you
shouldn't be happy with it you probably
just got luck lucky on that one problem
and it's it's actually kind of possible
to have a method that does that is a
fluke and it works in one way because
it's I mean one problem because of some
funny dynamics but then it doesn't work
in general so you kind of have to it
need some serious improvements so yeah
so that's okay there's also a few things
you can look at to see that actually I'm
going to talk about more of these kind
of Diagnostics a little later but there
are some indicators that'll tell you if
that if your algorithm is working
besides just looking at the final
performance but other in look for other
indicators that are going to tell you
that your optimization process is kind
of healthy so this is going to vary
based on the algorithm but for example
you can look at whether your value
function is actually accurate like
whether it's actually predicting returns
well you can look at how big the updates
are in terms of some either parameter
space or the output space standard
Diagnostics for deep networks like you
can look at norms of gradients and so on
okay one thing that takes some
discipline but is very useful is to have
a system for continually benchmarking
your code and that includes all of your
code not just the one thing you're
tuning right now because often it's easy
to tune your algorithm to work well in
one problem and then mess up the
performance on other problems and it's
really easy to overfit on single
problems when you're just adjusting
hyper parameters so I'd really recommend
having some kind of benchmark you can
run frequently and some kind of battery
of benchmarks that you've run
occasionally along as similar lines of
like overfitting of sort of reading
too far into noise or over interpreting
noise it's really easy to just to think
you're improving your algorithm or
you're making it worse but really you're
just seeing random noise so so you can
see seven different tasks these are the
Jim Moo Joko tasks like half cheetah and
hopper and so on and you have three
different algorithms here the red one
the green one and the blue one and you
can see ok let's I mean we can see that
the performance is a little different on
all the problems but let's it looks like
the the red let's see does the green
which one looks like the best well it
kind of varies by problem like the blue
one looks better on this problem and the
red one is worse on this problem and so
on but as it turns out these are all the
exact same algorithms and just random
seeds different random seeds so so it's
easy to imagine that you're just looking
at one of these problems then you see
that blue curve and you think you get
really excited than you think you found
some huge improvement to your algorithm
but it's really that you just got a
lucky seed that one run so yeah really
you've got to run your algorithm
multiple times an average and even if
you're averaging over a lot of seeds
like even if you had like 20 seeds here
there's a still a pretty big error bar
so it's yeah that makes it particularly
hard
I mean I'd recommend having like
multiple tasks and multiple seeds and if
you don't do that then you're probably
just overfitting unless you see a really
drastically large improvement another
thing to do is it's easy to keep adding
little modifications to your algorithm
until it gets really complicated and
then you're not sure and then you think
you have this really complicated
algorithm which is perfect but it turns
out that most of the things you did are
unnecessary because base some of the
tricks substitute for each other this is
often true because a lot of tricks help
because they're like normalizing things
in a better way or improving your
optimization like making
your optimization less susceptible to
like big spikes I don't know a lot of
different modifications you make have
similar effects so so often you you can
remove them and simplify your algorithm
and this is pretty important so it's
like especially with regard to changes
that do whitening these kind of these
kind of all substitute for each other
and also substitute for changes to your
optimization algorithm and yeah I would
and I would simplify things because it's
then it's more likely that your insights
will generalize to other problems and
also lastly it's pretty useful to
automate your experiments because
otherwise you're going to end up
spending all your day your whole day
just watching your code prints out
numbers and and it's actually really
it's it's really tempting to spend all
day doing that but I would I mean
especially if you need to run multiple
random seeds then it's then you you
really need to get your work flow down
so the year you're automating this
process and launching lots of
experiments at the same time so I'd
recommend just getting set up with one
of these cloud computing services so you
can just launch experiments on remote
instances and pull the results back when
you're done question oh yeah question is
you have a recommendation on what
framework to use to keep track of your
experiment results I personally use no
framework at all and I just have like
ipython notebooks and scripts that
collect a bunch of data that's stored in
various log files so I just have scripts
that read all my log files and plot them
I don't use some people like having
databases and stuff where they store all
their hyper parameter results but on I
think I don't find it necessary
personally okay so now I'm let's see I'm
going to talk about general tuning
strategies for RL and then after that
I'll talk about some specific tuning
strategies for different classes of
algorithms
okay so one thing is widening or
standardizing your data so if your
observations have unknown range you
should definitely standardize them I
would do that by computing a running
estimate of the mean and the standard
deviation and then just transform it Z
transforming it like this
and I would recommend computing the mean
and the standard deviation over all data
you've seen so far not just your recent
data because otherwise you're
effectively changing your data in some
way that the policy doesn't know about
like you have your that your policy
grading algorithm doesn't know about
like your policy grading algorithm is
actually optimizing some objective so
then if you just go and change the
problem out from under it then you're
often going to make things a lot worse
like if you rescale your observations
then your optimization algorithm didn't
know about that so you might just
collapse the performance so that's why I
would recommend using your whole all of
your data from the start of time so that
at least it's going to slow down over
time how fast it's how fast your
scalings are changing so yeah that's
what I would recommend doing with the
observations and for the rewards
I'd recommend rescaling it but not
shifting them because that affects the
agents will to live so if you if you
shift the mean reward that'll affect
whether how long it wants to survive
you're actually changing the problem ok
another yeah you might also want to try
to standardize prediction targets in the
same way though that's a little more
complicated to do using okay yeah so
question is what about pca widening
instead of just this element why scaling
yeah that could that could definitely
help I haven't I haven't experimented
with that but yeah that could help it's
hard to predict with like with neural
nets if it's going to help or not
because they seem to be pretty good at
disentangling things so I know that if
you have things that are terribly scaled
like they're from negative one thousand
two one thousand and other coordinates
are from negative point
point one then it's gonna be slow for
learning so this kind of scaling helps a
lot even though you're having their own
networks okay there's some parameters
that are really generally important like
discount factor that determines whether
you're that determines how long how far
away you're doing credit assignments so
whether you're paying attention to
effects that are delayed by a certain
time so if your discount is gamma equals
point 99 then you're basically ignoring
effects that are more delayed by a
hundred time steps so so you're kind of
short-sighted that gamma is controlling
your shortsightedness and you might want
to actually look at if how long that
corresponds to in real time so usually
in reinforcement learning you're sort of
discretizing time in a certain way and
it's worth paying attention to like is
that 100 time steps like three seconds
of real time or what and what happens
during that time also note that if you
have TD lamda kind of methods for either
for value function estimation or for
policy grading methods you can get away
with using a Lambda gamma that's really
close to one like 0.999 and things
aren't going to go unstable because if
you have a lower land of like 0.9 then
that's going to make it so the algorithm
is still stable even though gamma is
really close to one also okay so so as I
mentioned you might want to in in
practice we're usually discretizing some
continuous-time system so then it's
worth seeing if the problem can actually
be solved at this discretization level
so so for example in a game let's say
you're you're doing frame skip a meaning
that you repeat the action multiple
times as a human can you control it at
this rate or is it just impossible to
control is it just too like you're doing
the action too many times in a row and
you have to slow responses to control it
and I would also just look at the what
the random exploration looks like and if
you make sure that you're exploring like
the the
this Croatian is going to determine like
how far your Brownian motion goes
because if you're doing the same action
many times in a row then you're going to
be able to then you're going to tend to
explore further so so it's worth just
looking at what the random exploration
does and and choosing your time
discretization in a sensible way so that
it does interesting things question yeah
so the question is if you have a DQ n
how would you get started like tuning it
with tuning all the hyper parameters
actually I'm going to talk about DQ n
pretty soon so yeah I'll get to that
okay also look at the episode returns
very closely look at don't just look at
the mean look at the minimum and the
maximum so the maximum especially if you
have a deterministic system if you have
a certain maximum return that's
basically something that your policy can
hone in on pretty straightforwardly
because if if you just do that every
time then you're going to increase your
mean return to that level so so it's
worth so so it's useful to look at the
max return to see if your policy is ever
doing like the right thing according to
that max return or if it's just kind of
stuck and it's never discovering the
high return strategy also look at the
episode in length which is sometimes
more informative than the episode reward
like if because sometimes well yeah I
won't go into details on that like well
if you have a game you're it might mean
that like you might be losing every time
so you're never seeing yourself win but
the episode length will tell you if
you're losing slower so you might see an
improvement in episode length at the
beginning but not in reward okay for
Policy gradient there are specific
strategies or prediction there are
specific Diagnostics that are really
helpful so look at the entropy really
carefully if your entropy is going down
too fast that means your policy is
becoming deterministic and it's not
going to explore anything so
so be careful and also if it's not going
down your policy is never going to be
that good because it's always really
random else so you can sort of alleviate
this issue by using an entropy bonus or
a KL penalty so by stopping yourself
from move changing the policy the
probability distribution too fast as a
side effect you also prevent the entropy
from going down too fast when you use
the KL penalties I also look at the KL
as a diagnostic like look at how big of
an update you're doing in terms of KL
divergence if your KL is like 0.01
that's a pretty small update but if it's
like a 10 that's a really big update
question oh yeah how do you question is
how do you measure entropy so so if you
have for most policies you can compute
the entropy analytically so if you have
a discrete action space then you usually
can just compute it analytically and if
you have a continuous policy you're
usually you're using a Gaussian
distribution or something so you can
compute the differential entropy
analytically so here we're talking about
entropy in action space so the average
over state space of the action space
entropy what you actually might care
about even more is the entropy in state
space but you have no hope at actually
calculating that except maybe to do some
really crude approximation of it
okay yeah so KL is really useful look at
explain variants like whether your value
function is actually explaining is
actually a good predictor of the returns
or if it's just worse than predicting
nothing so if you just predict zeroes
then your explained variance is zero but
sometimes if you have some neural
network that's predicting then you find
that it's actually negative because it's
overfitting or it's just noisy and it's
not doing anything useful so that
probably means you need to tune some
hyper parameters so that your neural
networks actually predicting better than
the constant predicting zero question
okay yeah question is why does the KL
spike give you a loss in performance
well it doesn't always be a lot it's not
always a loss in performance sometimes
it's a gain in performance but in
practice it's usually a loss in
performance because it usually the
approximation that your policy gradient
is just taking you way outside the
region where your local approximation to
the policy performance is accurate so
you're you're probably just overshooting
like if you take your policy and you
take a really big step in any direction
you're probably making it worse so so
that's so usually if you take a big step
you're getting worse like if you have a
convex function if you take a big step
in any direction you're probably going
to make it worse let's see okay
initialize your policy that's pretty
important more important than in
supervised learning because that in
determines what data you're going to see
initially and you're going to learn from
at the beginning so I would recommend
using have initializing the final layer
to be either zero or really small so
that at least you you have the maximum
and you sort of explore randomly at the
beginning we randomly at the beginning
as opposed to having some kind of
particular like policy that has a strong
opinion on the right thing to do which
is based on no information at all okay
that's for Policy gradient for Q
learning so a few thing a few things one
is okay you often it helps to have a
really big replay buffer and to be able
to do this you need to be a little
careful about memory usage so it's worth
putting in the extra effort to do that
learning rate schedules are often quite
helpful here in practice as our
exploration schedules so in qdq any
you're usually using epsilon greedy and
it often helps to do to play with the
schedule on that also it converges
pretty slowly and it has a miss
serious warmup period at the beginning
often so so sometimes you just so I
actually have a lot of admiration for
the authors who originally got this the
people people who got this to work
originally because they had to just let
their code run for a while before it did
anything so so you have to have a lot of
patience - a lot of bravery to do that
ok this is just miscellaneous advice for
not necessarily for tuning algorithms
but just for for personal development so
I recommend reading older textbooks and
theses not just the latest conference
papers because often they up in them
like there are more dense source of
useful information whereas each
conference paper just has one idea ok
yeah don't get too stuck on problems
because often you actually have a
legitimately good algorithm but it's has
like some flaws so its might fail
miserably at some easy problem so in RL
there's some like simple problems like
cart will swing up where you have this
stick and you're trying to swing it up
by moving the cart around and this
problem like you might have a great
algorithm but it's gonna in my but like
some of the state-of-the-art algorithms
are gonna fail on that problem unless
you really tuned them carefully and
that's just because maybe it's not
exactly the right problem to start to I
mean maybe like the thing that makes
this problem hard is not the thing that
your algorithm is doing that's
interesting so you might have like come
up with a better policy grading method
but still it'll converge to the same
local minimum on that swing up problem
and you're not gonna fix that problem so
I I would say just don't get too stuck
on a single problem that your method
bails on and enough in like maybe the
ultimate algorithm will solve all of
these problems but we're not there yet
so you might as well just try to improve
and some like decently large subset of
problems so also like one funny thing is
the dqn performs pretty poorly on a lot
of problems especially with continuous
control
I think it does I mean for cartful it
probably solves it pretty well if with a
reasonable amount of tuning but some of
the other like fairly small continuous
control problems it fails on but that
doesn't mean it's like that doesn't mean
it's a bad algorithm because it solves a
different problem extremely well so yeah
I would say just these these things are
at least right now it's not gonna you
shouldn't expect to be able to solve
everything with the same method without
any tuning also techniques from
supervised learning often don't transfer
over to reinforcement learning so so
don't be surprised if you find that I
guess that's not I said this slide was
gonna be a bad personal development
that's not about personal development
but yeah I guess this is just a grab bag
of miscellaneous advice so yeah so like
Bachner a lot of people look at what
people are doing in RL and they think
why aren't you using batch norm or drop
out or or big networks why are you using
like two layers of 64 units and it's not
like people didn't think of trying these
other things they tried them and then
they found that those other like
architectures and methods don't actually
help here I mean if you figure out how
to make batch norm and drop out actually
help in RL that'll actually be really
great and a big that would be a big
development but yeah I don't know it's
not totally straightforward all right
that's all thank you
okay I have a few minutes for questions
yeah so the question is how long do you
wait until you've decided that your
algorithm is new at work either because
your code is wrong or it's just too hard
I don't have a good general answer to
that I think the problem is worse for
some algorithms and others I'd say for
policy gradient methods you don't see
that burnin period as much like often if
it's going to learn it'll learn it at
the beginning but that's not always true
either I mean sometimes it will kind of
take some time to get into the right
numerical regime so I don't yeah I don't
have general advice I would say you have
to just I would say go back and start
with the easy problems and you'll get
some intuition about whether you're you
should expect a you should expect a burn
in period or not where it's not learning
anything see I want to get some people
in the back because okay oh yeah
question is - do I use unit tests I use
unit tests for code that where there's
it's doing a very particular
mathematical thing that you can actually
write a test for like let's say I'm
computing the KL divergence then I'll
write a test to check I don't know their
various ways of testing it so and it's
easy to get those things wrong like you
have it's I don't know as you're off by
a constant or something so yeah I would
write tests for I write tests for things
where it's nothing that there's a very
well-defined correct thing to do it's
harder to write it for an algorithm
where it has a lot of different moving
parts where you it's not clear how fast
it should learn and it's also there's
some randomness involved
so if you try to write a test saying I
should be at performance 100 after this
many iterations it might fail just out
of random noise but yeah I think
probably unit tests are a good idea oh
yeah so the question is do I have
guidelines on matching the algorithm to
the task like when to use policy
gradients versus a value iteration style
method it's yeah it's hard to give some
general guidelines I think people have
found that and and the guidelines I give
you might just be just be kind of
historical accidents like someone got
this to work here and this to work there
so I think the well certainly if you
don't care that much about sample
complexity policy gradient methods are
are probably are probably the way to go
if you don't care about sample
complexity or using off policy data then
policy grading methods are probably the
safest bet because you I don't know it's
more understandable exactly what it's
doing it's just doing gradient descent
whereas q-learning it's a little bit
indirect what it's doing so it's and it
in practice is more finicky yeah if you
do care about sample complexity though
or need off policy data then hue
learning is usually better or yeah or a
few students like sample complexity is
relevant if your simulator is expensive
of course I would also say that people
have found that dqn and relatives have
worked well on game-like tasks with
images as input whereas policy grading
methods work better on the continuous
control tasks like these robotic
locomotion problems though that this
might not be fundamental it might be
more of a historical accident let's
oh yeah recommendations on older
textbooks let's see there's like brutes
a cuss for take us as books
that's approximate dinette what is it
optimal control and why am i blanking on
the name optimal control and dynamic
programming something like that and the
set I mean sudden embargo is a good one
to read butterman has a textbook kind of
a classic textbook on Markov decision
processes that's in the RL space then
there's books on numerical optimization
that are good and yeah I'd say obviously
the machine learning textbooks have a
lot of good material that might be
useful in the RL setting too
oh yeah can I comment on evolution
strategies and the blog posts the the
opening I blog post on it let's see do
you have any specific questions about it
or like how it compares
oh yeah okay yeah yeah so there's
there's a lot of policy grading methods
out there and some of them are quite
complicated so we've had a couple of
talks on them so far like all these
different work
it's excessively more complicated policy
grading methods but then there's this
old algorithm called evolution
strategies which is an extremely simple
algorithm and and there's a paper by
some of my colleagues where they show it
was called evolutionary strategies as a
scaleable alternative to reinforcement
learning which really meant like to
policy grading methods so and they
claimed that it worked basically as well
as policy grading methods or at least
it's sort of in the same order oh and
beer is one of the authors of that paper
so the claim was that it works it works
like similarly well to policy grading
method so why should we bother with
these policy grading methods if es works
just as well well I think in practice it
works well it works but it works not not
as well like it's it takes me the sample
complexity is is is worse by some
constant factor or it's not clear that
it's a constant factor or if this factor
scales with the size of the network but
it's it is a lot it is significantly
slower and the question is just what is
that constant factor so is that constant
factor like one or is it three or is it
10 or 100 so that's not that's going to
vary between problems and also the that
paper had some innovations in exactly
how to parameterize the networks and so
forth that made everything better
numerically everything better scaled so
that yes did work well but I would say
that if you that it's usually quote like
I don't know it's usually a pretty
decent constant factor slower than
policy grading methods especially the
more advanced ones like the PPO and
actor so so i'm i think it's it's not
really a clear win in the RL setting
where policy gradients work I think if
policy gradients work it's usually going
to be a lot better
and the es is going to be is going to be
better on problems where policy
gradients aren't going to work for some
reason like if you've got really long
you depend time dependencies where the
discounts are gonna are gonna ignore
them
then es might be less sensitive to that
let's see I think I'm okay last question
oh yeah favorite hyper parameter
optimization framework I've used some of
these than I just like to use the
uniform random sampling yeah that works
really well I mean you just run a bunch
of experiments with random hyper
parameters and then you just look at the
results the next day and to do some
regression to figure out which
parameters actually mattered and then
you've run another experiment with
better parameter ranges and so on so I
use the human version of it because
often it's just - it's like a it's it's
useful to be able to look at the results
yourself - to get some to figure out
which parameters actually matter so
you're not wasting a lot of computation
because that information transferred
between problems all right
[Applause]