diff --git a/docs/evidence/[English (auto-generated)] Deep RL Bootcamp Lecture 6 Nuts and Bolts of Deep RL Experimentation [Do(1).txt b/docs/evidence/[English (auto-generated)] Deep RL Bootcamp Lecture 6 Nuts and Bolts of Deep RL Experimentation [Do(1).txt
deleted file mode 100644
index 69e9285..0000000
--- a/docs/evidence/[English (auto-generated)] Deep RL Bootcamp Lecture 6 Nuts and Bolts of Deep RL Experimentation [Do(1).txt	
+++ /dev/null
@@ -1,1008 +0,0 @@
-so last year at nips I was slated to
-give a talk at the deep RL workshop and
-I wasn't sure what I was going to talk
-about because everything I had prepared
-I had already talked about it so many
-times that I just didn't want to didn't
-want to give another talk on it so I I
-asked Peter for his advice on what I
-should talk about and he said that
-Entering had given the talk earlier in
-the conference I called the nuts and
-bolts of deep learning where he sort of
-went through the flowchart of what you
-do when you see a new problem and like
-if you if you're overfitting you regular
-eyes and if you're underfitting then you
-use a bigger model and so on so so Peter
-suggested to come up to write a talk
-called the nuts and bolts of deep RL
-research where I would talk about some
-of the similar lessons and the tips and
-tricks for the RL setting so I put
-together a talk for that and actually
-people seem to like it
-so I'll give a slightly updated version
-of that talk right now so I'm going to
-talk about a few different things some
-of which are general and sort of apply
-to RL using reinforcement learning in
-general and some of them pertain to
-particular classes of methods like
-policy grading methods and these are
-just sort of little tips and tricks for
-how you how you get your algorithm to
-work and what you do day to day so let's
-say you have a totally new problem
-you're trying to solve like you have you
-have some new tasks and you figured out
-how to you defined an observation in an
-action space and you have your neural
-network policy or Q function but and you
-want to start learning learning how to
-solve it but you but you've never tried
-it before
-um so or okay or if you have a new
-algorithm you're trying to get working
-that you've never you you've never used
-it before so so what do you do what's
-the first thing you do if you have a new
-algorithm
-so so that I mean my first advice would
-be to use the small problems so you can
-run a lot of experiments really quickly
-and do a hyper parameter search and it's
-really useful too
-to be able to visualize the learning
-process in as many ways as possible so
-look at the state visitation like how
-that's evolving over time and look at
-how well your value function is fitting
-and so on so like I spent a lot of time
-looking at the pendulum problem where
-you're trying to swing up a pendulum
-because this problem has a 2d state
-space where it's just the angular and
-the angular velocity of the pendulum and
-I would visit visualize here's exactly
-what the value function looks like
-here's exactly what the state
-distribution looks like and here's how
-they evolve over time so I would get a
-sense for like what's if my algorithm
-isn't working is it because it's like
-oscillating in some funny way or maybe
-it's just giving a bad fit or maybe the
-function it's learnt the value function
-alerting isn't smooth enough and so on
-so I would say try to visualize
-everything and maybe use small problems
-where you can visualize everything also
-yeah it's useful to construct toy
-problems where your idea is going to be
-the strongest where you think okay if
-this idea has any possibility of working
-it's going to work there so for example
-let's say you're trying to do something
-with hierarchical reinforcement learning
-then construct some problem where
-there's some kind of obvious hierarchy
-that it should learn and you'll be able
-to tell if it's doing the right thing
-also construct the the problems where
-it's going to be weakest obviously and
-also as a counterpoint to that don't
-over fit your method to some contrived
-problem so let's say you've come up with
-some toy problem where your method is
-really good then don't realize that it's
-a toy problem and don't like tweak
-everything to just work on this toy
-problem perfectly because yeah it's also
-pretty useful to have medium-sized
-problems that you're very familiar with
-and you know exactly how fast the
-learning should be and what the reward
-should be at every iteration and so on
-so
-a few problems that I use a lot like
-training on pong Atari and the hopper
-would the hopper like problem which is
-this simulated robot problem with this
-hopping robot and I know exactly how
-fast an algorithm that's working should
-learn on these problems so so I can sort
-of it's it makes it easier to tune
-things if you have okay that's if you
-have a new algorithm let's say you have
-a new task I would recommend just making
-the task easier until you start seeing
-some signs of life you see it learning
-something so so there are various ways
-you can make it easier you can try doing
-some feature engineering so your input
-features you think that the you think
-that the policy should be a simple
-function of your input features like
-let's say you're trying to get pong to
-work and you tried setting it up with
-the images as input and you weren't
-learning anything then you can set up
-the problem where you pass in XY
-coordinates as input and then try
-running your algorithm and it's a much
-simpler function you're trying to learn
-so that's much more likely at work and
-then you can try to make it harder and
-harder until you're solving the full
-problem another way you can make it
-easier is by shaping the reward function
-that means you if you come up with some
-reward function that gives you fast
-feedback code on whether you're doing
-the right thing or not so let's say we
-can define one task where we have this
-reaching robot and we just give it a
-reward if it reaches if it hits the
-target so it gets a reward of one if it
-hits the target in zero otherwise so
-that might be hard to learn because
-you're not getting any feedback as
-you're flailing around but we could
-define a better shaped reward function
-where the where it's just distance to
-target then learning is going to be much
-faster in that problem there's also the
-problem on exactly how to turn your
-problem into a pom DP in the first place
-so so often it's not clear what your
-observation features should be and it's
-not even clear what the reward function
-should be so or it's not clear if this
-problem you're trying to solve is
-if it's feasible at all so so let's say
-you're trying to solve you you have some
-game or some robotics task or something
-new like and you you want to turn it
-into a reinforcement learning problem
-but you're not sure if this is feasible
-at all
-so the first thing to do is to just
-visualize a random policy acting on this
-problem and see see what happens so if
-the random policy occasionally does the
-right thing then there's a high chance
-of reinforcement learning is going to
-work because bringing forth a policy
-grading method is just going to take
-this random behavior and it's going to
-make the look the good behaviors more
-likely
-so it'll gradually like hone in on the
-good behaviors whereas if you're never
-doing the right thing then then there's
-RL isn't going to get any signal that
-tells it to do the right thing sometimes
-RL is able to learn even though it seems
-like it it's not clear how it's going to
-learn like learning how to walk it's not
-clear that that should work but because
-you would think that like you really
-have to have the whole thing in the
-whole policy in place before it does
-anything useful but as it turns out you
-sort of learn to take one step and then
-fall over and then take two steps and
-then fall over and so on until you've
-got a proper walking gait okay another
-thing to do is to make to make your
-observations make sure your observations
-are useable try to look at them as a
-human and see if you can control the
-system using the same observations
-you're giving to the agent so let's say
-you're doing some pre-processing on your
-images look at those pre processed
-images yourself and make sure you're not
-like losing too much detail when you
-downsample them or losing too much or in
-the color transformations and so on
-another thing to do is you want to make
-sure that everything is reasonably
-scaled so that for example well as a
-rule of thumb you usually want
-everything to be mean 0 and standard
-deviation 1 for the observations and for
-the rewards well it's a little less
-obvious but that's a reasonable
-heuristic so
-so you might want to like a scaler using
-some kind of filter I mean that's that's
-another good thing you can do but if you
-don't want to mess with some kind of
-filters on your observations and rewards
-what you can do it you can just kind of
-if you're allowed to define those
-yourself then you might want to just
-scale them yourself so what I'd
-recommend doing is plot histograms of
-all of your observations and your
-rewards and make sure that for each
-component of the observations and
-rewards you've scaled it properly so
-that it has the right mean insanity
-deviation and it doesn't have crazy
-outliers okay another thing to do is you
-should have some good baselines that you
-can use whenever you see a new when you
-whenever you see a new problem so just
-it's not clear which algorithm is going
-to work beforehand so make sure you you
-just have a bunch of a bunch of like
-well tune things that you can run on
-each problem yeah okay the question was
-if you're gonna do some kind of reward
-normalization should you do this over
-your whole training like all of your
-training data or just like the recent
-data I would yeah that's a there's a lot
-of subtlety there so I would say use all
-of your data so far because you're
-making everything non-stationary if you
-do some kind of filtering actually I'm
-going to talk about this at a later
-slide so anyway I would recommend as
-just a few baselines you should have a
-cross and to be method some policy
-grading methods some kind of cue
-learning or sarsa type method there's a
-lot of code online now that you can use
-other people's code that that's already
-written so you can use like we have this
-open AI baselines repository and also
-our L lab has a bunch of algorithms okay
-another thing to do which people often
-get tripped up on especially when
-they're trying to reproduce published
-work is so you implement the algorithm
-based on the paper
-and then it doesn't really learn
-anything at all and then you think oh
-maybe mike is my code like wrong or what
-happened so I would say early on you
-might need to run with more samples than
-expected
-so one hyper parameter that you can
-usually adjust is how big of a batch
-size to use or how many samples to use
-and I would say sometimes you should use
-more samples than you think you're going
-to need because usually things just work
-better when you have more samples almost
-always so often sometimes when you're
-trying to reproduce a published paper
-you've got it mostly right but not
-exactly right like maybe you haven't
-scaled everything properly or there's
-some like there's some really like
-obscure hyper parameter that you have
-wrong and then you just find that the
-code doesn't learn anything so then I
-would say just try to make it work a
-little bit and then you can work from
-there and try to tweak all the hyper
-parameters to to get up to the like to
-get fully up to the publish performance
-but if you want to just get something
-working at all often you need to use
-bigger batch sizes and you thought
-because if your batch size is too small
-than the nor the noise will overwhelm
-the signal and you won't learn anything
-so like for example for TRP oh I wasn't
-seeing any learning for a while and then
-it turned out it's just because I was
-using too small of a batch size and I
-had to use a hundred thousand time steps
-of a batch for the batch size but and
-for Atari they for dqn the type of
-parameters that were found to be best
-where you update every ten thousand time
-steps you update your queue function
-every ten thousand time steps and you
-have a 1 million time steps in your
-replay buffer which is a lot okay so now
-I'll talk about some guidelines for on
-for the ongoing development and tuning
-process as opposed to the initial
-process of I have a totally new problem
-or a new algorithm that I want to see
-some signs of life on so
-let's say you get something working I
-recommend looking how sensitive your
-algorithm is to every hyper parameter
-and if it's too sensitive it it's not
-actually a robust algorithm then you
-shouldn't be happy with it you probably
-just got luck lucky on that one problem
-and it's it's actually kind of possible
-to have a method that does that is a
-fluke and it works in one way because
-it's I mean one problem because of some
-funny dynamics but then it doesn't work
-in general so you kind of have to it
-need some serious improvements so yeah
-so that's okay there's also a few things
-you can look at to see that actually I'm
-going to talk about more of these kind
-of Diagnostics a little later but there
-are some indicators that'll tell you if
-that if your algorithm is working
-besides just looking at the final
-performance but other in look for other
-indicators that are going to tell you
-that your optimization process is kind
-of healthy so this is going to vary
-based on the algorithm but for example
-you can look at whether your value
-function is actually accurate like
-whether it's actually predicting returns
-well you can look at how big the updates
-are in terms of some either parameter
-space or the output space standard
-Diagnostics for deep networks like you
-can look at norms of gradients and so on
-okay one thing that takes some
-discipline but is very useful is to have
-a system for continually benchmarking
-your code and that includes all of your
-code not just the one thing you're
-tuning right now because often it's easy
-to tune your algorithm to work well in
-one problem and then mess up the
-performance on other problems and it's
-really easy to overfit on single
-problems when you're just adjusting
-hyper parameters so I'd really recommend
-having some kind of benchmark you can
-run frequently and some kind of battery
-of benchmarks that you've run
-occasionally along as similar lines of
-like overfitting of sort of reading
-too far into noise or over interpreting
-noise it's really easy to just to think
-you're improving your algorithm or
-you're making it worse but really you're
-just seeing random noise so so you can
-see seven different tasks these are the
-Jim Moo Joko tasks like half cheetah and
-hopper and so on and you have three
-different algorithms here the red one
-the green one and the blue one and you
-can see ok let's I mean we can see that
-the performance is a little different on
-all the problems but let's it looks like
-the the red let's see does the green
-which one looks like the best well it
-kind of varies by problem like the blue
-one looks better on this problem and the
-red one is worse on this problem and so
-on but as it turns out these are all the
-exact same algorithms and just random
-seeds different random seeds so so it's
-easy to imagine that you're just looking
-at one of these problems then you see
-that blue curve and you think you get
-really excited than you think you found
-some huge improvement to your algorithm
-but it's really that you just got a
-lucky seed that one run so yeah really
-you've got to run your algorithm
-multiple times an average and even if
-you're averaging over a lot of seeds
-like even if you had like 20 seeds here
-there's a still a pretty big error bar
-so it's yeah that makes it particularly
-hard
-I mean I'd recommend having like
-multiple tasks and multiple seeds and if
-you don't do that then you're probably
-just overfitting unless you see a really
-drastically large improvement another
-thing to do is it's easy to keep adding
-little modifications to your algorithm
-until it gets really complicated and
-then you're not sure and then you think
-you have this really complicated
-algorithm which is perfect but it turns
-out that most of the things you did are
-unnecessary because base some of the
-tricks substitute for each other this is
-often true because a lot of tricks help
-because they're like normalizing things
-in a better way or improving your
-optimization like making
-your optimization less susceptible to
-like big spikes I don't know a lot of
-different modifications you make have
-similar effects so so often you you can
-remove them and simplify your algorithm
-and this is pretty important so it's
-like especially with regard to changes
-that do whitening these kind of these
-kind of all substitute for each other
-and also substitute for changes to your
-optimization algorithm and yeah I would
-and I would simplify things because it's
-then it's more likely that your insights
-will generalize to other problems and
-also lastly it's pretty useful to
-automate your experiments because
-otherwise you're going to end up
-spending all your day your whole day
-just watching your code prints out
-numbers and and it's actually really
-it's it's really tempting to spend all
-day doing that but I would I mean
-especially if you need to run multiple
-random seeds then it's then you you
-really need to get your work flow down
-so the year you're automating this
-process and launching lots of
-experiments at the same time so I'd
-recommend just getting set up with one
-of these cloud computing services so you
-can just launch experiments on remote
-instances and pull the results back when
-you're done question oh yeah question is
-you have a recommendation on what
-framework to use to keep track of your
-experiment results I personally use no
-framework at all and I just have like
-ipython notebooks and scripts that
-collect a bunch of data that's stored in
-various log files so I just have scripts
-that read all my log files and plot them
-I don't use some people like having
-databases and stuff where they store all
-their hyper parameter results but on I
-think I don't find it necessary
-personally okay so now I'm let's see I'm
-going to talk about general tuning
-strategies for RL and then after that
-I'll talk about some specific tuning
-strategies for different classes of
-algorithms
-okay so one thing is widening or
-standardizing your data so if your
-observations have unknown range you
-should definitely standardize them I
-would do that by computing a running
-estimate of the mean and the standard
-deviation and then just transform it Z
-transforming it like this
-and I would recommend computing the mean
-and the standard deviation over all data
-you've seen so far not just your recent
-data because otherwise you're
-effectively changing your data in some
-way that the policy doesn't know about
-like you have your that your policy
-grading algorithm doesn't know about
-like your policy grading algorithm is
-actually optimizing some objective so
-then if you just go and change the
-problem out from under it then you're
-often going to make things a lot worse
-like if you rescale your observations
-then your optimization algorithm didn't
-know about that so you might just
-collapse the performance so that's why I
-would recommend using your whole all of
-your data from the start of time so that
-at least it's going to slow down over
-time how fast it's how fast your
-scalings are changing so yeah that's
-what I would recommend doing with the
-observations and for the rewards
-I'd recommend rescaling it but not
-shifting them because that affects the
-agents will to live so if you if you
-shift the mean reward that'll affect
-whether how long it wants to survive
-you're actually changing the problem ok
-another yeah you might also want to try
-to standardize prediction targets in the
-same way though that's a little more
-complicated to do using okay yeah so
-question is what about pca widening
-instead of just this element why scaling
-yeah that could that could definitely
-help I haven't I haven't experimented
-with that but yeah that could help it's
-hard to predict with like with neural
-nets if it's going to help or not
-because they seem to be pretty good at
-disentangling things so I know that if
-you have things that are terribly scaled
-like they're from negative one thousand
-two one thousand and other coordinates
-are from negative point
-point one then it's gonna be slow for
-learning so this kind of scaling helps a
-lot even though you're having their own
-networks okay there's some parameters
-that are really generally important like
-discount factor that determines whether
-you're that determines how long how far
-away you're doing credit assignments so
-whether you're paying attention to
-effects that are delayed by a certain
-time so if your discount is gamma equals
-point 99 then you're basically ignoring
-effects that are more delayed by a
-hundred time steps so so you're kind of
-short-sighted that gamma is controlling
-your shortsightedness and you might want
-to actually look at if how long that
-corresponds to in real time so usually
-in reinforcement learning you're sort of
-discretizing time in a certain way and
-it's worth paying attention to like is
-that 100 time steps like three seconds
-of real time or what and what happens
-during that time also note that if you
-have TD lamda kind of methods for either
-for value function estimation or for
-policy grading methods you can get away
-with using a Lambda gamma that's really
-close to one like 0.999 and things
-aren't going to go unstable because if
-you have a lower land of like 0.9 then
-that's going to make it so the algorithm
-is still stable even though gamma is
-really close to one also okay so so as I
-mentioned you might want to in in
-practice we're usually discretizing some
-continuous-time system so then it's
-worth seeing if the problem can actually
-be solved at this discretization level
-so so for example in a game let's say
-you're you're doing frame skip a meaning
-that you repeat the action multiple
-times as a human can you control it at
-this rate or is it just impossible to
-control is it just too like you're doing
-the action too many times in a row and
-you have to slow responses to control it
-and I would also just look at the what
-the random exploration looks like and if
-you make sure that you're exploring like
-the the
-this Croatian is going to determine like
-how far your Brownian motion goes
-because if you're doing the same action
-many times in a row then you're going to
-be able to then you're going to tend to
-explore further so so it's worth just
-looking at what the random exploration
-does and and choosing your time
-discretization in a sensible way so that
-it does interesting things question yeah
-so the question is if you have a DQ n
-how would you get started like tuning it
-with tuning all the hyper parameters
-actually I'm going to talk about DQ n
-pretty soon so yeah I'll get to that
-okay also look at the episode returns
-very closely look at don't just look at
-the mean look at the minimum and the
-maximum so the maximum especially if you
-have a deterministic system if you have
-a certain maximum return that's
-basically something that your policy can
-hone in on pretty straightforwardly
-because if if you just do that every
-time then you're going to increase your
-mean return to that level so so it's
-worth so so it's useful to look at the
-max return to see if your policy is ever
-doing like the right thing according to
-that max return or if it's just kind of
-stuck and it's never discovering the
-high return strategy also look at the
-episode in length which is sometimes
-more informative than the episode reward
-like if because sometimes well yeah I
-won't go into details on that like well
-if you have a game you're it might mean
-that like you might be losing every time
-so you're never seeing yourself win but
-the episode length will tell you if
-you're losing slower so you might see an
-improvement in episode length at the
-beginning but not in reward okay for
-Policy gradient there are specific
-strategies or prediction there are
-specific Diagnostics that are really
-helpful so look at the entropy really
-carefully if your entropy is going down
-too fast that means your policy is
-becoming deterministic and it's not
-going to explore anything so
-so be careful and also if it's not going
-down your policy is never going to be
-that good because it's always really
-random else so you can sort of alleviate
-this issue by using an entropy bonus or
-a KL penalty so by stopping yourself
-from move changing the policy the
-probability distribution too fast as a
-side effect you also prevent the entropy
-from going down too fast when you use
-the KL penalties I also look at the KL
-as a diagnostic like look at how big of
-an update you're doing in terms of KL
-divergence if your KL is like 0.01
-that's a pretty small update but if it's
-like a 10 that's a really big update
-question oh yeah how do you question is
-how do you measure entropy so so if you
-have for most policies you can compute
-the entropy analytically so if you have
-a discrete action space then you usually
-can just compute it analytically and if
-you have a continuous policy you're
-usually you're using a Gaussian
-distribution or something so you can
-compute the differential entropy
-analytically so here we're talking about
-entropy in action space so the average
-over state space of the action space
-entropy what you actually might care
-about even more is the entropy in state
-space but you have no hope at actually
-calculating that except maybe to do some
-really crude approximation of it
-okay yeah so KL is really useful look at
-explain variants like whether your value
-function is actually explaining is
-actually a good predictor of the returns
-or if it's just worse than predicting
-nothing so if you just predict zeroes
-then your explained variance is zero but
-sometimes if you have some neural
-network that's predicting then you find
-that it's actually negative because it's
-overfitting or it's just noisy and it's
-not doing anything useful so that
-probably means you need to tune some
-hyper parameters so that your neural
-networks actually predicting better than
-the constant predicting zero question
-okay yeah question is why does the KL
-spike give you a loss in performance
-well it doesn't always be a lot it's not
-always a loss in performance sometimes
-it's a gain in performance but in
-practice it's usually a loss in
-performance because it usually the
-approximation that your policy gradient
-is just taking you way outside the
-region where your local approximation to
-the policy performance is accurate so
-you're you're probably just overshooting
-like if you take your policy and you
-take a really big step in any direction
-you're probably making it worse so so
-that's so usually if you take a big step
-you're getting worse like if you have a
-convex function if you take a big step
-in any direction you're probably going
-to make it worse let's see okay
-initialize your policy that's pretty
-important more important than in
-supervised learning because that in
-determines what data you're going to see
-initially and you're going to learn from
-at the beginning so I would recommend
-using have initializing the final layer
-to be either zero or really small so
-that at least you you have the maximum
-and you sort of explore randomly at the
-beginning we randomly at the beginning
-as opposed to having some kind of
-particular like policy that has a strong
-opinion on the right thing to do which
-is based on no information at all okay
-that's for Policy gradient for Q
-learning so a few thing a few things one
-is okay you often it helps to have a
-really big replay buffer and to be able
-to do this you need to be a little
-careful about memory usage so it's worth
-putting in the extra effort to do that
-learning rate schedules are often quite
-helpful here in practice as our
-exploration schedules so in qdq any
-you're usually using epsilon greedy and
-it often helps to do to play with the
-schedule on that also it converges
-pretty slowly and it has a miss
-serious warmup period at the beginning
-often so so sometimes you just so I
-actually have a lot of admiration for
-the authors who originally got this the
-people people who got this to work
-originally because they had to just let
-their code run for a while before it did
-anything so so you have to have a lot of
-patience - a lot of bravery to do that
-ok this is just miscellaneous advice for
-not necessarily for tuning algorithms
-but just for for personal development so
-I recommend reading older textbooks and
-theses not just the latest conference
-papers because often they up in them
-like there are more dense source of
-useful information whereas each
-conference paper just has one idea ok
-yeah don't get too stuck on problems
-because often you actually have a
-legitimately good algorithm but it's has
-like some flaws so its might fail
-miserably at some easy problem so in RL
-there's some like simple problems like
-cart will swing up where you have this
-stick and you're trying to swing it up
-by moving the cart around and this
-problem like you might have a great
-algorithm but it's gonna in my but like
-some of the state-of-the-art algorithms
-are gonna fail on that problem unless
-you really tuned them carefully and
-that's just because maybe it's not
-exactly the right problem to start to I
-mean maybe like the thing that makes
-this problem hard is not the thing that
-your algorithm is doing that's
-interesting so you might have like come
-up with a better policy grading method
-but still it'll converge to the same
-local minimum on that swing up problem
-and you're not gonna fix that problem so
-I I would say just don't get too stuck
-on a single problem that your method
-bails on and enough in like maybe the
-ultimate algorithm will solve all of
-these problems but we're not there yet
-so you might as well just try to improve
-and some like decently large subset of
-problems so also like one funny thing is
-the dqn performs pretty poorly on a lot
-of problems especially with continuous
-control
-I think it does I mean for cartful it
-probably solves it pretty well if with a
-reasonable amount of tuning but some of
-the other like fairly small continuous
-control problems it fails on but that
-doesn't mean it's like that doesn't mean
-it's a bad algorithm because it solves a
-different problem extremely well so yeah
-I would say just these these things are
-at least right now it's not gonna you
-shouldn't expect to be able to solve
-everything with the same method without
-any tuning also techniques from
-supervised learning often don't transfer
-over to reinforcement learning so so
-don't be surprised if you find that I
-guess that's not I said this slide was
-gonna be a bad personal development
-that's not about personal development
-but yeah I guess this is just a grab bag
-of miscellaneous advice so yeah so like
-Bachner a lot of people look at what
-people are doing in RL and they think
-why aren't you using batch norm or drop
-out or or big networks why are you using
-like two layers of 64 units and it's not
-like people didn't think of trying these
-other things they tried them and then
-they found that those other like
-architectures and methods don't actually
-help here I mean if you figure out how
-to make batch norm and drop out actually
-help in RL that'll actually be really
-great and a big that would be a big
-development but yeah I don't know it's
-not totally straightforward all right
-that's all thank you
-okay I have a few minutes for questions
-yeah so the question is how long do you
-wait until you've decided that your
-algorithm is new at work either because
-your code is wrong or it's just too hard
-I don't have a good general answer to
-that I think the problem is worse for
-some algorithms and others I'd say for
-policy gradient methods you don't see
-that burnin period as much like often if
-it's going to learn it'll learn it at
-the beginning but that's not always true
-either I mean sometimes it will kind of
-take some time to get into the right
-numerical regime so I don't yeah I don't
-have general advice I would say you have
-to just I would say go back and start
-with the easy problems and you'll get
-some intuition about whether you're you
-should expect a you should expect a burn
-in period or not where it's not learning
-anything see I want to get some people
-in the back because okay oh yeah
-question is - do I use unit tests I use
-unit tests for code that where there's
-it's doing a very particular
-mathematical thing that you can actually
-write a test for like let's say I'm
-computing the KL divergence then I'll
-write a test to check I don't know their
-various ways of testing it so and it's
-easy to get those things wrong like you
-have it's I don't know as you're off by
-a constant or something so yeah I would
-write tests for I write tests for things
-where it's nothing that there's a very
-well-defined correct thing to do it's
-harder to write it for an algorithm
-where it has a lot of different moving
-parts where you it's not clear how fast
-it should learn and it's also there's
-some randomness involved
-so if you try to write a test saying I
-should be at performance 100 after this
-many iterations it might fail just out
-of random noise but yeah I think
-probably unit tests are a good idea oh
-yeah so the question is do I have
-guidelines on matching the algorithm to
-the task like when to use policy
-gradients versus a value iteration style
-method it's yeah it's hard to give some
-general guidelines I think people have
-found that and and the guidelines I give
-you might just be just be kind of
-historical accidents like someone got
-this to work here and this to work there
-so I think the well certainly if you
-don't care that much about sample
-complexity policy gradient methods are
-are probably are probably the way to go
-if you don't care about sample
-complexity or using off policy data then
-policy grading methods are probably the
-safest bet because you I don't know it's
-more understandable exactly what it's
-doing it's just doing gradient descent
-whereas q-learning it's a little bit
-indirect what it's doing so it's and it
-in practice is more finicky yeah if you
-do care about sample complexity though
-or need off policy data then hue
-learning is usually better or yeah or a
-few students like sample complexity is
-relevant if your simulator is expensive
-of course I would also say that people
-have found that dqn and relatives have
-worked well on game-like tasks with
-images as input whereas policy grading
-methods work better on the continuous
-control tasks like these robotic
-locomotion problems though that this
-might not be fundamental it might be
-more of a historical accident let's
-oh yeah recommendations on older
-textbooks let's see there's like brutes
-a cuss for take us as books
-that's approximate dinette what is it
-optimal control and why am i blanking on
-the name optimal control and dynamic
-programming something like that and the
-set I mean sudden embargo is a good one
-to read butterman has a textbook kind of
-a classic textbook on Markov decision
-processes that's in the RL space then
-there's books on numerical optimization
-that are good and yeah I'd say obviously
-the machine learning textbooks have a
-lot of good material that might be
-useful in the RL setting too
-oh yeah can I comment on evolution
-strategies and the blog posts the the
-opening I blog post on it let's see do
-you have any specific questions about it
-or like how it compares
-oh yeah okay yeah yeah so there's
-there's a lot of policy grading methods
-out there and some of them are quite
-complicated so we've had a couple of
-talks on them so far like all these
-different work
-it's excessively more complicated policy
-grading methods but then there's this
-old algorithm called evolution
-strategies which is an extremely simple
-algorithm and and there's a paper by
-some of my colleagues where they show it
-was called evolutionary strategies as a
-scaleable alternative to reinforcement
-learning which really meant like to
-policy grading methods so and they
-claimed that it worked basically as well
-as policy grading methods or at least
-it's sort of in the same order oh and
-beer is one of the authors of that paper
-so the claim was that it works it works
-like similarly well to policy grading
-method so why should we bother with
-these policy grading methods if es works
-just as well well I think in practice it
-works well it works but it works not not
-as well like it's it takes me the sample
-complexity is is is worse by some
-constant factor or it's not clear that
-it's a constant factor or if this factor
-scales with the size of the network but
-it's it is a lot it is significantly
-slower and the question is just what is
-that constant factor so is that constant
-factor like one or is it three or is it
-10 or 100 so that's not that's going to
-vary between problems and also the that
-paper had some innovations in exactly
-how to parameterize the networks and so
-forth that made everything better
-numerically everything better scaled so
-that yes did work well but I would say
-that if you that it's usually quote like
-I don't know it's usually a pretty
-decent constant factor slower than
-policy grading methods especially the
-more advanced ones like the PPO and
-actor so so i'm i think it's it's not
-really a clear win in the RL setting
-where policy gradients work I think if
-policy gradients work it's usually going
-to be a lot better
-and the es is going to be is going to be
-better on problems where policy
-gradients aren't going to work for some
-reason like if you've got really long
-you depend time dependencies where the
-discounts are gonna are gonna ignore
-them
-then es might be less sensitive to that
-let's see I think I'm okay last question
-oh yeah favorite hyper parameter
-optimization framework I've used some of
-these than I just like to use the
-uniform random sampling yeah that works
-really well I mean you just run a bunch
-of experiments with random hyper
-parameters and then you just look at the
-results the next day and to do some
-regression to figure out which
-parameters actually mattered and then
-you've run another experiment with
-better parameter ranges and so on so I
-use the human version of it because
-often it's just - it's like a it's it's
-useful to be able to look at the results
-yourself - to get some to figure out
-which parameters actually matter so
-you're not wasting a lot of computation
-because that information transferred
-between problems all right
-[Applause]
\ No newline at end of file
diff --git a/docs/ml_debug_folklore_log.md b/docs/ml_debug_folklore_log.md
deleted file mode 100644
index 10f3bf4..0000000
--- a/docs/ml_debug_folklore_log.md
+++ /dev/null
@@ -1,76 +0,0 @@
-# ML Debugging Folklore - Vargdown Process Log
-
-## Process
-- [x] evidence files read (21 files, 9416 lines total)
-- [x] quotes extracted via 12 parallel subagents
-- [x] key quotes verified against evidence files (spot-checked ~15 quotes)
-- [x] argdown verifier passes clean (`npx @argdown/cli json` -- 14 arguments, 45 statements, 14 relations)
-- [x] subagent review done (gpt-5.2-codex via opencode; fixed non-verbatim quotes, credence calibration, PCS structure)
-- [ ] human review done
-
-## Evidence Fetch Log
-
-All evidence files were pre-existing in `docs/evidence/`. They were fetched
-in a prior session via the methods listed in each file's header.
-
-| Source | Evidence File | Fetch Method | Status |
-|--------|--------|--------|--------|
-| Schulman 2016 slides | joschu_nuts_and_bolts.md | `uvx markitdown[pdf]` | verbatim (PDF artifacts: cid markers) |
-| Schulman 2017 bootcamp | schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md | YouTube auto-subtitles | verbatim (transcription errors: "insanity" = "and standard") |
-| Andy Jones RL debugging | andyljones_rl_debugging.md | markitdown | verbatim |
-| Henderson et al. 2018 | henderson_2018_deep_rl_matters.md | markitdown | verbatim |
-| Goodfellow Ch11 | goodfellow_ch11_practical_methodology.md | markitdown | verbatim |
-| CS231n NN3 | cs231n_neural_networks_3.md | markitdown | verbatim |
-| FSDL Spring 2021 L7 | fsdl_spring2021_lecture7.md | markitdown | verbatim |
-| Irpan RL hard | alexirpan_rl_hard.md | markitdown | verbatim |
-| amid.fish reproducing | amid_fish_reproducing_deep_rl.md | markitdown | verbatim |
-| Slavv 37 reasons | slavv_37_reasons_nn.md | markitdown | verbatim |
-| CS229 ML advice | cs229_ml_advice.md | markitdown | verbatim |
-| McCandlish 2018 | mccandlish_2018_large_batch.md | markitdown | verbatim |
-| William Falcon notes | williamfalcon_deeprl_hacks.md | markitdown | verbatim |
-| Goodfellow Ch15 | goodfellow_ch15_representation_learning.md | markitdown | verbatim |
-| Deep Learning Book | deeplearning_book.md | markitdown | verbatim |
-| Reddit RL tips 7s8px9 | reddit_rl_practical_tips_7s8px9.md | markitdown | verbatim |
-| Reddit RL debug 9sh77q | reddit_rl_debugging_tips_9sh77q.md | markitdown | verbatim |
-| Reddit RL roadblocks | reddit_rl_roadblocks_bzg3l2.md | markitdown | verbatim |
-| Reddit Schulman 5hereu | reddit_schulman_nuts_bolts_5hereu.md | markitdown | verbatim |
-| Reddit ICML tutorial | reddit_icml2017_tutorial_levine_6vcvu1.md | markitdown | verbatim |
-| Reddit DRL bootcamp | reddit_deeprl_bootcamp_2017_75m5vd.md | markitdown | verbatim |
-
-## Quote Verification Notes
-
-- Schulman subtitles contain auto-generated transcription errors (e.g., "mean insanity deviation" should be "mean and standard deviation"). Quotes used verbatim from file; errors are in the source, not introduced by us.
-- Schulman PDF (joschu_nuts_and_bolts.md) has markitdown conversion artifacts (`(cid:73)` bullet markers, table formatting). Core text is present but formatting is messy.
-- All other evidence files appear to be clean markitdown conversions.
-- 15 key quotes were manually spot-checked against evidence files. All matched.
-- Quotes from subagent extractions were cross-referenced with direct file reads.
-
-## Blockers / Caveats
-
-- Argdown verifier passes clean: `npx @argdown/cli json` exports 14 arguments, 45 statements, 14 relations. Fixed: 44 blank lines inside PCS blocks, bracket escaping in FSDL quote.
-- Some evidence files (especially Schulman PDF) have conversion artifacts that may cause verifier failures on exact quote matching.
-- The argdown uses auto-generated YouTube subtitles as a source; these contain transcription errors that are present in the evidence file.
-
-## Coverage Summary
-
-| SKILL.md Claim | Sources Used | Independent Sources |
-|---|---|---|
-| Normalize inputs mean=0 std=1 | Schulman, FSDL, Slavv | 3 |
-| Overfit tiny dataset first | CS231n, FSDL, Goodfellow | 3 |
-| Assume you have a bug | Jones, Goodfellow | 2 |
-| Seed variance is extreme | Schulman, Henderson, Irpan | 3 |
-| Use bigger batch sizes | Schulman (x2), McCandlish | 2 (Schulman slides + talk counted as 1) |
-| Hand-scale rewards, don't shift mean | Schulman, Jones, Henderson | 3 |
-| Use reference implementations | Jones, Rahtz | 2 |
-| Pursue anomalies | Jones, Rahtz | 2 |
-| Log everything | Rahtz, Goodfellow | 2 |
-| Random HP search | CS231n/Bergstra, Schulman | 2 |
-
-| Probe environments for RL | Jones | 1 (but applies general isolation principle) |
-| Policy entropy / KL diagnostics | Schulman | 1 (but built into major frameworks) |
-
-## Claims NOT Covered in Argdown (lower priority or single-source)
-- Gradient clipping masks problems (CS231n mentions, but as a technique not a warning)
-- Final layer zero init for policy (Schulman only)
-- Loss surface analysis / gradient quiver plots (original to SKILL, no external source)
-- Sweep methodology with within-group z-scores (original to SKILL)