diff --git a/docs/evidence/[English (auto-generated)] Deep RL Bootcamp Lecture 6 Nuts and Bolts of Deep RL Experimentation [Do(1).txt b/docs/evidence/[English (auto-generated)] Deep RL Bootcamp Lecture 6 Nuts and Bolts of Deep RL Experimentation [Do(1).txt deleted file mode 100644 index 69e9285..0000000 --- a/docs/evidence/[English (auto-generated)] Deep RL Bootcamp Lecture 6 Nuts and Bolts of Deep RL Experimentation [Do(1).txt +++ /dev/null @@ -1,1008 +0,0 @@ -so last year at nips I was slated to -give a talk at the deep RL workshop and -I wasn't sure what I was going to talk -about because everything I had prepared -I had already talked about it so many -times that I just didn't want to didn't -want to give another talk on it so I I -asked Peter for his advice on what I -should talk about and he said that -Entering had given the talk earlier in -the conference I called the nuts and -bolts of deep learning where he sort of -went through the flowchart of what you -do when you see a new problem and like -if you if you're overfitting you regular -eyes and if you're underfitting then you -use a bigger model and so on so so Peter -suggested to come up to write a talk -called the nuts and bolts of deep RL -research where I would talk about some -of the similar lessons and the tips and -tricks for the RL setting so I put -together a talk for that and actually -people seem to like it -so I'll give a slightly updated version -of that talk right now so I'm going to -talk about a few different things some -of which are general and sort of apply -to RL using reinforcement learning in -general and some of them pertain to -particular classes of methods like -policy grading methods and these are -just sort of little tips and tricks for -how you how you get your algorithm to -work and what you do day to day so let's -say you have a totally new problem -you're trying to solve like you have you -have some new tasks and you figured out -how to you defined an observation in an -action space and you have your neural -network policy or Q function but and you -want to start learning learning how to -solve it but you but you've never tried -it before -um so or okay or if you have a new -algorithm you're trying to get working -that you've never you you've never used -it before so so what do you do what's -the first thing you do if you have a new -algorithm -so so that I mean my first advice would -be to use the small problems so you can -run a lot of experiments really quickly -and do a hyper parameter search and it's -really useful too -to be able to visualize the learning -process in as many ways as possible so -look at the state visitation like how -that's evolving over time and look at -how well your value function is fitting -and so on so like I spent a lot of time -looking at the pendulum problem where -you're trying to swing up a pendulum -because this problem has a 2d state -space where it's just the angular and -the angular velocity of the pendulum and -I would visit visualize here's exactly -what the value function looks like -here's exactly what the state -distribution looks like and here's how -they evolve over time so I would get a -sense for like what's if my algorithm -isn't working is it because it's like -oscillating in some funny way or maybe -it's just giving a bad fit or maybe the -function it's learnt the value function -alerting isn't smooth enough and so on -so I would say try to visualize -everything and maybe use small problems -where you can visualize everything also -yeah it's useful to construct toy -problems where your idea is going to be -the strongest where you think okay if -this idea has any possibility of working -it's going to work there so for example -let's say you're trying to do something -with hierarchical reinforcement learning -then construct some problem where -there's some kind of obvious hierarchy -that it should learn and you'll be able -to tell if it's doing the right thing -also construct the the problems where -it's going to be weakest obviously and -also as a counterpoint to that don't -over fit your method to some contrived -problem so let's say you've come up with -some toy problem where your method is -really good then don't realize that it's -a toy problem and don't like tweak -everything to just work on this toy -problem perfectly because yeah it's also -pretty useful to have medium-sized -problems that you're very familiar with -and you know exactly how fast the -learning should be and what the reward -should be at every iteration and so on -so -a few problems that I use a lot like -training on pong Atari and the hopper -would the hopper like problem which is -this simulated robot problem with this -hopping robot and I know exactly how -fast an algorithm that's working should -learn on these problems so so I can sort -of it's it makes it easier to tune -things if you have okay that's if you -have a new algorithm let's say you have -a new task I would recommend just making -the task easier until you start seeing -some signs of life you see it learning -something so so there are various ways -you can make it easier you can try doing -some feature engineering so your input -features you think that the you think -that the policy should be a simple -function of your input features like -let's say you're trying to get pong to -work and you tried setting it up with -the images as input and you weren't -learning anything then you can set up -the problem where you pass in XY -coordinates as input and then try -running your algorithm and it's a much -simpler function you're trying to learn -so that's much more likely at work and -then you can try to make it harder and -harder until you're solving the full -problem another way you can make it -easier is by shaping the reward function -that means you if you come up with some -reward function that gives you fast -feedback code on whether you're doing -the right thing or not so let's say we -can define one task where we have this -reaching robot and we just give it a -reward if it reaches if it hits the -target so it gets a reward of one if it -hits the target in zero otherwise so -that might be hard to learn because -you're not getting any feedback as -you're flailing around but we could -define a better shaped reward function -where the where it's just distance to -target then learning is going to be much -faster in that problem there's also the -problem on exactly how to turn your -problem into a pom DP in the first place -so so often it's not clear what your -observation features should be and it's -not even clear what the reward function -should be so or it's not clear if this -problem you're trying to solve is -if it's feasible at all so so let's say -you're trying to solve you you have some -game or some robotics task or something -new like and you you want to turn it -into a reinforcement learning problem -but you're not sure if this is feasible -at all -so the first thing to do is to just -visualize a random policy acting on this -problem and see see what happens so if -the random policy occasionally does the -right thing then there's a high chance -of reinforcement learning is going to -work because bringing forth a policy -grading method is just going to take -this random behavior and it's going to -make the look the good behaviors more -likely -so it'll gradually like hone in on the -good behaviors whereas if you're never -doing the right thing then then there's -RL isn't going to get any signal that -tells it to do the right thing sometimes -RL is able to learn even though it seems -like it it's not clear how it's going to -learn like learning how to walk it's not -clear that that should work but because -you would think that like you really -have to have the whole thing in the -whole policy in place before it does -anything useful but as it turns out you -sort of learn to take one step and then -fall over and then take two steps and -then fall over and so on until you've -got a proper walking gait okay another -thing to do is to make to make your -observations make sure your observations -are useable try to look at them as a -human and see if you can control the -system using the same observations -you're giving to the agent so let's say -you're doing some pre-processing on your -images look at those pre processed -images yourself and make sure you're not -like losing too much detail when you -downsample them or losing too much or in -the color transformations and so on -another thing to do is you want to make -sure that everything is reasonably -scaled so that for example well as a -rule of thumb you usually want -everything to be mean 0 and standard -deviation 1 for the observations and for -the rewards well it's a little less -obvious but that's a reasonable -heuristic so -so you might want to like a scaler using -some kind of filter I mean that's that's -another good thing you can do but if you -don't want to mess with some kind of -filters on your observations and rewards -what you can do it you can just kind of -if you're allowed to define those -yourself then you might want to just -scale them yourself so what I'd -recommend doing is plot histograms of -all of your observations and your -rewards and make sure that for each -component of the observations and -rewards you've scaled it properly so -that it has the right mean insanity -deviation and it doesn't have crazy -outliers okay another thing to do is you -should have some good baselines that you -can use whenever you see a new when you -whenever you see a new problem so just -it's not clear which algorithm is going -to work beforehand so make sure you you -just have a bunch of a bunch of like -well tune things that you can run on -each problem yeah okay the question was -if you're gonna do some kind of reward -normalization should you do this over -your whole training like all of your -training data or just like the recent -data I would yeah that's a there's a lot -of subtlety there so I would say use all -of your data so far because you're -making everything non-stationary if you -do some kind of filtering actually I'm -going to talk about this at a later -slide so anyway I would recommend as -just a few baselines you should have a -cross and to be method some policy -grading methods some kind of cue -learning or sarsa type method there's a -lot of code online now that you can use -other people's code that that's already -written so you can use like we have this -open AI baselines repository and also -our L lab has a bunch of algorithms okay -another thing to do which people often -get tripped up on especially when -they're trying to reproduce published -work is so you implement the algorithm -based on the paper -and then it doesn't really learn -anything at all and then you think oh -maybe mike is my code like wrong or what -happened so I would say early on you -might need to run with more samples than -expected -so one hyper parameter that you can -usually adjust is how big of a batch -size to use or how many samples to use -and I would say sometimes you should use -more samples than you think you're going -to need because usually things just work -better when you have more samples almost -always so often sometimes when you're -trying to reproduce a published paper -you've got it mostly right but not -exactly right like maybe you haven't -scaled everything properly or there's -some like there's some really like -obscure hyper parameter that you have -wrong and then you just find that the -code doesn't learn anything so then I -would say just try to make it work a -little bit and then you can work from -there and try to tweak all the hyper -parameters to to get up to the like to -get fully up to the publish performance -but if you want to just get something -working at all often you need to use -bigger batch sizes and you thought -because if your batch size is too small -than the nor the noise will overwhelm -the signal and you won't learn anything -so like for example for TRP oh I wasn't -seeing any learning for a while and then -it turned out it's just because I was -using too small of a batch size and I -had to use a hundred thousand time steps -of a batch for the batch size but and -for Atari they for dqn the type of -parameters that were found to be best -where you update every ten thousand time -steps you update your queue function -every ten thousand time steps and you -have a 1 million time steps in your -replay buffer which is a lot okay so now -I'll talk about some guidelines for on -for the ongoing development and tuning -process as opposed to the initial -process of I have a totally new problem -or a new algorithm that I want to see -some signs of life on so -let's say you get something working I -recommend looking how sensitive your -algorithm is to every hyper parameter -and if it's too sensitive it it's not -actually a robust algorithm then you -shouldn't be happy with it you probably -just got luck lucky on that one problem -and it's it's actually kind of possible -to have a method that does that is a -fluke and it works in one way because -it's I mean one problem because of some -funny dynamics but then it doesn't work -in general so you kind of have to it -need some serious improvements so yeah -so that's okay there's also a few things -you can look at to see that actually I'm -going to talk about more of these kind -of Diagnostics a little later but there -are some indicators that'll tell you if -that if your algorithm is working -besides just looking at the final -performance but other in look for other -indicators that are going to tell you -that your optimization process is kind -of healthy so this is going to vary -based on the algorithm but for example -you can look at whether your value -function is actually accurate like -whether it's actually predicting returns -well you can look at how big the updates -are in terms of some either parameter -space or the output space standard -Diagnostics for deep networks like you -can look at norms of gradients and so on -okay one thing that takes some -discipline but is very useful is to have -a system for continually benchmarking -your code and that includes all of your -code not just the one thing you're -tuning right now because often it's easy -to tune your algorithm to work well in -one problem and then mess up the -performance on other problems and it's -really easy to overfit on single -problems when you're just adjusting -hyper parameters so I'd really recommend -having some kind of benchmark you can -run frequently and some kind of battery -of benchmarks that you've run -occasionally along as similar lines of -like overfitting of sort of reading -too far into noise or over interpreting -noise it's really easy to just to think -you're improving your algorithm or -you're making it worse but really you're -just seeing random noise so so you can -see seven different tasks these are the -Jim Moo Joko tasks like half cheetah and -hopper and so on and you have three -different algorithms here the red one -the green one and the blue one and you -can see ok let's I mean we can see that -the performance is a little different on -all the problems but let's it looks like -the the red let's see does the green -which one looks like the best well it -kind of varies by problem like the blue -one looks better on this problem and the -red one is worse on this problem and so -on but as it turns out these are all the -exact same algorithms and just random -seeds different random seeds so so it's -easy to imagine that you're just looking -at one of these problems then you see -that blue curve and you think you get -really excited than you think you found -some huge improvement to your algorithm -but it's really that you just got a -lucky seed that one run so yeah really -you've got to run your algorithm -multiple times an average and even if -you're averaging over a lot of seeds -like even if you had like 20 seeds here -there's a still a pretty big error bar -so it's yeah that makes it particularly -hard -I mean I'd recommend having like -multiple tasks and multiple seeds and if -you don't do that then you're probably -just overfitting unless you see a really -drastically large improvement another -thing to do is it's easy to keep adding -little modifications to your algorithm -until it gets really complicated and -then you're not sure and then you think -you have this really complicated -algorithm which is perfect but it turns -out that most of the things you did are -unnecessary because base some of the -tricks substitute for each other this is -often true because a lot of tricks help -because they're like normalizing things -in a better way or improving your -optimization like making -your optimization less susceptible to -like big spikes I don't know a lot of -different modifications you make have -similar effects so so often you you can -remove them and simplify your algorithm -and this is pretty important so it's -like especially with regard to changes -that do whitening these kind of these -kind of all substitute for each other -and also substitute for changes to your -optimization algorithm and yeah I would -and I would simplify things because it's -then it's more likely that your insights -will generalize to other problems and -also lastly it's pretty useful to -automate your experiments because -otherwise you're going to end up -spending all your day your whole day -just watching your code prints out -numbers and and it's actually really -it's it's really tempting to spend all -day doing that but I would I mean -especially if you need to run multiple -random seeds then it's then you you -really need to get your work flow down -so the year you're automating this -process and launching lots of -experiments at the same time so I'd -recommend just getting set up with one -of these cloud computing services so you -can just launch experiments on remote -instances and pull the results back when -you're done question oh yeah question is -you have a recommendation on what -framework to use to keep track of your -experiment results I personally use no -framework at all and I just have like -ipython notebooks and scripts that -collect a bunch of data that's stored in -various log files so I just have scripts -that read all my log files and plot them -I don't use some people like having -databases and stuff where they store all -their hyper parameter results but on I -think I don't find it necessary -personally okay so now I'm let's see I'm -going to talk about general tuning -strategies for RL and then after that -I'll talk about some specific tuning -strategies for different classes of -algorithms -okay so one thing is widening or -standardizing your data so if your -observations have unknown range you -should definitely standardize them I -would do that by computing a running -estimate of the mean and the standard -deviation and then just transform it Z -transforming it like this -and I would recommend computing the mean -and the standard deviation over all data -you've seen so far not just your recent -data because otherwise you're -effectively changing your data in some -way that the policy doesn't know about -like you have your that your policy -grading algorithm doesn't know about -like your policy grading algorithm is -actually optimizing some objective so -then if you just go and change the -problem out from under it then you're -often going to make things a lot worse -like if you rescale your observations -then your optimization algorithm didn't -know about that so you might just -collapse the performance so that's why I -would recommend using your whole all of -your data from the start of time so that -at least it's going to slow down over -time how fast it's how fast your -scalings are changing so yeah that's -what I would recommend doing with the -observations and for the rewards -I'd recommend rescaling it but not -shifting them because that affects the -agents will to live so if you if you -shift the mean reward that'll affect -whether how long it wants to survive -you're actually changing the problem ok -another yeah you might also want to try -to standardize prediction targets in the -same way though that's a little more -complicated to do using okay yeah so -question is what about pca widening -instead of just this element why scaling -yeah that could that could definitely -help I haven't I haven't experimented -with that but yeah that could help it's -hard to predict with like with neural -nets if it's going to help or not -because they seem to be pretty good at -disentangling things so I know that if -you have things that are terribly scaled -like they're from negative one thousand -two one thousand and other coordinates -are from negative point -point one then it's gonna be slow for -learning so this kind of scaling helps a -lot even though you're having their own -networks okay there's some parameters -that are really generally important like -discount factor that determines whether -you're that determines how long how far -away you're doing credit assignments so -whether you're paying attention to -effects that are delayed by a certain -time so if your discount is gamma equals -point 99 then you're basically ignoring -effects that are more delayed by a -hundred time steps so so you're kind of -short-sighted that gamma is controlling -your shortsightedness and you might want -to actually look at if how long that -corresponds to in real time so usually -in reinforcement learning you're sort of -discretizing time in a certain way and -it's worth paying attention to like is -that 100 time steps like three seconds -of real time or what and what happens -during that time also note that if you -have TD lamda kind of methods for either -for value function estimation or for -policy grading methods you can get away -with using a Lambda gamma that's really -close to one like 0.999 and things -aren't going to go unstable because if -you have a lower land of like 0.9 then -that's going to make it so the algorithm -is still stable even though gamma is -really close to one also okay so so as I -mentioned you might want to in in -practice we're usually discretizing some -continuous-time system so then it's -worth seeing if the problem can actually -be solved at this discretization level -so so for example in a game let's say -you're you're doing frame skip a meaning -that you repeat the action multiple -times as a human can you control it at -this rate or is it just impossible to -control is it just too like you're doing -the action too many times in a row and -you have to slow responses to control it -and I would also just look at the what -the random exploration looks like and if -you make sure that you're exploring like -the the -this Croatian is going to determine like -how far your Brownian motion goes -because if you're doing the same action -many times in a row then you're going to -be able to then you're going to tend to -explore further so so it's worth just -looking at what the random exploration -does and and choosing your time -discretization in a sensible way so that -it does interesting things question yeah -so the question is if you have a DQ n -how would you get started like tuning it -with tuning all the hyper parameters -actually I'm going to talk about DQ n -pretty soon so yeah I'll get to that -okay also look at the episode returns -very closely look at don't just look at -the mean look at the minimum and the -maximum so the maximum especially if you -have a deterministic system if you have -a certain maximum return that's -basically something that your policy can -hone in on pretty straightforwardly -because if if you just do that every -time then you're going to increase your -mean return to that level so so it's -worth so so it's useful to look at the -max return to see if your policy is ever -doing like the right thing according to -that max return or if it's just kind of -stuck and it's never discovering the -high return strategy also look at the -episode in length which is sometimes -more informative than the episode reward -like if because sometimes well yeah I -won't go into details on that like well -if you have a game you're it might mean -that like you might be losing every time -so you're never seeing yourself win but -the episode length will tell you if -you're losing slower so you might see an -improvement in episode length at the -beginning but not in reward okay for -Policy gradient there are specific -strategies or prediction there are -specific Diagnostics that are really -helpful so look at the entropy really -carefully if your entropy is going down -too fast that means your policy is -becoming deterministic and it's not -going to explore anything so -so be careful and also if it's not going -down your policy is never going to be -that good because it's always really -random else so you can sort of alleviate -this issue by using an entropy bonus or -a KL penalty so by stopping yourself -from move changing the policy the -probability distribution too fast as a -side effect you also prevent the entropy -from going down too fast when you use -the KL penalties I also look at the KL -as a diagnostic like look at how big of -an update you're doing in terms of KL -divergence if your KL is like 0.01 -that's a pretty small update but if it's -like a 10 that's a really big update -question oh yeah how do you question is -how do you measure entropy so so if you -have for most policies you can compute -the entropy analytically so if you have -a discrete action space then you usually -can just compute it analytically and if -you have a continuous policy you're -usually you're using a Gaussian -distribution or something so you can -compute the differential entropy -analytically so here we're talking about -entropy in action space so the average -over state space of the action space -entropy what you actually might care -about even more is the entropy in state -space but you have no hope at actually -calculating that except maybe to do some -really crude approximation of it -okay yeah so KL is really useful look at -explain variants like whether your value -function is actually explaining is -actually a good predictor of the returns -or if it's just worse than predicting -nothing so if you just predict zeroes -then your explained variance is zero but -sometimes if you have some neural -network that's predicting then you find -that it's actually negative because it's -overfitting or it's just noisy and it's -not doing anything useful so that -probably means you need to tune some -hyper parameters so that your neural -networks actually predicting better than -the constant predicting zero question -okay yeah question is why does the KL -spike give you a loss in performance -well it doesn't always be a lot it's not -always a loss in performance sometimes -it's a gain in performance but in -practice it's usually a loss in -performance because it usually the -approximation that your policy gradient -is just taking you way outside the -region where your local approximation to -the policy performance is accurate so -you're you're probably just overshooting -like if you take your policy and you -take a really big step in any direction -you're probably making it worse so so -that's so usually if you take a big step -you're getting worse like if you have a -convex function if you take a big step -in any direction you're probably going -to make it worse let's see okay -initialize your policy that's pretty -important more important than in -supervised learning because that in -determines what data you're going to see -initially and you're going to learn from -at the beginning so I would recommend -using have initializing the final layer -to be either zero or really small so -that at least you you have the maximum -and you sort of explore randomly at the -beginning we randomly at the beginning -as opposed to having some kind of -particular like policy that has a strong -opinion on the right thing to do which -is based on no information at all okay -that's for Policy gradient for Q -learning so a few thing a few things one -is okay you often it helps to have a -really big replay buffer and to be able -to do this you need to be a little -careful about memory usage so it's worth -putting in the extra effort to do that -learning rate schedules are often quite -helpful here in practice as our -exploration schedules so in qdq any -you're usually using epsilon greedy and -it often helps to do to play with the -schedule on that also it converges -pretty slowly and it has a miss -serious warmup period at the beginning -often so so sometimes you just so I -actually have a lot of admiration for -the authors who originally got this the -people people who got this to work -originally because they had to just let -their code run for a while before it did -anything so so you have to have a lot of -patience - a lot of bravery to do that -ok this is just miscellaneous advice for -not necessarily for tuning algorithms -but just for for personal development so -I recommend reading older textbooks and -theses not just the latest conference -papers because often they up in them -like there are more dense source of -useful information whereas each -conference paper just has one idea ok -yeah don't get too stuck on problems -because often you actually have a -legitimately good algorithm but it's has -like some flaws so its might fail -miserably at some easy problem so in RL -there's some like simple problems like -cart will swing up where you have this -stick and you're trying to swing it up -by moving the cart around and this -problem like you might have a great -algorithm but it's gonna in my but like -some of the state-of-the-art algorithms -are gonna fail on that problem unless -you really tuned them carefully and -that's just because maybe it's not -exactly the right problem to start to I -mean maybe like the thing that makes -this problem hard is not the thing that -your algorithm is doing that's -interesting so you might have like come -up with a better policy grading method -but still it'll converge to the same -local minimum on that swing up problem -and you're not gonna fix that problem so -I I would say just don't get too stuck -on a single problem that your method -bails on and enough in like maybe the -ultimate algorithm will solve all of -these problems but we're not there yet -so you might as well just try to improve -and some like decently large subset of -problems so also like one funny thing is -the dqn performs pretty poorly on a lot -of problems especially with continuous -control -I think it does I mean for cartful it -probably solves it pretty well if with a -reasonable amount of tuning but some of -the other like fairly small continuous -control problems it fails on but that -doesn't mean it's like that doesn't mean -it's a bad algorithm because it solves a -different problem extremely well so yeah -I would say just these these things are -at least right now it's not gonna you -shouldn't expect to be able to solve -everything with the same method without -any tuning also techniques from -supervised learning often don't transfer -over to reinforcement learning so so -don't be surprised if you find that I -guess that's not I said this slide was -gonna be a bad personal development -that's not about personal development -but yeah I guess this is just a grab bag -of miscellaneous advice so yeah so like -Bachner a lot of people look at what -people are doing in RL and they think -why aren't you using batch norm or drop -out or or big networks why are you using -like two layers of 64 units and it's not -like people didn't think of trying these -other things they tried them and then -they found that those other like -architectures and methods don't actually -help here I mean if you figure out how -to make batch norm and drop out actually -help in RL that'll actually be really -great and a big that would be a big -development but yeah I don't know it's -not totally straightforward all right -that's all thank you -okay I have a few minutes for questions -yeah so the question is how long do you -wait until you've decided that your -algorithm is new at work either because -your code is wrong or it's just too hard -I don't have a good general answer to -that I think the problem is worse for -some algorithms and others I'd say for -policy gradient methods you don't see -that burnin period as much like often if -it's going to learn it'll learn it at -the beginning but that's not always true -either I mean sometimes it will kind of -take some time to get into the right -numerical regime so I don't yeah I don't -have general advice I would say you have -to just I would say go back and start -with the easy problems and you'll get -some intuition about whether you're you -should expect a you should expect a burn -in period or not where it's not learning -anything see I want to get some people -in the back because okay oh yeah -question is - do I use unit tests I use -unit tests for code that where there's -it's doing a very particular -mathematical thing that you can actually -write a test for like let's say I'm -computing the KL divergence then I'll -write a test to check I don't know their -various ways of testing it so and it's -easy to get those things wrong like you -have it's I don't know as you're off by -a constant or something so yeah I would -write tests for I write tests for things -where it's nothing that there's a very -well-defined correct thing to do it's -harder to write it for an algorithm -where it has a lot of different moving -parts where you it's not clear how fast -it should learn and it's also there's -some randomness involved -so if you try to write a test saying I -should be at performance 100 after this -many iterations it might fail just out -of random noise but yeah I think -probably unit tests are a good idea oh -yeah so the question is do I have -guidelines on matching the algorithm to -the task like when to use policy -gradients versus a value iteration style -method it's yeah it's hard to give some -general guidelines I think people have -found that and and the guidelines I give -you might just be just be kind of -historical accidents like someone got -this to work here and this to work there -so I think the well certainly if you -don't care that much about sample -complexity policy gradient methods are -are probably are probably the way to go -if you don't care about sample -complexity or using off policy data then -policy grading methods are probably the -safest bet because you I don't know it's -more understandable exactly what it's -doing it's just doing gradient descent -whereas q-learning it's a little bit -indirect what it's doing so it's and it -in practice is more finicky yeah if you -do care about sample complexity though -or need off policy data then hue -learning is usually better or yeah or a -few students like sample complexity is -relevant if your simulator is expensive -of course I would also say that people -have found that dqn and relatives have -worked well on game-like tasks with -images as input whereas policy grading -methods work better on the continuous -control tasks like these robotic -locomotion problems though that this -might not be fundamental it might be -more of a historical accident let's -oh yeah recommendations on older -textbooks let's see there's like brutes -a cuss for take us as books -that's approximate dinette what is it -optimal control and why am i blanking on -the name optimal control and dynamic -programming something like that and the -set I mean sudden embargo is a good one -to read butterman has a textbook kind of -a classic textbook on Markov decision -processes that's in the RL space then -there's books on numerical optimization -that are good and yeah I'd say obviously -the machine learning textbooks have a -lot of good material that might be -useful in the RL setting too -oh yeah can I comment on evolution -strategies and the blog posts the the -opening I blog post on it let's see do -you have any specific questions about it -or like how it compares -oh yeah okay yeah yeah so there's -there's a lot of policy grading methods -out there and some of them are quite -complicated so we've had a couple of -talks on them so far like all these -different work -it's excessively more complicated policy -grading methods but then there's this -old algorithm called evolution -strategies which is an extremely simple -algorithm and and there's a paper by -some of my colleagues where they show it -was called evolutionary strategies as a -scaleable alternative to reinforcement -learning which really meant like to -policy grading methods so and they -claimed that it worked basically as well -as policy grading methods or at least -it's sort of in the same order oh and -beer is one of the authors of that paper -so the claim was that it works it works -like similarly well to policy grading -method so why should we bother with -these policy grading methods if es works -just as well well I think in practice it -works well it works but it works not not -as well like it's it takes me the sample -complexity is is is worse by some -constant factor or it's not clear that -it's a constant factor or if this factor -scales with the size of the network but -it's it is a lot it is significantly -slower and the question is just what is -that constant factor so is that constant -factor like one or is it three or is it -10 or 100 so that's not that's going to -vary between problems and also the that -paper had some innovations in exactly -how to parameterize the networks and so -forth that made everything better -numerically everything better scaled so -that yes did work well but I would say -that if you that it's usually quote like -I don't know it's usually a pretty -decent constant factor slower than -policy grading methods especially the -more advanced ones like the PPO and -actor so so i'm i think it's it's not -really a clear win in the RL setting -where policy gradients work I think if -policy gradients work it's usually going -to be a lot better -and the es is going to be is going to be -better on problems where policy -gradients aren't going to work for some -reason like if you've got really long -you depend time dependencies where the -discounts are gonna are gonna ignore -them -then es might be less sensitive to that -let's see I think I'm okay last question -oh yeah favorite hyper parameter -optimization framework I've used some of -these than I just like to use the -uniform random sampling yeah that works -really well I mean you just run a bunch -of experiments with random hyper -parameters and then you just look at the -results the next day and to do some -regression to figure out which -parameters actually mattered and then -you've run another experiment with -better parameter ranges and so on so I -use the human version of it because -often it's just - it's like a it's it's -useful to be able to look at the results -yourself - to get some to figure out -which parameters actually matter so -you're not wasting a lot of computation -because that information transferred -between problems all right -[Applause] \ No newline at end of file diff --git a/docs/ml_debug_folklore_log.md b/docs/ml_debug_folklore_log.md deleted file mode 100644 index 10f3bf4..0000000 --- a/docs/ml_debug_folklore_log.md +++ /dev/null @@ -1,76 +0,0 @@ -# ML Debugging Folklore - Vargdown Process Log - -## Process -- [x] evidence files read (21 files, 9416 lines total) -- [x] quotes extracted via 12 parallel subagents -- [x] key quotes verified against evidence files (spot-checked ~15 quotes) -- [x] argdown verifier passes clean (`npx @argdown/cli json` -- 14 arguments, 45 statements, 14 relations) -- [x] subagent review done (gpt-5.2-codex via opencode; fixed non-verbatim quotes, credence calibration, PCS structure) -- [ ] human review done - -## Evidence Fetch Log - -All evidence files were pre-existing in `docs/evidence/`. They were fetched -in a prior session via the methods listed in each file's header. - -| Source | Evidence File | Fetch Method | Status | -|--------|--------|--------|--------| -| Schulman 2016 slides | joschu_nuts_and_bolts.md | `uvx markitdown[pdf]` | verbatim (PDF artifacts: cid markers) | -| Schulman 2017 bootcamp | schulman_nuts_bolts_deeprl_bootcamp_2017_subtitles.md | YouTube auto-subtitles | verbatim (transcription errors: "insanity" = "and standard") | -| Andy Jones RL debugging | andyljones_rl_debugging.md | markitdown | verbatim | -| Henderson et al. 2018 | henderson_2018_deep_rl_matters.md | markitdown | verbatim | -| Goodfellow Ch11 | goodfellow_ch11_practical_methodology.md | markitdown | verbatim | -| CS231n NN3 | cs231n_neural_networks_3.md | markitdown | verbatim | -| FSDL Spring 2021 L7 | fsdl_spring2021_lecture7.md | markitdown | verbatim | -| Irpan RL hard | alexirpan_rl_hard.md | markitdown | verbatim | -| amid.fish reproducing | amid_fish_reproducing_deep_rl.md | markitdown | verbatim | -| Slavv 37 reasons | slavv_37_reasons_nn.md | markitdown | verbatim | -| CS229 ML advice | cs229_ml_advice.md | markitdown | verbatim | -| McCandlish 2018 | mccandlish_2018_large_batch.md | markitdown | verbatim | -| William Falcon notes | williamfalcon_deeprl_hacks.md | markitdown | verbatim | -| Goodfellow Ch15 | goodfellow_ch15_representation_learning.md | markitdown | verbatim | -| Deep Learning Book | deeplearning_book.md | markitdown | verbatim | -| Reddit RL tips 7s8px9 | reddit_rl_practical_tips_7s8px9.md | markitdown | verbatim | -| Reddit RL debug 9sh77q | reddit_rl_debugging_tips_9sh77q.md | markitdown | verbatim | -| Reddit RL roadblocks | reddit_rl_roadblocks_bzg3l2.md | markitdown | verbatim | -| Reddit Schulman 5hereu | reddit_schulman_nuts_bolts_5hereu.md | markitdown | verbatim | -| Reddit ICML tutorial | reddit_icml2017_tutorial_levine_6vcvu1.md | markitdown | verbatim | -| Reddit DRL bootcamp | reddit_deeprl_bootcamp_2017_75m5vd.md | markitdown | verbatim | - -## Quote Verification Notes - -- Schulman subtitles contain auto-generated transcription errors (e.g., "mean insanity deviation" should be "mean and standard deviation"). Quotes used verbatim from file; errors are in the source, not introduced by us. -- Schulman PDF (joschu_nuts_and_bolts.md) has markitdown conversion artifacts (`(cid:73)` bullet markers, table formatting). Core text is present but formatting is messy. -- All other evidence files appear to be clean markitdown conversions. -- 15 key quotes were manually spot-checked against evidence files. All matched. -- Quotes from subagent extractions were cross-referenced with direct file reads. - -## Blockers / Caveats - -- Argdown verifier passes clean: `npx @argdown/cli json` exports 14 arguments, 45 statements, 14 relations. Fixed: 44 blank lines inside PCS blocks, bracket escaping in FSDL quote. -- Some evidence files (especially Schulman PDF) have conversion artifacts that may cause verifier failures on exact quote matching. -- The argdown uses auto-generated YouTube subtitles as a source; these contain transcription errors that are present in the evidence file. - -## Coverage Summary - -| SKILL.md Claim | Sources Used | Independent Sources | -|---|---|---| -| Normalize inputs mean=0 std=1 | Schulman, FSDL, Slavv | 3 | -| Overfit tiny dataset first | CS231n, FSDL, Goodfellow | 3 | -| Assume you have a bug | Jones, Goodfellow | 2 | -| Seed variance is extreme | Schulman, Henderson, Irpan | 3 | -| Use bigger batch sizes | Schulman (x2), McCandlish | 2 (Schulman slides + talk counted as 1) | -| Hand-scale rewards, don't shift mean | Schulman, Jones, Henderson | 3 | -| Use reference implementations | Jones, Rahtz | 2 | -| Pursue anomalies | Jones, Rahtz | 2 | -| Log everything | Rahtz, Goodfellow | 2 | -| Random HP search | CS231n/Bergstra, Schulman | 2 | - -| Probe environments for RL | Jones | 1 (but applies general isolation principle) | -| Policy entropy / KL diagnostics | Schulman | 1 (but built into major frameworks) | - -## Claims NOT Covered in Argdown (lower priority or single-source) -- Gradient clipping masks problems (CS231n mentions, but as a technique not a warning) -- Final layer zero init for policy (Schulman only) -- Loss surface analysis / gradient quiver plots (original to SKILL, no external source) -- Sweep methodology with within-group z-scores (original to SKILL)