Files
ml-debug/docs/evidence/reddit_rl_practical_tips_7s8px9.md
wassname 4393cceefd initial: ML debugging folklore skill
Deep research to uplift LLMs for ML debugging, opinionated by source
selection. Distilled from Schulman, Jones, Rahtz, Goodfellow, CS231n,
FSDL, and more. Includes runnable diagnostic scripts and LLM-specific
anti-patterns.

Author: wassname (https://github.com/wassname)
2026-03-06 10:11:30 +08:00

5.9 KiB

Source: https://old.reddit.com/r/reinforcementlearning/comments/7s8px9/deep_reinforcement_learning_practical_tips/ Title: Deep Reinforcement Learning practical tips Fetched-via: browser paste (user) Fetch-status: verbatim

Deep Reinforcement Learning practical tips

submitted 8 years ago by grupiotr | 14 points (90% upvoted) | 13 comments

I would be particularly grateful for pointers to things you don't seem to be able to find in papers. Examples include:

  • How to choose learning rate?
  • Problems that work surprisingly well with high learning rates
  • Problems that require surprisingly low learning rates
  • Unhealthy-looking learning curves and what to do about them
  • Q estimators deciding to always give low scores to a subset of actions effectively limiting their search space
  • How to choose decay rate depending on the problem?
  • How to design reward function? Rescale? If so, linearly or non-linearly? Introduce/remove bias?
  • What to do when learning seems very inconsistent between runs?
  • In general, how to estimate how low one should be expecting the loss to get?
  • How to tell whether my learning is too low and I'm learning very slowly or too high and loss cannot be decreased further?

Comments

u/wassname (11 points):

Resources: I found these very useful

  • Deep RL Bootcamp Lecture 6: Nuts and Bolts of Deep RL Experimentation (slides) and a written summary
  • The 3 NIPS2017 Learning to run write ups contain practical advice from a competition
  • Lessons Learned Reproducing a Deep Reinforcement Learning Paper
  • Deep Reinforcement Learning that Matters - this gives you an idea of what does and doesn't matter
  • Deep Reinforcement Learning Doesn't Work Yet (at least as well as the hype suggests)
  • General deep learning tips from Slav Ivanov

Lessons learnt:

  • log everything with tensorboard/tensorboardX: policy and critic losses, advantages, ratio, actions (mean and std), states, noise. Check values, check losses are decreasing etc.
  • keep track of experiments with an experiments log (git commit messages with non-committed data or logs stored by date)
  • clip and clamp: mistakes not obvious as they can cause values to blow up instead of NaN
    • clamp all values, logarithmic values: logvalue.clamp(-np.log(1e-5), np.log(1e-5))
    • watch out for dividing by a value: 1/std should be 1/(std+eps) where eps=1e-5
    • clip gradients: grad_norm = torch.nn.utils.clip_grad(model.params, 20), then log grad norm
  • normalise everything: use running norms for state and reward; layer norms help
  • check everything: plot and sanity check as many values as possible. Check initial outputs, inits, distributions, action range.
  • think about step-size/sampling-rate: RL is sensitive to it (action repeat, frame skipping). Papers found skipping 4 Atari frames helped, repeating 4 actions in "Learning to Run" helped.

Curves:

  • in PPO the std should decrease as it learns
  • in actor-critic the critic loss should start converging then the actor loss follows
  • watch for local minima where it outputs a constant action
  • watch gradients for actor and critic; if much lower than 20 or much larger than 100 often run into problems (20 and 40 are where projects often clip gradient norm)
  • run on CartPole and log same curves to see what healthy looks like

Reward:

  • It's not the scaling factor that matters but the final value. Papers have gotten good results with rewards between 100-1000.

Learning rate:

  • Use decaying learning rates, watch loss curves to see when they begin to converge.
  • loss_actor will often initially increase while the critic is doing its initial learning (value function is a moving target). Focus on making the critic learning rate work first.
  • Critic learning rates are often set higher, with larger batches.
  • Use cyclical learning rate trick: slowly increase LR to find the min where model learns and max where it still converges.

My own questions:

  • How do you know if you've set exploration/variance too high or low?
  • Should you use a multi-headed actor/critic? Or separate networks?

"What to do when learning seems very inconsistent between runs?" - This could be an init issue. Try to init so it defaults to reasonable action values even before training.


u/gwern (8 points):

I've seen similar engineering details & folklore, but mostly in slides/talks:


u/grupiotr [OP] (5 points):

John Schulman's talk wins, particularly:

  • rescaling observations, rewards, targets and prediction targets
  • using big replay buffers, bigger batch size and generally more iterations to start with
  • always starting with a simple version of the task to get signs of life

u/Kaixhin (2 points):

My first bit of advice is actually don't do RL. If the answer is still yes, find some other useful task for the network to do, like predicting something. Get supervised gradients flowing through your network. Training end-to-end on purely an RL signal is impressive, but adding easier learning signals can potentially help a lot.


u/grupiotr [OP] (1 point):

What turned out to be the game-changer (made my RL agents actually learn something) was rescaling the reward from [-1, 1] to [0, 1]. Thanks again to everyone that contributed!