Files
ml_debug/docs/evidence/cleanrl_37_ppo_details.md
wassname ee4e9a5caa folklore: add koaning, gwern, kidger, nanochat, cleanrl; trim lucidrains
Gather debugging folklore from more practitioners, each a verbatim quote
checked against a cached source copy (footnoted with line numbers):
- koaning (Vincent Warmerdam), "Bad Labels": benchmark labels are often wrong;
  find them with confidence-sorted errors.
- gwern, the tank-detection legend: the canonical data-leakage parable, plus
  the scout-mindset twist that it's a likely-unsourced urban legend.
- Patrick Kidger, "Just Know Stuff": why research code is buggy ("kludge ...
  bugs that don't cripple things only because some other bug stops them") and
  "never accept the kludge". Plus a one-line jaxtyping pointer for shape bugs.
- nanochat (Karpathy): BOS-alignment fake metric improvement; all-ranks must
  clip on inf (a multi-GPU bug single-GPU testing hides).
- cleanrl "37 Implementation Details of PPO" -> RL sub-skill, as the canonical
  proof that reference-impl details (not ideas) decide whether PPO works.

Trim the lucidrains item to one quote (it had ballooned). Add wassname credit
+ companion-gist link. All 20 footnotes resolve.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 20:59:36 +08:00

1.8 KiB

The 37 Implementation Details of Proximal Policy Optimization

Authors: Huang, Shengyi; Dossa, Rousslan Fernand Julien; Raffin, Antonin; Kanervisto, Anssi; Wang, Weixun. Source: ICLR Blog Track, 2022-03-25 — https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ Code: https://github.com/vwxyzjn/ppo-implementation-details ; CleanRL: https://github.com/vwxyzjn/cleanrl

Excerpt cached for the ML-debugging skill (the full post is long; key framing passages below, verbatim).


Instead of doing ablation studies and making recommendations on which details matter, this blog post takes a step back and focuses on reproductions of PPO's results in all accounts.

During our re-implementation, we have compiled an implementation checklist containing 37 details as follows. For each implementation detail, we display the permanent link to its code (which is not done in academic papers) and point out its literature connection.

The 37 details break down as:

  • 13 core implementation details
  • 9 Atari-specific implementation details
  • 9 implementation details for robotics tasks (continuous action spaces)
  • 5 LSTM implementation details
  • 1 MultiDiscrete action-spaces implementation detail
  • (plus 4 situational details not used in the official implementation)

Our ultimate purpose is to help people understand the PPO implementation through and through, reproduce past results with high fidelity, and facilitate customization for new research.

Context: the official PPO implementation (openai/baselines, ppo2) has undergone several refactorings, so "it is important to recognize which version of the official implementation is worth studying." Libraries that match ppo2's details closely (Stable-Baselines3, CleanRL) reproduce similar results; others report more diverse (worse) results.