[rllib] Feature/soft actor critic v2 (#5328)

* Add base for Soft Actor-Critic * Pick changes from old SAC branch * Update sac.py * First implementation of sac model * Remove unnecessary SAC imports * Prune unnecessary noise and exploration code * Implement SAC model and use that in SAC policy * runs but doesn't learn * clear state * fix batch size * Add missing alpha grads and vars * -200 by 2k timesteps * doc * lazy squash * one file * ignore tfp * revert done
2026-07-04 17:39:55 +08:00 · 2019-08-01 23:37:36 -07:00
parent 3ae54a2b20
commit 13fb9fe3db
21 changed files with 827 additions and 26 deletions
@@ -28,6 +28,7 @@ MOCK_MODULES = [
    "scipy",
    "scipy.signal",
    "scipy.stats",
+    "tensorflow_probability",
    "tensorflow",
    "tensorflow.contrib",
    "tensorflow.contrib.all_reduce",
@@ -164,12 +164,12 @@ Tuned examples: `PongDeterministic-v4 <https://github.com/ray-project/ray/blob/m
 **Atari results @10M steps**: `more details <https://github.com/ray-project/rl-experiments>`__

 =============  ========================  =============================  ==============================  ===============================
- Atari env     RLlib DQN                 RLlib Dueling DDQN             RLlib Dist. DQN                 Hessel et al. DQN              
+ Atari env     RLlib DQN                 RLlib Dueling DDQN             RLlib Dist. DQN                 Hessel et al. DQN
 =============  ========================  =============================  ==============================  ===============================
-BeamRider      2869                      1910                           4447                            ~2000                          
-Breakout       287                       312                            410                             ~150                           
-Qbert          3921                      7968                           15780                           ~4000                          
-SpaceInvaders  650                       1001                           1025                            ~500                           
+BeamRider      2869                      1910                           4447                            ~2000
+Breakout       287                       312                            410                             ~150
+Qbert          3921                      7968                           15780                           ~4000
+SpaceInvaders  650                       1001                           1025                            ~500
 =============  ========================  =============================  ==============================  ===============================

 **DQN-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
@@ -217,7 +217,7 @@ SpaceInvaders  671             944             ~800
 =============  =========================  =============================
 MuJoCo env     RLlib PPO 16-workers @ 1h  Fan et al PPO 16-workers @ 1h
 =============  =========================  =============================
-HalfCheetah    9664                       ~7700 
+HalfCheetah    9664                       ~7700
 =============  =========================  =============================

 .. figure:: ppo.png
@@ -232,6 +232,21 @@ HalfCheetah    9664                       ~7700
   :start-after: __sphinx_doc_begin__
   :end-before: __sphinx_doc_end__

+-Soft Actor Critic (SAC)
+------------------------
+`[paper] <https://arxiv.org/pdf/1801.01290>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/sac/sac.py>`__
+
+RLlib's soft-actor critic implementation is ported from the `official SAC repo <https://github.com/rail-berkeley/softlearning>`__ to better integrate with RLlib APIs. Note that SAC has two fields to configure for custom models: ``policy_model`` and ``Q_model``, and currently has no support for non-continuous action distributions. It is also currently *experimental*.
+
+Tuned examples: `Pendulum-v0 <https://github.com/ray-project/ray/blob/master/python/ray/rllib/tuned_examples/regression_tests/pendulum-sac.yaml>`__
+
+**SAC-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
+
+.. literalinclude:: ../../python/ray/rllib/agents/sac/sac.py
+   :language: python
+   :start-after: __sphinx_doc_begin__
+   :end-before: __sphinx_doc_end__
+
 Derivative-free
 ~~~~~~~~~~~~~~~

@@ -18,6 +18,7 @@ DQN, Rainbow    **Yes** `+parametric`_  No                  **Yes**      No
 DDPG, TD3       No                      **Yes**             **Yes**      No
 APEX-DQN        **Yes** `+parametric`_  No                  **Yes**      No
 APEX-DDPG       No                      **Yes**             **Yes**      No
+SAC             (todo)                  **Yes**             **Yes**      No
 ES              **Yes**                 **Yes**             No           No
 ARS             **Yes**                 **Yes**             No           No
 QMIX            **Yes**                 No                  **Yes**      **Yes**
@@ -37,7 +37,7 @@ The ``rllib train`` command (same as the ``train.py`` script in the repo) has a
 The most important options are for choosing the environment
 with ``--env`` (any OpenAI gym environment including ones registered by the user
 can be used) and for choosing the algorithm with ``--run``
-(available options are ``PPO``, ``PG``, ``A2C``, ``A3C``, ``IMPALA``, ``ES``, ``DDPG``, ``DQN``, ``MARWIL``, ``APEX``, and ``APEX_DDPG``).
+(available options are ``SAC``, ``PPO``, ``PG``, ``A2C``, ``A3C``, ``IMPALA``, ``ES``, ``DDPG``, ``DQN``, ``MARWIL``, ``APEX``, and ``APEX_DDPG``).

 Evaluating Trained Policies
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -73,6 +73,8 @@ Algorithms

   -  `Proximal Policy Optimization (PPO) <rllib-algorithms.html#proximal-policy-optimization-ppo>`__

+   -  `Soft Actor Critic (SAC) <rllib-algorithms.html#soft-actor-critic-sac>`__
+
 *  Derivative-free

   -  `Augmented Random Search (ARS) <rllib-algorithms.html#augmented-random-search-ars>`__