[rllib] Feature/soft actor critic v2 (#5328)

* Add base for Soft Actor-Critic

* Pick changes from old SAC branch

* Update sac.py

* First implementation of sac model

* Remove unnecessary SAC imports

* Prune unnecessary noise and exploration code

* Implement SAC model and use that in SAC policy

* runs but doesn't learn

* clear state

* fix batch size

* Add missing alpha grads and vars

* -200 by 2k timesteps

* doc

* lazy squash

* one file

* ignore tfp

* revert done
This commit is contained in:
Kristian Hartikainen
2019-08-01 23:37:36 -07:00
committed by Eric Liang
parent 3ae54a2b20
commit 13fb9fe3db
21 changed files with 827 additions and 26 deletions
+1
View File
@@ -28,6 +28,7 @@ MOCK_MODULES = [
"scipy",
"scipy.signal",
"scipy.stats",
"tensorflow_probability",
"tensorflow",
"tensorflow.contrib",
"tensorflow.contrib.all_reduce",
+21 -6
View File
@@ -164,12 +164,12 @@ Tuned examples: `PongDeterministic-v4 <https://github.com/ray-project/ray/blob/m
**Atari results @10M steps**: `more details <https://github.com/ray-project/rl-experiments>`__
============= ======================== ============================= ============================== ===============================
Atari env RLlib DQN RLlib Dueling DDQN RLlib Dist. DQN Hessel et al. DQN
Atari env RLlib DQN RLlib Dueling DDQN RLlib Dist. DQN Hessel et al. DQN
============= ======================== ============================= ============================== ===============================
BeamRider 2869 1910 4447 ~2000
Breakout 287 312 410 ~150
Qbert 3921 7968 15780 ~4000
SpaceInvaders 650 1001 1025 ~500
BeamRider 2869 1910 4447 ~2000
Breakout 287 312 410 ~150
Qbert 3921 7968 15780 ~4000
SpaceInvaders 650 1001 1025 ~500
============= ======================== ============================= ============================== ===============================
**DQN-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
@@ -217,7 +217,7 @@ SpaceInvaders 671 944 ~800
============= ========================= =============================
MuJoCo env RLlib PPO 16-workers @ 1h Fan et al PPO 16-workers @ 1h
============= ========================= =============================
HalfCheetah 9664 ~7700
HalfCheetah 9664 ~7700
============= ========================= =============================
.. figure:: ppo.png
@@ -232,6 +232,21 @@ HalfCheetah 9664 ~7700
:start-after: __sphinx_doc_begin__
:end-before: __sphinx_doc_end__
-Soft Actor Critic (SAC)
------------------------
`[paper] <https://arxiv.org/pdf/1801.01290>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/sac/sac.py>`__
RLlib's soft-actor critic implementation is ported from the `official SAC repo <https://github.com/rail-berkeley/softlearning>`__ to better integrate with RLlib APIs. Note that SAC has two fields to configure for custom models: ``policy_model`` and ``Q_model``, and currently has no support for non-continuous action distributions. It is also currently *experimental*.
Tuned examples: `Pendulum-v0 <https://github.com/ray-project/ray/blob/master/python/ray/rllib/tuned_examples/regression_tests/pendulum-sac.yaml>`__
**SAC-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
.. literalinclude:: ../../python/ray/rllib/agents/sac/sac.py
:language: python
:start-after: __sphinx_doc_begin__
:end-before: __sphinx_doc_end__
Derivative-free
~~~~~~~~~~~~~~~
+1
View File
@@ -18,6 +18,7 @@ DQN, Rainbow **Yes** `+parametric`_ No **Yes** No
DDPG, TD3 No **Yes** **Yes** No
APEX-DQN **Yes** `+parametric`_ No **Yes** No
APEX-DDPG No **Yes** **Yes** No
SAC (todo) **Yes** **Yes** No
ES **Yes** **Yes** No No
ARS **Yes** **Yes** No No
QMIX **Yes** No **Yes** **Yes**
+1 -1
View File
@@ -37,7 +37,7 @@ The ``rllib train`` command (same as the ``train.py`` script in the repo) has a
The most important options are for choosing the environment
with ``--env`` (any OpenAI gym environment including ones registered by the user
can be used) and for choosing the algorithm with ``--run``
(available options are ``PPO``, ``PG``, ``A2C``, ``A3C``, ``IMPALA``, ``ES``, ``DDPG``, ``DQN``, ``MARWIL``, ``APEX``, and ``APEX_DDPG``).
(available options are ``SAC``, ``PPO``, ``PG``, ``A2C``, ``A3C``, ``IMPALA``, ``ES``, ``DDPG``, ``DQN``, ``MARWIL``, ``APEX``, and ``APEX_DDPG``).
Evaluating Trained Policies
~~~~~~~~~~~~~~~~~~~~~~~~~~~
+2
View File
@@ -73,6 +73,8 @@ Algorithms
- `Proximal Policy Optimization (PPO) <rllib-algorithms.html#proximal-policy-optimization-ppo>`__
- `Soft Actor Critic (SAC) <rllib-algorithms.html#soft-actor-critic-sac>`__
* Derivative-free
- `Augmented Random Search (ARS) <rllib-algorithms.html#augmented-random-search-ars>`__