mirror of
https://github.com/wassname/ray.git
synced 2026-07-04 17:39:55 +08:00
[rllib] Feature/soft actor critic v2 (#5328)
* Add base for Soft Actor-Critic * Pick changes from old SAC branch * Update sac.py * First implementation of sac model * Remove unnecessary SAC imports * Prune unnecessary noise and exploration code * Implement SAC model and use that in SAC policy * runs but doesn't learn * clear state * fix batch size * Add missing alpha grads and vars * -200 by 2k timesteps * doc * lazy squash * one file * ignore tfp * revert done
This commit is contained in:
committed by
Eric Liang
parent
3ae54a2b20
commit
13fb9fe3db
@@ -28,6 +28,7 @@ MOCK_MODULES = [
|
||||
"scipy",
|
||||
"scipy.signal",
|
||||
"scipy.stats",
|
||||
"tensorflow_probability",
|
||||
"tensorflow",
|
||||
"tensorflow.contrib",
|
||||
"tensorflow.contrib.all_reduce",
|
||||
|
||||
@@ -164,12 +164,12 @@ Tuned examples: `PongDeterministic-v4 <https://github.com/ray-project/ray/blob/m
|
||||
**Atari results @10M steps**: `more details <https://github.com/ray-project/rl-experiments>`__
|
||||
|
||||
============= ======================== ============================= ============================== ===============================
|
||||
Atari env RLlib DQN RLlib Dueling DDQN RLlib Dist. DQN Hessel et al. DQN
|
||||
Atari env RLlib DQN RLlib Dueling DDQN RLlib Dist. DQN Hessel et al. DQN
|
||||
============= ======================== ============================= ============================== ===============================
|
||||
BeamRider 2869 1910 4447 ~2000
|
||||
Breakout 287 312 410 ~150
|
||||
Qbert 3921 7968 15780 ~4000
|
||||
SpaceInvaders 650 1001 1025 ~500
|
||||
BeamRider 2869 1910 4447 ~2000
|
||||
Breakout 287 312 410 ~150
|
||||
Qbert 3921 7968 15780 ~4000
|
||||
SpaceInvaders 650 1001 1025 ~500
|
||||
============= ======================== ============================= ============================== ===============================
|
||||
|
||||
**DQN-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
|
||||
@@ -217,7 +217,7 @@ SpaceInvaders 671 944 ~800
|
||||
============= ========================= =============================
|
||||
MuJoCo env RLlib PPO 16-workers @ 1h Fan et al PPO 16-workers @ 1h
|
||||
============= ========================= =============================
|
||||
HalfCheetah 9664 ~7700
|
||||
HalfCheetah 9664 ~7700
|
||||
============= ========================= =============================
|
||||
|
||||
.. figure:: ppo.png
|
||||
@@ -232,6 +232,21 @@ HalfCheetah 9664 ~7700
|
||||
:start-after: __sphinx_doc_begin__
|
||||
:end-before: __sphinx_doc_end__
|
||||
|
||||
-Soft Actor Critic (SAC)
|
||||
------------------------
|
||||
`[paper] <https://arxiv.org/pdf/1801.01290>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/sac/sac.py>`__
|
||||
|
||||
RLlib's soft-actor critic implementation is ported from the `official SAC repo <https://github.com/rail-berkeley/softlearning>`__ to better integrate with RLlib APIs. Note that SAC has two fields to configure for custom models: ``policy_model`` and ``Q_model``, and currently has no support for non-continuous action distributions. It is also currently *experimental*.
|
||||
|
||||
Tuned examples: `Pendulum-v0 <https://github.com/ray-project/ray/blob/master/python/ray/rllib/tuned_examples/regression_tests/pendulum-sac.yaml>`__
|
||||
|
||||
**SAC-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
|
||||
|
||||
.. literalinclude:: ../../python/ray/rllib/agents/sac/sac.py
|
||||
:language: python
|
||||
:start-after: __sphinx_doc_begin__
|
||||
:end-before: __sphinx_doc_end__
|
||||
|
||||
Derivative-free
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
|
||||
@@ -18,6 +18,7 @@ DQN, Rainbow **Yes** `+parametric`_ No **Yes** No
|
||||
DDPG, TD3 No **Yes** **Yes** No
|
||||
APEX-DQN **Yes** `+parametric`_ No **Yes** No
|
||||
APEX-DDPG No **Yes** **Yes** No
|
||||
SAC (todo) **Yes** **Yes** No
|
||||
ES **Yes** **Yes** No No
|
||||
ARS **Yes** **Yes** No No
|
||||
QMIX **Yes** No **Yes** **Yes**
|
||||
|
||||
@@ -37,7 +37,7 @@ The ``rllib train`` command (same as the ``train.py`` script in the repo) has a
|
||||
The most important options are for choosing the environment
|
||||
with ``--env`` (any OpenAI gym environment including ones registered by the user
|
||||
can be used) and for choosing the algorithm with ``--run``
|
||||
(available options are ``PPO``, ``PG``, ``A2C``, ``A3C``, ``IMPALA``, ``ES``, ``DDPG``, ``DQN``, ``MARWIL``, ``APEX``, and ``APEX_DDPG``).
|
||||
(available options are ``SAC``, ``PPO``, ``PG``, ``A2C``, ``A3C``, ``IMPALA``, ``ES``, ``DDPG``, ``DQN``, ``MARWIL``, ``APEX``, and ``APEX_DDPG``).
|
||||
|
||||
Evaluating Trained Policies
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
@@ -73,6 +73,8 @@ Algorithms
|
||||
|
||||
- `Proximal Policy Optimization (PPO) <rllib-algorithms.html#proximal-policy-optimization-ppo>`__
|
||||
|
||||
- `Soft Actor Critic (SAC) <rllib-algorithms.html#soft-actor-critic-sac>`__
|
||||
|
||||
* Derivative-free
|
||||
|
||||
- `Augmented Random Search (ARS) <rllib-algorithms.html#augmented-random-search-ars>`__
|
||||
|
||||
Reference in New Issue
Block a user