[rllib] Update RLlib docs and README (#1288)

Updates the rllib docs and README.
2026-07-05 07:39:02 +08:00 · 2017-12-06 18:17:51 -08:00
parent 2d543b6e19
commit 35f7398666
7 changed files with 151 additions and 94 deletions
@@ -1,31 +1,57 @@
-RLLib: A Scalable Reinforcement Learning Library
+RLlib: A Scalable Reinforcement Learning Library
 ================================================

-Getting Started
---------------
+This README provides a brief technical overview of RLlib. See also the `user documentation <http://ray.readthedocs.io/en/latest/rllib.html>`__.

-You can run training with
+RLlib currently provides the following algorithms:

-::
+-  `Proximal Policy Optimization <https://arxiv.org/abs/1707.06347>`__ which
+   is a proximal variant of `TRPO <https://arxiv.org/abs/1502.05477>`__.

-    python train.py --env CartPole-v0 --run PPO
-
-The available algorithms are:
-
-  ``PPO`` is a proximal variant of
-   `TRPO <https://arxiv.org/abs/1502.05477>`__.
-
-  ``ES`` is decribed in `this
+-  Evolution Strategies which is decribed in `this
   paper <https://arxiv.org/abs/1703.03864>`__. Our implementation
   borrows code from
   `here <https://github.com/openai/evolution-strategies-starter>`__.

-  ``DQN`` is an implementation of `Deep Q
-   Networks <https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf>`__ based on
-   `OpenAI baselines <https://github.com/openai/baselines>`__.
+-  `The Asynchronous Advantage Actor-Critic <https://arxiv.org/abs/1602.01783>`__
+   based on `the OpenAI starter agent <https://github.com/openai/universe-starter-agent>`__.

-  ``A3C`` is an implementation of
-   `A3C <https://arxiv.org/abs/1602.01783>`__ based on `the OpenAI
-   starter agent <https://github.com/openai/universe-starter-agent>`__.
+- `Deep Q Network (DQN) <https://arxiv.org/abs/1312.5602>`__.

-User documentation can be `found here <http://ray.readthedocs.io/en/latest/rllib.html>`__.
+Proximal Policy Optimization scales to hundreds of cores and several GPUs, Evolution Strategies to clusters with thousands of cores and the Asynchronous Advantage Actor-Critic scales to dozens of cores on a single node.
+
+These algorithms can be run on any OpenAI Gym MDP, including custom ones written and registered by the user.
+
+For more detailed usage information, see the `user documentation <http://ray.readthedocs.io/en/latest/rllib.html>`__.
+
+Training API
+------------
+
+All RLlib algorithms implement a common training API (agent.py), which enables multiple algorithms to be easily evaluated:
+
+::
+
+    # Train a model on a single environment
+    python train.py --env CartPole-v0 --run PPO
+
+    # Integration with ray.tune for hyperparam evaluation
+    python train.py -f tuned_examples/cartpole-grid-search-example.yaml
+
+Evaluator and Optimizer abstractions
+------------------------------------
+
+RLlib's gradient-based algorithms are composed using two abstractions: Evaluators (evaluator.py) and Optimizers (optimizers/optimizer.py). Optimizers encapsulate a particular distributed optimization strategy for RL. Evaluators encapsulate the model graph, and once implemented, any Optimizer may be "plugged in" to any algorithm that implements the Evaluator interface.
+
+This pluggability enables optimization strategies to be re-used and improved across different algorithms and deep learning frameworks (RLlib's optimizers work with both TensorFlow and PyTorch, though currently only A3C has a PyTorch graph implementation).
+
+These are the currently available optimizers:
+
+-  ``AsyncOptimizer`` is an asynchronous RL optimizer, i.e. like A3C. It asynchronously pulls and applies gradients from evaluators, sending updated weights back as needed.
+-  ``LocalSyncOptimizer`` is a simple synchronous RL optimizer. It pulls samples from remote evaluators, concatenates them, and then updates a local model. The updated model weights are then broadcast to all remote evalutaors.
+-  ``LocalMultiGPUOptimizer`` (currently available for PPO) This optimizer performs SGD over a number of local GPUs, and pins experience data in GPU memory to amortize the copy overhead for multiple SGD passes.
+-  ``AllReduceOptimizer`` (planned) This optimizer would use the Allreduce primitive to scalably synchronize weights among a number of remote GPU workers.
+
+Common utilities
+----------------
+
+RLlib defines common action distributions, preprocessors, and neural network models, found in ``models/catalog.py``, which are shared by all algorithms. More information on these classes can be found in the `developer API docs <http://ray.readthedocs.io/en/latest/rllib.html#the-developer-api>`__.
@@ -35,7 +35,8 @@ class Agent(Trainable):
        env_creator (func): Function that creates a new training env.
        config (obj): Algorithm-specific configuration data.
        logdir (str): Directory in which training outputs should be placed.
-        registry (obj): Object registry.
+        registry (obj): Tune object registry, for registering user-defined
+            classes and objects by name.
    """

    _allow_unknown_configs = False
@@ -118,7 +118,7 @@ class TrialRunner(object):
                    self._committed_resources.gpu,
                    self._avail_resources.gpu))
        for local_dir in sorted(set([t.local_dir for t in self._trials])):
-            messages.append("Tensorboard logdir: {}".format(local_dir))
+            messages.append("Result logdir: {}".format(local_dir))
            for t in self._trials:
                if t.local_dir == local_dir:
                    messages.append(