diff --git a/doc/source/apex.png b/doc/source/apex.png
new file mode 100644
index 000000000..4d6a3c5e6
Binary files /dev/null and b/doc/source/apex.png differ
diff --git a/doc/source/es.png b/doc/source/es.png
new file mode 100644
index 000000000..c0f3db237
Binary files /dev/null and b/doc/source/es.png differ
diff --git a/doc/source/index.rst b/doc/source/index.rst
index ae43afd39..da488dd73 100644
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -79,8 +79,11 @@ Ray comes with libraries that accelerate deep learning and reinforcement learnin
:caption: Ray RLlib
rllib.rst
- policy-optimizers.rst
- rllib-dev.rst
+ rllib-training.rst
+ rllib-env.rst
+ rllib-algorithms.rst
+ rllib-models.rst
+ rllib-package-ref.rst
.. toctree::
:maxdepth: 1
diff --git a/doc/source/multi-agent.svg b/doc/source/multi-agent.svg
new file mode 100644
index 000000000..a99b604aa
--- /dev/null
+++ b/doc/source/multi-agent.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/doc/source/policy-optimizers.rst b/doc/source/policy-optimizers.rst
deleted file mode 100644
index 8753c2932..000000000
--- a/doc/source/policy-optimizers.rst
+++ /dev/null
@@ -1,69 +0,0 @@
-Policy Optimizers
-=================
-
-RLlib supports using its policy optimizer implementations from external algorithms.
-
-Example of constructing and using a policy optimizer `(link to full example) `__:
-
-.. code-block:: python
-
- ray.init()
- env_creator = lambda env_config: gym.make("PongNoFrameskip-v4")
- optimizer = LocalSyncReplayOptimizer.make(
- YourEvaluatorClass, [env_creator], num_workers=0, optimizer_config={})
-
- i = 0
- while optimizer.num_steps_sampled < 100000:
- i += 1
- print("== optimizer step {} ==".format(i))
- optimizer.step()
- print("optimizer stats", optimizer.stats())
- print("local evaluator stats", optimizer.local_evaluator.stats())
-
-Read more about policy optimizers in this post: `Distributed Policy Optimizers for Scalable and Reproducible Deep RL `__.
-
-Here are the steps for using a RLlib policy optimizer with an existing algorithm.
-
-1. Implement the `Policy evaluator interface `__.
-
- - Here is an example of porting a `PyTorch Rainbow implementation `__.
-
- - Another example porting a `TensorFlow DQN implementation `__.
-
-2. Pick a `Policy optimizer class `__. The `LocalSyncOptimizer `__ is a reasonable choice for local testing. You can also implement your own. Policy optimizers can be constructed using their ``make`` method (e.g., ``LocalSyncOptimizer.make(evaluator_cls, evaluator_args, num_workers, optimizer_config)``), or you can construct them by passing in a list of evaluators instantiated as Ray actors.
-
- - Here is code showing the `simple Policy Gradient agent `__ using ``make()``.
-
- - A different example showing an `A3C agent `__ passing in Ray actors directly.
-
-3. Decide how you want to drive the training loop.
-
- - Option 1: call ``optimizer.step()`` from some existing training code. Training statistics can be retrieved by querying the ``optimizer.local_evaluator`` evaluator instance, or mapping over the remote evaluators (e.g., ``ray.get([ev.some_fn.remote() for ev in optimizer.remote_evaluators])``) if you are running with multiple workers.
-
- - Option 2: define a full RLlib `Agent class `__. This might be preferable if you don't have an existing training harness or want to use features provided by `Ray Tune `__.
-
-Available Policy Optimizers
----------------------------
-
-+-----------------------------+---------------------+-----------------+------------------------------+
-| **Policy optimizer class** | **Operating range** | **Works with** | **Description** |
-+=============================+=====================+=================+==============================+
-|AsyncOptimizer |1-10s of CPUs |(any) |Asynchronous gradient-based |
-| | | |optimization (e.g., A3C) |
-+-----------------------------+---------------------+-----------------+------------------------------+
-|LocalSyncOptimizer |0-1 GPUs + |(any) |Synchronous gradient-based |
-| |1-100s of CPUs | |optimization with parallel |
-| | | |sample collection |
-+-----------------------------+---------------------+-----------------+------------------------------+
-|LocalSyncReplayOptimizer |0-1 GPUs + | Off-policy |Adds a replay buffer |
-| |1-100s of CPUs | algorithms |to LocalSyncOptimizer |
-+-----------------------------+---------------------+-----------------+------------------------------+
-|LocalMultiGPUOptimizer |0-10 GPUs + | Algorithms |Implements data-parallel |
-| |1-100s of CPUs | written in |optimization over multiple |
-| | | TensorFlow |GPUs, e.g., for PPO |
-+-----------------------------+---------------------+-----------------+------------------------------+
-|ApexOptimizer |1 GPU + | Off-policy |Implements the Ape-X |
-| |10-100s of CPUs | algorithms |distributed prioritization |
-| | | w/sample |algorithm |
-| | | prioritization | |
-+-----------------------------+---------------------+-----------------+------------------------------+
diff --git a/doc/source/ppo.png b/doc/source/ppo.png
new file mode 100644
index 000000000..c9d358f05
Binary files /dev/null and b/doc/source/ppo.png differ
diff --git a/doc/source/rllib-algorithms.rst b/doc/source/rllib-algorithms.rst
new file mode 100644
index 000000000..b2f804dd0
--- /dev/null
+++ b/doc/source/rllib-algorithms.rst
@@ -0,0 +1,67 @@
+RLlib Algorithms
+================
+
+Ape-X Distributed Prioritized Experience Replay
+-----------------------------------------------
+`[paper] `__
+`[implementation] `__
+Ape-X variations of DQN and DDPG (`APEX_DQN `__, `APEX_DDPG `__ in RLlib) use a single GPU learner and many CPU workers for experience collection. Experience collection can scale to hundreds of CPU workers due to the distributed prioritization of experience prior to storage in replay buffers.
+
+Tuned examples: `PongNoFrameskip-v4 `__, `Pendulum-v0 `__, `MountainCarContinuous-v0 `__
+
+.. figure:: apex.png
+
+ Ape-X using 32 workers in RLlib vs vanilla DQN (orange) and A3C (blue) on PongNoFrameskip-v4.
+
+Asynchronous Advantage Actor-Critic
+-----------------------------------
+`[paper] `__ `[implementation] `__
+RLlib's A3C uses the AsyncGradientsOptimizer to apply gradients computed remotely on policy evaluation actors. It scales to up to 16-32 worker processes, depending on the environment. Both a TensorFlow (LSTM), and PyTorch version are available.
+
+Tuned examples: `PongDeterministic-v4 `__, `PyTorch version `__
+
+Deep Deterministic Policy Gradients
+-----------------------------------
+`[paper] `__ `[implementation] `__
+DDPG is implemented similarly to DQN (below). The algorithm can be scaled by increasing the number of workers, switching to AsyncGradientsOptimizer, or using Ape-X.
+
+Tuned examples: `Pendulum-v0 `__, `MountainCarContinuous-v0 `__, `HalfCheetah-v2 `__
+
+Deep Q Networks
+---------------
+`[paper] `__ `[implementation] `__
+RLlib DQN is implemented using the SyncReplayOptimizer. The algorithm can be scaled by increasing the number of workers, using the AsyncGradientsOptimizer for async DQN, or using Ape-X. Memory usage is reduced by compressing samples in the replay buffer with LZ4.
+
+Tuned examples: `PongDeterministic-v4 `__
+
+Evolution Strategies
+--------------------
+`[paper] `__ `[implementation] `__
+Code here is adapted from https://github.com/openai/evolution-strategies-starter to execute in the distributed setting with Ray.
+
+Tuned examples: `Humanoid-v1 `__
+
+.. figure:: es.png
+ :width: 500px
+ :align: center
+
+ RLlib's ES implementation scales further and is faster than a reference Redis implementation.
+
+Policy Gradients
+----------------
+`[paper] `__ `[implementation] `__ We include a vanilla policy gradients implementation as an example algorithm. This is usually outperformed by PPO.
+
+Tuned examples: `CartPole-v0 `__
+
+Proximal Policy Optimization
+----------------------------
+`[paper] `__ `[implementation] `__
+PPO's clipped objective supports multiple SGD passes over the same batch of experiences. RLlib's multi-GPU optimizer pins that data in GPU memory to avoid unnecessary transfers from host memory, substantially improving performance over a naive implementation. RLlib's PPO scales out using multiple workers for experience collection, and also with multiple GPUs for SGD.
+
+Tuned examples: `Humanoid-v1 `__, `Hopper-v1 `__, `Pendulum-v0 `__, `PongDeterministic-v4 `__, `Walker2d-v1 `__
+
+.. figure:: ppo.png
+ :width: 500px
+ :align: center
+
+ RLlib's PPO is more cost effective and faster than a reference PPO implementation.
diff --git a/doc/source/rllib-dev.rst b/doc/source/rllib-dev.rst
deleted file mode 100644
index 6b5358f58..000000000
--- a/doc/source/rllib-dev.rst
+++ /dev/null
@@ -1,102 +0,0 @@
-RLlib Developer Guide
-=====================
-
-.. note::
-
- This guide will take you through steps for implementing a new algorithm in RLlib. To apply existing algorithms already implemented in RLlib, please see the `user docs `__.
-
-Recipe for an RLlib algorithm
------------------------------
-
-Here are the steps for implementing a new algorithm in RLlib:
-
-1. Define an algorithm-specific `Policy evaluator class <#policy-evaluators-and-optimizers>`__ (the core of the algorithm). Evaluators encapsulate framework-specific components such as the policy and loss functions. For an example, see the `simple policy gradient evaluator example `__.
-
-
-2. Pick an appropriate `Policy optimizer class <#policy-evaluators-and-optimizers>`__. Optimizers manage the parallel execution of the algorithm. RLlib provides several built-in optimizers for gradient-based algorithms. Advanced algorithms may find it beneficial to implement their own optimizers.
-
-
-3. Wrap the two up in an `Agent class <#agents>`__. Agents are the user-facing API of RLlib. They provide the necessary "glue" and implement accessory functionality such as statistics reporting and checkpointing.
-
-To help with implementation, RLlib provides common action distributions, preprocessors, and neural network models, found in `catalog.py `__, which are shared by all algorithms. Note that most of these utilities are currently Tensorflow specific.
-
-.. image:: rllib-api.svg
-
-
-The Developer API
------------------
-
-The following APIs are the building blocks of RLlib algorithms (also take a look at the `user components overview `__).
-
-Agents
-~~~~~~
-
-Agents implement a particular algorithm and can be used to run
-some number of iterations of the algorithm, save and load the state
-of training and evaluate the current policy. All agents inherit from
-a common base class:
-
-.. autoclass:: ray.rllib.agent.Agent
- :members:
-
-Policy Evaluators and Optimizers
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. autoclass:: ray.rllib.optimizers.policy_evaluator.PolicyEvaluator
- :members:
-
-.. autoclass:: ray.rllib.optimizers.policy_optimizer.PolicyOptimizer
- :members:
-
-Sample Batches
-~~~~~~~~~~~~~~
-
-In order for Optimizers to manipulate sample data, they should be returned from Evaluators
-in the SampleBatch format (a wrapper around a dict).
-
-.. autoclass:: ray.rllib.optimizers.SampleBatch
- :members:
-
-Models and Preprocessors
-~~~~~~~~~~~~~~~~~~~~~~~~
-
-Algorithms share neural network models which inherit from the following class:
-
-.. autoclass:: ray.rllib.models.Model
- :members:
-
-Currently we support fully connected and convolutional TensorFlow policies on all algorithms:
-
-.. autoclass:: ray.rllib.models.FullyConnectedNetwork
-
-A3C also supports a TensorFlow LSTM policy.
-
-.. autoclass:: ray.rllib.models.LSTM
-
-Observations are transformed by Preprocessors before used in the model:
-
-.. autoclass:: ray.rllib.models.preprocessors.Preprocessor
- :members:
-
-Action Distributions
-~~~~~~~~~~~~~~~~~~~~
-
-Actions can be sampled from different distributions which have a common base
-class:
-
-.. autoclass:: ray.rllib.models.ActionDistribution
- :members:
-
-Currently we support the following action distributions:
-
-.. autoclass:: ray.rllib.models.Categorical
-.. autoclass:: ray.rllib.models.DiagGaussian
-.. autoclass:: ray.rllib.models.Deterministic
-
-The Model Catalog
-~~~~~~~~~~~~~~~~~
-
-The Model Catalog is the mechanism for algorithms to get canonical preprocessors, models, and action distributions for varying gym environments. It enables easy reuse of these components across different algorithms.
-
-.. autoclass:: ray.rllib.models.ModelCatalog
- :members:
diff --git a/doc/source/rllib-env.rst b/doc/source/rllib-env.rst
new file mode 100644
index 000000000..20e6eed3b
--- /dev/null
+++ b/doc/source/rllib-env.rst
@@ -0,0 +1,142 @@
+RLlib Environments
+==================
+
+RLlib works with several different types of environments, including `OpenAI Gym `__, user-defined, multi-agent, and also batched environments.
+
+.. image:: rllib-envs.svg
+
+In the high-level agent APIs, environments are identified with string names. By default, the string will be interpreted as a gym `environment name `__, however you can also register custom environments by name:
+
+.. code-block:: python
+
+ import ray
+ from ray.tune.registry import register_env
+ from ray.rllib import ppo
+
+ def env_creator(env_config):
+ import gym
+ return gym.make("CartPole-v0") # or return your own custom env
+
+ register_env("my_env", env_creator)
+ ray.init()
+ trainer = ppo.PPOAgent(env="my-env", config={
+ "env_config": {}, # config to pass to env creator
+ })
+
+ while True:
+ print(trainer.train())
+
+
+OpenAI Gym
+----------
+
+RLlib uses Gym as its environment interface for single-agent training. For more information on how to implement a custom Gym environment, see the `gym.Env class definition `__. You may also find the `SimpleCorridor `__ and `Carla simulator `__ example env implementations useful as a reference.
+
+Performance
+~~~~~~~~~~~
+
+There are two ways to scale experience collection with Gym environments:
+
+ 1. **Vectorization within a single process:** Though many envs can very achieve high frame rates per core, their throughput is limited in practice by policy evaluation between steps. For example, even small TensorFlow models incur a couple milliseconds of latency to evaluate. This can be worked around by creating multiple envs per process and batching policy evaluations across these envs.
+
+ You can configure ``{"num_envs": M}`` to have RLlib create ``M`` concurrent environments per worker. RLlib auto-vectorizes Gym environments via `VectorEnv.wrap() `__.
+
+ 2. **Distribute across multiple processes:** You can also have RLlib create multiple processes (Ray actors) for experience collection. In most algorithms this can be controlled by setting the ``{"num_workers": N}`` config.
+
+.. image:: throughput.png
+
+You can also combine vectorization and distributed execution, as shown in the above figure. Here we plot just the throughput of RLlib policy evaluation from 1 to 128 CPUs. PongNoFrameskip-v4 on GPU scales from 2.4k to ∼200k actions/s, and Pendulum-v0 on CPU from 15k to 1.5M actions/s. One machine was used for 1-16 workers, and a Ray cluster of four machines for 32-128 workers. Each worker was configured with ``num_envs=64``.
+
+
+Vectorized
+----------
+
+RLlib will auto-vectorize Gym envs for batch evaluation if the ``num_envs`` config is set, or you can define a custom environment class that subclasses `VectorEnv `__ to implement ``vector_step()`` and ``vector_reset()``.
+
+Multi-Agent
+-----------
+
+A multi-agent environment is one which has multiple acting entities per step, e.g., in a traffic simulation, there may be multiple "car" and "traffic light" agents in the environment. The model for multi-agent in RLlib as follows: (1) as a user you define the number of policies available up front, and (2) a function that maps agent ids to policy ids. This is summarized by the below figure:
+
+.. image:: multi-agent.svg
+
+The environment itself must subclass the `MultiAgentEnv `__ interface, which can returns observations and rewards from multiple ready agents per step:
+
+.. code-block:: python
+
+ # Example: using a multi-agent env
+ > env = MultiAgentTrafficEnv(num_cars=20, num_traffic_lights=5)
+
+ # Observations are a dict mapping agent names to their obs. Not all agents
+ # may be present in the dict in each time step.
+ > print(env.reset())
+ {
+ "car_1": [[...]],
+ "car_2": [[...]],
+ "traffic_light_1": [[...]],
+ }
+
+ # Actions should be provided for each agent that returned an observation.
+ > new_obs, rewards, dones, infos = env.step(actions={"car_1": ..., "car_2": ...})
+
+ # Similarly, new_obs, rewards, dones, etc. also become dicts
+ > print(rewards)
+ {"car_1": 3, "car_2": -1, "traffic_light_1": 0}
+
+ # Individual agents can early exit; env is done when "__all__" = True
+ > print(dones)
+ {"car_2": True, "__all__": False}
+
+If all the agents will be using the same algorithm class to train, then you can setup multi-agent training as follows:
+
+.. code-block:: python
+
+ trainer = pg.PGAgent(env="my_multiagent_env", config={
+ "multiagent": {
+ "policy_graphs": {
+ "car1": (PGPolicyGraph, car_obs_space, car_act_space, {"gamma": 0.85}),
+ "car2": (PGPolicyGraph, car_obs_space, car_act_space, {"gamma": 0.99}),
+ "traffic_light": (PGPolicyGraph, tl_obs_space, tl_act_space, {}),
+ },
+ "policy_mapping_fn":
+ lambda agent_id:
+ "traffic_light" # Traffic lights are always controlled by this policy
+ if agent_id.startswith("traffic_light_")
+ else random.choice(["car1", "car2"]) # Randomly choose from car policies
+ },
+ },
+ })
+
+ while True:
+ print(trainer.train())
+
+RLlib will create three distinct policies and route agent decisions to its bound policy. When an agent first appears in the env, ``policy_mapping_fn`` will be called to determine which policy it is bound to. RLlib reports separate training statistics for each policy in the return from ``train()``, along with the combined reward.
+
+Here is a simple `example training script `__ in which you can vary the number of agents and policies in the environment. For more advanced usage, e.g., different classes of policies per agent, or more control over the training process, you can use the lower-level RLlib APIs directly to define custom policy graphs or algorithms.
+
+To scale to hundreds of agents, MultiAgentEnv batches policy evaluations across multiple agents internally. It can also be auto-vectorized by setting ``num_envs > 1``.
+
+Serving
+-------
+
+In many situations, it does not make sense for an environment to be "stepped" by RLlib. For example, if a policy is to be used in a web serving system, then it is more natural to instead *query* a service that serves policy decisions, and for that service to learn from experience over time.
+
+RLlib provides the `ServingEnv `__ class for this purpose. Unlike other envs, ServingEnv runs as its own thread of control. At any point, that thread can query the current policy for decisions via ``self.get_action()`` and reports rewards via ``self.log_returns()``. This can be done for multiple concurrent episodes as well.
+
+For example, ServingEnv can be used to implement a simple REST policy `server `__ that learns over time using RLlib. In this example RLlib runs with ``num_workers=0`` to avoid port allocation issues, but in principle this could be scaled by increasing ``num_workers``.
+
+Offline Data
+~~~~~~~~~~~~
+
+ServingEnv also provides a ``self.log_action()`` call to support off-policy actions. This allows the client to make independent decisions, e.g., to compare two different policies, and for RLlib to still learn from those off-policy actions. Note that this requires the algorithm used to support learning from off-policy decisions (e.g., DQN).
+
+The ``log_action`` API of ServingEnv can be used to ingest data from offline logs. The pattern would be as follows: First, some policy is followed to produce experience data which is stored in some offline storage system. Then, RLlib creates a number of workers that use a ServingEnv to read the logs in parallel and ingest the experiences. After a round of training completes, the new policy can be deployed to collect more experiences.
+
+Note that envs can read from different partitions of the logs based on the ``worker_index`` attribute of the `env context `__ passed into the environment constructor.
+
+Batch Asynchronous
+------------------
+
+The lowest-level "catch-all" environment supported by RLlib is `AsyncVectorEnv `__. AsyncVectorEnv models multiple agents executing asynchronously in multiple environments. A call to ``poll()`` returns observations from ready agents keyed by their environment and agent ids, and actions for those agents can be sent back via ``send_actions()``. This interface can be subclassed directly to support batched simulators such as `ELF `__.
+
+Under the hood, all other envs are converted to AsyncVectorEnv by RLlib so that there is a common internal path for policy evaluation.
diff --git a/doc/source/rllib-envs.svg b/doc/source/rllib-envs.svg
new file mode 100644
index 000000000..37d6d66e6
--- /dev/null
+++ b/doc/source/rllib-envs.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/doc/source/rllib-models.rst b/doc/source/rllib-models.rst
new file mode 100644
index 000000000..4978dd4e2
--- /dev/null
+++ b/doc/source/rllib-models.rst
@@ -0,0 +1,77 @@
+RLlib Models and Preprocessors
+==============================
+
+The following diagram provides a conceptual overview of data flow between different components in RLlib. We start with an ``Environment``, which given an action produces an observation. The observation is preprocessed by a ``Preprocessor`` and ``Filter`` (e.g. for running mean normalization) before being sent to a neural network ``Model``. The model output is in turn interpreted by an ``ActionDistribution`` to determine the next action.
+
+.. image:: rllib-components.svg
+
+The components highlighted in green can be replaced with custom user-defined implementations, as described in the next sections. The purple components are RLlib internal, which means they can only be modified by changing the algorithm source code.
+
+
+Built-in Models and Preprocessors
+---------------------------------
+
+RLlib picks default models based on a simple heuristic: a `vision network `__ for image observations, and a `fully connected network `__ for everything else. These models can be configured via the ``model`` config key, documented in the model `catalog `__. Note that you'll probably have to configure ``conv_filters`` if your environment observations have custom sizes, e.g., ``"model": {"dim": 42, "conv_filters": [[16, [4, 4], 2], [32, [4, 4], 2], [512, [11, 11], 1]]}`` for 42x42 observations.
+
+In addition, if you set ``"model": {"use_lstm": true}``, then the model output will be further processed by a `LSTM cell `__. More generally, RLlib supports the use of recurrent models for its algorithms (A3C, PG out of the box), and RNN support is built into its policy evaluation utilities.
+
+For preprocessors, RLlib tries to pick one of its built-in preprocessor based on the environment's observation space. Discrete observations are one-hot encoded, Atari observations downscaled, and Tuple observations flattened (there isn't native tuple support yet, but you can reshape the flattened observation in a custom model). Note that for Atari, DQN defaults to using the `DeepMind preprocessors `__, which are also used by the OpenAI baselines library.
+
+
+Custom Models
+-------------
+
+Custom models should subclass the common RLlib `model class `__ and override the ``_build_layers`` method. This method takes in a tensor input (observation), and returns a feature layer and float vector of the specified output size. The model can then be registered and used in place of a built-in model:
+
+.. code-block:: python
+
+ import ray
+ import ray.rllib.agents.ppo as ppo
+ from ray.rllib.models import ModelCatalog, Model
+
+ class MyModelClass(Model):
+ def _build_layers(self, inputs, num_outputs, options):
+ layer1 = slim.fully_connected(inputs, 64, ...)
+ layer2 = slim.fully_connected(inputs, 64, ...)
+ ...
+ return layerN, layerN_minus_1
+
+ ModelCatalog.register_custom_model("my_model", MyModelClass)
+
+ ray.init()
+ agent = ppo.PPOAgent(env="CartPole-v0", config={
+ "model": {
+ "custom_model": "my_model",
+ "custom_options": {}, # extra options to pass to your model
+ },
+ })
+
+For a full example of a custom model in code, see the `Carla RLlib model `__ and associated `training scripts `__. The ``CarlaModel`` class defined there operates over a composite (Tuple) observation space including both images and scalar measurements.
+
+Custom Preprocessors
+--------------------
+
+Similarly, custom preprocessors should subclass the RLlib `preprocessor class `__ and registered in the model catalog:
+
+.. code-block:: python
+
+ import ray
+ import ray.rllib.agents.ppo as ppo
+ from ray.rllib.models.preprocessors import Preprocessor
+
+ class MyPreprocessorClass(Preprocessor):
+ def _init(self):
+ self.shape = ... # perhaps varies depending on self._options
+
+ def transform(self, observation):
+ return ... # return the preprocessed observation
+
+ ModelCatalog.register_custom_preprocessor("my_prep", MyPreprocessorClass)
+
+ ray.init()
+ agent = ppo.PPOAgent(env="CartPole-v0", config={
+ "model": {
+ "custom_preprocessor": "my_prep",
+ "custom_options": {}, # extra options to pass to your preprocessor
+ },
+ })
diff --git a/doc/source/rllib-package-ref.rst b/doc/source/rllib-package-ref.rst
new file mode 100644
index 000000000..38a578dbd
--- /dev/null
+++ b/doc/source/rllib-package-ref.rst
@@ -0,0 +1,47 @@
+RLlib Package Reference
+=======================
+
+ray.rllib.agents
+----------------
+
+.. automodule:: ray.rllib.agents
+ :members:
+
+.. autoclass:: ray.rllib.agents.a3c.A3CAgent
+.. autoclass:: ray.rllib.agents.ddpg.ApexDDPGAgent
+.. autoclass:: ray.rllib.agents.ddpg.DDPGAgent
+.. autoclass:: ray.rllib.agents.dqn.ApexAgent
+.. autoclass:: ray.rllib.agents.dqn.DQNAgent
+.. autoclass:: ray.rllib.agents.es.ESAgent
+.. autoclass:: ray.rllib.agents.pg.PGAgent
+.. autoclass:: ray.rllib.agents.ppo.PPOAgent
+
+ray.rllib.env
+-------------
+
+.. automodule:: ray.rllib.env
+ :members:
+
+ray.rllib.evaluation
+--------------------
+
+.. automodule:: ray.rllib.evaluation
+ :members:
+
+ray.rllib.models
+----------------
+
+.. automodule:: ray.rllib.models
+ :members:
+
+ray.rllib.optimizers
+--------------------
+
+.. automodule:: ray.rllib.optimizers
+ :members:
+
+ray.rllib.utils
+---------------
+
+.. automodule:: ray.rllib.utils
+ :members:
diff --git a/doc/source/rllib-stack.svg b/doc/source/rllib-stack.svg
new file mode 100644
index 000000000..c3c18f0be
--- /dev/null
+++ b/doc/source/rllib-stack.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/doc/source/rllib-training.rst b/doc/source/rllib-training.rst
new file mode 100644
index 000000000..0171de62f
--- /dev/null
+++ b/doc/source/rllib-training.rst
@@ -0,0 +1,130 @@
+RLlib Training APIs
+===================
+
+Getting Started
+---------------
+
+At a high level, RLlib provides an ``Agent`` class which
+holds a policy for environment interaction. Through the agent interface, the policy can
+be trained, checkpointed, or an action computed.
+
+.. image:: rllib-api.svg
+
+You can train a simple DQN agent with the following command
+
+.. code-block:: bash
+
+ python ray/python/ray/rllib/train.py --run DQN --env CartPole-v0
+
+By default, the results will be logged to a subdirectory of ``~/ray_results``.
+This subdirectory will contain a file ``params.json`` which contains the
+hyperparameters, a file ``result.json`` which contains a training summary
+for each episode and a TensorBoard file that can be used to visualize
+training process with TensorBoard by running
+
+.. code-block:: bash
+
+ tensorboard --logdir=~/ray_results
+
+
+The ``train.py`` script has a number of options you can show by running
+
+.. code-block:: bash
+
+ python ray/python/ray/rllib/train.py --help
+
+The most important options are for choosing the environment
+with ``--env`` (any OpenAI gym environment including ones registered by the user
+can be used) and for choosing the algorithm with ``--run``
+(available options are ``PPO``, ``PG``, ``A3C``, ``ES``, ``DDPG``, ``DDPG2``, ``DQN``, ``APEX``, and ``APEX_DDPG``).
+
+Specifying Parameters
+~~~~~~~~~~~~~~~~~~~~~
+
+Each algorithm has specific hyperparameters that can be set with ``--config``. See the
+`algorithms documentation `__ for more information.
+
+In an example below, we train A3C by specifying 8 workers through the config flag.
+function that creates the env to refer to it by name. The contents of the env_config agent config field will be passed to that function to allow the environment to be configured. The return type should be an OpenAI gym.Env. For example:
+
+
+.. code-block:: bash
+
+ python ray/python/ray/rllib/train.py --env=PongDeterministic-v4 \
+ --run=A3C --config '{"num_workers": 8}'
+
+Evaluating Trained Agents
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In order to save checkpoints from which to evaluate agents,
+set ``--checkpoint-freq`` (number of training iterations between checkpoints)
+when running ``train.py``.
+
+
+An example of evaluating a previously trained DQN agent is as follows:
+
+.. code-block:: bash
+
+ python ray/python/ray/rllib/rollout.py \
+ ~/ray_results/default/DQN_CartPole-v0_0upjmdgr0/checkpoint-1 \
+ --run DQN --env CartPole-v0
+
+The ``rollout.py`` helper script reconstructs a DQN agent from the checkpoint
+located at ``~/ray_results/default/DQN_CartPole-v0_0upjmdgr0/checkpoint-1``
+and renders its behavior in the environment specified by ``--env``.
+
+Tuned Examples
+--------------
+
+Some good hyperparameters and settings are available in
+`the repository `__
+(some of them are tuned to run on GPUs). If you find better settings or tune
+an algorithm on a different domain, consider submitting a Pull Request!
+
+Python API
+----------
+
+The Python API provides the needed flexibility for applying RLlib to new problems. You will need to use this API if you wish to use custom environments, preprocesors, or models with RLlib.
+
+Here is an example of the basic usage:
+
+.. code-block:: python
+
+ import ray
+ import ray.rllib.agents.ppo as ppo
+
+ ray.init()
+ config = ppo.DEFAULT_CONFIG.copy()
+ agent = ppo.PPOAgent(config=config, env="CartPole-v0")
+
+ # Can optionally call agent.restore(path) to load a checkpoint.
+
+ for i in range(1000):
+ # Perform one iteration of training the policy with PPO
+ result = agent.train()
+ print("result: {}".format(result))
+
+ if i % 100 == 0:
+ checkpoint = agent.save()
+ print("checkpoint saved at", checkpoint)
+
+All RLlib agents implement the tune Trainable API, which means they support incremental training and checkpointing. This enables them to be easily used in experiments with Ray Tune.
+
+Accessing Global State
+~~~~~~~~~~~~~~~~~~~~~~
+It is common to need to access an agent's internal state, e.g., to set or get internal weights. In RLlib an agent's state is replicated across multiple *policy evaluators* (Ray actors) in the cluster. However, you can easily get and update this state between calls to ``train()`` via ``agent.optimizer.foreach_evaluator()`` or ``agent.optimizer.foreach_evaluator_with_index()``. These functions take a lambda function that is applied with the evaluator as an arg. You can also return values from these functions and those will be returned as a list.
+
+You can also access just the "master" copy of the agent state through ``agent.optimizer.local_evaluator``, but note that updates here may not be reflected in remote replicas if you have configured ``num_workers > 0``.
+
+REST API
+--------
+
+In some cases (i.e., when interacting with an external environment) it makes more sense to interact with RLlib as if were an independently running service, rather than RLlib hosting the simulations itself. This is possible via RLlib's serving env `interface `__.
+
+.. autoclass:: ray.rllib.utils.policy_client.PolicyClient
+ :members:
+
+.. autoclass:: ray.rllib.utils.policy_server.PolicyServer
+ :members:
+
+For a full client / server example that you can run, see the example `client script `__ and also the corresponding `server script `__, here configured to serve a policy for the toy CartPole-v0 environment.
diff --git a/doc/source/rllib.rst b/doc/source/rllib.rst
index 569975135..7316fc127 100644
--- a/doc/source/rllib.rst
+++ b/doc/source/rllib.rst
@@ -1,347 +1,73 @@
-Ray RLlib: Scalable Reinforcement Learning
-==========================================
+RLlib: Scalable Reinforcement Learning
+======================================
-Ray RLlib is an RL execution toolkit built on the Ray distributed execution framework. RLlib implements a collection of distributed *policy optimizers* that make it easy to use a variety of training strategies with existing RL algorithms written in frameworks such as PyTorch, TensorFlow, and Theano.
+RLlib is an open-source library for reinforcement learning that offers both a collection of reference algorithms and scalable primitives for composing new ones.
-You can find the code for RLlib `here on GitHub `__, and the paper `here `__.
+.. image:: rllib-stack.svg
-RLlib's policy optimizers serve as the basis for RLlib's reference algorithms, which include:
-
-- Proximal Policy Optimization (`PPO `__) which is a proximal variant of `TRPO `__.
-
-- Policy Gradients (`PG `__).
-
-- Asynchronous Advantage Actor-Critic (`A3C `__).
-
-- Deep Q Networks (`DQN `__).
-
-- Deep Deterministic Policy Gradients (`DDPG `__).
-
-- Ape-X Distributed Prioritized Experience Replay, including both `DQN `__ and `DDPG `__ variants.
-
-- Evolution Strategies (`ES `__), as described in `this paper `__.
-
-These algorithms can be run on any `OpenAI Gym MDP `__,
-including custom ones written and registered by the user.
-
-.. note::
-
- To use RLlib's policy optimizers outside of RLlib, see the `policy optimizers documentation `__.
+Learn more about RLlib's design by reading the `ICML paper `__.
Installation
------------
-RLlib has extra dependencies on top of **ray**. First, you'll need into install either PyTorch or TensorFlow.
-For usage of PyTorch models, visit the `PyTorch website `__
-for instructions on installing PyTorch.
+RLlib has extra dependencies on top of ``ray``. First, you'll need to install either `PyTorch `__ or `TensorFlow `__. Then, install the Ray RLlib module:
.. code-block:: bash
pip install tensorflow # or tensorflow-gpu
-
-Then, install Ray with extra RLlib dependencies:
-
-.. code-block:: bash
-
- pip install 'ray[rllib]'
+ pip install ray[rllib]
You might also want to clone the Ray repo for convenient access to RLlib helper scripts:
.. code-block:: bash
git clone https://github.com/ray-project/ray
-
-
-
-Getting Started
----------------
-
-At a high level, RLlib provides an ``Agent`` class which
-holds a policy for environment interaction. Through the agent interface, the policy can
-be trained, checkpointed, or an action computed.
-
-.. image:: rllib-api.svg
-
-You can train a simple DQN agent with the following command
-
-.. code-block:: bash
-
- python ray/python/ray/rllib/train.py --run DQN --env CartPole-v0
-
-By default, the results will be logged to a subdirectory of ``~/ray_results``.
-This subdirectory will contain a file ``params.json`` which contains the
-hyperparameters, a file ``result.json`` which contains a training summary
-for each episode and a TensorBoard file that can be used to visualize
-training process with TensorBoard by running
-
-.. code-block:: bash
-
- tensorboard --logdir=~/ray_results
-
-
-The ``train.py`` script has a number of options you can show by running
-
-.. code-block:: bash
-
- python ray/python/ray/rllib/train.py --help
-
-The most important options are for choosing the environment
-with ``--env`` (any OpenAI gym environment including ones registered by the user
-can be used) and for choosing the algorithm with ``--run``
-(available options are ``PPO``, ``PG``, ``A3C``, ``ES``, ``DDPG``, ``DDPG2``, ``DQN``, ``APEX``, and ``APEX_DDPG``).
-
-Specifying Parameters
-~~~~~~~~~~~~~~~~~~~~~
-
-Each algorithm has specific hyperparameters that can be set with ``--config`` - see the
-``DEFAULT_CONFIG`` variable in
-`PPO `__,
-`PG `__,
-`A3C `__,
-`ES `__,
-`DQN `__,
-`DDPG `__,
-`DDPG2 `__,
-`APEX `__, and
-`APEX_DDPG `__.
-
-In an example below, we train A3C by specifying 8 workers through the config flag.
-function that creates the env to refer to it by name. The contents of the env_config agent config field will be passed to that function to allow the environment to be configured. The return type should be an OpenAI gym.Env. For example:
-
-
-.. code-block:: bash
-
- python ray/python/ray/rllib/train.py --env=PongDeterministic-v4 \
- --run=A3C --config '{"num_workers": 8}'
-
-Evaluating Trained Agents
-~~~~~~~~~~~~~~~~~~~~~~~~~
-
-In order to save checkpoints from which to evaluate agents,
-set ``--checkpoint-freq`` (number of training iterations between checkpoints)
-when running ``train.py``.
-
-
-An example of evaluating a previously trained DQN agent is as follows:
-
-.. code-block:: bash
-
- python ray/python/ray/rllib/rollout.py \
- ~/ray_results/default/DQN_CartPole-v0_0upjmdgr0/checkpoint-1 \
- --run DQN --env CartPole-v0
-
-
-The ``rollout.py`` helper script reconstructs a DQN agent from the checkpoint
-located at ``~/ray_results/default/DQN_CartPole-v0_0upjmdgr0/checkpoint-1``
-and renders its behavior in the environment specified by ``--env``.
-
-Tuned Examples
---------------
-
-Some good hyperparameters and settings are available in
-`the repository `__
-(some of them are tuned to run on GPUs). If you find better settings or tune
-an algorithm on a different domain, consider submitting a Pull Request!
-
-Python User API
----------------
-
-The Python API provides the needed flexibility for applying RLlib to new problems. You will need to use this API if you wish to use custom environments, preprocesors, or models with RLlib.
-
-Here is an example of the basic usage:
-
-.. code-block:: python
-
- import ray
- import ray.rllib.ppo as ppo
-
- ray.init()
- config = ppo.DEFAULT_CONFIG.copy()
- agent = ppo.PPOAgent(config=config, env="CartPole-v0")
-
- # Can optionally call agent.restore(path) to load a checkpoint.
-
- for i in range(1000):
- # Perform one iteration of training the policy with PPO
- result = agent.train()
- print("result: {}".format(result))
-
- if i % 100 == 0:
- checkpoint = agent.save()
- print("checkpoint saved at", checkpoint)
-
-Components: User-customizable and Internal
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The following diagram provides a conceptual overview of data flow between different components in RLlib. We start with an ``Environment``, which given an action produces an observation. The observation is preprocessed by a ``Preprocessor`` and ``Filter`` (e.g. for running mean normalization) before being sent to a neural network ``Model``. The model output is in turn interpreted by an ``ActionDistribution`` to determine the next action.
-
-.. image:: rllib-components.svg
-
-The components highlighted in green above are *User-customizable*, which means RLlib provides APIs for swapping in user-defined implementations, as described in the next sections. The purple components are *RLlib internal*, which means they currently can only be modified by changing the RLlib source code.
-
-For more information about these components, also see the `RLlib Developer Guide `__.
-
-Custom Environments
-~~~~~~~~~~~~~~~~~~~
-
-To train against a custom environment, i.e. one not in the gym catalog, you
-can register a function that creates the env to refer to it by name. The contents of the
-``env_config`` agent config field will be passed to that function to allow the
-environment to be configured. The return type should be an `OpenAI gym.Env `__. For example:
-
-.. code-block:: python
-
- import ray
- from ray.tune.registry import register_env
- from ray.rllib import ppo
-
- def env_creator(env_config):
- import gym
- return gym.make("CartPole-v0") # or return your own custom env
-
- env_creator_name = "custom_env"
- register_env(env_creator_name, env_creator)
-
- ray.init()
- agent = ppo.PPOAgent(env=env_creator_name, config={
- "env_config": {}, # config to pass to env creator
- })
-
-For a code example of a custom env, see the `SimpleCorridor example `__. For a more complex example, also see the `Carla RLlib env `__.
-
-Custom Preprocessors and Models
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-RLlib includes default preprocessors and models for common gym
-environments, but you can also specify your own as follows. At a high level, your neural
-network model needs to take an input tensor of the preprocessed observation shape and
-output a vector of the size specified in the constructor. The interfaces for
-these custom classes can be found in the
-`RLlib Developer Guide `__.
-
-.. code-block:: python
-
- import ray
- from ray.rllib.models import ModelCatalog, Model
- from ray.rllib.models.preprocessors import Preprocessor
-
- class MyPreprocessorClass(Preprocessor):
- def _init(self):
- self.shape = ...
-
- def transform(self, observation):
- return ...
-
- class MyModelClass(Model):
- def _init(self, inputs, num_outputs, options):
- layer1 = slim.fully_connected(inputs, 64, ...)
- layer2 = slim.fully_connected(inputs, 64, ...)
- ...
- return layerN, layerN_minus_1
-
- ModelCatalog.register_custom_preprocessor("my_prep", MyPreprocessorClass)
- ModelCatalog.register_custom_model("my_model", MyModelClass)
-
- ray.init()
- agent = ppo.PPOAgent(env="CartPole-v0", config={
- "model": {
- "custom_preprocessor": "my_prep",
- "custom_model": "my_model",
- "custom_options": {}, # extra options to pass to your classes
- },
- })
-
-For a full example of a custom model in code, see the `Carla RLlib model `__ and associated `training scripts `__. The ``CarlaModel`` class defined there operates over a composite (Tuple) observation space including both images and scalar measurements.
-
-Multi-Agent Models
-~~~~~~~~~~~~~~~~~~
-RLlib supports multi-agent training with PPO. Currently it supports both
-shared, i.e. all agents have the same model, and non-shared multi-agent models. However, it only supports shared
-rewards and does not yet support individual rewards for each agent.
-
-
-While Generalized Advantage Estimation is supported in multiagent scenarios,
-it is assumed that it possible for the estimator to access the observations of
-all of the agents.
-
-
-Important config parameters are described below
-
-.. code-block:: python
-
- config["model"].update({"fcnet_hiddens": [256, 256]}) # dimension of value function
- options = {"multiagent_obs_shapes": [3, 3], # length of each observation space
- "multiagent_act_shapes": [1, 1], # length of each action space
- "multiagent_shared_model": True, # whether the model should be shared
- # list of dimensions of multiagent feedforward nets
- "multiagent_fcnet_hiddens": [[32, 32]] * 2}
- config["model"].update({"custom_options": options})
-
-For a full example of a multiagent model in code, see the
-`MultiAgent Pendulum `__.
-The ``MultiAgentPendulumEnv`` defined there operates
-over a composite (Tuple) enclosing a list of Boxes; each Box represents the
-observation of an agent. The action space is a list of Discrete actions, each
-element corresponding to half of the total torque. The environment will return a list of actions
-that can be iterated over and applied to each agent.
-
-External Data API
-~~~~~~~~~~~~~~~~~
-*coming soon!*
-
-
-Using RLlib with Ray Tune
--------------------------
-
-All Agents implemented in RLlib support the
-`tune Trainable `__ interface.
-
-Here is an example of using the command-line interface with RLlib:
-
-.. code-block:: bash
-
- python ray/python/ray/rllib/train.py -f tuned_examples/cartpole-grid-search-example.yaml
-
-Here is an example using the Python API. The same config passed to ``Agents`` may be placed
-in the ``config`` section of the experiments. RLlib agents automatically declare their
-resources requirements (e.g., based on ``num_workers``) to Tune, so you don't have to.
-
-.. code-block:: python
-
- import ray
- from ray.tune.tune import run_experiments
- from ray.tune.variant_generator import grid_search
-
-
- experiment = {
- 'cartpole-ppo': {
- 'run': 'PPO',
- 'env': 'CartPole-v0',
- 'stop': {
- 'episode_reward_mean': 200,
- 'time_total_s': 180
- },
- 'config': {
- 'num_sgd_iter': grid_search([1, 4]),
- 'num_workers': 2,
- 'sgd_batchsize': grid_search([128, 256, 512])
- }
- },
- # put additional experiments to run concurrently here
- }
-
- ray.init()
- run_experiments(experiment)
-
-For an advanced example of using Population Based Training (PBT) with RLlib,
-see the `PPO + PBT Walker2D training example `__.
-
-Using Policy Optimizers outside of RLlib
-----------------------------------------
-
-See the `RLlib policy optimizers documentation `__.
-
-Contributing to RLlib
----------------------
-
-See the `RLlib Developer Guide `__.
+ cd ray/python/ray/rllib
+
+Training APIs
+-------------
+* `Command-line `__
+* `Python API `__
+* `REST API `__
+
+Environments
+------------
+* `RLlib Environments Overview `__
+* `OpenAI Gym `__
+* `Vectorized (Batch) `__
+* `Multi-Agent `__
+* `Serving (Agent-oriented) `__
+* `Offline Data Ingest `__
+* `Batch Asynchronous `__
+
+Algorithms
+----------
+* `Ape-X Distributed Prioritized Experience Replay `__
+* `Asynchronous Advantage Actor-Critic `__
+* `Deep Deterministic Policy Gradients `__
+* `Deep Q Networks `__
+* `Evolution Strategies `__
+* `Policy Gradients `__
+* `Proximal Policy Optimization `__
+
+Models and Preprocessors
+-------------------------------
+* `RLlib Models and Preprocessors Overview `__
+* `Built-in Models and Preprocessors `__
+* `Custom Models `__
+* `Custom Preprocessors `__
+
+RL Building Blocks
+------------------
+* Policy Models, Losses, Postprocessing
+* Policy Evaluation
+* Policy Optimization
+
+Package Reference
+-----------------
+* `ray.rllib.agents `__
+* `ray.rllib.env `__
+* `ray.rllib.evaluation `__
+* `ray.rllib.models `__
+* `ray.rllib.optimizers `__
+* `ray.rllib.utils `__
diff --git a/doc/source/throughput.png b/doc/source/throughput.png
new file mode 100644
index 000000000..3bde99bed
Binary files /dev/null and b/doc/source/throughput.png differ
diff --git a/python/ray/experimental/internal_kv.py b/python/ray/experimental/internal_kv.py
index 573669d7d..85476a7c3 100644
--- a/python/ray/experimental/internal_kv.py
+++ b/python/ray/experimental/internal_kv.py
@@ -22,7 +22,7 @@ def _internal_kv_put(key, value, overwrite=False):
This only has an effect if the key does not already have a value.
- Returns
+ Returns:
already_exists (bool): whether the value already exists.
"""
diff --git a/python/ray/rllib/README.rst b/python/ray/rllib/README.rst
index 2e9833533..32571cf14 100644
--- a/python/ray/rllib/README.rst
+++ b/python/ray/rllib/README.rst
@@ -1,22 +1,25 @@
-Ray RLlib: Scalable Reinforcement Learning
-==========================================
+RLlib: Scalable Reinforcement Learning
+======================================
-Ray RLlib is an RL execution toolkit built on the Ray distributed execution framework. See the `user documentation `__ and `paper `__.
+RLlib is an open-source library for reinforcement learning that offers both a collection of reference algorithms and scalable primitives for composing new ones.
-RLlib includes the following reference algorithms:
+For an overview of RLlib, see the `documentation `__.
-- Proximal Policy Optimization (`PPO `__) which is a proximal variant of `TRPO `__.
+If you've found RLlib useful for your research, you can cite the `paper `__ as follows:
-- Policy Gradients (`PG `__).
-
-- Asynchronous Advantage Actor-Critic (`A3C `__).
-
-- Deep Q Networks (`DQN `__).
-
-- Deep Deterministic Policy Gradients (`DDPG `__).
-
-- Ape-X Distributed Prioritized Experience Replay, including both `DQN `__ and `DDPG `__ variants.
-
-- Evolution Strategies (`ES `__), as described in `this paper `__.
-
-These algorithms can be run on any OpenAI Gym MDP, including custom ones written and registered by the user.
+```
+@inproceedings{liang2018rllib,
+ Author = {Eric Liang and
+ Richard Liaw and
+ Robert Nishihara and
+ Philipp Moritz and
+ Roy Fox and
+ Ken Goldberg and
+ Joseph E. Gonzalez and
+ Michael I. Jordan and
+ Ion Stoica},
+ Title = {{RLlib}: Abstractions for Distributed Reinforcement Learning},
+ Booktitle = {International Conference on Machine Learning ({ICML})},
+ Year = {2018}
+}
+```
diff --git a/python/ray/rllib/__init__.py b/python/ray/rllib/__init__.py
index 58aa97f19..ee09c4579 100644
--- a/python/ray/rllib/__init__.py
+++ b/python/ray/rllib/__init__.py
@@ -6,20 +6,21 @@ from __future__ import print_function
# This file is imported from the tune module in order to register RLlib agents.
from ray.tune.registry import register_trainable
-from ray.rllib.utils.policy_graph import PolicyGraph
-from ray.rllib.utils.tf_policy_graph import TFPolicyGraph
-from ray.rllib.utils.common_policy_evaluator import CommonPolicyEvaluator
-from ray.rllib.utils.async_vector_env import AsyncVectorEnv
-from ray.rllib.utils.vector_env import VectorEnv
-from ray.rllib.utils.serving_env import ServingEnv
-from ray.rllib.optimizers.sample_batch import SampleBatch
+from ray.rllib.evaluation.policy_graph import PolicyGraph
+from ray.rllib.evaluation.tf_policy_graph import TFPolicyGraph
+from ray.rllib.env.async_vector_env import AsyncVectorEnv
+from ray.rllib.env.multi_agent_env import MultiAgentEnv
+from ray.rllib.env.vector_env import VectorEnv
+from ray.rllib.env.serving_env import ServingEnv
+from ray.rllib.evaluation.common_policy_evaluator import CommonPolicyEvaluator
+from ray.rllib.evaluation.sample_batch import SampleBatch
def _register_all():
for key in ["PPO", "ES", "DQN", "APEX", "A3C", "BC", "PG", "DDPG",
"APEX_DDPG", "__fake", "__sigmoid_fake_data",
"__parameter_tuning"]:
- from ray.rllib.agent import get_agent_class
+ from ray.rllib.agents.agent import get_agent_class
register_trainable(key, get_agent_class(key))
@@ -27,5 +28,5 @@ _register_all()
__all__ = [
"PolicyGraph", "TFPolicyGraph", "CommonPolicyEvaluator", "SampleBatch",
- "AsyncVectorEnv", "VectorEnv", "ServingEnv",
+ "AsyncVectorEnv", "MultiAgentEnv", "VectorEnv", "ServingEnv",
]
diff --git a/python/ray/rllib/a3c/__init__.py b/python/ray/rllib/a3c/__init__.py
deleted file mode 100644
index 2d9aaede4..000000000
--- a/python/ray/rllib/a3c/__init__.py
+++ /dev/null
@@ -1,3 +0,0 @@
-from ray.rllib.a3c.a3c import A3CAgent, DEFAULT_CONFIG
-
-__all__ = ["A3CAgent", "DEFAULT_CONFIG"]
diff --git a/python/ray/rllib/agents/__init__.py b/python/ray/rllib/agents/__init__.py
new file mode 100644
index 000000000..da8494a2b
--- /dev/null
+++ b/python/ray/rllib/agents/__init__.py
@@ -0,0 +1,3 @@
+from ray.rllib.agents.agent import Agent, with_common_config
+
+__all__ = ["Agent", "with_common_config"]
diff --git a/python/ray/rllib/agents/a3c/__init__.py b/python/ray/rllib/agents/a3c/__init__.py
new file mode 100644
index 000000000..e4ab31764
--- /dev/null
+++ b/python/ray/rllib/agents/a3c/__init__.py
@@ -0,0 +1,3 @@
+from ray.rllib.agents.a3c.a3c import A3CAgent, DEFAULT_CONFIG
+
+__all__ = ["A3CAgent", "DEFAULT_CONFIG"]
diff --git a/python/ray/rllib/a3c/a3c.py b/python/ray/rllib/agents/a3c/a3c.py
similarity index 53%
rename from python/ray/rllib/a3c/a3c.py
rename to python/ray/rllib/agents/a3c/a3c.py
index e50a1a04c..f37af7c42 100644
--- a/python/ray/rllib/a3c/a3c.py
+++ b/python/ray/rllib/agents/a3c/a3c.py
@@ -6,26 +6,17 @@ import pickle
import os
import ray
-from ray.rllib.agent import Agent
+from ray.rllib.agents.agent import Agent, with_common_config
from ray.rllib.optimizers import AsyncGradientsOptimizer
from ray.rllib.utils import FilterManager
-from ray.rllib.utils.common_policy_evaluator import CommonPolicyEvaluator, \
- collect_metrics
+from ray.rllib.evaluation.metrics import collect_metrics
from ray.tune.trial import Resources
-DEFAULT_CONFIG = {
- # Number of workers (excluding master)
- "num_workers": 2,
- # Number of environments to evaluate vectorwise per worker.
- "num_envs": 1,
+DEFAULT_CONFIG = with_common_config({
# Size of rollout batch
- "batch_size": 10,
+ "sample_batch_size": 10,
# Use PyTorch as backend - no LSTM support
"use_pytorch": False,
- # Which observation filter to apply to the observation
- "observation_filter": "NoFilter",
- # Discount factor of MDP
- "gamma": 0.99,
# GAE(gamma) parameter
"lambda": 1.0,
# Max global norm for each gradient calculated by worker
@@ -40,6 +31,8 @@ DEFAULT_CONFIG = {
"use_gpu_for_workers": False,
# Whether to emit extra summary stats
"summarize": False,
+ # Workers sample async
+ "sample_async": True,
# Model and preprocessor options
"model": {
# Use LSTM model. Requires TF.
@@ -55,23 +48,25 @@ DEFAULT_CONFIG = {
# (Image statespace) - Converts image shape to (C, dim, dim)
"channel_major": False,
},
+ # Configure TF for single-process operation
+ "tf_session_args": {
+ "intra_op_parallelism_threads": 1,
+ "inter_op_parallelism_threads": 1,
+ "gpu_options": {
+ "allow_growth": True,
+ },
+ },
# Arguments to pass to the rllib optimizer
"optimizer": {
# Number of gradients applied for each `train` step
"grads_per_step": 100,
},
- # Arguments to pass to the env creator
- "env_config": {},
-
- # === Multiagent ===
- "multiagent": {
- "policy_graphs": {},
- "policy_mapping_fn": None,
- },
-}
+})
class A3CAgent(Agent):
+ """A3C implementations in TensorFlow and PyTorch."""
+
_agent_name = "A3C"
_default_config = DEFAULT_CONFIG
@@ -86,51 +81,18 @@ class A3CAgent(Agent):
def _init(self):
if self.config["use_pytorch"]:
- from ray.rllib.a3c.a3c_torch_policy import A3CTorchPolicyGraph
- self.policy_cls = A3CTorchPolicyGraph
+ from ray.rllib.agents.a3c.a3c_torch_policy import \
+ A3CTorchPolicyGraph
+ policy_cls = A3CTorchPolicyGraph
else:
- from ray.rllib.a3c.a3c_tf_policy import A3CPolicyGraph
- self.policy_cls = A3CPolicyGraph
-
- if self.config["use_pytorch"]:
- session_creator = None
- else:
- import tensorflow as tf
-
- def session_creator():
- return tf.Session(
- config=tf.ConfigProto(
- intra_op_parallelism_threads=1,
- inter_op_parallelism_threads=1,
- gpu_options=tf.GPUOptions(allow_growth=True)))
-
- remote_cls = CommonPolicyEvaluator.as_remote(
- num_gpus=1 if self.config["use_gpu_for_workers"] else 0)
- self.local_evaluator = CommonPolicyEvaluator(
- self.env_creator,
- self.config["multiagent"]["policy_graphs"] or self.policy_cls,
- policy_mapping_fn=self.config["multiagent"]["policy_mapping_fn"],
- batch_steps=self.config["batch_size"],
- batch_mode="truncate_episodes",
- tf_session_creator=session_creator,
- env_config=self.config["env_config"],
- model_config=self.config["model"], policy_config=self.config,
- num_envs=self.config["num_envs"])
- self.remote_evaluators = [
- remote_cls.remote(
- self.env_creator,
- self.config["multiagent"]["policy_graphs"] or self.policy_cls,
- policy_mapping_fn=(
- self.config["multiagent"]["policy_mapping_fn"]),
- batch_steps=self.config["batch_size"],
- batch_mode="truncate_episodes", sample_async=True,
- tf_session_creator=session_creator,
- env_config=self.config["env_config"],
- model_config=self.config["model"], policy_config=self.config,
- num_envs=self.config["num_envs"],
- worker_index=i+1)
- for i in range(self.config["num_workers"])]
+ from ray.rllib.agents.a3c.a3c_tf_policy import A3CPolicyGraph
+ policy_cls = A3CPolicyGraph
+ self.local_evaluator = self.make_local_evaluator(
+ self.env_creator, policy_cls)
+ self.remote_evaluators = self.make_remote_evaluators(
+ self.env_creator, policy_cls, self.config["num_workers"],
+ {"num_gpus": 1 if self.config["use_gpu_for_workers"] else 0})
self.optimizer = AsyncGradientsOptimizer(
self.config["optimizer"], self.local_evaluator,
self.remote_evaluators)
@@ -168,12 +130,3 @@ class A3CAgent(Agent):
for a, o in zip(self.remote_evaluators, extra_data["remote_state"])
])
self.local_evaluator.restore(extra_data["local_state"])
-
- def compute_action(self, observation, state=None):
- if state is None:
- state = []
- obs = self.local_evaluator.filters["default"](
- observation, update=False)
- return self.local_evaluator.for_policy(
- lambda p: p.compute_single_action(
- obs, state, is_training=False)[0])
diff --git a/python/ray/rllib/a3c/a3c_tf_policy.py b/python/ray/rllib/agents/a3c/a3c_tf_policy.py
similarity index 96%
rename from python/ray/rllib/a3c/a3c_tf_policy.py
rename to python/ray/rllib/agents/a3c/a3c_tf_policy.py
index 9657d9b05..706f6824e 100644
--- a/python/ray/rllib/a3c/a3c_tf_policy.py
+++ b/python/ray/rllib/agents/a3c/a3c_tf_policy.py
@@ -7,8 +7,8 @@ import gym
import ray
from ray.rllib.utils.error import UnsupportedSpaceException
-from ray.rllib.utils.postprocessing import compute_advantages
-from ray.rllib.utils.tf_policy_graph import TFPolicyGraph
+from ray.rllib.evaluation.postprocessing import compute_advantages
+from ray.rllib.evaluation.tf_policy_graph import TFPolicyGraph
from ray.rllib.models.misc import linear, normc_initializer
from ray.rllib.models.catalog import ModelCatalog
@@ -32,7 +32,7 @@ class A3CLoss(object):
class A3CPolicyGraph(TFPolicyGraph):
def __init__(self, observation_space, action_space, config):
- config = dict(ray.rllib.a3c.a3c.DEFAULT_CONFIG, **config)
+ config = dict(ray.rllib.agents.a3c.a3c.DEFAULT_CONFIG, **config)
self.config = config
self.sess = tf.get_default_session()
diff --git a/python/ray/rllib/a3c/a3c_torch_policy.py b/python/ray/rllib/agents/a3c/a3c_torch_policy.py
similarity index 92%
rename from python/ray/rllib/a3c/a3c_torch_policy.py
rename to python/ray/rllib/agents/a3c/a3c_torch_policy.py
index 3813f1e20..a277de945 100644
--- a/python/ray/rllib/a3c/a3c_torch_policy.py
+++ b/python/ray/rllib/agents/a3c/a3c_torch_policy.py
@@ -9,8 +9,8 @@ from torch import nn
import ray
from ray.rllib.models.pytorch.misc import var_to_np
from ray.rllib.models.catalog import ModelCatalog
-from ray.rllib.utils.postprocessing import compute_advantages
-from ray.rllib.utils.torch_policy_graph import TorchPolicyGraph
+from ray.rllib.evaluation.postprocessing import compute_advantages
+from ray.rllib.evaluation.torch_policy_graph import TorchPolicyGraph
class A3CLoss(nn.Module):
@@ -40,7 +40,7 @@ class A3CTorchPolicyGraph(TorchPolicyGraph):
"""A simple, non-recurrent PyTorch policy example."""
def __init__(self, obs_space, action_space, config):
- config = dict(ray.rllib.a3c.a3c.DEFAULT_CONFIG, **config)
+ config = dict(ray.rllib.agents.a3c.a3c.DEFAULT_CONFIG, **config)
self.config = config
_, self.logit_dim = ModelCatalog.get_action_dist(
action_space, self.config["model"])
diff --git a/python/ray/rllib/agent.py b/python/ray/rllib/agents/agent.py
similarity index 67%
rename from python/ray/rllib/agent.py
rename to python/ray/rllib/agents/agent.py
index 195d76a9b..9739d1f64 100644
--- a/python/ray/rllib/agent.py
+++ b/python/ray/rllib/agents/agent.py
@@ -2,19 +2,62 @@ from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
-import logging
-import numpy as np
+import copy
import json
+import numpy as np
import os
import pickle
import tensorflow as tf
+from ray.rllib.evaluation.common_policy_evaluator import CommonPolicyEvaluator
from ray.tune.registry import ENV_CREATOR, _global_registry
from ray.tune.result import TrainingResult
from ray.tune.trainable import Trainable
-logger = logging.getLogger(__name__)
-logger.setLevel(logging.INFO)
+COMMON_CONFIG = {
+ # Discount factor of the MDP
+ "gamma": 0.99,
+ # Number of steps after which the rollout gets cut
+ "horizon": None,
+ # Number of environments to evaluate vectorwise per worker.
+ "num_envs": 1,
+ # Number of actors used for parallelism
+ "num_workers": 2,
+ # Default sample batch size
+ "sample_batch_size": 200,
+ # Whether to rollout "complete_episodes" or "truncate_episodes"
+ "batch_mode": "truncate_episodes",
+ # Whether to use a background thread for sampling (slightly off-policy)
+ "sample_async": False,
+ # Which observation filter to apply to the observation
+ "observation_filter": "NoFilter",
+ # Whether to use rllib or deepmind preprocessors
+ "preprocessor_pref": "rllib",
+ # Arguments to pass to the env creator
+ "env_config": {},
+ # Arguments to pass to model
+ "model": {},
+ # Arguments to pass to the rllib optimizer
+ "optimizer": {},
+ # Override default TF session args if non-empty
+ "tf_session_args": {},
+ # Whether to LZ4 compress observations
+ "compress_observations": False,
+
+ # === Multiagent ===
+ "multiagent": {
+ "policy_graphs": {},
+ "policy_mapping_fn": None,
+ },
+}
+
+
+def with_common_config(extra_config):
+ """Returns the given config dict merged with common agent confs."""
+
+ config = copy.deepcopy(COMMON_CONFIG)
+ config.update(extra_config)
+ return config
def _deep_update(original, new_dict, new_keys_allowed, whitelist):
@@ -62,6 +105,47 @@ class Agent(Trainable):
_allow_unknown_subkeys = [
"tf_session_args", "env_config", "model", "optimizer", "multiagent"]
+ def make_local_evaluator(self, env_creator, policy_graph):
+ """Convenience method to return configured local evaluator."""
+
+ return self._make_evaluator(
+ CommonPolicyEvaluator, env_creator, policy_graph, 0)
+
+ def make_remote_evaluators(
+ self, env_creator, policy_graph, count, remote_args):
+ """Convenience method to return a number of remote evaluators."""
+
+ cls = CommonPolicyEvaluator.as_remote(**remote_args).remote
+ return [
+ self._make_evaluator(cls, env_creator, policy_graph, i+1)
+ for i in range(count)]
+
+ def _make_evaluator(self, cls, env_creator, policy_graph, worker_index):
+ config = self.config
+
+ def session_creator():
+ return tf.Session(
+ config=tf.ConfigProto(**config["tf_session_args"]))
+
+ return cls(
+ env_creator,
+ self.config["multiagent"]["policy_graphs"] or policy_graph,
+ policy_mapping_fn=self.config["multiagent"]["policy_mapping_fn"],
+ tf_session_creator=(
+ session_creator if config["tf_session_args"] else None),
+ batch_steps=config["sample_batch_size"],
+ batch_mode=config["batch_mode"],
+ episode_horizon=config["horizon"],
+ preprocessor_pref=config["preprocessor_pref"],
+ sample_async=config["sample_async"],
+ compress_observations=config["compress_observations"],
+ num_envs=config["num_envs"],
+ observation_filter=config["observation_filter"],
+ env_config=config["env_config"],
+ model_config=config["model"],
+ policy_config=config,
+ worker_index=worker_index)
+
@classmethod
def resource_help(cls, config):
return (
@@ -116,11 +200,6 @@ class Agent(Trainable):
raise NotImplementedError
- def compute_action(self, observation):
- """Computes an action using the current trained policy."""
-
- raise NotImplementedError
-
@property
def iteration(self):
"""Current training iter, auto-incremented with each train() call."""
@@ -139,6 +218,17 @@ class Agent(Trainable):
raise NotImplementedError
+ def compute_action(self, observation, state=None):
+ """Computes an action using the current trained policy."""
+
+ if state is None:
+ state = []
+ obs = self.local_evaluator.filters["default"](
+ observation, update=False)
+ return self.local_evaluator.for_policy(
+ lambda p: p.compute_single_action(
+ obs, state, is_training=False)[0])
+
class _MockAgent(Agent):
"""Mock agent for use in tests"""
@@ -228,31 +318,31 @@ def get_agent_class(alg):
"""Returns the class of a known agent given its name."""
if alg == "DDPG":
- from ray.rllib import ddpg
+ from ray.rllib.agents import ddpg
return ddpg.DDPGAgent
elif alg == "APEX_DDPG":
- from ray.rllib import ddpg
+ from ray.rllib.agents import ddpg
return ddpg.ApexDDPGAgent
elif alg == "PPO":
- from ray.rllib import ppo
+ from ray.rllib.agents import ppo
return ppo.PPOAgent
elif alg == "ES":
- from ray.rllib import es
+ from ray.rllib.agents import es
return es.ESAgent
elif alg == "DQN":
- from ray.rllib import dqn
+ from ray.rllib.agents import dqn
return dqn.DQNAgent
elif alg == "APEX":
- from ray.rllib import dqn
+ from ray.rllib.agents import dqn
return dqn.ApexAgent
elif alg == "A3C":
- from ray.rllib import a3c
+ from ray.rllib.agents import a3c
return a3c.A3CAgent
elif alg == "BC":
- from ray.rllib import bc
+ from ray.rllib.agents import bc
return bc.BCAgent
elif alg == "PG":
- from ray.rllib import pg
+ from ray.rllib.agents import pg
return pg.PGAgent
elif alg == "script":
from ray.tune import script_runner
diff --git a/python/ray/rllib/agents/bc/__init__.py b/python/ray/rllib/agents/bc/__init__.py
new file mode 100644
index 000000000..eb0f8dc2d
--- /dev/null
+++ b/python/ray/rllib/agents/bc/__init__.py
@@ -0,0 +1,3 @@
+from ray.rllib.agents.bc.bc import BCAgent, DEFAULT_CONFIG
+
+__all__ = ["BCAgent", "DEFAULT_CONFIG"]
diff --git a/python/ray/rllib/bc/bc.py b/python/ray/rllib/agents/bc/bc.py
similarity index 95%
rename from python/ray/rllib/bc/bc.py
rename to python/ray/rllib/agents/bc/bc.py
index 1cc05e599..8dee9f6e9 100644
--- a/python/ray/rllib/bc/bc.py
+++ b/python/ray/rllib/agents/bc/bc.py
@@ -3,9 +3,9 @@ from __future__ import division
from __future__ import print_function
import ray
-from ray.rllib.agent import Agent
-from ray.rllib.bc.bc_evaluator import BCEvaluator, GPURemoteBCEvaluator, \
- RemoteBCEvaluator
+from ray.rllib.agents.agent import Agent
+from ray.rllib.agents.bc.bc_evaluator import BCEvaluator, \
+ GPURemoteBCEvaluator, RemoteBCEvaluator
from ray.rllib.optimizers import AsyncGradientsOptimizer
from ray.tune.result import TrainingResult
from ray.tune.trial import Resources
diff --git a/python/ray/rllib/bc/bc_evaluator.py b/python/ray/rllib/agents/bc/bc_evaluator.py
similarity index 90%
rename from python/ray/rllib/bc/bc_evaluator.py
rename to python/ray/rllib/agents/bc/bc_evaluator.py
index 87a7d4976..a856858c9 100644
--- a/python/ray/rllib/bc/bc_evaluator.py
+++ b/python/ray/rllib/agents/bc/bc_evaluator.py
@@ -6,10 +6,10 @@ import pickle
from six.moves import queue
import ray
-from ray.rllib.bc.experience_dataset import ExperienceDataset
-from ray.rllib.bc.policy import BCPolicy
+from ray.rllib.agents.bc.experience_dataset import ExperienceDataset
+from ray.rllib.agents.bc.policy import BCPolicy
+from ray.rllib.evaluation.interface import PolicyEvaluator
from ray.rllib.models import ModelCatalog
-from ray.rllib.optimizers import PolicyEvaluator
class BCEvaluator(PolicyEvaluator):
diff --git a/python/ray/rllib/bc/experience_dataset.py b/python/ray/rllib/agents/bc/experience_dataset.py
similarity index 100%
rename from python/ray/rllib/bc/experience_dataset.py
rename to python/ray/rllib/agents/bc/experience_dataset.py
diff --git a/python/ray/rllib/bc/policy.py b/python/ray/rllib/agents/bc/policy.py
similarity index 100%
rename from python/ray/rllib/bc/policy.py
rename to python/ray/rllib/agents/bc/policy.py
diff --git a/python/ray/rllib/ddpg/README.md b/python/ray/rllib/agents/ddpg/README.md
similarity index 100%
rename from python/ray/rllib/ddpg/README.md
rename to python/ray/rllib/agents/ddpg/README.md
diff --git a/python/ray/rllib/ddpg/__init__.py b/python/ray/rllib/agents/ddpg/__init__.py
similarity index 59%
rename from python/ray/rllib/ddpg/__init__.py
rename to python/ray/rllib/agents/ddpg/__init__.py
index 932b9f0c8..7d3390b20 100644
--- a/python/ray/rllib/ddpg/__init__.py
+++ b/python/ray/rllib/agents/ddpg/__init__.py
@@ -2,7 +2,7 @@ from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
-from ray.rllib.ddpg.apex import ApexDDPGAgent
-from ray.rllib.ddpg.ddpg import DDPGAgent, DEFAULT_CONFIG
+from ray.rllib.agents.ddpg.apex import ApexDDPGAgent
+from ray.rllib.agents.ddpg.ddpg import DDPGAgent, DEFAULT_CONFIG
__all__ = ["DDPGAgent", "ApexDDPGAgent", "DEFAULT_CONFIG"]
diff --git a/python/ray/rllib/ddpg/apex.py b/python/ray/rllib/agents/ddpg/apex.py
similarity index 91%
rename from python/ray/rllib/ddpg/apex.py
rename to python/ray/rllib/agents/ddpg/apex.py
index 8ede5109f..b53d4178e 100644
--- a/python/ray/rllib/ddpg/apex.py
+++ b/python/ray/rllib/agents/ddpg/apex.py
@@ -2,16 +2,16 @@ from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
-from ray.rllib.ddpg.ddpg import DDPGAgent, DEFAULT_CONFIG as DDPG_CONFIG
+from ray.rllib.agents.ddpg.ddpg import DDPGAgent, DEFAULT_CONFIG as DDPG_CONFIG
from ray.utils import merge_dicts
APEX_DDPG_DEFAULT_CONFIG = merge_dicts(
DDPG_CONFIG,
{
"optimizer_class": "AsyncSamplesOptimizer",
- "optimizer_config":
+ "optimizer":
merge_dicts(
- DDPG_CONFIG["optimizer_config"], {
+ DDPG_CONFIG["optimizer"], {
"max_weight_sync_delay": 400,
"num_replay_buffer_shards": 4,
"debug": False
diff --git a/python/ray/rllib/ddpg/common/__init__.py b/python/ray/rllib/agents/ddpg/common/__init__.py
similarity index 100%
rename from python/ray/rllib/ddpg/common/__init__.py
rename to python/ray/rllib/agents/ddpg/common/__init__.py
diff --git a/python/ray/rllib/ddpg/ddpg.py b/python/ray/rllib/agents/ddpg/ddpg.py
similarity index 88%
rename from python/ray/rllib/ddpg/ddpg.py
rename to python/ray/rllib/agents/ddpg/ddpg.py
index 9a93e57c1..c7e45f1b3 100644
--- a/python/ray/rllib/ddpg/ddpg.py
+++ b/python/ray/rllib/agents/ddpg/ddpg.py
@@ -2,9 +2,10 @@ from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
-from ray.rllib.dqn.common.schedules import ConstantSchedule, LinearSchedule
-from ray.rllib.dqn.dqn import DQNAgent
-from ray.rllib.ddpg.ddpg_policy_graph import DDPGPolicyGraph
+from ray.rllib.agents.agent import with_common_config
+from ray.rllib.agents.dqn.dqn import DQNAgent
+from ray.rllib.agents.ddpg.ddpg_policy_graph import DDPGPolicyGraph
+from ray.rllib.utils.schedules import ConstantSchedule, LinearSchedule
OPTIMIZER_SHARED_CONFIGS = [
"buffer_size", "prioritized_replay", "prioritized_replay_alpha",
@@ -12,7 +13,7 @@ OPTIMIZER_SHARED_CONFIGS = [
"train_batch_size", "learning_starts", "clip_rewards"
]
-DEFAULT_CONFIG = {
+DEFAULT_CONFIG = with_common_config({
# === Model ===
# Hidden layer sizes of the policy network
"actor_hiddens": [64, 64],
@@ -24,12 +25,6 @@ DEFAULT_CONFIG = {
"critic_hidden_activation": "relu",
# N-step Q learning
"n_step": 1,
- # Config options to pass to the model constructor
- "model": {},
- # Discount factor for the MDP
- "gamma": 0.99,
- # Arguments to pass to the env creator
- "env_config": {},
# === Exploration ===
# Max num timesteps for annealing schedules. Exploration is annealed from
@@ -99,30 +94,21 @@ DEFAULT_CONFIG = {
# to increase if your environment is particularly slow to sample, or if
# you"re using the Async or Ape-X optimizers.
"num_workers": 0,
- # Number of environments to evaluate vectorwise per worker.
- "num_envs": 1,
# Whether to allocate GPUs for workers (if > 0).
"num_gpus_per_worker": 0,
# Whether to allocate CPUs for workers (if > 0).
"num_cpus_per_worker": 1,
# Optimizer class to use.
"optimizer_class": "SyncReplayOptimizer",
- # Config to pass to the optimizer.
- "optimizer_config": {},
# Whether to use a distribution of epsilons across workers for exploration.
"per_worker_exploration": False,
# Whether to compute priorities on workers.
"worker_side_prioritization": False,
-
- # === Multiagent ===
- "multiagent": {
- "policy_graphs": {},
- "policy_mapping_fn": None,
- },
-}
+})
class DDPGAgent(DQNAgent):
+ """DDPG implementation in TensorFlow."""
_agent_name = "DDPG"
_default_config = DEFAULT_CONFIG
_policy_graph = DDPGPolicyGraph
diff --git a/python/ray/rllib/ddpg/ddpg_policy_graph.py b/python/ray/rllib/agents/ddpg/ddpg_policy_graph.py
similarity index 98%
rename from python/ray/rllib/ddpg/ddpg_policy_graph.py
rename to python/ray/rllib/agents/ddpg/ddpg_policy_graph.py
index 34aa9682b..a8a44980b 100644
--- a/python/ray/rllib/ddpg/ddpg_policy_graph.py
+++ b/python/ray/rllib/agents/ddpg/ddpg_policy_graph.py
@@ -8,11 +8,11 @@ import tensorflow as tf
import tensorflow.contrib.layers as layers
import ray
-from ray.rllib.dqn.dqn_policy_graph import _huber_loss, _minimize_and_clip, \
- _scope_vars, _postprocess_dqn
+from ray.rllib.agents.dqn.dqn_policy_graph import _huber_loss, \
+ _minimize_and_clip, _scope_vars, _postprocess_dqn
from ray.rllib.models import ModelCatalog
from ray.rllib.utils.error import UnsupportedSpaceException
-from ray.rllib.utils.tf_policy_graph import TFPolicyGraph
+from ray.rllib.evaluation.tf_policy_graph import TFPolicyGraph
A_SCOPE = "a_func"
@@ -113,7 +113,7 @@ class ActorCriticLoss(object):
class DDPGPolicyGraph(TFPolicyGraph):
def __init__(self, observation_space, action_space, config):
- config = dict(ray.rllib.ddpg.ddpg.DEFAULT_CONFIG, **config)
+ config = dict(ray.rllib.agents.ddpg.ddpg.DEFAULT_CONFIG, **config)
if not isinstance(action_space, Box):
raise UnsupportedSpaceException(
"Action space {} is not supported for DDPG.".format(
diff --git a/python/ray/rllib/dqn/README.md b/python/ray/rllib/agents/dqn/README.md
similarity index 100%
rename from python/ray/rllib/dqn/README.md
rename to python/ray/rllib/agents/dqn/README.md
diff --git a/python/ray/rllib/dqn/__init__.py b/python/ray/rllib/agents/dqn/__init__.py
similarity index 60%
rename from python/ray/rllib/dqn/__init__.py
rename to python/ray/rllib/agents/dqn/__init__.py
index a383adeb4..b46c249e6 100644
--- a/python/ray/rllib/dqn/__init__.py
+++ b/python/ray/rllib/agents/dqn/__init__.py
@@ -2,7 +2,7 @@ from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
-from ray.rllib.dqn.apex import ApexAgent
-from ray.rllib.dqn.dqn import DQNAgent, DEFAULT_CONFIG
+from ray.rllib.agents.dqn.apex import ApexAgent
+from ray.rllib.agents.dqn.dqn import DQNAgent, DEFAULT_CONFIG
__all__ = ["ApexAgent", "DQNAgent", "DEFAULT_CONFIG"]
diff --git a/python/ray/rllib/dqn/apex.py b/python/ray/rllib/agents/dqn/apex.py
similarity index 89%
rename from python/ray/rllib/dqn/apex.py
rename to python/ray/rllib/agents/dqn/apex.py
index d12754b89..1c8b2f6b3 100644
--- a/python/ray/rllib/dqn/apex.py
+++ b/python/ray/rllib/agents/dqn/apex.py
@@ -2,7 +2,7 @@ from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
-from ray.rllib.dqn.dqn import DQNAgent, DEFAULT_CONFIG as DQN_CONFIG
+from ray.rllib.agents.dqn.dqn import DQNAgent, DEFAULT_CONFIG as DQN_CONFIG
from ray.tune.trial import Resources
from ray.utils import merge_dicts
@@ -10,9 +10,9 @@ APEX_DEFAULT_CONFIG = merge_dicts(
DQN_CONFIG,
{
"optimizer_class": "AsyncSamplesOptimizer",
- "optimizer_config":
+ "optimizer":
merge_dicts(
- DQN_CONFIG["optimizer_config"], {
+ DQN_CONFIG["optimizer"], {
"max_weight_sync_delay": 400,
"num_replay_buffer_shards": 4,
"debug": False
@@ -47,7 +47,7 @@ class ApexAgent(DQNAgent):
def default_resource_request(cls, config):
cf = dict(cls._default_config, **config)
return Resources(
- cpu=1 + cf["optimizer_config"]["num_replay_buffer_shards"],
+ cpu=1 + cf["optimizer"]["num_replay_buffer_shards"],
gpu=cf["gpu"] and 1 or 0,
extra_cpu=cf["num_cpus_per_worker"] * cf["num_workers"],
extra_gpu=cf["num_gpus_per_worker"] * cf["num_workers"])
diff --git a/python/ray/rllib/dqn/common/__init__.py b/python/ray/rllib/agents/dqn/common/__init__.py
similarity index 100%
rename from python/ray/rllib/dqn/common/__init__.py
rename to python/ray/rllib/agents/dqn/common/__init__.py
diff --git a/python/ray/rllib/dqn/common/wrappers.py b/python/ray/rllib/agents/dqn/common/wrappers.py
similarity index 100%
rename from python/ray/rllib/dqn/common/wrappers.py
rename to python/ray/rllib/agents/dqn/common/wrappers.py
diff --git a/python/ray/rllib/dqn/dqn.py b/python/ray/rllib/agents/dqn/dqn.py
similarity index 75%
rename from python/ray/rllib/dqn/dqn.py
rename to python/ray/rllib/agents/dqn/dqn.py
index 8c0e55391..ba1224732 100644
--- a/python/ray/rllib/dqn/dqn.py
+++ b/python/ray/rllib/agents/dqn/dqn.py
@@ -7,11 +7,10 @@ import os
import ray
from ray.rllib import optimizers
-from ray.rllib.dqn.common.schedules import ConstantSchedule, LinearSchedule
-from ray.rllib.dqn.dqn_policy_graph import DQNPolicyGraph
-from ray.rllib.utils.common_policy_evaluator import CommonPolicyEvaluator, \
- collect_metrics
-from ray.rllib.agent import Agent
+from ray.rllib.agents.agent import Agent, with_common_config
+from ray.rllib.agents.dqn.dqn_policy_graph import DQNPolicyGraph
+from ray.rllib.evaluation.metrics import collect_metrics
+from ray.rllib.utils.schedules import ConstantSchedule, LinearSchedule
from ray.tune.trial import Resources
@@ -20,7 +19,7 @@ OPTIMIZER_SHARED_CONFIGS = [
"prioritized_replay_beta", "prioritized_replay_eps", "sample_batch_size",
"train_batch_size", "learning_starts", "clip_rewards"]
-DEFAULT_CONFIG = {
+DEFAULT_CONFIG = with_common_config({
# === Model ===
# Whether to use dueling dqn
"dueling": True,
@@ -30,12 +29,8 @@ DEFAULT_CONFIG = {
"hiddens": [256],
# N-step Q learning
"n_step": 1,
- # Config options to pass to the model constructor
- "model": {},
- # Discount factor for the MDP
- "gamma": 0.99,
- # Arguments to pass to the env creator
- "env_config": {},
+ # Whether to use rllib or deepmind preprocessors
+ "preprocessor_pref": "deepmind",
# === Exploration ===
# Max num timesteps for annealing schedules. Exploration is annealed from
@@ -66,6 +61,8 @@ DEFAULT_CONFIG = {
"prioritized_replay_eps": 1e-6,
# Whether to clip rewards to [-1, 1] prior to adding to the replay buffer.
"clip_rewards": True,
+ # Whether to LZ4 compress observations
+ "compress_observations": True,
# === Optimization ===
# Learning rate for adam optimizer
@@ -89,30 +86,22 @@ DEFAULT_CONFIG = {
# to increase if your environment is particularly slow to sample, or if
# you"re using the Async or Ape-X optimizers.
"num_workers": 0,
- # Number of environments to evaluate vectorwise per worker.
- "num_envs": 1,
# Whether to allocate GPUs for workers (if > 0).
"num_gpus_per_worker": 0,
# Whether to allocate CPUs for workers (if > 0).
"num_cpus_per_worker": 1,
# Optimizer class to use.
"optimizer_class": "SyncReplayOptimizer",
- # Config to pass to the optimizer.
- "optimizer_config": {},
# Whether to use a distribution of epsilons across workers for exploration.
"per_worker_exploration": False,
# Whether to compute priorities on workers.
"worker_side_prioritization": False,
-
- # === Multiagent ===
- "multiagent": {
- "policy_graphs": {},
- "policy_mapping_fn": None,
- },
-}
+})
class DQNAgent(Agent):
+ """DQN implementation in TensorFlow."""
+
_agent_name = "DQN"
_default_config = DEFAULT_CONFIG
_policy_graph = DQNPolicyGraph
@@ -126,32 +115,10 @@ class DQNAgent(Agent):
extra_gpu=cf["num_gpus_per_worker"] * cf["num_workers"])
def _init(self):
+ # Update effective batch size to include n-step
adjusted_batch_size = (
self.config["sample_batch_size"] + self.config["n_step"] - 1)
- self.local_evaluator = CommonPolicyEvaluator(
- self.env_creator,
- self.config["multiagent"]["policy_graphs"] or self._policy_graph,
- policy_mapping_fn=self.config["multiagent"]["policy_mapping_fn"],
- batch_steps=adjusted_batch_size,
- batch_mode="truncate_episodes", preprocessor_pref="deepmind",
- compress_observations=True,
- env_config=self.config["env_config"],
- model_config=self.config["model"], policy_config=self.config,
- num_envs=self.config["num_envs"])
- remote_cls = CommonPolicyEvaluator.as_remote(
- num_cpus=self.config["num_cpus_per_worker"],
- num_gpus=self.config["num_gpus_per_worker"])
- self.remote_evaluators = [
- remote_cls.remote(
- self.env_creator, self._policy_graph,
- batch_steps=adjusted_batch_size,
- batch_mode="truncate_episodes", preprocessor_pref="deepmind",
- compress_observations=True,
- env_config=self.config["env_config"],
- model_config=self.config["model"], policy_config=self.config,
- num_envs=self.config["num_envs"],
- worker_index=i+1)
- for i in range(self.config["num_workers"])]
+ self.config["sample_batch_size"] = adjusted_batch_size
self.exploration0 = self._make_exploration_schedule(0)
self.explorations = [
@@ -159,11 +126,17 @@ class DQNAgent(Agent):
for i in range(self.config["num_workers"])]
for k in OPTIMIZER_SHARED_CONFIGS:
- if k not in self.config["optimizer_config"]:
- self.config["optimizer_config"][k] = self.config[k]
+ if k not in self.config["optimizer"]:
+ self.config["optimizer"][k] = self.config[k]
+ self.local_evaluator = self.make_local_evaluator(
+ self.env_creator, self._policy_graph)
+ self.remote_evaluators = self.make_remote_evaluators(
+ self.env_creator, self._policy_graph, self.config["num_workers"],
+ {"num_cpus": self.config["num_cpus_per_worker"],
+ "num_gpus": self.config["num_gpus_per_worker"]})
self.optimizer = getattr(optimizers, self.config["optimizer_class"])(
- self.config["optimizer_config"], self.local_evaluator,
+ self.config["optimizer"], self.local_evaluator,
self.remote_evaluators)
self.last_target_update_ts = 0
@@ -247,10 +220,3 @@ class DQNAgent(Agent):
self.optimizer.restore(extra_data[2])
self.num_target_updates = extra_data[3]
self.last_target_update_ts = extra_data[4]
-
- def compute_action(self, observation, state=None):
- if state is None:
- state = []
- return self.local_evaluator.for_policy(
- lambda p: p.compute_single_action(
- observation, state, is_training=False)[0])
diff --git a/python/ray/rllib/dqn/dqn_policy_graph.py b/python/ray/rllib/agents/dqn/dqn_policy_graph.py
similarity index 98%
rename from python/ray/rllib/dqn/dqn_policy_graph.py
rename to python/ray/rllib/agents/dqn/dqn_policy_graph.py
index ecf6ac5dc..f94cc16a3 100644
--- a/python/ray/rllib/dqn/dqn_policy_graph.py
+++ b/python/ray/rllib/agents/dqn/dqn_policy_graph.py
@@ -9,9 +9,9 @@ import tensorflow.contrib.layers as layers
import ray
from ray.rllib.models import ModelCatalog
-from ray.rllib.optimizers.sample_batch import SampleBatch
+from ray.rllib.evaluation.sample_batch import SampleBatch
from ray.rllib.utils.error import UnsupportedSpaceException
-from ray.rllib.utils.tf_policy_graph import TFPolicyGraph
+from ray.rllib.evaluation.tf_policy_graph import TFPolicyGraph
Q_SCOPE = "q_func"
@@ -79,7 +79,7 @@ class QLoss(object):
class DQNPolicyGraph(TFPolicyGraph):
def __init__(self, observation_space, action_space, config):
- config = dict(ray.rllib.dqn.dqn.DEFAULT_CONFIG, **config)
+ config = dict(ray.rllib.agents.dqn.dqn.DEFAULT_CONFIG, **config)
if not isinstance(action_space, Discrete):
raise UnsupportedSpaceException(
"Action space {} is not supported for DQN.".format(
diff --git a/python/ray/rllib/agents/es/__init__.py b/python/ray/rllib/agents/es/__init__.py
new file mode 100644
index 000000000..3ea5f2edc
--- /dev/null
+++ b/python/ray/rllib/agents/es/__init__.py
@@ -0,0 +1,3 @@
+from ray.rllib.agents.es.es import (ESAgent, DEFAULT_CONFIG)
+
+__all__ = ["ESAgent", "DEFAULT_CONFIG"]
diff --git a/python/ray/rllib/es/es.py b/python/ray/rllib/agents/es/es.py
similarity index 97%
rename from python/ray/rllib/es/es.py
rename to python/ray/rllib/agents/es/es.py
index b900f88a7..62249e380 100644
--- a/python/ray/rllib/es/es.py
+++ b/python/ray/rllib/agents/es/es.py
@@ -12,13 +12,13 @@ import pickle
import time
import ray
-from ray.rllib import agent
+from ray.rllib.agents import Agent
from ray.tune.trial import Resources
-from ray.rllib.es import optimizers
-from ray.rllib.es import policies
-from ray.rllib.es import tabular_logger as tlogger
-from ray.rllib.es import utils
+from ray.rllib.agents.es import optimizers
+from ray.rllib.agents.es import policies
+from ray.rllib.agents.es import tabular_logger as tlogger
+from ray.rllib.agents.es import utils
Result = namedtuple("Result", [
@@ -134,7 +134,9 @@ class Worker(object):
eval_lengths=eval_lengths)
-class ESAgent(agent.Agent):
+class ESAgent(Agent):
+ """Large-scale implementation of Evolution Strategies in Ray."""
+
_agent_name = "ES"
_default_config = DEFAULT_CONFIG
diff --git a/python/ray/rllib/es/optimizers.py b/python/ray/rllib/agents/es/optimizers.py
similarity index 100%
rename from python/ray/rllib/es/optimizers.py
rename to python/ray/rllib/agents/es/optimizers.py
diff --git a/python/ray/rllib/es/policies.py b/python/ray/rllib/agents/es/policies.py
similarity index 100%
rename from python/ray/rllib/es/policies.py
rename to python/ray/rllib/agents/es/policies.py
diff --git a/python/ray/rllib/es/tabular_logger.py b/python/ray/rllib/agents/es/tabular_logger.py
similarity index 100%
rename from python/ray/rllib/es/tabular_logger.py
rename to python/ray/rllib/agents/es/tabular_logger.py
diff --git a/python/ray/rllib/es/utils.py b/python/ray/rllib/agents/es/utils.py
similarity index 100%
rename from python/ray/rllib/es/utils.py
rename to python/ray/rllib/agents/es/utils.py
diff --git a/python/ray/rllib/agents/pg/__init__.py b/python/ray/rllib/agents/pg/__init__.py
new file mode 100644
index 000000000..f0665f448
--- /dev/null
+++ b/python/ray/rllib/agents/pg/__init__.py
@@ -0,0 +1,3 @@
+from ray.rllib.agents.pg.pg import PGAgent, DEFAULT_CONFIG
+
+__all__ = ["PGAgent", "DEFAULT_CONFIG"]
diff --git a/python/ray/rllib/agents/pg/pg.py b/python/ray/rllib/agents/pg/pg.py
new file mode 100644
index 000000000..05a600cdb
--- /dev/null
+++ b/python/ray/rllib/agents/pg/pg.py
@@ -0,0 +1,54 @@
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from ray.rllib.agents.agent import Agent, with_common_config
+from ray.rllib.agents.pg.pg_policy_graph import PGPolicyGraph
+from ray.rllib.evaluation.metrics import collect_metrics
+from ray.rllib.optimizers import SyncSamplesOptimizer
+from ray.tune.trial import Resources
+
+
+DEFAULT_CONFIG = with_common_config({
+ # No remote workers by default
+ "num_workers": 0,
+ # Learning rate
+ "lr": 0.0004,
+ # Override model config
+ "model": {
+ # Use LSTM model.
+ "use_lstm": False,
+ # Max seq length for LSTM training.
+ "max_seq_len": 20,
+ },
+})
+
+
+class PGAgent(Agent):
+ """Simple policy gradient agent.
+
+ This is an example agent to show how to implement algorithms in RLlib.
+ In most cases, you will probably want to use the PPO agent instead.
+ """
+
+ _agent_name = "PG"
+ _default_config = DEFAULT_CONFIG
+
+ @classmethod
+ def default_resource_request(cls, config):
+ cf = dict(cls._default_config, **config)
+ return Resources(cpu=1, gpu=0, extra_cpu=cf["num_workers"])
+
+ def _init(self):
+ self.local_evaluator = self.make_local_evaluator(
+ self.env_creator, PGPolicyGraph)
+ self.remote_evaluators = self.make_remote_evaluators(
+ self.env_creator, PGPolicyGraph, self.config["num_workers"], {})
+ self.optimizer = SyncSamplesOptimizer(
+ self.config["optimizer"], self.local_evaluator,
+ self.remote_evaluators)
+
+ def _train(self):
+ self.optimizer.step()
+ return collect_metrics(
+ self.optimizer.local_evaluator, self.optimizer.remote_evaluators)
diff --git a/python/ray/rllib/pg/pg_policy_graph.py b/python/ray/rllib/agents/pg/pg_policy_graph.py
similarity index 91%
rename from python/ray/rllib/pg/pg_policy_graph.py
rename to python/ray/rllib/agents/pg/pg_policy_graph.py
index 2fec360f2..42124e3d1 100644
--- a/python/ray/rllib/pg/pg_policy_graph.py
+++ b/python/ray/rllib/agents/pg/pg_policy_graph.py
@@ -6,8 +6,8 @@ import tensorflow as tf
import ray
from ray.rllib.models.catalog import ModelCatalog
-from ray.rllib.utils.postprocessing import compute_advantages
-from ray.rllib.utils.tf_policy_graph import TFPolicyGraph
+from ray.rllib.evaluation.postprocessing import compute_advantages
+from ray.rllib.evaluation.tf_policy_graph import TFPolicyGraph
class PGLoss(object):
@@ -17,7 +17,7 @@ class PGLoss(object):
class PGPolicyGraph(TFPolicyGraph):
def __init__(self, obs_space, action_space, config):
- config = dict(ray.rllib.pg.pg.DEFAULT_CONFIG, **config)
+ config = dict(ray.rllib.agents.pg.pg.DEFAULT_CONFIG, **config)
self.config = config
# Setup policy
diff --git a/python/ray/rllib/agents/ppo/__init__.py b/python/ray/rllib/agents/ppo/__init__.py
new file mode 100644
index 000000000..e4d0c7cf0
--- /dev/null
+++ b/python/ray/rllib/agents/ppo/__init__.py
@@ -0,0 +1,3 @@
+from ray.rllib.agents.ppo.ppo import (PPOAgent, DEFAULT_CONFIG)
+
+__all__ = ["PPOAgent", "DEFAULT_CONFIG"]
diff --git a/python/ray/rllib/ppo/ppo.py b/python/ray/rllib/agents/ppo/ppo.py
similarity index 53%
rename from python/ray/rllib/ppo/ppo.py
rename to python/ray/rllib/agents/ppo/ppo.py
index dfd2c594d..a83c10f3b 100644
--- a/python/ray/rllib/ppo/ppo.py
+++ b/python/ray/rllib/agents/ppo/ppo.py
@@ -5,22 +5,16 @@ from __future__ import print_function
import os
import numpy as np
import pickle
-import tensorflow as tf
import ray
-from ray.tune.trial import Resources
-from ray.rllib.agent import Agent
-from ray.rllib.utils.common_policy_evaluator import (
- CommonPolicyEvaluator, collect_metrics)
+from ray.rllib.agents import Agent, with_common_config
+from ray.rllib.agents.ppo.ppo_tf_policy import PPOTFPolicyGraph
+from ray.rllib.evaluation.metrics import collect_metrics
from ray.rllib.utils import FilterManager
-from ray.rllib.ppo.ppo_tf_policy import PPOTFPolicyGraph
from ray.rllib.optimizers.multi_gpu_optimizer import LocalMultiGPUOptimizer
+from ray.tune.trial import Resources
-DEFAULT_CONFIG = {
- # Discount factor of the MDP
- "gamma": 0.995,
- # Number of steps after which the rollout gets cut
- "horizon": 2000,
+DEFAULT_CONFIG = with_common_config({
# If true, use the Generalized Advantage Estimator (GAE)
# with a value function, see https://arxiv.org/pdf/1506.02438.pdf.
"use_gae": True,
@@ -28,22 +22,12 @@ DEFAULT_CONFIG = {
"lambda": 1.0,
# Initial coefficient for KL divergence
"kl_coeff": 0.2,
+ # Number of timesteps collected for each SGD round
+ "timesteps_per_batch": 4000,
# Number of SGD iterations in each outer loop
"num_sgd_iter": 30,
# Stepsize of SGD
"sgd_stepsize": 5e-5,
- # TODO(pcm): Expose the choice between gpus and cpus
- # as a command line argument.
- "devices": ["/cpu:%d" % i for i in range(4)],
- "tf_session_args": {
- "device_count": {"CPU": 4},
- "log_device_placement": False,
- "allow_soft_placement": True,
- "intra_op_parallelism_threads": 1,
- "inter_op_parallelism_threads": 1,
- },
- # Batch size for policy evaluations for rollouts
- "rollout_batchsize": 1,
# Total SGD batch size across all devices for SGD
"sgd_batchsize": 128,
# Coefficient of the value function loss
@@ -54,82 +38,41 @@ DEFAULT_CONFIG = {
"clip_param": 0.3,
# Target value for KL divergence
"kl_target": 0.01,
- # Config params to pass to the model
- "model": {"free_log_std": False},
- # Which observation filter to apply to the observation
- "observation_filter": "MeanStdFilter",
- # If >1, adds frameskip
- "extra_frameskip": 1,
- # Number of timesteps collected in each outer loop
- "timesteps_per_batch": 4000,
- # Each tasks performs rollouts until at least this
- # number of steps is obtained
- "min_steps_per_task": 200,
- # Number of actors used to collect the rollouts
- "num_workers": 2,
+ # Number of GPUs to use for SGD
+ "num_gpus": 0,
# Whether to allocate GPUs for workers (if > 0).
"num_gpus_per_worker": 0,
# Whether to allocate CPUs for workers (if > 0).
"num_cpus_per_worker": 1,
- # Dump TensorFlow timeline after this many SGD minibatches
- "full_trace_nth_sgd_batch": -1,
- # Whether to profile data loading
- "full_trace_data_load": False,
- # Outer loop iteration index when we drop into the TensorFlow debugger
- "tf_debug_iteration": -1,
- # If this is True, the TensorFlow debugger is invoked if an Inf or NaN
- # is detected
- "tf_debug_inf_or_nan": False,
- # If True, we write tensorflow logs and checkpoints
- "write_logs": True,
- # Arguments to pass to the env creator
- "env_config": {},
-}
+ # Whether to rollout "complete_episodes" or "truncate_episodes"
+ "batch_mode": "complete_episodes",
+ # Which observation filter to apply to the observation
+ "observation_filter": "MeanStdFilter",
+})
class PPOAgent(Agent):
+ """Multi-GPU optimized implementation of PPO in TensorFlow."""
+
_agent_name = "PPO"
_default_config = DEFAULT_CONFIG
- _default_policy_graph = PPOTFPolicyGraph
@classmethod
def default_resource_request(cls, config):
cf = dict(cls._default_config, **config)
return Resources(
cpu=1,
- gpu=len([d for d in cf["devices"] if "gpu" in d.lower()]),
+ gpu=cf["num_gpus"],
extra_cpu=cf["num_cpus_per_worker"] * cf["num_workers"],
extra_gpu=cf["num_gpus_per_worker"] * cf["num_workers"])
def _init(self):
- def session_creator():
- return tf.Session(
- config=tf.ConfigProto(**self.config["tf_session_args"]))
- self.local_evaluator = CommonPolicyEvaluator(
- self.env_creator,
- self._default_policy_graph,
- tf_session_creator=session_creator,
- batch_mode="complete_episodes",
- observation_filter=self.config["observation_filter"],
- env_config=self.config["env_config"],
- model_config=self.config["model"],
- policy_config=self.config
- )
- RemoteEvaluator = CommonPolicyEvaluator.as_remote(
- num_cpus=self.config["num_cpus_per_worker"],
- num_gpus=self.config["num_gpus_per_worker"])
- self.remote_evaluators = [
- RemoteEvaluator.remote(
- self.env_creator,
- self._default_policy_graph,
- batch_mode="complete_episodes",
- observation_filter=self.config["observation_filter"],
- env_config=self.config["env_config"],
- model_config=self.config["model"],
- policy_config=self.config
- )
- for _ in range(self.config["num_workers"])]
-
+ self.local_evaluator = self.make_local_evaluator(
+ self.env_creator, PPOTFPolicyGraph)
+ self.remote_evaluators = self.make_remote_evaluators(
+ self.env_creator, PPOTFPolicyGraph, self.config["num_workers"],
+ {"num_cpus": self.config["num_cpus_per_worker"],
+ "num_gpus": self.config["num_gpus_per_worker"]})
self.optimizer = LocalMultiGPUOptimizer(
{"sgd_batch_size": self.config["sgd_batchsize"],
"sgd_stepsize": self.config["sgd_stepsize"],
@@ -137,10 +80,6 @@ class PPOAgent(Agent):
"timesteps_per_batch": self.config["timesteps_per_batch"]},
self.local_evaluator, self.remote_evaluators)
- # TODO(rliaw): Push into Policy Graph
- with self.local_evaluator.tf_sess.graph.as_default():
- self.saver = tf.train.Saver()
-
def _train(self):
def postprocess_samples(batch):
# Divide by the maximum of value.std() and 1e-4
@@ -183,10 +122,8 @@ class PPOAgent(Agent):
ev.__ray_terminate__.remote()
def _save(self, checkpoint_dir):
- checkpoint_path = self.saver.save(
- self.local_evaluator.tf_sess,
- os.path.join(checkpoint_dir, "checkpoint"),
- global_step=self.iteration)
+ checkpoint_path = os.path.join(checkpoint_dir,
+ "checkpoint-{}".format(self.iteration))
agent_state = ray.get(
[a.save.remote() for a in self.remote_evaluators])
extra_data = [
@@ -196,18 +133,8 @@ class PPOAgent(Agent):
return checkpoint_path
def _restore(self, checkpoint_path):
- self.saver.restore(self.local_evaluator.tf_sess, checkpoint_path)
extra_data = pickle.load(open(checkpoint_path + ".extra_data", "rb"))
self.local_evaluator.restore(extra_data[0])
ray.get([
a.restore.remote(o)
for (a, o) in zip(self.remote_evaluators, extra_data[1])])
-
- def compute_action(self, observation, state=None):
- if state is None:
- state = []
- obs = self.local_evaluator.filters["default"](
- observation, update=False)
- return self.local_evaluator.for_policy(
- lambda p: p.compute_single_action(
- obs, state, is_training=False)[0])
diff --git a/python/ray/rllib/ppo/ppo_tf_policy.py b/python/ray/rllib/agents/ppo/ppo_tf_policy.py
similarity index 98%
rename from python/ray/rllib/ppo/ppo_tf_policy.py
rename to python/ray/rllib/agents/ppo/ppo_tf_policy.py
index 3fd8b06fc..887357d9a 100644
--- a/python/ray/rllib/ppo/ppo_tf_policy.py
+++ b/python/ray/rllib/agents/ppo/ppo_tf_policy.py
@@ -4,9 +4,9 @@ from __future__ import print_function
import tensorflow as tf
+from ray.rllib.evaluation.postprocessing import compute_advantages
+from ray.rllib.evaluation.tf_policy_graph import TFPolicyGraph
from ray.rllib.models.catalog import ModelCatalog
-from ray.rllib.utils.postprocessing import compute_advantages
-from ray.rllib.utils.tf_policy_graph import TFPolicyGraph
class PPOLoss(object):
@@ -120,6 +120,7 @@ class PPOTFPolicyGraph(TFPolicyGraph):
("logprobs", logprobs_ph),
("vf_preds", vf_preds_ph)
]
+ # TODO(ekl) feed RNN states in here
# KL Coefficient
self.kl_coeff = tf.get_variable(
diff --git a/python/ray/rllib/ppo/rollout.py b/python/ray/rllib/agents/ppo/rollout.py
similarity index 95%
rename from python/ray/rllib/ppo/rollout.py
rename to python/ray/rllib/agents/ppo/rollout.py
index 9f7c39a30..54a235680 100644
--- a/python/ray/rllib/ppo/rollout.py
+++ b/python/ray/rllib/agents/ppo/rollout.py
@@ -3,7 +3,7 @@ from __future__ import division
from __future__ import print_function
import ray
-from ray.rllib.optimizers import SampleBatch
+from ray.rllib.evaluation.sample_batch import SampleBatch
def collect_samples(agents, timesteps_per_batch):
diff --git a/python/ray/rllib/ppo/test/test.py b/python/ray/rllib/agents/ppo/test/test.py
similarity index 97%
rename from python/ray/rllib/ppo/test/test.py
rename to python/ray/rllib/agents/ppo/test/test.py
index 6ab59af93..d6454eb56 100644
--- a/python/ray/rllib/ppo/test/test.py
+++ b/python/ray/rllib/agents/ppo/test/test.py
@@ -8,7 +8,7 @@ import tensorflow as tf
from numpy.testing import assert_allclose
from ray.rllib.models.action_dist import Categorical
-from ray.rllib.ppo.utils import flatten, concatenate
+from ray.rllib.agents.ppo.utils import flatten, concatenate
# TODO(ekl): move to rllib/models dir
diff --git a/python/ray/rllib/ppo/utils.py b/python/ray/rllib/agents/ppo/utils.py
similarity index 100%
rename from python/ray/rllib/ppo/utils.py
rename to python/ray/rllib/agents/ppo/utils.py
diff --git a/python/ray/rllib/bc/__init__.py b/python/ray/rllib/bc/__init__.py
deleted file mode 100644
index 8b6e41297..000000000
--- a/python/ray/rllib/bc/__init__.py
+++ /dev/null
@@ -1,3 +0,0 @@
-from ray.rllib.bc.bc import BCAgent, DEFAULT_CONFIG
-
-__all__ = ["BCAgent", "DEFAULT_CONFIG"]
diff --git a/python/ray/rllib/env/__init__.py b/python/ray/rllib/env/__init__.py
new file mode 100644
index 000000000..752d27cec
--- /dev/null
+++ b/python/ray/rllib/env/__init__.py
@@ -0,0 +1,9 @@
+from ray.rllib.env.async_vector_env import AsyncVectorEnv
+from ray.rllib.env.multi_agent_env import MultiAgentEnv
+from ray.rllib.env.serving_env import ServingEnv
+from ray.rllib.env.vector_env import VectorEnv
+from ray.rllib.env.env_context import EnvContext
+
+__all__ = [
+ "AsyncVectorEnv", "MultiAgentEnv", "ServingEnv", "VectorEnv", "EnvContext"
+]
diff --git a/python/ray/rllib/utils/async_vector_env.py b/python/ray/rllib/env/async_vector_env.py
similarity index 98%
rename from python/ray/rllib/utils/async_vector_env.py
rename to python/ray/rllib/env/async_vector_env.py
index 268a7896c..fcd661cb9 100644
--- a/python/ray/rllib/utils/async_vector_env.py
+++ b/python/ray/rllib/env/async_vector_env.py
@@ -2,9 +2,9 @@ from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
-from ray.rllib.utils.serving_env import ServingEnv
-from ray.rllib.utils.vector_env import VectorEnv
-from ray.rllib.utils.multi_agent_env import MultiAgentEnv
+from ray.rllib.env.serving_env import ServingEnv
+from ray.rllib.env.vector_env import VectorEnv
+from ray.rllib.env.multi_agent_env import MultiAgentEnv
class AsyncVectorEnv(object):
@@ -84,7 +84,8 @@ class AsyncVectorEnv(object):
The returns are two-level dicts mapping from env_id to a dict of
agent_id to values. The number of agents and envs can vary over time.
- Returns:
+ Returns
+ -------
obs (dict): New observations for each ready agent.
rewards (dict): Reward values for each ready agent. If the
episode is just started, the value will be None.
@@ -95,6 +96,7 @@ class AsyncVectorEnv(object):
that happens, there will be an entry in this dict that contains
the taken action. There is no need to send_actions() for agents
that have already chosen off-policy actions.
+
"""
raise NotImplementedError
diff --git a/python/ray/rllib/utils/atari_wrappers.py b/python/ray/rllib/env/atari_wrappers.py
similarity index 100%
rename from python/ray/rllib/utils/atari_wrappers.py
rename to python/ray/rllib/env/atari_wrappers.py
diff --git a/python/ray/rllib/utils/env_context.py b/python/ray/rllib/env/env_context.py
similarity index 100%
rename from python/ray/rllib/utils/env_context.py
rename to python/ray/rllib/env/env_context.py
diff --git a/python/ray/rllib/utils/multi_agent_env.py b/python/ray/rllib/env/multi_agent_env.py
similarity index 91%
rename from python/ray/rllib/utils/multi_agent_env.py
rename to python/ray/rllib/env/multi_agent_env.py
index 9a3015fff..42f7cee8c 100644
--- a/python/ray/rllib/utils/multi_agent_env.py
+++ b/python/ray/rllib/env/multi_agent_env.py
@@ -6,7 +6,8 @@ from __future__ import print_function
class MultiAgentEnv(object):
"""An environment that hosts multiple independent agents.
- Agents are identified by (string) agent ids.
+ Agents are identified by (string) agent ids. Note that these "agents" here
+ are not to be confused with RLlib agents.
Examples:
>>> env = MyMultiAgentEnv()
@@ -49,7 +50,8 @@ class MultiAgentEnv(object):
The returns are dicts mapping from agent_id strings to values. The
number of agents in the env can vary over time.
- Returns:
+ Returns
+ -------
obs (dict): New observations for each ready agent.
rewards (dict): Reward values for each ready agent. If the
episode is just started, the value will be None.
diff --git a/python/ray/rllib/utils/serving_env.py b/python/ray/rllib/env/serving_env.py
similarity index 100%
rename from python/ray/rllib/utils/serving_env.py
rename to python/ray/rllib/env/serving_env.py
diff --git a/python/ray/rllib/utils/vector_env.py b/python/ray/rllib/env/vector_env.py
similarity index 98%
rename from python/ray/rllib/utils/vector_env.py
rename to python/ray/rllib/env/vector_env.py
index 926048c48..ef57be859 100644
--- a/python/ray/rllib/utils/vector_env.py
+++ b/python/ray/rllib/env/vector_env.py
@@ -14,7 +14,7 @@ class VectorEnv(object):
Attributes:
action_space (gym.Space): Action space of individual envs.
observation_space (gym.Space): Observation space of individual envs.
- num_envs (int): Number of envs to batch over.
+ num_envs (int): Number of envs in this vector env.
"""
@staticmethod
diff --git a/python/ray/rllib/es/__init__.py b/python/ray/rllib/es/__init__.py
deleted file mode 100644
index b459494a9..000000000
--- a/python/ray/rllib/es/__init__.py
+++ /dev/null
@@ -1,3 +0,0 @@
-from ray.rllib.es.es import (ESAgent, DEFAULT_CONFIG)
-
-__all__ = ["ESAgent", "DEFAULT_CONFIG"]
diff --git a/python/ray/rllib/evaluation/__init__.py b/python/ray/rllib/evaluation/__init__.py
new file mode 100644
index 000000000..bc6acb472
--- /dev/null
+++ b/python/ray/rllib/evaluation/__init__.py
@@ -0,0 +1,14 @@
+from ray.rllib.evaluation.common_policy_evaluator import CommonPolicyEvaluator
+from ray.rllib.evaluation.interface import PolicyEvaluator
+from ray.rllib.evaluation.policy_graph import PolicyGraph
+from ray.rllib.evaluation.tf_policy_graph import TFPolicyGraph
+from ray.rllib.evaluation.torch_policy_graph import TorchPolicyGraph
+from ray.rllib.evaluation.sample_batch import SampleBatch, MultiAgentBatch, \
+ SampleBatchBuilder, MultiAgentSampleBatchBuilder
+from ray.rllib.evaluation.sampler import SyncSampler, AsyncSampler
+
+__all__ = [
+ "PolicyEvaluator", "CommonPolicyEvaluator", "PolicyGraph", "TFPolicyGraph",
+ "TorchPolicyGraph", "SampleBatch", "MultiAgentBatch", "SampleBatchBuilder",
+ "MultiAgentSampleBatchBuilder", "SyncSampler", "AsyncSampler",
+]
diff --git a/python/ray/rllib/utils/common_policy_evaluator.py b/python/ray/rllib/evaluation/common_policy_evaluator.py
similarity index 82%
rename from python/ray/rllib/utils/common_policy_evaluator.py
rename to python/ray/rllib/evaluation/common_policy_evaluator.py
index c25ad30b0..76cefe9f9 100644
--- a/python/ray/rllib/utils/common_policy_evaluator.py
+++ b/python/ray/rllib/evaluation/common_policy_evaluator.py
@@ -2,112 +2,75 @@ from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
-import collections
import gym
-import numpy as np
import pickle
import tensorflow as tf
import ray
from ray.rllib.models import ModelCatalog
-from ray.rllib.optimizers.policy_evaluator import PolicyEvaluator
-from ray.rllib.optimizers.sample_batch import MultiAgentBatch, \
+from ray.rllib.env.async_vector_env import AsyncVectorEnv
+from ray.rllib.env.atari_wrappers import wrap_deepmind, is_atari
+from ray.rllib.env.env_context import EnvContext
+from ray.rllib.env.serving_env import ServingEnv
+from ray.rllib.env.vector_env import VectorEnv
+from ray.rllib.env.multi_agent_env import MultiAgentEnv
+from ray.rllib.evaluation.interface import PolicyEvaluator
+from ray.rllib.evaluation.sample_batch import MultiAgentBatch, \
DEFAULT_POLICY_ID
-from ray.rllib.utils.async_vector_env import AsyncVectorEnv
-from ray.rllib.utils.atari_wrappers import wrap_deepmind, is_atari
+from ray.rllib.evaluation.sampler import AsyncSampler, SyncSampler
from ray.rllib.utils.compression import pack
-from ray.rllib.utils.env_context import EnvContext
from ray.rllib.utils.filter import get_filter
-from ray.rllib.utils.multi_agent_env import MultiAgentEnv
-from ray.rllib.utils.policy_graph import PolicyGraph
-from ray.rllib.utils.sampler import AsyncSampler, SyncSampler
-from ray.rllib.utils.serving_env import ServingEnv
-from ray.rllib.utils.tf_policy_graph import TFPolicyGraph
+from ray.rllib.evaluation.policy_graph import PolicyGraph
+from ray.rllib.evaluation.tf_policy_graph import TFPolicyGraph
from ray.rllib.utils.tf_run_builder import TFRunBuilder
-from ray.rllib.utils.vector_env import VectorEnv
-from ray.tune.result import TrainingResult
-
-
-def collect_metrics(local_evaluator, remote_evaluators=[]):
- """Gathers episode metrics from CommonPolicyEvaluator instances."""
-
- episode_rewards = []
- episode_lengths = []
- policy_rewards = collections.defaultdict(list)
- metric_lists = ray.get(
- [a.apply.remote(lambda ev: ev.sampler.get_metrics())
- for a in remote_evaluators])
- metric_lists.append(local_evaluator.sampler.get_metrics())
- for metrics in metric_lists:
- for episode in metrics:
- episode_lengths.append(episode.episode_length)
- episode_rewards.append(episode.episode_reward)
- for (_, policy_id), reward in episode.agent_rewards.items():
- policy_rewards[policy_id].append(reward)
- if episode_rewards:
- min_reward = min(episode_rewards)
- max_reward = max(episode_rewards)
- else:
- min_reward = float('nan')
- max_reward = float('nan')
- avg_reward = np.mean(episode_rewards)
- avg_length = np.mean(episode_lengths)
- timesteps = np.sum(episode_lengths)
-
- for policy_id, rewards in policy_rewards.copy().items():
- policy_rewards[policy_id] = np.mean(rewards)
-
- return TrainingResult(
- episode_reward_max=max_reward,
- episode_reward_min=min_reward,
- episode_reward_mean=avg_reward,
- episode_len_mean=avg_length,
- episodes_total=len(episode_lengths),
- timesteps_this_iter=timesteps,
- policy_reward_mean=dict(policy_rewards))
class CommonPolicyEvaluator(PolicyEvaluator):
- """Policy evaluator implementation that operates on a rllib.PolicyGraph.
+ """Common ``PolicyEvaluator`` implementation that wraps a ``PolicyGraph``.
- TODO: multi-gpu
+ This class wraps a policy graph instance and an environment class to
+ collect experiences from the environment. You can create many replicas of
+ this class as Ray actors to scale RL training.
+
+ This class supports vectorized and multi-agent policy evaluation (e.g.,
+ VectorEnv, MultiAgentEnv, etc.)
Examples:
- # Create a policy evaluator and using it to collect experiences.
+ >>> # Create a policy evaluator and using it to collect experiences.
>>> evaluator = CommonPolicyEvaluator(
- env_creator=lambda _: gym.make("CartPole-v0"),
- policy_graph=PGPolicyGraph)
+ ... env_creator=lambda _: gym.make("CartPole-v0"),
+ ... policy_graph=PGPolicyGraph)
>>> print(evaluator.sample())
SampleBatch({
"obs": [[...]], "actions": [[...]], "rewards": [[...]],
"dones": [[...]], "new_obs": [[...]]})
- # Creating policy evaluators using optimizer_cls.make().
+ >>> # Creating policy evaluators using optimizer_cls.make().
>>> optimizer = SyncSamplesOptimizer.make(
- evaluator_cls=CommonPolicyEvaluator,
- evaluator_args={
- "env_creator": lambda _: gym.make("CartPole-v0"),
- "policy_graph": PGPolicyGraph,
- },
- num_workers=10)
+ ... evaluator_cls=CommonPolicyEvaluator,
+ ... evaluator_args={
+ ... "env_creator": lambda _: gym.make("CartPole-v0"),
+ ... "policy_graph": PGPolicyGraph,
+ ... },
+ ... num_workers=10)
>>> for _ in range(10): optimizer.step()
- # Creating a multi-agent policy evaluator
+ >>> # Creating a multi-agent policy evaluator
>>> evaluator = CommonPolicyEvaluator(
- env_creator=lambda _: MultiAgentTrafficGrid(num_cars=25),
- policy_graph={
- # Use an ensemble of two policies for car agents
- "car_policy1":
- (PGPolicyGraph, Box(...), Discrete(...), {"gamma": 0.99}),
- "car_policy2":
- (PGPolicyGraph, Box(...), Discrete(...), {"gamma": 0.95}),
- # Use a single shared policy for all traffic lights
- "traffic_light_policy":
- (PGPolicyGraph, Box(...), Discrete(...), {}),
- },
- policy_mapping_fn=lambda agent_id:
- random.choice(["car_policy1", "car_policy2"])
- if agent_id.startswith("car_") else "traffic_light_policy")
+ ... env_creator=lambda _: MultiAgentTrafficGrid(num_cars=25),
+ ... policy_graphs={
+ ... # Use an ensemble of two policies for car agents
+ ... "car_policy1":
+ ... (PGPolicyGraph, Box(...), Discrete(...), {"gamma": 0.99}),
+ ... "car_policy2":
+ ... (PGPolicyGraph, Box(...), Discrete(...), {"gamma": 0.95}),
+ ... # Use a single shared policy for all traffic lights
+ ... "traffic_light_policy":
+ ... (PGPolicyGraph, Box(...), Discrete(...), {}),
+ ... },
+ ... policy_mapping_fn=lambda agent_id:
+ ... random.choice(["car_policy1", "car_policy2"])
+ ... if agent_id.startswith("car_") else "traffic_light_policy")
>>> print(evaluator.sample().keys())
MultiAgentBatch({
"car_policy1": SampleBatch(...),
diff --git a/python/ray/rllib/optimizers/policy_evaluator.py b/python/ray/rllib/evaluation/interface.py
similarity index 88%
rename from python/ray/rllib/optimizers/policy_evaluator.py
rename to python/ray/rllib/evaluation/interface.py
index e3bf9518e..f419bf0d6 100644
--- a/python/ray/rllib/optimizers/policy_evaluator.py
+++ b/python/ray/rllib/evaluation/interface.py
@@ -6,12 +6,9 @@ import os
class PolicyEvaluator(object):
- """Algorithms implement this interface to leverage policy optimizers.
+ """This is the interface between policy optimizers and policy evaluation.
- Policy evaluators are the "data plane" of an algorithm.
-
- Any algorithm that implements Evaluator can plug in any PolicyOptimizer,
- e.g. async SGD, Ape-X, local multi-GPU SGD, etc.
+ See also: CommonPolicyEvaluator
"""
def sample(self):
@@ -21,7 +18,7 @@ class PolicyEvaluator(object):
Returns:
SampleBatch|MultiAgentBatch: A columnar batch of experiences
- (e.g., tensors), or a multi-agent batch.
+ (e.g., tensors), or a multi-agent batch.
Examples:
>>> print(ev.sample())
@@ -37,9 +34,9 @@ class PolicyEvaluator(object):
Returns:
(grads, info): A list of gradients that can be applied on a
- compatible evaluator. In the multi-agent case, returns a dict
- of gradients keyed by policy graph ids. An info dictionary of
- extra metadata is also returned.
+ compatible evaluator. In the multi-agent case, returns a dict
+ of gradients keyed by policy graph ids. An info dictionary of
+ extra metadata is also returned.
Examples:
>>> batch = ev.sample()
diff --git a/python/ray/rllib/evaluation/metrics.py b/python/ray/rllib/evaluation/metrics.py
new file mode 100644
index 000000000..5c5f0cfba
--- /dev/null
+++ b/python/ray/rllib/evaluation/metrics.py
@@ -0,0 +1,48 @@
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import collections
+
+import ray
+from ray.tune.result import TrainingResult
+
+
+def collect_metrics(local_evaluator, remote_evaluators=[]):
+ """Gathers episode metrics from CommonPolicyEvaluator instances."""
+
+ episode_rewards = []
+ episode_lengths = []
+ policy_rewards = collections.defaultdict(list)
+ metric_lists = ray.get(
+ [a.apply.remote(lambda ev: ev.sampler.get_metrics())
+ for a in remote_evaluators])
+ metric_lists.append(local_evaluator.sampler.get_metrics())
+ for metrics in metric_lists:
+ for episode in metrics:
+ episode_lengths.append(episode.episode_length)
+ episode_rewards.append(episode.episode_reward)
+ for (_, policy_id), reward in episode.agent_rewards.items():
+ policy_rewards[policy_id].append(reward)
+ if episode_rewards:
+ min_reward = min(episode_rewards)
+ max_reward = max(episode_rewards)
+ else:
+ min_reward = float('nan')
+ max_reward = float('nan')
+ avg_reward = np.mean(episode_rewards)
+ avg_length = np.mean(episode_lengths)
+ timesteps = np.sum(episode_lengths)
+
+ for policy_id, rewards in policy_rewards.copy().items():
+ policy_rewards[policy_id] = np.mean(rewards)
+
+ return TrainingResult(
+ episode_reward_max=max_reward,
+ episode_reward_min=min_reward,
+ episode_reward_mean=avg_reward,
+ episode_len_mean=avg_length,
+ episodes_total=len(episode_lengths),
+ timesteps_this_iter=timesteps,
+ policy_reward_mean=dict(policy_rewards))
diff --git a/python/ray/rllib/utils/policy_graph.py b/python/ray/rllib/evaluation/policy_graph.py
similarity index 100%
rename from python/ray/rllib/utils/policy_graph.py
rename to python/ray/rllib/evaluation/policy_graph.py
diff --git a/python/ray/rllib/utils/postprocessing.py b/python/ray/rllib/evaluation/postprocessing.py
similarity index 96%
rename from python/ray/rllib/utils/postprocessing.py
rename to python/ray/rllib/evaluation/postprocessing.py
index 1e2f2ebef..667d8eea4 100644
--- a/python/ray/rllib/utils/postprocessing.py
+++ b/python/ray/rllib/evaluation/postprocessing.py
@@ -4,7 +4,7 @@ from __future__ import print_function
import numpy as np
import scipy.signal
-from ray.rllib.optimizers import SampleBatch
+from ray.rllib.evaluation.sample_batch import SampleBatch
def discount(x, gamma):
diff --git a/python/ray/rllib/optimizers/sample_batch.py b/python/ray/rllib/evaluation/sample_batch.py
similarity index 100%
rename from python/ray/rllib/optimizers/sample_batch.py
rename to python/ray/rllib/evaluation/sample_batch.py
diff --git a/python/ray/rllib/utils/sampler.py b/python/ray/rllib/evaluation/sampler.py
similarity index 99%
rename from python/ray/rllib/utils/sampler.py
rename to python/ray/rllib/evaluation/sampler.py
index 1d8509179..4ea09652c 100644
--- a/python/ray/rllib/utils/sampler.py
+++ b/python/ray/rllib/evaluation/sampler.py
@@ -7,9 +7,9 @@ import numpy as np
import six.moves.queue as queue
import threading
-from ray.rllib.optimizers.sample_batch import MultiAgentSampleBatchBuilder, \
+from ray.rllib.evaluation.sample_batch import MultiAgentSampleBatchBuilder, \
MultiAgentBatch
-from ray.rllib.utils.async_vector_env import AsyncVectorEnv
+from ray.rllib.env.async_vector_env import AsyncVectorEnv
from ray.rllib.utils.tf_run_builder import TFRunBuilder
diff --git a/python/ray/rllib/utils/tf_policy_graph.py b/python/ray/rllib/evaluation/tf_policy_graph.py
similarity index 99%
rename from python/ray/rllib/utils/tf_policy_graph.py
rename to python/ray/rllib/evaluation/tf_policy_graph.py
index 23e6bf02a..0df9d9935 100644
--- a/python/ray/rllib/utils/tf_policy_graph.py
+++ b/python/ray/rllib/evaluation/tf_policy_graph.py
@@ -5,8 +5,8 @@ from __future__ import print_function
import tensorflow as tf
import ray
+from ray.rllib.evaluation.policy_graph import PolicyGraph
from ray.rllib.models.lstm import chop_into_sequences
-from ray.rllib.utils.policy_graph import PolicyGraph
from ray.rllib.utils.tf_run_builder import TFRunBuilder
diff --git a/python/ray/rllib/utils/torch_policy_graph.py b/python/ray/rllib/evaluation/torch_policy_graph.py
similarity index 88%
rename from python/ray/rllib/utils/torch_policy_graph.py
rename to python/ray/rllib/evaluation/torch_policy_graph.py
index 96114cc5c..778eeff2e 100644
--- a/python/ray/rllib/utils/torch_policy_graph.py
+++ b/python/ray/rllib/evaluation/torch_policy_graph.py
@@ -5,11 +5,14 @@ from __future__ import print_function
import numpy as np
from threading import Lock
-import torch
-import torch.nn.functional as F
+try:
+ import torch
+ import torch.nn.functional as F
+ from ray.rllib.models.pytorch.misc import var_to_np
+except ImportError:
+ pass # soft dep
-from ray.rllib.models.pytorch.misc import var_to_np
-from ray.rllib.utils.policy_graph import PolicyGraph
+from ray.rllib.evaluation.policy_graph import PolicyGraph
class TorchPolicyGraph(PolicyGraph):
@@ -35,11 +38,12 @@ class TorchPolicyGraph(PolicyGraph):
observation_space (gym.Space): observation space of the policy.
action_space (gym.Space): action space of the policy.
model (nn.Module): PyTorch policy module. Given observations as
- input, this module must a list of outputs where the first item
- are action logits, and the remainder can be any value.
+ input, this module must return a list of outputs where the
+ first item is action logits, and the rest can be any value.
loss (nn.Module): Loss defined as a PyTorch module. The inputs for
this module are defined by the `loss_inputs` param. This module
- returns a single scalar loss.
+ returns a single scalar loss. Note that this module should
+ internally be using the model module.
loss_inputs (list): List of SampleBatch columns that will be
passed to the loss module's forward() function when computing
the loss. For example, ["obs", "action", "advantages"].
diff --git a/python/ray/rllib/examples/legacy_multiagent/multiagent_mountaincar.py b/python/ray/rllib/examples/legacy_multiagent/multiagent_mountaincar.py
index e3e20344b..1e97264a5 100644
--- a/python/ray/rllib/examples/legacy_multiagent/multiagent_mountaincar.py
+++ b/python/ray/rllib/examples/legacy_multiagent/multiagent_mountaincar.py
@@ -7,7 +7,7 @@ import gym
from gym.envs.registration import register
import ray
-import ray.rllib.ppo as ppo
+import ray.rllib.agents.ppo as ppo
from ray.tune.registry import register_env
env_name = "MultiAgentMountainCarEnv"
diff --git a/python/ray/rllib/examples/legacy_multiagent/multiagent_pendulum.py b/python/ray/rllib/examples/legacy_multiagent/multiagent_pendulum.py
index baf5bc29a..c78b5d601 100644
--- a/python/ray/rllib/examples/legacy_multiagent/multiagent_pendulum.py
+++ b/python/ray/rllib/examples/legacy_multiagent/multiagent_pendulum.py
@@ -7,7 +7,7 @@ import gym
from gym.envs.registration import register
import ray
-import ray.rllib.ppo as ppo
+import ray.rllib.agents.ppo as ppo
from ray.tune.registry import register_env
env_name = "MultiAgentPendulumEnv"
diff --git a/python/ray/rllib/examples/multiagent_cartpole.py b/python/ray/rllib/examples/multiagent_cartpole.py
index 158fec293..75c678c53 100644
--- a/python/ray/rllib/examples/multiagent_cartpole.py
+++ b/python/ray/rllib/examples/multiagent_cartpole.py
@@ -18,8 +18,8 @@ import gym
import random
import ray
-from ray.rllib.pg.pg import PGAgent
-from ray.rllib.pg.pg_policy_graph import PGPolicyGraph
+from ray.rllib.agents.pg.pg import PGAgent
+from ray.rllib.agents.pg.pg_policy_graph import PGPolicyGraph
from ray.rllib.test.test_multi_agent_env import MultiCartpole
from ray.tune.logger import pretty_print
from ray.tune.registry import register_env
diff --git a/python/ray/rllib/examples/serving/cartpole_server.py b/python/ray/rllib/examples/serving/cartpole_server.py
index ffbf9f6c6..7e6d79996 100755
--- a/python/ray/rllib/examples/serving/cartpole_server.py
+++ b/python/ray/rllib/examples/serving/cartpole_server.py
@@ -13,8 +13,8 @@ import os
from gym import spaces
import ray
-from ray.rllib.dqn import DQNAgent
-from ray.rllib.utils.serving_env import ServingEnv
+from ray.rllib.agents.dqn import DQNAgent
+from ray.rllib.env.serving_env import ServingEnv
from ray.rllib.utils.policy_server import PolicyServer
from ray.tune.logger import pretty_print
from ray.tune.registry import register_env
diff --git a/python/ray/rllib/models/__init__.py b/python/ray/rllib/models/__init__.py
index d381985dc..8ece52228 100644
--- a/python/ray/rllib/models/__init__.py
+++ b/python/ray/rllib/models/__init__.py
@@ -2,11 +2,11 @@ from ray.rllib.models.catalog import ModelCatalog
from ray.rllib.models.action_dist import (ActionDistribution, Categorical,
DiagGaussian, Deterministic)
from ray.rllib.models.model import Model
+from ray.rllib.models.preprocessors import Preprocessor
from ray.rllib.models.fcnet import FullyConnectedNetwork
from ray.rllib.models.lstm import LSTM
-from ray.rllib.models.multiagentfcnet import MultiAgentFullyConnectedNetwork
__all__ = ["ActionDistribution", "ActionDistribution", "Categorical",
"DiagGaussian", "Deterministic", "ModelCatalog", "Model",
- "FullyConnectedNetwork", "LSTM", "MultiAgentFullyConnectedNetwork"]
+ "Preprocessor", "FullyConnectedNetwork", "LSTM"]
diff --git a/python/ray/rllib/optimizers/__init__.py b/python/ray/rllib/optimizers/__init__.py
index 70d70a197..eaddfd0cd 100644
--- a/python/ray/rllib/optimizers/__init__.py
+++ b/python/ray/rllib/optimizers/__init__.py
@@ -1,15 +1,12 @@
+from ray.rllib.optimizers.policy_optimizer import PolicyOptimizer
from ray.rllib.optimizers.async_samples_optimizer import AsyncSamplesOptimizer
from ray.rllib.optimizers.async_gradients_optimizer import \
AsyncGradientsOptimizer
from ray.rllib.optimizers.sync_samples_optimizer import SyncSamplesOptimizer
from ray.rllib.optimizers.sync_replay_optimizer import SyncReplayOptimizer
from ray.rllib.optimizers.multi_gpu_optimizer import LocalMultiGPUOptimizer
-from ray.rllib.optimizers.sample_batch import SampleBatch, MultiAgentBatch
-from ray.rllib.optimizers.policy_evaluator import PolicyEvaluator, \
- TFMultiGPUSupport
__all__ = [
- "AsyncSamplesOptimizer", "AsyncGradientsOptimizer", "SyncSamplesOptimizer",
- "SyncReplayOptimizer", "LocalMultiGPUOptimizer", "SampleBatch",
- "PolicyEvaluator", "TFMultiGPUSupport", "MultiAgentBatch"]
+ "PolicyOptimizer", "AsyncSamplesOptimizer", "AsyncGradientsOptimizer",
+ "SyncSamplesOptimizer", "SyncReplayOptimizer", "LocalMultiGPUOptimizer"]
diff --git a/python/ray/rllib/optimizers/async_samples_optimizer.py b/python/ray/rllib/optimizers/async_samples_optimizer.py
index 8e4772909..dfc52e1d8 100644
--- a/python/ray/rllib/optimizers/async_samples_optimizer.py
+++ b/python/ray/rllib/optimizers/async_samples_optimizer.py
@@ -17,7 +17,7 @@ from six.moves import queue
import ray
from ray.rllib.optimizers.policy_optimizer import PolicyOptimizer
from ray.rllib.optimizers.replay_buffer import PrioritizedReplayBuffer
-from ray.rllib.optimizers.sample_batch import SampleBatch
+from ray.rllib.evaluation.sample_batch import SampleBatch
from ray.rllib.utils.actors import TaskPool, create_colocated
from ray.rllib.utils.timer import TimerStat
from ray.rllib.utils.window_stat import WindowStat
diff --git a/python/ray/rllib/optimizers/multi_gpu_optimizer.py b/python/ray/rllib/optimizers/multi_gpu_optimizer.py
index f1a80d749..1562b96ea 100644
--- a/python/ray/rllib/optimizers/multi_gpu_optimizer.py
+++ b/python/ray/rllib/optimizers/multi_gpu_optimizer.py
@@ -8,9 +8,9 @@ import os
import tensorflow as tf
import ray
+from ray.rllib.evaluation.tf_policy_graph import TFPolicyGraph
from ray.rllib.optimizers.policy_optimizer import PolicyOptimizer
from ray.rllib.optimizers.multi_gpu_impl import LocalSyncParallelOptimizer
-from ray.rllib.utils.tf_policy_graph import TFPolicyGraph
from ray.rllib.utils.timer import TimerStat
@@ -87,7 +87,7 @@ class LocalMultiGPUOptimizer(PolicyOptimizer):
with self.sample_timer:
if self.remote_evaluators:
# TODO(rliaw): remove when refactoring
- from ray.rllib.ppo.rollout import collect_samples
+ from ray.rllib.agents.ppo.rollout import collect_samples
samples = collect_samples(self.remote_evaluators,
self.timesteps_per_batch)
else:
diff --git a/python/ray/rllib/optimizers/policy_optimizer.py b/python/ray/rllib/optimizers/policy_optimizer.py
index f44aa4847..4a30b7521 100644
--- a/python/ray/rllib/optimizers/policy_optimizer.py
+++ b/python/ray/rllib/optimizers/policy_optimizer.py
@@ -3,7 +3,7 @@ from __future__ import division
from __future__ import print_function
import ray
-from ray.rllib.optimizers.sample_batch import MultiAgentBatch
+from ray.rllib.evaluation.sample_batch import MultiAgentBatch
class PolicyOptimizer(object):
@@ -31,34 +31,6 @@ class PolicyOptimizer(object):
evaluators created by this optimizer.
"""
- @classmethod
- def make(
- cls, evaluator_cls, evaluator_args, num_workers, optimizer_config,
- evaluator_resources={"num_cpus": None}):
- """Create evaluators and an optimizer instance using those evaluators.
-
- Args:
- evaluator_cls (class): Python class of the evaluators to create.
- evaluator_args (list|dict): Constructor args for the evaluators.
- num_workers (int): Number of remote evaluators to create in
- addition to a local evaluator. This can be zero or greater.
- optimizer_config (dict): Keyword arguments to pass to the
- optimizer class constructor.
- """
-
- remote_cls = ray.remote(**evaluator_resources)(evaluator_cls)
- if isinstance(evaluator_args, list):
- local_evaluator = evaluator_cls(*evaluator_args)
- remote_evaluators = [
- remote_cls.remote(*evaluator_args)
- for _ in range(num_workers)]
- else:
- local_evaluator = evaluator_cls(**evaluator_args)
- remote_evaluators = [
- remote_cls.remote(worker_index=i+1, **evaluator_args)
- for i in range(num_workers)]
- return cls(optimizer_config, local_evaluator, remote_evaluators)
-
def __init__(self, config, local_evaluator, remote_evaluators):
"""Create an optimizer instance.
diff --git a/python/ray/rllib/optimizers/sync_replay_optimizer.py b/python/ray/rllib/optimizers/sync_replay_optimizer.py
index 771695472..1058b0d5a 100644
--- a/python/ray/rllib/optimizers/sync_replay_optimizer.py
+++ b/python/ray/rllib/optimizers/sync_replay_optimizer.py
@@ -9,7 +9,7 @@ import ray
from ray.rllib.optimizers.replay_buffer import ReplayBuffer, \
PrioritizedReplayBuffer
from ray.rllib.optimizers.policy_optimizer import PolicyOptimizer
-from ray.rllib.optimizers.sample_batch import SampleBatch, DEFAULT_POLICY_ID, \
+from ray.rllib.evaluation.sample_batch import SampleBatch, DEFAULT_POLICY_ID, \
MultiAgentBatch
from ray.rllib.utils.compression import pack_if_needed
from ray.rllib.utils.filter import RunningStat
diff --git a/python/ray/rllib/optimizers/sync_samples_optimizer.py b/python/ray/rllib/optimizers/sync_samples_optimizer.py
index 2995a1034..c1c8e7c1a 100644
--- a/python/ray/rllib/optimizers/sync_samples_optimizer.py
+++ b/python/ray/rllib/optimizers/sync_samples_optimizer.py
@@ -4,7 +4,7 @@ from __future__ import print_function
import ray
from ray.rllib.optimizers.policy_optimizer import PolicyOptimizer
-from ray.rllib.optimizers.sample_batch import SampleBatch
+from ray.rllib.evaluation.sample_batch import SampleBatch
from ray.rllib.utils.filter import RunningStat
from ray.rllib.utils.timer import TimerStat
diff --git a/python/ray/rllib/pg/__init__.py b/python/ray/rllib/pg/__init__.py
deleted file mode 100644
index fa566536d..000000000
--- a/python/ray/rllib/pg/__init__.py
+++ /dev/null
@@ -1,3 +0,0 @@
-from ray.rllib.pg.pg import PGAgent, DEFAULT_CONFIG
-
-__all__ = ["PGAgent", "DEFAULT_CONFIG"]
diff --git a/python/ray/rllib/pg/pg.py b/python/ray/rllib/pg/pg.py
deleted file mode 100644
index e64af9fe6..000000000
--- a/python/ray/rllib/pg/pg.py
+++ /dev/null
@@ -1,86 +0,0 @@
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-from ray.rllib.agent import Agent
-from ray.rllib.optimizers import SyncSamplesOptimizer
-from ray.rllib.pg.pg_policy_graph import PGPolicyGraph
-from ray.rllib.utils.common_policy_evaluator import CommonPolicyEvaluator, \
- collect_metrics
-from ray.tune.trial import Resources
-
-
-DEFAULT_CONFIG = {
- # Number of workers (excluding master)
- "num_workers": 0,
- # Number of environments to evaluate vectorwise per worker.
- "num_envs": 1,
- # Size of rollout batch
- "batch_size": 512,
- # Discount factor of MDP
- "gamma": 0.99,
- # Number of steps after which the rollout gets cut
- "horizon": 500,
- # Learning rate
- "lr": 0.0004,
- # Arguments to pass to the rllib optimizer
- "optimizer": {},
- # Model parameters
- "model": {"fcnet_hiddens": [128, 128], "max_seq_len": 20},
- # Arguments to pass to the env creator
- "env_config": {},
-
- # === Multiagent ===
- "multiagent": {
- "policy_graphs": {},
- "policy_mapping_fn": None,
- },
-}
-
-
-class PGAgent(Agent):
- """Simple policy gradient agent.
-
- This is an example agent to show how to implement algorithms in RLlib.
- In most cases, you will probably want to use the PPO agent instead.
- """
-
- _agent_name = "PG"
- _default_config = DEFAULT_CONFIG
-
- @classmethod
- def default_resource_request(cls, config):
- cf = dict(cls._default_config, **config)
- return Resources(cpu=1, gpu=0, extra_cpu=cf["num_workers"])
-
- def _init(self):
- self.optimizer = SyncSamplesOptimizer.make(
- evaluator_cls=CommonPolicyEvaluator,
- evaluator_args={
- "env_creator": self.env_creator,
- "policy_graph": (
- self.config["multiagent"]["policy_graphs"] or
- PGPolicyGraph),
- "policy_mapping_fn":
- self.config["multiagent"]["policy_mapping_fn"],
- "batch_steps": self.config["batch_size"],
- "batch_mode": "truncate_episodes",
- "model_config": self.config["model"],
- "env_config": self.config["env_config"],
- "policy_config": self.config,
- "num_envs": self.config["num_envs"],
- },
- num_workers=self.config["num_workers"],
- optimizer_config=self.config["optimizer"])
-
- def _train(self):
- self.optimizer.step()
- return collect_metrics(
- self.optimizer.local_evaluator, self.optimizer.remote_evaluators)
-
- def compute_action(self, observation, state=None):
- if state is None:
- state = []
- return self.local_evaluator.for_policy(
- lambda p: p.compute_single_action(
- observation, state, is_training=False)[0])
diff --git a/python/ray/rllib/ppo/__init__.py b/python/ray/rllib/ppo/__init__.py
deleted file mode 100644
index c039f1248..000000000
--- a/python/ray/rllib/ppo/__init__.py
+++ /dev/null
@@ -1,3 +0,0 @@
-from ray.rllib.ppo.ppo import (PPOAgent, DEFAULT_CONFIG)
-
-__all__ = ["PPOAgent", "DEFAULT_CONFIG"]
diff --git a/python/ray/rllib/rollout.py b/python/ray/rllib/rollout.py
index 09fd52a31..ab647b8f2 100755
--- a/python/ray/rllib/rollout.py
+++ b/python/ray/rllib/rollout.py
@@ -11,8 +11,8 @@ import pickle
import gym
import ray
-from ray.rllib.agent import get_agent_class
-from ray.rllib.dqn.common.wrappers import wrap_dqn
+from ray.rllib.agents.agent import get_agent_class
+from ray.rllib.agents.dqn.common.wrappers import wrap_dqn
from ray.rllib.models import ModelCatalog
EXAMPLE_USAGE = """
diff --git a/python/ray/rllib/test/mock_evaluator.py b/python/ray/rllib/test/mock_evaluator.py
index 4762bb877..711a250e7 100644
--- a/python/ray/rllib/test/mock_evaluator.py
+++ b/python/ray/rllib/test/mock_evaluator.py
@@ -3,7 +3,7 @@ from __future__ import division
from __future__ import print_function
import numpy as np
-from ray.rllib.optimizers import SampleBatch
+from ray.rllib.evaluation import SampleBatch
from ray.rllib.utils.filter import MeanStdFilter
diff --git a/python/ray/rllib/test/test_checkpoint_restore.py b/python/ray/rllib/test/test_checkpoint_restore.py
index fe954423e..f94e08b5a 100644
--- a/python/ray/rllib/test/test_checkpoint_restore.py
+++ b/python/ray/rllib/test/test_checkpoint_restore.py
@@ -7,7 +7,7 @@ from __future__ import print_function
import numpy as np
import ray
-from ray.rllib.agent import get_agent_class
+from ray.rllib.agents.agent import get_agent_class
def get_mean_action(alg, obs):
diff --git a/python/ray/rllib/test/test_common_policy_evaluator.py b/python/ray/rllib/test/test_common_policy_evaluator.py
index 1f6b77956..a86d902bf 100644
--- a/python/ray/rllib/test/test_common_policy_evaluator.py
+++ b/python/ray/rllib/test/test_common_policy_evaluator.py
@@ -7,12 +7,12 @@ import time
import unittest
import ray
-from ray.rllib.pg import PGAgent
-from ray.rllib.utils.common_policy_evaluator import CommonPolicyEvaluator, \
- collect_metrics
-from ray.rllib.utils.policy_graph import PolicyGraph
-from ray.rllib.utils.postprocessing import compute_advantages
-from ray.rllib.utils.vector_env import VectorEnv
+from ray.rllib.agents.pg import PGAgent
+from ray.rllib.evaluation.common_policy_evaluator import CommonPolicyEvaluator
+from ray.rllib.evaluation.metrics import collect_metrics
+from ray.rllib.evaluation.policy_graph import PolicyGraph
+from ray.rllib.evaluation.postprocessing import compute_advantages
+from ray.rllib.env.vector_env import VectorEnv
from ray.tune.registry import register_env
@@ -101,7 +101,8 @@ class TestCommonPolicyEvaluator(unittest.TestCase):
def testQueryEvaluators(self):
register_env("test", lambda _: gym.make("CartPole-v0"))
- pg = PGAgent(env="test", config={"num_workers": 2, "batch_size": 5})
+ pg = PGAgent(
+ env="test", config={"num_workers": 2, "sample_batch_size": 5})
results = pg.optimizer.foreach_evaluator(lambda ev: ev.batch_steps)
results2 = pg.optimizer.foreach_evaluator_with_index(
lambda ev, i: (i, ev.batch_steps))
diff --git a/python/ray/rllib/test/test_evaluators.py b/python/ray/rllib/test/test_evaluators.py
index d2abf1e6d..6d493099e 100644
--- a/python/ray/rllib/test/test_evaluators.py
+++ b/python/ray/rllib/test/test_evaluators.py
@@ -4,7 +4,7 @@ from __future__ import print_function
import unittest
-from ray.rllib.dqn.dqn_policy_graph import adjust_nstep
+from ray.rllib.agents.dqn.dqn_policy_graph import adjust_nstep
class DQNTest(unittest.TestCase):
diff --git a/python/ray/rllib/test/test_multi_agent_env.py b/python/ray/rllib/test/test_multi_agent_env.py
index c058e7714..e1146dcca 100644
--- a/python/ray/rllib/test/test_multi_agent_env.py
+++ b/python/ray/rllib/test/test_multi_agent_env.py
@@ -7,17 +7,17 @@ import random
import unittest
import ray
-from ray.rllib.pg import PGAgent
-from ray.rllib.pg.pg_policy_graph import PGPolicyGraph
-from ray.rllib.dqn.dqn_policy_graph import DQNPolicyGraph
+from ray.rllib.agents.pg import PGAgent
+from ray.rllib.agents.pg.pg_policy_graph import PGPolicyGraph
+from ray.rllib.agents.dqn.dqn_policy_graph import DQNPolicyGraph
from ray.rllib.optimizers import SyncSamplesOptimizer, \
SyncReplayOptimizer, AsyncGradientsOptimizer
from ray.rllib.test.test_common_policy_evaluator import MockEnv, MockEnv2, \
MockPolicyGraph
-from ray.rllib.utils.common_policy_evaluator import CommonPolicyEvaluator, \
- collect_metrics
-from ray.rllib.utils.async_vector_env import _MultiAgentEnvToAsync
-from ray.rllib.utils.multi_agent_env import MultiAgentEnv
+from ray.rllib.evaluation.common_policy_evaluator import CommonPolicyEvaluator
+from ray.rllib.evaluation.metrics import collect_metrics
+from ray.rllib.env.async_vector_env import _MultiAgentEnvToAsync
+from ray.rllib.env.multi_agent_env import MultiAgentEnv
from ray.tune.registry import register_env
diff --git a/python/ray/rllib/test/test_optimizers.py b/python/ray/rllib/test/test_optimizers.py
index a9a109aa3..f3a4fc917 100644
--- a/python/ray/rllib/test/test_optimizers.py
+++ b/python/ray/rllib/test/test_optimizers.py
@@ -8,7 +8,8 @@ import numpy as np
import ray
from ray.rllib.test.mock_evaluator import _MockEvaluator
-from ray.rllib.optimizers import AsyncGradientsOptimizer, SampleBatch
+from ray.rllib.optimizers import AsyncGradientsOptimizer
+from ray.rllib.evaluation import SampleBatch
class AsyncOptimizerTest(unittest.TestCase):
diff --git a/python/ray/rllib/test/test_serving_env.py b/python/ray/rllib/test/test_serving_env.py
index 94b7f8673..5802b66f2 100644
--- a/python/ray/rllib/test/test_serving_env.py
+++ b/python/ray/rllib/test/test_serving_env.py
@@ -9,10 +9,10 @@ import unittest
import uuid
import ray
-from ray.rllib.dqn import DQNAgent
-from ray.rllib.pg import PGAgent
-from ray.rllib.utils.common_policy_evaluator import CommonPolicyEvaluator
-from ray.rllib.utils.serving_env import ServingEnv
+from ray.rllib.agents.dqn import DQNAgent
+from ray.rllib.agents.pg import PGAgent
+from ray.rllib.evaluation.common_policy_evaluator import CommonPolicyEvaluator
+from ray.rllib.env.serving_env import ServingEnv
from ray.rllib.test.test_common_policy_evaluator import BadPolicyGraph, \
MockPolicyGraph, MockEnv
from ray.tune.registry import register_env
diff --git a/python/ray/rllib/test/test_supported_spaces.py b/python/ray/rllib/test/test_supported_spaces.py
index 5c6e8c362..cb14fa93b 100644
--- a/python/ray/rllib/test/test_supported_spaces.py
+++ b/python/ray/rllib/test/test_supported_spaces.py
@@ -7,7 +7,7 @@ from gym.envs.registration import EnvSpec
import numpy as np
import ray
-from ray.rllib.agent import get_agent_class
+from ray.rllib.agents.agent import get_agent_class
from ray.rllib.utils.error import UnsupportedSpaceException
from ray.tune.registry import register_env
@@ -95,7 +95,6 @@ class ModelSupportedSpaces(unittest.TestCase):
check_support(
"PPO",
{"num_workers": 1, "num_sgd_iter": 1, "timesteps_per_batch": 1,
- "devices": ["/cpu:0"], "min_steps_per_task": 1,
"sgd_batchsize": 1},
stats)
check_support(
diff --git a/python/ray/rllib/tuned_examples/hopper-ppo.yaml b/python/ray/rllib/tuned_examples/hopper-ppo.yaml
index d881362ac..27441d394 100644
--- a/python/ray/rllib/tuned_examples/hopper-ppo.yaml
+++ b/python/ray/rllib/tuned_examples/hopper-ppo.yaml
@@ -1,4 +1,12 @@
hopper-ppo:
env: Hopper-v1
run: PPO
- config: {"gamma": 0.995, "kl_coeff": 1.0, "num_sgd_iter": 20, "sgd_stepsize": .0001, "sgd_batchsize": 32768, "devices": ["/gpu:0", "/gpu:1", "/gpu:2", "/gpu:3"], "tf_session_args": {"device_count": {"GPU": 4}, "log_device_placement": false, "allow_soft_placement": true}, "timesteps_per_batch": 160000, "num_workers": 64}
+ config:
+ gamma: 0.995
+ kl_coeff: 1.0
+ num_sgd_iter: 20
+ sgd_stepsize: .0001
+ sgd_batchsize: 32768
+ timesteps_per_batch: 160000
+ num_workers: 64
+ num_gpus: 4
diff --git a/python/ray/rllib/tuned_examples/humanoid-ppo-gae.yaml b/python/ray/rllib/tuned_examples/humanoid-ppo-gae.yaml
index 60055d2aa..5dfbf4315 100644
--- a/python/ray/rllib/tuned_examples/humanoid-ppo-gae.yaml
+++ b/python/ray/rllib/tuned_examples/humanoid-ppo-gae.yaml
@@ -3,5 +3,17 @@ humanoid-ppo-gae:
run: PPO
stop:
episode_reward_mean: 6000
- config: {"lambda": 0.95, "clip_param": 0.2, "kl_coeff": 1.0, "num_sgd_iter": 20, "sgd_stepsize": .0001, "sgd_batchsize": 32768, "horizon": 5000, "devices": ["/gpu:0", "/gpu:1", "/gpu:2", "/gpu:3"], "tf_session_args": {"device_count": {"GPU": 4}, "log_device_placement": false, "allow_soft_placement": true}, "timesteps_per_batch": 320000, "num_workers": 64, "model": {"free_log_std": true}, "write_logs": false}
-
+ config:
+ gamma: 0.995
+ lambda: 0.95
+ clip_param: 0.2
+ kl_coeff: 1.0
+ num_sgd_iter: 20
+ sgd_stepsize: .0001
+ sgd_batchsize: 32768
+ horizon: 5000
+ timesteps_per_batch: 320000
+ model:
+ free_log_std: true
+ num_workers: 64
+ num_gpus: 4
diff --git a/python/ray/rllib/tuned_examples/humanoid-ppo.yaml b/python/ray/rllib/tuned_examples/humanoid-ppo.yaml
index 9619d5389..c896f7d3b 100644
--- a/python/ray/rllib/tuned_examples/humanoid-ppo.yaml
+++ b/python/ray/rllib/tuned_examples/humanoid-ppo.yaml
@@ -2,5 +2,16 @@ humanoid-ppo:
env: Humanoid-v1
run: PPO
stop:
- episode_reward_mean: 6000
- config: {"kl_coeff": 1.0, "num_sgd_iter": 20, "sgd_stepsize": .0001, "sgd_batchsize": 32768, "devices": ["/gpu:0", "/gpu:1", "/gpu:2", "/gpu:3"], "tf_session_args": {"device_count": {"GPU": 4}, "log_device_placement": false, "allow_soft_placement": true}, "timesteps_per_batch": 320000, "num_workers": 64, "model": {"free_log_std": true}, "use_gae": false}
+ episode_reward_mean: 6000
+ config:
+ gamma: 0.995
+ kl_coeff: 1.0
+ num_sgd_iter: 20
+ sgd_stepsize: .0001
+ sgd_batchsize: 32768
+ timesteps_per_batch: 320000
+ model:
+ free_log_std: true
+ use_gae: false
+ num_workers: 64
+ num_gpus: 4
diff --git a/python/ray/rllib/tuned_examples/pong-a3c-pytorch.yaml b/python/ray/rllib/tuned_examples/pong-a3c-pytorch.yaml
index 46d84e6ae..891c4b991 100644
--- a/python/ray/rllib/tuned_examples/pong-a3c-pytorch.yaml
+++ b/python/ray/rllib/tuned_examples/pong-a3c-pytorch.yaml
@@ -3,7 +3,7 @@ pong-a3c-pytorch-cnn:
run: A3C
config:
num_workers: 16
- batch_size: 20
+ sample_batch_size: 20
use_pytorch: true
vf_loss_coeff: 0.5
entropy_coeff: -0.01
diff --git a/python/ray/rllib/tuned_examples/pong-a3c.yaml b/python/ray/rllib/tuned_examples/pong-a3c.yaml
index 4cb868bb5..d5b011ee3 100644
--- a/python/ray/rllib/tuned_examples/pong-a3c.yaml
+++ b/python/ray/rllib/tuned_examples/pong-a3c.yaml
@@ -5,7 +5,7 @@ pong-a3c:
run: A3C
config:
num_workers: 16
- batch_size: 20
+ sample_batch_size: 20
use_pytorch: false
vf_loss_coeff: 0.5
entropy_coeff: -0.01
diff --git a/python/ray/rllib/tuned_examples/pong-ppo.yaml b/python/ray/rllib/tuned_examples/pong-ppo.yaml
index fcb27c1f7..144748164 100644
--- a/python/ray/rllib/tuned_examples/pong-ppo.yaml
+++ b/python/ray/rllib/tuned_examples/pong-ppo.yaml
@@ -14,4 +14,4 @@ pong-deterministic-ppo:
gamma: 0.99
num_workers: 4
num_sgd_iter: 20
- devices: ["/gpu:0"]
+ num_gpus: 1
diff --git a/python/ray/rllib/tuned_examples/walker2d-ppo.yaml b/python/ray/rllib/tuned_examples/walker2d-ppo.yaml
index 22197081b..4591b4b58 100644
--- a/python/ray/rllib/tuned_examples/walker2d-ppo.yaml
+++ b/python/ray/rllib/tuned_examples/walker2d-ppo.yaml
@@ -1,4 +1,11 @@
walker2d-v1-ppo:
env: Walker2d-v1
run: PPO
- config: {"kl_coeff": 1.0, "num_sgd_iter": 20, "sgd_stepsize": .0001, "sgd_batchsize": 32768, "devices": ["/gpu:0", "/gpu:1", "/gpu:2", "/gpu:3"], "tf_session_args": {"device_count": {"GPU": 4}, "log_device_placement": false, "allow_soft_placement": true}, "timesteps_per_batch": 320000, "num_workers": 64}
+ config:
+ kl_coeff: 1.0
+ num_sgd_iter: 20
+ sgd_stepsize: .0001
+ sgd_batchsize: 32768
+ timesteps_per_batch: 320000
+ num_workers: 64
+ num_gpus: 4
diff --git a/python/ray/rllib/utils/__init__.py b/python/ray/rllib/utils/__init__.py
index 3e2b5e0e6..9c1d441df 100644
--- a/python/ray/rllib/utils/__init__.py
+++ b/python/ray/rllib/utils/__init__.py
@@ -1,3 +1,11 @@
from ray.rllib.utils.filter_manager import FilterManager
+from ray.rllib.utils.filter import Filter
+from ray.rllib.utils.policy_client import PolicyClient
+from ray.rllib.utils.policy_server import PolicyServer
-__all__ = ["FilterManager"]
+__all__ = [
+ "Filter",
+ "FilterManager",
+ "PolicyClient",
+ "PolicyServer",
+]
diff --git a/python/ray/rllib/utils/policy_client.py b/python/ray/rllib/utils/policy_client.py
index 623d32c1e..901dc983b 100644
--- a/python/ray/rllib/utils/policy_client.py
+++ b/python/ray/rllib/utils/policy_client.py
@@ -13,7 +13,7 @@ except ImportError:
class PolicyClient(object):
- """Client to interact with a RLlib policy server."""
+ """REST client to interact with a RLlib policy server."""
START_EPISODE = "START_EPISODE"
GET_ACTION = "GET_ACTION"
diff --git a/python/ray/rllib/utils/policy_server.py b/python/ray/rllib/utils/policy_server.py
index 708b14e05..554d74974 100644
--- a/python/ray/rllib/utils/policy_server.py
+++ b/python/ray/rllib/utils/policy_server.py
@@ -18,6 +18,34 @@ elif sys.version_info[0] == 3:
class PolicyServer(ThreadingMixIn, HTTPServer):
+ """REST server than can be launched from a ServingEnv.
+
+ This launches a multi-threaded server that listens on the specified host
+ and port to serve policy requests and forward experiences to RLlib.
+
+ Examples:
+ >>> class CartpoleServing(ServingEnv):
+ def __init__(self):
+ ServingEnv.__init__(
+ self, spaces.Discrete(2),
+ spaces.Box(low=-10, high=10, shape=(4,)))
+ def run(self):
+ server = PolicyServer(self, "localhost", 8900)
+ server.serve_forever()
+ >>> register_env("srv", lambda _: CartpoleServing())
+ >>> pg = PGAgent(env="srv", config={"num_workers": 0})
+ >>> while True:
+ pg.train()
+
+ >>> client = PolicyClient("localhost:8900")
+ >>> eps_id = client.start_episode()
+ >>> action = client.get_action(eps_id, obs)
+ >>> ...
+ >>> client.log_returns(eps_id, reward)
+ >>> ...
+ >>> client.log_returns(eps_id, reward)
+ """
+
def __init__(self, serving_env, address, port):
handler = _make_handler(serving_env)
HTTPServer.__init__(self, (address, port), handler)
diff --git a/python/ray/rllib/dqn/common/schedules.py b/python/ray/rllib/utils/schedules.py
similarity index 100%
rename from python/ray/rllib/dqn/common/schedules.py
rename to python/ray/rllib/utils/schedules.py
diff --git a/test/jenkins_tests/run_multi_node_tests.sh b/test/jenkins_tests/run_multi_node_tests.sh
index 15cb2beb7..86c325da1 100755
--- a/test/jenkins_tests/run_multi_node_tests.sh
+++ b/test/jenkins_tests/run_multi_node_tests.sh
@@ -98,7 +98,7 @@ docker run --rm --shm-size=10G --memory=10G $DOCKER_SHA \
--env MontezumaRevenge-v0 \
--run PPO \
--stop '{"training_iteration": 2}' \
- --config '{"kl_coeff": 1.0, "num_sgd_iter": 10, "sgd_stepsize": 1e-4, "sgd_batchsize": 64, "timesteps_per_batch": 2000, "num_workers": 1, "model": {"dim": 40, "conv_filters": [[16, [8, 8], 4], [32, [4, 4], 2], [512, [5, 5], 1]]}, "extra_frameskip": 4}'
+ --config '{"kl_coeff": 1.0, "num_sgd_iter": 10, "sgd_stepsize": 1e-4, "sgd_batchsize": 64, "timesteps_per_batch": 2000, "num_workers": 1, "model": {"dim": 40, "conv_filters": [[16, [8, 8], 4], [32, [4, 4], 2], [512, [5, 5], 1]]}}'
docker run --rm --shm-size=10G --memory=10G $DOCKER_SHA \
python /ray/python/ray/rllib/train.py \
@@ -126,35 +126,35 @@ docker run --rm --shm-size=10G --memory=10G $DOCKER_SHA \
--env CartPole-v0 \
--run PG \
--stop '{"training_iteration": 2}' \
- --config '{"batch_size": 500, "num_workers": 1}'
+ --config '{"sample_batch_size": 500, "num_workers": 1}'
docker run --rm --shm-size=10G --memory=10G $DOCKER_SHA \
python /ray/python/ray/rllib/train.py \
--env CartPole-v0 \
--run PG \
--stop '{"training_iteration": 2}' \
- --config '{"batch_size": 500, "num_workers": 1, "model": {"use_lstm": true, "max_seq_len": 100}}'
+ --config '{"sample_batch_size": 500, "num_workers": 1, "model": {"use_lstm": true, "max_seq_len": 100}}'
docker run --rm --shm-size=10G --memory=10G $DOCKER_SHA \
python /ray/python/ray/rllib/train.py \
--env CartPole-v0 \
--run PG \
--stop '{"training_iteration": 2}' \
- --config '{"batch_size": 500, "num_workers": 1, "num_envs": 10}'
+ --config '{"sample_batch_size": 500, "num_workers": 1, "num_envs": 10}'
docker run --rm --shm-size=10G --memory=10G $DOCKER_SHA \
python /ray/python/ray/rllib/train.py \
--env Pong-v0 \
--run PG \
--stop '{"training_iteration": 2}' \
- --config '{"batch_size": 500, "num_workers": 1}'
+ --config '{"sample_batch_size": 500, "num_workers": 1}'
docker run --rm --shm-size=10G --memory=10G $DOCKER_SHA \
python /ray/python/ray/rllib/train.py \
--env FrozenLake-v0 \
--run PG \
--stop '{"training_iteration": 2}' \
- --config '{"batch_size": 500, "num_workers": 1}'
+ --config '{"sample_batch_size": 500, "num_workers": 1}'
docker run --rm --shm-size=10G --memory=10G $DOCKER_SHA \
python /ray/python/ray/rllib/train.py \