[rllib] Sync filters at end of iteration not start; hierarchical docs (#3769)

This commit is contained in:
Eric Liang
2019-01-15 16:25:25 -08:00
committed by Richard Liaw
parent 3918934dfd
commit 401e656b95
4 changed files with 39 additions and 6 deletions
+34 -3
View File
@@ -108,8 +108,8 @@ Vectorized
RLlib will auto-vectorize Gym envs for batch evaluation if the ``num_envs_per_worker`` config is set, or you can define a custom environment class that subclasses `VectorEnv <https://github.com/ray-project/ray/blob/master/python/ray/rllib/env/vector_env.py>`__ to implement ``vector_step()`` and ``vector_reset()``.
Multi-Agent
-----------
Multi-Agent and Hierarchical
----------------------------
.. note::
@@ -162,7 +162,6 @@ If all the agents will be using the same algorithm class to train, then you can
"traffic_light" # Traffic lights are always controlled by this policy
if agent_id.startswith("traffic_light_")
else random.choice(["car1", "car2"]) # Randomly choose from car policies
},
},
})
@@ -203,6 +202,38 @@ Here is a simple `example training script <https://github.com/ray-project/ray/bl
To scale to hundreds of agents, MultiAgentEnv batches policy evaluations across multiple agents internally. It can also be auto-vectorized by setting ``num_envs_per_worker > 1``.
Hierarchical Environments
~~~~~~~~~~~~~~~~~~~~~~~~~
Hierarchical training can sometimes be implemented as a special case of multi-agent RL. For example, consider a three-level hierarchy of policies, where a top-level policy issues high level actions that are executed at finer timescales by a mid-level and low-level policy. The following timeline shows one step of the top-level policy, which corresponds to two mid-level actions and five low-level actions:
.. code-block:: text
top_level ---------------------------------------------------------------> top_level --->
mid_level_0 -------------------------------> mid_level_0 ----------------> mid_level_1 ->
low_level_0 -> low_level_0 -> low_level_0 -> low_level_1 -> low_level_1 -> low_level_2 ->
This can be implemented as a multi-agent environment with three types of agents. Each higher-level action creates a new lower-level agent instance with a new id (e.g., ``low_level_0``, ``low_level_1``, ``low_level_2`` in the above example). These lower-level agents pop in existence at the start of higher-level steps, and terminate when their higher-level action ends. Their experiences are aggregated by policy, so from RLlib's perspective it's just optimizing three different types of policies. The configuration might look something like this:
.. code-block:: python
"multiagent": {
"policy_graphs": {
"top_level": (some_policy_graph, ...),
"mid_level": (some_policy_graph, ...),
"low_level": (some_policy_graph, ...),
},
"policy_mapping_fn":
lambda agent_id:
"low_level" if agent_id.startswith("low_level_") else
"mid_level" if agent_id.startswith("mid_level_") else "top_level"
"policies_to_train": ["top_level"],
},
In this setup, the appropriate rewards for training lower-level agents must be provided by the multi-agent env implementation. The environment class is also responsible for routing between the agents, e.g., conveying `goals <https://arxiv.org/pdf/1703.01161.pdf>`__ from higher-level agents to lower-level agents as part of the lower-level agent observation.
Grouping Agents
~~~~~~~~~~~~~~~
File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 91 KiB

After

Width:  |  Height:  |  Size: 97 KiB

+1 -1
View File
@@ -37,7 +37,7 @@ Environments
* `RLlib Environments Overview <rllib-env.html>`__
* `OpenAI Gym <rllib-env.html#openai-gym>`__
* `Vectorized <rllib-env.html#vectorized>`__
* `Multi-Agent <rllib-env.html#multi-agent>`__
* `Multi-Agent and Hierarchical <rllib-env.html#multi-agent-and-hierarchical>`__
* `Interfacing with External Agents <rllib-env.html#interfacing-with-external-agents>`__
* `Batch Asynchronous <rllib-env.html#batch-asynchronous>`__
+3 -1
View File
@@ -271,6 +271,8 @@ class Agent(Trainable):
ev.set_global_vars.remote(self.global_vars)
logger.debug("updated global vars: {}".format(self.global_vars))
result = Trainable.train(self)
if (self.config.get("observation_filter", "NoFilter") != "NoFilter"
and hasattr(self, "local_evaluator")):
FilterManager.synchronize(
@@ -280,12 +282,12 @@ class Agent(Trainable):
logger.debug("synchronized filters: {}".format(
self.local_evaluator.filters))
result = Trainable.train(self)
if self.config["callbacks"].get("on_train_result"):
self.config["callbacks"]["on_train_result"]({
"agent": self,
"result": result,
})
return result
@override(Trainable)