[rllib] Sync filters at end of iteration not start; hierarchical docs (#3769)

2026-06-27 19:16:19 +08:00 · 2019-01-15 16:25:25 -08:00
parent 3918934dfd
commit 401e656b95
4 changed files with 39 additions and 6 deletions
@@ -108,8 +108,8 @@ Vectorized

 RLlib will auto-vectorize Gym envs for batch evaluation if the ``num_envs_per_worker`` config is set, or you can define a custom environment class that subclasses `VectorEnv <https://github.com/ray-project/ray/blob/master/python/ray/rllib/env/vector_env.py>`__ to implement ``vector_step()`` and ``vector_reset()``.

-Multi-Agent
-----------
+Multi-Agent and Hierarchical
+----------------------------

 .. note::

@@ -162,7 +162,6 @@ If all the agents will be using the same algorithm class to train, then you can
                    "traffic_light"  # Traffic lights are always controlled by this policy
                    if agent_id.startswith("traffic_light_")
                    else random.choice(["car1", "car2"])  # Randomly choose from car policies
-            },
        },
    })

@@ -203,6 +202,38 @@ Here is a simple `example training script <https://github.com/ray-project/ray/bl

 To scale to hundreds of agents, MultiAgentEnv batches policy evaluations across multiple agents internally. It can also be auto-vectorized by setting ``num_envs_per_worker > 1``.

+Hierarchical Environments
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Hierarchical training can sometimes be implemented as a special case of multi-agent RL. For example, consider a three-level hierarchy of policies, where a top-level policy issues high level actions that are executed at finer timescales by a mid-level and low-level policy. The following timeline shows one step of the top-level policy, which corresponds to two mid-level actions and five low-level actions:
+
+.. code-block:: text
+
+   top_level ---------------------------------------------------------------> top_level --->
+   mid_level_0 -------------------------------> mid_level_0 ----------------> mid_level_1 ->
+   low_level_0 -> low_level_0 -> low_level_0 -> low_level_1 -> low_level_1 -> low_level_2 ->
+
+This can be implemented as a multi-agent environment with three types of agents. Each higher-level action creates a new lower-level agent instance with a new id (e.g., ``low_level_0``, ``low_level_1``, ``low_level_2`` in the above example). These lower-level agents pop in existence at the start of higher-level steps, and terminate when their higher-level action ends. Their experiences are aggregated by policy, so from RLlib's perspective it's just optimizing three different types of policies. The configuration might look something like this:
+
+.. code-block:: python
+
+    "multiagent": {
+        "policy_graphs": {
+            "top_level": (some_policy_graph, ...),
+            "mid_level": (some_policy_graph, ...),
+            "low_level": (some_policy_graph, ...),
+        },
+        "policy_mapping_fn":
+            lambda agent_id:
+                "low_level" if agent_id.startswith("low_level_") else
+                "mid_level" if agent_id.startswith("mid_level_") else "top_level"
+        "policies_to_train": ["top_level"],
+    },
+
+
+In this setup, the appropriate rewards for training lower-level agents must be provided by the multi-agent env implementation. The environment class is also responsible for routing between the agents, e.g., conveying `goals <https://arxiv.org/pdf/1703.01161.pdf>`__ from higher-level agents to lower-level agents as part of the lower-level agent observation.
+
+
 Grouping Agents
 ~~~~~~~~~~~~~~~

@@ -37,7 +37,7 @@ Environments
 * `RLlib Environments Overview <rllib-env.html>`__
 * `OpenAI Gym <rllib-env.html#openai-gym>`__
 * `Vectorized <rllib-env.html#vectorized>`__
-* `Multi-Agent <rllib-env.html#multi-agent>`__
+* `Multi-Agent and Hierarchical <rllib-env.html#multi-agent-and-hierarchical>`__
 * `Interfacing with External Agents <rllib-env.html#interfacing-with-external-agents>`__
 * `Batch Asynchronous <rllib-env.html#batch-asynchronous>`__

@@ -271,6 +271,8 @@ class Agent(Trainable):
                ev.set_global_vars.remote(self.global_vars)
            logger.debug("updated global vars: {}".format(self.global_vars))

+        result = Trainable.train(self)
+
        if (self.config.get("observation_filter", "NoFilter") != "NoFilter"
                and hasattr(self, "local_evaluator")):
            FilterManager.synchronize(
@@ -280,12 +282,12 @@ class Agent(Trainable):
            logger.debug("synchronized filters: {}".format(
                self.local_evaluator.filters))

-        result = Trainable.train(self)
        if self.config["callbacks"].get("on_train_result"):
            self.config["callbacks"]["on_train_result"]({
                "agent": self,
                "result": result,
            })
+
        return result

    @override(Trainable)