diff --git a/doc/source/rllib-env.rst b/doc/source/rllib-env.rst index c95def692..a4a659b8e 100644 --- a/doc/source/rllib-env.rst +++ b/doc/source/rllib-env.rst @@ -136,6 +136,48 @@ Here is a simple `example training script 1``. +Variable-Sharing Between Policies +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +RLlib will create each policy's model in a separate ``tf.variable_scope``. However, variables can still be shared between policies by explicitly entering a globally shared variable scope with ``tf.VariableScope(reuse=tf.AUTO_REUSE)``: + +.. code-block:: python + + with tf.variable_scope( + tf.VariableScope(tf.AUTO_REUSE, "name_of_global_shared_scope"), + reuse=tf.AUTO_REUSE, + auxiliary_name_scope=False): + + +There is a full example of this in the `example training script `__. + +Implementing a Centralized Critic +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Implementing a shared critic between multiple policies requires the definition of custom policy graphs. It can be done as follows: + +1. Querying the critic: this can be done in the ``postprocess_trajectory`` method of a custom policy graph, which has full access to the policies and observations of concurrent agents via the ``other_agent_batches`` and ``episode`` arguments. This assumes you use variable sharing to access the critic network from multiple policies. The critic predictions can then be added to the postprocessed trajectory. Here's an example: + +.. code-block:: python + + def postprocess_trajectory(self, sample_batch, other_agent_batches, episode): + agents = ["agent_1", "agent_2", "agent_3"] # simple example of 3 agents + global_obs_batch = np.stack( + [other_agent_batches[agent_id][1]["obs"] for agent_id in agents], + axis=1) + # add the global obs and global critic value + sample_batch["global_obs"] = global_obs_batch + sample_batch["global_vf"] = self.sess.run( + self.global_critic_network, feed_dict={"obs": global_obs_batch}) + # metrics like "global reward" can be retrieved from the info return of the environment + sample_batch["global_rewards"] = [ + info["global_reward"] for info in sample_batch["infos"]] + return sample_batch + +2. Updating the critic: the centralized critic loss can be added to the loss of some arbitrary policy graph. The policy graph that is chosen must add the inputs for the critic loss to its postprocessed trajectory batches. + +For an example of defining loss inputs, see the `PGPolicyGraph example `__. + Agent-Driven ------------ diff --git a/doc/source/rllib-models.rst b/doc/source/rllib-models.rst index a2a9233ef..79df3ab5c 100644 --- a/doc/source/rllib-models.rst +++ b/doc/source/rllib-models.rst @@ -30,7 +30,7 @@ The following is a list of the built-in model hyperparameters: Custom Models ------------- -Custom models should subclass the common RLlib `model class `__ and override the ``_build_layers_v2`` method. This method takes in a dict of tensor inputs (the observation ``obs``, ``prev_action``, and ``prev_reward``), and returns a feature layer and float vector of the specified output size. The model can then be registered and used in place of a built-in model: +Custom models should subclass the common RLlib `model class `__ and override the ``_build_layers_v2`` method. This method takes in a dict of tensor inputs (the observation ``obs``, ``prev_action``, and ``prev_reward``), and returns a feature layer and float vector of the specified output size. You can also override the ``value_function`` method to implement a custom value branch. The model can then be registered and used in place of a built-in model: .. code-block:: python @@ -74,6 +74,18 @@ Custom models should subclass the common RLlib `model class