[rllib] Documentation for I/O API and multi-agent support / cleanup (#3650)

2026-06-30 23:11:40 +08:00 · 2019-01-03 15:15:36 +08:00
parent 2177e2f410
commit ca864faece
19 changed files with 431 additions and 61 deletions
@@ -94,6 +94,7 @@ Ray comes with libraries that accelerate deep learning and reinforcement learnin
   rllib-env.rst
   rllib-algorithms.rst
   rllib-models.rst
+   rllib-offline.rst
   rllib-dev.rst
   rllib-concepts.rst
   rllib-package-ref.rst
@@ -271,6 +271,10 @@ The ``log_action`` API of ExternalEnv can be used to ingest data from offline lo

 Note that envs can read from different partitions of the logs based on the ``worker_index`` attribute of the `env context <https://github.com/ray-project/ray/blob/master/python/ray/rllib/env/env_context.py>`__ passed into the environment constructor.

+.. seealso::
+
+    `RLlib I/O <rllib-offline.html>`__ provides higher-level interfaces for working with offline experience datasets.
+
 Batch Asynchronous
 ------------------

@@ -325,7 +325,7 @@ Custom models can be used to work with environments where (1) the set of valid a
            return masked_logits, last_layer


-Depending on your use case it may make sense to use just the masking, just action embeddings, or both. For a runnable example of this in code, check out `parametric_action_cartpole.py <https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/parametric_action_cartpole.py>`__. Note that since masking introduces ``tf.float32.min`` values into the model output, this technique might not work with all algorithm options. For example, algorithms might crash if they incorrectly process the ``tf.float32.min`` values. The cartpole example has working configurations for DQN and several policy gradient algorithms.
+Depending on your use case it may make sense to use just the masking, just action embeddings, or both. For a runnable example of this in code, check out `parametric_action_cartpole.py <https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/parametric_action_cartpole.py>`__. Note that since masking introduces ``tf.float32.min`` values into the model output, this technique might not work with all algorithm options. For example, algorithms might crash if they incorrectly process the ``tf.float32.min`` values. The cartpole example has working configurations for DQN (must set ``hiddens=[]``), PPO (must disable running mean and set ``vf_share_layers=True``), and several other algorithms.


 Model-Based Rollouts
@@ -0,0 +1,123 @@
+RLlib Offline Data Input / Output
+=================================
+
+Working with Offline Datasets
+-----------------------------
+
+RLlib's I/O APIs enable you to work with datasets of experiences read from offline storage (e.g., disk, cloud storage, streaming systems, HDFS). For example, you might want to read experiences saved from previous training runs, or gathered from policies deployed in `web applications <https://arxiv.org/abs/1811.00260>`__. You can also log new agent experiences produced during online training for future use.
+
+RLlib represents trajectory sequences (i.e., ``(s, a, r, s', ...)`` tuples) with `SampleBatch <https://github.com/ray-project/ray/blob/master/python/ray/rllib/evaluation/sample_batch.py>`__ objects. Using a batch format enables efficient encoding and compression of experiences. During online training, RLlib uses `policy evaluation <rllib-concepts.html#policy-evaluation>`__ actors to generate batches of experiences in parallel using the current policy. RLlib also uses this same batch format for reading and writing experiences to offline storage.
+
+Example: Training on previously saved experiences
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+In this example, we will save batches of experiences generated during online training to disk, and then leverage this saved data to train a policy offline using DQN. First, we run a simple policy gradient algorithm for 100k steps with ``"output": "/tmp/cartpole-out"`` to tell RLlib to write simulation outputs to the ``/tmp/cartpole-out`` directory.
+
+.. code-block:: bash
+
+    $ rllib train
+        --run=PG \
+        --env=CartPole-v0 \
+        --config='{"output": "/tmp/cartpole-out", "output_max_file_size": 5000000}' \
+        --stop='{"timesteps_total": 100000}'
+
+The experiences will be saved in compressed JSON batch format:
+
+.. code-block:: text
+
+    $ ls -l /tmp/cartpole-out
+    total 11636
+    -rw-rw-r-- 1 eric eric 5022257 output-2019-01-01_15-58-57_worker-0_0.json
+    -rw-rw-r-- 1 eric eric 5002416 output-2019-01-01_15-59-22_worker-0_1.json
+    -rw-rw-r-- 1 eric eric 1881666 output-2019-01-01_15-59-47_worker-0_2.json
+
+Then, we can tell DQN to train using these previously generated experiences with ``"input": "/tmp/cartpole-out"``. We disable exploration since it has no effect on the input:
+
+.. code-block:: bash
+
+    $ rllib train \
+        --run=DQN \
+        --env=CartPole-v0 \
+        --config='{
+            "input": "/tmp/cartpole-out",
+            "exploration_final_eps": 0,
+            "exploration_fraction": 0}'
+
+Since the input experiences are not from running simulations, RLlib cannot report the true policy performance during training. However, you can use ``tensorboard --logdir=~/ray_results`` to monitor training progress via other metrics such as estimated Q-value:
+
+.. image:: offline-q.png
+
+In offline input mode, no simulations are run, though you still need to specify the environment in order to define the action and observation spaces. If true simulation is also possible (i.e., your env supports ``step()``), you can also set ``"input_evaluation": "simulation"`` to tell RLlib to run background simulations to estimate current policy performance. The output of these simulations will not be used for learning.
+
+Example: Converting external experiences to batch format
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When the env does not support simulation (e.g., it is a web application), it is necessary to generate the ``*.json`` experience batch files outside of RLlib. This can be done by using the `JsonWriter <https://github.com/ray-project/ray/blob/master/python/ray/rllib/offline/json_writer.py>`__ class to write out batches.
+This `runnable example <https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/saving_experiences.py>`__ shows how to generate and save experience batches for CartPole-v0 to disk:
+
+.. literalinclude:: ../../python/ray/rllib/examples/saving_experiences.py
+   :language: python
+   :start-after: __sphinx_doc_begin__
+   :end-before: __sphinx_doc_end__
+
+On-policy algorithms and experience postprocessing
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+RLlib assumes that input batches are of `postprocessed experiences <https://github.com/ray-project/ray/blob/b8a9e3f1064c6f8d754884fd9c75e0b2f88df4d6/python/ray/rllib/evaluation/policy_graph.py#L103>`__. This isn't typically critical for off-policy algorithms (e.g., DQN's `post-processing <https://github.com/ray-project/ray/blob/b8a9e3f1064c6f8d754884fd9c75e0b2f88df4d6/python/ray/rllib/agents/dqn/dqn_policy_graph.py#L514>`__ is only needed if ``n_step > 1`` or ``worker_side_prioritization: True``). For off-policy algorithms, you can also safely set the ``postprocess_inputs: True`` config to auto-postprocess data.
+
+However, for on-policy algorithms like PPO, you'll need to pass in the extra values added during policy evaluation and postprocessing to ``batch_builder.add_values()``, e.g., ``logits``, ``vf_preds``, ``value_target``, and ``advantages`` for PPO. This is needed since the calculation of these values depends on the parameters of the *behaviour* policy, which RLlib does not have access to in the offline setting (in online training, these values are automatically added during policy evaluation).
+
+Note that for on-policy algorithms, you'll also have to throw away experiences generated by prior versions of the policy. This greatly reduces sample efficiency, which is typically undesirable for offline training, but can make sense for certain applications.
+
+Mixing simulation and offline data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+RLlib supports multiplexing inputs from multiple input sources, including simulation. For example, in the following example we read 40% of our experiences from ``/tmp/cartpole-out``, 30% from ``hdfs:/archive/cartpole``, and the last 30% is produced via policy evaluation. Input sources are multiplexed using `np.random.choice <https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.choice.html>`__:
+
+.. code-block:: bash
+
+    $ rllib train \
+        --run=DQN \
+        --env=CartPole-v0 \
+        --config='{
+            "input": {
+                "/tmp/cartpole-out": 0.4,
+                "hdfs:/archive/cartpole": 0.3,
+                "sampler": 0.3,
+            },
+            "exploration_final_eps": 0,
+            "exploration_fraction": 0}'
+
+Scaling I/O throughput
+~~~~~~~~~~~~~~~~~~~~~~
+
+Similar to scaling online training, you can scale offline I/O throughput by increasing the number of RLlib workers via the ``num_workers`` config. Each worker accesses offline storage independently in parallel, for linear scaling of I/O throughput. Within each read worker, files are chosen in random order for reads, but file contents are read sequentially.
+
+Input API
+---------
+
+You can configure experience input for an agent using the following options:
+
+.. literalinclude:: ../../python/ray/rllib/agents/agent.py
+   :language: python
+   :start-after: __sphinx_doc_input_begin__
+   :end-before: __sphinx_doc_input_end__
+
+The interface for a custom input reader is as follows:
+
+.. autoclass:: ray.rllib.offline.InputReader
+    :members:
+
+Output API
+----------
+
+You can configure experience output for an agent using the following options:
+
+.. literalinclude:: ../../python/ray/rllib/agents/agent.py
+   :language: python
+   :start-after: __sphinx_doc_output_begin__
+   :end-before: __sphinx_doc_output_end__
+
+The interface for a custom output writer is as follows:
+
+.. autoclass:: ray.rllib.offline.OutputWriter
+    :members:
@@ -83,6 +83,12 @@ Models and Preprocessors
 * `Variable-length / Parametric Action Spaces <rllib-models.html#variable-length-parametric-action-spaces>`__
 * `Model-Based Rollouts <rllib-models.html#model-based-rollouts>`__

+Offline Data Input / Output
+---------------------------
+* `Working with Offline Datasets <rllib-offline.html>`__
+* `Input API <rllib-offline.html#input-api>`__
+* `Output API <rllib-offline.html#output-api>`__
+
 RLlib Development
 -----------------

@@ -130,7 +130,8 @@ COMMON_CONFIG = {
    # Drop metric batches from unresponsive workers after this many seconds
    "collect_metrics_timeout": 180,

-    # === Offline Data Input / Output (Experimental) ===
+    # === Offline Data Input / Output ===
+    # __sphinx_doc_input_begin__
    # Specify how to generate experiences:
    #  - "sampler": generate experiences via online simulation (default)
    #  - a local directory or file glob expression (e.g., "/tmp/*.json")
@@ -146,9 +147,14 @@ COMMON_CONFIG = {
    #    metrics will be NaN if using offline data.
    #  - "simulation": run the environment in the background, but use
    #    this data for evaluation only and not for learning.
-    #  - "counterfactual": use counterfactual policy evaluation to estimate
-    #    performance (this option is not implemented yet).
    "input_evaluation": None,
+    # Whether to run postprocess_trajectory() on the trajectory fragments from
+    # offline inputs. Note that postprocessing will be done using the *current*
+    # policy, not the *behaviour* policy, which is typically undesirable for
+    # on-policy algorithms.
+    "postprocess_inputs": False,
+    # __sphinx_doc_input_end__
+    # __sphinx_doc_output_begin__
    # Specify where experiences should be saved:
    #  - None: don't save any experiences
    #  - "logdir" to save to the agent log dir
@@ -159,10 +165,7 @@ COMMON_CONFIG = {
    "output_compress_columns": ["obs", "new_obs"],
    # Max output file size before rolling over to a new file.
    "output_max_file_size": 64 * 1024 * 1024,
-    # Whether to run postprocess_trajectory() on the trajectory fragments from
-    # offline inputs. Whether this makes sense is algorithm-specific.
-    # TODO(ekl) implement this and multi-agent batch handling
-    # "postprocess_inputs": False,
+    # __sphinx_doc_output_end__

    # === Multiagent ===
    "multiagent": {
@@ -503,9 +506,9 @@ class Agent(Trainable):
        elif config["input"] == "sampler":
            input_creator = (lambda ioctx: ioctx.default_sampler_input())
        elif isinstance(config["input"], dict):
-            input_creator = (lambda ioctx: MixedInput(ioctx, config["input"]))
+            input_creator = (lambda ioctx: MixedInput(config["input"], ioctx))
        else:
-            input_creator = (lambda ioctx: JsonReader(ioctx, config["input"]))
+            input_creator = (lambda ioctx: JsonReader(config["input"], ioctx))

        if isinstance(config["output"], FunctionType):
            output_creator = config["output"]
@@ -513,14 +516,14 @@ class Agent(Trainable):
            output_creator = (lambda ioctx: NoopOutput())
        elif config["output"] == "logdir":
            output_creator = (lambda ioctx: JsonWriter(
-                ioctx,
                ioctx.log_dir,
+                ioctx,
                max_file_size=config["output_max_file_size"],
                compress_columns=config["output_compress_columns"]))
        else:
            output_creator = (lambda ioctx: JsonWriter(
-                    ioctx,
                    config["output"],
+                    ioctx,
                    max_file_size=config["output_max_file_size"],
                    compress_columns=config["output_compress_columns"]))

@@ -187,8 +187,6 @@ class PolicyEvaluator(EvaluatorInterface):
                    other metrics will be NaN.
                  - "simulation": run the environment in the background, but
                    use this data for evaluation only and never for learning.
-                  - "counterfactual": use counterfactual policy evaluation to
-                    estimate performance.
            output_creator (func): Function that returns an OutputWriter object
                for saving generated experiences.
        """
@@ -309,8 +307,6 @@ class PolicyEvaluator(EvaluatorInterface):
                "Requested 'simulation' input evaluation method: "
                "will discard all sampler outputs and keep only metrics.")
            sample_async = True
-        elif input_evaluation_method == "counterfactual":
-            raise NotImplementedError
        elif input_evaluation_method is None:
            pass
        else:
@@ -388,6 +384,10 @@ class PolicyEvaluator(EvaluatorInterface):
                "samples": batch
            })

+        # Always do writes prior to compression for consistency and to allow
+        # for better compression inside the writer.
+        self.output_writer.write(batch)
+
        if self.compress_observations:
            if isinstance(batch, MultiAgentBatch):
                for data in batch.policy_batches.values():
@@ -397,7 +397,6 @@ class PolicyEvaluator(EvaluatorInterface):
                batch["obs"] = [pack(o) for o in batch["obs"]]
                batch["new_obs"] = [pack(o) for o in batch["new_obs"]]

-        self.output_writer.write(batch)
        return batch

    @ray.method(num_return_vals=2)
@@ -306,10 +306,48 @@ class SampleBatch(object):
        return out

    def shuffle(self):
+        """Shuffles the rows of this batch in-place."""
+
        permutation = np.random.permutation(self.count)
        for key, val in self.items():
            self[key] = val[permutation]

+    def split_by_episode(self):
+        """Splits this batch's data by `eps_id`.
+
+        Returns:
+            list of SampleBatch, one per distinct episode.
+        """
+
+        slices = []
+        cur_eps_id = self.data["eps_id"][0]
+        offset = 0
+        for i in range(self.count):
+            next_eps_id = self.data["eps_id"][i]
+            if next_eps_id != cur_eps_id:
+                slices.append(self.slice(offset, i))
+                offset = i
+                cur_eps_id = next_eps_id
+        slices.append(self.slice(offset, self.count))
+        for s in slices:
+            slen = len(set(s["eps_id"]))
+            assert slen == 1, (s, slen)
+        assert sum(s.count for s in slices) == self.count, (slices, self.count)
+        return slices
+
+    def slice(self, start, end):
+        """Returns a slice of the row data of this batch.
+
+        Arguments:
+            start (int): Starting index.
+            end (int): Ending index.
+
+        Returns:
+            SampleBatch which has a slice of this batch's data.
+        """
+
+        return SampleBatch({k: v[start:end] for k, v in self.data.items()})
+
    def __getitem__(self, key):
        return self.data[key]

@@ -175,10 +175,10 @@ if __name__ == "__main__":
        }
    elif args.run == "DQN":
        cfg = {
-            "hiddens": [],  # don't postprocess the action scores
+            "hiddens": [],  # important: don't postprocess the action scores
        }
    else:
-        cfg = {}
+        cfg = {}  # PG, IMPALA, A2C, etc.
    run_experiments({
        "parametric_cartpole": {
            "run": args.run,
@@ -0,0 +1,47 @@
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+"""Simple example of writing experiences to a file using JsonWriter."""
+
+# __sphinx_doc_begin__
+import gym
+import numpy as np
+
+from ray.rllib.evaluation.sample_batch import SampleBatchBuilder
+from ray.rllib.offline.json_writer import JsonWriter
+
+if __name__ == "__main__":
+    batch_builder = SampleBatchBuilder()  # or MultiAgentSampleBatchBuilder
+    writer = JsonWriter("/tmp/demo-out")
+
+    # You normally wouldn't want to manually create sample batches if a
+    # simulator is available, but let's do it anyways for example purposes:
+    env = gym.make("CartPole-v0")
+
+    for eps_id in range(100):
+        obs = env.reset()
+        prev_action = np.zeros_like(env.action_space.sample())
+        prev_reward = 0
+        done = False
+        t = 0
+        while not done:
+            action = env.action_space.sample()
+            new_obs, rew, done, info = env.step(action)
+            batch_builder.add_values(
+                t=t,
+                eps_id=eps_id,
+                agent_index=0,
+                obs=obs,
+                actions=action,
+                rewards=rew,
+                prev_actions=prev_action,
+                prev_rewards=prev_reward,
+                dones=done,
+                infos=info,
+                new_obs=new_obs)
+            obs = new_obs
+            prev_action = action
+            prev_reward = rew
+            t += 1
+        writer.write(batch_builder.build_and_reset())
+# __sphinx_doc_end__
@@ -9,8 +9,11 @@ class InputReader(object):
    """Input object for loading experiences in policy evaluation."""

    def next(self):
-        """Return the next batch of experiences read."""
+        """Return the next batch of experiences read.

+        Returns:
+            SampleBatch or MultiAgentBatch read.
+        """
        raise NotImplementedError


@@ -2,6 +2,8 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function

+import os
+
 from ray.rllib.offline.input_reader import SamplerInput


@@ -18,9 +20,13 @@ class IOContext(object):
        evaluator (PolicyEvaluator): policy evaluator object reference.
    """

-    def __init__(self, log_dir, config, worker_index, evaluator):
-        self.log_dir = log_dir
-        self.config = config
+    def __init__(self,
+                 log_dir=None,
+                 config=None,
+                 worker_index=0,
+                 evaluator=None):
+        self.log_dir = log_dir or os.getcwd()
+        self.config = config or {}
        self.worker_index = worker_index
        self.evaluator = evaluator

@@ -16,7 +16,9 @@ except ImportError:
    smart_open = None

 from ray.rllib.offline.input_reader import InputReader
-from ray.rllib.evaluation.sample_batch import SampleBatch
+from ray.rllib.offline.io_context import IOContext
+from ray.rllib.evaluation.sample_batch import MultiAgentBatch, SampleBatch, \
+    DEFAULT_POLICY_ID
 from ray.rllib.utils.annotations import override
 from ray.rllib.utils.compression import unpack_if_needed

@@ -28,17 +30,17 @@ class JsonReader(InputReader):

    The input files will be read from in an random order."""

-    def __init__(self, ioctx, inputs):
+    def __init__(self, inputs, ioctx=None):
        """Initialize a JsonReader.

        Arguments:
-            ioctx (IOContext): current IO context object.
            inputs (str|list): either a glob expression for files, e.g.,
                "/tmp/**/*.json", or a list of single file paths or URIs, e.g.,
                ["s3://bucket/file.json", "s3://bucket/file2.json"].
+            ioctx (IOContext): current IO context object.
        """

-        self.ioctx = ioctx
+        self.ioctx = ioctx or IOContext()
        if isinstance(inputs, six.string_types):
            if os.path.isdir(inputs):
                inputs = os.path.join(inputs, "*.json")
@@ -74,7 +76,23 @@ class JsonReader(InputReader):
            raise ValueError(
                "Failed to read valid experience batch from file: {}".format(
                    self.cur_file))
-        return batch
+        return self._postprocess_if_needed(batch)
+
+    def _postprocess_if_needed(self, batch):
+        if not self.ioctx.config.get("postprocess_inputs"):
+            return batch
+
+        if isinstance(batch, SampleBatch):
+            out = []
+            for sub_batch in batch.split_by_episode():
+                out.append(self.ioctx.evaluator.policy_map[DEFAULT_POLICY_ID]
+                           .postprocess_trajectory(sub_batch))
+            return SampleBatch.concat_samples(out)
+        else:
+            # TODO(ekl) this is trickier since the alignments between agent
+            # trajectories in the episode are not available any more.
+            raise NotImplementedError(
+                "Postprocessing of multi-agent data not implemented yet.")

    def _try_parse(self, line):
        line = line.strip()
@@ -121,6 +139,25 @@ def _from_json(batch):
    if isinstance(batch, bytes):  # smart_open S3 doesn't respect "r"
        batch = batch.decode("utf-8")
    data = json.loads(batch)
-    for k, v in data.items():
-        data[k] = [unpack_if_needed(x) for x in unpack_if_needed(v)]
-    return SampleBatch(data)
+
+    if "type" in data:
+        data_type = data.pop("type")
+    else:
+        raise ValueError("JSON record missing 'type' field")
+
+    if data_type == "SampleBatch":
+        for k, v in data.items():
+            data[k] = unpack_if_needed(v)
+        return SampleBatch(data)
+    elif data_type == "MultiAgentBatch":
+        policy_batches = {}
+        for policy_id, policy_batch in data["policy_batches"].items():
+            inner = {}
+            for k, v in policy_batch.items():
+                inner[k] = unpack_if_needed(v)
+            policy_batches[policy_id] = SampleBatch(inner)
+        return MultiAgentBatch(policy_batches, data["count"])
+    else:
+        raise ValueError(
+            "Type field must be one of ['SampleBatch', 'MultiAgentBatch']",
+            data_type)
@@ -15,6 +15,8 @@ try:
 except ImportError:
    smart_open = None

+from ray.rllib.evaluation.sample_batch import MultiAgentBatch
+from ray.rllib.offline.io_context import IOContext
 from ray.rllib.offline.output_writer import OutputWriter
 from ray.rllib.utils.annotations import override
 from ray.rllib.utils.compression import pack
@@ -26,21 +28,21 @@ class JsonWriter(OutputWriter):
    """Writer object that saves experiences in JSON file chunks."""

    def __init__(self,
-                 ioctx,
                 path,
+                 ioctx=None,
                 max_file_size=64 * 1024 * 1024,
                 compress_columns=frozenset(["obs", "new_obs"])):
        """Initialize a JsonWriter.

        Arguments:
-            ioctx (IOContext): current IO context object.
            path (str): a path/URI of the output directory to save files in.
+            ioctx (IOContext): current IO context object.
            max_file_size (int): max size of single files before rolling over.
            compress_columns (list): list of sample batch columns to compress.
        """

-        self.ioctx = ioctx
        self.path = path
+        self.ioctx = ioctx or IOContext()
        self.max_file_size = max_file_size
        self.compress_columns = compress_columns
        if urlparse(path).scheme:
@@ -102,7 +104,19 @@ def _to_jsonable(v, compress):


 def _to_json(batch, compress_columns):
-    return json.dumps({
-        k: _to_jsonable(v, compress=k in compress_columns)
-        for k, v in batch.data.items()
-    })
+    out = {}
+    if isinstance(batch, MultiAgentBatch):
+        out["type"] = "MultiAgentBatch"
+        out["count"] = batch.count
+        policy_batches = {}
+        for policy_id, sub_batch in batch.policy_batches.items():
+            policy_batches[policy_id] = {}
+            for k, v in sub_batch.data.items():
+                policy_batches[policy_id][k] = _to_jsonable(
+                    v, compress=k in compress_columns)
+        out["policy_batches"] = policy_batches
+    else:
+        out["type"] = "SampleBatch"
+        for k, v in batch.data.items():
+            out[k] = _to_jsonable(v, compress=k in compress_columns)
+    return json.dumps(out)
@@ -13,20 +13,20 @@ class MixedInput(InputReader):
    """Mixes input from a number of other input sources.

    Examples:
-        >>> MixedInput(ioctx, {
+        >>> MixedInput({
            "sampler": 0.4,
            "/tmp/experiences/*.json": 0.4,
            "s3://bucket/expert.json": 0.2,
-        })
+        }, ioctx)
    """

-    def __init__(self, ioctx, dist):
+    def __init__(self, dist, ioctx):
        """Initialize a MixedInput.

        Arguments:
-            ioctx (IOContext): current IO context object.
            dist (dict): dict mapping JSONReader paths or "sampler" to
                probabilities. The probabilities must sum to 1.0.
+            ioctx (IOContext): current IO context object.
        """
        if sum(dist.values()) != 1.0:
            raise ValueError("Values must sum to 1.0: {}".format(dist))
@@ -36,7 +36,7 @@ class MixedInput(InputReader):
            if k == "sampler":
                self.choices.append(ioctx.default_sampler_input())
            else:
-                self.choices.append(JsonReader(ioctx, k))
+                self.choices.append(JsonReader(k))
            self.p.append(v)

    @override(InputReader)
@@ -21,7 +21,7 @@ if __name__ == '__main__':
        print(yaml.dump(experiments))

        for i in range(3):
-            trials = run_experiments(experiments)
+            trials = run_experiments(experiments, resume=False)

            num_failures = 0
            for t in trials:
@@ -3,8 +3,11 @@ from __future__ import division
 from __future__ import print_function

 import glob
+import gym
+import json
 import numpy as np
 import os
+import random
 import shutil
 import tempfile
 import time
@@ -12,13 +15,17 @@ import unittest

 import ray
 from ray.rllib.agents.pg import PGAgent
+from ray.rllib.agents.pg.pg_policy_graph import PGPolicyGraph
 from ray.rllib.evaluation import SampleBatch
 from ray.rllib.offline import IOContext, JsonWriter, JsonReader
 from ray.rllib.offline.json_writer import _to_json
+from ray.rllib.test.test_multi_agent_env import MultiCartpole
+from ray.tune.registry import register_env

 SAMPLES = SampleBatch({
-    "actions": np.array([1, 2, 3]),
-    "obs": np.array([4, 5, 6])
+    "actions": np.array([1, 2, 3, 4]),
+    "obs": np.array([4, 5, 6, 7]),
+    "eps_id": [1, 1, 2, 3],
 })


@@ -49,8 +56,7 @@ class AgentIOTest(unittest.TestCase):
    def testAgentOutputOk(self):
        self.writeOutputs(self.test_dir)
        self.assertEqual(len(os.listdir(self.test_dir)), 1)
-        ioctx = IOContext(self.test_dir, {}, 0, None)
-        reader = JsonReader(ioctx, self.test_dir + "/*.json")
+        reader = JsonReader(self.test_dir + "/*.json")
        reader.next()

    def testAgentOutputLogdir(self):
@@ -69,6 +75,40 @@ class AgentIOTest(unittest.TestCase):
        self.assertEqual(result["timesteps_total"], 250)  # read from input
        self.assertTrue(np.isnan(result["episode_reward_mean"]))

+    def testSplitByEpisode(self):
+        splits = SAMPLES.split_by_episode()
+        self.assertEqual(len(splits), 3)
+        self.assertEqual(splits[0].count, 2)
+        self.assertEqual(splits[1].count, 1)
+        self.assertEqual(splits[2].count, 1)
+
+    def testAgentInputPostprocessingEnabled(self):
+        self.writeOutputs(self.test_dir)
+
+        # Rewrite the files to drop advantages and value_targets for testing
+        for path in glob.glob(self.test_dir + "/*.json"):
+            out = []
+            for line in open(path).readlines():
+                data = json.loads(line)
+                del data["advantages"]
+                del data["value_targets"]
+                out.append(data)
+            with open(path, "w") as f:
+                for data in out:
+                    f.write(json.dumps(data))
+
+        agent = PGAgent(
+            env="CartPole-v0",
+            config={
+                "input": self.test_dir,
+                "input_evaluation": None,
+                "postprocess_inputs": True,  # adds back 'advantages'
+            })
+
+        result = agent.train()
+        self.assertEqual(result["timesteps_total"], 250)  # read from input
+        self.assertTrue(np.isnan(result["episode_reward_mean"]))
+
    def testAgentInputEvalSim(self):
        self.writeOutputs(self.test_dir)
        agent = PGAgent(
@@ -112,6 +152,58 @@ class AgentIOTest(unittest.TestCase):
        result = agent.train()
        self.assertTrue(not np.isnan(result["episode_reward_mean"]))

+    def testMultiAgent(self):
+        register_env("multi_cartpole", lambda _: MultiCartpole(10))
+        single_env = gym.make("CartPole-v0")
+
+        def gen_policy():
+            obs_space = single_env.observation_space
+            act_space = single_env.action_space
+            return (PGPolicyGraph, obs_space, act_space, {})
+
+        pg = PGAgent(
+            env="multi_cartpole",
+            config={
+                "num_workers": 0,
+                "output": self.test_dir,
+                "multiagent": {
+                    "policy_graphs": {
+                        "policy_1": gen_policy(),
+                        "policy_2": gen_policy(),
+                    },
+                    "policy_mapping_fn": (
+                        lambda agent_id: random.choice(
+                            ["policy_1", "policy_2"])),
+                },
+            })
+        pg.train()
+        self.assertEqual(len(os.listdir(self.test_dir)), 1)
+
+        pg.stop()
+        pg = PGAgent(
+            env="multi_cartpole",
+            config={
+                "num_workers": 0,
+                "input": self.test_dir,
+                "input_evaluation": "simulation",
+                "train_batch_size": 2000,
+                "multiagent": {
+                    "policy_graphs": {
+                        "policy_1": gen_policy(),
+                        "policy_2": gen_policy(),
+                    },
+                    "policy_mapping_fn": (
+                        lambda agent_id: random.choice(
+                            ["policy_1", "policy_2"])),
+                },
+            })
+        for _ in range(50):
+            result = pg.train()
+            if not np.isnan(result["episode_reward_mean"]):
+                return  # simulation ok
+            time.sleep(0.1)
+        assert False, "did not see any simulation results"
+

 class JsonIOTest(unittest.TestCase):
    def setUp(self):
@@ -123,7 +215,7 @@ class JsonIOTest(unittest.TestCase):
    def testWriteSimple(self):
        ioctx = IOContext(self.test_dir, {}, 0, None)
        writer = JsonWriter(
-            ioctx, self.test_dir, max_file_size=1000, compress_columns=["obs"])
+            self.test_dir, ioctx, max_file_size=1000, compress_columns=["obs"])
        self.assertEqual(len(os.listdir(self.test_dir)), 0)
        writer.write(SAMPLES)
        writer.write(SAMPLES)
@@ -132,8 +224,8 @@ class JsonIOTest(unittest.TestCase):
    def testWriteFileURI(self):
        ioctx = IOContext(self.test_dir, {}, 0, None)
        writer = JsonWriter(
-            ioctx,
            "file:" + self.test_dir,
+            ioctx,
            max_file_size=1000,
            compress_columns=["obs"])
        self.assertEqual(len(os.listdir(self.test_dir)), 0)
@@ -144,7 +236,7 @@ class JsonIOTest(unittest.TestCase):
    def testWritePaginate(self):
        ioctx = IOContext(self.test_dir, {}, 0, None)
        writer = JsonWriter(
-            ioctx, self.test_dir, max_file_size=5000, compress_columns=["obs"])
+            self.test_dir, ioctx, max_file_size=5000, compress_columns=["obs"])
        self.assertEqual(len(os.listdir(self.test_dir)), 0)
        for _ in range(100):
            writer.write(SAMPLES)
@@ -153,10 +245,10 @@ class JsonIOTest(unittest.TestCase):
    def testReadWrite(self):
        ioctx = IOContext(self.test_dir, {}, 0, None)
        writer = JsonWriter(
-            ioctx, self.test_dir, max_file_size=5000, compress_columns=["obs"])
+            self.test_dir, ioctx, max_file_size=5000, compress_columns=["obs"])
        for i in range(100):
            writer.write(make_sample_batch(i))
-        reader = JsonReader(ioctx, self.test_dir + "/*.json")
+        reader = JsonReader(self.test_dir + "/*.json")
        seen_a = set()
        seen_o = set()
        for i in range(1000):
@@ -169,7 +261,6 @@ class JsonIOTest(unittest.TestCase):
        self.assertLess(len(seen_o), 101)

    def testSkipsOverEmptyLinesAndFiles(self):
-        ioctx = IOContext(self.test_dir, {}, 0, None)
        open(self.test_dir + "/empty", "w").close()
        with open(self.test_dir + "/f1", "w") as f:
            f.write("\n")
@@ -178,7 +269,7 @@ class JsonIOTest(unittest.TestCase):
        with open(self.test_dir + "/f2", "w") as f:
            f.write(_to_json(make_sample_batch(1), []))
            f.write("\n")
-        reader = JsonReader(ioctx, [
+        reader = JsonReader([
            self.test_dir + "/empty",
            self.test_dir + "/f1",
            "file:" + self.test_dir + "/f2",
@@ -190,7 +281,6 @@ class JsonIOTest(unittest.TestCase):
        self.assertEqual(len(seen_a), 2)

    def testSkipsOverCorruptedLines(self):
-        ioctx = IOContext(self.test_dir, {}, 0, None)
        with open(self.test_dir + "/f1", "w") as f:
            f.write(_to_json(make_sample_batch(0), []))
            f.write("\n")
@@ -201,7 +291,7 @@ class JsonIOTest(unittest.TestCase):
            f.write(_to_json(make_sample_batch(3), []))
            f.write("\n")
            f.write("{..corrupted_json_record")
-        reader = JsonReader(ioctx, [
+        reader = JsonReader([
            self.test_dir + "/f1",
        ])
        seen_a = set()
@@ -211,9 +301,8 @@ class JsonIOTest(unittest.TestCase):
        self.assertEqual(len(seen_a), 4)

    def testAbortOnAllEmptyInputs(self):
-        ioctx = IOContext(self.test_dir, {}, 0, None)
        open(self.test_dir + "/empty", "w").close()
-        reader = JsonReader(ioctx, [
+        reader = JsonReader([
            self.test_dir + "/empty",
        ])
        self.assertRaises(ValueError, lambda: reader.next())
@@ -223,7 +312,7 @@ class JsonIOTest(unittest.TestCase):
        with open(self.test_dir + "/empty2", "w") as f:
            for _ in range(100):
                f.write("\n")
-        reader = JsonReader(ioctx, [
+        reader = JsonReader([
            self.test_dir + "/empty1",
            self.test_dir + "/empty2",
        ])
@@ -104,5 +104,5 @@ class LinearSchedule(object):

    def value(self, t):
        """See Schedule.value"""
-        fraction = min(float(t) / self.schedule_timesteps, 1.0)
+        fraction = min(float(t) / max(1, self.schedule_timesteps), 1.0)
        return self.initial_p + fraction * (self.final_p - self.initial_p)