Files
ray/doc/source/example-evolution-strategies.rst
T
Robert Nishihara 0eae917766 [rllib] Clean up evolution strategies example. (#1225)
* Remove ES observation statistics.

* Consolidate policy classes.

* Remove random stream.

* Move rollout function out of policy.

* Consolidate policy initialization.

* Replace act implementation with sess.run.

* Remove tf_utils.

* Remove variable scope.

* Remove unused imports.

* Use regular TF session.

* Use MeanStdFilter.

* Minor.

* Clarify naming.

* Update documentation.

* eps -> episodes

* Report noiseless evaluation runs.

* Clean up naming.

* Update documentation.

* Fix some bugs.

* Make it run on atari.

* Don't add action noise during evaluation runs.

* Add ES to checkpoint/restore test.

* Small cleanups and remove redundant calls to get_weights.

* Remove outdated comment.
2017-11-16 21:58:30 -08:00

96 lines
3.0 KiB
ReStructuredText

Evolution Strategies
====================
This document provides a walkthrough of the evolution strategies example.
To run the application, first install some dependencies.
.. code-block:: bash
pip install tensorflow
pip install gym
You can view the `code for this example`_.
.. _`code for this example`: https://github.com/ray-project/ray/tree/master/python/ray/rllib/es
The script can be run as follows. Note that the configuration is tuned to work
on the ``Humanoid-v1`` gym environment.
.. code-block:: bash
python/ray/rllib/train.py --env=Humanoid-v1 --alg=ES
To train a policy on a cluster (e.g., using 900 workers), run the following.
.. code-block:: bash
python ray/python/ray/rllib/train.py \
--env=Humanoid-v1 \
--alg=ES \
--redis-address=<redis-address> \
--config='{"num_workers": 900, "episodes_per_batch": 10000, "timesteps_per_batch": 100000}'
At the heart of this example, we define a ``Worker`` class. These workers have
a method ``do_rollouts``, which will be used to perform simulate randomly
perturbed policies in a given environment.
.. code-block:: python
@ray.remote
class Worker(object):
def __init__(self, config, policy_params, env_name, noise):
self.env = # Initialize environment.
self.policy = # Construct policy.
# Details omitted.
def do_rollouts(self, params):
perturbation = # Generate a random perturbation to the policy.
self.policy.set_weights(params + perturbation)
# Do rollout with the perturbed policy.
self.policy.set_weights(params - perturbation)
# Do rollout with the perturbed policy.
# Return the rewards.
In the main loop, we create a number of actors with this class.
.. code-block:: python
workers = [Worker.remote(config, policy_params, env_name, noise_id)
for _ in range(num_workers)]
We then enter an infinite loop in which we use the actors to perform rollouts
and use the rewards from the rollouts to update the policy.
.. code-block:: python
while True:
# Get the current policy weights.
theta = policy.get_weights()
# Put the current policy weights in the object store.
theta_id = ray.put(theta)
# Use the actors to do rollouts, note that we pass in the ID of the policy
# weights.
rollout_ids = [worker.do_rollouts.remote(theta_id), for worker in workers]
# Get the results of the rollouts.
results = ray.get(rollout_ids)
# Update the policy.
optimizer.update(...)
In addition, note that we create a large object representing a shared block of
random noise. We then put the block in the object store so that each ``Worker``
actor can use it without creating its own copy.
.. code-block:: python
@ray.remote
def create_shared_noise():
noise = np.random.randn(250000000)
return noise
noise_id = create_shared_noise.remote()
Recall that the ``noise_id`` argument is passed into the actor constructor.