mirror of
https://github.com/wassname/ray.git
synced 2026-07-02 06:47:50 +08:00
[docs] update to expose libraries + landing page (#1642)
This commit is contained in:
@@ -1,245 +0,0 @@
|
||||
Hyperparameter Optimization
|
||||
===========================
|
||||
|
||||
This document provides a walkthrough of the hyperparameter optimization example.
|
||||
|
||||
.. note::
|
||||
|
||||
To learn about Ray's built-in hyperparameter optimization framework, see `Ray Tune <http://ray.readthedocs.io/en/latest/tune.html>`__.
|
||||
|
||||
To run the application, first install some dependencies.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
pip install tensorflow
|
||||
|
||||
You can view the `code for this example`_.
|
||||
|
||||
.. _`code for this example`: https://github.com/ray-project/ray/tree/master/examples/hyperopt
|
||||
|
||||
The simple script that processes results as they become available and launches
|
||||
new experiments can be run as follows.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python ray/examples/hyperopt/hyperopt_simple.py --trials=5 --steps=10
|
||||
|
||||
The variant that divides training into multiple segments and aggressively
|
||||
terminates poorly performing models can be run as follows.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python ray/examples/hyperopt/hyperopt_adaptive.py --num-starting-segments=5 \
|
||||
--num-segments=10 \
|
||||
--steps-per-segment=20
|
||||
|
||||
Machine learning algorithms often have a number of *hyperparameters* whose
|
||||
values must be chosen by the practitioner. For example, an optimization
|
||||
algorithm may have a step size, a decay rate, and a regularization coefficient.
|
||||
In a deep network, the network parameterization itself (e.g., the number of
|
||||
layers and the number of units per layer) can be considered a hyperparameter.
|
||||
|
||||
Choosing these parameters can be challenging, and so a common practice is to
|
||||
search over the space of hyperparameters. One approach that works surprisingly
|
||||
well is to randomly sample different options.
|
||||
|
||||
Problem Setup
|
||||
-------------
|
||||
|
||||
Suppose that we want to train a convolutional network, but we aren't sure how to
|
||||
choose the following hyperparameters:
|
||||
|
||||
- the learning rate
|
||||
- the batch size
|
||||
- the dropout probability
|
||||
- the standard deviation of the distribution from which to initialize the
|
||||
network weights
|
||||
|
||||
Suppose that we've defined a remote function ``train_cnn_and_compute_accuracy``,
|
||||
which takes values for these hyperparameters as its input (along with the
|
||||
dataset), trains a convolutional network using those hyperparameters, and
|
||||
returns the accuracy of the trained model on a validation set.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import numpy as np
|
||||
import ray
|
||||
|
||||
@ray.remote
|
||||
def train_cnn_and_compute_accuracy(hyperparameters,
|
||||
train_images,
|
||||
train_labels,
|
||||
validation_images,
|
||||
validation_labels):
|
||||
# Construct a deep network, train it, and return the accuracy on the
|
||||
# validation data.
|
||||
return np.random.uniform(0, 1)
|
||||
|
||||
Basic random search
|
||||
-------------------
|
||||
|
||||
Something that works surprisingly well is to try random values for the
|
||||
hyperparameters. For example, we can write a function that randomly generates
|
||||
hyperparameter configurations.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def generate_hyperparameters():
|
||||
# Randomly choose values for the hyperparameters.
|
||||
return {"learning_rate": 10 ** np.random.uniform(-5, 5),
|
||||
"batch_size": np.random.randint(1, 100),
|
||||
"dropout": np.random.uniform(0, 1),
|
||||
"stddev": 10 ** np.random.uniform(-5, 5)}
|
||||
|
||||
In addition, let's assume that we've started Ray and loaded some data.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import ray
|
||||
|
||||
ray.init()
|
||||
|
||||
from tensorflow.examples.tutorials.mnist import input_data
|
||||
mnist = input_data.read_data_sets("MNIST_data", one_hot=True)
|
||||
train_images = ray.put(mnist.train.images)
|
||||
train_labels = ray.put(mnist.train.labels)
|
||||
validation_images = ray.put(mnist.validation.images)
|
||||
validation_labels = ray.put(mnist.validation.labels)
|
||||
|
||||
|
||||
Then basic random hyperparameter search looks something like this. We launch a
|
||||
bunch of experiments, and we get the results.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Generate a bunch of hyperparameter configurations.
|
||||
hyperparameter_configurations = [generate_hyperparameters() for _ in range(20)]
|
||||
|
||||
# Launch some experiments.
|
||||
results = []
|
||||
for hyperparameters in hyperparameter_configurations:
|
||||
results.append(train_cnn_and_compute_accuracy.remote(hyperparameters,
|
||||
train_images,
|
||||
train_labels,
|
||||
validation_images,
|
||||
validation_labels))
|
||||
|
||||
# Get the results.
|
||||
accuracies = ray.get(results)
|
||||
|
||||
Then we can inspect the contents of `accuracies` and see which set of
|
||||
hyperparameters worked the best. Note that in the above example, the for loop
|
||||
will run instantaneously and the program will block in the call to ``ray.get``,
|
||||
which will wait until all of the experiments have finished.
|
||||
|
||||
Processing results as they become available
|
||||
-------------------------------------------
|
||||
|
||||
One problem with the above approach is that you have to wait for all of the
|
||||
experiments to finish before you can process the results. Instead, you may want
|
||||
to process the results as they become available, perhaps in order to adaptively
|
||||
choose new experiments to run, or perhaps simply so you know how well the
|
||||
experiments are doing. To process the results as they become available, we can
|
||||
use the ``ray.wait`` primitive.
|
||||
|
||||
The most simple usage is the following. This example is implemented in more
|
||||
detail in driver.py_.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
# Launch some experiments.
|
||||
remaining_ids = []
|
||||
for hyperparameters in hyperparameter_configurations:
|
||||
remaining_ids.append(train_cnn_and_compute_accuracy.remote(hyperparameters,
|
||||
train_images,
|
||||
train_labels,
|
||||
validation_images,
|
||||
validation_labels))
|
||||
|
||||
# Whenever a new experiment finishes, print the value and start a new
|
||||
# experiment.
|
||||
for i in range(100):
|
||||
ready_ids, remaining_ids = ray.wait(remaining_ids, num_returns=1)
|
||||
accuracy = ray.get(ready_ids[0])
|
||||
print("Accuracy is {}".format(accuracy))
|
||||
# Start a new experiment.
|
||||
new_hyperparameters = generate_hyperparameters()
|
||||
remaining_ids.append(train_cnn_and_compute_accuracy.remote(new_hyperparameters,
|
||||
train_images,
|
||||
train_labels,
|
||||
validation_images,
|
||||
validation_labels))
|
||||
|
||||
.. _driver.py: https://github.com/ray-project/ray/blob/master/examples/hyperopt/driver.py
|
||||
|
||||
More sophisticated hyperparameter search
|
||||
----------------------------------------
|
||||
|
||||
Hyperparameter search algorithms can get much more sophisticated. So far, we've
|
||||
been treating the function ``train_cnn_and_compute_accuracy`` as a black box,
|
||||
that we can choose its inputs and inspect its outputs, but once we decide to run
|
||||
it, we have to run it until it finishes.
|
||||
|
||||
However, there is often more structure to be exploited. For example, if the
|
||||
training procedure is going poorly, we can end the session early and invest more
|
||||
resources in the more promising hyperparameter experiments. And if we've saved
|
||||
the state of the training procedure, we can always restart it again later.
|
||||
|
||||
This is one of the ideas of the Hyperband_ algorithm. Start with a huge number
|
||||
of hyperparameter configurations, aggressively stop the bad ones, and invest
|
||||
more resources in the promising experiments.
|
||||
|
||||
To implement this, we can first adapt our training method to optionally take a
|
||||
model and to return the updated model.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@ray.remote
|
||||
def train_cnn_and_compute_accuracy(hyperparameters, model=None):
|
||||
# Construct a deep network, train it, and return the accuracy on the
|
||||
# validation data as well as the latest version of the model. If the model
|
||||
# argument is not None, this will continue training an existing model.
|
||||
validation_accuracy = np.random.uniform(0, 1)
|
||||
new_model = model
|
||||
return validation_accuracy, new_model
|
||||
|
||||
Here's a different variant that uses the same principles. Divide each training
|
||||
session into a series of shorter training sessions. Whenever a short session
|
||||
finishes, if it still looks promising, then continue running it. If it isn't
|
||||
doing well, then terminate it and start a new experiment.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import numpy as np
|
||||
|
||||
def is_promising(model):
|
||||
# Return true if the model is doing well and false otherwise. In practice,
|
||||
# this function will want more information than just the model.
|
||||
return np.random.choice([True, False])
|
||||
|
||||
# Start 10 experiments.
|
||||
remaining_ids = []
|
||||
for _ in range(10):
|
||||
experiment_id = train_cnn_and_compute_accuracy.remote(hyperparameters, model=None)
|
||||
remaining_ids.append(experiment_id)
|
||||
|
||||
accuracies = []
|
||||
for i in range(100):
|
||||
# Whenever a segment of an experiment finishes, decide if it looks promising
|
||||
# or not.
|
||||
ready_ids, remaining_ids = ray.wait(remaining_ids, num_returns=1)
|
||||
experiment_id = ready_ids[0]
|
||||
current_accuracy, current_model = ray.get(experiment_id)
|
||||
accuracies.append(current_accuracy)
|
||||
|
||||
if is_promising(experiment_id):
|
||||
# Continue running the experiment.
|
||||
experiment_id = train_cnn_and_compute_accuracy.remote(hyperparameters,
|
||||
model=current_model)
|
||||
else:
|
||||
# Start a new experiment.
|
||||
experiment_id = train_cnn_and_compute_accuracy.remote(hyperparameters)
|
||||
|
||||
remaining_ids.append(experiment_id)
|
||||
|
||||
.. _Hyperband: https://arxiv.org/abs/1603.06560
|
||||
@@ -0,0 +1,99 @@
|
||||
HyperBand and Early Stopping
|
||||
============================
|
||||
|
||||
Ray Tune includes distributed implementations of early stopping algorithms such as `Median Stopping Rule <https://research.google.com/pubs/pub46180.html>`__, `HyperBand <https://arxiv.org/abs/1603.06560>`__, and an `asynchronous version of HyperBand <https://openreview.net/forum?id=S1Y7OOlRZ>`__. These algorithms are very resource efficient and can outperform Bayesian Optimization methods in `many cases <https://people.eecs.berkeley.edu/~kjamieson/hyperband.html>`__.
|
||||
|
||||
Asynchronous HyperBand
|
||||
----------------------
|
||||
|
||||
The `asynchronous version of HyperBand <https://openreview.net/forum?id=S1Y7OOlRZ>`__ scheduler can be plugged in on top of an existing grid or random search. This can be done by setting the ``scheduler`` parameter of ``run_experiments``, e.g.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
run_experiments({...}, scheduler=AsyncHyperBandScheduler())
|
||||
|
||||
Compared to the original version of HyperBand, this implementation provides better parallelism and avoids straggler issues during eliminations. An example of this can be found in `async_hyperband_example.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/async_hyperband_example.py>`__. **We recommend using this over the standard HyperBand scheduler.**
|
||||
|
||||
.. autoclass:: ray.tune.async_hyperband.AsyncHyperBandScheduler
|
||||
|
||||
HyperBand
|
||||
---------
|
||||
|
||||
.. note:: Note that the HyperBand scheduler requires your trainable to support checkpointing, which is described in `Ray Tune documentation <tune.html#trial-checkpointing>`__. Checkpointing enables the scheduler to multiplex many concurrent trials onto a limited size cluster.
|
||||
|
||||
Ray Tune also implements the `standard version of HyperBand <https://arxiv.org/abs/1603.06560>`__. You can use it as such:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
run_experiments({...}, scheduler=HyperBandScheduler())
|
||||
|
||||
An example of this can be found in `hyperband_example.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/hyperband_example.py>`__. The progress of one such HyperBand run is shown below.
|
||||
|
||||
|
||||
::
|
||||
|
||||
== Status ==
|
||||
Using HyperBand: num_stopped=0 total_brackets=5
|
||||
Round #0:
|
||||
Bracket(n=5, r=100, completed=80%): {'PAUSED': 4, 'PENDING': 1}
|
||||
Bracket(n=8, r=33, completed=23%): {'PAUSED': 4, 'PENDING': 4}
|
||||
Bracket(n=15, r=11, completed=4%): {'RUNNING': 2, 'PAUSED': 2, 'PENDING': 11}
|
||||
Bracket(n=34, r=3, completed=0%): {'RUNNING': 2, 'PENDING': 32}
|
||||
Bracket(n=81, r=1, completed=0%): {'PENDING': 38}
|
||||
Resources used: 4/4 CPUs, 0/0 GPUs
|
||||
Result logdir: ~/ray_results/hyperband_test
|
||||
PAUSED trials:
|
||||
- my_class_0_height=99,width=43: PAUSED [pid=11664], 0 s, 100 ts, 97.1 rew
|
||||
- my_class_11_height=85,width=81: PAUSED [pid=11771], 0 s, 33 ts, 32.8 rew
|
||||
- my_class_12_height=0,width=52: PAUSED [pid=11785], 0 s, 33 ts, 0 rew
|
||||
- my_class_19_height=44,width=88: PAUSED [pid=11811], 0 s, 11 ts, 5.47 rew
|
||||
- my_class_27_height=96,width=84: PAUSED [pid=11840], 0 s, 11 ts, 12.5 rew
|
||||
... 5 more not shown
|
||||
PENDING trials:
|
||||
- my_class_10_height=12,width=25: PENDING
|
||||
- my_class_13_height=90,width=45: PENDING
|
||||
- my_class_14_height=69,width=45: PENDING
|
||||
- my_class_15_height=41,width=11: PENDING
|
||||
- my_class_16_height=57,width=69: PENDING
|
||||
... 81 more not shown
|
||||
RUNNING trials:
|
||||
- my_class_23_height=75,width=51: RUNNING [pid=11843], 0 s, 1 ts, 1.47 rew
|
||||
- my_class_26_height=16,width=48: RUNNING
|
||||
- my_class_31_height=40,width=10: RUNNING
|
||||
- my_class_53_height=28,width=96: RUNNING
|
||||
|
||||
.. autoclass:: ray.tune.hyperband.HyperBandScheduler
|
||||
|
||||
|
||||
HyperBand Implementation Details
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Implementation details may deviate slightly from theory but are focused on increasing usability. Note: ``R``, ``s_max``, and ``eta`` are parameters of HyperBand given by the paper. See `this post <https://people.eecs.berkeley.edu/~kjamieson/hyperband.html>`_ for context.
|
||||
|
||||
1. Both ``s_max`` (representing the ``number of brackets - 1``) and ``eta``, representing the downsampling rate, are fixed. In many practical settings, ``R``, which represents some resource unit and often the number of training iterations, can be set reasonably large, like ``R >= 200``. For simplicity, assume ``eta = 3``. Varying ``R`` between ``R = 200`` and ``R = 1000`` creates a huge range of the number of trials needed to fill up all brackets.
|
||||
|
||||
.. image:: images/hyperband_bracket.png
|
||||
|
||||
On the other hand, holding ``R`` constant at ``R = 300`` and varying ``eta`` also leads to HyperBand configurations that are not very intuitive:
|
||||
|
||||
.. image:: images/hyperband_eta.png
|
||||
|
||||
The implementation takes the same configuration as the example given in the paper and exposes ``max_t``, which is not a parameter in the paper.
|
||||
|
||||
2. The example in the `post <https://people.eecs.berkeley.edu/~kjamieson/hyperband.html>`_ to calculate ``n_0`` is actually a little different than the algorithm given in the paper. In this implementation, we implement ``n_0`` according to the paper (which is `n` in the below example):
|
||||
|
||||
.. image:: images/hyperband_allocation.png
|
||||
|
||||
|
||||
3. There are also implementation specific details like how trials are placed into brackets which are not covered in the paper. This implementation places trials within brackets according to smaller bracket first - meaning that with low number of trials, there will be less early stopping.
|
||||
|
||||
Median Stopping Rule
|
||||
--------------------
|
||||
|
||||
The Median Stopping Rule implements the simple strategy of stopping a trial if its performance falls below the median of other trials at similar points in time. You can set the ``scheduler`` parameter as such:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
run_experiments({...}, scheduler=MedianStoppingRule())
|
||||
|
||||
.. autoclass:: ray.tune.median_stopping_rule.MedianStoppingRule
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 32 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 25 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 34 KiB |
+38
-23
@@ -9,6 +9,31 @@ Ray
|
||||
|
||||
*Ray is a flexible, high-performance distributed execution framework.*
|
||||
|
||||
|
||||
Ray is easy to install: ``pip install ray``
|
||||
|
||||
Example Use
|
||||
-----------
|
||||
|
||||
+------------------------------------------------+----------------------------------------------------+
|
||||
| **Basic Python** | **Distributed with Ray** |
|
||||
+------------------------------------------------+----------------------------------------------------+
|
||||
|.. code-block:: python |.. code-block:: python |
|
||||
| | |
|
||||
| # Execute f serially. | # Execute f in parallel. |
|
||||
| | |
|
||||
| | @ray.remote |
|
||||
| def f(): | def f(): |
|
||||
| time.sleep(1) | time.sleep(1) |
|
||||
| return 1 | return 1 |
|
||||
| | |
|
||||
| | |
|
||||
| | ray.init() |
|
||||
| results = [f() for i in range(4)] | results = ray.get([f.remote() for i in range(4)]) |
|
||||
+------------------------------------------------+----------------------------------------------------+
|
||||
|
||||
|
||||
|
||||
View the `codebase on GitHub`_.
|
||||
|
||||
.. _`codebase on GitHub`: https://github.com/ray-project/ray
|
||||
@@ -21,27 +46,6 @@ Ray comes with libraries that accelerate deep learning and reinforcement learnin
|
||||
.. _`Ray Tune`: tune.html
|
||||
.. _`Ray RLlib`: rllib.html
|
||||
|
||||
Example Program
|
||||
---------------
|
||||
|
||||
+------------------------------------------------+----------------------------------------------------+
|
||||
| **Basic Python** | **Distributed with Ray** |
|
||||
+------------------------------------------------+----------------------------------------------------+
|
||||
|.. code:: python |.. code-block:: python |
|
||||
| | |
|
||||
| import time | import time |
|
||||
| | import ray |
|
||||
| | |
|
||||
| | ray.init() |
|
||||
| | |
|
||||
| | @ray.remote |
|
||||
| def f(): | def f(): |
|
||||
| time.sleep(1) | time.sleep(1) |
|
||||
| return 1 | return 1 |
|
||||
| | |
|
||||
| # Execute f serially. | # Execute f in parallel. |
|
||||
| results = [f() for i in range(4)] | results = ray.get([f.remote() for i in range(4)]) |
|
||||
+------------------------------------------------+----------------------------------------------------+
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
@@ -60,16 +64,27 @@ Example Program
|
||||
api.rst
|
||||
actors.rst
|
||||
using-ray-with-gpus.rst
|
||||
webui.rst
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: Ray Tune
|
||||
|
||||
tune.rst
|
||||
hyperband.rst
|
||||
pbt.rst
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: Ray RLlib
|
||||
|
||||
rllib.rst
|
||||
rllib-dev.rst
|
||||
webui.rst
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: Examples
|
||||
|
||||
example-hyperopt.rst
|
||||
example-rl-pong.rst
|
||||
example-policy-gradient.rst
|
||||
example-parameter-server.rst
|
||||
|
||||
@@ -0,0 +1,33 @@
|
||||
Population Based Training
|
||||
=========================
|
||||
|
||||
Ray Tune includes a distributed implementation of `Population Based Training (PBT) <https://deepmind.com/blog/population-based-training-neural-networks>`__.
|
||||
|
||||
|
||||
PBT Scheduler
|
||||
-------------
|
||||
|
||||
Ray Tune's PBT scheduler can be plugged in on top of an existing grid or random search experiment. This can be enabled by setting the ``scheduler`` parameter of ``run_experiments``, e.g.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
run_experiments(
|
||||
{...},
|
||||
scheduler=PopulationBasedTraining(
|
||||
time_attr='time_total_s',
|
||||
reward_attr='mean_accuracy',
|
||||
perturbation_interval=600.0,
|
||||
hyperparameter_mutations={
|
||||
"lr": [1e-3, 5e-4, 1e-4, 5e-5, 1e-5],
|
||||
"alpha": lambda: random.uniform(0.0, 1.0),
|
||||
...
|
||||
}))
|
||||
|
||||
When the PBT scheduler is enabled, each trial variant is treated as a member of the population. Periodically, top-performing trials are checkpointed (this requires your Trainable to support `checkpointing <tune.html#trial-checkpointing>`__). Low-performing trials clone the checkpoints of top performers and perturb the configurations in the hope of discovering an even better variation.
|
||||
|
||||
You can run this `toy PBT example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/pbt_example.py>`__ to get an idea of how how PBT operates. When training in PBT mode, a single trial may see many different hyperparameters over its lifetime, which is recorded in its ``result.json`` file. The following figure generated by the example shows PBT discovering new hyperparams over the course of a single experiment:
|
||||
|
||||
.. image:: pbt.png
|
||||
|
||||
.. autoclass:: ray.tune.pbt.PopulationBasedTraining
|
||||
|
||||
File diff suppressed because one or more lines are too long
|
After Width: | Height: | Size: 200 KiB |
+9
-64
@@ -5,7 +5,7 @@ This document describes Ray Tune, a hyperparameter tuning framework for long-run
|
||||
|
||||
It has the following features:
|
||||
|
||||
- Scalable implementations of search algorithms such as `Population Based Training (PBT) <#population-based-training>`__, `Median Stopping Rule <https://research.google.com/pubs/pub46180.html>`__, and `HyperBand <https://arxiv.org/abs/1603.06560>`__.
|
||||
- Scalable implementations of search algorithms such as `Population Based Training (PBT) <pbt.html>`__, `Median Stopping Rule <hyperband.html#median-stopping-rule>`__, and `HyperBand <hyperband.html>`__.
|
||||
|
||||
- Integration with visualization tools such as `TensorBoard <https://www.tensorflow.org/get_started/summaries_and_tensorboard>`__, `rllab's VisKit <https://media.readthedocs.org/pdf/rllab/latest/rllab.pdf>`__, and a `parallel coordinates visualization <https://en.wikipedia.org/wiki/Parallel_coordinates>`__.
|
||||
|
||||
@@ -19,6 +19,8 @@ You can find the code for Ray Tune `here on GitHub <https://github.com/ray-proje
|
||||
Concepts
|
||||
--------
|
||||
|
||||
.. image:: tune-api.svg
|
||||
|
||||
Ray Tune schedules a number of *trials* in a cluster. Each trial runs a user-defined Python function or class and is parameterized by a json *config* variation passed to the user code.
|
||||
|
||||
Ray Tune provides a ``run_experiments(spec)`` function that generates and runs the trials described by the experiment specification. The trials are scheduled and managed by a *trial scheduler* that implements the search algorithm (default is FIFO).
|
||||
@@ -75,6 +77,11 @@ This script runs a small grid search over the ``my_func`` function using Ray Tun
|
||||
|
||||
In order to report incremental progress, ``my_func`` periodically calls the ``reporter`` function passed in by Ray Tune to return the current timestep and other metrics as defined in `ray.tune.result.TrainingResult <https://github.com/ray-project/ray/blob/master/python/ray/tune/result.py>`__. Incremental results will be synced to local disk on the head node of the cluster and optionally uploaded to the specified ``upload_dir`` (e.g. S3 path).
|
||||
|
||||
Trial Schedulers
|
||||
----------------
|
||||
|
||||
By default, Ray Tune schedules trials in serial order with the ``FIFOScheduler`` class. However, you can also specify a custom scheduling algorithm that can early stop trials, perturb parameters, or incorporate suggestions from an external service. Currently implemented trial schedulers include `Population Based Training (PBT) <pbt.html>`__, `Median Stopping Rule <hyperband.html#median-stopping-rule>`__, and `HyperBand <hyperband.html>`__.
|
||||
|
||||
Visualizing Results
|
||||
-------------------
|
||||
|
||||
@@ -135,72 +142,10 @@ By default, each random variable and grid search point is sampled once. To take
|
||||
|
||||
For more information on variant generation, see `variant_generator.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/variant_generator.py>`__.
|
||||
|
||||
Early Stopping
|
||||
--------------
|
||||
|
||||
To reduce costs, long-running trials can often be early stopped if their initial performance is not promising. Ray Tune allows early stopping algorithms to be plugged in on top of existing grid or random searches. This can be enabled by setting the ``scheduler`` parameter of ``run_experiments``, e.g.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
run_experiments({...}, scheduler=HyperBandScheduler())
|
||||
|
||||
An example of this can be found in `hyperband_example.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/hyperband_example.py>`__. The progress of one such HyperBand run is shown below.
|
||||
|
||||
::
|
||||
|
||||
== Status ==
|
||||
Using HyperBand: num_stopped=0 total_brackets=5
|
||||
Round #0:
|
||||
Bracket(n=5, r=100, completed=80%): {'PAUSED': 4, 'PENDING': 1}
|
||||
Bracket(n=8, r=33, completed=23%): {'PAUSED': 4, 'PENDING': 4}
|
||||
Bracket(n=15, r=11, completed=4%): {'RUNNING': 2, 'PAUSED': 2, 'PENDING': 11}
|
||||
Bracket(n=34, r=3, completed=0%): {'RUNNING': 2, 'PENDING': 32}
|
||||
Bracket(n=81, r=1, completed=0%): {'PENDING': 38}
|
||||
Resources used: 4/4 CPUs, 0/0 GPUs
|
||||
Result logdir: ~/ray_results/hyperband_test
|
||||
PAUSED trials:
|
||||
- my_class_0_height=99,width=43: PAUSED [pid=11664], 0 s, 100 ts, 97.1 rew
|
||||
- my_class_11_height=85,width=81: PAUSED [pid=11771], 0 s, 33 ts, 32.8 rew
|
||||
- my_class_12_height=0,width=52: PAUSED [pid=11785], 0 s, 33 ts, 0 rew
|
||||
- my_class_19_height=44,width=88: PAUSED [pid=11811], 0 s, 11 ts, 5.47 rew
|
||||
- my_class_27_height=96,width=84: PAUSED [pid=11840], 0 s, 11 ts, 12.5 rew
|
||||
... 5 more not shown
|
||||
PENDING trials:
|
||||
- my_class_10_height=12,width=25: PENDING
|
||||
- my_class_13_height=90,width=45: PENDING
|
||||
- my_class_14_height=69,width=45: PENDING
|
||||
- my_class_15_height=41,width=11: PENDING
|
||||
- my_class_16_height=57,width=69: PENDING
|
||||
... 81 more not shown
|
||||
RUNNING trials:
|
||||
- my_class_23_height=75,width=51: RUNNING [pid=11843], 0 s, 1 ts, 1.47 rew
|
||||
- my_class_26_height=16,width=48: RUNNING
|
||||
- my_class_31_height=40,width=10: RUNNING
|
||||
- my_class_53_height=28,width=96: RUNNING
|
||||
|
||||
Ray Tune also implements an `asynchronous version of HyperBand <https://openreview.net/forum?id=S1Y7OOlRZ>`__, providing better parallelism and avoids straggler issues during eliminations. An example of this can be found in `async_hyperband_example.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/async_hyperband_example.py>`__. We recommend using this over the vanilla HyperBand scheduler.
|
||||
|
||||
.. note:: Some trial schedulers such as HyperBand and PBT require your Trainable to support checkpointing, which is described in the next section. Checkpointing enables the scheduler to multiplex many concurrent trials onto a limited size cluster.
|
||||
|
||||
Currently we support the following early stopping algorithms, or you can write your own that implements the `TrialScheduler <https://github.com/ray-project/ray/blob/master/python/ray/tune/trial_scheduler.py>`__ interface.
|
||||
|
||||
.. autoclass:: ray.tune.median_stopping_rule.MedianStoppingRule
|
||||
.. autoclass:: ray.tune.hyperband.HyperBandScheduler
|
||||
.. autoclass:: ray.tune.async_hyperband.AsyncHyperBandScheduler
|
||||
|
||||
Population Based Training
|
||||
-------------------------
|
||||
|
||||
Ray Tune includes a distributed implementation of `Population Based Training (PBT) <https://deepmind.com/blog/population-based-training-neural-networks>`__. PBT also requires your Trainable to support checkpointing. You can run this `toy PBT example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/pbt_example.py>`__ to get an idea of how how PBT operates. When training in PBT mode, the set of trial variations is treated as the population, so a single trial may see many different hyperparameters over its lifetime, which is recorded in the ``result.json`` file. The following figure generated by the example shows PBT discovering new hyperparams over the course of a single experiment:
|
||||
|
||||
.. image:: pbt.png
|
||||
|
||||
.. autoclass:: ray.tune.pbt.PopulationBasedTraining
|
||||
|
||||
Trial Checkpointing
|
||||
-------------------
|
||||
|
||||
To enable checkpoint / resume, you must subclass ``Trainable`` and implement its ``_train``, ``_save``, and ``_restore`` abstract methods `(example) <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/hyperband_example.py>`__: Implementing this interface is required to support resource multiplexing in schedulers such as HyperBand and PBT.
|
||||
To enable checkpointing, you must implement a Trainable class (Trainable functions are not checkpointable, since they never return control back to their caller). The easiest way to do this is to subclass the pre-defined ``Trainable`` class and implement its ``_train``, ``_save``, and ``_restore`` abstract methods `(example) <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/hyperband_example.py>`__: Implementing this interface is required to support resource multiplexing in schedulers such as HyperBand and PBT.
|
||||
|
||||
For TensorFlow model training, this would look something like this `(full tensorflow example) <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/tune_mnist_ray_hyperband.py>`__:
|
||||
|
||||
|
||||
Reference in New Issue
Block a user