[tune] Improve user guides and API docs (#7716)

* create guide gallery for Tune

* mods

* ok

* fix

* fix_up_gallery

* ok

* Apply suggestions from code review

Co-Authored-By: Sven Mika <sven@anyscale.io>

* Apply suggestions from code review

Co-Authored-By: Sven Mika <sven@anyscale.io>

Co-authored-by: Sven Mika <sven@anyscale.io>
This commit is contained in:
Richard Liaw
2020-04-06 12:16:35 -07:00
committed by GitHub
parent 22ccc43670
commit a67edc4051
22 changed files with 574 additions and 340 deletions
+1 -1
View File
@@ -6,7 +6,7 @@ SPHINXOPTS =
SPHINXBUILD = sphinx-build
PAPER =
BUILDDIR = _build
AUTOGALLERYDIR= source/auto_examples
AUTOGALLERYDIR= source/auto_examples source/tune/generated_guides
# User-friendly check for sphinx-build
ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)
+1
View File
@@ -0,0 +1 @@
:orphan:
+13 -12
View File
@@ -4,6 +4,19 @@ Ray Tutorials and Examples
Get started with Ray, Tune, and RLlib with these notebooks that you can run online in CoLab or Binder:
* `Ray Tutorial Notebooks <https://github.com/ray-project/tutorial>`__
.. toctree::
:hidden:
plot_parameter_server.rst
plot_example-a3c.rst
plot_hyperparameter.rst
plot_pong_example.rst
plot_lbfgs.rst
plot_newsreader.rst
plot_streaming.rst
plot_example-lm.rst
Example Gallery
---------------
@@ -42,15 +55,3 @@ Example Gallery
.. customgalleryitem::
:tooltip: Distributed Fault-Tolerant BERT training for FAIRSeq using Ray.
:description: :doc:`/auto_examples/plot_example-lm`
.. toctree::
:hidden:
plot_parameter_server.rst
plot_example-a3c.rst
plot_hyperparameter.rst
plot_pong_example.rst
plot_lbfgs.rst
plot_newsreader.rst
plot_streaming.rst
plot_example-lm.rst
+4
View File
@@ -14,3 +14,7 @@
color: #2980B9;
text-transform: uppercase
}
.rst-content .section ol p, .rst-content .section ul p {
margin-bottom: 0px;
}
+3 -3
View File
@@ -66,8 +66,9 @@ extensions = [
]
sphinx_gallery_conf = {
"examples_dirs": ["../examples"], # path to example scripts
"gallery_dirs": ["auto_examples"], # path where to save generated examples
"examples_dirs": ["../examples", "tune/guides"], # path to example scripts
# path where to save generated examples
"gallery_dirs": ["auto_examples", "tune/generated_guides"],
"ignore_pattern": "../examples/doc_code/",
"plot_gallery": "False",
# "filename_pattern": "tutorial.py",
@@ -138,7 +139,6 @@ language = None
# directories to ignore when looking for source files.
exclude_patterns = ['_build']
exclude_patterns += sphinx_gallery_conf['examples_dirs']
exclude_patterns += ["*/README.rst"]
# The reST default role (used for this markup: `text`) to use for all
# documents.
+4 -4
View File
@@ -11,8 +11,6 @@ try:
except NameError:
FileNotFoundError = IOError
# This is not a top level item in the directory, so we use `../` to refer
# to images located at the top level.
GALLERY_TEMPLATE = """
.. raw:: html
@@ -20,7 +18,7 @@ GALLERY_TEMPLATE = """
.. only:: html
.. figure:: ../{thumbnail}
.. figure:: {thumbnail}
{description}
@@ -78,9 +76,11 @@ class CustomGalleryItemDirective(Directive):
os.makedirs(thumb_dir, exist_ok=True)
image_path = os.path.join(thumb_dir, os.path.basename(figname))
sphinx_gallery.gen_rst.scale_image(figname, image_path, 400, 280)
thumbnail = os.path.relpath(image_path, env.srcdir)
# https://stackoverflow.com/questions/52138336/sphinx-reference-to-an-image-from-different-locations
thumbnail = "/" + thumbnail
else:
# "/" is the top level srcdir
thumbnail = "/_static/img/thumbnails/default.png"
if "description" in self.options:
Binary file not shown.

After

Width:  |  Height:  |  Size: 9.4 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1011 KiB

+1 -3
View File
@@ -248,10 +248,8 @@ Getting Involved
:caption: Tune
tune.rst
tune-tutorial.rst
tune-advanced-tutorial.rst
Tune Guides and Tutorials <tune/generated_guides/overview.rst>
tune-usage.rst
tune-distributed.rst
tune-schedulers.rst
tune-searchalg.rst
tune-examples.rst
+4 -1
View File
@@ -1,3 +1,5 @@
.. _tune-schedulers:
Tune Trial Schedulers
=====================
@@ -15,6 +17,7 @@ Current Available Trial Schedulers:
:local:
:backlinks: none
.. _tune-scheduler-pbt:
Population Based Training (PBT)
-------------------------------
@@ -31,7 +34,7 @@ Tune includes a distributed implementation of `Population Based Training (PBT) <
hyperparam_mutations={
"lr": [1e-3, 5e-4, 1e-4, 5e-5, 1e-5],
"alpha": lambda: random.uniform(0.0, 1.0),
...
...
})
tune.run( ... , scheduler=pbt_scheduler)
+2
View File
@@ -1,3 +1,5 @@
.. _tune-search-alg:
Tune Search Algorithms
======================
+65 -177
View File
@@ -1,80 +1,58 @@
.. _tune-user-guide:
Tune User Guide
===============
Tune Overview
-------------
The basic Tune API [``tune.run(Trainable)``] has two main parts: a :ref:`Training API <guide-training-api>` and :ref:`tune.run <guide-running-tune>`.
Tune takes a user-defined Python function or class and evaluates it on a set of hyperparameter configurations.
.. _guide-training-api:
Each hyperparameter configuration evaluation is called a *trial*, and multiple trials are run in parallel. Configurations are either generated by Tune or drawn from a user-specified **search algorithm**. The trials are scheduled and managed by a **trial scheduler**.
Training API
------------
.. image:: images/tune-api.svg
More information about Tune's `search algorithms can be found here <tune-searchalg.html>`__. More information about Tune's `trial schedulers can be found here <tune-schedulers.html>`__. You can check out our `examples page <tune-examples.html>`__ for more code examples.
Tune Training API
-----------------
The Tune training API [``tune.run(Trainable)``] has two concepts:
1. The `Trainable <tune-usage.html#trainable-api>`__ API, and
2. `tune.run <tune-usage.html#launching-tune>`__.
Training can be done with either the Trainable **Class API** or **function-based API**.
Trainable API
~~~~~~~~~~~~~
The class-based API will require users to subclass ``ray.tune.Trainable``. See the API documentation: :ref:`trainable-docstring`.
Here is an example:
Training can be done with either a **Class API** (``tune.Trainable``) or **function-based API** (``track.log``). Here is an example ``tune.Trainable`` that you can use to dry-run Tune:
.. code-block:: python
class Example(Trainable):
from ray import tune
class trainable(tune.Trainable):
def _setup(self, config):
...
if config["print_me"]:
print(config["print_me"])
def _train(self):
# run training code
# run one step of training code.
# important: this method is called repeatedly!
result_dict = {"accuracy": 0.5, "f1": 0.1, ...}
return result_dict
tune.run(trainable, config={"print_me": "hello-world"}, stop={"training_iteration": 200})
.. autoclass:: ray.tune.Trainable
:noindex:
Tune function-based API
~~~~~~~~~~~~~~~~~~~~~~~
User-defined functions will need to have following signature and call ``tune.track.log``, which will allow you to report metrics used for scheduling, search, or early stopping:
The **function-based API** is for fast prototyping but has limited functionality. Here is a **function-based API** example:
.. code-block:: python
from ray import tune
import time
def trainable(config):
"""
Args:
config (dict): Parameters provided from the search algorithm
or variant generation.
"""
if config["print_me"]:
print(config["print_me"])
while True:
# ...
tune.track.log(**kwargs)
for i in range(200):
time.sleep(1)
result_dict = {"accuracy": 0.5, "f1": 0.1, ...}
tune.track.log(**result_dict)
tune.run(trainable, config={"print_me": "hello-world"})
Tune will run this function on a separate thread in a Ray actor process. Note that this API is not checkpointable, since the thread will never return control back to its caller. ``tune.track`` documentation can be found here: :ref:`track-docstring`.
To read more, check out the :ref:`Trainable API docs<trainable-docs>`.
Both the Trainable and function-based API will have `autofilled metrics <tune-usage.html#auto-filled-results>`__ in addition to the metrics reported.
.. _guide-running-tune:
.. note::
If you have a lambda function that you want to train, you will need to first register the function: ``tune.register_trainable("lambda_id", lambda x: ...)``. You can then use ``lambda_id`` in place of ``my_trainable``.
.. note:: See previous versions of the documentation for the ``reporter`` API.
Launching Tune
~~~~~~~~~~~~~~
Running Tune
------------
Use ``tune.run`` to generate and execute your hyperparameter sweep:
@@ -88,20 +66,24 @@ Use ``tune.run`` to generate and execute your hyperparameter sweep:
This function will report status on the command line until all Trials stop:
::
.. code-block:: bash
== Status ==
Memory usage on this node: 11.4/16.0 GiB
Using FIFO scheduling algorithm.
Resources used: 4/8 CPUs, 0/0 GPUs
Result logdir: ~/ray_results/my_experiment
- train_func_0_lr=0.2,momentum=1: RUNNING [pid=6778], 209 s, 20604 ts, 7.29 acc
- train_func_1_lr=0.4,momentum=1: RUNNING [pid=6780], 208 s, 20522 ts, 53.1 acc
- train_func_2_lr=0.6,momentum=1: TERMINATED [pid=6789], 21 s, 2190 ts, 100 acc
- train_func_3_lr=0.2,momentum=2: RUNNING [pid=6791], 208 s, 41004 ts, 8.37 acc
- train_func_4_lr=0.4,momentum=2: RUNNING [pid=6800], 209 s, 41204 ts, 70.1 acc
- train_func_5_lr=0.6,momentum=2: TERMINATED [pid=6809], 10 s, 2164 ts, 100 acc
Resources requested: 4/12 CPUs, 0/0 GPUs, 0.0/3.17 GiB heap, 0.0/1.07 GiB objects
Result logdir: /Users/foo/ray_results/myexp
Number of trials: 4 (4 RUNNING)
+----------------------+----------+---------------------+-----------+--------+--------+--------+--------+------------------+-------+
| Trial name | status | loc | param1 | param2 | param3 | acc | loss | total time (s) | iter |
|----------------------+----------+---------------------+-----------+--------+--------+--------+--------+------------------+-------|
| MyTrainable_a826033a | RUNNING | 10.234.98.164:31115 | 0.303706 | 0.0761 | 0.4328 | 0.1289 | 1.8572 | 7.54952 | 15 |
| MyTrainable_a8263fc6 | RUNNING | 10.234.98.164:31117 | 0.929276 | 0.158 | 0.3417 | 0.4865 | 1.6307 | 7.0501 | 14 |
| MyTrainable_a8267914 | RUNNING | 10.234.98.164:31111 | 0.068426 | 0.0319 | 0.1147 | 0.9585 | 1.9603 | 7.0477 | 14 |
| MyTrainable_a826b7bc | RUNNING | 10.234.98.164:31112 | 0.729127 | 0.0748 | 0.1784 | 0.1797 | 1.7161 | 7.05715 | 14 |
+----------------------+----------+---------------------+-----------+--------+--------+--------+--------+------------------+-------+
All results reported by the trainable will be logged locally to a unique directory per experiment, e.g. ``~/ray_results/example-experiment`` in the above example. On a cluster, incremental results will be synced to local disk on the head node.
All results reported by the trainable will be logged locally to a unique directory per experiment, e.g. ``~/ray_results/example-experiment`` in the above example. On a cluster, incremental results will be synced to local disk on the head node. All results will have `autofilled metrics <tune-usage.html#auto-filled-results>`__ in addition to your own user-defined metrics.
Trial Parallelism
~~~~~~~~~~~~~~~~~
@@ -162,73 +144,31 @@ You can use the ``ExperimentAnalysis`` object to obtain the best configuration o
>>> print("Best config is", analysis.get_best_config(metric="mean_accuracy"))
Best config is: {'lr': 0.011537575723482687, 'momentum': 0.8921971713692662}
Here are some example operations for obtaining a summary of your experiment:
.. code-block:: python
# Get a dataframe for the last reported results of all of the trials
df = analysis.dataframe()
# Get a dataframe for the max accuracy seen for each trial
df = analysis.dataframe(metric="mean_accuracy", mode="max")
# Get a dict mapping {trial logdir -> dataframes} for all trials in the experiment.
all_dataframes = analysis.trial_dataframes
# Get a list of trials
trials = analysis.trials
You may want to get a summary of multiple experiments that point to the same ``local_dir``. For this, you can use the ``Analysis`` class.
.. code-block:: python
from ray.tune import Analysis
analysis = Analysis("~/ray_results/example-experiment")
See the full documentation for the ``Analysis`` object: :ref:`analysis-docstring`.
See the full documentation for the ``Analysis`` object: :ref:`exp-analysis-docstring`.
Tune Search Space (Default)
---------------------------
Grid Search/Random Search
-------------------------
You can use ``tune.grid_search`` to specify an axis of a grid search. By default, Tune also supports sampling parameters from user-specified lambda functions, which can be used independently or in combination with grid search.
.. warning:: If you use a Search Algorithm, you may not be able to specify lambdas or grid search with this
interface, as the search algorithm may require a different search space declaration.
.. note::
If you specify an explicit Search Algorithm such as any SuggestionAlgorithm, you may not be able to specify lambdas or grid search with this interface, as the search algorithm may require a different search space declaration.
Use ``tune.sample_from(<func>)`` to sample a value for a hyperparameter. The ``func`` should take in a ``spec`` object, which has a ``config`` namespace from which you can access other hyperparameters. This is useful for conditional distributions:
You can specify a grid search or random search via the dict passed into ``tune.run(config=)``.
.. code-block:: python
tune.run(
...,
trainable,
config={
"alpha": tune.sample_from(lambda spec: np.random.uniform(100)),
"beta": tune.sample_from(lambda spec: spec.config.alpha * np.random.normal())
"qux": tune.sample_from(lambda spec: 2 + 2),
"bar": tune.grid_search([True, False]),
"foo": tune.grid_search([1, 2, 3]),
"baz": "asd",
}
)
Tune provides a couple of helper functions for common parameter distributions, wrapping numpy random utilities such as ``np.random.uniform``, ``np.random.choice``, and ``np.random.randn``. See :ref:`tune-sample-docs` for more details.
The following shows grid search over two nested parameters combined with random sampling from two lambda functions, generating 9 different trials. Note that the value of ``beta`` depends on the value of ``alpha``, which is represented by referencing ``spec.config.alpha`` in the lambda function. This lets you specify conditional parameter distributions.
.. code-block:: python
:emphasize-lines: 4-11
tune.run(
my_trainable,
name="my_trainable",
config={
"alpha": tune.sample_from(lambda spec: np.random.uniform(100)),
"beta": tune.sample_from(lambda spec: spec.config.alpha * np.random.normal()),
"nn_layers": [
tune.grid_search([16, 64, 256]),
tune.grid_search([16, 64, 256]),
],
}
)
Read about this in the :ref:`Grid/Random Search API <tune-grid-random>` page.
Custom Trial Names
------------------
@@ -316,8 +256,8 @@ The ``Trainable`` also provides the ``default_resource_requests`` interface to a
:noindex:
Save and Restore
----------------
Trainable (Trial) Checkpointing
-------------------------------
When running a hyperparameter search, Tune can automatically and periodically save/checkpoint your model. Checkpointing is used for
@@ -326,36 +266,7 @@ When running a hyperparameter search, Tune can automatically and periodically sa
* fault-tolerance in experiments with pre-emptible machines.
* enables certain Trial Schedulers such as HyperBand and PBT.
To enable checkpointing, you must implement a `Trainable class <tune-usage.html#trainable-api>`__ (Trainable functions are not checkpointable, since they never return control back to their caller). The easiest way to do this is to subclass the pre-defined ``Trainable`` class and implement ``_save``, and ``_restore`` abstract methods, as seen in `this example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/hyperband_example.py>`__.
For PyTorch model training, this would look something like this `PyTorch example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/mnist_pytorch_trainable.py>`__:
.. code-block:: python
class MyTrainableClass(Trainable):
def _save(self, tmp_checkpoint_dir):
checkpoint_path = os.path.join(tmp_checkpoint_dir, "model.pth")
torch.save(self.model.state_dict(), checkpoint_path)
return tmp_checkpoint_dir
def _restore(self, tmp_checkpoint_dir):
checkpoint_path = os.path.join(tmp_checkpoint_dir, "model.pth")
self.model.load_state_dict(torch.load(checkpoint_path))
Checkpoints will be saved by training iteration to ``local_dir/exp_name/trial_name/checkpoint_<iter>``. You can restore a single trial checkpoint by using ``tune.run(restore=<checkpoint_dir>)``.
Tune also generates temporary checkpoints for pausing and switching between trials. For this purpose, it is important not to depend on absolute paths in the implementation of ``save``. See the below reference:
.. automethod:: ray.tune.Trainable._save
:noindex:
.. automethod:: ray.tune.Trainable._restore
:noindex:
Trainable (Trial) Checkpointing
-------------------------------
To enable checkpointing, you must implement a `Trainable class <tune-usage.html#trainable-api>`__ (Trainable functions are not checkpointable, since they never return control back to their caller).
Checkpointing assumes that the model state will be saved to disk on whichever node the Trainable is running on. You can checkpoint with three different mechanisms: manually, periodically, and at termination.
@@ -407,6 +318,8 @@ The checkpoint will be saved at a path that looks like ``local_dir/exp_name/tria
config={"env": "CartPole-v0"},
)
.. _tune-fault-tol:
Fault Tolerance
---------------
@@ -416,7 +329,7 @@ Tune will restore trials from the latest checkpoint, where available. In the dis
If the trial/actor is placed on a different node, Tune will automatically push the previous checkpoint file to that node and restore the remote trial actor state, allowing the trial to resume from the latest checkpoint even after failure.
Take a look at `an example <tune-distributed.html#example-for-using-spot-instances-aws>`_.
Take a look at an example: :ref:`tune-distributed-spot`.
Recovering From Failures
~~~~~~~~~~~~~~~~~~~~~~~~
@@ -542,19 +455,15 @@ The following fields will automatically show up on the console output, if provid
3. ``mean_accuracy``
4. ``timesteps_this_iter`` (aggregated into ``timesteps_total``).
.. code-block:: bash
Example_0: TERMINATED [pid=68248], 179 s, 2 iter, 60000 ts, 94 rew
TensorBoard
-----------
To visualize learning in tensorboard, install TensorFlow or tensorboardX:
To visualize learning in tensorboard, install tensorboardX:
.. code-block:: bash
$ pip install tensorboardX # or pip install tensorflow
$ pip install tensorboardX
Then, after you run a experiment, you can visualize your experiment with TensorBoard by specifying the output directory of your results. Note that if you running Ray on a remote cluster, you can forward the tensorboard port to your local machine through SSH using ``ssh -L 6006:localhost:6006 <address>``:
@@ -585,22 +494,6 @@ If using TF2, Tune also automatically generates TensorBoard HParams output, as s
.. image:: images/tune-hparams.png
The nonrelevant metrics (like timing stats) can be disabled on the left to show only the relevant ones (like accuracy, loss, etc.).
Viskit
------
To use VisKit (you may have to install some dependencies), run:
.. code-block:: bash
$ git clone https://github.com/rll/rllab.git
$ python rllab/rllab/viskit/frontend.py ~/ray_results/my_experiment
.. image:: ray-tune-viskit.png
Logging
-------
@@ -616,12 +509,7 @@ You can pass in your own logging mechanisms to output logs in custom formats as
loggers=DEFAULT_LOGGERS + (CustomLogger1, CustomLogger2)
)
These loggers will be called along with the default Tune loggers. All loggers must inherit the Logger interface (:ref:`logger-interface`). Tune enables default loggers for Tensorboard, CSV, and JSON formats. You can also check out `logger.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/logger.py>`__ for implementation details. An example can be found in `logging_example.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/logging_example.py>`__.
MLFlow
~~~~~~
Tune also provides a default logger for `MLFlow <https://mlflow.org>`_. You can install MLFlow via ``pip install mlflow``. An example can be found `mlflow_example.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/mlflow_example.py>`__. Note that this currently does not include artifact logging support. For this, you can use the native MLFlow APIs inside your Trainable definition.
These loggers will be called along with the default Tune loggers. All loggers must inherit the Logger interface (:ref:`logger-interface`). Tune enables default loggers for Tensorboard, CSV, and JSON formats. You can also check out `logger.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/logger.py>`__ for implementation details. An example can be found in `logging_example.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/logging_example.py>`__. See the :ref:`Logging API <logger-interface>`.
Uploading/Syncing
-----------------
+4 -8
View File
@@ -7,7 +7,7 @@ Tune: Scalable Hyperparameter Tuning
Tune is a Python library for experiment execution and hyperparameter tuning at any scale. Core features:
* Launch a multi-node `distributed hyperparameter sweep <tune-distributed.html>`_ in less than 10 lines of code.
* Launch a multi-node :ref:`distributed hyperparameter sweep <tune-distributed>` in less than 10 lines of code.
* Supports any machine learning framework, including PyTorch, XGBoost, MXNet, and Keras. See `examples here <tune-examples.html>`_.
* Natively `integrates with optimization libraries <tune-searchalg.html>`_ such as `HyperOpt <https://github.com/hyperopt/hyperopt>`_, `Bayesian Optimization <https://github.com/fmfn/BayesianOptimization>`_, and `Facebook Ax <http://ax.dev>`_.
* Choose among `scalable algorithms <tune-schedulers.html>`_ such as `Population Based Training (PBT)`_, `Vizier's Median Stopping Rule`_, `HyperBand/ASHA`_.
@@ -36,14 +36,10 @@ For more information, check out:
Quick Start
-----------
To run this example, you will need to install the following:
.. code-block:: bash
$ pip install 'ray[tune]' torch torchvision
To run this example, install the following: ``pip install 'ray[tune]' torch torchvision``.
This example runs a small grid search to train a CNN using PyTorch and Tune.
This example runs a small grid search to train a convolutional neural network using PyTorch and Tune.
.. literalinclude:: ../../python/ray/tune/tests/example.py
:language: python
@@ -67,7 +63,7 @@ If using TF2 and TensorBoard, Tune will also automatically generate TensorBoard
:scale: 20%
:align: center
Take a look at the `Distributed Experiments <tune-distributed.html>`_ documentation for:
Take a look at the :ref:`Distributed Experiments <tune-distributed>` documentation for:
1. Setting up distributed experiments on your local cluster
2. Using AWS and GCP
+52 -3
View File
@@ -4,6 +4,41 @@ Analysis/Logging (tune.analysis / tune.logger)
Analyzing Results
-----------------
You can use the ``ExperimentAnalysis`` object for analyzing results. It is returned automatically when calling ``tune.run``.
.. code-block:: python
analysis = tune.run(
trainable,
name="example-experiment",
num_samples=10,
)
Here are some example operations for obtaining a summary of your experiment:
.. code-block:: python
# Get a dataframe for the last reported results of all of the trials
df = analysis.dataframe()
# Get a dataframe for the max accuracy seen for each trial
df = analysis.dataframe(metric="mean_accuracy", mode="max")
# Get a dict mapping {trial logdir -> dataframes} for all trials in the experiment.
all_dataframes = analysis.trial_dataframes
# Get a list of trials
trials = analysis.trials
You may want to get a summary of multiple experiments that point to the same ``local_dir``. For this, you can use the ``Analysis`` class.
.. code-block:: python
from ray.tune import Analysis
analysis = Analysis("~/ray_results/example-experiment")
.. _exp-analysis-docstring:
ExperimentAnalysis
~~~~~~~~~~~~~~~~~~
@@ -11,8 +46,6 @@ ExperimentAnalysis
:show-inheritance:
:members:
.. _analysis-docstring:
Analysis
~~~~~~~~
@@ -24,6 +57,21 @@ Analysis
Loggers (tune.logger)
---------------------
Viskit
~~~~~~
Tune automatically integrates with Viskit via the ``CSVLogger`` outputs. To use VisKit (you may have to install some dependencies), run:
.. code-block:: bash
$ git clone https://github.com/rll/rllab.git
$ python rllab/rllab/viskit/frontend.py ~/ray_results/my_experiment
The nonrelevant metrics (like timing stats) can be disabled on the left to show only the relevant ones (like accuracy, loss, etc.).
.. image:: /ray-tune-viskit.png
.. _logger-interface:
Logger
@@ -54,5 +102,6 @@ CSVLogger
MLFLowLogger
~~~~~~~~~~~~
.. autoclass:: ray.tune.logger.MLFLowLogger
Tune also provides a default logger for `MLFlow <https://mlflow.org>`_. You can install MLFlow via ``pip install mlflow``. An example can be found `mlflow_example.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/mlflow_example.py>`__. Note that this currently does not include artifact logging support. For this, you can use the native MLFlow APIs inside your Trainable definition.
.. autoclass:: ray.tune.logger.MLFLowLogger
+158 -7
View File
@@ -1,10 +1,162 @@
.. _tune-grid-random:
Grid/Random Search
==================
Overview
--------
Tune has a native interface for specifying a grid search or random search. You can specify the search space via ``tune.run(config=...)``.
Thereby, you can either use the ``tune.grid_search`` primitive to specify an axis of a grid search...
.. code-block:: python
tune.run(
trainable,
config={"bar": tune.grid_search([True, False])})
... or one of the random sampling primitives to specify distributions (:ref:`tune-sample-docs`):
.. code-block:: python
tune.run(
trainable,
config={
"param1": tune.choice([True, False]),
"bar": tune.uniform(0, 10),
"alpha": tune.sample_from(lambda _: np.random.uniform(100) ** 2),
"const": "hello" # It is also ok to specify constant values.
})
.. caution:: If you use a Search Algorithm, you may not be able to specify lambdas or grid search with this
interface, as the search algorithm may require a different search space declaration.
To sample multiple times/run multiple trials, specify ``tune.run(num_samples=N``. If ``grid_search`` is provided as an argument, the *same* grid will be repeated ``N`` times.
.. code-block:: python
# 13 different configs.
tune.run(trainable config={
"x": tune.choice([0, 1, 2]),
}
)
# 13 different configs.
tune.run(trainable, num_samples=13, config={
"x": tune.choice([0, 1, 2]),
"y": tune.randn([0, 1, 2]),
}
)
# 4 different configs.
tune.run(trainable, config={"x": tune.grid_search([1, 2, 3, 4])}, num_samples=1)
# 3 different configs.
tune.run(trainable, config={"x": grid_search([1, 2, 3])}, num_samples=1)
# 6 different configs.
tune.run(trainable, config={"x": tune.grid_search([1, 2, 3])}, num_samples=2)
# 9 different configs.
tune.run(trainable, num_samples=1, config={
"x": tune.grid_search([1, 2, 3]),
"y": tune.grid_search([a, b, c])}
)
# 18 different configs.
tune.run(trainable, num_samples=2, config={
"x": tune.grid_search([1, 2, 3]),
"y": tune.grid_search([a, b, c])}
)
# 45 different configs.
tune.run(trainable, num_samples=5, config={
"x": tune.grid_search([1, 2, 3]),
"y": tune.grid_search([a, b, c])}
)
Note that grid search and random search primitives are inter-operable. Each can be used independently or in combination with each other.
.. code-block:: python
# 6 different configs.
tune.run(trainable, num_samples=2, config={
"x": tune.sample_from(...),
"y": tune.grid_search([a, b, c])
}
)
In the below example, ``num_samples=10`` repeats the 3x3 grid search 10 times, for a total of 90 trials, each with randomly sampled values of ``alpha`` and ``beta``.
.. code-block:: python
:emphasize-lines: 12
tune.run(
my_trainable,
name="my_trainable",
# num_samples will repeat the entire config 10 times.
num_samples=10
config={
# ``sample_from`` creates a generator to call the lambda once per trial.
"alpha": tune.sample_from(lambda spec: np.random.uniform(100)),
# ``sample_from`` also supports "conditional search spaces"
"beta": tune.sample_from(lambda spec: spec.config.alpha * np.random.normal()),
"nn_layers": [
# tune.grid_search will make it so that all values are evaluated.
tune.grid_search([16, 64, 256]),
tune.grid_search([16, 64, 256]),
],
},
)
Custom/Conditional Search Spaces
--------------------------------
You'll often run into awkward search spaces (i.e., when one hyperparameter depends on another). Use ``tune.sample_from(func)`` to provide a **custom** callable function for generating a search space.
The parameter ``func`` should take in a ``spec`` object, which has a ``config`` namespace from which you can access other hyperparameters. This is useful for conditional distributions:
.. code-block:: python
tune.run(
...,
config={
# A random function
"alpha": tune.sample_from(lambda _: np.random.uniform(100)),
# Use the `spec.config` namespace to access other hyperparameters
"beta": tune.sample_from(lambda spec: spec.config.alpha * np.random.normal())
}
)
Here's an example showing a grid search over two nested parameters combined with random sampling from two lambda functions, generating 9 different trials. Note that the value of ``beta`` depends on the value of ``alpha``, which is represented by referencing ``spec.config.alpha`` in the lambda function. This lets you specify conditional parameter distributions.
.. code-block:: python
:emphasize-lines: 4-11
tune.run(
my_trainable,
name="my_trainable",
config={
"alpha": tune.sample_from(lambda spec: np.random.uniform(100)),
"beta": tune.sample_from(lambda spec: spec.config.alpha * np.random.normal()),
"nn_layers": [
tune.grid_search([16, 64, 256]),
tune.grid_search([16, 64, 256]),
],
}
)
.. _tune-sample-docs:
Random Distributions
--------------------
Random Distributions API
------------------------
tune.randn
~~~~~~~~~~
@@ -31,14 +183,13 @@ tune.sample_from
.. autoclass:: ray.tune.sample_from
Grid Search
-----------
tune.grid_search
~~~~~~~~~~~~~~~~
Grid Search API
---------------
.. autofunction:: ray.tune.grid_search
Internals
---------
BasicVariantGenerator
~~~~~~~~~~~~~~~~~~~~~
+1 -1
View File
@@ -11,11 +11,11 @@ on `Github`_.
execution.rst
trainable.rst
reporters.rst
analysis.rst
grid_random.rst
suggestion.rst
schedulers.rst
internals.rst
reporters.rst
client.rst
cli.rst
+152 -6
View File
@@ -1,10 +1,149 @@
.. _trainable-docs:
Training (tune.Trainable, tune.track)
=====================================
.. _trainable-docstring:
Training can be done with either a **Class API** (``tune.Trainable``) < or **function-based API** (``track.log``).
You can use the **function-based API** for fast prototyping. On the other hand, the ``tune.Trainable`` interface supports checkpoint/restore functionality and provides more control for advanced algorithms.
Function-based API
------------------
.. code-block:: python
def trainable(config):
"""
Args:
config (dict): Parameters provided from the search algorithm
or variant generation.
"""
while True:
# ...
tune.track.log(**kwargs)
.. tip:: Do not use ``tune.track.log`` within a ``Trainable`` class.
Tune will run this function on a separate thread in a Ray actor process. Note that this API is not checkpointable, since the thread will never return control back to its caller.
.. note:: If you have a lambda function that you want to train, you will need to first register the function: ``tune.register_trainable("lambda_id", lambda x: ...)``. You can then use ``lambda_id`` in place of ``my_trainable``.
Trainable API
-------------
.. caution:: Do not use ``tune.track.log`` within a ``Trainable`` class.
The Trainable **class API** will require users to subclass ``ray.tune.Trainable``. Here's a naive example of this API:
.. code-block:: python
from ray import tune
class Guesser(tune.Trainable):
"""Randomly picks 10 number from [1, 10000) to find the password."""
def _setup(self, config):
self.config = config
self.password = 1024
def _train(self):
"""Execute one step of 'training'."""
result_dict = {"diff": abs(self.config['guess'] - self.password)}
return result_dict
def _stop(self):
# perform any cleanup necessary.
pass
analysis = tune.run(
Guesser,
stop={
"training_iteration": 1,
},
num_samples=10,
config={
"guess": tune.randint(1, 10000)
})
print('best config: ', analysis.get_best_config(metric="diff", mode="min"))
As a subclass of ``tune.Trainable``, Tune will create a ``Guesser`` object on a separate process (using the Ray Actor API).
1. ``_setup`` function is invoked once training starts.
2. ``_train`` is invoked **multiple times**. Each time, the Guesser object executes one logical iteration of training in the tuning process, which may include one or more iterations of actual training.
3. ``_stop`` is invoked when training is finished.
.. tip:: As a rule of thumb, the execution time of ``_train`` should be large enough to avoid overheads (i.e. more than a few seconds), but short enough to report progress periodically (i.e. at most a few minutes).
In this example, we only implemented the ``_setup`` and ``_train`` methods for simplification. Next, we'll implement ``_save`` and ``_restore`` for checkpoint and fault tolerance.
Save and Restore
~~~~~~~~~~~~~~~~
Many Tune features rely on ``_save``, and ``_restore``, including the usage of certain Trial Schedulers, fault tolerance, and checkpointing.
.. code-block:: python
class MyTrainableClass(Trainable):
def _save(self, tmp_checkpoint_dir):
checkpoint_path = os.path.join(tmp_checkpoint_dir, "model.pth")
torch.save(self.model.state_dict(), checkpoint_path)
return tmp_checkpoint_dir
def _restore(self, tmp_checkpoint_dir):
checkpoint_path = os.path.join(tmp_checkpoint_dir, "model.pth")
self.model.load_state_dict(torch.load(checkpoint_path))
Checkpoints will be saved by training iteration to ``local_dir/exp_name/trial_name/checkpoint_<iter>``. You can restore a single trial checkpoint by using ``tune.run(restore=<checkpoint_dir>)``.
Tune also generates temporary checkpoints for pausing and switching between trials. For this purpose, it is important not to depend on absolute paths in the implementation of ``save``.
Use ``validate_save_restore`` to catch ``_save``/``_restore`` errors before execution.
.. code-block:: python
from ray.tune.utils import validate_save_restore
# both of these should return
validate_save_restore(MyTrainableClass)
validate_save_restore(MyTrainableClass, use_object_store=True)
Advanced: Reusing Actors
~~~~~~~~~~~~~~~~~~~~~~~~
Your Trainable can often take a long time to start. To avoid this, you can do ``tune.run(reuse_actors=True)`` to reuse the same Trainable Python process and object for multiple hyperparameters.
This requires you to implement ``Trainable.reset_config``, which provides a new set of hyperparameters. It is up to the user to correctly update the hyperparameters of your trainable.
.. code-block:: python
class PytorchTrainble(tune.Trainable):
"""Train a Pytorch ConvNet."""
def _setup(self, config):
self.train_loader, self.test_loader = get_data_loaders()
self.model = ConvNet()
self.optimizer = optim.SGD(
self.model.parameters(),
lr=config.get("lr", 0.01),
momentum=config.get("momentum", 0.9))
def reset_config(self, new_config):
for param_group in self.optimizer.param_groups:
if "lr" in new_config:
param_group["lr"] = new_config["lr"]
if "momentum" in new_config:
param_group["momentum"] = new_config["momentum"]
self.model = ConvNet()
self.config = new_config
return True
tune.Trainable
~~~~~~~~~~~~~~
--------------
.. autoclass:: ray.tune.Trainable
:member-order: groupwise
@@ -12,21 +151,28 @@ tune.Trainable
:members:
tune.DurableTrainable
~~~~~~~~~~~~~~~~~~~~~
---------------------
.. autoclass:: ray.tune.DurableTrainable
.. _track-docstring:
tune.track
~~~~~~~~~~
----------
.. automodule:: ray.tune.track
:members:
:exclude-members: init, shutdown
:exclude-members: init,
KerasCallback
-------------
.. automodule:: ray.tune.integration.keras
:members:
StatusReporter
~~~~~~~~~~~~~~
--------------
.. autoclass:: ray.tune.function_runner.StatusReporter
:members: __call__, logdir
+1
View File
@@ -0,0 +1 @@
:orphan:
+38
View File
@@ -0,0 +1,38 @@
Tune Guides and Tutorials
=========================
Tune takes a user-defined Python function or class and evaluates it on a set of hyperparameter configurations.
Each hyperparameter configuration evaluation is called a *trial*, and multiple trials are run in parallel. Configurations are either generated by Tune or drawn from a user-specified **search algorithm**. The trials are scheduled and managed by a **trial scheduler**.
.. image:: /images/tune-api.svg
.. customgalleryitem::
:tooltip: Getting started with Tune.
:figure: /images/tune.png
:description: :doc:`plot_tune-tutorial`
.. customgalleryitem::
:tooltip: A simple guide to Population-based Training
:figure: /images/tune-pbt-small.png
:description: :doc:`plot_tune-advanced-tutorial`
.. customgalleryitem::
:tooltip: Distributed Tuning
:figure: /images/tune.png
:description: :doc:`plot_tune-distributed`
.. toctree::
:hidden:
plot_tune-tutorial.rst
plot_tune-advanced-tutorial.rst
plot_tune-distributed.rst
.. :figure: /images/param_actor.png
@@ -1,80 +1,24 @@
Tune Advanced Tutorials
=======================
Guide to Population Based Training (PBT)
========================================
In this page, we will explore some advanced functionality in Tune with more examples.
Tune includes a distributed implementation of `Population Based Training (PBT) <https://deepmind.com/blog/population-based-training-neural-networks>`__ as
a :ref:`scheduler <tune-scheduler-pbt>`.
On this page:
.. image:: /images/tune_advanced_paper1.png
PBT starts by training many neural networks in parallel with random hyperparameters, using information from the rest of the population to refine these
hyperparameters and allocate resources to promising models. Let's walk through how to use this algorithm.
.. contents::
:local:
:backlinks: none
A native example of Trainable
-----------------------------
As mentioned in `Tune User Guide <tune-usage.html#Tune Training API>`_, Training can be done
with either the `Trainable <tune-usage.html#trainable-api>`__ **Class API** or
**function-based API**. Comparably, ``Trainable`` is stateful, supports checkpoint/restore functionality,
and is preferable for advanced algorithms.
A naive example for ``Trainable`` is a simple number guesser:
Trainable API with Population Based Training
--------------------------------------------
.. code-block:: python
import ray
from ray import tune
from ray.tune import Trainable
class Guesser(Trainable):
def _setup(self, config):
self.config = config
self.password = 1024
def _train(self):
result_dict = {"diff": abs(self.config['guess'] - self.password)}
return result_dict
ray.init()
analysis = tune.run(
Guesser,
stop={
"training_iteration": 1,
},
num_samples=10,
config={
"guess": tune.randint(1, 10000)
})
print('best config: ', analysis.get_best_config(metric="diff", mode="min"))
The program randomly picks 10 number from [1, 10000) and finds which is closer to the password.
As a subclass of ``ray.tune.Trainable``, Tune will convert ``Guesser`` into a Ray actor, which
runs on a separate process on a worker. ``_setup`` function is invoked once for each Actor for custom
initialization.
``_train`` execute one logical iteration of training in the tuning process,
which may include several iterations of actual training (see the next example). As a rule of
thumb, the execution time of one train call should be large enough to avoid overheads
(i.e. more than a few seconds), but short enough to report progress periodically
(i.e. at most a few minutes).
We only implemented ``_setup`` and ``_train`` methods for simplification, usually it's also required
to implement ``_save``, and ``_restore`` for checkpoint and fault tolerance.
Next, we train a Pytorch convolution model with Trainable and PBT.
Trainable with Population Based Training (PBT)
----------------------------------------------
Tune includes a distributed implementation of `Population Based Training (PBT) <https://deepmind.com/blog/population-based-training-neural-networks>`__ as
a scheduler `PopulationBasedTraining <tune-schedulers.html#Population Based Training (PBT)>`__ .
PBT starts by training many neural networks in parallel with random hyperparameters. But instead of the
networks training independently, it uses information from the rest of the population to refine the
hyperparameters and direct computational resources to models which show promise.
.. image:: images/tune_advanced_paper1.png
This takes its inspiration from genetic algorithms where each member of the population
PBT takes its inspiration from genetic algorithms where each member of the population
can exploit information from the remainder of the population. For example, a worker might
copy the model parameters from a better performing worker. It can also explore new hyperparameters by
changing the current values randomly.
@@ -87,9 +31,9 @@ This means that PBT can quickly exploit good hyperparameters, can dedicate more
promising models and, crucially, can adapt the hyperparameter values throughout training,
leading to automatic learning of the best configurations.
First we define a Trainable that wraps a ConvNet model.
First, we define a Trainable that wraps a ConvNet model.
.. literalinclude:: ../../python/ray/tune/examples/pbt_convnet_example.py
.. literalinclude:: /../../python/ray/tune/examples/pbt_convnet_example.py
:language: python
:start-after: __trainable_begin__
:end-before: __trainable_end__
@@ -103,7 +47,7 @@ with ``reuse_actors=True``.
Then, we define a PBT scheduler:
.. literalinclude:: ../../python/ray/tune/examples/pbt_convnet_example.py
.. literalinclude:: /../../python/ray/tune/examples/pbt_convnet_example.py
:language: python
:start-after: __pbt_begin__
:end-before: __pbt_end__
@@ -123,7 +67,7 @@ Some of the most important parameters are:
Now we can kick off the tuning process by invoking tune.run:
.. literalinclude:: ../../python/ray/tune/examples/pbt_convnet_example.py
.. literalinclude:: /../../python/ray/tune/examples/pbt_convnet_example.py
:language: python
:start-after: __tune_begin__
:end-before: __tune_end__
@@ -168,7 +112,7 @@ Checking the accuracy:
print('best config:', analysis.get_best_config("mean_accuracy"))
.. image:: images/tune_advanced_plot1.png
.. image:: /images/tune_advanced_plot1.png
DCGAN with Trainable and PBT
----------------------------
@@ -184,7 +128,7 @@ Complete code example at `github <https://github.com/ray-project/ray/tree/master
We define the Generator and Discriminator with standard Pytorch API:
.. literalinclude:: ../../python/ray/tune/examples/pbt_dcgan_mnist/pbt_dcgan_mnist.py
.. literalinclude:: /../../python/ray/tune/examples/pbt_dcgan_mnist/pbt_dcgan_mnist.py
:language: python
:start-after: __GANmodel_begin__
:end-before: __GANmodel_end__
@@ -194,7 +138,7 @@ the model candidates. For a GAN network, inception score is arguably the most
commonly used metric. We trained a mnist classification model (LeNet) and use
it to inference the generated images and evaluate the image quality.
.. literalinclude:: ../../python/ray/tune/examples/pbt_dcgan_mnist/pbt_dcgan_mnist.py
.. literalinclude:: /../../python/ray/tune/examples/pbt_dcgan_mnist/pbt_dcgan_mnist.py
:language: python
:start-after: __INCEPTION_SCORE_begin__
:end-before: __INCEPTION_SCORE_end__
@@ -202,14 +146,14 @@ it to inference the generated images and evaluate the image quality.
The ``Trainable`` class includes a Generator and a Discriminator, each with an
independent learning rate and optimizer.
.. literalinclude:: ../../python/ray/tune/examples/pbt_dcgan_mnist/pbt_dcgan_mnist.py
.. literalinclude:: /../../python/ray/tune/examples/pbt_dcgan_mnist/pbt_dcgan_mnist.py
:language: python
:start-after: __Trainable_begin__
:end-before: __Trainable_end__
We specify inception score as the metric and start the tuning:
.. literalinclude:: ../../python/ray/tune/examples/pbt_dcgan_mnist/pbt_dcgan_mnist.py
.. literalinclude:: /../../python/ray/tune/examples/pbt_dcgan_mnist/pbt_dcgan_mnist.py
:language: python
:start-after: __tune_begin__
:end-before: __tune_end__
@@ -217,9 +161,10 @@ We specify inception score as the metric and start the tuning:
The trained Generator models can be loaded from log directory, and generate images
from noise signals.
.. image:: images/tune_advanced_dcgan_generated.gif
Visualization
~~~~~~~~~~~~~
Visualize the increasing inception score from the training logs.
Below, we visualize the increasing inception score from the training logs.
.. code-block:: python
@@ -235,7 +180,7 @@ Visualize the increasing inception score from the training logs.
plt.legend()
plt.show()
.. image:: images/tune_advanced_dcgan_inscore.png
.. image:: /images/tune_advanced_dcgan_inscore.png
And the Generator loss:
@@ -253,7 +198,7 @@ And the Generator loss:
plt.legend()
plt.show()
.. image:: images/tune_advanced_dcgan_Gloss.png
.. image:: /images/tune_advanced_dcgan_Gloss.png
Training of the MNist Generator takes about several minutes. The example can be easily
altered to generate images for other dataset, e.g. cifar10 or LSUN.
Training of the MNist Generator takes a couple of minutes. The example can be easily
altered to generate images for other datasets, e.g. cifar10 or LSUN.
@@ -1,21 +1,27 @@
.. _tune-distributed:
Tune Distributed Experiments
============================
Tune is commonly used for large-scale distributed hyperparameter optimization. This page will overview:
1. How to setup and launch a distributed experiment,
2. `commonly used commands <tune-distributed.html#common-commands>`_, including fast file mounting, one-line cluster launching, and result uploading to cloud storage.
2. :ref:`Commonly used commands <tune-distributed-common>`, including fast file mounting, one-line cluster launching, and result uploading to cloud storage.
**Quick Summary**: To run a distributed experiment with Tune, you need to:
1. Make sure your script has ``ray.init(address=...)`` to connect to the existing Ray cluster.
2. If a ray cluster does not exist, start a Ray cluster (instructions for `local machines <tune-distributed.html#local-cluster-setup>`_, `cloud <tune-distributed.html#launching-a-cloud-cluster>`_).
2. If a ray cluster does not exist, start a Ray cluster.
3. Run the script on the head node (or use ``ray submit``).
.. contents::
:local:
:backlinks: none
Running a distributed experiment
--------------------------------
Running a distributed (multi-node) experiment requires Ray to be started already. You can do this on local machines or on the cloud (instructions for `local machines <tune-distributed.html#local-cluster-setup>`_, `cloud <tune-distributed.html#launching-a-cloud-cluster>`_).
Running a distributed (multi-node) experiment requires Ray to be started already. You can do this on local machines or on the cloud.
Across your machines, Tune will automatically detect the number of GPUs and CPUs without you needing to manage ``CUDA_VISIBLE_DEVICES``.
@@ -29,9 +35,9 @@ One common approach to modifying an existing Tune experiment to go distributed i
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--ray-address")
parser.add_argument("--address")
args = parser.parse_args()
ray.init(address=args.ray_address)
ray.init(address=args.address)
tune.run(...)
@@ -51,12 +57,14 @@ If you used a cluster configuration (starting a cluster with ``ray up`` or ``ray
1. In the examples, the Ray redis address commonly used is ``localhost:6379``.
2. If the Ray cluster is already started, you should not need to run anything on the worker nodes.
.. _tune-distributed-local:
Local Cluster Setup
-------------------
If you have already have a list of nodes, you can follow the local private cluster setup `instructions here <autoscaling.html#quick-start-private-cluster>`_. Below is an example cluster configuration as ``tune-default.yaml``:
If you already have a list of nodes, you can follow the local private cluster setup `instructions here <autoscaling.html#quick-start-private-cluster>`_. Below is an example cluster configuration as ``tune-default.yaml``:
.. literalinclude:: ../../python/ray/tune/examples/tune-local-default.yaml
.. literalinclude:: /../../python/ray/tune/examples/tune-local-default.yaml
:language: yaml
``ray up`` starts Ray on the cluster of nodes.
@@ -98,16 +106,18 @@ Then, you can run your Tune Python script on the head node like:
# On the head node, execute using existing ray cluster
$ python tune_script.py --ray-address=<address>
.. tune-distributed-cloud:
Launching a cloud cluster
-------------------------
.. tip::
If you have already have a list of nodes, go to the `Local Cluster Setup`_ section.
If you have already have a list of nodes, go to :ref:`tune-distributed-local`.
Ray currently supports AWS and GCP. Follow the instructions below to launch nodes on AWS (using the Deep Learning AMI). See the `cluster setup documentation <autoscaling.html>`_. Save the below cluster configuration (``tune-default.yaml``):
.. literalinclude:: ../../python/ray/tune/examples/tune-default.yaml
.. literalinclude:: /../../python/ray/tune/examples/tune-default.yaml
:language: yaml
:name: tune-default.yaml
@@ -125,7 +135,7 @@ Ray currently supports AWS and GCP. Follow the instructions below to launch node
ray submit tune-default.yaml tune_script.py --start --args="--ray-address=localhost:6379"
.. image:: images/tune-upload.png
.. image:: /images/tune-upload.png
:scale: 50%
:align: center
@@ -139,6 +149,8 @@ Analyze your results on TensorBoard by starting TensorBoard on the remote head m
Note that you can customize the directory of results by running: ``tune.run(local_dir=..)``. You can then point TensorBoard to that directory to visualize results. You can also use `awless <https://github.com/wallix/awless>`_ for easy cluster management on AWS.
.. _tune-distributed-spot:
Pre-emptible Instances (Cloud)
------------------------------
@@ -180,14 +192,14 @@ Spot instances may be removed suddenly while trials are still running. Often tim
The easiest way to do this is to subclass the pre-defined ``Trainable`` class and implement ``_save``, and ``_restore`` abstract methods, as seen in the example below:
.. literalinclude:: ../../python/ray/tune/examples/mnist_pytorch_trainable.py
.. literalinclude:: /../../python/ray/tune/examples/mnist_pytorch_trainable.py
:language: python
:start-after: __trainable_example_begin__
:end-before: __trainable_example_end__
This can then be used similarly to the Function API as before:
.. literalinclude:: ../../python/ray/tune/tests/tutorial.py
.. literalinclude:: /../../python/ray/tune/tests/tutorial.py
:language: python
:start-after: __trainable_run_begin__
:end-before: __trainable_run_end__
@@ -198,13 +210,13 @@ Example for using spot instances (AWS)
Here is an example for running Tune on spot instances. This assumes your AWS credentials have already been setup (``aws configure``):
1. Download a full example Tune experiment script here. This includes a Trainable with checkpointing: :download:`mnist_pytorch_trainable.py <../../python/ray/tune/examples/mnist_pytorch_trainable.py>`. To run this example, you will need to install the following:
1. Download a full example Tune experiment script here. This includes a Trainable with checkpointing: :download:`mnist_pytorch_trainable.py </../../python/ray/tune/examples/mnist_pytorch_trainable.py>`. To run this example, you will need to install the following:
.. code-block:: bash
$ pip install ray torch torchvision filelock
2. Download an example cluster yaml here: :download:`tune-default.yaml <../../python/ray/tune/examples/tune-default.yaml>`
2. Download an example cluster yaml here: :download:`tune-default.yaml </../../python/ray/tune/examples/tune-default.yaml>`
3. Run ``ray submit`` as below to run Tune across them. Append ``[--start]`` if the cluster is not up yet. Append ``[--stop]`` to automatically shutdown your nodes after running.
.. code-block:: bash
@@ -230,9 +242,11 @@ To summarize, here are the commands to run:
# wait a while until after all nodes have started
ray kill-random-node tune-default.yaml --hard
You should see Tune eventually continue the trials on a different worker node. See the `Save and Restore <tune-usage.html#save-and-restore>`__ section for more details.
You should see Tune eventually continue the trials on a different worker node. See the :ref:`Fault Tolerance <tune-fault-tol>` section for more details.
You can also specify ``tune.run(upload_dir=...)`` to sync results with a cloud storage like S3, persisting results in case you want to start and stop your cluster automatically.
You can also specify ``tune.run(upload_dir=...)`` to sync results with a cloud storage like S3, allowing you to persist results in case you want to start and stop your cluster automatically.
.. _tune-distributed-common:
Common Commands
---------------
@@ -284,6 +298,3 @@ Sometimes, your program may freeze. Run this to restart the Ray cluster without
.. code-block:: bash
$ ray up CLUSTER.YAML --restart-only
.. Local Cluster Setup: tune-distributed.html#local-cluster-setup
@@ -18,7 +18,7 @@ This tutorial will walk you through the following process to setup a Tune experi
We first run some imports:
.. literalinclude:: ../../python/ray/tune/tests/tutorial.py
.. literalinclude:: /../../python/ray/tune/tests/tutorial.py
:language: python
:start-after: __tutorial_imports_begin__
:end-before: __tutorial_imports_end__
@@ -26,7 +26,7 @@ We first run some imports:
Below, we have some boiler plate code for a PyTorch training function.
.. literalinclude:: ../../python/ray/tune/tests/tutorial.py
.. literalinclude:: /../../python/ray/tune/tests/tutorial.py
:language: python
:start-after: __train_func_begin__
:end-before: __train_func_end__
@@ -48,14 +48,14 @@ Notice that there's a couple helper functions in the above training script. You
Let's run 1 trial, randomly sampling from a uniform distribution for learning rate and momentum.
.. literalinclude:: ../../python/ray/tune/tests/tutorial.py
.. literalinclude:: /../../python/ray/tune/tests/tutorial.py
:language: python
:start-after: __eval_func_begin__
:end-before: __eval_func_end__
We can then plot the performance of this trial.
.. literalinclude:: ../../python/ray/tune/tests/tutorial.py
.. literalinclude:: /../../python/ray/tune/tests/tutorial.py
:language: python
:start-after: __plot_begin__
:end-before: __plot_end__
@@ -71,21 +71,21 @@ Let's integrate an early stopping algorithm to our search - ASHA, a scalable alg
How does it work? On a high level, it terminates trials that are less promising and
allocates more time and resources to more promising trials. See `this blog post <https://blog.ml.cmu.edu/2018/12/12/massively-parallel-hyperparameter-optimization/>`__ for more details.
We can afford to **increase the search space by 5x**, by adjusting the parameter ``num_samples``. See the `Trial Scheduler section <tune-schedulers.html>`__ for more details of available schedulers and library integrations.
We can afford to **increase the search space by 5x**, by adjusting the parameter ``num_samples``. See :ref:`tune-schedulers` for more details of available schedulers and library integrations.
.. literalinclude:: ../../python/ray/tune/tests/tutorial.py
.. literalinclude:: /../../python/ray/tune/tests/tutorial.py
:language: python
:start-after: __run_scheduler_begin__
:end-before: __run_scheduler_end__
You can run the below in a Jupyter notebook to visualize trial progress.
.. literalinclude:: ../../python/ray/tune/tests/tutorial.py
.. literalinclude:: /../../python/ray/tune/tests/tutorial.py
:language: python
:start-after: __plot_scheduler_begin__
:end-before: __plot_scheduler_end__
.. image:: images/tune-df-plot.png
.. image:: /images/tune-df-plot.png
:scale: 50%
:align: center
@@ -99,9 +99,9 @@ You can also use Tensorboard for visualizing results.
Search Algorithms in Tune
~~~~~~~~~~~~~~~~~~~~~~~~~
With Tune you can combine powerful hyperparameter search libraries such as `HyperOpt <https://github.com/hyperopt/hyperopt>`_ and `Ax <https://ax.dev>`_ with state-of-the-art algorithms such as HyperBand without modifying any model training code. Tune allows you to use different search algorithms in combination with different trial schedulers. See the `Search Algorithm section <tune-searchalg.html>`__ for more details of available algorithms and library integrations.
With Tune you can combine powerful hyperparameter search libraries such as `HyperOpt <https://github.com/hyperopt/hyperopt>`_ and `Ax <https://ax.dev>`_ with state-of-the-art algorithms such as HyperBand without modifying any model training code. Tune allows you to use different search algorithms in combination with different trial schedulers. See :ref:`tune-search-alg` for more details of available algorithms and library integrations.
.. literalinclude:: ../../python/ray/tune/tests/tutorial.py
.. literalinclude:: /../../python/ray/tune/tests/tutorial.py
:language: python
:start-after: __run_searchalg_begin__
:end-before: __run_searchalg_end__
@@ -112,7 +112,7 @@ Evaluate your model
You can evaluate best trained model using the Analysis object to retrieve the best model:
.. literalinclude:: ../../python/ray/tune/tests/tutorial.py
.. literalinclude:: /../../python/ray/tune/tests/tutorial.py
:language: python
:start-after: __run_analysis_begin__
:end-before: __run_analysis_end__
@@ -120,4 +120,4 @@ You can evaluate best trained model using the Analysis object to retrieve the be
Next Steps
----------
Take a look at the `Usage Guide <tune-usage.html>`__ for more comprehensive overview of Tune features.
Take a look at the :ref`tune-user-guide` for a more comprehensive overview of Tune's features.