[tune] New Doc edits, add Concepts page (#8083)

Co-Authored-By: Sven Mika <sven@anyscale.io>
This commit is contained in:
Richard Liaw
2020-04-25 18:25:56 -07:00
committed by GitHub
parent 69ff7e3e35
commit b506f87117
29 changed files with 1041 additions and 734 deletions
+2 -2
View File
@@ -2,11 +2,11 @@
#
# You can set these variables from the command line.
SPHINXOPTS =
SPHINXOPTS =
SPHINXBUILD = sphinx-build
PAPER =
BUILDDIR = _build
AUTOGALLERYDIR= source/auto_examples source/tune/generated_guides
AUTOGALLERYDIR= source/auto_examples source/tune/tutorials source/tune/generated_guides
# User-friendly check for sphinx-build
ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)
+12
View File
@@ -18,3 +18,15 @@
.rst-content .section ol p, .rst-content .section ul p {
margin-bottom: 0px;
}
div.sphx-glr-bigcontainer {
display: inline-block;
width: 100%
}
td.tune-colab, th.tune-colab {
border: 1px solid #dddddd;
text-align: left;
padding: 8px;
}
+2 -2
View File
@@ -80,9 +80,9 @@ versionwarning_messages = {
versionwarning_body_selector = "div.document"
sphinx_gallery_conf = {
"examples_dirs": ["../examples", "tune/guides"], # path to example scripts
"examples_dirs": ["../examples", "tune/_tutorials"], # path to example scripts
# path where to save generated examples
"gallery_dirs": ["auto_examples", "tune/generated_guides"],
"gallery_dirs": ["auto_examples", "tune/tutorials"],
"ignore_pattern": "../examples/doc_code/",
"plot_gallery": "False",
# "filename_pattern": "tutorial.py",
Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

+1 -3
View File
@@ -249,11 +249,9 @@ Getting Involved
:caption: Tune
tune.rst
Tune Guides and Tutorials <tune/generated_guides/overview.rst>
tune-usage.rst
Tutorials, Guides, Examples <tune/tutorials/overview.rst>
tune-schedulers.rst
tune-searchalg.rst
tune-examples.rst
tune/api_docs/overview.rst
tune-contrib.rst
+1 -1
View File
@@ -605,7 +605,7 @@ This is how the example in the previous section looks when written using a polic
Trainers
--------
Trainers are the boilerplate classes that put the above components together, making algorithms accessible via Python API and the command line. They manage algorithm configuration, setup of the rollout workers and optimizer, and collection of training metrics. Trainers also implement the `Trainable API <tune-usage.html#trainable-api>`__ for easy experiment management.
Trainers are the boilerplate classes that put the above components together, making algorithms accessible via Python API and the command line. They manage algorithm configuration, setup of the rollout workers and optimizer, and collection of training metrics. Trainers also implement the :ref:`Tune Trainable API <tune-60-seconds>` for easy experiment management.
Example of three equivalent ways of interacting with the PPO trainer, all of which log results in ``~/ray_results``:
+3 -3
View File
@@ -172,9 +172,9 @@ Here is an example of the basic usage (for a more complete example, see `custom_
.. note::
It's recommended that you run RLlib trainers with `Tune <tune.html>`__, for easy experiment management and visualization of results. Just set ``"run": ALG_NAME, "env": ENV_NAME`` in the experiment config.
It's recommended that you run RLlib trainers with :ref:`Tune <tune-index>`, for easy experiment management and visualization of results. Just set ``"run": ALG_NAME, "env": ENV_NAME`` in the experiment config.
All RLlib trainers are compatible with the `Tune API <tune-usage.html>`__. This enables them to be easily used in experiments with `Tune <tune.html>`__. For example, the following code performs a simple hyperparam sweep of PPO:
All RLlib trainers are compatible with the :ref:`Tune API <tune-60-seconds>`. This enables them to be easily used in experiments with :ref:`Tune <tune-index>`. For example, the following code performs a simple hyperparam sweep of PPO:
.. code-block:: python
@@ -461,7 +461,7 @@ Advanced Python APIs
Custom Training Workflows
~~~~~~~~~~~~~~~~~~~~~~~~~
In the `basic training example <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_env.py>`__, Tune will call ``train()`` on your trainer once per training iteration and report the new training results. Sometimes, it is desirable to have full control over training, but still run inside Tune. Tune supports `custom trainable functions <tune-usage.html#trainable-api>`__ that can be used to implement `custom training workflows (example) <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_train_fn.py>`__.
In the `basic training example <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_env.py>`__, Tune will call ``train()`` on your trainer once per training iteration and report the new training results. Sometimes, it is desirable to have full control over training, but still run inside Tune. Tune supports :ref:`custom trainable functions <trainable-docs>` that can be used to implement `custom training workflows (example) <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_train_fn.py>`__.
For even finer-grained control over training, you can use RLlib's lower-level `building blocks <rllib-concepts.html>`__ directly to implement `fully customized training workflows <https://github.com/ray-project/ray/blob/master/rllib/examples/rollout_worker_custom_workflow.py>`__.
+2
View File
@@ -1,3 +1,5 @@
.. _tune-contrib:
Contributing to Tune
====================
+2 -2
View File
@@ -38,7 +38,7 @@ Tune includes a distributed implementation of `Population Based Training (PBT) <
})
tune.run( ... , scheduler=pbt_scheduler)
When the PBT scheduler is enabled, each trial variant is treated as a member of the population. Periodically, top-performing trials are checkpointed (this requires your Trainable to support `save and restore <tune-usage.html#save-and-restore>`__). Low-performing trials clone the checkpoints of top performers and perturb the configurations in the hope of discovering an even better variation.
When the PBT scheduler is enabled, each trial variant is treated as a member of the population. Periodically, top-performing trials are checkpointed (this requires your Trainable to support :ref:`save and restore <tune-checkpoint>`). Low-performing trials clone the checkpoints of top performers and perturb the configurations in the hope of discovering an even better variation.
You can run this `toy PBT example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/pbt_example.py>`__ to get an idea of how how PBT operates. When training in PBT mode, a single trial may see many different hyperparameters over its lifetime, which is recorded in its ``result.json`` file. The following figure generated by the example shows PBT with optimizing a LR schedule over the course of a single experiment:
@@ -72,7 +72,7 @@ Compared to the original version of HyperBand, this implementation provides bett
HyperBand
---------
.. note:: Note that the HyperBand scheduler requires your trainable to support saving and restoring, which is described in `Tune User Guide <tune-usage.html#save-and-restore>`__. Checkpointing enables the scheduler to multiplex many concurrent trials onto a limited size cluster.
.. note:: Note that the HyperBand scheduler requires your trainable to support :ref:`saving and restoring <tune-checkpoint>`. Checkpointing enables the scheduler to multiplex many concurrent trials onto a limited size cluster.
Tune also implements the `standard version of HyperBand <https://arxiv.org/abs/1603.06560>`__. You can use it as such:
+3 -2
View File
@@ -26,14 +26,15 @@ Currently, Tune offers the following search algorithms (and library integrations
Variant Generation (Grid Search/Random Search)
----------------------------------------------
By default, Tune uses the `default search space and variant generation process <tune-usage.html#tune-search-space-default>`__ to create and queue trials. This supports random search and grid search as specified by the ``config`` parameter of ``tune.run``.
By default, Tune uses a BasicVariantGenerator to sample trials. This supports random search and grid search as specified by the ``config`` parameter of ``tune.run``.
.. autoclass:: ray.tune.suggest.BasicVariantGenerator
:show-inheritance:
:noindex:
Read about this in the :ref:`Grid/Random Search API <tune-grid-random>`.
Note that other search algorithms will not necessarily extend this class and may require a different search space declaration than the default Tune format.
Note that other search algorithms will require a different search space declaration than the default Tune format.
Repeated Evaluations
-576
View File
@@ -1,576 +0,0 @@
.. _tune-user-guide:
Tune User Guide
===============
The basic Tune API [``tune.run(Trainable)``] has two main parts: a :ref:`Training API <guide-training-api>` and :ref:`tune.run <guide-running-tune>`.
.. _guide-training-api:
Training API
------------
Training can be done with either a **Class API** (``tune.Trainable``) or **function-based API** (``track.log``). Here is an example ``tune.Trainable`` that you can use to dry-run Tune:
.. code-block:: python
from ray import tune
class trainable(tune.Trainable):
def _setup(self, config):
if config["print_me"]:
print(config["print_me"])
def _train(self):
# run one step of training code.
# important: this method is called repeatedly!
result_dict = {"accuracy": 0.5, "f1": 0.1, ...}
return result_dict
tune.run(trainable, config={"print_me": "hello-world"}, stop={"training_iteration": 200})
The **function-based API** is for fast prototyping but has limited functionality. Here is a **function-based API** example:
.. code-block:: python
from ray import tune
import time
def trainable(config):
if config["print_me"]:
print(config["print_me"])
for i in range(200):
time.sleep(1)
result_dict = {"accuracy": 0.5, "f1": 0.1, ...}
tune.track.log(**result_dict)
tune.run(trainable, config={"print_me": "hello-world"})
To read more, check out the :ref:`Trainable API docs<trainable-docs>`.
.. _guide-running-tune:
Running Tune
------------
Use ``tune.run`` to generate and execute your hyperparameter sweep:
.. code-block:: python
tune.run(trainable)
# Run a total of 10 evaluations of the Trainable. Tune runs in
# parallel and automatically determines concurrency.
tune.run(trainable, num_samples=10)
This function will report status on the command line until all Trials stop:
.. code-block:: bash
== Status ==
Memory usage on this node: 11.4/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 4/12 CPUs, 0/0 GPUs, 0.0/3.17 GiB heap, 0.0/1.07 GiB objects
Result logdir: /Users/foo/ray_results/myexp
Number of trials: 4 (4 RUNNING)
+----------------------+----------+---------------------+-----------+--------+--------+--------+--------+------------------+-------+
| Trial name | status | loc | param1 | param2 | param3 | acc | loss | total time (s) | iter |
|----------------------+----------+---------------------+-----------+--------+--------+--------+--------+------------------+-------|
| MyTrainable_a826033a | RUNNING | 10.234.98.164:31115 | 0.303706 | 0.0761 | 0.4328 | 0.1289 | 1.8572 | 7.54952 | 15 |
| MyTrainable_a8263fc6 | RUNNING | 10.234.98.164:31117 | 0.929276 | 0.158 | 0.3417 | 0.4865 | 1.6307 | 7.0501 | 14 |
| MyTrainable_a8267914 | RUNNING | 10.234.98.164:31111 | 0.068426 | 0.0319 | 0.1147 | 0.9585 | 1.9603 | 7.0477 | 14 |
| MyTrainable_a826b7bc | RUNNING | 10.234.98.164:31112 | 0.729127 | 0.0748 | 0.1784 | 0.1797 | 1.7161 | 7.05715 | 14 |
+----------------------+----------+---------------------+-----------+--------+--------+--------+--------+------------------+-------+
All results reported by the trainable will be logged locally to a unique directory per experiment, e.g. ``~/ray_results/example-experiment`` in the above example. On a cluster, incremental results will be synced to local disk on the head node. All results will have `autofilled metrics <tune-usage.html#auto-filled-results>`__ in addition to your own user-defined metrics.
Trial Parallelism
~~~~~~~~~~~~~~~~~
Tune automatically runs N concurrent trials, where N is the number of CPUs (cores) on your machine. By default, Tune assumes that each trial will only require 1 CPU. You can override this with ``resources_per_trial``:
.. code-block:: python
# If you have 4 CPUs on your machine, this will run 4 concurrent trials at a time.
tune.run(trainable, num_samples=10)
# If you have 4 CPUs on your machine, this will run 2 concurrent trials at a time.
tune.run(trainable, num_samples=10, resources_per_trial={"cpu": 2})
# If you have 4 CPUs on your machine, this will run 1 trial at a time.
tune.run(trainable, num_samples=10, resources_per_trial={"cpu": 4})
To leverage GPUs, you can set ``gpu`` in ``resources_per_trial``. A trial will only be executed if there are resources available. See the section on `resource allocation <tune-usage.html#resource-allocation-using-gpus>`_, which provides more details about GPU usage and trials that are distributed:
.. code-block:: python
# If you have 4 CPUs on your machine and 1 GPU, this will run 1 trial at a time.
tune.run(trainable, num_samples=10, resources_per_trial={"cpu": 2, "gpu": 1})
To attach to a Ray cluster or use ``ray.init`` manual resource overrides, simply run ``ray.init`` before ``tune.run``:
.. code-block:: python
# Setup a local ray cluster and override resources. This will run 50 trials in parallel:
ray.init(num_cpus=100)
tune.run(trainable, num_samples=100, resources_per_trial={"cpu": 2})
# Connect to an existing distributed Ray cluster
ray.init(address=<ray_redis_address>)
tune.run(trainable, num_samples=100, resources_per_trial={"cpu": 2, "gpu": 1})
.. tip:: To run everything sequentially, use `Ray Local Mode <tune-usage.html#debugging>`_.
Analyzing Results
-----------------
Tune provides an ``ExperimentAnalysis`` object for analyzing results from ``tune.run``.
.. code-block:: python
analysis = tune.run(
trainable,
name="example-experiment",
num_samples=10,
)
You can use the ``ExperimentAnalysis`` object to obtain the best configuration of the experiment:
.. code-block:: python
>>> print("Best config is", analysis.get_best_config(metric="mean_accuracy"))
Best config is: {'lr': 0.011537575723482687, 'momentum': 0.8921971713692662}
See the full documentation for the ``Analysis`` object: :ref:`exp-analysis-docstring`.
Grid Search/Random Search
-------------------------
.. warning:: If you use a Search Algorithm, you may not be able to specify lambdas or grid search with this
interface, as the search algorithm may require a different search space declaration.
You can specify a grid search or random search via the dict passed into ``tune.run(config=)``.
.. code-block:: python
tune.run(
trainable,
config={
"qux": tune.sample_from(lambda spec: 2 + 2),
"bar": tune.grid_search([True, False]),
"foo": tune.grid_search([1, 2, 3]),
"baz": "asd",
}
)
Read about this in the :ref:`Grid/Random Search API <tune-grid-random>` page.
Custom Trial Names
------------------
To specify custom trial names, you can pass use the ``trial_name_creator`` argument
to `tune.run`. This takes a function with the following signature:
.. code-block:: python
def trial_name_string(trial):
"""
Args:
trial (Trial): A generated trial object.
Returns:
trial_name (str): String representation of Trial.
"""
return str(trial)
tune.run(
MyTrainableClass,
name="example-experiment",
num_samples=1,
trial_name_creator=trial_name_string
)
An example can be found in `logging_example.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/logging_example.py>`__.
Sampling Multiple Times
-----------------------
By default, each random variable and grid search point is sampled once. To take multiple random samples, add ``num_samples: N`` to the experiment config. If `grid_search` is provided as an argument, the grid will be repeated `num_samples` of times.
.. code-block:: python
:emphasize-lines: 12
tune.run(
my_trainable,
name="my_trainable",
config={
"alpha": tune.sample_from(lambda spec: np.random.uniform(100)),
"beta": tune.sample_from(lambda spec: spec.config.alpha * np.random.normal()),
"nn_layers": [
tune.grid_search([16, 64, 256]),
tune.grid_search([16, 64, 256]),
],
},
num_samples=10
)
E.g. in the above, ``num_samples=10`` repeats the 3x3 grid search 10 times, for a total of 90 trials, each with randomly sampled values of ``alpha`` and ``beta``.
Resource Allocation (Using GPUs)
--------------------------------
Tune will allocate the specified GPU and CPU ``resources_per_trial`` to each individual trial (defaulting to 1 CPU per trial). Under the hood, Tune runs each trial as a Ray actor, using Ray's resource handling to allocate resources and place actors. A trial will not be scheduled unless at least that amount of resources is available in the cluster, preventing the cluster from being overloaded.
Fractional values are also supported, (i.e., ``"gpu": 0.2``). You can find an example of this in the `Keras MNIST example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/tune_mnist_keras.py>`__.
If GPU resources are not requested, the ``CUDA_VISIBLE_DEVICES`` environment variable will be set as empty, disallowing GPU access.
Otherwise, it will be set to the GPUs in the list (this is managed by Ray).
Advanced Resource Allocation
----------------------------
Trainables can themselves be distributed. If your trainable function / class creates further Ray actors or tasks that also consume CPU / GPU resources, you will also want to set ``extra_cpu`` or ``extra_gpu`` to reserve extra resource slots for the actors you will create. For example, if a trainable class requires 1 GPU itself, but will launch 4 actors each using another GPU, then it should set ``"gpu": 1, "extra_gpu": 4``.
.. code-block:: python
:emphasize-lines: 4-8
tune.run(
my_trainable,
name="my_trainable",
resources_per_trial={
"cpu": 1,
"gpu": 1,
"extra_gpu": 4
}
)
The ``Trainable`` also provides the ``default_resource_requests`` interface to automatically declare the ``resources_per_trial`` based on the given configuration.
.. automethod:: ray.tune.Trainable.default_resource_request
:noindex:
Trainable (Trial) Checkpointing
-------------------------------
When running a hyperparameter search, Tune can automatically and periodically save/checkpoint your model. Checkpointing is used for
* saving a model at the end of training
* modifying a model in the middle of training
* fault-tolerance in experiments with pre-emptible machines.
* enables certain Trial Schedulers such as HyperBand and PBT.
To enable checkpointing, you must implement a `Trainable class <tune-usage.html#trainable-api>`__ (Trainable functions are not checkpointable, since they never return control back to their caller).
Checkpointing assumes that the model state will be saved to disk on whichever node the Trainable is running on. You can checkpoint with three different mechanisms: manually, periodically, and at termination.
**Manual Checkpointing**: A custom Trainable can manually trigger checkpointing by returning ``should_checkpoint: True`` (or ``tune.result.SHOULD_CHECKPOINT: True``) in the result dictionary of `_train`. This can be especially helpful in spot instances:
.. code-block:: python
def _train(self):
# training code
result = {"mean_accuracy": accuracy}
if detect_instance_preemption():
result.update(should_checkpoint=True)
return result
**Periodic Checkpointing**: periodic checkpointing can be used to provide fault-tolerance for experiments. This can be enabled by setting ``checkpoint_freq=<int>`` and ``max_failures=<int>`` to checkpoint trials every *N* iterations and recover from up to *M* crashes per trial, e.g.:
.. code-block:: python
tune.run(
my_trainable,
checkpoint_freq=10,
max_failures=5,
)
**Checkpointing at Termination**: The checkpoint_freq may not coincide with the exact end of an experiment. If you want a checkpoint to be created at the end
of a trial, you can additionally set the ``checkpoint_at_end=True``:
.. code-block:: python
:emphasize-lines: 5
tune.run(
my_trainable,
checkpoint_freq=10,
checkpoint_at_end=True,
max_failures=5,
)
The checkpoint will be saved at a path that looks like ``local_dir/exp_name/trial_name/checkpoint_x/``, where the x is the number of iterations so far when the checkpoint is saved. To restore the checkpoint, you can use the ``restore`` argument and specify a checkpoint file. By doing this, you can change whatever experiments' configuration such as the experiment's name, the training iteration or so:
.. code-block:: python
# Restored previous trial from the given checkpoint
tune.run(
"PG",
name="RestoredExp", # The name can be different.
stop={"training_iteration": 10}, # train 5 more iterations than previous
restore="~/ray_results/Original/PG_<xxx>/checkpoint_5/checkpoint-5",
config={"env": "CartPole-v0"},
)
.. _tune-fault-tol:
Fault Tolerance
---------------
Tune will automatically restart trials in case of trial failures/error (if ``max_failures != 0``), both in the single node and distributed setting.
Tune will restore trials from the latest checkpoint, where available. In the distributed setting, if using the autoscaler with ``rsync`` enabled, Tune will automatically sync the trial folder with the driver. For example, if a node is lost while a trial (specifically, the corresponding Trainable actor of the trial) is still executing on that node and a checkpoint of the trial exists, Tune will wait until available resources are available to begin executing the trial again.
If the trial/actor is placed on a different node, Tune will automatically push the previous checkpoint file to that node and restore the remote trial actor state, allowing the trial to resume from the latest checkpoint even after failure.
Take a look at an example: :ref:`tune-distributed-spot`.
Recovering From Failures
~~~~~~~~~~~~~~~~~~~~~~~~
Tune automatically persists the progress of your entire experiment (a ``tune.run`` session), so if an experiment crashes or is otherwise cancelled, it can be resumed by passing one of True, False, "LOCAL", "REMOTE", or "PROMPT" to ``tune.run(resume=...)``. Note that this only works if trial checkpoints are detected, whether it be by manual or periodic checkpointing.
**Settings:**
- The default setting of ``resume=False`` creates a new experiment.
- ``resume="LOCAL"`` and ``resume=True`` restore the experiment from ``local_dir/[experiment_name]``.
- ``resume="REMOTE"`` syncs the upload dir down to the local dir and then restores the experiment from ``local_dir/experiment_name``.
- ``resume="PROMPT"`` will cause Tune to prompt you for whether you want to resume. You can always force a new experiment to be created by changing the experiment name.
Note that trials will be restored to their last checkpoint. If trial checkpointing is not enabled, unfinished trials will be restarted from scratch.
E.g.:
.. code-block:: python
tune.run(
my_trainable,
checkpoint_freq=10,
local_dir="~/path/to/results",
resume=True
)
Upon a second run, this will restore the entire experiment state from ``~/path/to/results/my_experiment_name``. Importantly, any changes to the experiment specification upon resume will be ignored. For example, if the previous experiment has reached its termination, then resuming it with a new stop criterion makes no effect: the new experiment will terminate immediately after initialization. If you want to change the configuration, such as training more iterations, you can do so restore the checkpoint by setting ``restore=<path-to-checkpoint>`` - note that this only works for a single trial.
.. warning::
This feature is still experimental, so any provided Trial Scheduler or Search Algorithm will not be preserved. Only ``FIFOScheduler`` and ``BasicVariantGenerator`` will be supported.
Handling Large Datasets
-----------------------
You often will want to compute a large object (e.g., training data, model weights) on the driver and use that object within each trial. Tune provides a ``pin_in_object_store`` utility function that can be used to broadcast such large objects. Objects pinned in this way will never be evicted from the Ray object store while the driver process is running, and can be efficiently retrieved from any task via ``get_pinned_object``.
.. code-block:: python
import ray
from ray import tune
from ray.tune.utils import pin_in_object_store, get_pinned_object
import numpy as np
ray.init()
# X_id can be referenced in closures
X_id = pin_in_object_store(np.random.random(size=100000000))
def f(config, reporter):
X = get_pinned_object(X_id)
# use X
tune.run(f)
Custom Stopping Criteria
------------------------
You can control when trials are stopped early by passing the ``stop`` argument to ``tune.run``. This argument takes either a dictionary or a function.
If a dictionary is passed in, the keys may be any field in the return result of ``tune.track.log`` in the Function API or ``train()`` (including the results from ``_train`` and auto-filled metrics).
In the example below, each trial will be stopped either when it completes 10 iterations OR when it reaches a mean accuracy of 0.98. Note that `training_iteration` is an auto-filled metric by Tune.
.. code-block:: python
tune.run(
my_trainable,
stop={"training_iteration": 10, "mean_accuracy": 0.98}
)
For more flexibility, you can pass in a function instead. If a function is passed in, it must take ``(trial_id, result)`` as arguments and return a boolean (``True`` if trial should be stopped and ``False`` otherwise).
.. code-block:: python
def stopper(trial_id, result):
return result["mean_accuracy"] / result["training_iteration"] > 5
tune.run(my_trainable, stop=stopper)
Finally, you can implement the ``Stopper`` abstract class for stopping entire experiments. For example, the following example stops all trials after the criteria is fulfilled by any individual trial, and prevents new ones from starting:
.. code-block:: python
from ray.tune import Stopper
class CustomStopper(Stopper):
def __init__(self):
self.should_stop = False
def __call__(self, trial_id, result):
if not self.should_stop and result['foo'] > 10:
self.should_stop = True
return self.should_stop
def stop_all(self):
"""Returns whether to stop trials and prevent new ones from starting."""
return self.should_stop
stopper = CustomStopper()
tune.run(my_trainable, stop=stopper)
Note that in the above example the currently running trials will not stop immediately but will do so once their current iterations are complete.
Auto-Filled Results
-------------------
During training, Tune will automatically fill certain fields if not already provided. All of these can be used as stopping conditions or in the Scheduler/Search Algorithm specification.
.. literalinclude:: ../../python/ray/tune/result.py
:language: python
:start-after: __sphinx_doc_begin__
:end-before: __sphinx_doc_end__
The following fields will automatically show up on the console output, if provided:
1. ``episode_reward_mean``
2. ``mean_loss``
3. ``mean_accuracy``
4. ``timesteps_this_iter`` (aggregated into ``timesteps_total``).
TensorBoard
-----------
To visualize learning in tensorboard, install tensorboardX:
.. code-block:: bash
$ pip install tensorboardX
Then, after you run a experiment, you can visualize your experiment with TensorBoard by specifying the output directory of your results. Note that if you running Ray on a remote cluster, you can forward the tensorboard port to your local machine through SSH using ``ssh -L 6006:localhost:6006 <address>``:
.. code-block:: bash
$ tensorboard --logdir=~/ray_results/my_experiment
If you are running Ray on a remote multi-user cluster where you do not have sudo access, you can run the following commands to make sure tensorboard is able to write to the tmp directory:
.. code-block:: bash
$ export TMPDIR=/tmp/$USER; mkdir -p $TMPDIR; tensorboard --logdir=~/ray_results
.. image:: ray-tune-tensorboard.png
If using TF2, Tune also automatically generates TensorBoard HParams output, as shown below:
.. code-block:: python
tune.run(
...,
config={
"lr": tune.grid_search([1e-5, 1e-4]),
"momentum": tune.grid_search([0, 0.9])
}
)
.. image:: images/tune-hparams.png
Logging
-------
You can pass in your own logging mechanisms to output logs in custom formats as follows:
.. code-block:: python
from ray.tune.logger import DEFAULT_LOGGERS
tune.run(
MyTrainableClass,
name="experiment_name",
loggers=DEFAULT_LOGGERS + (CustomLogger1, CustomLogger2)
)
These loggers will be called along with the default Tune loggers. All loggers must inherit the Logger interface (:ref:`logger-interface`). Tune enables default loggers for Tensorboard, CSV, and JSON formats. You can also check out `logger.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/logger.py>`__ for implementation details. An example can be found in `logging_example.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/logging_example.py>`__. See the :ref:`Logging API <logger-interface>`.
Uploading/Syncing
-----------------
Tune automatically syncs the trial folder on remote nodes back to the head node. This requires the ray cluster to be started with the `autoscaler <autoscaling.html>`__.
By default, local syncing requires rsync to be installed. You can customize the sync command with the ``sync_to_driver`` argument in ``tune.run`` by providing either a function or a string.
If a string is provided, then it must include replacement fields ``{source}`` and ``{target}``, like ``rsync -savz -e "ssh -i ssh_key.pem" {source} {target}``. Alternatively, a function can be provided with the following signature:
.. code-block:: python
def custom_sync_func(source, target):
sync_cmd = "rsync {source} {target}".format(
source=source,
target=target)
sync_process = subprocess.Popen(sync_cmd, shell=True)
sync_process.wait()
tune.run(
MyTrainableClass,
name="experiment_name",
sync_to_driver=custom_sync_func,
)
When syncing results back to the driver, the source would be a path similar to ``ubuntu@192.0.0.1:/home/ubuntu/ray_results/trial1``, and the target would be a local path.
This custom sync command would be also be used in node failures, where the source argument would be the path to the trial directory and the target would be a remote path. The `sync_to_driver` would be invoked to push a checkpoint to new node for a queued trial to resume.
If an upload directory is provided, Tune will automatically sync results to the given directory, natively supporting standard S3/gsutil commands.
You can customize this to specify arbitrary storages with the ``sync_to_cloud`` argument. This argument is similar to ``sync_to_cloud`` in that it supports strings with the same replacement fields and arbitrary functions. See `syncer.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/syncer.py>`__ for implementation details.
.. code-block:: python
tune.run(
MyTrainableClass,
name="experiment_name",
sync_to_cloud=custom_sync_func,
)
Debugging
---------
By default, Tune will run hyperparameter evaluations on multiple processes. However, if you need to debug your training process, it may be easier to do everything on a single process. You can force all Ray functions to occur on a single process with ``local_mode`` by calling the following before ``tune.run``.
.. code-block:: python
ray.init(local_mode=True)
Note that some behavior such as writing to files by depending on the current working directory in a Trainable and setting global process variables may not work as expected. Local mode with multiple configuration evaluations will interleave computation, so it is most naturally used when running a single configuration evaluation.
Further Questions or Issues?
----------------------------
You can post questions or issues or feedback through the following channels:
1. `ray-dev@googlegroups.com`_: For discussions about development or any general
questions and feedback.
2. `StackOverflow`_: For questions about how to use Ray.
3. `GitHub Issues`_: For bug reports and feature requests.
.. _`ray-dev@googlegroups.com`: https://groups.google.com/forum/#!forum/ray-dev
.. _`StackOverflow`: https://stackoverflow.com/questions/tagged/ray
.. _`GitHub Issues`: https://github.com/ray-project/ray/issues
+5 -11
View File
@@ -1,3 +1,5 @@
.. _tune-index:
Tune: Scalable Hyperparameter Tuning
====================================
@@ -8,7 +10,7 @@ Tune: Scalable Hyperparameter Tuning
Tune is a Python library for experiment execution and hyperparameter tuning at any scale. Core features:
* Launch a multi-node :ref:`distributed hyperparameter sweep <tune-distributed>` in less than 10 lines of code.
* Supports any machine learning framework, including PyTorch, XGBoost, MXNet, and Keras. See `examples here <tune-examples.html>`_.
* Supports any machine learning framework, including PyTorch, XGBoost, MXNet, and Keras. See :ref:`examples here <tune-guides-overview>`.
* Natively `integrates with optimization libraries <tune-searchalg.html>`_ such as `HyperOpt <https://github.com/hyperopt/hyperopt>`_, `Bayesian Optimization <https://github.com/fmfn/BayesianOptimization>`_, and `Facebook Ax <http://ax.dev>`_.
* Choose among `scalable algorithms <tune-schedulers.html>`_ such as `Population Based Training (PBT)`_, `Vizier's Median Stopping Rule`_, `HyperBand/ASHA`_.
* Visualize results with `TensorBoard <https://www.tensorflow.org/get_started/summaries_and_tensorboard>`__.
@@ -21,24 +23,16 @@ Tune is a Python library for experiment execution and hyperparameter tuning at a
For more information, check out:
* :ref:`Tune in 60 Seconds <tune-60-seconds>`: A quick overview of Tune and its key concepts.
* :ref:`Tune Guides and Examples <tune-guides-overview>`: Examples, Tutorials, and Guides for how to use Tune.
* `Code <https://github.com/ray-project/ray/tree/master/python/ray/tune>`__: GitHub repository for Tune.
* `User Guide <tune-usage.html>`__: A comprehensive overview on how to use Tune's features.
* `Tutorial Notebooks <https://github.com/ray-project/tutorial/blob/master/tune_exercises/>`__: Our tutorial notebooks of using Tune with Keras or PyTorch.
**Try out a tutorial notebook on Colab**:
.. raw:: html
<a href="https://colab.research.google.com/github/ray-project/tutorial/blob/master/tune_exercises/exercise_1_basics.ipynb" target="_parent">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Tune Tutorial"/>
</a>
Quick Start
-----------
To run this example, install the following: ``pip install 'ray[tune]' torch torchvision``.
This example runs a small grid search to train a convolutional neural network using PyTorch and Tune.
.. literalinclude:: ../../python/ray/tune/tests/example.py
+1 -1
View File
@@ -1 +1 @@
generated_guides/
tutorials/
@@ -1,15 +1,132 @@
.. _tune-guides-overview:
Tutorials, User Guides, Examples
================================
In this section, you can find material on how to use Tune and its various features. If any of the materials is out of date or broken, or if you'd like to add an example to this page, feel free to raise an issue on our Github repository.
Tutorials
---------
Take a look at any of the below tutorials to get started with Tune.
.. raw:: html
<div class="sphx-glr-bigcontainer">
.. customgalleryitem::
:tooltip: A gentle 60 second tour of core Tune concepts.
:figure: /images/tune-workflow.png
:description: :doc:`A gentle 60 second tour of Tune <tune-60-seconds>`
.. customgalleryitem::
:tooltip: A simple Tune walkthrough.
:figure: /images/tune.png
:description: :doc:`A walkthrough to setup your first Tune experiment <tune-tutorial>`
.. raw:: html
</div>
.. toctree::
:hidden:
tune-60-seconds.rst
tune-tutorial.rst
User Guides
-----------
These pages will demonstrate the various features and configurations of Tune.
.. raw:: html
<div class="sphx-glr-bigcontainer">
.. customgalleryitem::
:tooltip: A guide to Tune features.
:figure: /images/tune.png
:description: :doc:`A guide to Tune features <tune-usage>`
.. customgalleryitem::
:tooltip: A simple guide to Population-based Training
:figure: /images/tune-pbt-small.png
:description: :doc:`A simple guide to Population-based Training <tune-advanced-tutorial>`
.. customgalleryitem::
:tooltip: A guide to distributed hyperparameter tuning
:figure: /images/tune.png
:description: :doc:`A guide to distributed hyperparameter tuning <tune-distributed>`
.. raw:: html
</div>
.. toctree::
:hidden:
tune-usage.rst
tune-advanced-tutorial.rst
tune-distributed.rst
Colab Exercises
---------------
Learn how to use Tune in your browser with the following Colab-based exercises.
.. raw:: html
<table>
<tr>
<th class="tune-colab">Exercise Description</th>
<th class="tune-colab">Library</th>
<th class="tune-colab">Colab Link</th>
</tr>
<tr>
<td class="tune-colab">Basics of using Tune.</td>
<td class="tune-colab">TF/Keras</td>
<td class="tune-colab">
<a href="https://colab.research.google.com/github/ray-project/tutorial/blob/master/tune_exercises/exercise_1_basics.ipynb" target="_parent">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Tune Tutorial"/>
</a>
</td>
</tr>
<tr>
<td class="tune-colab">Using Search algorithms and Trial Schedulers to optimize your model.</td>
<td class="tune-colab">Pytorch</td>
<td class="tune-colab">
<a href="https://colab.research.google.com/github/ray-project/tutorial/blob/master/tune_exercises/exercise_2_optimize.ipynb" target="_parent">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Tune Tutorial"/>
</a>
</td>
</tr>
<tr>
<td class="tune-colab">Using Population-Based Training (PBT).</td>
<td class="tune-colab">Pytorch</td>
<td class="tune-colab">
<a href="https://colab.research.google.com/github/ray-project/tutorial/blob/master/tune_exercises/exercise_3_pbt.ipynb" target="_parent">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Tune Tutorial"/>
</a>
</td>
</tr>
</table>
Tutorial source files `can be found here <https://github.com/ray-project/tutorial>`_.
Tune Examples
=============
-------------
.. Keep this in sync with ray/python/ray/tune/examples/README.rst
In our repository, we provide a variety of examples for the various use cases and features of Tune.
If any example is broken, or if you'd like to add an example to this page, feel free to raise an issue on our Github repository.
General Examples
----------------
~~~~~~~~~~~~~~~~
- `async_hyperband_example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/async_hyperband_example.py>`__: Example of using a Trainable class with AsyncHyperBandScheduler.
- `hyperband_example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/hyperband_example.py>`__: Example of using a Trainable class with HyperBandScheduler. Also uses the Experiment class API for specifying the experiment configuration. Also uses the AsyncHyperBandScheduler.
@@ -18,7 +135,7 @@ General Examples
- `logging_example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/logging_example.py>`__: Example of custom loggers and custom trial directory naming.
Search Algorithm Examples
-------------------------
~~~~~~~~~~~~~~~~~~~~~~~~~
- `Ax example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/ax_example.py>`__: Optimize a Hartmann function with `Ax <https://ax.dev>`_ with 4 parallel workers.
- `HyperOpt Example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/hyperopt_example.py>`__: Optimizes a basic function using the function-based API and the HyperOptSearch (SearchAlgorithm wrapper for HyperOpt TPE).
@@ -26,7 +143,7 @@ Search Algorithm Examples
- `Bayesian Optimization example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/bayesopt_example.py>`__: Optimize a simple toy function using `Bayesian Optimization <https://github.com/fmfn/BayesianOptimization>`_ with 4 parallel workers.
Tensorflow/Keras Examples
-------------------------
~~~~~~~~~~~~~~~~~~~~~~~~~
- `tune_mnist_keras <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/tune_mnist_keras.py>`__: Converts the Keras MNIST example to use Tune with the function-based API and a Keras callback. Also shows how to easily convert something relying on argparse to use Tune.
- `pbt_memnn_example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/pbt_memnn_example.py>`__: Example of training a Memory NN on bAbI with Keras using PBT.
@@ -34,27 +151,27 @@ Tensorflow/Keras Examples
PyTorch Examples
----------------
~~~~~~~~~~~~~~~~
- `mnist_pytorch <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/mnist_pytorch.py>`__: Converts the PyTorch MNIST example to use Tune with the function-based API. Also shows how to easily convert something relying on argparse to use Tune.
- `mnist_pytorch_trainable <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/mnist_pytorch_trainable.py>`__: Converts the PyTorch MNIST example to use Tune with Trainable API. Also uses the HyperBandScheduler and checkpoints the model at the end.
XGBoost Example
---------------
~~~~~~~~~~~~~~~
- `xgboost_example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/xgboost_example.py>`__: Trains a basic XGBoost model with Tune with the function-based API and a XGBoost callback.
- `xgboost_example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/xgboost_example.py>`__: Trains a basic XGBoost model with Tune with the function-based API and an XGBoost callback.
LightGBM Example
----------------
~~~~~~~~~~~~~~~~
- `lightgbm_example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/lightgbm_example.py>`__: Trains a basic LightGBM model with Tune with the function-based API and a LightGBM callback.
Contributed Examples
--------------------
~~~~~~~~~~~~~~~~~~~~
- `pbt_tune_cifar10_with_keras <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/pbt_tune_cifar10_with_keras.py>`__: A contributed example of tuning a Keras model on CIFAR10 with the PopulationBasedTraining scheduler.
- `genetic_example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/genetic_example.py>`__: Optimizing the michalewicz function using the contributed GeneticSearch search algorithm with AsyncHyperBandScheduler.
- `genetic_example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/genetic_example.py>`__: Optimizing the michalewicz function using the contributed GeneticSearch algorithm with AsyncHyperBandScheduler.
- `tune_cifar10_gluon <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/tune_cifar10_gluon.py>`__: MXNet Gluon example to use Tune with the function-based API on CIFAR-10 dataset.
@@ -0,0 +1,193 @@
.. _tune-60-seconds:
Tune in 60 Seconds
==================
Let's quickly walk through the key concepts you need to know to use Tune. In this guide, we'll be covering the following:
.. contents::
:local:
:depth: 1
Tune takes a user-defined Python function or class and evaluates it on a set of hyperparameter configurations. Each hyperparameter configuration evaluation is called a *trial*, and Tune runs multiple trials in parallel, leveraging Search Algorithms and Trial Schedulers to optimize your hyperparameters.
.. image:: /images/tune-workflow.png
Trainables
----------
To allow Tune to optimize your model, Tune will need to control your training process. This is done via the Trainable API. Each *trial* corresponds to one instance of a Trainable; Tune will create multiple instances of the Trainable.
The Trainable API is where you specify how to set up your model and track intermediate training progress. There are two types of Trainables - a **function-based API** is for fast prototyping, and **class-based** API that unlocks many Tune features such as checkpointing, pausing.
.. code-block:: python
from ray import tune
class Trainable(tune.Trainable):
"""Tries to iteratively find the password."""
def _setup(self, config):
self.iter = 0
self.password = 1024
def _train(self):
"""Execute one step of 'training'. This function will be called iteratively"""
self.iter += 1
return {
"accuracy": abs(self.iter - self.password),
"training_iteration": self.iter # Tune will automatically provide this.
}
def _stop(self):
# perform any cleanup necessary.
pass
Function API example:
.. code-block:: python
def trainable(config):
"""
Args:
config (dict): Parameters provided from the search algorithm
or variant generation.
"""
while True:
# ...
tune.track.log(**kwargs)
.. tip:: Do not use ``tune.track.log`` within a ``Trainable`` class.
See the documentation: :ref:`trainable-docs`.
tune.run
--------
Use ``tune.run`` execute hyperparameter tuning using the core Ray APIs. This function manages your distributed experiment and provides many features such as logging, checkpointing, and early stopping.
.. code-block:: python
# Pass in a Trainable class or function to tune.run.
tune.run(trainable)
# Run 10 trials (each trial is one instance of a Trainable). Tune runs in
# parallel and automatically determines concurrency.
tune.run(trainable, num_samples=10)
# Run 1 trial, stop when trial has reached 10 iterations OR a mean accuracy of 0.98.
tune.run(my_trainable, stop={"training_iteration": 10, "mean_accuracy": 0.98})
# Run 1 trial, search over hyperparameters, stop after 10 iterations.
hyperparameters = {"lr": tune.uniform(0, 1), "momentum": tune.uniform(0, 1)}
tune.run(my_trainable, config=hyperparameters, stop={"training_iteration": 10})
This function will report status on the command line until all Trials stop:
.. code-block:: bash
== Status ==
Memory usage on this node: 11.4/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 4/12 CPUs, 0/0 GPUs, 0.0/3.17 GiB heap, 0.0/1.07 GiB objects
Result logdir: /Users/foo/ray_results/myexp
Number of trials: 4 (4 RUNNING)
+----------------------+----------+---------------------+-----------+--------+--------+----------------+-------+
| Trial name | status | loc | param1 | param2 | acc | total time (s) | iter |
|----------------------+----------+---------------------+-----------+--------+--------+----------------+-------|
| MyTrainable_a826033a | RUNNING | 10.234.98.164:31115 | 0.303706 | 0.0761 | 0.1289 | 7.54952 | 15 |
| MyTrainable_a8263fc6 | RUNNING | 10.234.98.164:31117 | 0.929276 | 0.158 | 0.4865 | 7.0501 | 14 |
| MyTrainable_a8267914 | RUNNING | 10.234.98.164:31111 | 0.068426 | 0.0319 | 0.9585 | 7.0477 | 14 |
| MyTrainable_a826b7bc | RUNNING | 10.234.98.164:31112 | 0.729127 | 0.0748 | 0.1797 | 7.05715 | 14 |
+----------------------+----------+---------------------+-----------+--------+--------+----------------+-------+
See the documentation: :ref:`tune-run-ref`.
Search Algorithms
-----------------
To optimize the hyperparameters of your training process, you will want to explore a “search space”.
Search Algorithms are Tune modules that help explore a provided search space. It will use previous results from evaluating different hyperparameters to suggest better hyperparameters. Tune has SearchAlgorithms that integrate with many popular **optimization** libraries, such as `Nevergrad <https://github.com/facebookresearch/nevergrad>`_ and `Hyperopt <https://github.com/hyperopt/hyperopt/>`_.
.. code-block:: python
# https://github.com/hyperopt/hyperopt/
# pip install hyperopt
import hyperopt as hp
from ray.tune.suggest.hyperopt import HyperOptSearch
# Create a HyperOpt search space
space = {"momentum": hp.uniform("momentum", 0, 20), "lr": hp.uniform("lr", 0, 1)}
# Pass the search space into Tune's HyperOpt wrapper and maximize accuracy
hyperopt = HyperOptSearch(space, metric="accuracy", mode="max")
# Execute 20 trials using HyperOpt, stop after 20 iterations
max_iters = {"training_iteration": 20}
tune.run(trainable, search_alg=hyperopt, num_samples=20, stop=max_iters)
See the documentation: :ref:`searchalg-ref`.
Trial Schedulers
----------------
In addition, you can make your training process more efficient by stopping, pausing, or changing the hyperparameters of running trials.
Trial Schedulers are Tune modules that adjust and change distributed training runs during execution. These modules can stop/pause/tweak the hyperparameters of running trials, making your hyperparameter tuning process much faster. Population-based training and HyperBand are examples of popular optimization algorithms implemented as Trial Schedulers.
.. code-block:: python
from ray.tune.schedulers import HyperBandScheduler
# Create HyperBand scheduler and maximize accuracy
hyperband = HyperBandScheduler(metric="accuracy", mode="max")
# Execute 20 trials using HyperBand using a search space
configs = {"lr": tune.uniform(0, 1), "momentum": tune.uniform(0, 1)}
tune.run(MyTrainableClass, num_samples=20, config=configs, scheduler=hyperband)
Unlike **Search Algorithms**, Trial Schedulers do not select which hyperparameter configurations to evaluate. However, you can use them together.
See the documentation: :ref:`schedulers-ref`.
Analysis
--------
After running a hyperparameter tuning job, you will want to analyze your results to determine what specific parameters are important and which hyperparameter values are the best.
``tune.run`` returns an :ref:`Analysis <tune-analysis-docs>` object which has methods you can use for analyzing your results. This object can also retrieve all training runs as dataframes, allowing you to do ad-hoc data analysis over your results.
.. code-block:: python
analysis = tune.run(trainable, search_alg=algo, stop={"training_iteration": 20})
# Get the best hyperparameters
best_hyperparameters = analysis.get_best_config()
# Get a dataframe for the max accuracy seen for each trial
df = analysis.dataframe(metric="mean_accuracy", mode="max")
What's Next?
~~~~~~~~~~~~
Now that you have a working understanding of Tune, check out:
* :ref:`Tune Guides and Examples <tune-guides-overview>`: Examples and templates for using Tune with your preferred machine learning library.
* :ref:`tune-tutorial`: A simple tutorial that walks you through the process of setting up a Tune experiment.
* :ref:`tune-user-guide`: A comprehensive overview of Tune's features.
Further Questions or Issues?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Reach out to us if you have any questions or issues or feedback through the following channels:
1. `StackOverflow`_: For questions about how to use Ray.
2. `GitHub Issues`_: For bug reports and feature requests.
.. _`StackOverflow`: https://stackoverflow.com/questions/tagged/ray
.. _`GitHub Issues`: https://github.com/ray-project/ray/issues
@@ -115,7 +115,7 @@ Launching a cloud cluster
If you have already have a list of nodes, go to :ref:`tune-distributed-local`.
Ray currently supports AWS and GCP. Follow the instructions below to launch nodes on AWS (using the Deep Learning AMI). See the `cluster setup documentation <autoscaling.html>`_. Save the below cluster configuration (``tune-default.yaml``):
Ray currently supports AWS and GCP. Follow the instructions below to launch nodes on AWS (using the Deep Learning AMI). See the :ref:`cluster setup documentation <ref-automatic-cluster>`. Save the below cluster configuration (``tune-default.yaml``):
.. literalinclude:: /../../python/ray/tune/examples/tune-default.yaml
:language: yaml
@@ -149,6 +149,33 @@ Analyze your results on TensorBoard by starting TensorBoard on the remote head m
Note that you can customize the directory of results by running: ``tune.run(local_dir=..)``. You can then point TensorBoard to that directory to visualize results. You can also use `awless <https://github.com/wallix/awless>`_ for easy cluster management on AWS.
Syncing
-------
Tune automatically syncs the trial folder on remote nodes back to the head node. This requires the ray cluster to be started with the :ref:`autoscaler <ref-automatic-cluster>`.
By default, local syncing requires rsync to be installed. You can customize the sync command with the ``sync_to_driver`` argument in ``tune.run`` by providing either a function or a string.
If a string is provided, then it must include replacement fields ``{source}`` and ``{target}``, like ``rsync -savz -e "ssh -i ssh_key.pem" {source} {target}``. Alternatively, a function can be provided with the following signature:
.. code-block:: python
def custom_sync_func(source, target):
sync_cmd = "rsync {source} {target}".format(
source=source,
target=target)
sync_process = subprocess.Popen(sync_cmd, shell=True)
sync_process.wait()
tune.run(
MyTrainableClass,
name="experiment_name",
sync_to_driver=custom_sync_func,
)
When syncing results back to the driver, the source would be a path similar to ``ubuntu@192.0.0.1:/home/ubuntu/ray_results/trial1``, and the target would be a local path.
This custom sync command is used to restart trials under failure. The ``sync_to_driver`` is invoked to push a checkpoint to new node for a paused/pre-empted trial to resume.
.. _tune-distributed-spot:
Pre-emptible Instances (Cloud)
@@ -245,12 +272,54 @@ You should see Tune eventually continue the trials on a different worker node. S
You can also specify ``tune.run(upload_dir=...)`` to sync results with a cloud storage like S3, allowing you to persist results in case you want to start and stop your cluster automatically.
.. _tune-fault-tol:
Fault Tolerance
---------------
Tune will automatically restart trials in case of trial failures/error (if ``max_failures != 0``), both in the single node and distributed setting.
Tune will restore trials from the latest checkpoint, where available. In the distributed setting, if using the autoscaler with ``rsync`` enabled, Tune will automatically sync the trial folder with the driver. For example, if a node is lost while a trial (specifically, the corresponding Trainable actor of the trial) is still executing on that node and a checkpoint of the trial exists, Tune will wait until available resources are available to begin executing the trial again.
If the trial/actor is placed on a different node, Tune will automatically push the previous checkpoint file to that node and restore the remote trial actor state, allowing the trial to resume from the latest checkpoint even after failure.
Recovering From Failures
~~~~~~~~~~~~~~~~~~~~~~~~
Tune automatically persists the progress of your entire experiment (a ``tune.run`` session), so if an experiment crashes or is otherwise cancelled, it can be resumed by passing one of True, False, "LOCAL", "REMOTE", or "PROMPT" to ``tune.run(resume=...)``. Note that this only works if trial checkpoints are detected, whether it be by manual or periodic checkpointing.
**Settings:**
- The default setting of ``resume=False`` creates a new experiment.
- ``resume="LOCAL"`` and ``resume=True`` restore the experiment from ``local_dir/[experiment_name]``.
- ``resume="REMOTE"`` syncs the upload dir down to the local dir and then restores the experiment from ``local_dir/experiment_name``.
- ``resume="PROMPT"`` will cause Tune to prompt you for whether you want to resume. You can always force a new experiment to be created by changing the experiment name.
Note that trials will be restored to their last checkpoint. If trial checkpointing is not enabled, unfinished trials will be restarted from scratch.
E.g.:
.. code-block:: python
tune.run(
my_trainable,
checkpoint_freq=10,
local_dir="~/path/to/results",
resume=True
)
Upon a second run, this will restore the entire experiment state from ``~/path/to/results/my_experiment_name``. Importantly, any changes to the experiment specification upon resume will be ignored. For example, if the previous experiment has reached its termination, then resuming it with a new stop criterion will not run. The new experiment will terminate immediately after initialization. If you want to change the configuration, such as training more iterations, you can do so restore the checkpoint by setting ``restore=<path-to-checkpoint>`` - note that this only works for a single trial.
.. warning::
This feature is still experimental, so any provided Trial Scheduler or Search Algorithm will not be checkpointed and able to resume. Only ``FIFOScheduler`` and ``BasicVariantGenerator`` will be supported.
.. _tune-distributed-common:
Common Commands
---------------
Below are some commonly used commands for submitting experiments. Please see the `Autoscaler page <autoscaling.html>`__ to see find more comprehensive documentation of commands.
Below are some commonly used commands for submitting experiments. Please see the :ref:`Autoscaler page <ref-automatic-cluster>` to see find more comprehensive documentation of commands.
.. code-block:: bash
@@ -1,5 +1,9 @@
Tune Walkthrough
================
.. _tune-tutorial:
A Basic Tune Tutorial
=====================
.. image:: /images/tune-api.svg
This tutorial will walk you through the following process to setup a Tune experiment. Specifically, we'll leverage ASHA and Bayesian Optimization (via HyperOpt) via the following steps:
@@ -14,7 +18,7 @@ This tutorial will walk you through the following process to setup a Tune experi
.. code-block:: bash
$ pip install ray torch torchvision filelock
$ pip install ray torch torchvision
We first run some imports:
@@ -35,6 +39,8 @@ Notice that there's a couple helper functions in the above training script. You
.. code:: python
EPOCH_SIZE = 20
def train(model, optimizer, train_loader):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
@@ -66,7 +72,7 @@ We can then plot the performance of this trial.
Early Stopping with ASHA
~~~~~~~~~~~~~~~~~~~~~~~~
Let's integrate an early stopping algorithm to our search - ASHA, a scalable algorithm for principled early stopping.
Let's integrate a Trial Scheduler to our search - ASHA, a scalable algorithm for principled early stopping.
How does it work? On a high level, it terminates trials that are less promising and
allocates more time and resources to more promising trials. See `this blog post <https://blog.ml.cmu.edu/2018/12/12/massively-parallel-hyperparameter-optimization/>`__ for more details.
@@ -120,4 +126,4 @@ You can evaluate best trained model using the Analysis object to retrieve the be
Next Steps
----------
Take a look at the :ref`tune-user-guide` for a more comprehensive overview of Tune's features.
Take a look at the :ref:`tune-user-guide` for a more comprehensive overview of Tune's features.
+424
View File
@@ -0,0 +1,424 @@
.. _tune-user-guide:
Tune User Guide
===============
.. warning:: Before you continue, be sure to have read :ref:`tune-60-seconds`.
This document provides an overview of the core concepts as well as some of the configurations for running Tune.
.. contents:: :local:
Parallelism / GPUs
------------------
.. tip:: To run everything sequentially, use :ref:`Ray Local Mode <tune-debugging>`.
Parallelism is determined by ``resources_per_trial`` (defaulting to 1 CPU, 0 GPU per trial) and the resources available to Tune (``ray.cluster_resources()``).
Tune will allocate the specified GPU and CPU from ``resources_per_trial`` to each individual trial. A trial will not be scheduled unless at least that amount of resources is available, preventing the cluster from being overloaded.
By default, Tune automatically runs N concurrent trials, where N is the number of CPUs (cores) on your machine.
.. code-block:: python
# If you have 4 CPUs on your machine, this will run 4 concurrent trials at a time.
tune.run(trainable, num_samples=10)
You can override this parallelism with ``resources_per_trial``:
.. code-block:: python
# If you have 4 CPUs on your machine, this will run 2 concurrent trials at a time.
tune.run(trainable, num_samples=10, resources_per_trial={"cpu": 2})
# If you have 4 CPUs on your machine, this will run 1 trial at a time.
tune.run(trainable, num_samples=10, resources_per_trial={"cpu": 4})
# Fractional values are also supported, (i.e., {"cpu": 0.5}).
tune.run(trainable, num_samples=10, resources_per_trial={"cpu": 0.5})
To leverage GPUs, you must set ``gpu`` in ``resources_per_trial``. This will automatically set ``CUDA_VISIBLE_DEVICES`` for each trial.
.. code-block:: python
# If you have 8 GPUs, this will run 8 trials at once.
tune.run(trainable, num_samples=10, resources_per_trial={"gpu": 1})
# If you have 4 CPUs on your machine and 1 GPU, this will run 1 trial at a time.
tune.run(trainable, num_samples=10, resources_per_trial={"cpu": 2, "gpu": 1})
You can find an example of this in the `Keras MNIST example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/tune_mnist_keras.py>`__.
.. warning:: If 'gpu' is not set, ``CUDA_VISIBLE_DEVICES`` environment variable will be set as empty, disallowing GPU access.
To attach to a Ray cluster, simply run ``ray.init`` before ``tune.run``:
.. code-block:: python
# Connect to an existing distributed Ray cluster
ray.init(address=<ray_address>)
tune.run(trainable, num_samples=100, resources_per_trial={"cpu": 2, "gpu": 1})
Search Space (Grid/Random)
--------------------------
.. warning:: If you use a Search Algorithm, you will need to use a different search space API.
You can specify a grid search or random search via the dict passed into ``tune.run(config=)``.
.. code-block:: python
parameters = {
"qux": tune.sample_from(lambda spec: 2 + 2),
"bar": tune.grid_search([True, False]),
"foo": tune.grid_search([1, 2, 3]),
"baz": "asd", # a constant value
}
tune.run(trainable, config=parameters)
By default, each random variable and grid search point is sampled once. To take multiple random samples, add ``num_samples: N`` to the experiment config. If `grid_search` is provided as an argument, the grid will be repeated `num_samples` of times.
.. code-block:: python
:emphasize-lines: 13
# num_samples=10 repeats the 3x3 grid search 10 times, for a total of 90 trials
tune.run(
my_trainable,
name="my_trainable",
config={
"alpha": tune.uniform(100),
"beta": tune.sample_from(lambda spec: spec.config.alpha * np.random.normal()),
"nn_layers": [
tune.grid_search([16, 64, 256]),
tune.grid_search([16, 64, 256]),
],
},
num_samples=10
)
Read about this in the :ref:`Grid/Random Search API <tune-grid-random>` page.
Reporting Metrics
-----------------
You can log arbitrary values and metrics in both training APIs:
.. code-block:: python
def trainable(config):
num_epochs = 100
for i in range(num_epochs):
accuracy = model.train()
metric_1 = f(model)
metric_2 = model.get_loss()
tune.track.log(acc=accuracy, metric_foo=random_metric_1, bar=metric_2)
class Trainable(tune.Trainable):
...
def _train(self): # this is called iteratively
accuracy = self.model.train()
metric_1 = f(self.model)
metric_2 = self.model.get_loss()
# don't call track.log here!
return dict(acc=accuracy, metric_foo=random_metric_1, bar=metric_2)
During training, Tune will automatically log the below metrics in addition to the user-provided values. All of these can be used as stopping conditions or passed as a parameter to Trial Schedulers/Search Algorithms.
.. literalinclude:: ../../../../python/ray/tune/result.py
:language: python
:start-after: __sphinx_doc_begin__
:end-before: __sphinx_doc_end__
.. _tune-checkpoint:
Checkpointing
-------------
When running a hyperparameter search, Tune can automatically and periodically save/checkpoint your model. Checkpointing is used for
* saving a model throughout training
* fault-tolerance when using pre-emptible machines.
* Pausing trials when using Trial Schedulers such as HyperBand and PBT.
To enable checkpointing, you must implement a :ref:`Trainable class <trainable-docs>` (the function-based API are not checkpointable, since they never return control back to their caller).
Checkpointing assumes that the model state will be saved to disk on whichever node the Trainable is running on. You can checkpoint with three different mechanisms: manually, periodically, and at termination.
**Manual Checkpointing**: A custom Trainable can manually trigger checkpointing by returning ``should_checkpoint: True`` (or ``tune.result.SHOULD_CHECKPOINT: True``) in the result dictionary of `_train`. This can be especially helpful in spot instances:
.. code-block:: python
def _train(self):
# training code
result = {"mean_accuracy": accuracy}
if detect_instance_preemption():
result.update(should_checkpoint=True)
return result
**Periodic Checkpointing**: periodic checkpointing can be used to provide fault-tolerance for experiments. This can be enabled by setting ``checkpoint_freq=<int>`` and ``max_failures=<int>`` to checkpoint trials every *N* iterations and recover from up to *M* crashes per trial, e.g.:
.. code-block:: python
tune.run(
my_trainable,
checkpoint_freq=10,
max_failures=5,
)
**Checkpointing at Termination**: The checkpoint_freq may not coincide with the exact end of an experiment. If you want a checkpoint to be created at the end
of a trial, you can additionally set the ``checkpoint_at_end=True``:
.. code-block:: python
:emphasize-lines: 5
tune.run(
my_trainable,
checkpoint_freq=10,
checkpoint_at_end=True,
max_failures=5,
)
The checkpoint will be saved at a path that looks like ``local_dir/exp_name/trial_name/checkpoint_x/``, where the x is the number of iterations so far when the checkpoint is saved. To restore the checkpoint, you can use the ``restore`` argument and specify a checkpoint file. By doing this, you can change whatever experiments' configuration such as the experiment's name, the training iteration or so:
.. code-block:: python
# Restored previous trial from the given checkpoint
tune.run(
"PG",
name="RestoredExp", # The name can be different.
stop={"training_iteration": 10}, # train 5 more iterations than previous
restore="~/ray_results/Original/PG_<xxx>/checkpoint_5/checkpoint-5",
config={"env": "CartPole-v0"},
)
Handling Large Datasets
-----------------------
You often will want to compute a large object (e.g., training data, model weights) on the driver and use that object within each trial. Tune provides a ``pin_in_object_store`` utility function that can be used to broadcast such large objects. Objects pinned in this way will never be evicted from the Ray object store while the driver process is running, and can be efficiently retrieved from any task via ``get_pinned_object``.
.. code-block:: python
import ray
from ray import tune
from ray.tune.utils import pin_in_object_store, get_pinned_object
import numpy as np
ray.init()
# X_id can be referenced in closures
X_id = pin_in_object_store(np.random.random(size=100000000))
def f(config, reporter):
X = get_pinned_object(X_id)
# use X
tune.run(f)
Stopping Trials
---------------
You can control when trials are stopped early by passing the ``stop`` argument to ``tune.run``. This argument takes either a dictionary or a function.
If a dictionary is passed in, the keys may be any field in the return result of ``tune.track.log`` in the Function API or ``_train()`` (including the results from ``_train`` and auto-filled metrics).
In the example below, each trial will be stopped either when it completes 10 iterations OR when it reaches a mean accuracy of 0.98. These metrics are assumed to be **increasing**.
.. code-block:: python
# training_iteration is an auto-filled metric by Tune.
tune.run(
my_trainable,
stop={"training_iteration": 10, "mean_accuracy": 0.98}
)
For more flexibility, you can pass in a function instead. If a function is passed in, it must take ``(trial_id, result)`` as arguments and return a boolean (``True`` if trial should be stopped and ``False`` otherwise).
.. code-block:: python
def stopper(trial_id, result):
return result["mean_accuracy"] / result["training_iteration"] > 5
tune.run(my_trainable, stop=stopper)
Finally, you can implement the ``Stopper`` abstract class for stopping entire experiments. For example, the following example stops all trials after the criteria is fulfilled by any individual trial, and prevents new ones from starting:
.. code-block:: python
from ray.tune import Stopper
class CustomStopper(Stopper):
def __init__(self):
self.should_stop = False
def __call__(self, trial_id, result):
if not self.should_stop and result['foo'] > 10:
self.should_stop = True
return self.should_stop
def stop_all(self):
"""Returns whether to stop trials and prevent new ones from starting."""
return self.should_stop
stopper = CustomStopper()
tune.run(my_trainable, stop=stopper)
Note that in the above example the currently running trials will not stop immediately but will do so once their current iterations are complete. See the :ref:`tune-stop-ref` documentation.
Logging/Tensorboard
-------------------
Tune will log the results of each trial to a subfolder under a specified local dir, which defaults to ``~/ray_results``.
Tune by default will log results for Tensorboard, CSV, and JSON formats.
.. code-block:: bash
# This logs to 2 different trial folders:
# ~/ray_results/trainable_name/trial_name_1 and ~/ray_results/trainable_name/trial_name_2
# trainable_name and trial_name are autogenerated.
tune.run(trainable, num_samples=2)
Learn about how to customize logging paths and outputs: :ref:`loggers-docstring`.
Tune automatically outputs Tensorboard files during ``tune.run``. To visualize learning in tensorboard, install tensorboardX:
.. code-block:: bash
$ pip install tensorboardX
Then, after you run an experiment, you can visualize your experiment with TensorBoard by specifying the output directory of your results.
.. code-block:: bash
$ tensorboard --logdir=~/ray_results/my_experiment
If you are running Ray on a remote multi-user cluster where you do not have sudo access, you can run the following commands to make sure tensorboard is able to write to the tmp directory:
.. code-block:: bash
$ export TMPDIR=/tmp/$USER; mkdir -p $TMPDIR; tensorboard --logdir=~/ray_results
.. image:: ../../ray-tune-tensorboard.png
If using TF2, Tune also automatically generates TensorBoard HParams output, as shown below:
.. code-block:: python
tune.run(
...,
config={
"lr": tune.grid_search([1e-5, 1e-4]),
"momentum": tune.grid_search([0, 0.9])
}
)
.. image:: ../../images/tune-hparams.png
Console Output
--------------
The following fields will automatically show up on the console output, if provided:
1. ``episode_reward_mean``
2. ``mean_loss``
3. ``mean_accuracy``
4. ``timesteps_this_iter`` (aggregated into ``timesteps_total``).
Below is an example of the console output:
.. code-block:: bash
== Status ==
Memory usage on this node: 11.4/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 4/12 CPUs, 0/0 GPUs, 0.0/3.17 GiB heap, 0.0/1.07 GiB objects
Result logdir: /Users/foo/ray_results/myexp
Number of trials: 4 (4 RUNNING)
+----------------------+----------+---------------------+-----------+--------+--------+----------------+-------+
| Trial name | status | loc | param1 | param2 | acc | total time (s) | iter |
|----------------------+----------+---------------------+-----------+--------+--------+----------------+-------|
| MyTrainable_a826033a | RUNNING | 10.234.98.164:31115 | 0.303706 | 0.0761 | 0.1289 | 7.54952 | 15 |
| MyTrainable_a8263fc6 | RUNNING | 10.234.98.164:31117 | 0.929276 | 0.158 | 0.4865 | 7.0501 | 14 |
| MyTrainable_a8267914 | RUNNING | 10.234.98.164:31111 | 0.068426 | 0.0319 | 0.9585 | 7.0477 | 14 |
| MyTrainable_a826b7bc | RUNNING | 10.234.98.164:31112 | 0.729127 | 0.0748 | 0.1797 | 7.05715 | 14 |
+----------------------+----------+---------------------+-----------+--------+--------+----------------+-------+
You can use a :ref:`Reporter <tune-reporter-doc>` object to customize the console output.
Uploading Results
-----------------
If an upload directory is provided, Tune will automatically sync results from the ``local_dir`` to the given directory, natively supporting standard S3/gsutil URIs.
.. code-block:: python
tune.run(
MyTrainableClass,
local_dir="~/ray_results",
upload_dir="s3://my-log-dir"
)
You can customize this to specify arbitrary storages with the ``sync_to_cloud`` argument in ``tune.run``. This argument supports either strings with the same replacement fields OR arbitrary functions.
.. code-block:: python
tune.run(
MyTrainableClass,
upload_dir="s3://my-log-dir",
sync_to_cloud=custom_sync_str_or_func,
)
If a string is provided, then it must include replacement fields ``{source}`` and ``{target}``, like ``s3 sync {source} {target}``. Alternatively, a function can be provided with the following signature:
.. code-block:: python
def custom_sync_func(source, target):
# do arbitrary things inside
sync_cmd = "s3 {source} {target}".format(
source=source,
target=target)
sync_process = subprocess.Popen(sync_cmd, shell=True)
sync_process.wait()
.. _tune-debugging:
Debugging
---------
By default, Tune will run hyperparameter evaluations on multiple processes. However, if you need to debug your training process, it may be easier to do everything on a single process. You can force all Ray functions to occur on a single process with ``local_mode`` by calling the following before ``tune.run``.
.. code-block:: python
ray.init(local_mode=True)
Local mode with multiple configuration evaluations will interleave computation, so it is most naturally used when running a single configuration evaluation.
Stopping after the first failure
--------------------------------
By default, ``tune.run`` will continue executing until all trials have terminated or errored. To stop the entire Tune run as soon as **any** trial errors:
.. code-block:: python
tune.run(trainable, fail_fast=True)
This is useful when you are trying to setup a large hyperparameter experiment.
Further Questions or Issues?
----------------------------
You can post questions or issues or feedback through the following channels:
1. `StackOverflow`_: For questions about how to use Ray.
2. `GitHub Issues`_: For bug reports and feature requests.
.. _`StackOverflow`: https://stackoverflow.com/questions/tagged/ray
.. _`GitHub Issues`: https://github.com/ray-project/ray/issues
+4 -55
View File
@@ -1,5 +1,7 @@
Analysis/Logging (tune.analysis / tune.logger)
==============================================
.. _tune-analysis-docs:
Analysis (tune.analysis)
========================
Analyzing Results
-----------------
@@ -52,56 +54,3 @@ Analysis
.. autoclass:: ray.tune.Analysis
:members:
.. _loggers-docstring:
Loggers (tune.logger)
---------------------
Viskit
~~~~~~
Tune automatically integrates with Viskit via the ``CSVLogger`` outputs. To use VisKit (you may have to install some dependencies), run:
.. code-block:: bash
$ git clone https://github.com/rll/rllab.git
$ python rllab/rllab/viskit/frontend.py ~/ray_results/my_experiment
The nonrelevant metrics (like timing stats) can be disabled on the left to show only the relevant ones (like accuracy, loss, etc.).
.. image:: /ray-tune-viskit.png
.. _logger-interface:
Logger
~~~~~~
.. autoclass:: ray.tune.logger.Logger
UnifiedLogger
~~~~~~~~~~~~~
.. autoclass:: ray.tune.logger.UnifiedLogger
TBXLogger
~~~~~~~~~
.. autoclass:: ray.tune.logger.TBXLogger
JsonLogger
~~~~~~~~~~
.. autoclass:: ray.tune.logger.JsonLogger
CSVLogger
~~~~~~~~~
.. autoclass:: ray.tune.logger.CSVLogger
MLFLowLogger
~~~~~~~~~~~~
Tune also provides a default logger for `MLFlow <https://mlflow.org>`_. You can install MLFlow via ``pip install mlflow``. An example can be found `mlflow_example.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/mlflow_example.py>`__. Note that this currently does not include artifact logging support. For this, you can use the native MLFlow APIs inside your Trainable definition.
.. autoclass:: ray.tune.logger.MLFLowLogger
+3
View File
@@ -1,6 +1,8 @@
Training (tune.run, tune.Experiment)
====================================
.. _tune-run-ref:
tune.run
--------
@@ -16,6 +18,7 @@ tune.Experiment
.. autofunction:: ray.tune.Experiment
.. _tune-stop-ref:
Stopper (tune.Stopper)
----------------------
+120
View File
@@ -0,0 +1,120 @@
.. _loggers-docstring:
Loggers (tune.logger)
=====================
Tune has default loggers for Tensorboard, CSV, and JSON formats.
Logging Path
------------
Tune will log the results of each trial to a subfolder under a specified local dir, which defaults to ``~/ray_results``.
.. code-block:: python
# This logs to 2 different trial folders:
# ~/ray_results/trainable_name/trial_name_1 and ~/ray_results/trainable_name/trial_name_2
# trainable_name and trial_name are autogenerated.
tune.run(trainable, num_samples=2)
You can specify the ``local_dir`` and ``trainable_name``:
.. code-block:: python
# This logs to 2 different trial folders:
# ./results/test_experiment/trial_name_1 and ./results/test_experiment/trial_name_2
# Only trial_name is autogenerated.
tune.run(trainable, num_samples=2, local_dir="./results", name="test_experiment")
To specify custom trial folder names, you can pass use the ``trial_name_creator`` argument
to `tune.run`. This takes a function with the following signature:
.. code-block:: python
def trial_name_string(trial):
"""
Args:
trial (Trial): A generated trial object.
Returns:
trial_name (str): String representation of Trial.
"""
return str(trial)
tune.run(
MyTrainableClass,
name="example-experiment",
num_samples=1,
trial_name_creator=trial_name_string
)
See the documentation on Trials: :ref:`trial-docstring`.
Custom Loggers
--------------
You can pass in your own logging mechanisms to output logs in custom formats as follows:
.. code-block:: python
from ray.tune.logger import DEFAULT_LOGGERS
tune.run(
MyTrainableClass,
name="experiment_name",
loggers=DEFAULT_LOGGERS + (CustomLogger1, CustomLogger2)
)
These loggers will be called along with the default Tune loggers. All loggers must inherit the Logger interface (:ref:`logger-interface`). You can also check out `logger.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/logger.py>`__ for implementation details.
An example can be found in `logging_example.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/logging_example.py>`__.
Viskit
------
Tune automatically integrates with `Viskit <https://github.com/vitchyr/viskit>`_ via the ``CSVLogger`` outputs. To use VisKit (you may have to install some dependencies), run:
.. code-block:: bash
$ git clone https://github.com/rll/rllab.git
$ python rllab/rllab/viskit/frontend.py ~/ray_results/my_experiment
The nonrelevant metrics (like timing stats) can be disabled on the left to show only the relevant ones (like accuracy, loss, etc.).
.. image:: /ray-tune-viskit.png
.. _logger-interface:
Logger
------
.. autoclass:: ray.tune.logger.Logger
UnifiedLogger
-------------
.. autoclass:: ray.tune.logger.UnifiedLogger
TBXLogger
---------
.. autoclass:: ray.tune.logger.TBXLogger
JsonLogger
----------
.. autoclass:: ray.tune.logger.JsonLogger
CSVLogger
---------
.. autoclass:: ray.tune.logger.CSVLogger
MLFLowLogger
------------
Tune also provides a default logger for `MLFlow <https://mlflow.org>`_. You can install MLFlow via ``pip install mlflow``. An example can be found `mlflow_example.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/mlflow_example.py>`__. Note that this currently does not include artifact logging support. For this, you can use the native MLFlow APIs inside your Trainable definition.
.. autoclass:: ray.tune.logger.MLFLowLogger
+3
View File
@@ -1,3 +1,5 @@
.. _tune-api-ref:
Tune API Reference
==================
@@ -16,6 +18,7 @@ on `Github`_.
grid_random.rst
suggestion.rst
schedulers.rst
logging.rst
internals.rst
client.rst
cli.rst
+9 -5
View File
@@ -1,3 +1,5 @@
.. _tune-reporter-doc:
Console Output (Reporters)
==========================
@@ -73,20 +75,22 @@ The default reporting style can also be overriden more broadly by extending the
tune.run(my_trainable, progress_reporter=CustomReporter())
ProgressReporter
----------------
.. autoclass:: ray.tune.ProgressReporter
:members:
CLIReporter
-----------
.. autoclass:: ray.tune.CLIReporter
:members: add_metric_column
JupyterNotebookReporter
-----------------------
.. autoclass:: ray.tune.JupyterNotebookReporter
:members: add_metric_column
ProgressReporter
----------------
.. autoclass:: ray.tune.ProgressReporter
:members:
+4 -2
View File
@@ -1,5 +1,7 @@
Schedulers (tune.schedulers)
============================
.. _schedulers-ref:
Trial Schedulers (tune.schedulers)
==================================
FIFOScheduler
~~~~~~~~~~~~~
+2
View File
@@ -1,3 +1,5 @@
.. _searchalg-ref:
Search Algorithms (tune.suggest)
================================
+34 -12
View File
@@ -3,7 +3,7 @@
Training (tune.Trainable, tune.track)
=====================================
Training can be done with either a **Class API** (``tune.Trainable``) < or **function-based API** (``track.log``).
Training can be done with either a **Class API** (``tune.Trainable``) or **function-based API** (``track.log``).
You can use the **function-based API** for fast prototyping. On the other hand, the ``tune.Trainable`` interface supports checkpoint/restore functionality and provides more control for advanced algorithms.
@@ -41,26 +41,26 @@ The Trainable **class API** will require users to subclass ``ray.tune.Trainable`
from ray import tune
class Guesser(tune.Trainable):
"""Randomly picks 10 number from [1, 10000) to find the password."""
"""Randomly picks a number from [1, 10000) to find the password."""
def _setup(self, config):
self.config = config
self.guess = config["guess"]
self.iter = 0
self.password = 1024
def _train(self):
"""Execute one step of 'training'."""
result_dict = {"diff": abs(self.config['guess'] - self.password)}
return result_dict
"""Execute one step of 'training'. This function will be called iteratively"""
self.iter += 1
self.guess += 1
return {
"accuracy": abs(self.guess - self.password),
"training_iteration": self.iter # Tune will automatically provide this.
}
def _stop(self):
# perform any cleanup necessary.
pass
analysis = tune.run(
Guesser,
stop={
"training_iteration": 1,
},
stop={"training_iteration": 10},
num_samples=10,
config={
"guess": tune.randint(1, 10000)
@@ -109,6 +109,28 @@ Use ``validate_save_restore`` to catch ``_save``/``_restore`` errors before exec
validate_save_restore(MyTrainableClass)
validate_save_restore(MyTrainableClass, use_object_store=True)
Advanced Resource Allocation
----------------------------
Trainables can themselves be distributed. If your trainable function / class creates further Ray actors or tasks that also consume CPU / GPU resources, you will want to set ``extra_cpu`` or ``extra_gpu`` inside ``tune.run`` to reserve extra resource slots. For example, if a trainable class requires 1 GPU itself, but also launches 4 actors, each using another GPU, then you should set ``"gpu": 1, "extra_gpu": 4``.
.. code-block:: python
:emphasize-lines: 4-8
tune.run(
my_trainable,
name="my_trainable",
resources_per_trial={
"cpu": 1,
"gpu": 1,
"extra_gpu": 4
}
)
The ``Trainable`` also provides the ``default_resource_requests`` interface to automatically declare the ``resources_per_trial`` based on the given configuration.
Advanced: Reusing Actors
~~~~~~~~~~~~~~~~~~~~~~~~
-38
View File
@@ -1,38 +0,0 @@
Tune Guides and Tutorials
=========================
Tune takes a user-defined Python function or class and evaluates it on a set of hyperparameter configurations.
Each hyperparameter configuration evaluation is called a *trial*, and multiple trials are run in parallel. Configurations are either generated by Tune or drawn from a user-specified **search algorithm**. The trials are scheduled and managed by a **trial scheduler**.
.. image:: /images/tune-api.svg
.. customgalleryitem::
:tooltip: Getting started with Tune.
:figure: /images/tune.png
:description: :doc:`plot_tune-tutorial`
.. customgalleryitem::
:tooltip: A simple guide to Population-based Training
:figure: /images/tune-pbt-small.png
:description: :doc:`plot_tune-advanced-tutorial`
.. customgalleryitem::
:tooltip: Distributed Tuning
:figure: /images/tune.png
:description: :doc:`plot_tune-distributed`
.. toctree::
:hidden:
plot_tune-tutorial.rst
plot_tune-advanced-tutorial.rst
plot_tune-distributed.rst
.. :figure: /images/param_actor.png