diff --git a/doc/source/tune-distributed.rst b/doc/source/tune-distributed.rst index e521e1b07..82ff76b9c 100644 --- a/doc/source/tune-distributed.rst +++ b/doc/source/tune-distributed.rst @@ -1,14 +1,27 @@ Tune Distributed Experiments ============================ -Tune is commonly used for large-scale distributed hyperparameter optimization. Tune and Ray provide many utilities that enable an effective workflow for interacting with a cluster, including fast file mounting, one-line cluster launching, and result uploading to cloud storage. +Tune is commonly used for large-scale distributed hyperparameter optimization. This page will overview: -This page will overview the tooling for distributed experiments, covering how to connect to a cluster, how to launch a distributed experiment, and commonly used commands. + 1. How to setup and launch a distributed experiment, + 2. `commonly used commands `_, including fast file mounting, one-line cluster launching, and result uploading to cloud storage. -Connecting to a cluster ------------------------ +**Quick Summary**: To run a distributed experiment with Tune, you need to: -One common approach to modifying an existing Tune experiment to go distributed is to set an `argparse` variable so that toggling between distributed and single-node is seamless. This allows Tune to utilize all the resources available to the Ray cluster. + 1. Make sure your script has ``ray.init(redis_address=...)`` to connect to the existing Ray cluster. + 2. If a ray cluster does not exist, start a Ray cluster (instructions for `local machines `_, `cloud `_). + 3. Run the script on the head node (or use ``ray submit``). + +Running a distributed experiment +-------------------------------- + +Running a distributed (multi-node) experiment requires Ray to be started already. You can do this on local machines or on the cloud (instructions for `local machines `_, `cloud `_). + +Across your machines, Tune will automatically detect the number of GPUs and CPUs without you needing to manage ``CUDA_VISIBLE_DEVICES``. + +To execute a distributed experiment, call ``ray.init(redis_address=XXX)`` before ``tune.run``, where ``XXX`` is the Ray redis address, which defaults to ``localhost:6379``. The Tune python script should be executed only on the head node of the Ray cluster. + +One common approach to modifying an existing Tune experiment to go distributed is to set an ``argparse`` variable so that toggling between distributed and single-node is seamless. .. code-block:: python @@ -20,58 +33,50 @@ One common approach to modifying an existing Tune experiment to go distributed i args = parser.parse_args() ray.init(redis_address=args.ray_redis_address) -Note that connecting to cluster requires a pre-existing Ray cluster to be started already (`Manual Cluster Setup `_). The script should be run on the head node of the Ray cluster. Below, ``tune_script.py`` can be any script that runs a Tune hyperparameter search. + tune.run(...) .. code-block:: bash - # Single-node execution - $ python tune_script.py - # On the head node, connect to an existing ray cluster $ python tune_script.py --ray-redis-address=localhost:XXXX -.. literalinclude:: ../../python/ray/tune/examples/mnist_pytorch.py - :language: python - :start-after: if __name__ == "__main__": +If you used a cluster configuration (starting a cluster with ``ray up`` or ``ray submit --start``), use: +.. code-block:: bash -Launching a cloud cluster -------------------------- + ray submit tune-default.yaml tune_script.py --args="--ray-redis-address=localhost:6379" .. tip:: - If you have already have a list of nodes, skip down to the `Local Cluster Setup`_ section. - -Ray currently supports AWS and GCP. Below, we will launch nodes on AWS that will default to using the Deep Learning AMI. See the `cluster setup documentation `_. - -.. literalinclude:: ../../python/ray/tune/examples/tune-default.yaml - :language: yaml - :name: tune-default.yaml - -This code starts a cluster as specified by the given cluster configuration YAML file, uploads ``tune_script.py`` to the cluster, and runs ``python tune_script.py``. - -.. code-block:: bash - - ray submit tune-default.yaml tune_script.py --start - -.. image:: images/tune-upload.png - :scale: 50% - :align: center - -Analyze your results on TensorBoard by starting TensorBoard on the remote head machine. - -.. code-block:: bash - - # Go to http://localhost:6006 to access TensorBoard. - ray exec tune-default.yaml 'tensorboard --logdir=~/ray_results/ --port 6006' --port-forward 6006 - - -Note that you can customize the directory of results by running: ``tune.run(local_dir=..)``. You can then point TensorBoard to that directory to visualize results. You can also use `awless `_ for easy cluster management on AWS. + 1. In the examples, the Ray redis address commonly used is ``localhost:6379``. + 2. If the Ray cluster is already started, you should not need to run anything on the worker nodes. Local Cluster Setup ------------------- -If you run into issues (or want to add nodes manually), you can use the manual cluster setup `documentation here `__. At a glance, On the head node, run the following. +If you have already have a list of nodes, you can follow the local private cluster setup `instructions here `_. Below is an example cluster configuration as ``tune-default.yaml``: + +.. literalinclude:: ../../python/ray/tune/examples/tune-local-default.yaml + :language: yaml + +``ray up`` starts Ray on the cluster of nodes. + +.. code-block:: bash + + ray up tune-default.yaml + +``ray submit`` uploads ``tune_script.py`` to the cluster and runs ``python tune_script.py [args]``. + +.. code-block:: bash + + ray submit tune-default.yaml tune_script.py --args="--ray-redis-address=localhost:6379" + +Manual Local Cluster Setup +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If you run into issues using the local cluster setup (or want to add nodes manually), you can use the manual cluster setup. `Full documentation here `__. At a glance, + +**On the head node**: .. code-block:: bash @@ -86,8 +91,51 @@ The command will print out the address of the Redis server that was started (and $ ray start --redis-address= -If you have already have a list of nodes, you can follow the private autoscaling cluster setup `instructions here `_. +Then, you can run your Tune Python script on the head node like: +.. code-block:: bash + + # On the head node, execute using existing ray cluster + $ python tune_script.py --ray-redis-address= + +Launching a cloud cluster +------------------------- + +.. tip:: + + If you have already have a list of nodes, go to the `Local Cluster Setup`_ section. + +Ray currently supports AWS and GCP. Below, we will launch nodes on AWS that will default to using the Deep Learning AMI. See the `cluster setup documentation `_. Save the below cluster configuration (``tune-default.yaml``): + +.. literalinclude:: ../../python/ray/tune/examples/tune-default.yaml + :language: yaml + :name: tune-default.yaml + +``ray up`` starts Ray on the cluster of nodes. + +.. code-block:: bash + + ray up tune-default.yaml + +``ray submit --start`` starts a cluster as specified by the given cluster configuration YAML file, uploads ``tune_script.py`` to the cluster, and runs ``python tune_script.py [args]``. + +.. code-block:: bash + + ray submit tune-default.yaml tune_script.py --start --args="--ray-redis-address=localhost:6379" + +.. image:: images/tune-upload.png + :scale: 50% + :align: center + +Analyze your results on TensorBoard by starting TensorBoard on the remote head machine. + +.. code-block:: bash + + # Go to http://localhost:6006 to access TensorBoard. + ray exec tune-default.yaml 'tensorboard --logdir=~/ray_results/ --port 6006' --port-forward 6006 + + +Note that you can customize the directory of results by running: ``tune.run(local_dir=..)``. You can then point TensorBoard to that directory to visualize results. You can also use `awless `_ for easy cluster management on AWS. Pre-emptible Instances (Cloud) ------------------------------ @@ -180,7 +228,7 @@ To summarize, here are the commands to run: # wait a while until after all nodes have started ray kill-random-node tune-default.yaml --hard -You should see Tune eventually continue the trials on a different worker node. See the `Saving and Recovery `__ section for more details. +You should see Tune eventually continue the trials on a different worker node. See the `Save and Restore `__ section for more details. You can also specify ``tune.run(upload_dir=...)`` to sync results with a cloud storage like S3, persisting results in case you want to start and stop your cluster automatically. diff --git a/doc/source/tune-schedulers.rst b/doc/source/tune-schedulers.rst index cb3f1aced..12b802d06 100644 --- a/doc/source/tune-schedulers.rst +++ b/doc/source/tune-schedulers.rst @@ -35,7 +35,7 @@ Tune includes a distributed implementation of `Population Based Training (PBT) < }) tune.run( ... , scheduler=pbt_scheduler) -When the PBT scheduler is enabled, each trial variant is treated as a member of the population. Periodically, top-performing trials are checkpointed (this requires your Trainable to support `save and restore `__). Low-performing trials clone the checkpoints of top performers and perturb the configurations in the hope of discovering an even better variation. +When the PBT scheduler is enabled, each trial variant is treated as a member of the population. Periodically, top-performing trials are checkpointed (this requires your Trainable to support `save and restore `__). Low-performing trials clone the checkpoints of top performers and perturb the configurations in the hope of discovering an even better variation. You can run this `toy PBT example `__ to get an idea of how how PBT operates. When training in PBT mode, a single trial may see many different hyperparameters over its lifetime, which is recorded in its ``result.json`` file. The following figure generated by the example shows PBT with optimizing a LR schedule over the course of a single experiment: @@ -69,7 +69,7 @@ Compared to the original version of HyperBand, this implementation provides bett HyperBand --------- -.. note:: Note that the HyperBand scheduler requires your trainable to support saving and restoring, which is described in `Tune User Guide `__. Checkpointing enables the scheduler to multiplex many concurrent trials onto a limited size cluster. +.. note:: Note that the HyperBand scheduler requires your trainable to support saving and restoring, which is described in `Tune User Guide `__. Checkpointing enables the scheduler to multiplex many concurrent trials onto a limited size cluster. Tune also implements the `standard version of HyperBand `__. You can use it as such: diff --git a/doc/source/tune-usage.rst b/doc/source/tune-usage.rst index ac0ce698d..a24e9c2bf 100644 --- a/doc/source/tune-usage.rst +++ b/doc/source/tune-usage.rst @@ -255,8 +255,8 @@ If your trainable function / class creates further Ray actors or tasks that also } ) -Saving and Recovery -------------------- +Save and Restore +---------------- When running a hyperparameter search, Tune can automatically and periodically save/checkpoint your model. Checkpointing is used for @@ -296,7 +296,7 @@ Checkpoints will be saved by training iteration to ``local_dir/exp_name/trial_na Trainable (Trial) Checkpointing -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +------------------------------- Checkpointing assumes that the model state will be saved to disk on whichever node the Trainable is running on. You can checkpoint with three different mechanisms: manually, periodically, and at termination. @@ -349,7 +349,7 @@ The checkpoint will be saved at a path that looks like ``local_dir/exp_name/tria ) Fault Tolerance -~~~~~~~~~~~~~~~ +--------------- Tune will automatically restart trials from the last checkpoint in case of trial failures/error (if ``max_failures`` is set), both in the single node and distributed setting. diff --git a/doc/source/tune.rst b/doc/source/tune.rst index 63b69f4dc..f559797b1 100644 --- a/doc/source/tune.rst +++ b/doc/source/tune.rst @@ -48,14 +48,6 @@ If TensorBoard is installed, automatically visualize all trial results: Distributed Quick Start ----------------------- -.. note:: - - This assumes that you have already setup your AWS account and AWS credentials (``aws configure``). To run this example, you will need to install the following: - - .. code-block:: bash - - $ pip install ray torch torchvision filelock - 1. Import and initialize Ray by appending the following to your example script. .. code-block:: python @@ -71,25 +63,35 @@ Distributed Quick Start Alternatively, download a full example script here: :download:`mnist_pytorch.py <../../python/ray/tune/examples/mnist_pytorch.py>` -2. Download an example cluster yaml here: :download:`tune-default.yaml <../../python/ray/tune/examples/tune-default.yaml>` +2. Download the following example Ray cluster configuration as ``tune-local-default.yaml`` and replace the appropriate fields: + +.. literalinclude:: ../../python/ray/tune/examples/tune-local-default.yaml + :language: yaml + +Alternatively, download it here: :download:`tune-local-default.yaml <../../python/ray/tune/examples/tune-local-default.yaml>`. See `Ray cluster docs here `_. + 3. Run ``ray submit`` like the following. .. code-block:: bash - ray submit tune-default.yaml mnist_pytorch.py --args="--ray-redis-address=localhost:6379" --start + ray submit tune-local-default.yaml mnist_pytorch.py --args="--ray-redis-address=localhost:6379" --start -This will start 3 AWS machines and run a distributed hyperparameter search across them. Append ``[--stop]`` to automatically shutdown your nodes afterwards. +This will start Ray on all of your machines and run a distributed hyperparameter search across them. To summarize, here are the full set of commands: .. code-block:: bash wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/tune/examples/mnist_pytorch.py - wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/tune/tune-default.yaml - ray submit tune-default.yaml mnist_pytorch.py --args="--ray-redis-address=localhost:6379" --start + wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/tune/tune-local-default.yaml + ray submit tune-local-default.yaml mnist_pytorch.py --args="--ray-redis-address=localhost:6379" --start -Take a look at the `Distributed Experiments `_ documentation for more details, including setting up distributed experiments on local machines, using GCP, adding resilience to spot instance usage, and more. +Take a look at the `Distributed Experiments `_ documentation for more details, including: + + 1. Setting up distributed experiments on your local cluster + 2. Using AWS and GCP + 3. Spot instance usage/pre-emptible instances, and more. Getting Started --------------- diff --git a/python/ray/tests/test_autoscaler_yaml.py b/python/ray/tests/test_autoscaler_yaml.py new file mode 100644 index 000000000..ac7d6657b --- /dev/null +++ b/python/ray/tests/test_autoscaler_yaml.py @@ -0,0 +1,34 @@ +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import os +import unittest +import yaml + +from ray.autoscaler.autoscaler import fillout_defaults, validate_config +from ray.tests.utils import recursive_fnmatch + +RAY_PATH = os.path.abspath(os.path.join(__file__, "../../")) +CONFIG_PATHS = recursive_fnmatch( + os.path.join(RAY_PATH, "autoscaler"), "*.yaml") + +CONFIG_PATHS += recursive_fnmatch( + os.path.join(RAY_PATH, "tune/examples/"), "*.yaml") + + +class AutoscalingConfigTest(unittest.TestCase): + def testValidateDefaultConfig(self): + + for config_path in CONFIG_PATHS: + with open(config_path) as f: + config = yaml.safe_load(f) + config = fillout_defaults(config) + try: + validate_config(config) + except Exception: + self.fail("Config did not pass validation test!") + + +if __name__ == "__main__": + unittest.main(verbosity=2) diff --git a/python/ray/tests/utils.py b/python/ray/tests/utils.py index 98ee13017..f01b22fe9 100644 --- a/python/ray/tests/utils.py +++ b/python/ray/tests/utils.py @@ -2,6 +2,7 @@ from __future__ import absolute_import from __future__ import division from __future__ import print_function +import fnmatch import os import subprocess import sys @@ -116,3 +117,15 @@ def wait_for_condition(condition_predictor, time_elapsed += retry_interval_ms time.sleep(retry_interval_ms / 1000.0) return False + + +def recursive_fnmatch(dirpath, pattern): + """Looks at a file directory subtree for a filename pattern. + + Similar to glob.glob(..., recursive=True) but also supports 2.7 + """ + matches = [] + for root, dirnames, filenames in os.walk(dirpath): + for filename in fnmatch.filter(filenames, pattern): + matches.append(os.path.join(root, filename)) + return matches diff --git a/python/ray/tune/examples/tune-default.yaml b/python/ray/tune/examples/tune-default.yaml index 877ab1c11..1f51cf4ce 100644 --- a/python/ray/tune/examples/tune-default.yaml +++ b/python/ray/tune/examples/tune-default.yaml @@ -1,51 +1,10 @@ -# An unique identifier for the head node and workers of this cluster. -cluster_name: tune-example - -# The minimum number of workers nodes to launch in addition to the head -# node. This number should be >= 0. -min_workers: 2 - -# The maximum number of workers nodes to launch in addition to the head -# node. This takes precedence over min_workers. -max_workers: 2 - -# Cloud-provider specific configuration. -provider: - type: aws - region: us-west-2 - # Availability zone(s), comma-separated, that nodes may be launched in. - # Nodes are currently spread between zones by a round-robin approach, - # however this implementation detail should not be relied upon. - availability_zone: us-west-2a,us-west-2b - -# How Ray will authenticate with newly launched nodes. -# By default Ray creates a new private keypair, but you can also use your own. -auth: - ssh_user: ubuntu - -# Provider-specific config for the head node, e.g. instance type. -head_node: - InstanceType: c5.xlarge - ImageId: ami-0b294f219d14e6a82 # Deep Learning AMI (Ubuntu) Version 21.0 - -# Provider-specific config for worker nodes, e.g. instance type. -worker_nodes: - InstanceType: c5.xlarge - ImageId: ami-0b294f219d14e6a82 # Deep Learning AMI (Ubuntu) Version 21.0 - - # Run workers on spot by default. Comment this out to use on-demand. - InstanceMarketOptions: - MarketType: spot - -# Files or directories to copy to the head and worker nodes. The format is a -# dictionary from REMOTE_PATH: LOCAL_PATH, e.g. -file_mounts: { -# "/path1/on/remote/machine": "/path1/on/local/machine", -# "/path2/on/remote/machine": "/path2/on/local/machine", -} - -# List of shell commands to run to set up each node. -setup_commands: - - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.8.0.dev3-cp36-cp36m-manylinux1_x86_64.whl - - pip install torch torchvision tabulate tensorboard filelock - +cluster_name: tune-default +provider: {type: aws, region: us-west-2} +auth: {ssh_user: ubuntu} +min_workers: 3 +max_workers: 3 +# Deep Learning AMI (Ubuntu) Version 21.0 +head_node: {InstanceType: c5.xlarge, ImageId: ami-0b294f219d14e6a82} +worker_nodes: {InstanceType: c5.xlarge, ImageId: ami-0b294f219d14e6a82} +setup_commands: # Set up each node. + - pip install ray torch torchvision tabulate tensorboard diff --git a/python/ray/tune/examples/tune-local-default.yaml b/python/ray/tune/examples/tune-local-default.yaml new file mode 100644 index 000000000..8cf5a3fe1 --- /dev/null +++ b/python/ray/tune/examples/tune-local-default.yaml @@ -0,0 +1,11 @@ +cluster_name: local-default +provider: + type: local + head_ip: YOUR_HEAD_NODE_HOSTNAME + worker_ips: [WORKER_NODE_1_HOSTNAME, WORKER_NODE_2_HOSTNAME, ... ] +auth: {ssh_user: YOUR_USERNAME, ssh_private_key: ~/.ssh/id_rsa} +## Typically for local clusters, min_workers == max_workers. +min_workers: 3 +max_workers: 3 +setup_commands: # Set up each node. + - pip install ray torch torchvision tabulate tensorboard diff --git a/python/ray/tune/tests/test_tune_restore.py b/python/ray/tune/tests/test_tune_restore.py index 306e7d27b..4130cf809 100644 --- a/python/ray/tune/tests/test_tune_restore.py +++ b/python/ray/tune/tests/test_tune_restore.py @@ -10,7 +10,8 @@ import unittest import ray from ray import tune -from ray.tune.util import recursive_fnmatch, validate_save_restore +from ray.tests.utils import recursive_fnmatch +from ray.tune.util import validate_save_restore from ray.rllib import _register_all diff --git a/python/ray/tune/util.py b/python/ray/tune/util.py index e1dbd4db9..19e5ce9c6 100644 --- a/python/ray/tune/util.py +++ b/python/ray/tune/util.py @@ -4,9 +4,7 @@ from __future__ import print_function import base64 import copy -import fnmatch import logging -import os import threading import time from collections import defaultdict @@ -213,18 +211,6 @@ def _from_pinnable(obj): return obj[0] -def recursive_fnmatch(dirpath, pattern): - """Looks at a file directory subtree for a filename pattern. - - Similar to glob.glob(..., recursive=True) but also supports 2.7 - """ - matches = [] - for root, dirnames, filenames in os.walk(dirpath): - for filename in fnmatch.filter(filenames, pattern): - matches.append(os.path.join(root, filename)) - return matches - - def validate_save_restore(trainable_cls, config=None, use_object_store=False): """Helper method to check if your Trainable class will resume correctly.