diff --git a/doc/source/actors.rst b/doc/source/actors.rst index 3e0f43ffe..e367f7309 100644 --- a/doc/source/actors.rst +++ b/doc/source/actors.rst @@ -475,20 +475,21 @@ Actor Pool Actor pool hasn't been implemented in Java yet. -Actors, Workers and Resources ------------------------------ +FAQ: Actors, Workers and Resources +---------------------------------- + What's the difference between a worker and an actor? -Each "Ray worker" is a python process. +Each "Ray worker" is a python process. Workers are treated differently for tasks and actors. Any "Ray worker" is either 1. used to execute multiple Ray tasks or 2. is started as a dedicated Ray actor. - * Tasks: When Ray starts on a machine, a number of Ray workers will be started automatically (1 per CPU by default). They will be used to execute tasks (like a process pool). If you execute 8 tasks with `num_cpus=2`, and total number of CPUs is 16 (`ray.cluster_resources()["CPU"] == 16`), you will end up with 8 of your 16 workers idling. +* **Tasks**: When Ray starts on a machine, a number of Ray workers will be started automatically (1 per CPU by default). They will be used to execute tasks (like a process pool). If you execute 8 tasks with `num_cpus=2`, and total number of CPUs is 16 (`ray.cluster_resources()["CPU"] == 16`), you will end up with 8 of your 16 workers idling. - * Actor: A Ray Actor is also a "Ray worker" but is instantiated at runtime (upon `actor_cls.remote()`). All of its methods will run on the same process, using the same resources (designated when defining the Actor). Note that unlike tasks, the python processes that runs Ray Actors are not reused and will be terminated when the Actor is deleted. +* **Actor**: A Ray Actor is also a "Ray worker" but is instantiated at runtime (upon `actor_cls.remote()`). All of its methods will run on the same process, using the same resources (designated when defining the Actor). Note that unlike tasks, the python processes that runs Ray Actors are not reused and will be terminated when the Actor is deleted. To maximally utilize your resources, you want to maximize the time that -your workers are working. You also want to allocate enough cluster resources +your workers are working. You also want to allocate enough cluster resources so that both all of your needed actors can run and any other tasks you define can run. This also implies that tasks are scheduled more flexibly, and that if you don't need the stateful part of an actor, you're mostly @@ -505,8 +506,8 @@ Concurrency within an actor Ray offers two types of concurrency within an actor: - * :ref:`async execution ` - * :ref:`threading ` + * :ref:`async execution ` + * :ref:`threading ` See the above links for more details. diff --git a/doc/source/cluster/index.rst b/doc/source/cluster/index.rst index aa341bb6a..88fe75bf2 100644 --- a/doc/source/cluster/index.rst +++ b/doc/source/cluster/index.rst @@ -10,14 +10,47 @@ Key Concepts * **Ray Nodes**: A Ray cluster consists of a **head node** and a set of **worker nodes**. The head node needs to be started first, and the worker nodes are given the address of the head node to form the cluster. The Ray cluster itself can also "auto-scale," meaning that it can interact with a Cloud Provider to request or release instances according to application workload. -* **Ports**: Ray processes communicate via TCP ports. When starting a Ray cluster, either on prem or on the cloud, it is important to open the right ports so that Ray functions correctly. +* **Ports**: Ray processes communicate via TCP ports. When starting a Ray cluster, either on prem or on the cloud, it is important to open the right ports so that Ray functions correctly. See :ref:`the Ray Ports documentation ` for more details. * **Ray Cluster Launcher**: The :ref:`Ray Cluster Launcher ` is a simple tool that automatically provisions machines and launches a multi-node Ray cluster. You can use the cluster launcher on GCP, Amazon EC2, Azure, or even Kubernetes. -Starting a Ray cluster ----------------------- +Summary +------- -Clusters can be started with the :ref:`Cluster Launcher ` or :ref:`manually `. You can also create a Ray cluster using a standard cluster manager such as :ref:`Kubernetes `, :ref:`YARN `, or :ref:`SLURM `. +Clusters are started with the :ref:`Ray Cluster Launcher ` or :ref:`manually `. + +You can also create a Ray cluster using a standard cluster manager such as :ref:`Kubernetes `, :ref:`YARN `, or :ref:`SLURM `. + +After a cluster is started, you need to connect your program to the Ray cluster. + +.. tabs:: + .. group-tab:: python + + You can connect to this Ray runtime by starting a Python process that calls the following on the same node as where you ran ``ray start``: + + .. code-block:: python + + # This must + import ray + ray.init(address='auto') + + .. group-tab:: java + + If you want to run Java code, you need to specify the classpath via the ``--code-search-path`` option. See :ref:`code_search_path` for more details. + + .. code-block:: bash + + $ ray start ... --code-search-path=/path/to/jars + +and then the rest of your script should be able to leverage Ray as a distributed framework! + + +Using the cluster launcher +-------------------------- + +The ``ray up`` command uses the :ref:`Ray Cluster Launcher ` to start a cluster on the cloud, creating a designated "head node" and worker nodes. Any Python process that runs ``ray.init(address=...)`` on any of the cluster nodes will connect to the ray cluster. + +.. important:: Calling ``ray.init`` on your laptop will not work if using ``ray up``, since your laptop will not be the head node. Here is an example of using the Cluster Launcher on AWS: @@ -29,42 +62,20 @@ Here is an example of using the Cluster Launcher on AWS: # out the command that can be used to SSH into the cluster head node. $ ray up ray/python/ray/autoscaler/aws/example-full.yaml -Running a Ray program on the Ray cluster ----------------------------------------- - -To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes. For example, start up Python on one of the nodes in the cluster: - -.. code-block:: python - - import ray - ray.init(address="auto") - -.. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes. - -To verify that the correct number of nodes have joined the cluster, you can run the following. - -.. code-block:: python - - import time - - @ray.remote - def f(): - time.sleep(0.01) - return ray.services.get_node_ip_address() - - # Get a list of the IP addresses of the nodes that have joined the cluster. - set(ray.get([f.remote() for _ in range(1000)])) +You can monitor the Ray cluster status with ``ray monitor cluster.yaml`` and ssh into the head node with ``ray attach cluster.yaml``. .. _manual-cluster: Manual Ray Cluster Setup ------------------------ -The most preferable way to run a Ray cluster is via the Ray Cluster Launcher. However, it is also possible to start a Ray cluster by hand. +The most preferable way to run a Ray cluster is via the :ref:`Ray Cluster Launcher `. However, it is also possible to start a Ray cluster by hand. This section assumes that you have a list of machines and that the nodes in the cluster can communicate with each other. It also assumes that Ray is installed on each machine. To install Ray, follow the `installation instructions`_. +To configure the Ray cluster to run Java code, you need to add the ``--code-search-path`` option. See :ref:`code_search_path` for more details. + .. _`installation instructions`: http://docs.ray.io/en/latest/installation.html Starting Ray on each machine @@ -92,14 +103,56 @@ should look something like ``123.45.67.89:6379``). If you wish to specify that a machine has 10 CPUs and 1 GPU, you can do this with the flags ``--num-cpus=10`` and ``--num-gpus=1``. See the :ref:`Configuration ` page for more information. -Now we've started all of the Ray processes on each node Ray. This includes - -- Some worker processes on each machine. -- An object store on each machine. -- A raylet on each machine. -- Multiple Redis servers (on the head node). +Now we've started the Ray runtime. Stopping Ray ~~~~~~~~~~~~ When you want to stop the Ray processes, run ``ray stop`` on each node. + +.. _using-ray-on-a-cluster: + +Running a Ray program on the Ray cluster +---------------------------------------- + +To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes. + +.. tabs:: + .. group-tab:: Python + + Within your program/script, you must call ``ray.init`` and add the ``address`` parameter to ``ray.init`` (like ``ray.init(address=...)``). This causes Ray to connect to the existing cluster. For example: + + .. code-block:: python + + ray.init(address="auto") + + .. group-tab:: Java + + You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``). + + To connect your program to the Ray cluster, run it like this: + + .. code-block:: bash + + java -classpath /path/to/jars/ \ + -Dray.address=
\ + + + .. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command. + + +.. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes. + +To verify that the correct number of nodes have joined the cluster, you can run the following. + +.. code-block:: python + + import time + + @ray.remote + def f(): + time.sleep(0.01) + return ray.services.get_node_ip_address() + + # Get a list of the IP addresses of the nodes that have joined the cluster. + set(ray.get([f.remote() for _ in range(1000)])) diff --git a/doc/source/cluster/launcher.rst b/doc/source/cluster/launcher.rst index 947b08ccd..2a6c6d4a0 100644 --- a/doc/source/cluster/launcher.rst +++ b/doc/source/cluster/launcher.rst @@ -135,13 +135,18 @@ If you want to run applications on the cluster that are accessible from a web br Running Ray scripts on the cluster (``ray submit``) --------------------------------------------------- -You can also use ``ray submit`` to execute Python scripts on clusters. This will ``rsync`` the designated file onto the cluster and execute it with the given arguments. See :ref:`the documentation ` for ``ray submit``. +You can also use ``ray submit`` to execute Python scripts on clusters. This will ``rsync`` the designated file onto the head node cluster and execute it with the given arguments. See :ref:`the documentation ` for ``ray submit``. .. code-block:: shell # Run a Python script in a detached tmux session $ ray submit cluster.yaml --tmux --start --stop tune_experiment.py + # Run a Python script with arguments. + # This executes script.py on the head node of the cluster, using + # the command: python ~/script.py --arg1 --arg2 --arg3 + $ ray submit cluster.yaml script.py -- --arg1 --arg2 --arg3 + Attaching to a running cluster (``ray attach``) ----------------------------------------------- diff --git a/doc/source/cluster/slurm.rst b/doc/source/cluster/slurm.rst index 9be636332..9f033c165 100644 --- a/doc/source/cluster/slurm.rst +++ b/doc/source/cluster/slurm.rst @@ -57,6 +57,12 @@ Starter SLURM script # the worker will not be able to connect to redis. In case of longer delay, # adjust the sleeptime above to ensure proper order. + # Now we execute worker_num worker nodes on all nodes in the allocation except hostname by + # specifying --nodes=${worker_num} and --exclude=`hostname`. Use 1 task per node, so worker_num tasks in total + # (--ntasks=${worker_num}) and 5 CPUs per task (--cps-per-task=${SLURM_CPUS_PER_TASK}). + srun --nodes=${worker_num} --ntasks=${worker_num} --cpus-per-task=${SLURM_CPUS_PER_TASK} --exclude=`hostname` ray start --address $ip_head --block --num-cpus ${SLURM_CPUS_PER_TASK} & + sleep 5 + for (( i=1; i<=$worker_num; i++ )) do node2=${nodes_array[$i]} diff --git a/doc/source/configure.rst b/doc/source/configure.rst index dacc19ec0..6de549d95 100644 --- a/doc/source/configure.rst +++ b/doc/source/configure.rst @@ -115,6 +115,8 @@ start a new worker with the given *root temporary directory*. ├── plasma_store └── raylet # this could be deleted by Ray's shutdown cleanup. +.. _ray-ports: + Ports configurations -------------------- Ray requires bi-directional communication among its nodes in a cluster. Each of node is supposed to open specific ports to receive incoming network requests. @@ -130,7 +132,7 @@ The following options specify the range of ports used by worker processes across - ``--max-worker-port``: Maximum port number worker can be bound to. Default: 10999. Head Node -~~~~~~~~~~~ +~~~~~~~~~ In addition to ports specified above, the head node needs to open several more ports. - ``--port``: Port of GCS. Default: 6379. diff --git a/doc/source/installation.rst b/doc/source/installation.rst index ddf0e059b..7e603ee22 100644 --- a/doc/source/installation.rst +++ b/doc/source/installation.rst @@ -1,3 +1,5 @@ +.. _installation: + Installing Ray ============== @@ -75,7 +77,64 @@ For example, here are the Ray 1.0.0 wheels for Python 3.5, MacOS for commit ``a0 .. code-block:: bash - pip install https://ray-wheels.s3-us-west-2.amazonaws.com/master/a0ba4499ac645c9d3e82e68f3a281e48ad57f873/ray-1.0.0-cp35-cp35m-macosx_10_13_intel.whl + pip install https://ray-wheels.s3-us-west-2.amazonaws.com/master/a0ba4499ac645c9d3e82e68f3a281e48ad57f873/ray-1.1.0.dev0-cp35-cp35m-macosx_10_13_intel.whl + +.. _ray-install-java: + +Install Ray With Maven +---------------------- + +The latest Ray Java release can be found in `central repository `__. To use the latest Ray Java release in your application, add the following entries in your ``pom.xml``: + +.. code-block:: xml + + + io.ray + ray-api + ${ray.version} + + + io.ray + ray-runtime + ${ray.version} + + +The latest Ray Java snapshot can be found in `sonatype repository `__. To use the latest Ray Java snapshot in your application, add the following entries in your ``pom.xml``: + +.. code-block:: xml + + + + + sonatype + https://oss.sonatype.org/content/repositories/snapshots/ + + false + + + true + + + + + + + io.ray + ray-api + ${ray.version} + + + io.ray + ray-runtime + ${ray.version} + + + +.. note:: + + When you run ``pip install`` to install Ray, Java jars are installed as well. The above dependencies are only used to build your Java code and to run your code in local or single machine mode. + + If you want to run your Java code in a multi-node Ray cluster, it's better to exclude Ray jars when packaging your code to avoid jar conficts if the versions (installed Ray with ``pip install`` and maven dependencies) don't match. .. _windows-support: diff --git a/doc/source/starting-ray.rst b/doc/source/starting-ray.rst index 372d3f7a8..c94c22f24 100644 --- a/doc/source/starting-ray.rst +++ b/doc/source/starting-ray.rst @@ -3,45 +3,29 @@ Starting Ray This page covers how to start Ray on your single machine or cluster of machines. -.. contents:: :local: +.. tip:: Be sure to have :ref:`installed Ray ` before following the instructions on this page. -Installation ------------- -Install Ray with ``pip install -U ray``. For the latest wheels (a snapshot of the ``master`` branch), you can use the instructions at :ref:`install-nightlies`. +What is the Ray runtime? +------------------------ -.. note:: This step is not required if you are writing a Ray application in Java and you don't have the need of running your Java application in a multi-node Ray cluster at the development stage. See `Local mode`_ for more details. +Ray programs are able to parallelize and distribute by leveraging an underlying *Ray runtime*. +The Ray runtime consists of multiple services/processes started in the background for communication, data transfer, scheduling, and more. The Ray runtime can be started on a laptop, a single server, or multiple servers. -Build your Java code --------------------- +There are three ways of starting the Ray runtime: -If your application is written in Java, you need to add Ray dependencies to your project in order to build it. +* Implicitly via ``ray.init()`` (:ref:`start-ray-init`) +* Explicitly via CLI (:ref:`start-ray-cli`) +* Explicitly via the cluster launcher (:ref:`start-ray-up`) -.. code-block:: xml - - - - io.ray - ray-api - ... - - - io.ray - ray-runtime - ... - - - -.. note:: - - When you run ``pip install`` to install Ray, Java jars are installed as well. The above dependencies are only used to build your Java code and to run your code in local or single machine mode. - - If you want to run your Java code in a multi-node Ray cluster, it's better to exclude Ray jars when packaging your code to avoid jar conficts if the versions (installed Ray with ``pip install`` and maven dependencies) don't match. +.. _start-ray-init: Starting Ray on a single machine -------------------------------- -You can start Ray with the ``init`` API (see the code snippet below). It will start the local services that Ray uses to schedule remote tasks and actors and then connect to them. Note that you must initialize Ray before any tasks or actors are called. +Calling ``ray.init()`` (without any ``address`` args) starts a Ray runtime on your laptop/machine. This laptop/machine becomes the "head node". + +You must initialize Ray before any tasks or actors are called. .. tabs:: .. code-tab:: python @@ -63,7 +47,7 @@ You can start Ray with the ``init`` API (see the code snippet below). It will st } } -To stop or restart Ray, use the shutdown API. +When the process calling ``ray.init()`` terminates, the Ray runtime will also terminate. To explicitly stop or restart Ray, use the shutdown API. .. tabs:: .. code-tab:: python @@ -120,79 +104,67 @@ To stop or restart Ray, use the shutdown API. See the `Configuration `__ documentation for the various ways to configure Ray. -.. _using-ray-on-a-cluster: +.. _start-ray-cli: -Using Ray on a cluster ----------------------- +Starting Ray via the CLI (``ray start``) +---------------------------------------- -There are two steps needed to use Ray in a distributed setting: - - 1. You must first start the Ray cluster. - - If you have a Ray cluster specification (:ref:`ref-automatic-cluster`), you can launch a multi-node cluster with Ray initialized on each node with ``ray up``. **From your local machine/laptop**: - - .. code-block:: bash - - ray up cluster.yaml - - To configure the Ray cluster to run Java code, you need to add the ``--code-search-path`` option. See :ref:`code_search_path` for more details. - - You can monitor the Ray cluster status with ``ray monitor cluster.yaml`` and ssh into the head node with ``ray attach cluster.yaml``. - - 2. Specify the address of the Ray cluster when initializing Ray in your code. This causes Ray to connect to the existing cluster instead of starting a new one on the local node. - - .. tabs:: - .. group-tab:: Python - - You need to add the ``address`` parameter to ``ray.init`` (like ``ray.init(address=...)``). To connect your program to the Ray cluster, add the following to your Python script: - - .. code-block:: python - - ray.init(address="auto") - - .. group-tab:: Java - - You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``). - - To connect your program to the Ray cluster, run it like this: - - .. code-block:: bash - - java -classpath /path/to/jars/ \ - -Dray.address=
\ - - - .. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command. - - Your driver code **only** needs to execute on one machine in the cluster (usually the head node). - - .. note:: Without the address parameter, your Ray program will only be parallelized across a single machine! - -Manual cluster setup -~~~~~~~~~~~~~~~~~~~~ - -You can also use the manual cluster setup (:ref:`ref-cluster-setup`) by running initialization commands on each node. - -**On the head node**: +Use ``ray start`` from the CLI to start a 1 node ray runtime on a machine. This machine becomes the "head node". .. code-block:: bash - # If the ``--redis-port`` argument is omitted, Ray will choose a port at random. - $ ray start --head --redis-port=6379 + $ ray start --head --port=6379 -The command will print out the address of the Redis server that was started (and some other address information). + Local node IP: 192.123.1.123 + 2020-09-20 10:38:54,193 INFO services.py:1166 -- View the Ray dashboard at http://localhost:8265 -**Then on all of the other nodes**, run the following. Make sure to replace ``
`` with the value printed by the command on the head node (it should look something like ``123.45.67.89:6379``). + -------------------- + Ray runtime started. + -------------------- -.. code-block:: bash + ... - $ ray start --address=
-If you want to run Java code, you need to specify the classpath via the ``--code-search-path`` option. See :ref:`code_search_path` for more details. +.. tabs:: + .. group-tab:: python -.. code-block:: bash + You can connect to this Ray runtime by starting a Python process that calls the following on the same node as where you ran ``ray start``: - $ ray start ... --code-search-path=/path/to/jars + .. code-block:: python + + # This must + import ray + ray.init(address='auto') + + .. group-tab:: java + + + If you want to run Java code, you need to specify the classpath via the ``--code-search-path`` option. See :ref:`code_search_path` for more details. + + .. code-block:: bash + + $ ray start ... --code-search-path=/path/to/jars + + +You can connect other nodes to the head node, creating a Ray cluster by also calling ``ray start`` on those nodes. See :ref:`manual-cluster` for more details. Calling ``ray.init(address="auto")`` on any of the cluster machines will connect to the ray cluster. + +.. _start-ray-up: + +Launching a Ray cluster (``ray up``) +------------------------------------ + +Ray clusters can be launched with the :ref:`Cluster Launcher `. +The ``ray up`` command uses the Ray cluster launcher to start a cluster on the cloud, creating a designated "head node" and worker nodes. Underneath the hood, it automatically calls ``ray start`` to create a Ray cluster. + +Your code **only** needs to execute on one machine in the cluster (usually the head node). Read more about :ref:`running programs on a Ray cluster `. + +To connect to the existing cluster, similar to the method outlined in :ref:`start-ray-cli`, you must call ``ray.init`` and specify the address of the Ray cluster when initializing Ray in your code. This allows Ray to connect to the cluster. + +.. code-block:: python + + ray.init(address="auto") + +Note that the machine calling ``ray up`` will not be considered as part of the Ray cluster, and therefore calling ``ray.init`` on that same machine will not attach to the cluster. .. _local_mode: @@ -217,9 +189,10 @@ By default, Ray will parallelize its workload and run tasks on multiple processe -Dray.local-mode=true \ + .. note:: If you just want to run your Java code in local mode, you can run it without Ray or even Python installed. + Note that there are some known issues with local mode. Please read :ref:`these tips ` for more information. -.. note:: If you just want to run your Java code in local mode, you can run it without Ray or even Python installed. What's next? ------------ diff --git a/doc/source/tune/_tutorials/_faq.rst b/doc/source/tune/_tutorials/_faq.rst index 96ff28590..8fcb05ace 100644 --- a/doc/source/tune/_tutorials/_faq.rst +++ b/doc/source/tune/_tutorials/_faq.rst @@ -183,6 +183,7 @@ additional outputs: called * ``trial_id``: Unique trial ID + How do I set resources? ~~~~~~~~~~~~~~~~~~~~~~~ If you want to allocate specific resources to a trial, you can use the diff --git a/doc/source/tune/_tutorials/overview.rst b/doc/source/tune/_tutorials/overview.rst index 9dac8b5a6..5aa6360be 100644 --- a/doc/source/tune/_tutorials/overview.rst +++ b/doc/source/tune/_tutorials/overview.rst @@ -24,6 +24,11 @@ Take a look at any of the below tutorials to get started with Tune. :figure: /images/tune.png :description: :doc:`A walkthrough to setup your first Tune experiment ` +.. customgalleryitem:: + :tooltip: A deep dive into Tune's workings. + :figure: /images/tune.png + :description: :doc:`How does Tune work? ` + .. customgalleryitem:: :tooltip: A simple guide to Population-based Training :figure: /images/tune-pbt-small.png @@ -76,6 +81,7 @@ Take a look at any of the below tutorials to get started with Tune. tune-tutorial.rst tune-advanced-tutorial.rst + tune-lifecycle.rst tune-distributed.rst tune-sklearn.rst tune-pytorch-cifar.rst diff --git a/doc/source/tune/_tutorials/tune-distributed.rst b/doc/source/tune/_tutorials/tune-distributed.rst index 97e57321c..498576e5b 100644 --- a/doc/source/tune/_tutorials/tune-distributed.rst +++ b/doc/source/tune/_tutorials/tune-distributed.rst @@ -3,59 +3,21 @@ Tune Distributed Experiments ============================ -Tune is commonly used for large-scale distributed hyperparameter optimization. This page will overview: - - 1. How to setup and launch a distributed experiment, - 2. :ref:`Commonly used commands `, including fast file mounting, one-line cluster launching, and result uploading to cloud storage. - -**Quick Summary**: To run a distributed experiment with Tune, you need to: - - 1. Make sure your script has ``ray.init(address=...)`` to connect to the existing Ray cluster. - 2. If a ray cluster does not exist, start a Ray cluster. - 3. Run the script on the head node (or use ``ray submit``). +Tune is commonly used for large-scale distributed hyperparameter optimization. This page will overview how to setup and launch a distributed experiment along with :ref:`commonly used commands ` for Tune when running distributed experiments. .. contents:: :local: :backlinks: none -Running a distributed experiment --------------------------------- +Summary +------- -Running a distributed (multi-node) experiment requires Ray to be started already. You can do this on local machines or on the cloud. +To run a distributed experiment with Tune, you need to: -Across your machines, Tune will automatically detect the number of GPUs and CPUs without you needing to manage ``CUDA_VISIBLE_DEVICES``. +1. First, :ref:`start a Ray cluster ` if you have not already. +2. Specify ``ray.init(address=...)`` in your script :ref:`to connect to the existing Ray cluster `. +3. Run the script on the head node (or use :ref:`ray submit `). -To execute a distributed experiment, call ``ray.init(address=XXX)`` before ``tune.run``, where ``XXX`` is the Ray redis address, which defaults to ``localhost:6379``. The Tune python script should be executed only on the head node of the Ray cluster. - -One common approach to modifying an existing Tune experiment to go distributed is to set an ``argparse`` variable so that toggling between distributed and single-node is seamless. - -.. code-block:: python - - import ray - import argparse - - parser = argparse.ArgumentParser() - parser.add_argument("--address") - args = parser.parse_args() - ray.init(address=args.address) - - tune.run(...) - -.. code-block:: bash - - # On the head node, connect to an existing ray cluster - $ python tune_script.py --ray-address=localhost:XXXX - -If you used a cluster configuration (starting a cluster with ``ray up`` or ``ray submit --start``), use: - -.. code-block:: bash - - ray submit tune-default.yaml tune_script.py -- --ray-address=localhost:6379 - -.. tip:: - - 1. In the examples, the Ray redis address commonly used is ``localhost:6379``. - 2. If the Ray cluster is already started, you should not need to run anything on the worker nodes. .. _tune-distributed-local: @@ -84,28 +46,6 @@ Manual Local Cluster Setup If you run into issues using the local cluster setup (or want to add nodes manually), you can use :ref:`the manual cluster setup `. At a glance, -**On the head node**: - -.. code-block:: bash - - # If the ``--redis-port`` argument is omitted, Ray will choose a port at random. - $ ray start --head --redis-port=6379 - -The command will print out the address of the Redis server that was started (and some other address information). - -**Then on all of the other nodes**, run the following. Make sure to replace ``
`` with the value printed by the command on the head node (it should look something like ``123.45.67.89:6379``). - -.. code-block:: bash - - $ ray start --address=
- -Then, you can run your Tune Python script on the head node like: - -.. code-block:: bash - - # On the head node, execute using existing ray cluster - $ python tune_script.py --ray-address=
- .. tune-distributed-cloud: Launching a cloud cluster @@ -147,6 +87,46 @@ Analyze your results on TensorBoard by starting TensorBoard on the remote head m Note that you can customize the directory of results by running: ``tune.run(local_dir=..)``. You can then point TensorBoard to that directory to visualize results. You can also use `awless `_ for easy cluster management on AWS. + +Running a distributed experiment +-------------------------------- + +Running a distributed (multi-node) experiment requires Ray to be started already. You can do this on local machines or on the cloud. + +Across your machines, Tune will automatically detect the number of GPUs and CPUs without you needing to manage ``CUDA_VISIBLE_DEVICES``. + +To execute a distributed experiment, call ``ray.init(address=XXX)`` before ``tune.run``, where ``XXX`` is the Ray redis address, which defaults to ``localhost:6379``. The Tune python script should be executed only on the head node of the Ray cluster. + +One common approach to modifying an existing Tune experiment to go distributed is to set an ``argparse`` variable so that toggling between distributed and single-node is seamless. + +.. code-block:: python + + import ray + import argparse + + parser = argparse.ArgumentParser() + parser.add_argument("--address") + args = parser.parse_args() + ray.init(address=args.address) + + tune.run(...) + +.. code-block:: bash + + # On the head node, connect to an existing ray cluster + $ python tune_script.py --ray-address=localhost:XXXX + +If you used a cluster configuration (starting a cluster with ``ray up`` or ``ray submit --start``), use: + +.. code-block:: bash + + ray submit tune-default.yaml tune_script.py -- --ray-address=localhost:6379 + +.. tip:: + + 1. In the examples, the Ray redis address commonly used is ``localhost:6379``. + 2. If the Ray cluster is already started, you should not need to run anything on the worker nodes. + Syncing ------- @@ -281,10 +261,10 @@ Tune automatically persists the progress of your entire experiment (a ``tune.run **Settings:** - - The default setting of ``resume=False`` creates a new experiment. - - ``resume="LOCAL"`` and ``resume=True`` restore the experiment from ``local_dir/[experiment_name]``. - - ``resume="REMOTE"`` syncs the upload dir down to the local dir and then restores the experiment from ``local_dir/experiment_name``. - - ``resume="PROMPT"`` will cause Tune to prompt you for whether you want to resume. You can always force a new experiment to be created by changing the experiment name. +- The default setting of ``resume=False`` creates a new experiment. +- ``resume="LOCAL"`` and ``resume=True`` restore the experiment from ``local_dir/[experiment_name]``. +- ``resume="REMOTE"`` syncs the upload dir down to the local dir and then restores the experiment from ``local_dir/experiment_name``. +- ``resume="PROMPT"`` will cause Tune to prompt you for whether you want to resume. You can always force a new experiment to be created by changing the experiment name. Note that trials will be restored to their last checkpoint. If trial checkpointing is not enabled, unfinished trials will be restarted from scratch. diff --git a/doc/source/tune/_tutorials/tune-lifecycle.rst b/doc/source/tune/_tutorials/tune-lifecycle.rst new file mode 100644 index 000000000..a721106ac --- /dev/null +++ b/doc/source/tune/_tutorials/tune-lifecycle.rst @@ -0,0 +1,94 @@ +.. _tune-lifecycle: + +How does Tune work? +=================== + +This page provides an overview of Tune's inner workings. + +.. tip:: Before you continue, be sure to have read :ref:`the Tune Key Concepts page `. + +Definitions +----------- + +**Trainable** + +The function passed to tune.run. + +**Trial** + +Logical representation of a single hyperparameter configuration. Each trial is associated with an instance of a Trainable. + +**Driver/worker process** + +The driver process is the python process that calls ``tune.run`` (which calls ``ray.init()`` underneath the hood). + +The Tune driver process runs on the node where you run your script (which calls ``tune.run``), while Ray Tune trainable "actors" run on any node (either on the same node or on worker nodes (distributed Ray only)). + +**Ray Actors** + +Tune uses Ray Actors as worker processes to evaluate multiple Trainables in parallel. + +:ref:`Ray Actors ` allow you to parallelize an instance of a class in Python. When you instantiate a class that is a Ray actor, Ray will start a instance of that class on a separate process either on the same machine (or another distributed machine, if running a Ray cluster). This actor can then asynchronously execute method calls and maintain its own internal state. + +What happens in ``tune.run``? +----------------------------- + +When calling the following: + +.. code-block:: python + + space = {"x": tune.uniform(0, 1)} + tune.run(my_trainable, config=space, num_samples=10) + +The provided function/trainable is evaluated multiple times in parallel with different hyperparameters (sampled from ``uniform(0, 1)``). + +Every Tune run consists of "driver process" and many "worker processes". As mentioned in the Definitions section, the driver process (Tune Driver) is the python process in which you call ``tune.run``. + +The driver spawns parallel worker processes (:ref:`Ray actors `) +that are responsible for evaluating each trial using its hyperparameter configuration and the provided trainable (see the `trial executor source code `__). + +While the Trainable is executing (:ref:`trainable-execution`), the Tune Driver communicates with each actor via actor methods to receive intermediate training results and pause/stop actors (see :ref:`trial-lifecycle`). + +When the Trainable terminates (or is stopped), the actor is also terminated. + +.. _trainable-execution: + +The execution of a trainable +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Tune uses :ref:`Ray actors ` to parallelize the evaluation of multiple hyperparameter configurations. Each actor is a Python process that executes an instance of the user-provided Trainable. + +The definition of the user-provided Trainable will be :ref:`serialized via cloudpickle `) and sent to each actor process. Each Ray actor will start an instance of the Trainable to be executed. + +If the Trainable is a class, it will be executed iteratively by calling ``train/step``. After each invocation, the driver is notified that a "result dict" is ready. The driver will then pull the result via ``ray.get``. + +If the trainable is a callable or a function, it will be executed on the Ray actor process on a separate execution thread. Whenever ``tune.report`` is called, the execution thread is paused and waits for the driver to pull a result (see `function_runner.py `__. After pulling, the actor’s execution thread will automatically resume. + + +Resource Management in Tune +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Before running a trial, the Tune Driver will check whether there are available resources on the cluster (see :ref:`resource-requirements`). It will compare the available resources with the resources required by the trial. + +If there is space on the cluster, then the Tune Driver will start a Ray actor (worker). This actor will be scheduled and executed on some node where the resources are available. + +See :ref:`tune-parallelism` for more information. + +.. _trial-lifecycle: + +Lifecycle of a trial +-------------------- + +A trial's life cycle consists of 6 stages: + +* **Initialization** (generation): A trial is first generated as a hyperparameter sample, and its parameters are configured according to what was provided in tune.run. Trials are then placed into a queue to be executed (with status PENDING). + +* **PENDING**: A pending trial is a trial to be executed on the machine. Every trial is configured with resource values. Whenever the trial’s resource values are available, tune will run the trial (by starting a ray actor holding the config and the training function. + +* **RUNNING**: A running trial is assigned a Ray Actor. There can be multiple running trials in parallel. See the :ref:`trainable execution ` section for more details. + +* **ERRORED**: If a running trial throws an exception, Tune will catch that exception and mark the trial as errored. Note that exceptions can be propagated from an actor to the main Tune driver process. If max_retries is set, Tune will set the trial back into "PENDING" and later start it from the last checkpoint. + +* **TERMINATED**: A trial is terminated if it is stopped by a Stopper/Scheduler. If using the Function API, the trial is also terminated when the function stops. + +* **PAUSED**: A trial can be paused by a Trial scheduler. This means that the trial’s actor will be stopped. A paused trial can later be resumed from the most recent checkpoint. diff --git a/doc/source/tune/_tutorials/tune-sklearn.py b/doc/source/tune/_tutorials/tune-sklearn.py index c21f0ff5a..fb38c7597 100644 --- a/doc/source/tune/_tutorials/tune-sklearn.py +++ b/doc/source/tune/_tutorials/tune-sklearn.py @@ -145,11 +145,11 @@ print(tune_search.best_params_) # # Check out more detailed examples and get started with tune-sklearn! # -# * `Skorch with tune-sklearn `_ -# * `Scikit-Learn Pipelines with tune-sklearn `_ -# * `XGBoost with tune-sklearn `_ -# * `KerasClassifier with tune-sklearn `_ -# * `LightGBM with tune-sklearn `_ +# * `Skorch with tune-sklearn `_ +# * `Scikit-Learn Pipelines with tune-sklearn `_ +# * `XGBoost with tune-sklearn `_ +# * `KerasClassifier with tune-sklearn `_ +# * `LightGBM with tune-sklearn `_ # # # Further Reading diff --git a/doc/source/tune/index.rst b/doc/source/tune/index.rst index 7bc5574ac..45a0fa5f7 100644 --- a/doc/source/tune/index.rst +++ b/doc/source/tune/index.rst @@ -7,11 +7,11 @@ Tune: Scalable Hyperparameter Tuning Tune is a Python library for experiment execution and hyperparameter tuning at any scale. Core features: - * Launch a multi-node :ref:`distributed hyperparameter sweep ` in less than 10 lines of code. - * Supports any machine learning framework, :ref:`including PyTorch, XGBoost, MXNet, and Keras `. - * Automatically manages :ref:`checkpoints ` and logging to :ref:`TensorBoard `. - * Choose among state of the art algorithms such as :ref:`Population Based Training (PBT) `, :ref:`BayesOptSearch `, :ref:`HyperBand/ASHA `. - * Move your models from training to serving on the same infrastructure with `Ray Serve`_. +* Launch a multi-node :ref:`distributed hyperparameter sweep ` in less than 10 lines of code. +* Supports any machine learning framework, :ref:`including PyTorch, XGBoost, MXNet, and Keras `. +* Automatically manages :ref:`checkpoints ` and logging to :ref:`TensorBoard `. +* Choose among state of the art algorithms such as :ref:`Population Based Training (PBT) `, :ref:`BayesOptSearch `, :ref:`HyperBand/ASHA `. +* Move your models from training to serving on the same infrastructure with `Ray Serve`_. .. _`Ray Serve`: serve/index.html @@ -94,18 +94,18 @@ Reference Materials Here are some reference materials for Tune: - * :doc:`/tune/user-guide` - * :ref:`Frequently asked questions ` - * `Code `__: GitHub repository for Tune +* :doc:`/tune/user-guide` +* :ref:`Frequently asked questions ` +* `Code `__: GitHub repository for Tune Below are some blog posts and talks about Tune: - - [blog] `Tune: a Python library for fast hyperparameter tuning at any scale `_ - - [blog] `Cutting edge hyperparameter tuning with Ray Tune `_ - - [blog] `Simple hyperparameter and architecture search in tensorflow with Ray Tune `_ - - [slides] `Talk given at RISECamp 2019 `_ - - [video] `Talk given at RISECamp 2018 `_ - - [video] `A Guide to Modern Hyperparameter Optimization (PyData LA 2019) `_ (`slides `_) +- [blog] `Tune: a Python library for fast hyperparameter tuning at any scale `_ +- [blog] `Cutting edge hyperparameter tuning with Ray Tune `_ +- [blog] `Simple hyperparameter and architecture search in tensorflow with Ray Tune `_ +- [slides] `Talk given at RISECamp 2019 `_ +- [video] `Talk given at RISECamp 2018 `_ +- [video] `A Guide to Modern Hyperparameter Optimization (PyData LA 2019) `_ (`slides `_) Citing Tune ----------- diff --git a/doc/source/tune/key-concepts.rst b/doc/source/tune/key-concepts.rst index 213d680a0..958daae9c 100644 --- a/doc/source/tune/key-concepts.rst +++ b/doc/source/tune/key-concepts.rst @@ -58,8 +58,8 @@ The other is a :ref:`class-based API `. Here's an example of spe See the documentation: :ref:`trainable-docs` and :ref:`examples `. -tune.run --------- +tune.run and Trials +------------------- Use ``tune.run`` execute hyperparameter tuning using the core Ray APIs. This function manages your experiment and provides many features such as :ref:`logging `, :ref:`checkpointing `, and :ref:`early stopping `. @@ -68,7 +68,11 @@ Use ``tune.run`` execute hyperparameter tuning using the core Ray APIs. This fun # Pass in a Trainable class or function to tune.run. tune.run(trainable) -This function will report status on the command line until all trials stop (each trial is one instance of a :ref:`Trainable `): +``tune.run`` will generate a couple hyperparameter configurations from its arguments, and each hyperparameter configuration is logically represented by a Trial object. + +Each trial has a resource specification (``resources_per_trial`` or ``trial.resources``), a hyperparameter configuration (``trial.config``), id (``trial.trial_id``), among other configuration values. Each trial is also associated with one instance of a :ref:`Trainable `. You can access trial objects through the :ref:`Analysis object ` provided after ``tune.run`` finishes. + +``tune.run`` will execute until all trials stop or error: .. code-block:: bash @@ -210,6 +214,8 @@ Unlike **Search Algorithms**, :ref:`Trial Scheduler ` do not se See the documentation: :ref:`schedulers-ref`. +.. _tune-concepts-analysis: + Analysis -------- @@ -242,10 +248,10 @@ What's Next? Now that you have a working understanding of Tune, check out: - * :doc:`/tune/user-guide`: A comprehensive overview of Tune's features. - * :ref:`tune-guides`: Tutorials for using Tune with your preferred machine learning library. - * :doc:`/tune/examples/index`: End-to-end examples and templates for using Tune with your preferred machine learning library. - * :ref:`tune-tutorial`: A simple tutorial that walks you through the process of setting up a Tune experiment. +* :doc:`/tune/user-guide`: A comprehensive overview of Tune's features. +* :ref:`tune-guides`: Tutorials for using Tune with your preferred machine learning library. +* :doc:`/tune/examples/index`: End-to-end examples and templates for using Tune with your preferred machine learning library. +* :ref:`tune-tutorial`: A simple tutorial that walks you through the process of setting up a Tune experiment. Further Questions or Issues? diff --git a/doc/source/tune/user-guide.rst b/doc/source/tune/user-guide.rst index 0050920db..375d8a79f 100644 --- a/doc/source/tune/user-guide.rst +++ b/doc/source/tune/user-guide.rst @@ -56,7 +56,7 @@ You can find an example of this in the :doc:`Keras MNIST example ) tune.run(trainable, num_samples=100, resources_per_trial={"cpu": 2, "gpu": 1}) + + .. _tune-default-search-space: Search Space (Grid/Random) diff --git a/doc/source/walkthrough.rst b/doc/source/walkthrough.rst index f45a4b03f..f6cb6266d 100644 --- a/doc/source/walkthrough.rst +++ b/doc/source/walkthrough.rst @@ -179,6 +179,7 @@ Note the following behaviors: first task (the value corresponding to ``obj_ref1/objRef1``) will be sent over the network to the machine where the second task is scheduled. +.. _resource-requirements: Specifying required resources ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~