[docs] new Ray Cluster documentation (#13839)

Co-authored-by: Javier Redondo <javier@anyscale.com>
Co-authored-by: AmeerHajAli <ameerh@berkeley.edu>
This commit is contained in:
javi-redondo
2021-02-15 00:47:14 -08:00
committed by GitHub
parent 82539f2da4
commit b8b2d6410d
19 changed files with 1503 additions and 551 deletions
+1 -1
View File
@@ -11,7 +11,7 @@ You can view the `code for this example`_.
.. _`code for this example`: https://github.com/ray-project/ray/tree/master/doc/examples/lm
To use Ray cluster launcher on AWS, install boto (``pip install boto3``) and configure your AWS credentials in ``~/.aws/credentials`` as described on the :ref:`Automatic Cluster Setup page <ref-automatic-cluster>`.
To use Ray cluster launcher on AWS, install boto (``pip install boto3``) and configure your AWS credentials in ``~/.aws/credentials`` as described on the :ref:`Automatic Cluster Setup page <cluster-cloud>`.
We provide an `example config file <https://github.com/ray-project/ray/tree/master/doc/examples/lm/lm-cluster.yaml>`__ (``lm-cluster.yaml``).
In the example config file, we use an ``m5.xlarge`` on-demand instance as the head node, and use ``p3.2xlarge`` GPU spot instances as the worker nodes. We set the minimal number of workers to 1 and maximum workers to 2 in the config, which can be modified according to your own demand.
+1
View File
@@ -25,6 +25,7 @@ sphinx-jsonschema
sphinx-tabs
sphinx-version-warning
sphinx-book-theme
sphinxcontrib.yt
starlette
tabulate
uvicorn
-164
View File
@@ -1,164 +0,0 @@
.. _ref-autoscaling:
Cluster Autoscaling
===================
.. tip:: Before you continue, be sure to have read :ref:`cluster-cloud`.
Basics
------
The Ray Cluster Launcher will automatically enable a load-based autoscaler. The scheduler will look at the task, actor, and placement group resource demands from the cluster, and tries to add the minimum set of nodes that can fulfill these demands. When nodes are idle for more than a timeout, they will be removed, down to the ``min_workers`` limit. The head node is never removed.
To avoid launching too many nodes at once, the number of nodes allowed to be pending is limited by the ``upscaling_speed`` setting. By default it is set to ``1.0``, which means the cluster can be growing in size by at most ``100%`` at any time (e.g., if the cluster currently has 20 nodes, at most 20 pending launches are allowed). This fraction can be set to as high as needed, e.g., ``99999`` to allow the cluster to quickly grow to its max size.
In more detail, the autoscaler implements the following control loop:
1. It calculates the number of nodes required to satisfy all currently pending tasks, actor, and placement group requests.
2. If the number of nodes required total divided by the number of current nodes exceeds ``1 + upscaling_speed``, then the number of nodes launched will be limited by that threshold.
3. If a node is idle for a timeout (5 minutes by default), it is removed from the cluster.
The basic autoscaling config settings are as follows:
.. code-block:: yaml
# An unique identifier for the head node and workers of this cluster.
cluster_name: default
# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
min_workers: 0
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0
# If a node is idle for this many minutes, it will be removed. A node is
# considered idle if there are no tasks or actors running on it.
idle_timeout_minutes: 5
Programmatically Scaling a Cluster
----------------------------------
You can from within a Ray program command the autoscaler to scale the cluster up to a desired size with ``request_resources()`` call. The cluster will immediately attempt to scale to accomodate the requested resources, bypassing normal upscaling speed constraints.
.. autofunction:: ray.autoscaler.sdk.request_resources
Manually Adding Nodes without Resources (Unmanaged Nodes)
---------------------------------------------------------
In some cases, adding special nodes without any resources (i.e. `num_cpus=0`) may be desirable. Such nodes can be used as a driver which connects to the cluster to launch jobs.
In order to manually add a node to an autoscaled cluster, the `ray-cluster-name` tag should be set and `ray-node-type` tag should be set to `unmanaged`.
Unmanaged nodes **must have 0 resources**.
If you are using the `available_node_types` field, you should create a custom node type with `resources: {}`, and `max_workers: 0` when configuring the autoscaler.
The autoscaler will not attempt to start, stop, or update unmanaged nodes. The user is responsible for properly setting up and cleaning up unmanaged nodes.
Multiple Node Type Autoscaling
------------------------------
Ray supports multiple node types in a single cluster. In this mode of operation, the scheduler will choose the types of nodes to add based on the resource demands, instead of always adding the same kind of node type.
The concept of a cluster node type encompasses both the physical instance type (e.g., AWS p3.8xl GPU nodes vs m4.16xl CPU nodes), as well as other attributes (e.g., IAM role, the machine image, etc). `Custom resources <configure.html>`__ can be specified for each node type so that Ray is aware of the demand for specific node types at the application level (e.g., a task may request to be placed on a machine with a specific role or machine image via custom resource).
An example of configuring multiple node types is as follows `(full example) <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-multi-node-type.yaml>`__:
.. code-block:: yaml
# Specify the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
cpu_4_ondemand:
node_config:
InstanceType: m4.xlarge
# For AWS instances, autoscaler will automatically add the available
# CPUs/GPUs/accelerator_type ({"CPU": 4} for m4.xlarge) in "resources".
# resources: {"CPU": 4}
min_workers: 1
max_workers: 5
cpu_16_spot:
node_config:
InstanceType: m4.4xlarge
InstanceMarketOptions:
MarketType: spot
# Autoscaler will auto fill the CPU resources below.
resources: {"Custom1": 1, "is_spot": 1}
max_workers: 10
gpu_1_ondemand:
node_config:
InstanceType: p2.xlarge
# Autoscaler will auto fill the CPU/GPU resources below.
resources: {"Custom2": 2}
max_workers: 4
worker_setup_commands:
- pip install tensorflow-gpu # Example command.
gpu_8_ondemand:
node_config:
InstanceType: p3.8xlarge
# Autoscaler autofills the "resources" below.
# resources: {"CPU": 32, "GPU": 4, "accelerator_type:V100": 1}
max_workers: 2
worker_setup_commands:
- pip install tensorflow-gpu # Example command.
# Specify the node type of the head node (as configured above).
head_node_type: cpu_4_ondemand
The above config defines two CPU node types (``cpu_4_ondemand`` and ``cpu_16_spot``), and two GPU types (``gpu_1_ondemand`` and ``gpu_8_ondemand``). Each node type has a name (e.g., ``cpu_4_ondemand``), which has no semantic meaning and is only for debugging. Let's look at the inner fields of the ``gpu_1_ondemand`` node type:
The node config tells the underlying Cloud provider how to launch a node of this type. This node config is merged with the top level node config of the YAML and can override fields (i.e., to specify the p2.xlarge instance type here):
.. code-block:: yaml
node_config:
InstanceType: p2.xlarge
The resources field tells the autoscaler what kinds of resources this node provides. This can include custom resources as well (e.g., "Custom2"). This field enables the autoscaler to automatically select the right kind of nodes to launch given the resource demands of the application. The resources specified here will be automatically passed to the ``ray start`` command for the node via an environment variable. For more information, see also the `resource demand scheduler <https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/resource_demand_scheduler.py>`__:
.. code-block:: yaml
resources: {"CPU": 4, "GPU": 1, "Custom2": 2}
The ``min_workers`` and ``max_workers`` fields constrain the minimum and maximum number of nodes of this type to launch, respectively:
.. code-block:: yaml
min_workers: 1
max_workers: 4
The ``worker_setup_commands`` field (and also the ``initialization_commands`` field, not shown) can be used to override the setup and initialization commands for a node type. Note that you can only override the setup for worker nodes. The head node's setup commands are always configured via the top level field in the cluster YAML:
.. code-block:: yaml
worker_setup_commands:
- pip install tensorflow-gpu # Example command.
Docker Support for Multi-type clusters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For each node type, you can specify ``worker_image`` and ``pull_before_run`` fields. These will override any top level ``docker`` section values (see :ref:`autoscaler-docker`). The ``worker_run_options`` field is combined with top level ``docker: run_options`` field to produce the docker run command for the given node_type. Ray will automatically select the Nvidia docker runtime if it is available.
The following configuration is for a GPU enabled node type:
.. code-block:: yaml
available_node_types:
gpu_1_ondemand:
max_workers: 2
worker_setup_commands:
- pip install tensorflow-gpu # Example command.
# Docker specific commands for gpu_1_ondemand
pull_before_run: True
worker_image:
- rayproject/ray-ml:latest-gpu
worker_run_options: # Appended to top-level docker field.
- "-v /home:/home"
+159 -3
View File
@@ -272,6 +272,116 @@ There are two ways of running private clusters:
$ ray down ray/python/ray/autoscaler/local/example-full.yaml
.. _manual-cluster:
Manual Ray Cluster Setup
------------------------
The most preferable way to run a Ray cluster is via the Ray Cluster Launcher. However, it is also possible to start a Ray cluster by hand.
This section assumes that you have a list of machines and that the nodes in the cluster can communicate with each other. It also assumes that Ray is installed
on each machine. To install Ray, follow the `installation instructions`_.
.. _`installation instructions`: http://docs.ray.io/en/master/installation.html
Starting Ray on each machine
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
On the head node (just choose some node to be the head node), run the following.
If the ``--port`` argument is omitted, Ray will choose port 6379, falling back to a
random port.
.. code-block:: bash
$ ray start --head --port=6379
...
Next steps
To connect to this Ray runtime from another node, run
ray start --address='<ip address>:6379' --redis-password='<password>'
If connection fails, check your firewall settings and network configuration.
The command will print out the address of the Redis server that was started
(the local node IP address plus the port number you specified).
**Then on each of the other nodes**, run the following. Make sure to replace
``<address>`` with the value printed by the command on the head node (it
should look something like ``123.45.67.89:6379``).
Note that if your compute nodes are on their own subnetwork with Network
Address Translation, to connect from a regular machine outside that subnetwork,
the command printed by the head node will not work. You need to find the
address that will reach the head node from the second machine. If the head node
has a domain address like compute04.berkeley.edu, you can simply use that in
place of an IP address and rely on the DNS.
.. code-block:: bash
$ ray start --address=<address> --redis-password='<password>'
--------------------
Ray runtime started.
--------------------
To terminate the Ray runtime, run
ray stop
If you wish to specify that a machine has 10 CPUs and 1 GPU, you can do this
with the flags ``--num-cpus=10`` and ``--num-gpus=1``. See the :ref:`Configuration <configuring-ray>` page for more information.
If you see ``Unable to connect to Redis. If the Redis instance is on a
different machine, check that your firewall is configured properly.``,
this means the ``--port`` is inaccessible at the given IP address (because, for
example, the head node is not actually running Ray, or you have the wrong IP
address).
If you see ``Ray runtime started.``, then the node successfully connected to
the IP address at the ``--port``. You should now be able to connect to the
cluster with ``ray.init(address='auto')``.
If ``ray.init(address='auto')`` keeps repeating
``redis_context.cc:303: Failed to connect to Redis, retrying.``, then the node
is failing to connect to some other port(s) besides the main port.
.. code-block:: bash
If connection fails, check your firewall settings and network configuration.
If the connection fails, to check whether each port can be reached from a node,
you can use a tool such as ``nmap`` or ``nc``.
.. code-block:: bash
$ nmap -sV --reason -p $PORT $HEAD_ADDRESS
Nmap scan report for compute04.berkeley.edu (123.456.78.910)
Host is up, received echo-reply ttl 60 (0.00087s latency).
rDNS record for 123.456.78.910: compute04.berkeley.edu
PORT STATE SERVICE REASON VERSION
6379/tcp open redis syn-ack ttl 60 Redis key-value store
Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
$ nc -vv -z $HEAD_ADDRESS $PORT
Connection to compute04.berkeley.edu 6379 port [tcp/*] succeeded!
If the node cannot access that port at that IP address, you might see
.. code-block:: bash
$ nmap -sV --reason -p $PORT $HEAD_ADDRESS
Nmap scan report for compute04.berkeley.edu (123.456.78.910)
Host is up (0.0011s latency).
rDNS record for 123.456.78.910: compute04.berkeley.edu
PORT STATE SERVICE REASON VERSION
6379/tcp closed redis reset ttl 60
Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
$ nc -vv -z $HEAD_ADDRESS $PORT
nc: connect to compute04.berkeley.edu port 6379 (tcp) failed: Connection refused
Stopping Ray
~~~~~~~~~~~~
When you want to stop the Ray processes, run ``ray stop`` on each node.
Additional Cloud Providers
--------------------------
@@ -283,16 +393,62 @@ Security
On cloud providers, nodes will be launched into their own security group by default, with traffic allowed only between nodes in the same group. A new SSH key will also be created and saved to your local machine for access to the cluster.
.. _using-ray-on-a-cluster:
Running a Ray program on the Ray cluster
----------------------------------------
To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes.
.. tabs::
.. group-tab:: Python
Within your program/script, you must call ``ray.init`` and add the ``address`` parameter to ``ray.init`` (like ``ray.init(address=...)``). This causes Ray to connect to the existing cluster. For example:
.. code-block:: python
ray.init(address="auto")
.. group-tab:: Java
You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``).
To connect your program to the Ray cluster, run it like this:
.. code-block:: bash
java -classpath <classpath> \
-Dray.address=<address> \
<classname> <args>
.. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
.. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes.
To verify that the correct number of nodes have joined the cluster, you can run the following.
.. code-block:: python
import time
@ray.remote
def f():
time.sleep(0.01)
return ray.services.get_node_ip_address()
# Get a list of the IP addresses of the nodes that have joined the cluster.
set(ray.get([f.remote() for _ in range(1000)]))
What's Next?
-------------
Now that you have a working understanding of the cluster launcher, check out:
* :ref:`cluster-config`: A guide to configuring your Ray cluster.
* :ref:`ref-cluster-quick-start`: A end-to-end demo to run an application that autoscales.
* :ref:`cluster-config`: A complete reference of how to configure your Ray cluster.
* :ref:`cluster-commands`: A short user guide to the various cluster launcher commands.
* A `step by step guide`_ to using the cluster launcher
* :ref:`ref-autoscaling`: An overview of how Ray autoscaling works.
File diff suppressed because it is too large Load Diff
+4
View File
@@ -3,6 +3,10 @@
Ray with Cluster Managers
=========================
.. note::
If you're using AWS, Azure or GCP you can use the :ref:`Ray Cluster Launcher <cluster-cloud>` to simplify the cluster setup process.
.. toctree::
:maxdepth: 2
+13 -216
View File
@@ -1,229 +1,26 @@
.. _cluster-index:
Distributed Ray Overview
========================
Ray Cluster Overview
====================
One of Ray's strengths is the ability to leverage multiple machines in the same program. Ray can, of course, be run on a single machine (and is done so often) but the real power is using Ray on a cluster of machines.
Key Concepts
------------
* **Ray Nodes**: A Ray cluster consists of a **head node** and a set of **worker nodes**. The head node needs to be started first, and the worker nodes are given the address of the head node to form the cluster. The Ray cluster itself can also "auto-scale," meaning that it can interact with a Cloud Provider to request or release instances according to application workload.
* **Ports**: Ray processes communicate via TCP ports. When starting a Ray cluster, either on prem or on the cloud, it is important to open the right ports so that Ray functions correctly. See :ref:`the Ray Ports documentation <ray-ports>` for more details.
* **Ray Cluster Launcher**: The :ref:`Ray Cluster Launcher <ref-automatic-cluster>` is a simple tool that automatically provisions machines and launches a multi-node Ray cluster. You can use the cluster launcher on GCP, Amazon EC2, Azure, or even Kubernetes.
Summary
-------
Clusters are started with the :ref:`Ray Cluster Launcher <ref-automatic-cluster>` or :ref:`manually <manual-cluster>`.
You can also create a Ray cluster using a standard cluster manager such as :ref:`Kubernetes <ray-k8s-deploy>`, :ref:`YARN <ray-yarn-deploy>`, or :ref:`SLURM <ray-slurm-deploy>`.
After a cluster is started, you need to connect your program to the Ray cluster by starting a driver process on the same node as where you ran ``ray start``:
.. tabs::
.. code-tab:: python
# This must
import ray
ray.init(address='auto')
.. group-tab:: java
.. code-block:: java
import io.ray.api.Ray;
public class MyRayApp {
public static void main(String[] args) {
Ray.init();
...
}
}
.. code-block:: bash
java -classpath <classpath> \
-Dray.address=<address> \
<classname> <args>
and then the rest of your script should be able to leverage Ray as a distributed framework!
Using the cluster launcher
--------------------------
The ``ray up`` command uses the :ref:`Ray Cluster Launcher <ref-automatic-cluster>` to start a cluster on the cloud, creating a designated "head node" and worker nodes. Any Python process that runs ``ray.init(address=...)`` on any of the cluster nodes will connect to the ray cluster.
.. important:: Calling ``ray.init`` on your laptop will not work if using ``ray up``, since your laptop will not be the head node.
Here is an example of using the Cluster Launcher on AWS:
.. code-block:: shell
# First, run `pip install boto3` and `aws configure`
#
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/aws/example-full.yaml
You can monitor the Ray cluster status with ``ray monitor cluster.yaml`` and ssh into the head node with ``ray attach cluster.yaml``.
.. _manual-cluster:
Manual Ray Cluster Setup
What is a Ray cluster?
------------------------
The most preferable way to run a Ray cluster is via the :ref:`Ray Cluster Launcher <ref-automatic-cluster>`. However, it is also possible to start a Ray cluster by hand.
One of Ray's strengths is the ability to leverage multiple machines in the same program. Ray can, of course, be run on a single machine (and is done so often), but the real power is using Ray on a cluster of machines.
This section assumes that you have a list of machines and that the nodes in the cluster can communicate with each other. It also assumes that Ray is installed
on each machine. To install Ray, follow the `installation instructions`_.
A Ray cluster consists of a **head node** and a set of **worker nodes**. The head node needs to be started first, and the worker nodes are given the address of the head node to form the cluster.
.. _`installation instructions`: http://docs.ray.io/en/master/installation.html
You can use the Ray Cluster Launcher to provision machines and launch a multi-node Ray cluster. You can use the cluster launcher on AWS, GCP, Azure, Kubernetes, on-premise, and Staroid or even on your custom node provider. Ray clusters can also make use of the Ray Autoscaler, which allows Ray to interact with a cloud provider to request or release instances according to application workload.
Starting Ray on each machine
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
How does it work?
-----------------
On the head node (just choose some node to be the head node), run the following.
If the ``--port`` argument is omitted, Ray will choose port 6379, falling back to a
random port.
The Ray Cluster Launcher will automatically enable a load-based autoscaler. The autoscaler resource demand scheduler will look at the pending tasks, actors, and placement groups resource demands from the cluster, and try to add the minimum list of nodes that can fulfill these demands. When worker nodes are idle for more than :ref:`idle_timeout_minutes <cluster-configuration-idle-timeout-minutes>`, they will be removed (the head node is never removed unless the cluster is teared down).
.. code-block:: bash
Autoscaler uses a simple binpacking algorithm to binpack the user demands into the available cluster resources. The remaining unfulfilled demands are placed on the smallest list of nodes that satisfies the demand while maximizing utilization (starting from the smallest node).
$ ray start --head --port=6379
...
Next steps
To connect to this Ray runtime from another node, run
ray start --address='<ip address>:6379' --redis-password='<password>'
**Here is "A Glimpse into the Ray Autoscaler" and how to debug/monitor your cluster:**
If connection fails, check your firewall settings and network configuration.
2021-19-01 by Ameer Haj-Ali, Anyscale, Inc.
The command will print out the address of the Redis server that was started
(the local node IP address plus the port number you specified).
**Then on each of the other nodes**, run the following. Make sure to replace
``<address>`` with the value printed by the command on the head node (it
should look something like ``123.45.67.89:6379``).
Note that if your compute nodes are on their own subnetwork with Network
Address Translation, to connect from a regular machine outside that subnetwork,
the command printed by the head node will not work. You need to find the
address that will reach the head node from the second machine. If the head node
has a domain address like compute04.berkeley.edu, you can simply use that in
place of an IP address and rely on the DNS.
.. code-block:: bash
$ ray start --address=<address> --redis-password='<password>'
--------------------
Ray runtime started.
--------------------
To terminate the Ray runtime, run
ray stop
If you wish to specify that a machine has 10 CPUs and 1 GPU, you can do this
with the flags ``--num-cpus=10`` and ``--num-gpus=1``. See the :ref:`Configuration <configuring-ray>` page for more information.
If you see ``Unable to connect to Redis. If the Redis instance is on a
different machine, check that your firewall is configured properly.``,
this means the ``--port`` is inaccessible at the given IP address (because, for
example, the head node is not actually running Ray, or you have the wrong IP
address).
If you see ``Ray runtime started.``, then the node successfully connected to
the IP address at the ``--port``. You should now be able to connect to the
cluster with ``ray.init(address='auto')``.
If ``ray.init(address='auto')`` keeps repeating
``redis_context.cc:303: Failed to connect to Redis, retrying.``, then the node
is failing to connect to some other port(s) besides the main port.
.. code-block:: bash
If connection fails, check your firewall settings and network configuration.
If the connection fails, to check whether each port can be reached from a node,
you can use a tool such as ``nmap`` or ``nc``.
.. code-block:: bash
$ nmap -sV --reason -p $PORT $HEAD_ADDRESS
Nmap scan report for compute04.berkeley.edu (123.456.78.910)
Host is up, received echo-reply ttl 60 (0.00087s latency).
rDNS record for 123.456.78.910: compute04.berkeley.edu
PORT STATE SERVICE REASON VERSION
6379/tcp open redis syn-ack ttl 60 Redis key-value store
Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
$ nc -vv -z $HEAD_ADDRESS $PORT
Connection to compute04.berkeley.edu 6379 port [tcp/*] succeeded!
If the node cannot access that port at that IP address, you might see
.. code-block:: bash
$ nmap -sV --reason -p $PORT $HEAD_ADDRESS
Nmap scan report for compute04.berkeley.edu (123.456.78.910)
Host is up (0.0011s latency).
rDNS record for 123.456.78.910: compute04.berkeley.edu
PORT STATE SERVICE REASON VERSION
6379/tcp closed redis reset ttl 60
Service detection performed. Please report any incorrect results at https://nmap.org/submit/ .
$ nc -vv -z $HEAD_ADDRESS $PORT
nc: connect to compute04.berkeley.edu port 6379 (tcp) failed: Connection refused
Stopping Ray
~~~~~~~~~~~~
When you want to stop the Ray processes, run ``ray stop`` on each node.
.. _using-ray-on-a-cluster:
Running a Ray program on the Ray cluster
----------------------------------------
To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes.
.. tabs::
.. group-tab:: Python
Within your program/script, you must call ``ray.init`` and add the ``address`` parameter to ``ray.init`` (like ``ray.init(address=...)``). This causes Ray to connect to the existing cluster. For example:
.. code-block:: python
ray.init(address="auto")
.. group-tab:: Java
You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``).
To connect your program to the Ray cluster, run it like this:
.. code-block:: bash
java -classpath <classpath> \
-Dray.address=<address> \
<classname> <args>
.. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command.
.. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes.
To verify that the correct number of nodes have joined the cluster, you can run the following.
.. code-block:: python
import time
@ray.remote
def f():
time.sleep(0.01)
return ray.services.get_node_ip_address()
# Get a list of the IP addresses of the nodes that have joined the cluster.
set(ray.get([f.remote() for _ in range(1000)]))
.. youtube:: BJ06eJasdu4
+1 -1
View File
@@ -41,7 +41,7 @@ Below is a brief overview of the two tools.
The Ray Cluster Launcher
------------------------
The :ref:`Ray Cluster Launcher <ref-automatic-cluster>` is geared towards experimentation and development and can be used to launch Ray clusters on Kubernetes (among other backends).
The :ref:`Ray Cluster Launcher <cluster-cloud>` is geared towards experimentation and development and can be used to launch Ray clusters on Kubernetes (among other backends).
It allows you to manage an autoscaling Ray Cluster from your local environment using the :ref:`Ray CLI <cluster-commands>`.
For example, you can use ``ray up`` to launch a Ray cluster on Kubernetes and ``ray exec`` to execute commands in the Ray head node's pod.
Note that using the Cluster Launcher requires Ray to be :ref:`installed locally <installation>`.
-66
View File
@@ -1,66 +0,0 @@
.. _ref-automatic-cluster:
Launching Cloud Clusters with Ray
=================================
Ray comes with a built-in cluster launcher that makes deploying a Ray cluster simple.
The cluster launcher will provision resources from a node provider (like :ref:`AWS EC2 <ref-cloud-setup>` or :ref:`Kubernetes <ray-launch-k8s>`) to instantiate the specified cluster, and start a Ray cluster on the provisioned resources.
You can configure the Ray Cluster Launcher to use with :ref:`a cloud provider <cluster-cloud>`, an existing :ref:`Kubernetes cluster <ray-launch-k8s>`, or a private cluster of machines.
.. tabs::
.. group-tab:: AWS
.. code-block:: shell
# First, run `pip install boto3` and `aws configure`
#
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/aws/example-full.yaml
See :ref:`the AWS section <ref-cloud-setup>` for full instructions.
.. group-tab:: GCP
.. code-block:: shell
# First, ``pip install google-api-python-client``
# set up your GCP credentials, and
# create a new GCP project.
#
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/gcp/example-full.yaml
See :ref:`the GCP section <ref-cloud-setup>` for full instructions.
.. group-tab:: Azure
.. code-block:: shell
# First, install the Azure CLI
# ``pip install azure-cli azure-core``) then
# login using (``az login``).
#
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray up ray/python/ray/autoscaler/azure/example-full.yaml
See :ref:`the Azure section <ref-cloud-setup>` for full instructions.
Once the Ray cluster is running, you can manually SSH into it or use provided commands like ``ray attach``, ``ray rsync-up``, and ``ray exec`` to access it and run Ray programs.
.. toctree::
/cluster/cloud.rst
/cluster/config.rst
/cluster/commands.rst
Questions or Issues?
--------------------
.. include:: /_help.rst
+240
View File
@@ -0,0 +1,240 @@
.. _ref-cluster-quick-start:
Quick Start Cluster Autoscaling Demo
====================================
This quick start demonstrates the capabilities of the Ray cluster. Using the Ray cluster, we'll take a sample application designed to run on a laptop and scale it up in the cloud. Ray will launch clusters and scale Python with just a few commands.
About the demo
--------------
This demo will walk through an end-to-end flow:
1. Create a (basic) Python application.
2. Launch a cluster on a cloud provider.
3. Run the application in the cloud.
Requirements
~~~~~~~~~~~~
To run this demo, you will need:
* Python installed on your development machine (typically your laptop), and
* an account at your preferred cloud provider (AWS, Azure or GCP).
Setup
~~~~~
Before we start, you will need to install some Python dependencies as follows:
.. tabs::
.. group-tab:: AWS
.. code-block:: shell
$ pip install -U ray boto3
.. group-tab:: Azure
.. code-block:: shell
$ pip install -U ray azure-cli azure-core
.. group-tab:: GCP
.. code-block:: shell
$ pip install -U ray google-api-python-client
Next, if you're not set up to use your cloud provider from the command line, you'll have to configure your credentials:
.. tabs::
.. group-tab:: AWS
Configure your credentials in ``~/.aws/credentials`` as described in `the AWS docs <https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html>`_.
.. group-tab:: Azure
Log in using ``az login``, then configure your credentials with ``az account set -s <subscription_id>``.
.. group-tab:: GCP
Set the ``GOOGLE_APPLICATION_CREDENTIALS`` environment variable as described in `the GCP docs <https://cloud.google.com/docs/authentication/getting-started>`_.
Create a (basic) Python application
-----------------------------------
We will write a simple Python application that tracks the IP addresses of the machines that its tasks are executed on:
.. code-block:: python
from collections import Counter
import socket
import time
def f():
time.sleep(0.001)
# Return IP address.
return socket.gethostbyname(socket.gethostname())
ip_addresses = [f() for _ in range(10000)]
print(Counter(ip_addresses))
Save this application as ``script.py`` and execute it by running the command ``python script.py``. The application should take 10 seconds to run and output something similar to ``Counter({'127.0.0.1': 10000})``.
With some small changes, we can make this application run on Ray (for more information on how to do this, refer to :ref:`the Ray Core Walkthrough<core-walkthrough>`):
.. code-block:: python
from collections import Counter
import socket
import time
import ray
ray.init()
@ray.remote
def f():
time.sleep(0.001)
# Return IP address.
return socket.gethostbyname(socket.gethostname())
object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)
print(Counter(ip_addresses))
Finally, let's add some code to make the output more interesting:
.. code-block:: python
from collections import Counter
import socket
import time
import ray
ray.init()
print('''This cluster consists of
{} nodes in total
{} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))
@ray.remote
def f():
time.sleep(0.001)
# Return IP address.
return socket.gethostbyname(socket.gethostname())
object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)
print('Tasks executed')
for ip_address, num_tasks in Counter(ip_addresses).items():
print(' {} tasks on {}'.format(num_tasks, ip_address))
Running ``python script.py`` should now output something like:
.. parsed-literal::
This cluster consists of
1 nodes in total
4.0 CPU resources in total
Tasks executed
10000 tasks on 127.0.0.1
Launch a cluster on a cloud provider
------------------------------------
To start a Ray Cluster, first we need to define the cluster configuration. The cluster configuration is defined within a YAML file that will be used by the Cluster Launcher to launch the head node, and by the Autoscaler to launch worker nodes.
A minimal sample cluster configuration file looks as follows:
.. tabs::
.. group-tab:: AWS
.. code-block:: yaml
# An unique identifier for the head node and workers of this cluster.
cluster_name: minimal
# Cloud-provider specific configuration.
provider:
type: aws
region: us-west-2
.. group-tab:: Azure
.. code-block:: yaml
# An unique identifier for the head node and workers of this cluster.
cluster_name: minimal
# Cloud-provider specific configuration.
provider:
type: azure
location: westus2
resource_group: ray-cluster
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ubuntu
# you must specify paths to matching private and public key pair files
# use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair
ssh_private_key: ~/.ssh/id_rsa
# changes to this should match what is specified in file_mounts
ssh_public_key: ~/.ssh/id_rsa.pub
.. group-tab:: GCP
.. code-block:: yaml
# A unique identifier for the head node and workers of this cluster.
cluster_name: minimal
# Cloud-provider specific configuration.
provider:
type: gcp
region: us-west1
Save this configuration file as ``config.yaml``. You can specify a lot more details in the configuration file: instance types to use, minimum and maximum number of workers to start, autoscaling strategy, files to sync, and more. For a full reference on the available configuration properties, please refer to the :ref:`cluster YAML configuration options reference <cluster-config>`.
After defining our configuration, we will use the Ray Cluster Launcher to start a cluster on the cloud, creating a designated "head node" and worker nodes. To start the Ray cluster, we will use the :ref:`Ray CLI <ray-cli>`. Run the following command:
.. code-block:: shell
$ ray up -y config.yaml
Run the application in the cloud
--------------------------------
We are now ready to execute the application in across multiple machines on our Ray cloud cluster. Run the following command:
.. code-block:: shell
$ ray submit config.yaml script.py
The output should now look similar to the following:
.. parsed-literal::
This cluster consists of
3 nodes in total
6.0 CPU resources in total
Tasks executed
3425 tasks on xxx.xxx.xxx.xxx
3834 tasks on xxx.xxx.xxx.xxx
2741 tasks on xxx.xxx.xxx.xxx
In this sample output, 3 nodes were started. If the output only shows 1 node, you may want to increase the ``secs`` in ``time.sleep(secs)`` to give Ray more time to start additional nodes.
The Ray CLI offers additional functionality. For example, you can monitor the Ray cluster status with ``ray monitor config.yaml``, and you can connect to the cluster (ssh into the head node) with ``ray attach config.yaml``. For a full reference on the Ray CLI, please refer to :ref:`the cluster commands reference <cluster-commands>`.
To finish, don't forget to shut down the cluster. Run the following command:
.. code-block:: shell
$ ray down -y config.yaml
+11
View File
@@ -0,0 +1,11 @@
.. _cluster-reference:
Config YAML and CLI Reference
=============================
.. toctree::
:maxdepth: 2
config.rst
commands.rst
sdk.rst
+13
View File
@@ -0,0 +1,13 @@
.. _ref-autoscaler-sdk:
Autoscaler SDK
==============
.. _ref-autoscaler-sdk-request-resources:
ray.autoscaler.sdk.request_resources
------------------------------------
Within a Ray program, you can command the autoscaler to scale the cluster up to a desired size with ``request_resources()`` call. The cluster will immediately attempt to scale to accommodate the requested resources, bypassing normal upscaling speed constraints.
.. autofunction:: ray.autoscaler.sdk.request_resources
+1
View File
@@ -148,6 +148,7 @@ extensions = [
'sphinx_gallery.gen_gallery',
'sphinxemoji.sphinxemoji',
'sphinx_copybutton',
'sphinxcontrib.yt',
'versionwarning.extension',
]
+1 -1
View File
@@ -71,7 +71,7 @@ Here's an example:
Why use Dask on Ray?
1. To take advantage of Ray-specific features such as the
:ref:`cluster launcher <ref-automatic-cluster>` and
:ref:`launching cloud clusters <cluster-cloud>` and
:ref:`shared-memory store <memory>`.
2. If you'd like to use Dask and Ray libraries in the same application without having two different clusters.
3. If you'd like to create data analyses using the familiar NumPy and Pandas APIs provided by Dask and execute them on a fast, fault-tolerant distributed task execution system geared towards production, like Ray.
+4 -3
View File
@@ -231,11 +231,12 @@ Papers
.. toctree::
:hidden:
:maxdepth: -1
:caption: Ray Cluster
:caption: Ray Clusters/Autoscaler
cluster/index.rst
cluster/launcher.rst
cluster/autoscaling.rst
cluster/quickstart.rst
cluster/reference.rst
cluster/cloud.rst
cluster/deploy.rst
.. toctree::
+1 -1
View File
@@ -140,7 +140,7 @@ In order to deploy Ray Serve on Kubernetes, we need to do the following:
3. Start Ray Serve on the cluster.
There are multiple ways to start a Ray cluster on Kubernetes, see :ref:`ray-k8s-deploy` for more information.
Here, we will be using the :ref:`Ray Cluster Launcher <ref-automatic-cluster>` tool, which has support for Kubernetes as a backend.
Here, we will be using the :ref:`Ray Cluster Launcher <cluster-cloud>` tool, which has support for Kubernetes as a backend.
The cluster launcher takes in a yaml config file that describes the cluster.
Here, we'll be using the `Kubernetes default config`_ with a few small modifications.
+1 -1
View File
@@ -164,7 +164,7 @@ You can connect other nodes to the head node, creating a Ray cluster by also cal
Launching a Ray cluster (``ray up``)
------------------------------------
Ray clusters can be launched with the :ref:`Cluster Launcher <ref-automatic-cluster>`.
Ray clusters can be launched with the :ref:`Cluster Launcher <cluster-cloud>`.
The ``ray up`` command uses the Ray cluster launcher to start a cluster on the cloud, creating a designated "head node" and worker nodes. Underneath the hood, it automatically calls ``ray start`` to create a Ray cluster.
Your code **only** needs to execute on one machine in the cluster (usually the head node). Read more about :ref:`running programs on a Ray cluster <using-ray-on-a-cluster>`.
@@ -55,7 +55,7 @@ Launching a cloud cluster
If you have already have a list of nodes, go to :ref:`tune-distributed-local`.
Ray currently supports AWS and GCP. Follow the instructions below to launch nodes on AWS (using the Deep Learning AMI). See the :ref:`cluster setup documentation <ref-automatic-cluster>`. Save the below cluster configuration (``tune-default.yaml``):
Ray currently supports AWS and GCP. Follow the instructions below to launch nodes on AWS (using the Deep Learning AMI). See the :ref:`cluster setup documentation <cluster-cloud>`. Save the below cluster configuration (``tune-default.yaml``):
.. literalinclude:: /../../python/ray/tune/examples/tune-default.yaml
:language: yaml
@@ -130,7 +130,7 @@ If you used a cluster configuration (starting a cluster with ``ray up`` or ``ray
Syncing
-------
Tune automatically syncs the trial folder on remote nodes back to the head node. This requires the ray cluster to be started with the :ref:`cluster launcher <ref-automatic-cluster>`.
Tune automatically syncs the trial folder on remote nodes back to the head node. This requires the ray cluster to be started with the :ref:`cluster launcher <cluster-cloud>`.
By default, local syncing requires rsync to be installed. You can customize the sync command with the ``sync_to_driver`` argument in ``tune.SyncConfig`` by providing either a function or a string.
If a string is provided, then it must include replacement fields ``{source}`` and ``{target}``, like ``rsync -savz -e "ssh -i ssh_key.pem" {source} {target}``. Alternatively, a function can be provided with the following signature:
@@ -290,7 +290,7 @@ Upon a second run, this will restore the entire experiment state from ``~/path/t
Common Commands
---------------
Below are some commonly used commands for submitting experiments. Please see the :ref:`Autoscaler page <ref-automatic-cluster>` to see find more comprehensive documentation of commands.
Below are some commonly used commands for submitting experiments. Please see the :ref:`Autoscaler page <cluster-cloud>` to see find more comprehensive documentation of commands.
.. code-block:: bash
+1 -1
View File
@@ -265,7 +265,7 @@ You can restore a single trial checkpoint by using ``tune.run(restore=<checkpoin
Distributed Checkpointing
~~~~~~~~~~~~~~~~~~~~~~~~~
On a multinode cluster, Tune automatically creates a copy of all trial checkpoints on the head node. This requires the Ray cluster to be started with the :ref:`cluster launcher <ref-automatic-cluster>` and also requires rsync to be installed.
On a multinode cluster, Tune automatically creates a copy of all trial checkpoints on the head node. This requires the Ray cluster to be started with the :ref:`cluster launcher <cluster-cloud>` and also requires rsync to be installed.
Note that you must use the ``tune.checkpoint_dir`` API to trigger syncing. Also, if running Tune on Kubernetes, be sure to use the :ref:`KubernetesSyncer <tune-kubernetes>` to transfer files between different pods.