From b8b2d6410d450b23c2b81fa82ff52343b71a7223 Mon Sep 17 00:00:00 2001 From: javi-redondo <53356357+javi-redondo@users.noreply.github.com> Date: Mon, 15 Feb 2021 00:47:14 -0800 Subject: [PATCH] [docs] new Ray Cluster documentation (#13839) Co-authored-by: Javier Redondo Co-authored-by: AmeerHajAli --- doc/examples/plot_example-lm.rst | 2 +- doc/requirements-doc.txt | 1 + doc/source/cluster/autoscaling.rst | 164 --- doc/source/cluster/cloud.rst | 162 ++- doc/source/cluster/config.rst | 1138 +++++++++++++++-- doc/source/cluster/deploy.rst | 4 + doc/source/cluster/index.rst | 229 +--- doc/source/cluster/kubernetes.rst | 2 +- doc/source/cluster/launcher.rst | 66 - doc/source/cluster/quickstart.rst | 240 ++++ doc/source/cluster/reference.rst | 11 + doc/source/cluster/sdk.rst | 13 + doc/source/conf.py | 1 + doc/source/dask-on-ray.rst | 2 +- doc/source/index.rst | 7 +- doc/source/serve/deployment.rst | 2 +- doc/source/starting-ray.rst | 2 +- .../tune/_tutorials/tune-distributed.rst | 6 +- doc/source/tune/user-guide.rst | 2 +- 19 files changed, 1503 insertions(+), 551 deletions(-) delete mode 100644 doc/source/cluster/autoscaling.rst delete mode 100644 doc/source/cluster/launcher.rst create mode 100644 doc/source/cluster/quickstart.rst create mode 100644 doc/source/cluster/reference.rst create mode 100644 doc/source/cluster/sdk.rst diff --git a/doc/examples/plot_example-lm.rst b/doc/examples/plot_example-lm.rst index 843a7e782..204f470b3 100644 --- a/doc/examples/plot_example-lm.rst +++ b/doc/examples/plot_example-lm.rst @@ -11,7 +11,7 @@ You can view the `code for this example`_. .. _`code for this example`: https://github.com/ray-project/ray/tree/master/doc/examples/lm -To use Ray cluster launcher on AWS, install boto (``pip install boto3``) and configure your AWS credentials in ``~/.aws/credentials`` as described on the :ref:`Automatic Cluster Setup page `. +To use Ray cluster launcher on AWS, install boto (``pip install boto3``) and configure your AWS credentials in ``~/.aws/credentials`` as described on the :ref:`Automatic Cluster Setup page `. We provide an `example config file `__ (``lm-cluster.yaml``). In the example config file, we use an ``m5.xlarge`` on-demand instance as the head node, and use ``p3.2xlarge`` GPU spot instances as the worker nodes. We set the minimal number of workers to 1 and maximum workers to 2 in the config, which can be modified according to your own demand. diff --git a/doc/requirements-doc.txt b/doc/requirements-doc.txt index cb2c358fa..a9a34624a 100644 --- a/doc/requirements-doc.txt +++ b/doc/requirements-doc.txt @@ -25,6 +25,7 @@ sphinx-jsonschema sphinx-tabs sphinx-version-warning sphinx-book-theme +sphinxcontrib.yt starlette tabulate uvicorn diff --git a/doc/source/cluster/autoscaling.rst b/doc/source/cluster/autoscaling.rst deleted file mode 100644 index ecb7af155..000000000 --- a/doc/source/cluster/autoscaling.rst +++ /dev/null @@ -1,164 +0,0 @@ -.. _ref-autoscaling: - -Cluster Autoscaling -=================== - -.. tip:: Before you continue, be sure to have read :ref:`cluster-cloud`. - -Basics ------- - -The Ray Cluster Launcher will automatically enable a load-based autoscaler. The scheduler will look at the task, actor, and placement group resource demands from the cluster, and tries to add the minimum set of nodes that can fulfill these demands. When nodes are idle for more than a timeout, they will be removed, down to the ``min_workers`` limit. The head node is never removed. - -To avoid launching too many nodes at once, the number of nodes allowed to be pending is limited by the ``upscaling_speed`` setting. By default it is set to ``1.0``, which means the cluster can be growing in size by at most ``100%`` at any time (e.g., if the cluster currently has 20 nodes, at most 20 pending launches are allowed). This fraction can be set to as high as needed, e.g., ``99999`` to allow the cluster to quickly grow to its max size. - -In more detail, the autoscaler implements the following control loop: - - 1. It calculates the number of nodes required to satisfy all currently pending tasks, actor, and placement group requests. - 2. If the number of nodes required total divided by the number of current nodes exceeds ``1 + upscaling_speed``, then the number of nodes launched will be limited by that threshold. - 3. If a node is idle for a timeout (5 minutes by default), it is removed from the cluster. - -The basic autoscaling config settings are as follows: - -.. code-block:: yaml - - # An unique identifier for the head node and workers of this cluster. - cluster_name: default - - # The minimum number of workers nodes to launch in addition to the head - # node. This number should be >= 0. - min_workers: 0 - - # The autoscaler will scale up the cluster faster with higher upscaling speed. - # E.g., if the task requires adding more nodes then autoscaler will gradually - # scale up the cluster in chunks of upscaling_speed*currently_running_nodes. - # This number should be > 0. - upscaling_speed: 1.0 - - # If a node is idle for this many minutes, it will be removed. A node is - # considered idle if there are no tasks or actors running on it. - idle_timeout_minutes: 5 - -Programmatically Scaling a Cluster ----------------------------------- - -You can from within a Ray program command the autoscaler to scale the cluster up to a desired size with ``request_resources()`` call. The cluster will immediately attempt to scale to accomodate the requested resources, bypassing normal upscaling speed constraints. - -.. autofunction:: ray.autoscaler.sdk.request_resources - -Manually Adding Nodes without Resources (Unmanaged Nodes) ---------------------------------------------------------- - -In some cases, adding special nodes without any resources (i.e. `num_cpus=0`) may be desirable. Such nodes can be used as a driver which connects to the cluster to launch jobs. - -In order to manually add a node to an autoscaled cluster, the `ray-cluster-name` tag should be set and `ray-node-type` tag should be set to `unmanaged`. - -Unmanaged nodes **must have 0 resources**. - -If you are using the `available_node_types` field, you should create a custom node type with `resources: {}`, and `max_workers: 0` when configuring the autoscaler. - -The autoscaler will not attempt to start, stop, or update unmanaged nodes. The user is responsible for properly setting up and cleaning up unmanaged nodes. - - -Multiple Node Type Autoscaling ------------------------------- - -Ray supports multiple node types in a single cluster. In this mode of operation, the scheduler will choose the types of nodes to add based on the resource demands, instead of always adding the same kind of node type. - -The concept of a cluster node type encompasses both the physical instance type (e.g., AWS p3.8xl GPU nodes vs m4.16xl CPU nodes), as well as other attributes (e.g., IAM role, the machine image, etc). `Custom resources `__ can be specified for each node type so that Ray is aware of the demand for specific node types at the application level (e.g., a task may request to be placed on a machine with a specific role or machine image via custom resource). - -An example of configuring multiple node types is as follows `(full example) `__: - -.. code-block:: yaml - - # Specify the allowed node types and the resources they provide. - # The key is the name of the node type, which is just for debugging purposes. - # The node config specifies the launch config and physical instance type. - available_node_types: - cpu_4_ondemand: - node_config: - InstanceType: m4.xlarge - # For AWS instances, autoscaler will automatically add the available - # CPUs/GPUs/accelerator_type ({"CPU": 4} for m4.xlarge) in "resources". - # resources: {"CPU": 4} - min_workers: 1 - max_workers: 5 - cpu_16_spot: - node_config: - InstanceType: m4.4xlarge - InstanceMarketOptions: - MarketType: spot - # Autoscaler will auto fill the CPU resources below. - resources: {"Custom1": 1, "is_spot": 1} - max_workers: 10 - gpu_1_ondemand: - node_config: - InstanceType: p2.xlarge - # Autoscaler will auto fill the CPU/GPU resources below. - resources: {"Custom2": 2} - max_workers: 4 - worker_setup_commands: - - pip install tensorflow-gpu # Example command. - gpu_8_ondemand: - node_config: - InstanceType: p3.8xlarge - # Autoscaler autofills the "resources" below. - # resources: {"CPU": 32, "GPU": 4, "accelerator_type:V100": 1} - max_workers: 2 - worker_setup_commands: - - pip install tensorflow-gpu # Example command. - - # Specify the node type of the head node (as configured above). - head_node_type: cpu_4_ondemand - - -The above config defines two CPU node types (``cpu_4_ondemand`` and ``cpu_16_spot``), and two GPU types (``gpu_1_ondemand`` and ``gpu_8_ondemand``). Each node type has a name (e.g., ``cpu_4_ondemand``), which has no semantic meaning and is only for debugging. Let's look at the inner fields of the ``gpu_1_ondemand`` node type: - -The node config tells the underlying Cloud provider how to launch a node of this type. This node config is merged with the top level node config of the YAML and can override fields (i.e., to specify the p2.xlarge instance type here): - -.. code-block:: yaml - - node_config: - InstanceType: p2.xlarge - -The resources field tells the autoscaler what kinds of resources this node provides. This can include custom resources as well (e.g., "Custom2"). This field enables the autoscaler to automatically select the right kind of nodes to launch given the resource demands of the application. The resources specified here will be automatically passed to the ``ray start`` command for the node via an environment variable. For more information, see also the `resource demand scheduler `__: - -.. code-block:: yaml - - resources: {"CPU": 4, "GPU": 1, "Custom2": 2} - -The ``min_workers`` and ``max_workers`` fields constrain the minimum and maximum number of nodes of this type to launch, respectively: - -.. code-block:: yaml - - min_workers: 1 - max_workers: 4 - -The ``worker_setup_commands`` field (and also the ``initialization_commands`` field, not shown) can be used to override the setup and initialization commands for a node type. Note that you can only override the setup for worker nodes. The head node's setup commands are always configured via the top level field in the cluster YAML: - -.. code-block:: yaml - - worker_setup_commands: - - pip install tensorflow-gpu # Example command. - -Docker Support for Multi-type clusters -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -For each node type, you can specify ``worker_image`` and ``pull_before_run`` fields. These will override any top level ``docker`` section values (see :ref:`autoscaler-docker`). The ``worker_run_options`` field is combined with top level ``docker: run_options`` field to produce the docker run command for the given node_type. Ray will automatically select the Nvidia docker runtime if it is available. - -The following configuration is for a GPU enabled node type: - -.. code-block:: yaml - - available_node_types: - gpu_1_ondemand: - max_workers: 2 - worker_setup_commands: - - pip install tensorflow-gpu # Example command. - - # Docker specific commands for gpu_1_ondemand - pull_before_run: True - worker_image: - - rayproject/ray-ml:latest-gpu - worker_run_options: # Appended to top-level docker field. - - "-v /home:/home" diff --git a/doc/source/cluster/cloud.rst b/doc/source/cluster/cloud.rst index ea59f95ea..d2e7b90d5 100644 --- a/doc/source/cluster/cloud.rst +++ b/doc/source/cluster/cloud.rst @@ -272,6 +272,116 @@ There are two ways of running private clusters: $ ray down ray/python/ray/autoscaler/local/example-full.yaml +.. _manual-cluster: + +Manual Ray Cluster Setup +------------------------ + +The most preferable way to run a Ray cluster is via the Ray Cluster Launcher. However, it is also possible to start a Ray cluster by hand. + +This section assumes that you have a list of machines and that the nodes in the cluster can communicate with each other. It also assumes that Ray is installed +on each machine. To install Ray, follow the `installation instructions`_. + +.. _`installation instructions`: http://docs.ray.io/en/master/installation.html + +Starting Ray on each machine +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +On the head node (just choose some node to be the head node), run the following. +If the ``--port`` argument is omitted, Ray will choose port 6379, falling back to a +random port. + +.. code-block:: bash + + $ ray start --head --port=6379 + ... + Next steps + To connect to this Ray runtime from another node, run + ray start --address=':6379' --redis-password='' + + If connection fails, check your firewall settings and network configuration. + +The command will print out the address of the Redis server that was started +(the local node IP address plus the port number you specified). + +**Then on each of the other nodes**, run the following. Make sure to replace +``
`` with the value printed by the command on the head node (it +should look something like ``123.45.67.89:6379``). + +Note that if your compute nodes are on their own subnetwork with Network +Address Translation, to connect from a regular machine outside that subnetwork, +the command printed by the head node will not work. You need to find the +address that will reach the head node from the second machine. If the head node +has a domain address like compute04.berkeley.edu, you can simply use that in +place of an IP address and rely on the DNS. + +.. code-block:: bash + + $ ray start --address=
--redis-password='' + -------------------- + Ray runtime started. + -------------------- + + To terminate the Ray runtime, run + ray stop + +If you wish to specify that a machine has 10 CPUs and 1 GPU, you can do this +with the flags ``--num-cpus=10`` and ``--num-gpus=1``. See the :ref:`Configuration ` page for more information. + +If you see ``Unable to connect to Redis. If the Redis instance is on a +different machine, check that your firewall is configured properly.``, +this means the ``--port`` is inaccessible at the given IP address (because, for +example, the head node is not actually running Ray, or you have the wrong IP +address). + +If you see ``Ray runtime started.``, then the node successfully connected to +the IP address at the ``--port``. You should now be able to connect to the +cluster with ``ray.init(address='auto')``. + +If ``ray.init(address='auto')`` keeps repeating +``redis_context.cc:303: Failed to connect to Redis, retrying.``, then the node +is failing to connect to some other port(s) besides the main port. + +.. code-block:: bash + + If connection fails, check your firewall settings and network configuration. + +If the connection fails, to check whether each port can be reached from a node, +you can use a tool such as ``nmap`` or ``nc``. + +.. code-block:: bash + + $ nmap -sV --reason -p $PORT $HEAD_ADDRESS + Nmap scan report for compute04.berkeley.edu (123.456.78.910) + Host is up, received echo-reply ttl 60 (0.00087s latency). + rDNS record for 123.456.78.910: compute04.berkeley.edu + PORT STATE SERVICE REASON VERSION + 6379/tcp open redis syn-ack ttl 60 Redis key-value store + Service detection performed. Please report any incorrect results at https://nmap.org/submit/ . + $ nc -vv -z $HEAD_ADDRESS $PORT + Connection to compute04.berkeley.edu 6379 port [tcp/*] succeeded! + +If the node cannot access that port at that IP address, you might see + +.. code-block:: bash + + $ nmap -sV --reason -p $PORT $HEAD_ADDRESS + Nmap scan report for compute04.berkeley.edu (123.456.78.910) + Host is up (0.0011s latency). + rDNS record for 123.456.78.910: compute04.berkeley.edu + PORT STATE SERVICE REASON VERSION + 6379/tcp closed redis reset ttl 60 + Service detection performed. Please report any incorrect results at https://nmap.org/submit/ . + $ nc -vv -z $HEAD_ADDRESS $PORT + nc: connect to compute04.berkeley.edu port 6379 (tcp) failed: Connection refused + + +Stopping Ray +~~~~~~~~~~~~ + +When you want to stop the Ray processes, run ``ray stop`` on each node. + + Additional Cloud Providers -------------------------- @@ -283,16 +393,62 @@ Security On cloud providers, nodes will be launched into their own security group by default, with traffic allowed only between nodes in the same group. A new SSH key will also be created and saved to your local machine for access to the cluster. +.. _using-ray-on-a-cluster: + +Running a Ray program on the Ray cluster +---------------------------------------- + +To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes. + +.. tabs:: + .. group-tab:: Python + + Within your program/script, you must call ``ray.init`` and add the ``address`` parameter to ``ray.init`` (like ``ray.init(address=...)``). This causes Ray to connect to the existing cluster. For example: + + .. code-block:: python + + ray.init(address="auto") + + .. group-tab:: Java + + You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``). + + To connect your program to the Ray cluster, run it like this: + + .. code-block:: bash + + java -classpath \ + -Dray.address=
\ + + + .. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command. + + +.. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes. + +To verify that the correct number of nodes have joined the cluster, you can run the following. + +.. code-block:: python + + import time + + @ray.remote + def f(): + time.sleep(0.01) + return ray.services.get_node_ip_address() + + # Get a list of the IP addresses of the nodes that have joined the cluster. + set(ray.get([f.remote() for _ in range(1000)])) + What's Next? ------------- Now that you have a working understanding of the cluster launcher, check out: -* :ref:`cluster-config`: A guide to configuring your Ray cluster. +* :ref:`ref-cluster-quick-start`: A end-to-end demo to run an application that autoscales. +* :ref:`cluster-config`: A complete reference of how to configure your Ray cluster. * :ref:`cluster-commands`: A short user guide to the various cluster launcher commands. -* A `step by step guide`_ to using the cluster launcher -* :ref:`ref-autoscaling`: An overview of how Ray autoscaling works. diff --git a/doc/source/cluster/config.rst b/doc/source/cluster/config.rst index 8260e8f6b..430d5473d 100644 --- a/doc/source/cluster/config.rst +++ b/doc/source/cluster/config.rst @@ -1,82 +1,286 @@ .. _cluster-config: -Configuring your Cluster -======================== +Cluster YAML Configuration Options +================================== -.. tip:: Before you continue, be sure to have read :ref:`cluster-cloud`. +The cluster configuration is defined within a YAML file that will be used by the Cluster Launcher to launch the head node, and by the Autoscaler to launch worker nodes. Once the cluster configuration is defined, you will need to use the :ref:`Ray CLI ` to perform any operations such as starting and stopping the cluster. -To launch a cluster, you must first create a *cluster configuration file*, which specifies some important details about the cluster. +Syntax +------ -Quickstart ----------- +.. parsed-literal:: -At a minimum, we need to specify: + :ref:`cluster_name `: str + :ref:`max_workers `: int + :ref:`upscaling_speed `: float + :ref:`idle_timeout_minutes `: int + :ref:`docker `: + :ref:`docker ` + :ref:`provider `: + :ref:`provider ` + :ref:`auth `: + :ref:`auth ` + :ref:`available_node_types `: + :ref:`node_types ` + :ref:`worker_nodes `: + :ref:`node_config ` + :ref:`head_node_type `: str + :ref:`file_mounts `: + :ref:`file_mounts ` + :ref:`cluster_synced_files `: + - str + :ref:`rsync_exclude `: + - str + :ref:`rsync_filter `: + - str + :ref:`initialization_commands `: + - str + :ref:`setup_commands `: + - str + :ref:`head_setup_commands `: + - str + :ref:`worker_setup_commands `: + - str + :ref:`head_start_ray_commands `: + - str + :ref:`worker_start_ray_commands `: + - str -* the name of your cluster, -* the number of workers in the cluster -* the cloud provider -* any setup commands that should run on the node upon launch. +Custom types +------------ -Here is an example cluster configuration file: +.. _cluster-configuration-docker-type: -.. code-block:: yaml +Docker +~~~~~~ - # A unique identifier for this cluster. - cluster_name: basic-ray +.. parsed-literal:: + :ref:`image `: str + :ref:`head_image `: str + :ref:`worker_image `: str + :ref:`container_name `: str + :ref:`pull_before_run `: bool + :ref:`run_options `: + - str + :ref:`head_run_options `: + - str + :ref:`worker_run_options `: + - str + :ref:`disable_automatic_runtime_detection `: bool + :ref:`disable_shm_size_detection `: bool - # The maximum number of workers nodes to launch in addition to the head - # node. - max_workers: 0 # this means zero workers +.. _cluster-configuration-auth-type: - # Cloud-provider specific configuration. - provider: - type: aws - region: us-west-2 - availability_zone: us-west-2a +Auth +~~~~ - # How Ray will authenticate with newly launched nodes. - auth: - ssh_user: ubuntu +.. tabs:: + .. group-tab:: AWS - setup_commands: - - pip install ray[all] - # The following line demonstrate that you can specify arbitrary - # startup scripts on the cluster. - - touch /tmp/some_file.txt + .. parsed-literal:: -Most of the example YAML file is optional. Here is a `reference minimal YAML file `__, and you can find the defaults for `optional fields in this YAML file `__. + :ref:`ssh_user `: str + :ref:`ssh_private_key `: str -In another example, the `AWS example configuration file `__ cluster config file will create a small cluster with an m5.large head node (on-demand) configured to autoscale up to two m5.large `spot workers `__. + .. group-tab:: Azure -**You are encouraged to copy the example YAML file and modify it to your needs. This may include adding additional setup commands to install libraries or sync local data files.** + .. parsed-literal:: -Setup Commands --------------- + :ref:`ssh_user `: str + :ref:`ssh_private_key `: str + :ref:`ssh_public_key `: str -.. tip:: After you have customized the nodes, create a new machine image (or docker container) and use that in the config file to reduce setup times. + .. group-tab:: GCP -The setup commands you use should ideally be *idempotent* (i.e., can be run multiple times without changing the result). This allows Ray to safely update nodes after they have been created. + .. parsed-literal:: -You can usually make commands idempotent with small modifications, e.g. ``git clone foo`` can be rewritten as ``test -e foo || git clone foo`` which checks if the repo is already cloned first. + :ref:`ssh_user `: str + :ref:`ssh_private_key `: str -.. _autoscaler-docker: +.. _cluster-configuration-provider-type: -Docker Support --------------- +Provider +~~~~~~~~ -The cluster launcher is fully compatible with Docker images. To use Docker, provide a ``docker_image`` and ``container_name`` in the ``docker`` field of the YAML. +.. tabs:: + .. group-tab:: AWS -.. code-block:: yaml + .. parsed-literal:: - docker: - container_name: "ray_container" - image: "rayproject/ray-ml:latest-gpu" + :ref:`type `: str + :ref:`region `: str + :ref:`availability_zone `: str + :ref:`cache_stopped_nodes `: bool -We provide docker images on `DockerHub `__. The ``rayproject/ray-ml:latest`` image is a quick way to get up and running . + .. group-tab:: Azure -When the cluster is launched, all of the Ray tasks will be executed completely inside of the container. For GPU support, Ray will automatically select the Nvidia docker runtime if available, and you just need to specify a docker image with the CUDA support (``rayproject/ray-ml:latest-gpu`` and all of our ``-gpu`` images have this). + .. parsed-literal:: -If Docker is not installed, add the following commands to ``initialization_commands`` to install it. + :ref:`type `: str + :ref:`location `: str + :ref:`resource_group `: str + :ref:`subscription_id `: str + :ref:`cache_stopped_nodes `: bool + + .. group-tab:: GCP + + .. parsed-literal:: + + :ref:`type `: str + :ref:`region `: str + :ref:`availability_zone `: str + :ref:`project_id `: str + :ref:`cache_stopped_nodes `: bool + +.. _cluster-configuration-node-types-type: + +Node types +~~~~~~~~~~ + +The nodes types object's keys represent the names of the different node types. + +.. parsed-literal:: + : + :ref:`node_config `: + :ref:`Node config ` + :ref:`resources `: + :ref:`Resources ` + :ref:`min_workers `: int + :ref:`max_workers `: int + :ref:`worker_setup_commands `: + - str + :ref:`docker `: + :ref:`Node Docker ` + : + ... + ... + +.. _cluster-configuration-node-config-type: + +Node config +~~~~~~~~~~~ + +.. tabs:: + .. group-tab:: AWS + + A YAML object as defined in `the AWS docs `_. + + .. group-tab:: Azure + + A YAML object as defined in `the deployment template `_ whose resources are defined in `the Azure docs `_. + + .. group-tab:: GCP + + A YAML object as defined in `the GCP docs `_. + +.. _cluster-configuration-node-docker-type: + +Node Docker +~~~~~~~~~~~ + +.. parsed-literal:: + + :ref:`image `: str + :ref:`pull_before_run `: bool + :ref:`run_options `: + - str + :ref:`disable_automatic_runtime_detection `: bool + :ref:`disable_shm_size_detection `: bool + +.. _cluster-configuration-resources-type: + +Resources +~~~~~~~~~ + +.. parsed-literal:: + + :ref:`CPU `: int + :ref:`GPU `: int + : int + : int + ... + +.. _cluster-configuration-file-mounts-type: + +File mounts +~~~~~~~~~~~ + +.. parsed-literal:: + : str # Path 1 on local machine + : str # Path 2 on local machine + ... + +Properties and Definitions +-------------------------- + +.. _cluster-configuration-cluster-name: + +``cluster_name`` +~~~~~~~~~~~~~~~~ + +The name of the cluster. This is the namespace of the cluster. + +* **Required:** Yes +* **Importance:** High +* **Type:** String +* **Default:** "default" +* **Pattern:** ``[a-zA-Z0-9_]+`` + +.. _cluster-configuration-max-workers: + +``max_workers`` +~~~~~~~~~~~~~~~ + +The maximum number of workers the cluster will have at any given time. + +* **Required:** No +* **Importance:** High +* **Type:** Integer +* **Default:** ``2`` +* **Minimum:** ``0`` +* **Maximum:** Unbounded + +.. _cluster-configuration-upscaling-speed: + +``upscaling_speed`` +~~~~~~~~~~~~~~~~~~~ + +The number of nodes allowed to be pending as a multiple of the current number of nodes. For example, if set to 1.0, the cluster can grow in size by at most 100% at any time, so if the cluster currently has 20 nodes, at most 20 pending launches are allowed. + +* **Required:** No +* **Importance:** Medium +* **Type:** Float +* **Default:** ``1.0`` +* **Minimum:** ``0.0`` +* **Maximum:** Unbounded + +.. _cluster-configuration-idle-timeout-minutes: + +``idle_timeout_minutes`` +~~~~~~~~~~~~~~~~~~~~~~~~ + +The number of minutes that need to pass before an idle worker node is removed by the Autoscaler. + +* **Required:** No +* **Importance:** Medium +* **Type:** Integer +* **Default:** ``5`` +* **Minimum:** ``0`` +* **Maximum:** Unbounded + +.. _cluster-configuration-docker: + +``docker`` +~~~~~~~~~~ + +Configure Ray to run in Docker containers. + +* **Required:** No +* **Importance:** High +* **Type:** :ref:`Docker ` +* **Default:** ``{}`` + +In rare cases when Docker is not available on the system by default (e.g., bad AMI), add the following commands to :ref:`initialization_commands ` to install it. .. code-block:: yaml @@ -86,59 +290,813 @@ If Docker is not installed, add the following commands to ``initialization_comma - sudo usermod -aG docker $USER - sudo systemctl restart docker -f -Common cluster configurations ------------------------------ +.. _cluster-configuration-provider: -The `example-full.yaml `__ configuration is enough to get started with Ray, but for more compute intensive workloads you will want to change the instance types to e.g. use GPU or larger compute instance by editing the yaml file. +``provider`` +~~~~~~~~~~~~ -Here are a few common configurations (note that we use AWS in the examples, but these examples are generic): +The cloud provider-specific configuration properties. -**GPU single node**: use Ray on a single large GPU instance. +* **Required:** Yes +* **Importance:** High +* **Type:** :ref:`Provider ` -.. code-block:: yaml +.. _cluster-configuration-auth: - max_workers: 0 - head_node: - InstanceType: p2.8xlarge +``auth`` +~~~~~~~~ + +Authentication credentials that Ray will use to launch nodes. + +* **Required:** Yes +* **Importance:** High +* **Type:** :ref:`Auth ` + +.. _cluster-configuration-available-node-types: + +``available_node_types`` +~~~~~~~~~~~~~~~~~~~~~~~~ + +Tells the autoscaler the allowed node types and the resources they provide. +The key is the name of the node type, which is just for debugging purposes. + +* **Required:** No +* **Importance:** High +* **Type:** :ref:`Node types ` +* **Default:** + +.. tabs:: + .. group-tab:: AWS + + .. code-block:: yaml + + available_node_types: + ray.head.default: + node_config: + InstanceType: m5.large + BlockDeviceMappings: + - DeviceName: /dev/sda1 + Ebs: + VolumeSize: 100 + resources: {"CPU": 2} + min_workers: 0 + max_workers: 0 + ray.worker.small: + node_config: + InstanceType: m5.large + InstanceMarketOptions: + MarketType: spot + resources: {"CPU": 2} + min_workers: 0 + max_workers: 1 + +.. _cluster-configuration-head-node-type: + +``head_node_type`` +~~~~~~~~~~~~~~~~~~ + +The key for one of the node types in :ref:`available_node_types `. This node type will be used to launch the head node. -**Mixed GPU and CPU nodes**: for RL applications that require proportionally more -CPU than GPU resources, you can use additional CPU workers with a GPU head node. +* **Required:** Yes +* **Importance:** High +* **Type:** String +* **Pattern:** ``[a-zA-Z0-9_]+`` -.. code-block:: yaml +.. _cluster-configuration-worker-nodes: - max_workers: 10 - head_node: - InstanceType: p2.8xlarge - worker_nodes: - InstanceType: m4.16xlarge +``worker_nodes`` +~~~~~~~~~~~~~~~~ -**Autoscaling CPU cluster**: use a small head node and have Ray auto-scale -workers as needed. This can be a cost-efficient configuration for clusters with -bursty workloads. You can also request spot workers for additional cost savings. +The configuration to be used to launch worker nodes on the cloud service provider. Generally, node configs are set in the :ref:`node config of each node type `. Setting this property allows propagation of a default value to all the node types when they launch as workers (e.g., using spot instances across all workers can be configured here so that it doesn't have to be set across all instance types). -.. code-block:: yaml +* **Required:** No +* **Importance:** Low +* **Type:** :ref:`Node config ` +* **Default:** ``{}`` - min_workers: 0 - max_workers: 10 - head_node: - InstanceType: m4.large - worker_nodes: - InstanceMarketOptions: - MarketType: spot - InstanceType: m4.16xlarge +.. _cluster-configuration-file-mounts: -**Autoscaling GPU cluster**: similar to the autoscaling CPU cluster, but -with GPU worker nodes instead. +``file_mounts`` +~~~~~~~~~~~~~~~ -.. code-block:: yaml +The files or directories to copy to the head and worker nodes. - min_workers: 0 # NOTE: older Ray versions may need 1+ GPU workers (#2106) - max_workers: 10 - head_node: - InstanceType: m4.large - worker_nodes: - InstanceMarketOptions: - MarketType: spot - InstanceType: p2.xlarge +* **Required:** No +* **Importance:** High +* **Type:** :ref:`File mounts ` +* **Default:** ``[]`` +.. _cluster-configuration-cluster-synced-files: + +``cluster_synced_files`` +~~~~~~~~~~~~~~~~~~~~~~~~ + +A list of paths to the files or directories to copy from the head node to the worker nodes. The same path on the head node will be copied to the worker node. This behavior is a subset of the file_mounts behavior, so in the vast majority of cases one should just use :ref:`file_mounts `. + +* **Required:** No +* **Importance:** Low +* **Type:** List of String +* **Default:** ``[]`` + +.. _cluster-configuration-rsync-exclude: + +``rsync_exclude`` +~~~~~~~~~~~~~~~~~ + +A list of patterns for files to exclude when running ``rsync up`` or ``rsync down``. The filter is applied on the source directory only. + +Example for a pattern in the list: ``**/.git/**``. + +* **Required:** No +* **Importance:** Low +* **Type:** List of String +* **Default:** ``[]`` + +.. _cluster-configuration-rsync-filter: + +``rsync_filter`` +~~~~~~~~~~~~~~~~ + +A list of patterns for files to exclude when running ``rsync up`` or ``rsync down``. The filter is applied on the source directory and recursively through all subdirectories. + +Example for a pattern in the list: ``.gitignore``. + +* **Required:** No +* **Importance:** Low +* **Type:** List of String +* **Default:** ``[]`` + +.. _cluster-configuration-initialization-commands: + +``initialization_commands`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +A list of commands that will be run before the :ref:`setup commands `. If Docker is enabled, these commands will run outside the container and before Docker is setup. + +* **Required:** No +* **Importance:** Medium +* **Type:** List of String +* **Default:** ``[]`` + +.. _cluster-configuration-setup-commands: + +``setup_commands`` +~~~~~~~~~~~~~~~~~~ + +A list of commands to run to set up nodes. These commands will always run on the head and worker nodes and will be merged with :ref:`head setup commands ` for head and with :ref:`worker setup commands ` for workers. + +* **Required:** No +* **Importance:** Medium +* **Type:** List of String +* **Default:** + +.. tabs:: + .. group-tab:: AWS + + .. code-block:: yaml + + # Default setup_commands: + setup_commands: + - echo 'export PATH="$HOME/anaconda3/envs/tensorflow_p36/bin:$PATH"' >> ~/.bashrc + - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp36-cp36m-manylinux2014_x86_64.whl + +- Setup commands should ideally be *idempotent* (i.e., can be run multiple times without changing the result); this allows Ray to safely update nodes after they have been created. You can usually make commands idempotent with small modifications, e.g. ``git clone foo`` can be rewritten as ``test -e foo || git clone foo`` which checks if the repo is already cloned first. + +- Setup commands are run sequentially but separately. For example, if you are using anaconda, you need to run ``conda activate env && pip install -U ray`` because splitting the command into two setup commands will not work. + +- Ideally, you should avoid using setup_commands by creating a docker image with all the dependencies preinstalled to minimize startup time. + +- **Tip**: if you also want to run apt-get commands during setup add the following list of commands: + + .. code-block:: yaml + + setup_commands: + - sudo pkill -9 apt-get || true + - sudo pkill -9 dpkg || true + - sudo dpkg --configure -a + +.. _cluster-configuration-head-setup-commands: + +``head_setup_commands`` +~~~~~~~~~~~~~~~~~~~~~~~ + +A list of commands to run to set up the head node. These commands will be merged with the general :ref:`setup commands `. + +* **Required:** No +* **Importance:** Low +* **Type:** List of String +* **Default:** ``[]`` + +.. _cluster-configuration-worker-setup-commands: + +``worker_setup_commands`` +~~~~~~~~~~~~~~~~~~~~~~~~~ + +A list of commands to run to set up the worker nodes. These commands will be merged with the general :ref:`setup commands `. + +* **Required:** No +* **Importance:** Low +* **Type:** List of String +* **Default:** ``[]`` + +.. _cluster-configuration-head-start-ray-commands: + +``head_start_ray_commands`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Commands to start ray on the head node. You don't need to change this. + +* **Required:** No +* **Importance:** Low +* **Type:** List of String +* **Default:** + +.. tabs:: + .. group-tab:: AWS + + .. code-block:: yaml + + head_start_ray_commands: + - ray stop + - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml + +.. _cluster-configuration-worker-start-ray-commands: + +``worker_start_ray_commands`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Command to start ray on worker nodes. You don't need to change this. + +* **Required:** No +* **Importance:** Low +* **Type:** List of String +* **Default:** + +.. tabs:: + .. group-tab:: AWS + + .. code-block:: yaml + + worker_start_ray_commands: + - ray stop + - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 + +.. _cluster-configuration-image: + +``docker.image`` +~~~~~~~~~~~~~~~~ + +The default Docker image to pull in the head and worker nodes. This can be overridden by the :ref:`head_image ` and :ref:`worker_image ` fields. If neither `image` nor (:ref:`head_image ` and :ref:`worker_image `) are specified, Ray will not use Docker. + +* **Required:** Yes (If Docker is in use.) +* **Importance:** High +* **Type:** String + +The Ray project provides Docker images on `DockerHub `_. The repository includes following images: + +* ``rayproject/ray-ml:latest-gpu``: CUDA support, includes ML dependencies. +* ``rayproject/ray:latest-gpu``: CUDA support, no ML dependencies. +* ``rayproject/ray-ml:latest``: No CUDA support, includes ML dependencies. +* ``rayproject/ray:latest``: No CUDA support, no ML dependencies. + +.. _cluster-configuration-head-image: + +``docker.head_image`` +~~~~~~~~~~~~~~~~~~~~~ +Docker image for the head node to override the default :ref:`docker image `. + +* **Required:** No +* **Importance:** Low +* **Type:** String + +.. _cluster-configuration-worker-image: + +``docker.worker_image`` +~~~~~~~~~~~~~~~~~~~~~~~ +Docker image for the worker nodes to override the default :ref:`docker image `. + +* **Required:** No +* **Importance:** Low +* **Type:** String + +.. _cluster-configuration-container-name: + +``docker.container_name`` +~~~~~~~~~~~~~~~~~~~~~~~~~ + +The name to use when starting the Docker container. + +* **Required:** Yes (If Docker is in use.) +* **Importance:** Low +* **Type:** String +* **Default:** ray_container + +.. _cluster-configuration-pull-before-run: + +``docker.pull_before_run`` +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If enabled, the latest version of image will be pulled when starting Docker. If disabled, ``docker run`` will only pull the image if no cached version is present. + +* **Required:** No +* **Importance:** Medium +* **Type:** Boolean +* **Default:** ``True`` + +.. _cluster-configuration-run-options: + +``docker.run_options`` +~~~~~~~~~~~~~~~~~~~~~~ + +The extra options to pass to ``docker run``. + +* **Required:** No +* **Importance:** Medium +* **Type:** List of String +* **Default:** ``[]`` + +.. _cluster-configuration-head-run-options: + +``docker.head_run_options`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The extra options to pass to ``docker run`` for head node only. + +* **Required:** No +* **Importance:** Low +* **Type:** List of String +* **Default:** ``[]`` + +.. _cluster-configuration-worker-run-options: + +``docker.worker_run_options`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The extra options to pass to ``docker run`` for worker nodes only. + +* **Required:** No +* **Importance:** Low +* **Type:** List of String +* **Default:** ``[]`` + +.. _cluster-configuration-disable-automatic-runtime-detection: + +``docker.disable_automatic_runtime_detection`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If enabled, Ray will not try to use the NVIDIA Container Runtime if GPUs are present. + +* **Required:** No +* **Importance:** Low +* **Type:** Boolean +* **Default:** ``False`` + + +.. _cluster-configuration-disable-shm-size-detection: + +``docker.disable_shm_size_detection`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If enabled, Ray will not automatically specify the size ``/dev/shm`` for the started container and the runtime's default value (64MiB for Docker) will be used. + +* **Required:** No +* **Importance:** Low +* **Type:** Boolean +* **Default:** ``False`` + + +.. _cluster-configuration-ssh-user: + +``auth.ssh_user`` +~~~~~~~~~~~~~~~~~ + +The user that Ray will authenticate with when launching new nodes. + +* **Required:** Yes +* **Importance:** High +* **Type:** String + +.. _cluster-configuration-ssh-private-key: + +``auth.ssh_private_key`` +~~~~~~~~~~~~~~~~~~~~~~~~ + +.. tabs:: + .. group-tab:: AWS + + The path to an existing private key for Ray to use. If not configured, Ray will create a new private keypair (default behavior). If configured, the key must be added to the project-wide metadata and ``KeyName`` has to be defined in the :ref:`node configuration `. + + * **Required:** No + * **Importance:** Low + * **Type:** String + + .. group-tab:: Azure + + The path to an existing private key for Ray to use. + + * **Required:** Yes + * **Importance:** High + * **Type:** String + + You may use ``ssh-keygen -t rsa -b 4096`` to generate a new ssh keypair. + + .. group-tab:: GCP + + The path to an existing private key for Ray to use. If not configured, Ray will create a new private keypair (default behavior). If configured, the key must be added to the project-wide metadata and ``KeyName`` has to be defined in the :ref:`node configuration `. + + * **Required:** No + * **Importance:** Low + * **Type:** String + +.. _cluster-configuration-ssh-public-key: + +``auth.ssh_public_key`` +~~~~~~~~~~~~~~~~~~~~~~~ + +.. tabs:: + .. group-tab:: AWS + + Not available. + + .. group-tab:: Azure + + The path to an existing public key for Ray to use. + + * **Required:** Yes + * **Importance:** High + * **Type:** String + + .. group-tab:: GCP + + Not available. + +.. _cluster-configuration-type: + +``provider.type`` +~~~~~~~~~~~~~~~~~ + +.. tabs:: + .. group-tab:: AWS + + The cloud service provider. For AWS, this must be set to ``aws``. + + * **Required:** Yes + * **Importance:** High + * **Type:** String + + .. group-tab:: Azure + + The cloud service provider. For Azure, this must be set to ``azure``. + + * **Required:** Yes + * **Importance:** High + * **Type:** String + + .. group-tab:: GCP + + The cloud service provider. For GCP, this must be set to ``gcp``. + + * **Required:** Yes + * **Importance:** High + * **Type:** String + +.. _cluster-configuration-region: + +``provider.region`` +~~~~~~~~~~~~~~~~~~~ + +.. tabs:: + .. group-tab:: AWS + + The region to use for deployment of the Ray cluster. + + * **Required:** Yes + * **Importance:** High + * **Type:** String + * **Default:** us-west-2 + + .. group-tab:: Azure + + Not available. + + .. group-tab:: GCP + + The region to use for deployment of the Ray cluster. + + * **Required:** Yes + * **Importance:** High + * **Type:** String + * **Default:** us-west1 + +.. _cluster-configuration-availability-zone: + +``provider.availability_zone`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. tabs:: + .. group-tab:: AWS + + A string specifying a comma-separated list of availability zone(s) that nodes may be launched in. + + * **Required:** No + * **Importance:** Low + * **Type:** String + * **Default:** us-west-2a,us-west-2b + + .. group-tab:: Azure + + Not available. + + .. group-tab:: GCP + + A string specifying a comma-separated list of availability zone(s) that nodes may be launched in. + + * **Required:** No + * **Importance:** Low + * **Type:** String + * **Default:** us-west1-a + +.. _cluster-configuration-location: + +``provider.location`` +~~~~~~~~~~~~~~~~~~~~~ + +.. tabs:: + .. group-tab:: AWS + + Not available. + + .. group-tab:: Azure + + The location to use for deployment of the Ray cluster. + + * **Required:** Yes + * **Importance:** High + * **Type:** String + * **Default:** westus2 + + .. group-tab:: GCP + + Not available. + +.. _cluster-configuration-resource-group: + +``provider.resource_group`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. tabs:: + .. group-tab:: AWS + + Not available. + + .. group-tab:: Azure + + The resource group to use for deployment of the Ray cluster. + + * **Required:** Yes + * **Importance:** High + * **Type:** String + * **Default:** ray-cluster + + .. group-tab:: GCP + + Not available. + +.. _cluster-configuration-subscription-id: + +``provider.subscription_id`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. tabs:: + .. group-tab:: AWS + + Not available. + + .. group-tab:: Azure + + The subscription ID to use for deployment of the Ray cluster. If not specified, Ray will use the default from the Azure CLI. + + * **Required:** No + * **Importance:** High + * **Type:** String + * **Default:** ``""`` + + .. group-tab:: GCP + + Not available. + +.. _cluster-configuration-project-id: + +``provider.project_id`` +~~~~~~~~~~~~~~~~~~~~~~~ + +.. tabs:: + .. group-tab:: AWS + + Not available. + + .. group-tab:: Azure + + Not available. + + .. group-tab:: GCP + + The globally unique project ID to use for deployment of the Ray cluster. + + * **Required:** No + * **Importance:** Low + * **Type:** String + * **Default:** ``null`` + +.. _cluster-configuration-cache-stopped-nodes: + +``provider.cache_stopped_nodes`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If enabled, nodes will be *stopped* when the cluster scales down. If disabled, nodes will be *terminated* instead. Stopped nodes launch faster than terminated nodes. + + +* **Required:** No +* **Importance:** Low +* **Type:** Boolean +* **Default:** ``True`` + +.. _cluster-configuration-node-config: + +``available_node_types..node_type.node_config`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The configuration to be used to launch the nodes on the cloud service provider. Among other things, this will specify the instance type to be launched. + +* **Required:** Yes +* **Importance:** High +* **Type:** :ref:`Node config ` + +.. _cluster-configuration-resources: + +``available_node_types..node_type.resources`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The resources that a node type provides, which enables the autoscaler to automatically select the right type of nodes to launch given the resource demands of the application. The resources specified will be automatically passed to the ``ray start`` command for the node via an environment variable. If not provided, Autoscaler can automatically detect them only for AWS/Kubernetes cloud providers. For more information, see also the `resource demand scheduler `_ + +* **Required:** Yes (except for AWS/K8s) +* **Importance:** High +* **Type:** :ref:`Resources ` +* **Default:** ``{}`` + +In some cases, adding special nodes without any resources may be desirable. Such nodes can be used as a driver which connects to the cluster to launch jobs. In order to manually add a node to an autoscaled cluster, the *ray-cluster-name* tag should be set and *ray-node-type* tag should be set to unmanaged. Unmanaged nodes can be created by setting the resources to ``{}`` and the :ref:`maximum workers ` to 0. The Autoscaler will not attempt to start, stop, or update unmanaged nodes. The user is responsible for properly setting up and cleaning up unmanaged nodes. + +.. _cluster-configuration-node-min-workers: + +``available_node_types..node_type.min_workers`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The minimum number of workers to maintain for this node type regardless of utilization. + +* **Required:** No +* **Importance:** High +* **Type:** Integer +* **Default:** ``0`` +* **Minimum:** ``0`` +* **Maximum:** Unbounded + +.. _cluster-configuration-node-max-workers: + +``available_node_types..node_type.max_workers`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The maximum number of workers to have in the cluster for this node type regardless of utilization. This takes precedence over :ref:`minimum workers `. + +* **Required:** No +* **Importance:** High +* **Type:** Integer +* **Default:** ``0`` +* **Minimum:** ``0`` +* **Maximum:** Unbounded + +.. _cluster-configuration-node-type-worker-setup-commands: + +``available_node_types..node_type.worker_setup_commands`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +A list of commands to run to set up worker nodes of this type. These commands will replace the general :ref:`worker setup commands ` for the node. + +* **Required:** No +* **Importance:** low +* **Type:** List of String +* **Default:** ``[]`` + +.. _cluster-configuration-cpu: + +``available_node_types..node_type.resources.CPU`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. tabs:: + .. group-tab:: AWS + + The number of CPUs made available by this node. If not configured, Autoscaler can automatically detect them only for AWS/Kubernetes cloud providers. + + * **Required:** Yes (except for AWS/K8s) + * **Importance:** High + * **Type:** Integer + + .. group-tab:: Azure + + The number of CPUs made available by this node. + + * **Required:** Yes + * **Importance:** High + * **Type:** Integer + + .. group-tab:: GCP + + The number of CPUs made available by this node. + + * **Required:** No + * **Importance:** High + * **Type:** Integer + + +.. _cluster-configuration-gpu: + +``available_node_types..node_type.resources.GPU`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. tabs:: + .. group-tab:: AWS + + The number of GPUs made available by this node. If not configured, Autoscaler can automatically detect them only for AWS/Kubernetes cloud providers. + + * **Required:** No + * **Importance:** Low + * **Type:** Integer + + .. group-tab:: Azure + + The number of GPUs made available by this node. + + * **Required:** No + * **Importance:** High + * **Type:** Integer + + .. group-tab:: GCP + + The number of GPUs made available by this node. + + * **Required:** No + * **Importance:** High + * **Type:** Integer + +.. _cluster-configuration-node-docker: + +``available_node_types..docker`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +A set of overrides to the top-level :ref:`Docker ` configuration. + +* **Required:** No +* **Importance:** Low +* **Type:** :ref:`docker ` +* **Default:** ``{}`` + +Examples +-------- + +Minimal configuration +~~~~~~~~~~~~~~~~~~~~~ + +.. tabs:: + .. group-tab:: AWS + + .. literalinclude:: ../../../python/ray/autoscaler/aws/example-minimal.yaml + :language: yaml + + .. group-tab:: Azure + + .. literalinclude:: ../../../python/ray/autoscaler/azure/example-minimal.yaml + :language: yaml + + .. group-tab:: GCP + + .. literalinclude:: ../../../python/ray/autoscaler/gcp/example-minimal.yaml + :language: yaml + +Full configuration +~~~~~~~~~~~~~~~~~~ + +.. tabs:: + .. group-tab:: AWS + + .. literalinclude:: ../../../python/ray/autoscaler/aws/example-full.yaml + :language: yaml + + .. group-tab:: Azure + + .. literalinclude:: ../../../python/ray/autoscaler/azure/example-full.yaml + :language: yaml + + .. group-tab:: GCP + + .. literalinclude:: ../../../python/ray/autoscaler/gcp/example-full.yaml + :language: yaml diff --git a/doc/source/cluster/deploy.rst b/doc/source/cluster/deploy.rst index e9253614f..24bcfe456 100644 --- a/doc/source/cluster/deploy.rst +++ b/doc/source/cluster/deploy.rst @@ -3,6 +3,10 @@ Ray with Cluster Managers ========================= +.. note:: + + If you're using AWS, Azure or GCP you can use the :ref:`Ray Cluster Launcher ` to simplify the cluster setup process. + .. toctree:: :maxdepth: 2 diff --git a/doc/source/cluster/index.rst b/doc/source/cluster/index.rst index c95eca1cb..f32fab548 100644 --- a/doc/source/cluster/index.rst +++ b/doc/source/cluster/index.rst @@ -1,229 +1,26 @@ .. _cluster-index: -Distributed Ray Overview -======================== +Ray Cluster Overview +==================== -One of Ray's strengths is the ability to leverage multiple machines in the same program. Ray can, of course, be run on a single machine (and is done so often) but the real power is using Ray on a cluster of machines. - -Key Concepts ------------- - -* **Ray Nodes**: A Ray cluster consists of a **head node** and a set of **worker nodes**. The head node needs to be started first, and the worker nodes are given the address of the head node to form the cluster. The Ray cluster itself can also "auto-scale," meaning that it can interact with a Cloud Provider to request or release instances according to application workload. - -* **Ports**: Ray processes communicate via TCP ports. When starting a Ray cluster, either on prem or on the cloud, it is important to open the right ports so that Ray functions correctly. See :ref:`the Ray Ports documentation ` for more details. - -* **Ray Cluster Launcher**: The :ref:`Ray Cluster Launcher ` is a simple tool that automatically provisions machines and launches a multi-node Ray cluster. You can use the cluster launcher on GCP, Amazon EC2, Azure, or even Kubernetes. - -Summary -------- - -Clusters are started with the :ref:`Ray Cluster Launcher ` or :ref:`manually `. - -You can also create a Ray cluster using a standard cluster manager such as :ref:`Kubernetes `, :ref:`YARN `, or :ref:`SLURM `. - -After a cluster is started, you need to connect your program to the Ray cluster by starting a driver process on the same node as where you ran ``ray start``: - -.. tabs:: - .. code-tab:: python - - # This must - import ray - ray.init(address='auto') - - .. group-tab:: java - - .. code-block:: java - - import io.ray.api.Ray; - - public class MyRayApp { - - public static void main(String[] args) { - Ray.init(); - ... - } - } - - .. code-block:: bash - - java -classpath \ - -Dray.address=
\ - - -and then the rest of your script should be able to leverage Ray as a distributed framework! - - -Using the cluster launcher --------------------------- - -The ``ray up`` command uses the :ref:`Ray Cluster Launcher ` to start a cluster on the cloud, creating a designated "head node" and worker nodes. Any Python process that runs ``ray.init(address=...)`` on any of the cluster nodes will connect to the ray cluster. - -.. important:: Calling ``ray.init`` on your laptop will not work if using ``ray up``, since your laptop will not be the head node. - -Here is an example of using the Cluster Launcher on AWS: - -.. code-block:: shell - - # First, run `pip install boto3` and `aws configure` - # - # Create or update the cluster. When the command finishes, it will print - # out the command that can be used to SSH into the cluster head node. - $ ray up ray/python/ray/autoscaler/aws/example-full.yaml - -You can monitor the Ray cluster status with ``ray monitor cluster.yaml`` and ssh into the head node with ``ray attach cluster.yaml``. - -.. _manual-cluster: - -Manual Ray Cluster Setup +What is a Ray cluster? ------------------------ -The most preferable way to run a Ray cluster is via the :ref:`Ray Cluster Launcher `. However, it is also possible to start a Ray cluster by hand. +One of Ray's strengths is the ability to leverage multiple machines in the same program. Ray can, of course, be run on a single machine (and is done so often), but the real power is using Ray on a cluster of machines. -This section assumes that you have a list of machines and that the nodes in the cluster can communicate with each other. It also assumes that Ray is installed -on each machine. To install Ray, follow the `installation instructions`_. +A Ray cluster consists of a **head node** and a set of **worker nodes**. The head node needs to be started first, and the worker nodes are given the address of the head node to form the cluster. -.. _`installation instructions`: http://docs.ray.io/en/master/installation.html +You can use the Ray Cluster Launcher to provision machines and launch a multi-node Ray cluster. You can use the cluster launcher on AWS, GCP, Azure, Kubernetes, on-premise, and Staroid or even on your custom node provider. Ray clusters can also make use of the Ray Autoscaler, which allows Ray to interact with a cloud provider to request or release instances according to application workload. -Starting Ray on each machine -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +How does it work? +----------------- -On the head node (just choose some node to be the head node), run the following. -If the ``--port`` argument is omitted, Ray will choose port 6379, falling back to a -random port. +The Ray Cluster Launcher will automatically enable a load-based autoscaler. The autoscaler resource demand scheduler will look at the pending tasks, actors, and placement groups resource demands from the cluster, and try to add the minimum list of nodes that can fulfill these demands. When worker nodes are idle for more than :ref:`idle_timeout_minutes `, they will be removed (the head node is never removed unless the cluster is teared down). -.. code-block:: bash +Autoscaler uses a simple binpacking algorithm to binpack the user demands into the available cluster resources. The remaining unfulfilled demands are placed on the smallest list of nodes that satisfies the demand while maximizing utilization (starting from the smallest node). - $ ray start --head --port=6379 - ... - Next steps - To connect to this Ray runtime from another node, run - ray start --address=':6379' --redis-password='' +**Here is "A Glimpse into the Ray Autoscaler" and how to debug/monitor your cluster:** - If connection fails, check your firewall settings and network configuration. +2021-19-01 by Ameer Haj-Ali, Anyscale, Inc. -The command will print out the address of the Redis server that was started -(the local node IP address plus the port number you specified). - -**Then on each of the other nodes**, run the following. Make sure to replace -``
`` with the value printed by the command on the head node (it -should look something like ``123.45.67.89:6379``). - -Note that if your compute nodes are on their own subnetwork with Network -Address Translation, to connect from a regular machine outside that subnetwork, -the command printed by the head node will not work. You need to find the -address that will reach the head node from the second machine. If the head node -has a domain address like compute04.berkeley.edu, you can simply use that in -place of an IP address and rely on the DNS. - -.. code-block:: bash - - $ ray start --address=
--redis-password='' - -------------------- - Ray runtime started. - -------------------- - - To terminate the Ray runtime, run - ray stop - -If you wish to specify that a machine has 10 CPUs and 1 GPU, you can do this -with the flags ``--num-cpus=10`` and ``--num-gpus=1``. See the :ref:`Configuration ` page for more information. - -If you see ``Unable to connect to Redis. If the Redis instance is on a -different machine, check that your firewall is configured properly.``, -this means the ``--port`` is inaccessible at the given IP address (because, for -example, the head node is not actually running Ray, or you have the wrong IP -address). - -If you see ``Ray runtime started.``, then the node successfully connected to -the IP address at the ``--port``. You should now be able to connect to the -cluster with ``ray.init(address='auto')``. - -If ``ray.init(address='auto')`` keeps repeating -``redis_context.cc:303: Failed to connect to Redis, retrying.``, then the node -is failing to connect to some other port(s) besides the main port. - -.. code-block:: bash - - If connection fails, check your firewall settings and network configuration. - -If the connection fails, to check whether each port can be reached from a node, -you can use a tool such as ``nmap`` or ``nc``. - -.. code-block:: bash - - $ nmap -sV --reason -p $PORT $HEAD_ADDRESS - Nmap scan report for compute04.berkeley.edu (123.456.78.910) - Host is up, received echo-reply ttl 60 (0.00087s latency). - rDNS record for 123.456.78.910: compute04.berkeley.edu - PORT STATE SERVICE REASON VERSION - 6379/tcp open redis syn-ack ttl 60 Redis key-value store - Service detection performed. Please report any incorrect results at https://nmap.org/submit/ . - $ nc -vv -z $HEAD_ADDRESS $PORT - Connection to compute04.berkeley.edu 6379 port [tcp/*] succeeded! - -If the node cannot access that port at that IP address, you might see - -.. code-block:: bash - - $ nmap -sV --reason -p $PORT $HEAD_ADDRESS - Nmap scan report for compute04.berkeley.edu (123.456.78.910) - Host is up (0.0011s latency). - rDNS record for 123.456.78.910: compute04.berkeley.edu - PORT STATE SERVICE REASON VERSION - 6379/tcp closed redis reset ttl 60 - Service detection performed. Please report any incorrect results at https://nmap.org/submit/ . - $ nc -vv -z $HEAD_ADDRESS $PORT - nc: connect to compute04.berkeley.edu port 6379 (tcp) failed: Connection refused - - -Stopping Ray -~~~~~~~~~~~~ - -When you want to stop the Ray processes, run ``ray stop`` on each node. - -.. _using-ray-on-a-cluster: - -Running a Ray program on the Ray cluster ----------------------------------------- - -To run a distributed Ray program, you'll need to execute your program on the same machine as one of the nodes. - -.. tabs:: - .. group-tab:: Python - - Within your program/script, you must call ``ray.init`` and add the ``address`` parameter to ``ray.init`` (like ``ray.init(address=...)``). This causes Ray to connect to the existing cluster. For example: - - .. code-block:: python - - ray.init(address="auto") - - .. group-tab:: Java - - You need to add the ``ray.address`` parameter to your command line (like ``-Dray.address=...``). - - To connect your program to the Ray cluster, run it like this: - - .. code-block:: bash - - java -classpath \ - -Dray.address=
\ - - - .. note:: Specifying ``auto`` as the address hasn't been implemented in Java yet. You need to provide the actual address. You can find the address of the server from the output of the ``ray up`` command. - - -.. note:: A common mistake is setting the address to be a cluster node while running the script on your laptop. This will not work because the script needs to be started/executed on one of the Ray nodes. - -To verify that the correct number of nodes have joined the cluster, you can run the following. - -.. code-block:: python - - import time - - @ray.remote - def f(): - time.sleep(0.01) - return ray.services.get_node_ip_address() - - # Get a list of the IP addresses of the nodes that have joined the cluster. - set(ray.get([f.remote() for _ in range(1000)])) +.. youtube:: BJ06eJasdu4 diff --git a/doc/source/cluster/kubernetes.rst b/doc/source/cluster/kubernetes.rst index 94711b595..1234ece99 100644 --- a/doc/source/cluster/kubernetes.rst +++ b/doc/source/cluster/kubernetes.rst @@ -41,7 +41,7 @@ Below is a brief overview of the two tools. The Ray Cluster Launcher ------------------------ -The :ref:`Ray Cluster Launcher ` is geared towards experimentation and development and can be used to launch Ray clusters on Kubernetes (among other backends). +The :ref:`Ray Cluster Launcher ` is geared towards experimentation and development and can be used to launch Ray clusters on Kubernetes (among other backends). It allows you to manage an autoscaling Ray Cluster from your local environment using the :ref:`Ray CLI `. For example, you can use ``ray up`` to launch a Ray cluster on Kubernetes and ``ray exec`` to execute commands in the Ray head node's pod. Note that using the Cluster Launcher requires Ray to be :ref:`installed locally `. diff --git a/doc/source/cluster/launcher.rst b/doc/source/cluster/launcher.rst deleted file mode 100644 index 8c63f04f9..000000000 --- a/doc/source/cluster/launcher.rst +++ /dev/null @@ -1,66 +0,0 @@ -.. _ref-automatic-cluster: - -Launching Cloud Clusters with Ray -================================= - -Ray comes with a built-in cluster launcher that makes deploying a Ray cluster simple. - -The cluster launcher will provision resources from a node provider (like :ref:`AWS EC2 ` or :ref:`Kubernetes `) to instantiate the specified cluster, and start a Ray cluster on the provisioned resources. - -You can configure the Ray Cluster Launcher to use with :ref:`a cloud provider `, an existing :ref:`Kubernetes cluster `, or a private cluster of machines. - -.. tabs:: - .. group-tab:: AWS - - .. code-block:: shell - - # First, run `pip install boto3` and `aws configure` - # - # Create or update the cluster. When the command finishes, it will print - # out the command that can be used to SSH into the cluster head node. - $ ray up ray/python/ray/autoscaler/aws/example-full.yaml - - See :ref:`the AWS section ` for full instructions. - - .. group-tab:: GCP - - .. code-block:: shell - - # First, ``pip install google-api-python-client`` - # set up your GCP credentials, and - # create a new GCP project. - # - # Create or update the cluster. When the command finishes, it will print - # out the command that can be used to SSH into the cluster head node. - $ ray up ray/python/ray/autoscaler/gcp/example-full.yaml - - See :ref:`the GCP section ` for full instructions. - - .. group-tab:: Azure - - .. code-block:: shell - - # First, install the Azure CLI - # ``pip install azure-cli azure-core``) then - # login using (``az login``). - # - # Create or update the cluster. When the command finishes, it will print - # out the command that can be used to SSH into the cluster head node. - $ ray up ray/python/ray/autoscaler/azure/example-full.yaml - - See :ref:`the Azure section ` for full instructions. - - -Once the Ray cluster is running, you can manually SSH into it or use provided commands like ``ray attach``, ``ray rsync-up``, and ``ray exec`` to access it and run Ray programs. - - -.. toctree:: - - /cluster/cloud.rst - /cluster/config.rst - /cluster/commands.rst - -Questions or Issues? --------------------- - -.. include:: /_help.rst diff --git a/doc/source/cluster/quickstart.rst b/doc/source/cluster/quickstart.rst new file mode 100644 index 000000000..f02db280e --- /dev/null +++ b/doc/source/cluster/quickstart.rst @@ -0,0 +1,240 @@ +.. _ref-cluster-quick-start: + +Quick Start Cluster Autoscaling Demo +==================================== + +This quick start demonstrates the capabilities of the Ray cluster. Using the Ray cluster, we'll take a sample application designed to run on a laptop and scale it up in the cloud. Ray will launch clusters and scale Python with just a few commands. + +About the demo +-------------- + +This demo will walk through an end-to-end flow: + +1. Create a (basic) Python application. +2. Launch a cluster on a cloud provider. +3. Run the application in the cloud. + +Requirements +~~~~~~~~~~~~ + +To run this demo, you will need: + +* Python installed on your development machine (typically your laptop), and +* an account at your preferred cloud provider (AWS, Azure or GCP). + +Setup +~~~~~ + +Before we start, you will need to install some Python dependencies as follows: + +.. tabs:: + .. group-tab:: AWS + + .. code-block:: shell + + $ pip install -U ray boto3 + + .. group-tab:: Azure + + .. code-block:: shell + + $ pip install -U ray azure-cli azure-core + + .. group-tab:: GCP + + .. code-block:: shell + + $ pip install -U ray google-api-python-client + +Next, if you're not set up to use your cloud provider from the command line, you'll have to configure your credentials: + +.. tabs:: + .. group-tab:: AWS + + Configure your credentials in ``~/.aws/credentials`` as described in `the AWS docs `_. + + .. group-tab:: Azure + + Log in using ``az login``, then configure your credentials with ``az account set -s ``. + + .. group-tab:: GCP + + Set the ``GOOGLE_APPLICATION_CREDENTIALS`` environment variable as described in `the GCP docs `_. + +Create a (basic) Python application +----------------------------------- + +We will write a simple Python application that tracks the IP addresses of the machines that its tasks are executed on: + +.. code-block:: python + + from collections import Counter + import socket + import time + + def f(): + time.sleep(0.001) + # Return IP address. + return socket.gethostbyname(socket.gethostname()) + + ip_addresses = [f() for _ in range(10000)] + print(Counter(ip_addresses)) + +Save this application as ``script.py`` and execute it by running the command ``python script.py``. The application should take 10 seconds to run and output something similar to ``Counter({'127.0.0.1': 10000})``. + +With some small changes, we can make this application run on Ray (for more information on how to do this, refer to :ref:`the Ray Core Walkthrough`): + +.. code-block:: python + + from collections import Counter + import socket + import time + + import ray + + ray.init() + + @ray.remote + def f(): + time.sleep(0.001) + # Return IP address. + return socket.gethostbyname(socket.gethostname()) + + object_ids = [f.remote() for _ in range(10000)] + ip_addresses = ray.get(object_ids) + print(Counter(ip_addresses)) + +Finally, let's add some code to make the output more interesting: + +.. code-block:: python + + from collections import Counter + import socket + import time + + import ray + + ray.init() + + print('''This cluster consists of + {} nodes in total + {} CPU resources in total + '''.format(len(ray.nodes()), ray.cluster_resources()['CPU'])) + + @ray.remote + def f(): + time.sleep(0.001) + # Return IP address. + return socket.gethostbyname(socket.gethostname()) + + object_ids = [f.remote() for _ in range(10000)] + ip_addresses = ray.get(object_ids) + + print('Tasks executed') + for ip_address, num_tasks in Counter(ip_addresses).items(): + print(' {} tasks on {}'.format(num_tasks, ip_address)) + +Running ``python script.py`` should now output something like: + +.. parsed-literal:: + + This cluster consists of + 1 nodes in total + 4.0 CPU resources in total + + Tasks executed + 10000 tasks on 127.0.0.1 + +Launch a cluster on a cloud provider +------------------------------------ + +To start a Ray Cluster, first we need to define the cluster configuration. The cluster configuration is defined within a YAML file that will be used by the Cluster Launcher to launch the head node, and by the Autoscaler to launch worker nodes. + +A minimal sample cluster configuration file looks as follows: + +.. tabs:: + .. group-tab:: AWS + + .. code-block:: yaml + + # An unique identifier for the head node and workers of this cluster. + cluster_name: minimal + + # Cloud-provider specific configuration. + provider: + type: aws + region: us-west-2 + + .. group-tab:: Azure + + .. code-block:: yaml + + # An unique identifier for the head node and workers of this cluster. + cluster_name: minimal + + # Cloud-provider specific configuration. + provider: + type: azure + location: westus2 + resource_group: ray-cluster + + # How Ray will authenticate with newly launched nodes. + auth: + ssh_user: ubuntu + # you must specify paths to matching private and public key pair files + # use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair + ssh_private_key: ~/.ssh/id_rsa + # changes to this should match what is specified in file_mounts + ssh_public_key: ~/.ssh/id_rsa.pub + + .. group-tab:: GCP + + .. code-block:: yaml + + # A unique identifier for the head node and workers of this cluster. + cluster_name: minimal + + # Cloud-provider specific configuration. + provider: + type: gcp + region: us-west1 + +Save this configuration file as ``config.yaml``. You can specify a lot more details in the configuration file: instance types to use, minimum and maximum number of workers to start, autoscaling strategy, files to sync, and more. For a full reference on the available configuration properties, please refer to the :ref:`cluster YAML configuration options reference `. + +After defining our configuration, we will use the Ray Cluster Launcher to start a cluster on the cloud, creating a designated "head node" and worker nodes. To start the Ray cluster, we will use the :ref:`Ray CLI `. Run the following command: + +.. code-block:: shell + + $ ray up -y config.yaml + +Run the application in the cloud +-------------------------------- + +We are now ready to execute the application in across multiple machines on our Ray cloud cluster. Run the following command: + +.. code-block:: shell + + $ ray submit config.yaml script.py + +The output should now look similar to the following: + +.. parsed-literal:: + + This cluster consists of + 3 nodes in total + 6.0 CPU resources in total + + Tasks executed + 3425 tasks on xxx.xxx.xxx.xxx + 3834 tasks on xxx.xxx.xxx.xxx + 2741 tasks on xxx.xxx.xxx.xxx + +In this sample output, 3 nodes were started. If the output only shows 1 node, you may want to increase the ``secs`` in ``time.sleep(secs)`` to give Ray more time to start additional nodes. + +The Ray CLI offers additional functionality. For example, you can monitor the Ray cluster status with ``ray monitor config.yaml``, and you can connect to the cluster (ssh into the head node) with ``ray attach config.yaml``. For a full reference on the Ray CLI, please refer to :ref:`the cluster commands reference `. + +To finish, don't forget to shut down the cluster. Run the following command: + +.. code-block:: shell + + $ ray down -y config.yaml diff --git a/doc/source/cluster/reference.rst b/doc/source/cluster/reference.rst new file mode 100644 index 000000000..ad9388060 --- /dev/null +++ b/doc/source/cluster/reference.rst @@ -0,0 +1,11 @@ +.. _cluster-reference: + +Config YAML and CLI Reference +============================= + +.. toctree:: + :maxdepth: 2 + + config.rst + commands.rst + sdk.rst diff --git a/doc/source/cluster/sdk.rst b/doc/source/cluster/sdk.rst new file mode 100644 index 000000000..7238ee558 --- /dev/null +++ b/doc/source/cluster/sdk.rst @@ -0,0 +1,13 @@ +.. _ref-autoscaler-sdk: + +Autoscaler SDK +============== + +.. _ref-autoscaler-sdk-request-resources: + +ray.autoscaler.sdk.request_resources +------------------------------------ + +Within a Ray program, you can command the autoscaler to scale the cluster up to a desired size with ``request_resources()`` call. The cluster will immediately attempt to scale to accommodate the requested resources, bypassing normal upscaling speed constraints. + +.. autofunction:: ray.autoscaler.sdk.request_resources \ No newline at end of file diff --git a/doc/source/conf.py b/doc/source/conf.py index bdff928f7..b1a74f263 100644 --- a/doc/source/conf.py +++ b/doc/source/conf.py @@ -148,6 +148,7 @@ extensions = [ 'sphinx_gallery.gen_gallery', 'sphinxemoji.sphinxemoji', 'sphinx_copybutton', + 'sphinxcontrib.yt', 'versionwarning.extension', ] diff --git a/doc/source/dask-on-ray.rst b/doc/source/dask-on-ray.rst index 0530fdc4c..486dc9a1f 100644 --- a/doc/source/dask-on-ray.rst +++ b/doc/source/dask-on-ray.rst @@ -71,7 +71,7 @@ Here's an example: Why use Dask on Ray? 1. To take advantage of Ray-specific features such as the - :ref:`cluster launcher ` and + :ref:`launching cloud clusters ` and :ref:`shared-memory store `. 2. If you'd like to use Dask and Ray libraries in the same application without having two different clusters. 3. If you'd like to create data analyses using the familiar NumPy and Pandas APIs provided by Dask and execute them on a fast, fault-tolerant distributed task execution system geared towards production, like Ray. diff --git a/doc/source/index.rst b/doc/source/index.rst index e90b52299..182ff7ef7 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -231,11 +231,12 @@ Papers .. toctree:: :hidden: :maxdepth: -1 - :caption: Ray Cluster + :caption: Ray Clusters/Autoscaler cluster/index.rst - cluster/launcher.rst - cluster/autoscaling.rst + cluster/quickstart.rst + cluster/reference.rst + cluster/cloud.rst cluster/deploy.rst .. toctree:: diff --git a/doc/source/serve/deployment.rst b/doc/source/serve/deployment.rst index 1ab190595..ed397ec83 100644 --- a/doc/source/serve/deployment.rst +++ b/doc/source/serve/deployment.rst @@ -140,7 +140,7 @@ In order to deploy Ray Serve on Kubernetes, we need to do the following: 3. Start Ray Serve on the cluster. There are multiple ways to start a Ray cluster on Kubernetes, see :ref:`ray-k8s-deploy` for more information. -Here, we will be using the :ref:`Ray Cluster Launcher ` tool, which has support for Kubernetes as a backend. +Here, we will be using the :ref:`Ray Cluster Launcher ` tool, which has support for Kubernetes as a backend. The cluster launcher takes in a yaml config file that describes the cluster. Here, we'll be using the `Kubernetes default config`_ with a few small modifications. diff --git a/doc/source/starting-ray.rst b/doc/source/starting-ray.rst index 1791cc25b..b4bf4ce02 100644 --- a/doc/source/starting-ray.rst +++ b/doc/source/starting-ray.rst @@ -164,7 +164,7 @@ You can connect other nodes to the head node, creating a Ray cluster by also cal Launching a Ray cluster (``ray up``) ------------------------------------ -Ray clusters can be launched with the :ref:`Cluster Launcher `. +Ray clusters can be launched with the :ref:`Cluster Launcher `. The ``ray up`` command uses the Ray cluster launcher to start a cluster on the cloud, creating a designated "head node" and worker nodes. Underneath the hood, it automatically calls ``ray start`` to create a Ray cluster. Your code **only** needs to execute on one machine in the cluster (usually the head node). Read more about :ref:`running programs on a Ray cluster `. diff --git a/doc/source/tune/_tutorials/tune-distributed.rst b/doc/source/tune/_tutorials/tune-distributed.rst index 498576e5b..46b47e3bc 100644 --- a/doc/source/tune/_tutorials/tune-distributed.rst +++ b/doc/source/tune/_tutorials/tune-distributed.rst @@ -55,7 +55,7 @@ Launching a cloud cluster If you have already have a list of nodes, go to :ref:`tune-distributed-local`. -Ray currently supports AWS and GCP. Follow the instructions below to launch nodes on AWS (using the Deep Learning AMI). See the :ref:`cluster setup documentation `. Save the below cluster configuration (``tune-default.yaml``): +Ray currently supports AWS and GCP. Follow the instructions below to launch nodes on AWS (using the Deep Learning AMI). See the :ref:`cluster setup documentation `. Save the below cluster configuration (``tune-default.yaml``): .. literalinclude:: /../../python/ray/tune/examples/tune-default.yaml :language: yaml @@ -130,7 +130,7 @@ If you used a cluster configuration (starting a cluster with ``ray up`` or ``ray Syncing ------- -Tune automatically syncs the trial folder on remote nodes back to the head node. This requires the ray cluster to be started with the :ref:`cluster launcher `. +Tune automatically syncs the trial folder on remote nodes back to the head node. This requires the ray cluster to be started with the :ref:`cluster launcher `. By default, local syncing requires rsync to be installed. You can customize the sync command with the ``sync_to_driver`` argument in ``tune.SyncConfig`` by providing either a function or a string. If a string is provided, then it must include replacement fields ``{source}`` and ``{target}``, like ``rsync -savz -e "ssh -i ssh_key.pem" {source} {target}``. Alternatively, a function can be provided with the following signature: @@ -290,7 +290,7 @@ Upon a second run, this will restore the entire experiment state from ``~/path/t Common Commands --------------- -Below are some commonly used commands for submitting experiments. Please see the :ref:`Autoscaler page ` to see find more comprehensive documentation of commands. +Below are some commonly used commands for submitting experiments. Please see the :ref:`Autoscaler page ` to see find more comprehensive documentation of commands. .. code-block:: bash diff --git a/doc/source/tune/user-guide.rst b/doc/source/tune/user-guide.rst index 909ebbc9f..8dd636042 100644 --- a/doc/source/tune/user-guide.rst +++ b/doc/source/tune/user-guide.rst @@ -265,7 +265,7 @@ You can restore a single trial checkpoint by using ``tune.run(restore=` and also requires rsync to be installed. +On a multinode cluster, Tune automatically creates a copy of all trial checkpoints on the head node. This requires the Ray cluster to be started with the :ref:`cluster launcher ` and also requires rsync to be installed. Note that you must use the ``tune.checkpoint_dir`` API to trigger syncing. Also, if running Tune on Kubernetes, be sure to use the :ref:`KubernetesSyncer ` to transfer files between different pods.