From 981df65b91f9c55962c2efce55b6a4deefabd8e9 Mon Sep 17 00:00:00 2001 From: SangBin Cho Date: Tue, 1 Dec 2020 13:15:30 -0800 Subject: [PATCH] [Doc] Improve the placement group document (#12507) * Improve the placement group document. * Fix grammar. * Addressed code review. --- doc/examples/overview.rst | 5 +++ doc/examples/placement-group.rst | 58 +++++++++++++++++++++++++++++ doc/source/placement-group.rst | 63 ++++++++++++++++++++++++++++---- 3 files changed, 119 insertions(+), 7 deletions(-) create mode 100644 doc/examples/placement-group.rst diff --git a/doc/examples/overview.rst b/doc/examples/overview.rst index 738923e96..f69fecb4e 100644 --- a/doc/examples/overview.rst +++ b/doc/examples/overview.rst @@ -18,6 +18,7 @@ Ray Examples testing-tips.rst progress_bar.rst plot_streaming.rst + placement-group.rst .. customgalleryitem:: :tooltip: Tips for first time users. @@ -36,6 +37,10 @@ Ray Examples :tooltip: Implement a simple streaming application using Ray’s actors. :description: :doc:`/auto_examples/plot_streaming` +.. customgalleryitem:: + :tooltip: Learn placement group use cases with examples. + :description: :doc:`/auto_examples/placement-group` + .. raw:: html diff --git a/doc/examples/placement-group.rst b/doc/examples/placement-group.rst new file mode 100644 index 000000000..3e833dd06 --- /dev/null +++ b/doc/examples/placement-group.rst @@ -0,0 +1,58 @@ +Placement Group Examples +======================== + +.. _ray-placement-group-examples-ref: + +Ray placement groups are an advanced feature that helps you to optimize the performance and improve the resiliency (fault tolerance) of your distributed applications. + +This page assumes you already have basic knowledge about the placement group APIs. To learn more about APIs, go to the :ref:`placement group document `. + +Colocation +---------- +**Recommended Strategy**: `STRICT_PACK`. + +Colocation helps you to maximize data locality and minimize communication costs. + +When a Ray application runs in a multi-node cluster, inter-node communication for tasks/actors is much more expensive than inter-process communication within the same node. +Here are possible system overheads when your tasks and actors are located in a different node. Imagine you have two Ray primitives (task or actor) A (in a node NA) and B (in a node NB). + +- **Object transfer cost**: When A accesses an object created by B (O), the O is not immediately available to A. Here, Ray fetches the O from NB to NA. That says, there will be a network cost to transfer the object and latency overhead to wait for the object to be fetched. The overhead linearly grows to the object size that needs to be transferred. +- **Network cost**: When A and B communicate with each other, there is additional networking overhead since A and B are located in a different node. + +In some cases, this type of overhead can have a significant impact on your application's performance. For example, imagine you'd like to read +large size dataset using S3 and distribute it to multiple tasks or actors. + +You can minimize the overhead using the placement group's `STRICT_PACK` strategy. `STRICT_PACK` guarantees that all the tasks and actors are located in the same node. + +Gang Scheduling +--------------- +**Recommended Strategy**: `STRICT_SPREAD`. + +Sometimes, you'd like to schedule multiple tasks/actors in a separate physical machine (node) "at the same time". For example, "write the gang scheduling example". + +You can use placement groups' `STRICT_SPREAD` strategy to achieve it. `STRICT_SPREAD` ensures that all actors and tasks scheduled with the placement group will be located in a separate node. + +Improve Fault tolerance +----------------------- +**Recommended Strategy**: `SPREAD`, `STRICT_SPREAD`. + +Imagine you have a set of stateless actors that are used for serving requests. Ray Serve is one of the examples. + +In this case, it is common to have multiple replicas of the serving actor to improve the throughput of the applications. But what if every actor is located in the same node? +It means that if that specific node fails, the application will be crashed, or the throughput becomes much lower until all of the dead actors are backed up. + +To avoid this, you can use either the `SPREAD` or `STRICT_SPREAD` strategy. +It ensures the most replica actors are located in a different node, which improves the fault tolerance of your applications. + +.. note:: + + If `STRICT` strategies are not absolutely necessary, it is encouraged to use just `SPREAD` or `PACK` strategies. Strict strategies can lead to lower resource utilization and is harder to schedule. + +Load Balancing +-------------- +**Recommended Strategy**: `SPREAD`, `STRICT_SPREAD`. + +Imagine you have a set of actors that read a large dataset from S3. It means all these actors' performance is bound to the network bandwidth of physical machines. +At this time, if you start multiple actors in the same physical node, it won't improve the read/write throughput. + +Instead, you can use the placement group's `SPREAD` or `STRICT_SPREAD` strategy to ensure all S3 reader actors are located in a different physical node. diff --git a/doc/source/placement-group.rst b/doc/source/placement-group.rst index 19118fdba..de26b0498 100644 --- a/doc/source/placement-group.rst +++ b/doc/source/placement-group.rst @@ -1,6 +1,8 @@ Placement Groups ================ +.. _ray-placement-group-doc-ref: + Placement groups allow users to atomically reserve groups of resources across multiple nodes (i.e., gang scheduling). They can be then used to schedule Ray tasks and actors to be packed as close as possible for locality (PACK), or spread apart (SPREAD). Here are some use cases: @@ -9,6 +11,7 @@ Here are some use cases: - **Maximizing data locality**: You'd like to place or schedule your tasks and actors close to your data to avoid object transfer overheads. - **Load balancing**: To improve application availability and avoid resource overload, you'd like to place your actors or tasks into different physical machines as much as possible. +To learn more about production use cases, check out the :ref:`examples `. Key Concepts ------------ @@ -192,10 +195,40 @@ Now let's define an actor that uses GPU. We'll also define a task that use ``ext Now, you can guarantee all gpu actors and extra_resource tasks are located on the same node because they are scheduled on a placement group with the STRICT_PACK strategy. -Note that you must remove the placement group once you are finished with your application. -Workers of actors and tasks that are scheduled on placement group will be all killed. +.. note:: -.. warning:: Do not lose the reference to the placement group - you will not be able to remove it. This behavior will change in a later release. + In order to fully utilize resources pre-reserved by the placement group, + Ray automatically schedules children tasks/actors to the same placement group as its parent. + + .. code-block:: python + + # Create a placement group with the STRICT_SPREAD strategy. + pg = placement_group([{"CPU": 2}, {"CPU": 2}], strategy="STRICT_SPREAD") + ray.get(pg.ready()) + + @ray.remote + def child(): + pass + + @ray.remote + def parent(): + # The child task is scheduled with the same placement group as its parent + # although child.options(placement_group=pg).remote() wasn't called. + ray.get(child.remote()) + + ray.get(parent.options(placement_group=pg).remote()) + + To avoid it, you should specify `options(placement_group=None)` in a child task/actor remote call. + + .. code-block:: python + + @ray.remote + def parent(): + # In this case, the child task won't be + # scheduled with the parent's placement group. + ray.get(child.options(placement_group=None).remote()) + +Note that you can anytime remove the placement group to clean up resources. .. code-block:: python @@ -218,16 +251,32 @@ Workers of actors and tasks that are scheduled on placement group will be all ki ray.shutdown() +Tips for Using Placement Groups +------------------------------- +- Learn the :ref:`lifecycle ` of placement groups. +- Learn the :ref:`fault tolerance ` of placement groups. +- See more :ref:`examples ` to learn real world use cases of placement group APIs. + Lifecycle --------- -When placement group is first created, the request is sent to the GCS. The GCS reserve resources to nodes based on its scheduling strategy. Ray guarantees the atomic creation of placement group. +.. _ray-placement-group-lifecycle-ref: -Placement groups are pending creation if there are no nodes that can satisfy resource requirements for a given strategy. The Ray Autoscaler will be aware of placement groups, and auto-scale the cluster to ensure pending groups can be placed as needed. +**Creation**: When placement groups are first created, the request is sent to the GCS. The GCS sends resource reservation requests to nodes based on its scheduling strategy. Ray guarantees placement groups are placed atomically. -If nodes that contain some bundles of a placement group die, bundles will be rescheduled on different nodes by GCS. This means that the initial creation of placement group is "atomic", but once it is created, there could be partial placement groups. +**Autoscaling**: Placement groups are pending creation if there are no nodes that can satisfy resource requirements for a given strategy. The Ray Autoscaler will be aware of placement groups, and auto-scale the cluster to ensure pending groups can be placed as needed. -Placement groups are tolerant to worker nodes failures (bundles on dead nodes are rescheduled). However, placement groups are currently unable to tolerate head node failures (GCS failures). +**Cleanup**: Placement groups are automatically removed when the job that created the placement group is finished. The only exception is that it is created by detached actors. In this case, placement groups fate-share with the detached actors. + + +Fault Tolerance +--------------- + +.. _ray-placement-group-ft-ref: + +If nodes that contain some bundles of a placement group die, all the bundles will be rescheduled on different nodes by GCS. This means that the initial creation of placement group is "atomic", but once it is created, there could be partial placement groups. + +Placement groups are tolerant to worker nodes failures (bundles on dead nodes are rescheduled). However, placement groups are currently unable to tolerate head node failures (GCS failures), which is a single point of failure of Ray. API Reference -------------