From 6b93ad11d01d2b971908e6d1b548d8f044480b64 Mon Sep 17 00:00:00 2001 From: Simon Mo Date: Thu, 20 Aug 2020 11:40:47 -0700 Subject: [PATCH] [Doc] Add Architecture Doc for Ray Serve (#10204) --- doc/source/fault-tolerance.rst | 2 + doc/source/index.rst | 1 + doc/source/serialization.rst | 2 + doc/source/serve/advanced.rst | 6 +- doc/source/serve/architecture.rst | 92 +++++++++++++++++++++++++++++++ doc/source/serve/architecture.svg | 1 + doc/source/serve/deployment.rst | 2 + 7 files changed, 105 insertions(+), 1 deletion(-) create mode 100644 doc/source/serve/architecture.rst create mode 100644 doc/source/serve/architecture.svg diff --git a/doc/source/fault-tolerance.rst b/doc/source/fault-tolerance.rst index 9991e56c0..fa3abb094 100644 --- a/doc/source/fault-tolerance.rst +++ b/doc/source/fault-tolerance.rst @@ -41,6 +41,8 @@ You can experiment with this behavior by running the following code. except ray.exceptions.RayWorkerError: print('FAILURE') +.. _actor-fault-tolerance: + Actors ------ diff --git a/doc/source/index.rst b/doc/source/index.rst index cf5a1bb8d..7a7d99ed7 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -153,6 +153,7 @@ Academic Papers serve/tutorials/index.rst serve/deployment.rst serve/advanced.rst + serve/architecture.rst serve/package-ref.rst .. toctree:: diff --git a/doc/source/serialization.rst b/doc/source/serialization.rst index 0b53f9a5b..732e54797 100644 --- a/doc/source/serialization.rst +++ b/doc/source/serialization.rst @@ -5,6 +5,8 @@ Serialization Since Ray processes do not share memory space, data transferred between workers and nodes will need to **serialized** and **deserialized**. Ray uses the `Plasma object store `_ to efficiently transfer objects across different processes and different nodes. Numpy arrays in the object store are shared between workers on the same node (zero-copy deserialization). +.. _plasma-store: + Plasma Object Store ------------------- diff --git a/doc/source/serve/advanced.rst b/doc/source/serve/advanced.rst index 9e6c7c539..ebf4fa62f 100644 --- a/doc/source/serve/advanced.rst +++ b/doc/source/serve/advanced.rst @@ -188,6 +188,8 @@ The shard key can either be specified via the X-SERVE-SHARD-KEY HTTP header or ` handle = serve.get_handle("api_endpoint") handler.options(shard_key=session_id).remote(args) +.. _serve-shadow-testing: + Shadow Testing -------------- @@ -220,6 +222,8 @@ This is demonstrated in the example below, where we create an endpoint serviced serve.shadow_traffic("shadowed_endpoint", "new_backend_1", 0) serve.shadow_traffic("shadowed_endpoint", "new_backend_2", 0) +.. _serve-model-composition: + Composing Multiple Models ========================= Ray Serve supports composing individually scalable models into a single model @@ -286,7 +290,7 @@ How do I call an endpoint from Python code? use the following to get a "handle" to that endpoint. .. code-block:: python - + handle = serve.get_handle("api_endpoint") diff --git a/doc/source/serve/architecture.rst b/doc/source/serve/architecture.rst new file mode 100644 index 000000000..854063d0f --- /dev/null +++ b/doc/source/serve/architecture.rst @@ -0,0 +1,92 @@ +Serve Architecture +================== +This document provides an overview of how each component in Serve works. + +.. image:: architecture.svg + :align: center + :width: 600px + +High Level View +--------------- + +Serve runs on Ray and utilizes :ref:`Ray actors`. + +There are three kinds of actors that are created to make up a Serve instance: + +- Controller: A global actor unique to each :ref:`Serve instance ` that manages + the control plane. The Controller is responsible for creating, updating, and + destroying other actors. Serve API calls like :mod:`serve.create_backend `, + :mod:`serve.create_endpoint ` make remote calls to the Controller. +- Router: There is one router per node. Each router is a `Uvicorn `_ HTTP + server that accepts incoming requests, forwards them to the worker replicas, and + responds once they are completed. +- Worker Replica: Worker replicas actually execute the code in response to a + request. For example, they may contain an instantiation of an ML model. Each + replica processes individual requests or batches of requests from the routers. + + +Lifetime of a Request +--------------------- +When an HTTP request is sent to the router, the follow things happen: + +- The HTTP request is received and parsed. +- The correct :ref:`endpoint ` associated with the HTTP url path is looked up. +- One or more :ref:`backends ` is selected to handle the request given the :ref:`traffic + splitting ` and :ref:`shadow testing ` rules. The requests for each backend + are placed on a queue. +- For each request in a backend queue, an available worker replica is looked up + and the request is sent to it. If there are no available worker replicas (there + are more than ``max_concurrent_queries`` requests outstanding), the request + is left in the queue until an outstanding request is finished. + +Each worker maintains a queue of requests and processes one batch of requests at +a time. By default the batch size is 1, you can increase the batch size to +increase throughput. If the handler (the function for the backend or +``__call__``) is ``async``, worker will not wait for the handler to run; +otherwise, worker will block until the handler returns. + +FAQ +--- +How does Serve handle fault tolerance? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Application errors like exceptions in your model evaluation code is catched and +wrapped. A 500 status code will be returned with the traceback information. The +worker replica will be able to continue to handle requests. + +Machine errors and faults will be handled by Ray. Serve utilizes the :ref:`actor +reconstruction ` capability. For example, when a machine hosting any of the +actors crashes, those actors will be automatically restarted on another +available machine. All data in the Controller (routing policies, backend +configurations, etc) is checkpointed to the Ray. Transient data in the +router and the worker replica (like network connections and internal request +queues) will be lost upon failure. + +How does Serve ensure horizontal scalability and availability? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Serve starts one router per node. Each router will bind the same port. You +should be able to reach Serve and send requests to any models via any of the +servers. + +This architecture ensures horizontal scalability for Serve. You can scale the +router by adding more nodes and scale the model workers by increasing the number +of replicas. + +How do ServeHandles work? +^^^^^^^^^^^^^^^^^^^^^^^^^ + +:mod:`ServeHandles ` wrap a handle to the router actor on the same node. When a +request is sent from one via worker replica to another via the handle, the +requests go through the same data path as incoming HTTP requests. This enables +the same backend selection and batching procedures to happen. ServeHandles are +often used to implement :ref:`model composition `. + + +What happens to large requests? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Serve utilizes Ray’s :ref:`shared memory object store ` and in process memory +store. Small request objects are directly sent between actors via network +call. Larger request objects (100KiB+) are written to a distributed shared +memory store and the worker can read them via zero-copy read. diff --git a/doc/source/serve/architecture.svg b/doc/source/serve/architecture.svg new file mode 100644 index 000000000..93edd9864 --- /dev/null +++ b/doc/source/serve/architecture.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/doc/source/serve/deployment.rst b/doc/source/serve/deployment.rst index 8686193b5..223d11a7a 100644 --- a/doc/source/serve/deployment.rst +++ b/doc/source/serve/deployment.rst @@ -278,6 +278,8 @@ opt for launching a Ray Cluster locally. Specify a Ray cluster like we did in :r To learn more, in general, about Ray Clusters see :ref:`cluster-index`. +.. _serve-instance: + Deploying Multiple Serve Instances on a Single Ray Cluster ----------------------------------------------------------