diff --git a/doc/source/serve/advanced.rst b/doc/source/serve/advanced.rst index fdf2684e7..5ea7b8dae 100644 --- a/doc/source/serve/advanced.rst +++ b/doc/source/serve/advanced.rst @@ -29,8 +29,10 @@ To scale out a backend to many instances, simply configure the number of replica This will scale up or down the number of replicas that can accept requests. -Using Resources (CPUs, GPUs) -============================ +.. _`serve-cpus-gpus`: + +Resource Management (CPUs, GPUs) +================================ To assign hardware resources per replica, you can pass resource requirements to ``ray_actor_options``. @@ -89,7 +91,7 @@ If you *do* want to enable this parallelism in your Serve backend, just set OMP_ .. _serve-batching: -Batching to improve performance +Batching to Improve Performance =============================== You can also have Ray Serve batch requests for performance. In order to do use this feature, you need to: diff --git a/doc/source/serve/index.rst b/doc/source/serve/index.rst index af64475a3..8e731120d 100644 --- a/doc/source/serve/index.rst +++ b/doc/source/serve/index.rst @@ -11,79 +11,78 @@ Ray Serve: Scalable and Programmable Serving .. _rayserve-overview: -Ray Serve is a scalable model-serving library built on Ray. +Ray Serve is an easy-to-use scalable model serving library built on Ray. Ray Serve is: -For users, Ray Serve is: - -- **Framework Agnostic**: Use the same toolkit to serve everything from deep learning models +- **Framework-agnostic**: Use a single toolkit to serve everything from deep learning models built with frameworks like :ref:`PyTorch `, - :ref:`Tensorflow, and Keras `, to :ref:`Scikit-Learn ` models, to arbitrary business logic. -- **Python First**: Configure your model serving with pure Python code - no more YAML or + :ref:`Tensorflow, and Keras `, to :ref:`Scikit-Learn ` models, to arbitrary Python business logic. +- **Python-first**: Configure your model serving with pure Python code---no more YAML or JSON configs. -As a library, Ray Serve enables: +Since Ray Serve is built on Ray, it allows you to easily scale to many machines, both in your datacenter and in the cloud. -- :ref:`serve-split-traffic` with zero downtime, by decoupling routing logic from response handling logic. -- :ref:`serve-batching` is built in to help you meet your performance objectives. You can also use your model for batch and online processing. -- Because Serve is a library, it's easy to integrate it with other tools in your environment, such as CI/CD. +Ray Serve can be used in two primary ways to deploy your models at scale: + +1. Have Python functions and classes automatically placed behind HTTP endpoints. + +2. Alternatively, call them from within your existing Python web server using the Python-native :ref:`servehandle-api` . -Since Serve is built on Ray, it also allows you to scale to many machines, in your datacenter or in cloud environments, and it allows you to leverage all of the other Ray frameworks. .. note:: - If you want to try out Serve, join our `community slack `_ - and discuss in the #serve channel. + Chat with Ray Serve users and developers on our `community Slack `_ in the #serve channel and on our `forum `_! .. note:: - Starting with Ray version 1.3.0, Ray Serve backends must take in a Starlette Request object instead of a Flask Request object. + Starting with Ray version 1.2.0, Ray Serve backends take in a Starlette Request object instead of a Flask Request object. See the `migration guide `_ for details. -Installation -============ +Ray Serve Quickstart +==================== -Ray Serve supports Python versions 3.6 through 3.8. To install Ray Serve: +Ray Serve supports Python versions 3.6 through 3.8. To install Ray Serve, run the following command: .. code-block:: bash pip install "ray[serve]" -Ray Serve in 90 Seconds -======================= - -Serve a function by defining a function, an endpoint, and a backend (in this case a stateless function) then -connecting the two by setting traffic from the endpoint to the backend. +Now you can serve a function... .. literalinclude:: ../../../python/ray/serve/examples/doc/quickstart_function.py - :lines: 2-4,6-7,9- -Serve a stateful class by defining a class (``Counter``), creating an endpoint and a backend, then connecting -the two by setting traffic from the endpoint to the backend. + +...or serve a stateful class. .. literalinclude:: ../../../python/ray/serve/examples/doc/quickstart_class.py - :lines: 2-4,6-7,9- -See :doc:`key-concepts` for more exhaustive coverage about Ray Serve and its core concepts. + +See :doc:`key-concepts` for more exhaustive coverage about Ray Serve and its core concepts: backends and endpoints. +For a high-level view of the architecture underlying Ray Serve, see :doc:`architecture`. Why Ray Serve? ============== There are generally two ways of serving machine learning applications, both with serious limitations: -you can build using a **traditional webserver** - your own Flask app or you can use a cloud hosted solution. +you can use a **traditional web server**---your own Flask app---or you can use a cloud-hosted solution. The first approach is easy to get started with, but it's hard to scale each component. The second approach -requires vendor lock-in (SageMaker), framework specific tooling (TFServing), and a general +requires vendor lock-in (SageMaker), framework-specific tooling (TFServing), and a general lack of flexibility. -Ray Serve solves these problems by giving a user the ability to leverage the simplicity -of deployment of a simple webserver but handles the complex routing, scaling, and testing logic +Ray Serve solves these problems by giving you a simple web server (and the ability to use your own) while still handling the complex routing, scaling, and testing logic necessary for production deployments. -For more on the motivation behind Ray Serve, check out these `meetup slides `_. +Beyond scaling up your backends with multiple replicas, Ray Serve also enables: + +- :ref:`serve-split-traffic` with zero downtime, by decoupling routing logic from response-handling logic. +- :ref:`serve-batching`---built in to help you meet your performance objectives. +- :ref:`serve-cpus-gpus`---specify fractional resource requirements to fully saturate each of your GPUs with several models. + +For more on the motivation behind Ray Serve, check out these `meetup slides `_ and this `blog post `_. When should I use Ray Serve? ---------------------------- -Ray Serve is a simple (but flexible) tool for deploying, operating, and monitoring Python based machine learning models. -Ray Serve excels when scaling out to serve models in production is a necessity. This might be because of large scale batch processing +Ray Serve is a simple (but flexible) tool for deploying, operating, and monitoring Python-based machine learning models. +Ray Serve excels when scaling out to serve models in production is a necessity. This might be because of large-scale batch processing requirements or because you're going to serve a number of models behind different endpoints and may need to run A/B tests or control traffic between different models. @@ -93,6 +92,12 @@ What's next? ============ Check out the :doc:`key-concepts`, learn more about :doc:`advanced`, look at the :ref:`serve-faq`, -or head over to the :doc:`tutorials/index` to get started building your Ray Serve Applications. +or head over to the :doc:`tutorials/index` to get started building your Ray Serve applications. + +For more, see the following blog posts about Ray Serve: + +- `How to Scale Up Your FastAPI Application Using Ray Serve `_ by Archit Kulkarni +- `Machine Learning is Broken `_ by Simon Mo +- `The Simplest Way to Serve your NLP Model in Production with Pure Python `_ by Edward Oakes and Bill Chambers diff --git a/doc/source/serve/package-ref.rst b/doc/source/serve/package-ref.rst index e64794500..b7014ab45 100644 --- a/doc/source/serve/package-ref.rst +++ b/doc/source/serve/package-ref.rst @@ -17,8 +17,10 @@ Backend Configuration .. autoclass:: ray.serve.CondaEnv -Handle API ----------- +.. _`servehandle-api`: + +ServeHandle API +--------------- .. autoclass:: ray.serve.handle.RayServeHandle :members: remote, options diff --git a/python/ray/serve/examples/doc/quickstart_class.py b/python/ray/serve/examples/doc/quickstart_class.py index d4238ea6a..fbdec6c7a 100644 --- a/python/ray/serve/examples/doc/quickstart_class.py +++ b/python/ray/serve/examples/doc/quickstart_class.py @@ -2,7 +2,7 @@ import ray from ray import serve import requests -ray.init(num_cpus=8) +ray.init() client = serve.start() @@ -10,13 +10,17 @@ class Counter: def __init__(self): self.count = 0 - def __call__(self, starlette_request): + def __call__(self, request): self.count += 1 - return {"current_counter": self.count} + return {"count": self.count} -client.create_backend("counter", Counter) -client.create_endpoint("counter", backend="counter", route="/counter") +# Form a backend from our class and connect it to an endpoint. +client.create_backend("my_backend", Counter) +client.create_endpoint("my_endpoint", backend="my_backend", route="/counter") +# Query our endpoint in two different ways: from HTTP and from Python. print(requests.get("http://127.0.0.1:8000/counter").json()) -# > {"current_counter": 1} +# > {"count": 1} +print(ray.get(client.get_handle("my_endpoint").remote())) +# > {"count": 2} diff --git a/python/ray/serve/examples/doc/quickstart_function.py b/python/ray/serve/examples/doc/quickstart_function.py index 81ae4b7f1..4d14dd8b0 100644 --- a/python/ray/serve/examples/doc/quickstart_function.py +++ b/python/ray/serve/examples/doc/quickstart_function.py @@ -2,16 +2,20 @@ import ray from ray import serve import requests -ray.init(num_cpus=8) +ray.init() client = serve.start() -def echo(starlette_request): - return "hello " + starlette_request.query_params.get("name", "serve!") +def say_hello(request): + return "hello " + request.query_params["name"] + "!" -client.create_backend("hello", echo) -client.create_endpoint("hello", backend="hello", route="/hello") +# Form a backend from our function and connect it to an endpoint. +client.create_backend("my_backend", say_hello) +client.create_endpoint("my_endpoint", backend="my_backend", route="/hello") -print(requests.get("http://127.0.0.1:8000/hello").text) +# Query our endpoint in two different ways: from HTTP and from Python. +print(requests.get("http://127.0.0.1:8000/hello?name=serve").text) +# > hello serve! +print(ray.get(client.get_handle("my_endpoint").remote(name="serve"))) # > hello serve!