diff --git a/doc/source/serve/index.rst b/doc/source/serve/index.rst index 8e731120d..6ed26d89a 100644 --- a/doc/source/serve/index.rst +++ b/doc/source/serve/index.rst @@ -25,10 +25,11 @@ Ray Serve can be used in two primary ways to deploy your models at scale: 1. Have Python functions and classes automatically placed behind HTTP endpoints. -2. Alternatively, call them from within your existing Python web server using the Python-native :ref:`servehandle-api` . +2. Alternatively, call them from :ref:`within your existing Python web server ` using the Python-native :ref:`servehandle-api`. -.. note:: + +.. tip:: Chat with Ray Serve users and developers on our `community Slack `_ in the #serve channel and on our `forum `_! .. note:: @@ -67,7 +68,7 @@ The first approach is easy to get started with, but it's hard to scale each comp requires vendor lock-in (SageMaker), framework-specific tooling (TFServing), and a general lack of flexibility. -Ray Serve solves these problems by giving you a simple web server (and the ability to use your own) while still handling the complex routing, scaling, and testing logic +Ray Serve solves these problems by giving you a simple web server (and the ability to :ref:`use your own `) while still handling the complex routing, scaling, and testing logic necessary for production deployments. Beyond scaling up your backends with multiple replicas, Ray Serve also enables: diff --git a/doc/source/serve/tutorials/index.rst b/doc/source/serve/tutorials/index.rst index 34380019f..375be165b 100644 --- a/doc/source/serve/tutorials/index.rst +++ b/doc/source/serve/tutorials/index.rst @@ -2,7 +2,7 @@ Tutorials ========= -Below are a list of tutorials that you can use to learn more about the different pieces of +Below is a list of tutorials that you can use to learn more about the different pieces of Ray Serve functionality and how to integrate different modeling frameworks. .. toctree:: @@ -14,7 +14,9 @@ Ray Serve functionality and how to integrate different modeling frameworks. pytorch.rst sklearn.rst batch.rst + web-server-integration.rst Other Topics: + - :doc:`../deployment` \ No newline at end of file diff --git a/doc/source/serve/tutorials/web-server-integration.rst b/doc/source/serve/tutorials/web-server-integration.rst new file mode 100644 index 000000000..2d308aef6 --- /dev/null +++ b/doc/source/serve/tutorials/web-server-integration.rst @@ -0,0 +1,81 @@ +.. _serve-web-server-integration-tutorial: + +Integration with Existing Web Servers +===================================== + +In this guide, you will learn how to use Ray Serve to scale up your existing web application. The key feature of Ray Serve that makes this possible is the Python-native :ref:`servehandle-api`, which allows you keep using your same Python web server while offloading your heavy computation to Ray Serve. + +We give two examples, one using a `FastAPI `__ web server and another using an `AIOHTTP `__ web server, but the same approach will work with any Python web server. + + +Scaling Up a FastAPI Application +-------------------------------- + +For this example, you must have either `Pytorch `_ or `Tensorflow `_ installed, as well as `Huggingface Transformers `_ and `FastAPI `_. For example: + +.. code-block:: bash + + pip install "ray[serve]" tensorflow transformers fastapi + +Here’s a simple FastAPI web server. It uses Huggingface Transformers to auto-generate text based on a short initial input using `OpenAI’s GPT-2 model `_. + +.. literalinclude:: ../../../../python/ray/serve/examples/doc/fastapi/fastapi.py + +To scale this up, we define a Ray Serve backend containing our text model and call it from Python using a ServeHandle: + +.. literalinclude:: ../../../../python/ray/serve/examples/doc/fastapi/servehandle_fastapi.py + +To run this example, save it as ``main.py`` and then in the same directory, run the following commands to start a local Ray cluster on your machine and run the FastAPI application: + +.. code-block:: bash + + ray start --head + uvicorn main:app + +Now you can query your web server, for example by running the following in another terminal: + +.. code-block:: bash + + curl "http://127.0.0.1:8000/generate?query=Hello%20friend%2C%20how" + +The terminal should then print the generated text: + +.. code-block:: bash + + [{"generated_text":"Hello friend, how's your morning?\n\nSven: Thank you.\n\nMRS. MELISSA: I feel like it really has done to you.\n\nMRS. MELISSA: The only thing I"}]% + +To clean up the Ray cluster, run ``ray stop`` in the terminal. + +.. tip:: + According to the backend configuration parameter ``num_replicas``, Ray Serve will place multiple replicas of your model across multiple CPU cores and multiple machines (provided you have :ref:`started a multi-node Ray cluster `), which will correspondingly multiply your throughput. + +Scaling Up an AIOHTTP Application +--------------------------------- + +In this section, we'll integrate Ray Serve with an `AIOHTTP `_ web server run using `Gunicorn `_. You'll need to install AIOHTTP and gunicorn with the command ``pip install aiohttp gunicorn``. + +First, here is the script that deploys Ray Serve: + +.. literalinclude:: ../../../../python/ray/serve/examples/doc/aiohttp/aiohttp_deploy_serve.py + +Next is the script that defines the AIOHTTP server: + +.. literalinclude:: ../../../../python/ray/serve/examples/doc/aiohttp/aiohttp_app.py + +Here's how to run this example: + +1. Run ``ray start --head`` to start a local Ray cluster in the background. + +2. In the directory where the example files are saved, run ``python deploy_serve.py`` to deploy our Ray Serve endpoint. + +.. note:: + Because we have omitted the keyword argument ``route`` in ``client.create_endpoint()``, our endpoint will not be exposed over HTTP by Ray Serve. + +3. Run ``gunicorn aiohttp_app:app --worker-class aiohttp.GunicornWebWorker --bind localhost:8001`` to start the AIOHTTP app using gunicorn. We bind to port 8001 because the Ray Dashboard is already using port 8000 by default. + +.. tip:: + You can change the Ray Dashboard port with the command ``ray start --dashboard-port XXXX``. + +4. To test out the server, run ``curl localhost:8001/dummy-model``. This should output ``Model received data: dummy input``. + +5. For cleanup, you can press Ctrl-C to stop the Gunicorn server, and run ``ray stop`` to stop the background Ray cluster. diff --git a/python/ray/serve/api.py b/python/ray/serve/api.py index 483917182..564a29fc5 100644 --- a/python/ray/serve/api.py +++ b/python/ray/serve/api.py @@ -523,7 +523,7 @@ class Client: def start(detached: bool = False, - http_host: str = DEFAULT_HTTP_HOST, + http_host: Optional[str] = DEFAULT_HTTP_HOST, http_port: int = DEFAULT_HTTP_PORT, http_middlewares: List[Any] = []) -> Client: """Initialize a serve instance. @@ -537,8 +537,8 @@ def start(detached: bool = False, Args: detached (bool): Whether not the instance should be detached from this script. - http_host (str): Host for HTTP servers to listen on. Defaults to - "127.0.0.1". To expose Serve publicly, you probably want to set + http_host (str, optional): Host for HTTP servers to listen on. Defaults + to "127.0.0.1". To expose Serve publicly, you probably want to set this to "0.0.0.0". One HTTP server will be started on each node in the Ray cluster. To not start HTTP servers, set this to None. http_port (int): Port for HTTP server. Defaults to 8000. diff --git a/python/ray/serve/examples/doc/aiohttp/aiohttp_app.py b/python/ray/serve/examples/doc/aiohttp/aiohttp_app.py new file mode 100644 index 000000000..38cf1ab23 --- /dev/null +++ b/python/ray/serve/examples/doc/aiohttp/aiohttp_app.py @@ -0,0 +1,28 @@ +# File name: aiohttp_app.py +from aiohttp import web + +import ray +from ray import serve + +# Connect to the running Ray cluster. +ray.init(address="auto") + +# Connect to the running Ray Serve instance. +client = serve.connect() + +my_handle = client.get_handle("my_endpoint") # Returns a ServeHandle object. + + +# Define our AIOHTTP request handler. +async def handle_request(request): + # Offload the computation to our Ray Serve backend. + result = await my_handle.remote("dummy input") + return web.Response(text=result) + + +# Set up an HTTP endpoint. +app = web.Application() +app.add_routes([web.get("/dummy-model", handle_request)]) + +if __name__ == "__main__": + web.run_app(app) diff --git a/python/ray/serve/examples/doc/aiohttp/aiohttp_deploy_serve.py b/python/ray/serve/examples/doc/aiohttp/aiohttp_deploy_serve.py new file mode 100644 index 000000000..194e91458 --- /dev/null +++ b/python/ray/serve/examples/doc/aiohttp/aiohttp_deploy_serve.py @@ -0,0 +1,21 @@ +# File name: deploy_serve.py +import ray +from ray import serve + +# Connect to the running Ray cluster. +ray.init(address="auto") + +# Start a detached Ray Serve instance. It will persist after the script exits. +client = serve.start(http_host=None, detached=True) + + +# Define a function to serve. Alternatively, you could define a stateful class. +async def my_model(request): + data = await request.body() + return f"Model received data: {data}" + + +# Set up a backend with the desired number of replicas and set up an endpoint. +backend_config = serve.BackendConfig(num_replicas=2) +client.create_backend("my_backend", my_model, config=backend_config) +client.create_endpoint("my_endpoint", backend="my_backend") diff --git a/python/ray/serve/examples/doc/fastapi/fastapi.py b/python/ray/serve/examples/doc/fastapi/fastapi.py new file mode 100644 index 000000000..018dc0d5e --- /dev/null +++ b/python/ray/serve/examples/doc/fastapi/fastapi.py @@ -0,0 +1,12 @@ +from fastapi import FastAPI +from transformers import pipeline # A simple API for NLP tasks. + +app = FastAPI() + +nlp_model = pipeline("text-generation", model="gpt2") # Load the model. + + +# The function below handles GET requests to the URL `/generate`. +@app.get("/generate") +def generate(query: str): + return nlp_model(query, max_length=50) # Output 50 words based on query. diff --git a/python/ray/serve/examples/doc/fastapi/servehandle_fastapi.py b/python/ray/serve/examples/doc/fastapi/servehandle_fastapi.py new file mode 100644 index 000000000..4bc89d8b7 --- /dev/null +++ b/python/ray/serve/examples/doc/fastapi/servehandle_fastapi.py @@ -0,0 +1,37 @@ +import ray +from ray import serve + +from fastapi import FastAPI +from transformers import pipeline + +app = FastAPI() + +serve_handle = None + + +@app.on_event("startup") # Code to be run when the server starts. +async def startup_event(): + ray.init(address="auto") # Connect to the running Ray cluster. + client = serve.start(http_host=None) # Start the Ray Serve client. + + # Define a callable class to use for our Ray Serve backend. + class GPT2: + def __init__(self): + self.nlp_model = pipeline("text-generation", model="gpt2") + + async def __call__(self, request): + return self.nlp_model(await request.body(), max_length=50) + + # Set up a Ray Serve backend with the desired number of replicas. + backend_config = serve.BackendConfig(num_replicas=2) + client.create_backend("gpt-2", GPT2, config=backend_config) + client.create_endpoint("generate", backend="gpt-2") + + # Get a handle to our Ray Serve endpoint so we can query it in Python. + global serve_handle + serve_handle = client.get_handle("generate") + + +@app.get("/generate") +async def generate(query: str): + return await serve_handle.remote(query)