From e78945db963f7c52af7abda05b8b0675fc8ae61f Mon Sep 17 00:00:00 2001
From: Simon Mo <simon.mo@hey.com>
Date: Fri, 17 Jul 2020 11:17:37 -0700
Subject: [PATCH] [Serve] Add internal instruction for running benchmarks
 (#9531)

---
 python/ray/serve/benchmarks/README.md | 67 +++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)
 create mode 100644 python/ray/serve/benchmarks/README.md

diff --git a/python/ray/serve/benchmarks/README.md b/python/ray/serve/benchmarks/README.md
new file mode 100644
index 000000000..e4d6e42df
--- /dev/null
+++ b/python/ray/serve/benchmarks/README.md
@@ -0,0 +1,67 @@
+## Ray Serve Benchmarks
+
+This directory contains code that setup benchmark for Serve.
+
+### `microbenchmark.py` runs several scenarios
+
+```
+{'max_batch_size': 1, 'max_concurrent_queries': 1, 'num_replicas': 1}:
+        single client small data 629.38 +- 5.34 requests/s
+        8 clients small data 888.99 +- 14.0 requests/s
+{'max_batch_size': 1, 'max_concurrent_queries': 10000, 'num_replicas': 1}:
+        single client small data 610.65 +- 11.99 requests/s
+        8 clients small data 1856.55 +- 17.33 requests/s
+{'max_batch_size': 10000, 'max_concurrent_queries': 10000, 'num_replicas': 1}:
+        single client small data 602.93 +- 5.57 requests/s
+        8 clients small data 1723.73 +- 88.33 requests/s
+{'max_batch_size': 1, 'max_concurrent_queries': 1, 'num_replicas': 8}:
+        single client small data 492.09 +- 6.11 requests/s
+        8 clients small data 1662.08 +- 92.08 requests/s
+{'max_batch_size': 1, 'max_concurrent_queries': 10000, 'num_replicas': 8}:
+        single client small data 459.71 +- 25.66 requests/s
+        8 clients small data 1860.39 +- 24.45 requests/s
+{'max_batch_size': 10000, 'max_concurrent_queries': 10000, 'num_replicas': 8}:
+        single client small data 487.65 +- 15.61 requests/s
+        8 clients small data 1917.84 +- 95.61 requests/s
+```
+
+### `noop_latency.py` set up a noop backend for external benchmarks.
+
+```
+python noop_latency.py --blocking --num-replicas 8 --num-queries 500 --max-concurrent-queries 10000
+```
+
+- `--blocking` flags will blocks the server after firing `--num-queries` for warm up.
+- `--num-replicas` and `--max-concurrent-queries` configures the backend replicas.
+
+Once you setup deployment, external benchmark services like `wrk`, ApacheBench, or locust can be used. Example
+
+```
+$ wrk -c 100 -t 2 -d 10s --latency http://127.0.0.1:8000/noop
+Running 10s test @ http://127.0.0.1:8000/noop
+  2 threads and 100 connections
+  Thread Stats   Avg      Stdev     Max   +/- Stdev
+    Latency    44.14ms   11.24ms 151.03ms   94.78%
+    Req/Sec     1.15k   185.88     1.36k    91.50%
+  Latency Distribution
+     50%   42.10ms
+     75%   44.78ms
+     90%   49.40ms
+     99%  106.94ms
+  22917 requests in 10.04s, 3.56MB read
+Requests/sec:   2283.17
+Transfer/sec:    363.43KB
+```
+
+Typically 100~200 connections should suffice to profile throughput.
+
+### Use py-spy to generate flamegraphs
+
+```
+sudo env "PATH=$PATH" py-spy record --duration 30 -o out.svg --pid PID --native
+```
+
+Tips:
+
+- If a process is overloaded, py-spy might not be able to find the Python stacks due to the heavy use of Cython extension
+  in Ray. In that case, you can start py-spy first and then start the load generation.