Add script for running infinitely long stress tests. (#4163)

Running `./ci/long_running_tests/start_workloads.sh` will start several workloads running (each in their own EC2 instance).
- The workloads run forever.
- The workloads all simulate multiple nodes but use a single machine.
- You can get the tail of each workload by running `./ci/long_running_tests/check_workloads.sh`.
- You have to manually shut down the instances.

As discussed with @ericl @richardliaw, the idea here is to optimize for the debuggability of the tests. If one of them fails, you can ssh to the relevant instance and see all of the logs.
This commit is contained in:
Robert Nishihara
2019-02-27 14:33:06 -08:00
committed by Richard Liaw
parent 41b81af11b
commit 75504b9586
8 changed files with 391 additions and 4 deletions
+2 -4
View File
@@ -126,10 +126,8 @@ def run(args, parser):
cluster = Cluster()
for _ in range(args.ray_num_nodes):
cluster.add_node(
resources={
"num_cpus": args.ray_num_cpus or 1,
"num_gpus": args.ray_num_gpus or 0,
},
num_cpus=args.ray_num_cpus or 1,
num_gpus=args.ray_num_gpus or 0,
object_store_memory=args.ray_object_store_memory,
redis_max_memory=args.ray_redis_max_memory)
ray.init(redis_address=cluster.redis_address)