Auto-scale ray clusters based on GCS load metrics (#1348)

This adds (experimental) auto-scaling support for Ray clusters based on GCS load metrics. The auto-scaling algorithm is as follows:

Based on current (instantaneous) load information, we compute the approximate number of "used workers". This is based on the bottleneck resource, e.g. if 8/8 GPUs are used in a 8-node cluster but all the CPUs are idle, the number of used nodes is still counted as 8. This number can also be fractional.
We scale that number by 1 / target_utilization_fraction and round up to determine the target cluster size (subject to the max_workers constraint). The autoscaler control loop takes care of launching new nodes until the target cluster size is met.
When a node is idle for more than idle_timeout_minutes, we remove it from the cluster if that would not drop the cluster size below min_workers.
Note that we'll need to update the wheel in the example yaml file after this PR is merged.
This commit is contained in:
Eric Liang
2017-12-31 14:39:57 -08:00
committed by GitHub
parent e970e24ea5
commit b6c42f96be
12 changed files with 657 additions and 176 deletions
+5 -6
View File
@@ -245,10 +245,9 @@ def stop():
@click.command()
@click.argument("cluster_config_file", required=True, type=str)
@click.option(
"--sync-only", is_flag=True, default=False, help=(
"Whether to only perform the file sync stage when updating nodes. "
"This avoids interrupting running jobs. You can use this when "
"resizing the cluster with the min/max_workers flag."))
"--no-restart", is_flag=True, default=False, help=(
"Whether to skip restarting Ray services during the update. "
"This avoids interrupting running jobs."))
@click.option(
"--min-workers", required=False, type=int, help=(
"Override the configured min worker node count for the cluster."))
@@ -256,9 +255,9 @@ def stop():
"--max-workers", required=False, type=int, help=(
"Override the configured max worker node count for the cluster."))
def create_or_update(
cluster_config_file, min_workers, max_workers, sync_only):
cluster_config_file, min_workers, max_workers, no_restart):
create_or_update_cluster(
cluster_config_file, min_workers, max_workers, sync_only)
cluster_config_file, min_workers, max_workers, no_restart)
@click.command()