mirror of
https://github.com/wassname/ray.git
synced 2026-06-28 14:48:54 +08:00
Auto-scale ray clusters based on GCS load metrics (#1348)
This adds (experimental) auto-scaling support for Ray clusters based on GCS load metrics. The auto-scaling algorithm is as follows: Based on current (instantaneous) load information, we compute the approximate number of "used workers". This is based on the bottleneck resource, e.g. if 8/8 GPUs are used in a 8-node cluster but all the CPUs are idle, the number of used nodes is still counted as 8. This number can also be fractional. We scale that number by 1 / target_utilization_fraction and round up to determine the target cluster size (subject to the max_workers constraint). The autoscaler control loop takes care of launching new nodes until the target cluster size is met. When a node is idle for more than idle_timeout_minutes, we remove it from the cluster if that would not drop the cluster size below min_workers. Note that we'll need to update the wheel in the example yaml file after this PR is merged.
This commit is contained in:
@@ -245,10 +245,9 @@ def stop():
|
||||
@click.command()
|
||||
@click.argument("cluster_config_file", required=True, type=str)
|
||||
@click.option(
|
||||
"--sync-only", is_flag=True, default=False, help=(
|
||||
"Whether to only perform the file sync stage when updating nodes. "
|
||||
"This avoids interrupting running jobs. You can use this when "
|
||||
"resizing the cluster with the min/max_workers flag."))
|
||||
"--no-restart", is_flag=True, default=False, help=(
|
||||
"Whether to skip restarting Ray services during the update. "
|
||||
"This avoids interrupting running jobs."))
|
||||
@click.option(
|
||||
"--min-workers", required=False, type=int, help=(
|
||||
"Override the configured min worker node count for the cluster."))
|
||||
@@ -256,9 +255,9 @@ def stop():
|
||||
"--max-workers", required=False, type=int, help=(
|
||||
"Override the configured max worker node count for the cluster."))
|
||||
def create_or_update(
|
||||
cluster_config_file, min_workers, max_workers, sync_only):
|
||||
cluster_config_file, min_workers, max_workers, no_restart):
|
||||
create_or_update_cluster(
|
||||
cluster_config_file, min_workers, max_workers, sync_only)
|
||||
cluster_config_file, min_workers, max_workers, no_restart)
|
||||
|
||||
|
||||
@click.command()
|
||||
|
||||
Reference in New Issue
Block a user