mirror of
https://github.com/wassname/ray.git
synced 2026-06-29 11:34:25 +08:00
b6c42f96be
This adds (experimental) auto-scaling support for Ray clusters based on GCS load metrics. The auto-scaling algorithm is as follows: Based on current (instantaneous) load information, we compute the approximate number of "used workers". This is based on the bottleneck resource, e.g. if 8/8 GPUs are used in a 8-node cluster but all the CPUs are idle, the number of used nodes is still counted as 8. This number can also be fractional. We scale that number by 1 / target_utilization_fraction and round up to determine the target cluster size (subject to the max_workers constraint). The autoscaler control loop takes care of launching new nodes until the target cluster size is met. When a node is idle for more than idle_timeout_minutes, we remove it from the cluster if that would not drop the cluster size below min_workers. Note that we'll need to update the wheel in the example yaml file after this PR is merged.
21 lines
639 B
Python
21 lines
639 B
Python
from __future__ import absolute_import
|
|
from __future__ import division
|
|
from __future__ import print_function
|
|
|
|
"""Ray constants used in the Python code."""
|
|
|
|
|
|
# Abort autoscaling if more than this number of errors are encountered. This
|
|
# is a safety feature to prevent e.g. runaway node launches.
|
|
AUTOSCALER_MAX_NUM_FAILURES = 5
|
|
|
|
# Max number of nodes to launch at a time.
|
|
AUTOSCALER_MAX_CONCURRENT_LAUNCHES = 10
|
|
|
|
# Interval at which to perform autoscaling updates.
|
|
AUTOSCALER_UPDATE_INTERVAL_S = 5
|
|
|
|
# The autoscaler will attempt to restart Ray on nodes it hasn't heard from
|
|
# in more than this interval.
|
|
AUTOSCALER_HEARTBEAT_TIMEOUT_S = 30
|