mirror of
https://github.com/wassname/ray.git
synced 2026-06-28 09:29:05 +08:00
[tune] add scalability release tests (#13986)
* Add scalability tests * Network overhead cluster * Update xgboost tests * Document release tests * Don't raise on failed trial * Update to multi node yamls * Update yamls * Revert xgboost test changes * Fix import * Update release/tune_tests/scalability_tests/workloads/test_bookkeeping_overhead.py Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Pass aws credentials (WIP) * Update durable trainable example * Update xgboost sweep * Change xgboost scope, fix durable trainable stop condition * Fix max depth to limit total test length * Add cluster information to test descriptions. Update release checklist/process docs Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
This commit is contained in:
@@ -62,8 +62,14 @@ This checklist is meant to be used in conjunction with the RELEASE_PROCESS.rst d
|
||||
- [ ] K8s operator test
|
||||
- [ ] Data processing tests
|
||||
- [ ] streaming_shuffle
|
||||
- [x] Tune tests
|
||||
- [x] ignore for now
|
||||
- [ ] Tune tests
|
||||
- [ ] test_bookkeeping_overhead
|
||||
- [x] test_result_throughput_cluster (ignore final time)
|
||||
- [x] test_result_throughput_single_node (ignore final time)
|
||||
- [x] test_network_overhead (ignore final time)
|
||||
- [ ] test_long_running_large_checkpoints
|
||||
- [ ] test_xgboost_sweep
|
||||
- [ ] test_durable_trainable
|
||||
- [ ] XGBoost Tests
|
||||
- [ ] distributed_api_test
|
||||
- [ ] train_small
|
||||
|
||||
@@ -167,8 +167,14 @@ is generally the easiest way to run release tests.
|
||||
|
||||
General Ray Tune functionality is implicitly tested via RLLib and XGBoost release tests.
|
||||
We are in the process of introducing scalability envelopes for Ray Tune.
|
||||
This is an ongoing effort and will only be introduced in the next release.
|
||||
For now, **you can ignore the tune_tests directory**.
|
||||
|
||||
Of the seven existing tests, three are currently not reaching their target time.
|
||||
These three tests (test_result_throughput_cluster, test_result_throughput_single_node, and
|
||||
test_network_overhead) are marked in the release checklist and don't have to be run at this time.
|
||||
|
||||
The other release tests are expected to run through without errors and to pass within a pre-specified time.
|
||||
The time is checked in the test function and the output will let you know if a run was fast enough and
|
||||
thus passed the test.
|
||||
|
||||
10. **XGBoost release tests**
|
||||
|
||||
|
||||
@@ -1,31 +0,0 @@
|
||||
cluster_name: ray-tune-scalability-tests
|
||||
|
||||
min_workers: 15
|
||||
max_workers: 15
|
||||
|
||||
idle_timeout_minutes: 15
|
||||
|
||||
docker:
|
||||
image: anyscale/ray:nightly
|
||||
container_name: ray_container
|
||||
pull_before_run: true
|
||||
|
||||
provider:
|
||||
type: aws
|
||||
region: us-west-2
|
||||
availability_zone: us-west-2a
|
||||
cache_stopped_nodes: false
|
||||
|
||||
auth:
|
||||
ssh_user: ubuntu
|
||||
|
||||
head_node:
|
||||
# 64 CPUs
|
||||
InstanceType: m5.16xlarge
|
||||
|
||||
worker_nodes:
|
||||
# 64 CPUs
|
||||
InstanceType: m5.16xlarge
|
||||
|
||||
setup_commands:
|
||||
- pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
|
||||
@@ -0,0 +1,47 @@
|
||||
cluster_name: ray-tune-scalability-tests-16x2
|
||||
|
||||
max_workers: 15
|
||||
upscaling_speed: 15
|
||||
|
||||
idle_timeout_minutes: 0
|
||||
|
||||
docker:
|
||||
image: anyscale/ray:nightly
|
||||
container_name: ray_container
|
||||
pull_before_run: true
|
||||
|
||||
provider:
|
||||
type: aws
|
||||
region: us-west-2
|
||||
availability_zone: us-west-2a
|
||||
cache_stopped_nodes: false
|
||||
|
||||
available_node_types:
|
||||
cpu_2_ondemand:
|
||||
node_config:
|
||||
InstanceType: m5.large
|
||||
resources: {"CPU": 2}
|
||||
min_workers: 0
|
||||
max_workers: 0
|
||||
cpu_2_spot:
|
||||
node_config:
|
||||
InstanceType: m5.large
|
||||
InstanceMarketOptions:
|
||||
MarketType: spot
|
||||
resources: {"CPU": 2}
|
||||
min_workers: 15
|
||||
max_workers: 15
|
||||
|
||||
auth:
|
||||
ssh_user: ubuntu
|
||||
|
||||
head_node_type: cpu_2_ondemand
|
||||
worker_default_node_type: cpu_2_spot
|
||||
|
||||
setup_commands:
|
||||
- ray install-nightly
|
||||
- pip install -U awscli
|
||||
|
||||
file_mounts: {
|
||||
"~/release-automation-tune_scalability_tests": "."
|
||||
}
|
||||
@@ -0,0 +1,42 @@
|
||||
cluster_name: ray-tune-scalability-tests-16x64
|
||||
|
||||
max_workers: 15
|
||||
upscaling_speed: 15
|
||||
|
||||
idle_timeout_minutes: 0
|
||||
|
||||
docker:
|
||||
image: anyscale/ray:nightly
|
||||
container_name: ray_container
|
||||
pull_before_run: true
|
||||
|
||||
provider:
|
||||
type: aws
|
||||
region: us-west-2
|
||||
availability_zone: us-west-2a
|
||||
cache_stopped_nodes: false
|
||||
|
||||
available_node_types:
|
||||
cpu_64_ondemand:
|
||||
node_config:
|
||||
InstanceType: m5.16xlarge
|
||||
resources: {"CPU": 64}
|
||||
min_workers: 0
|
||||
max_workers: 0
|
||||
cpu_64_spot:
|
||||
node_config:
|
||||
InstanceType: m5.16xlarge
|
||||
InstanceMarketOptions:
|
||||
MarketType: spot
|
||||
resources: {"CPU": 64}
|
||||
min_workers: 15
|
||||
max_workers: 15
|
||||
|
||||
auth:
|
||||
ssh_user: ubuntu
|
||||
|
||||
head_node_type: cpu_64_ondemand
|
||||
worker_default_node_type: cpu_64_spot
|
||||
|
||||
setup_commands:
|
||||
- ray install-nightly
|
||||
@@ -0,0 +1,53 @@
|
||||
cluster_name: ray-tune-scalability-tests-16x64_data
|
||||
|
||||
max_workers: 16
|
||||
upscaling_speed: 16
|
||||
|
||||
idle_timeout_minutes: 0
|
||||
|
||||
docker:
|
||||
image: anyscale/ray:nightly
|
||||
container_name: ray_container
|
||||
pull_before_run: true
|
||||
|
||||
provider:
|
||||
type: aws
|
||||
region: us-west-2
|
||||
availability_zone: us-west-2a
|
||||
cache_stopped_nodes: false
|
||||
|
||||
available_node_types:
|
||||
cpu_64_ondemand:
|
||||
node_config:
|
||||
InstanceType: m5.16xlarge
|
||||
resources: {"CPU": 64}
|
||||
min_workers: 0
|
||||
max_workers: 0
|
||||
cpu_64_spot:
|
||||
node_config:
|
||||
InstanceType: m5.16xlarge
|
||||
InstanceMarketOptions:
|
||||
MarketType: spot
|
||||
resources: {"CPU": 64}
|
||||
min_workers: 15
|
||||
max_workers: 15
|
||||
|
||||
auth:
|
||||
ssh_user: ubuntu
|
||||
|
||||
head_node_type: cpu_64_ondemand
|
||||
worker_default_node_type: cpu_64_spot
|
||||
|
||||
file_mounts: {
|
||||
"~/release-automation-tune_scalability_tests": "."
|
||||
}
|
||||
|
||||
setup_commands:
|
||||
- ray install-nightly
|
||||
- pip install pytest xgboost_ray
|
||||
- mkdir -p ~/data || true
|
||||
- rm -rf ~/data/train.parquet || true
|
||||
- rm -rf ~/data/test.parquet || true
|
||||
- cp -R /tmp/ray_tmp_mount/release-automation-tune_scalability_tests ~/release-automation-tune_scalability_tests || echo "Copy failed"
|
||||
- python ~/release-automation-tune_scalability_tests/create_test_data.py ~/data/train.parquet --seed 1234 --num-rows 40000000 --num-cols 40 --num-partitions 128 --num-classes 2
|
||||
- python ~/release-automation-tune_scalability_tests/create_test_data.py ~/data/test.parquet --seed 1234 --num-rows 10000000 --num-cols 40 --num-partitions 128 --num-classes 2
|
||||
@@ -0,0 +1,34 @@
|
||||
cluster_name: ray-tune-scalability-tests-1x16
|
||||
|
||||
max_workers: 0
|
||||
upscaling_speed: 1
|
||||
|
||||
idle_timeout_minutes: 0
|
||||
|
||||
docker:
|
||||
image: anyscale/ray:nightly
|
||||
container_name: ray_container
|
||||
pull_before_run: true
|
||||
|
||||
provider:
|
||||
type: aws
|
||||
region: us-west-2
|
||||
availability_zone: us-west-2a
|
||||
cache_stopped_nodes: false
|
||||
|
||||
available_node_types:
|
||||
cpu_4_ondemand:
|
||||
node_config:
|
||||
InstanceType: m5.xlarge
|
||||
resources: {"CPU": 4}
|
||||
min_workers: 0
|
||||
max_workers: 0
|
||||
|
||||
auth:
|
||||
ssh_user: ubuntu
|
||||
|
||||
head_node_type: cpu_4_ondemand
|
||||
worker_default_node_type: cpu_4_ondemand
|
||||
|
||||
setup_commands:
|
||||
- ray install-nightly
|
||||
@@ -0,0 +1,40 @@
|
||||
cluster_name: ray-tune-scalability-tests-1x32_hd
|
||||
|
||||
max_workers: 0
|
||||
upscaling_speed: 1
|
||||
|
||||
idle_timeout_minutes: 0
|
||||
|
||||
docker:
|
||||
image: anyscale/ray:nightly
|
||||
container_name: ray_container
|
||||
pull_before_run: true
|
||||
|
||||
provider:
|
||||
type: aws
|
||||
region: us-west-2
|
||||
availability_zone: us-west-2a
|
||||
cache_stopped_nodes: false
|
||||
|
||||
available_node_types:
|
||||
cpu_32_hd_ondemand:
|
||||
node_config:
|
||||
InstanceType: m5.8xlarge
|
||||
|
||||
BlockDeviceMappings:
|
||||
- DeviceName: /dev/sda1
|
||||
Ebs:
|
||||
VolumeSize: 160
|
||||
|
||||
resources: {"CPU": 32} # 128 GB memory
|
||||
min_workers: 0
|
||||
max_workers: 0
|
||||
|
||||
auth:
|
||||
ssh_user: ubuntu
|
||||
|
||||
head_node_type: cpu_32_hd_ondemand
|
||||
worker_default_node_type: cpu_32_hd_ondemand
|
||||
|
||||
setup_commands:
|
||||
- ray install-nightly
|
||||
@@ -0,0 +1,34 @@
|
||||
cluster_name: ray-tune-scalability-tests-1x96
|
||||
|
||||
max_workers: 0
|
||||
upscaling_speed: 1
|
||||
|
||||
idle_timeout_minutes: 0
|
||||
|
||||
docker:
|
||||
image: anyscale/ray:nightly
|
||||
container_name: ray_container
|
||||
pull_before_run: true
|
||||
|
||||
provider:
|
||||
type: aws
|
||||
region: us-west-2
|
||||
availability_zone: us-west-2a
|
||||
cache_stopped_nodes: false
|
||||
|
||||
available_node_types:
|
||||
cpu_96_ondemand:
|
||||
node_config:
|
||||
InstanceType: m5.24xlarge
|
||||
resources: {"CPU": 96}
|
||||
min_workers: 0
|
||||
max_workers: 0
|
||||
|
||||
auth:
|
||||
ssh_user: ubuntu
|
||||
|
||||
head_node_type: cpu_96_ondemand
|
||||
worker_default_node_type: cpu_96_ondemand
|
||||
|
||||
setup_commands:
|
||||
- ray install-nightly
|
||||
@@ -0,0 +1,42 @@
|
||||
cluster_name: ray-tune-scalability-tests-200x2
|
||||
|
||||
max_workers: 199
|
||||
upscaling_speed: 199
|
||||
|
||||
idle_timeout_minutes: 0
|
||||
|
||||
docker:
|
||||
image: anyscale/ray:nightly
|
||||
container_name: ray_container
|
||||
pull_before_run: true
|
||||
|
||||
provider:
|
||||
type: aws
|
||||
region: us-west-2
|
||||
availability_zone: us-west-2a
|
||||
cache_stopped_nodes: false
|
||||
|
||||
available_node_types:
|
||||
cpu_2_ondemand:
|
||||
node_config:
|
||||
InstanceType: m5.large
|
||||
resources: {"CPU": 2}
|
||||
min_workers: 0
|
||||
max_workers: 0
|
||||
cpu_2_spot:
|
||||
node_config:
|
||||
InstanceType: m5.large
|
||||
InstanceMarketOptions:
|
||||
MarketType: spot
|
||||
resources: {"CPU": 2}
|
||||
min_workers: 199
|
||||
max_workers: 199
|
||||
|
||||
auth:
|
||||
ssh_user: ubuntu
|
||||
|
||||
head_node_type: cpu_2_ondemand
|
||||
worker_default_node_type: cpu_2_spot
|
||||
|
||||
setup_commands:
|
||||
- ray install-nightly
|
||||
@@ -0,0 +1,61 @@
|
||||
import argparse
|
||||
import numpy as np
|
||||
import os
|
||||
|
||||
from xgboost_ray.tests.utils import create_parquet
|
||||
|
||||
if __name__ == "__main__":
|
||||
if "OMP_NUM_THREADS" in os.environ:
|
||||
del os.environ["OMP_NUM_THREADS"]
|
||||
|
||||
parser = argparse.ArgumentParser(description="Create fake data.")
|
||||
parser.add_argument(
|
||||
"filename", type=str, default="/data/parted.parquet/", help="ray/dask")
|
||||
parser.add_argument(
|
||||
"-r",
|
||||
"--num-rows",
|
||||
required=False,
|
||||
type=int,
|
||||
default=1e8,
|
||||
help="num rows")
|
||||
parser.add_argument(
|
||||
"-p",
|
||||
"--num-partitions",
|
||||
required=False,
|
||||
type=int,
|
||||
default=100,
|
||||
help="num partitions")
|
||||
parser.add_argument(
|
||||
"-c",
|
||||
"--num-cols",
|
||||
required=False,
|
||||
type=int,
|
||||
default=4,
|
||||
help="num columns (features)")
|
||||
parser.add_argument(
|
||||
"-C",
|
||||
"--num-classes",
|
||||
required=False,
|
||||
type=int,
|
||||
default=2,
|
||||
help="num classes")
|
||||
parser.add_argument(
|
||||
"-s",
|
||||
"--seed",
|
||||
required=False,
|
||||
type=int,
|
||||
default=1234,
|
||||
help="random seed")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if os.path.exists(args.filename):
|
||||
print(f"File already exists: {args.filename}. Skipping creation.")
|
||||
|
||||
np.random.seed(args.seed)
|
||||
create_parquet(
|
||||
args.filename,
|
||||
num_rows=int(args.num_rows),
|
||||
num_partitions=int(args.num_partitions),
|
||||
num_features=int(args.num_cols),
|
||||
num_classes=int(args.num_classes))
|
||||
@@ -1,6 +1,7 @@
|
||||
#!/usr/bin/env bash
|
||||
|
||||
ray_version=""
|
||||
nodes=""
|
||||
ray_version=""
|
||||
commit=""
|
||||
ray_branch=""
|
||||
|
||||
@@ -8,9 +9,11 @@ for i in "$@"
|
||||
do
|
||||
echo "$i"
|
||||
case "$i" in
|
||||
--nodes=*)
|
||||
nodes="${i#*=}"
|
||||
;;
|
||||
--ray-version=*)
|
||||
ray_version="${i#*=}"
|
||||
|
||||
;;
|
||||
--commit=*)
|
||||
commit="${i#*=}"
|
||||
@@ -32,25 +35,22 @@ case "$i" in
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ $ray_version == "" || $commit == "" || $ray_branch == "" ]]
|
||||
if [[ $nodes == "" || $ray_version == "" || $commit == "" || $ray_branch == "" ]]
|
||||
then
|
||||
echo "Provide --ray-version, --commit, and --ray-branch"
|
||||
echo "Provide --nodes --ray-version, --commit, and --ray-branch"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "nodes: $nodes"
|
||||
echo "version: $ray_version"
|
||||
echo "commit: $commit"
|
||||
echo "branch: $ray_branch"
|
||||
echo "workload: ignored"
|
||||
|
||||
wheel="https://s3-us-west-2.amazonaws.com/ray-wheels/$ray_branch/$commit/ray-$ray_version-cp37-cp37m-manylinux2014_x86_64.whl"
|
||||
# wheel="https://s3-us-west-2.amazonaws.com/ray-wheels/$ray_branch/$commit/ray-$ray_version-cp37-cp37m-manylinux2014_x86_64.whl"
|
||||
# pip install -U "$wheel"
|
||||
|
||||
pip install -U pip
|
||||
pip install -U "$wheel"
|
||||
pip install "ray[tune]" "ray"
|
||||
pip install boto3==1.4.8 cython==0.29.0
|
||||
|
||||
if ! python "wait_cluster.py" 16 450; then
|
||||
if ! python "wait_cluster.py" "$nodes" 600; then
|
||||
echo "Cluster did not come up in time. Aborting test."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
@@ -0,0 +1,153 @@
|
||||
import os
|
||||
import time
|
||||
|
||||
import numpy as np
|
||||
import pickle
|
||||
|
||||
from ray import tune
|
||||
|
||||
from ray.tune.durable_trainable import DurableTrainable
|
||||
|
||||
|
||||
class TestDurableTrainable(DurableTrainable):
|
||||
def __init__(self, remote_checkpoint_dir, config, logger_creator=None):
|
||||
self.setup_env()
|
||||
|
||||
super(TestDurableTrainable, self).__init__(
|
||||
remote_checkpoint_dir,
|
||||
config=config,
|
||||
logger_creator=logger_creator)
|
||||
|
||||
def setup_env(self):
|
||||
pass
|
||||
|
||||
def setup(self, config):
|
||||
self._num_iters = int(config["num_iters"])
|
||||
self._sleep_time = config["sleep_time"]
|
||||
self._score = config["score"]
|
||||
|
||||
self._checkpoint_iters = config["checkpoint_iters"]
|
||||
self._checkpoint_size_b = config["checkpoint_size_b"]
|
||||
self._checkpoint_num_items = self._checkpoint_size_b // 8 # np.float64
|
||||
|
||||
self._iter = 0
|
||||
|
||||
def step(self):
|
||||
if self._iter > 0:
|
||||
time.sleep(self._sleep_time)
|
||||
|
||||
res = dict(score=self._iter + self._score)
|
||||
|
||||
if self._iter >= self._num_iters:
|
||||
res["done"] = True
|
||||
|
||||
self._iter += 1
|
||||
return res
|
||||
|
||||
def save_checkpoint(self, tmp_checkpoint_dir):
|
||||
checkpoint_file = os.path.join(tmp_checkpoint_dir, "bogus.ckpt")
|
||||
checkpoint_data = np.random.uniform(
|
||||
0, 1, size=self._checkpoint_num_items)
|
||||
with open(checkpoint_file, "wb") as fp:
|
||||
pickle.dump(checkpoint_data, fp)
|
||||
return checkpoint_file
|
||||
|
||||
def load_checkpoint(self, checkpoint):
|
||||
pass
|
||||
|
||||
|
||||
def function_trainable(config):
|
||||
num_iters = int(config["num_iters"])
|
||||
sleep_time = config["sleep_time"]
|
||||
score = config["score"]
|
||||
|
||||
checkpoint_iters = config["checkpoint_iters"]
|
||||
checkpoint_size_b = config["checkpoint_size_b"]
|
||||
checkpoint_num_items = checkpoint_size_b // 8 # np.float64
|
||||
|
||||
for i in range(num_iters):
|
||||
if checkpoint_iters >= 0 and checkpoint_size_b > 0 and \
|
||||
i % checkpoint_iters == 0:
|
||||
with tune.checkpoint_dir(step=i) as dir:
|
||||
checkpoint_file = os.path.join(dir, "bogus.ckpt")
|
||||
checkpoint_data = np.random.uniform(
|
||||
0, 1, size=checkpoint_num_items)
|
||||
with open(checkpoint_file, "wb") as fp:
|
||||
pickle.dump(checkpoint_data, fp)
|
||||
|
||||
tune.report(score=i + score)
|
||||
time.sleep(sleep_time)
|
||||
|
||||
|
||||
def timed_tune_run(name: str,
|
||||
num_samples: int,
|
||||
results_per_second: int = 1,
|
||||
trial_length_s: int = 1,
|
||||
max_runtime: int = 300,
|
||||
checkpoint_freq_s: int = -1,
|
||||
checkpoint_size_b: int = 0,
|
||||
**tune_kwargs):
|
||||
durable = "sync_config" in tune_kwargs and \
|
||||
tune_kwargs["sync_config"].upload_dir.startswith("s3://")
|
||||
|
||||
sleep_time = 1. / results_per_second
|
||||
num_iters = int(trial_length_s / sleep_time)
|
||||
checkpoint_iters = -1
|
||||
if checkpoint_freq_s >= 0:
|
||||
checkpoint_iters = int(checkpoint_freq_s / sleep_time)
|
||||
|
||||
config = {
|
||||
"score": tune.uniform(0., 1.),
|
||||
"num_iters": num_iters,
|
||||
"sleep_time": sleep_time,
|
||||
"checkpoint_iters": checkpoint_iters,
|
||||
"checkpoint_size_b": checkpoint_size_b,
|
||||
}
|
||||
|
||||
print(f"Starting benchmark with config: {config}")
|
||||
|
||||
run_kwargs = {"reuse_actors": True, "verbose": 2}
|
||||
run_kwargs.update(tune_kwargs)
|
||||
|
||||
_train = function_trainable
|
||||
|
||||
aws_key_id = os.getenv("AWS_ACCESS_KEY_ID", "")
|
||||
aws_secret = os.getenv("AWS_SECRET_ACCESS_KEY", "")
|
||||
aws_session = os.getenv("AWS_SESSION_TOKEN", "")
|
||||
|
||||
if durable:
|
||||
|
||||
class AwsDurableTrainable(TestDurableTrainable):
|
||||
AWS_ACCESS_KEY_ID = aws_key_id
|
||||
AWS_SECRET_ACCESS_KEY = aws_secret
|
||||
AWS_SESSION_TOKEN = aws_session
|
||||
|
||||
def setup_env(self):
|
||||
os.environ["AWS_ACCESS_KEY_ID"] = self.AWS_ACCESS_KEY_ID
|
||||
os.environ[
|
||||
"AWS_SECRET_ACCESS_KEY"] = self.AWS_SECRET_ACCESS_KEY
|
||||
os.environ["AWS_SESSION_TOKEN"] = self.AWS_SESSION_TOKEN
|
||||
|
||||
_train = AwsDurableTrainable
|
||||
run_kwargs["checkpoint_freq"] = checkpoint_iters
|
||||
|
||||
start_time = time.monotonic()
|
||||
tune.run(
|
||||
_train,
|
||||
config=config,
|
||||
num_samples=num_samples,
|
||||
raise_on_failed_trial=False,
|
||||
**run_kwargs)
|
||||
time_taken = time.monotonic() - start_time
|
||||
|
||||
assert time_taken < max_runtime, \
|
||||
f"The {name} test took {time_taken:.2f} seconds, but should not " \
|
||||
f"have exceeded {max_runtime:.2f} seconds. Test failed. \n\n" \
|
||||
f"--- FAILED: {name.upper()} ::: " \
|
||||
f"{time_taken:.2f} > {max_runtime:.2f} ---"
|
||||
|
||||
print(f"The {name} test took {time_taken:.2f} seconds, which "
|
||||
f"is below the budget of {max_runtime:.2f} seconds. "
|
||||
f"Test successful. \n\n"
|
||||
f"--- PASSED: {name.upper()} ::: "
|
||||
f"{time_taken:.2f} <= {max_runtime:.2f} ---")
|
||||
@@ -0,0 +1,42 @@
|
||||
"""Bookkeeping overhead (1 node, 10k trials)
|
||||
|
||||
In this run, we will start a large number of trials (10k) that take just a
|
||||
second to run. We thus measure overhead that comes with dealing with a
|
||||
large number of trials, e.g. experiment checkpointing.
|
||||
|
||||
Cluster: cluster_1x16.yaml
|
||||
|
||||
Test owner: krfricke
|
||||
|
||||
Acceptance criteria: Should run faster than 800 seconds.
|
||||
|
||||
Theoretical minimum time: 10000/16 = 625 seconds
|
||||
"""
|
||||
import os
|
||||
|
||||
import ray
|
||||
|
||||
from _trainable import timed_tune_run
|
||||
|
||||
|
||||
def main():
|
||||
os.environ["TUNE_GLOBAL_CHECKPOINT_S"] = "100" # Tweak
|
||||
|
||||
ray.init(address="auto")
|
||||
|
||||
num_samples = 10000
|
||||
results_per_second = 1
|
||||
trial_length_s = 1
|
||||
|
||||
max_runtime = 800
|
||||
|
||||
timed_tune_run(
|
||||
name="bookkeeping overhead",
|
||||
num_samples=num_samples,
|
||||
results_per_second=results_per_second,
|
||||
trial_length_s=trial_length_s,
|
||||
max_runtime=max_runtime)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,47 @@
|
||||
"""Durable trainable (16 trials, checkpoint to cloud)
|
||||
|
||||
In this run, we will start 16 trials on a cluster. The trials create
|
||||
10 MB checkpoints every 10 seconds and should only keep 2 of these. This test
|
||||
ensures that durable checkpoints don't slow down experiment progress too much.
|
||||
|
||||
Cluster: cluster_16x2.yaml
|
||||
|
||||
Test owner: krfricke
|
||||
|
||||
Acceptance criteria: Should run faster than 500 seconds.
|
||||
|
||||
Theoretical minimum time: 300 seconds
|
||||
"""
|
||||
import ray
|
||||
from ray import tune
|
||||
|
||||
from _trainable import timed_tune_run
|
||||
|
||||
|
||||
def main():
|
||||
ray.init(address="auto")
|
||||
|
||||
num_samples = 16
|
||||
results_per_second = 10 / 60
|
||||
trial_length_s = 300
|
||||
|
||||
max_runtime = 500
|
||||
|
||||
timed_tune_run(
|
||||
name="durable trainable",
|
||||
num_samples=num_samples,
|
||||
results_per_second=results_per_second,
|
||||
trial_length_s=trial_length_s,
|
||||
max_runtime=max_runtime,
|
||||
checkpoint_freq_s=10, # Once every 10 seconds
|
||||
checkpoint_size_b=int(10 * 1000**2), # 10 MB
|
||||
keep_checkpoints_num=2,
|
||||
resources_per_trial={"cpu": 2},
|
||||
sync_config=tune.SyncConfig(
|
||||
sync_to_driver=False,
|
||||
upload_dir="s3://ray-tune-scalability-test/durable/",
|
||||
))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,44 @@
|
||||
"""Large checkpoints in long running trials (16 trials, 4 GB checkpoints).
|
||||
|
||||
In this run, we will start 16 trials on a single node. The trials create
|
||||
4 GB checkpoints every 15 minutes and should only keep 2 of these. This test
|
||||
ensures that handling large checkpoints don't lead to much overhead.
|
||||
|
||||
Cluster: cluster_1x32_hd.yaml
|
||||
|
||||
Test owner: krfricke
|
||||
|
||||
Acceptance criteria: Should run faster than 90,000 seconds.
|
||||
|
||||
Theoretical minimum time: 86,400 seconds
|
||||
"""
|
||||
import ray
|
||||
from ray import tune
|
||||
|
||||
from _trainable import timed_tune_run
|
||||
|
||||
|
||||
def main():
|
||||
ray.init(address="auto")
|
||||
|
||||
num_samples = 16
|
||||
results_per_second = 1 / 60
|
||||
trial_length_s = 86400
|
||||
|
||||
max_runtime = 90000
|
||||
|
||||
timed_tune_run(
|
||||
name="long running large checkpoints",
|
||||
num_samples=num_samples,
|
||||
results_per_second=results_per_second,
|
||||
trial_length_s=trial_length_s,
|
||||
max_runtime=max_runtime,
|
||||
checkpoint_freq_s=900, # Once every 15 minutes
|
||||
checkpoint_size_b=int(3.75 * 1000**3),
|
||||
keep_checkpoints_num=2, # 2 * 16 * 4 = 128 GB
|
||||
resources_per_trial={"cpu": 1},
|
||||
sync_config=tune.SyncConfig(sync_to_driver=True))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,41 @@
|
||||
"""Networking overhead (200 trials on 200 nodes)
|
||||
|
||||
In this run, we will start 200 trials and run them on 200 different nodes.
|
||||
This test will thus measure the overhead that comes with network communication
|
||||
and specifically log synchronization.
|
||||
|
||||
Cluster: cluster_200x2.yaml
|
||||
|
||||
Test owner: krfricke
|
||||
|
||||
Acceptance criteria: Should run faster than 500 seconds.
|
||||
|
||||
Theoretical minimum time: 300 seconds
|
||||
"""
|
||||
import ray
|
||||
from ray import tune
|
||||
|
||||
from _trainable import timed_tune_run
|
||||
|
||||
|
||||
def main():
|
||||
ray.init(address="auto")
|
||||
|
||||
num_samples = 200
|
||||
results_per_second = 1
|
||||
trial_length_s = 300
|
||||
|
||||
max_runtime = 500
|
||||
|
||||
timed_tune_run(
|
||||
name="result network overhead",
|
||||
num_samples=num_samples,
|
||||
results_per_second=results_per_second,
|
||||
trial_length_s=trial_length_s,
|
||||
max_runtime=max_runtime,
|
||||
resources_per_trial={"cpu": 2}, # One per node
|
||||
sync_config=tune.SyncConfig(sync_to_driver=True))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -1,54 +0,0 @@
|
||||
import time
|
||||
|
||||
import ray
|
||||
from ray import tune
|
||||
from ray.tune.cluster_info import is_ray_cluster
|
||||
|
||||
|
||||
def my_naive_trainable(config):
|
||||
for i in range(int(config["num_iters"])):
|
||||
tune.report(score=i + config["score"])
|
||||
time.sleep(config["sleep_time"])
|
||||
|
||||
|
||||
def main():
|
||||
ray.init(address="auto")
|
||||
|
||||
num_samples = 1000
|
||||
|
||||
sleep_time = 0.1
|
||||
num_iters = 300
|
||||
|
||||
expected_run_time = num_iters * sleep_time
|
||||
|
||||
# Allow minimum of 20 % overhead (or 10 seconds for short runs)
|
||||
expected_run_time += max(expected_run_time * 0.2, 10.)
|
||||
|
||||
if is_ray_cluster():
|
||||
# Add constant overhead for SSH connection
|
||||
expected_run_time += 0.3 * num_samples
|
||||
|
||||
start_time = time.time()
|
||||
tune.run(
|
||||
my_naive_trainable,
|
||||
config={
|
||||
"score": tune.uniform(0., 1.),
|
||||
"num_iters": num_iters,
|
||||
"sleep_time": sleep_time
|
||||
},
|
||||
reuse_actors=True,
|
||||
verbose=2,
|
||||
num_samples=num_samples)
|
||||
time_taken = time.time() - start_time
|
||||
|
||||
assert time_taken < expected_run_time, \
|
||||
f"The buffering test took {time_taken:.2f} seconds, but should not " \
|
||||
f"have exceeded {expected_run_time:.2f} seconds. Test failed."
|
||||
|
||||
print(f"The buffering test took {time_taken:.2f} seconds, which "
|
||||
f"is below the budget of {expected_run_time:.2f} seconds. "
|
||||
f"Test successful.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,49 @@
|
||||
"""Result throughput on a cluster
|
||||
|
||||
In this run, we will start 1000 trials concurrently that report often
|
||||
(10 results per second). We thus measure the amount of overhead incurred when
|
||||
dealing with a large number of results from distributed trials.
|
||||
|
||||
Cluster: cluster_16x64.yaml
|
||||
|
||||
Test owner: krfricke
|
||||
|
||||
Acceptance criteria: Should run faster than 120 seconds.
|
||||
|
||||
Theoretical minimum time: 100 seconds
|
||||
"""
|
||||
import os
|
||||
|
||||
import ray
|
||||
from ray import tune
|
||||
from ray.tune.cluster_info import is_ray_cluster
|
||||
|
||||
from _trainable import timed_tune_run
|
||||
|
||||
|
||||
def main():
|
||||
os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1" # Tweak
|
||||
|
||||
ray.init(address="auto")
|
||||
|
||||
num_samples = 1000
|
||||
results_per_second = 10
|
||||
trial_length_s = 100
|
||||
|
||||
max_runtime = 120
|
||||
|
||||
if is_ray_cluster():
|
||||
# Add constant overhead for SSH connection
|
||||
max_runtime = 120
|
||||
|
||||
timed_tune_run(
|
||||
name="result throughput cluster",
|
||||
num_samples=num_samples,
|
||||
results_per_second=results_per_second,
|
||||
trial_length_s=trial_length_s,
|
||||
max_runtime=max_runtime,
|
||||
sync_config=tune.SyncConfig(sync_to_driver=False)) # Tweak!
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,42 @@
|
||||
"""Result throughput on a single node
|
||||
|
||||
In this run, we will start 96 trials concurrently that report very often
|
||||
(500 results per second). We thus measure the amount of overhead incurred when
|
||||
dealing with a large number of results.
|
||||
|
||||
Cluster: cluster_1x96.yaml
|
||||
|
||||
Test owner: krfricke
|
||||
|
||||
Acceptance criteria: Should run faster than 120 seconds.
|
||||
|
||||
Theoretical minimum time: 100 seconds
|
||||
"""
|
||||
import os
|
||||
|
||||
import ray
|
||||
|
||||
from _trainable import timed_tune_run
|
||||
|
||||
|
||||
def main():
|
||||
os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1" # Tweak
|
||||
|
||||
ray.init(address="auto")
|
||||
|
||||
num_samples = 96
|
||||
results_per_second = 500
|
||||
trial_length_s = 100
|
||||
|
||||
max_runtime = 120
|
||||
|
||||
timed_tune_run(
|
||||
name="result throughput single node",
|
||||
num_samples=num_samples,
|
||||
results_per_second=results_per_second,
|
||||
trial_length_s=trial_length_s,
|
||||
max_runtime=max_runtime)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,98 @@
|
||||
"""Large-scale XGBoost parameter sweep
|
||||
|
||||
In this run, we will start 32 trials of 32 actors each running distributed
|
||||
XGBoost training. This test is more about making sure that the run succeeds
|
||||
than about total runtime. However, it is expected that this is faster than
|
||||
1 hour.
|
||||
|
||||
We fix the max_depth to 4 and the number of boosting rounds to 100. The
|
||||
fastest observed training time for 32 actors (1 CPU each) was about 2000
|
||||
seconds. We allow up to 10 minutes of slack, so aim for 2600 seconds total
|
||||
tuning time.
|
||||
|
||||
Cluster: cluster_16x64_data.yaml
|
||||
|
||||
Test owner: krfricke
|
||||
|
||||
Acceptance criteria: Should run faster than 2600 seconds. Should run without
|
||||
errors.
|
||||
"""
|
||||
import os
|
||||
import time
|
||||
|
||||
import ray
|
||||
from ray import tune
|
||||
|
||||
from xgboost_ray import train, RayParams, RayDMatrix
|
||||
|
||||
|
||||
def xgboost_train(config, num_actors=128, num_boost_round=200):
|
||||
train_set = RayDMatrix(
|
||||
os.path.expanduser("~/data/train.parquet"), "labels")
|
||||
test_set = RayDMatrix(os.path.expanduser("~/data/test.parquet"), "labels")
|
||||
|
||||
evals_result = {}
|
||||
|
||||
bst = train(
|
||||
params=config,
|
||||
dtrain=train_set,
|
||||
evals=[(test_set, "eval")],
|
||||
evals_result=evals_result,
|
||||
ray_params=RayParams(
|
||||
max_actor_restarts=1,
|
||||
gpus_per_actor=0,
|
||||
cpus_per_actor=1,
|
||||
num_actors=num_actors),
|
||||
verbose_eval=False,
|
||||
num_boost_round=num_boost_round)
|
||||
|
||||
model_path = "tuned.xgb"
|
||||
bst.save_model(model_path)
|
||||
print("Final validation error: {:.4f}".format(
|
||||
evals_result["eval"]["error"][-1]))
|
||||
|
||||
|
||||
def main():
|
||||
name = "large xgboost sweep"
|
||||
|
||||
ray.init(address="auto")
|
||||
|
||||
num_samples = 32
|
||||
num_actors_per_sample = 32
|
||||
|
||||
max_runtime = 2600
|
||||
|
||||
config = {
|
||||
"tree_method": "approx",
|
||||
"objective": "binary:logistic",
|
||||
"eval_metric": ["logloss", "error"],
|
||||
"eta": tune.loguniform(1e-4, 1e-1),
|
||||
"subsample": tune.uniform(0.5, 1.0),
|
||||
"max_depth": 4
|
||||
}
|
||||
|
||||
start_time = time.monotonic()
|
||||
tune.run(
|
||||
tune.with_parameters(
|
||||
xgboost_train,
|
||||
num_actors=num_actors_per_sample,
|
||||
num_boost_round=100),
|
||||
config=config,
|
||||
num_samples=num_samples)
|
||||
time_taken = time.monotonic() - start_time
|
||||
|
||||
assert time_taken < max_runtime, \
|
||||
f"The {name} test took {time_taken:.2f} seconds, but should not " \
|
||||
f"have exceeded {max_runtime:.2f} seconds. Test failed. \n\n" \
|
||||
f"--- FAILED: {name.upper()} ::: " \
|
||||
f"{time_taken:.2f} > {max_runtime:.2f} ---"
|
||||
|
||||
print(f"The {name} test took {time_taken:.2f} seconds, which "
|
||||
f"is below the budget of {max_runtime:.2f} seconds. "
|
||||
f"Test successful. \n\n"
|
||||
f"--- PASSED: {name.upper()} ::: "
|
||||
f"{time_taken:.2f} <= {max_runtime:.2f} ---")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user