ray/python at 784a6399b08ec74549c09f01e3ad362a346d7b67 - ray

mirror of https://github.com/wassname/ray.git synced 2026-06-27 21:23:10 +08:00

Files

T

Richard Liaw 784a6399b0 [tune] Node Fault Tolerance (#3238 )

This PR introduces single-node fault tolerance for Tune.

## Previous behavior:
 - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.

## New behavior:
 - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). 
 - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
 - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running.


Remaining questions:
 -  Should `last_result` be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).

 - Waiting for some PRs to merge first (#3239)

Closes #2851.

2018-11-21 12:38:16 -08:00

benchmarks

Deprecate num_workers argument to ray.init and ray start. (#3114 )

2018-10-28 20:12:49 -07:00

ray

[tune] Node Fault Tolerance (#3238 )

2018-11-21 12:38:16 -08:00

asv.conf.json

[asv] Pushing to s3 (#2246 )

2018-06-20 10:43:44 -07:00

build-wheel-macos.sh

Adding Python3.7 wheels support (#2546 )