mirror of
https://github.com/wassname/ray.git
synced 2026-06-28 01:16:06 +08:00
784a6399b0
This PR introduces single-node fault tolerance for Tune. ## Previous behavior: - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources. ## New behavior: - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued. - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running. Remaining questions: - Should `last_result` be consistent during restore? Yes; but not for earlier trials (trials that are yet to be checkpointed). - Waiting for some PRs to merge first (#3239) Closes #2851.
Tune: Scalable Hyperparameter Search
====================================
Tune is a scalable framework for hyperparameter search with a focus on deep learning and deep reinforcement learning.
User documentation can be `found here <http://ray.readthedocs.io/en/latest/tune.html>`__.
Tutorial
--------
To get started with Tune, try going through `our tutorial of using Tune with Keras <https://github.com/ray-project/tutorial/blob/master/tune_exercises/Tune.ipynb>`__.
(Experimental): You can try out `the above tutorial on a free hosted server via Binder <https://mybinder.org/v2/gh/ray-project/tutorial/master?filepath=tune_exercises%2FTune.ipynb>`__.
Citing Tune
-----------
If Tune helps you in your academic research, you are encouraged to cite `our paper <https://arxiv.org/abs/1807.05118>`__. Here is an example bibtex:
.. code-block:: tex
@article{liaw2018tune,
title={Tune: A Research Platform for Distributed Model Selection and Training},
author={Liaw, Richard and Liang, Eric and Nishihara, Robert and
Moritz, Philipp and Gonzalez, Joseph E and Stoica, Ion},
journal={arXiv preprint arXiv:1807.05118},
year={2018}
}