ray/python/ray/tune at 0f0099fb909a862ed93ffc93c12fa411d5cc119b - ray

mirror of https://github.com/wassname/ray.git synced 2026-06-28 01:16:06 +08:00

Files

T

Richard Liaw 784a6399b0 [tune] Node Fault Tolerance (#3238 )

This PR introduces single-node fault tolerance for Tune.

## Previous behavior:
 - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.

## New behavior:
 - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). 
 - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
 - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running.


Remaining questions:
 -  Should `last_result` be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).

 - Waiting for some PRs to merge first (#3239)

Closes #2851.

2018-11-21 12:38:16 -08:00

automl

[tune] Add AutoML algorithm of GeneticSearcher (#2699 )

2018-09-12 09:17:04 -07:00

automlboard

[tune] Tune onto Logging Module (#2882 )

2018-09-16 12:09:36 -07:00

examples

[sgd] Document and add simple MNIST example (#3236 )

2018-11-10 21:52:20 -08:00

schedulers

[tune] Tweaks to Trainable and Verbosity (#2889 )

2018-10-11 23:42:13 -07:00

suggest

Fix linting errors. (#3127 )

2018-10-24 16:30:00 -07:00

test

[tune] Node Fault Tolerance (#3238 )

2018-11-21 12:38:16 -08:00

__init__.py

[tune] Support lambda functions in hyperparameters / tune rllib multiagent support (#2568 )

2018-08-07 16:29:21 -07:00

cluster_info.py

[tune] Sync logs from workers and improve tensorboard reporting (#1567 )

2018-02-26 11:35:51 -08:00

config_parser.py

[tune] Add PyTorch MNIST Example + Misc. Tweaks (#2708 )

2018-08-30 16:18:56 -07:00

error.py

[tune] Experiment Management API (#1328 )

2018-01-24 13:45:10 -08:00

experiment.py

[tune] Support "None" for upload_dir

2018-11-09 22:02:08 -08:00

function_runner.py

[tune] Doc: Autofilled, StatusReporter (#3294 )

2018-11-13 13:15:56 -08:00

log_sync.py

Bug/log syncer fails with parentheses (#2653 )

2018-10-06 00:34:53 -07:00

logger.py

[tune] Tune onto Logging Module (#2882 )

2018-09-16 12:09:36 -07:00

ParallelCoordinatesVisualization.ipynb

[rllib, tune] TrainingResult -> Dict, Removes C408 from flake8 (#2565 )

2018-08-07 12:17:44 -07:00

ray_trial_executor.py

[tune] Node Fault Tolerance (#3238 )

2018-11-21 12:38:16 -08:00

README.rst

[tune] Annotated Example Page and showcase Tutorials (#3267 )

2018-11-08 23:45:05 -08:00

registry.py

[tune] Fix registering trainable twice (#2293 )

2018-06-27 16:29:39 -07:00

result.py

[tune] Doc: Autofilled, StatusReporter (#3294 )

2018-11-13 13:15:56 -08:00

trainable.py

[tune] Fix default handling for timesteps (#3293 )

2018-11-12 15:52:17 -08:00

trial_executor.py

[tune] Node Fault Tolerance (#3238 )

2018-11-21 12:38:16 -08:00

trial_runner.py

[tune] Node Fault Tolerance (#3238 )

2018-11-21 12:38:16 -08:00

trial.py

[tune] Node Fault Tolerance (#3238 )

2018-11-21 12:38:16 -08:00

tune.py

[tune] Add a raise_on_failed_trial flag in run_experiments (#2961 )

2018-09-29 11:29:46 -07:00

TuneClient.ipynb

[tune] Split Search from Scheduling (#2452 )

2018-08-04 21:27:39 -07:00

util.py

Remove legacy Ray code. (#3121 )

2018-10-26 13:36:58 -07:00

visual_utils.py

[rllib] Fix LSTM regression on truncated sequences and add regression test (#2898 )

2018-09-18 15:09:16 -07:00

web_server.py

[tune] Tune onto Logging Module (#2882 )

2018-09-16 12:09:36 -07:00

README.rst

Tune: Scalable Hyperparameter Search
====================================

Tune is a scalable framework for hyperparameter search with a focus on deep learning and deep reinforcement learning.

User documentation can be `found here <http://ray.readthedocs.io/en/latest/tune.html>`__.


Tutorial
--------

To get started with Tune, try going through `our tutorial of using Tune with Keras <https://github.com/ray-project/tutorial/blob/master/tune_exercises/Tune.ipynb>`__.

(Experimental): You can try out `the above tutorial on a free hosted server via Binder <https://mybinder.org/v2/gh/ray-project/tutorial/master?filepath=tune_exercises%2FTune.ipynb>`__.


Citing Tune
-----------

If Tune helps you in your academic research, you are encouraged to cite `our paper <https://arxiv.org/abs/1807.05118>`__. Here is an example bibtex:

.. code-block:: tex

    @article{liaw2018tune,
        title={Tune: A Research Platform for Distributed Model Selection and Training},
        author={Liaw, Richard and Liang, Eric and Nishihara, Robert and
                Moritz, Philipp and Gonzalez, Joseph E and Stoica, Ion},
        journal={arXiv preprint arXiv:1807.05118},
        year={2018}
    }