Commit Graph

10 Commits

Author SHA1 Message Date
Ujval Misra 375852af23 [tune] Check node liveness before result fetch (#5844)
* Check if trial's node is alive before trying to fetch result

* Added function for failed trials to trial_executor interface

* Address comments, add test.
2019-10-08 11:41:01 -07:00
Tianming Xu 1302fafc0b [Tune] Add export_formats option to export policy graphs (#3868)
In earlier PRs, PR#3585 and PR#3637, export_policy_model and export_policy_checkpoint were introduced for users to export TensorFlow model and checkpoint.

For Ray Tune users, these APIs are not accessible through YAML configurations.

In this pull request, export_formats option is provided to enable users to choose the desired export format.
2019-01-31 17:07:27 -08:00
Richard Liaw aad3c50e2d [tune] Cluster Fault Tolerance (#3309)
This PR introduces cluster-level fault tolerance for Tune by checkpointing global state. This occurs with relatively high frequency and allows users to easily resume experiments when the cluster crashes.

Note that this PR may affect automated workflows due to auto-prompting, but this is resolvable.
2018-12-29 11:42:25 +08:00
Richard Liaw 9d0bd50e78 [tune] Component notification on node failure + Tests (#3414)
Changes include:
 - Notify Components on Requeue
 - Slight refactoring of Node Failure handling
 - Better tests
2018-12-04 14:47:31 -08:00
Richard Liaw 784a6399b0 [tune] Node Fault Tolerance (#3238)
This PR introduces single-node fault tolerance for Tune.

## Previous behavior:
 - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.

## New behavior:
 - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). 
 - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
 - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running.


Remaining questions:
 -  Should `last_result` be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).

 - Waiting for some PRs to merge first (#3239)

Closes #2851.
2018-11-21 12:38:16 -08:00
Eric Liang 65c27c70cf [rllib] Clean up agent resource configurations (#3296)
Closes #3284
2018-11-13 18:00:03 -08:00
Richard Liaw f372f48bf3 [tune] Tune onto Logging Module (#2882)
Moves Tune onto logging in Python. Ignores examples and tests.
2018-09-16 12:09:36 -07:00
Kaahan 045861c9b0 [tune] Reset Config for Trainables (#2831)
Adds the ability for trainables to reset their configurations during experiments. These changes in particular add the base functions to the trial_executor and trainable interfaces as well as giving the basic implementation on the PopulationBasedTraining scheduler.

Related issue number: #2741
2018-09-11 08:45:04 -07:00
Richard Liaw 72542c9016 [tune] Fix Pausing and Error Propogation (#2815)
* add new tests

* Try-catch errors from ray get

* longer pbt run

* Update pbt_example.py

* Split trial and result and fix tests
2018-09-04 15:22:11 -07:00
joyyoj 38867eea4e [tune] Cross-Framework Compatibility (#2646)
This commit is a first pass at restructuring the Trial execution logic to support running on multiple frameworks.
2018-08-22 10:55:45 -07:00