* Check if trial's node is alive before trying to fetch result
* Added function for failed trials to trial_executor interface
* Address comments, add test.
In earlier PRs, PR#3585 and PR#3637, export_policy_model and export_policy_checkpoint were introduced for users to export TensorFlow model and checkpoint.
For Ray Tune users, these APIs are not accessible through YAML configurations.
In this pull request, export_formats option is provided to enable users to choose the desired export format.
This PR introduces cluster-level fault tolerance for Tune by checkpointing global state. This occurs with relatively high frequency and allows users to easily resume experiments when the cluster crashes.
Note that this PR may affect automated workflows due to auto-prompting, but this is resolvable.
This PR introduces single-node fault tolerance for Tune.
## Previous behavior:
- Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.
## New behavior:
- RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available).
- If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
- During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running.
Remaining questions:
- Should `last_result` be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).
- Waiting for some PRs to merge first (#3239)
Closes#2851.
Adds the ability for trainables to reset their configurations during experiments. These changes in particular add the base functions to the trial_executor and trainable interfaces as well as giving the basic implementation on the PopulationBasedTraining scheduler.
Related issue number: #2741