mirror of
https://github.com/wassname/ray.git
synced 2026-06-28 07:50:30 +08:00
aad3c50e2d
This PR introduces cluster-level fault tolerance for Tune by checkpointing global state. This occurs with relatively high frequency and allows users to easily resume experiments when the cluster crashes. Note that this PR may affect automated workflows due to auto-prompting, but this is resolvable.