[tune] Cluster Fault Tolerance (#3309)

This PR introduces cluster-level fault tolerance for Tune by checkpointing global state. This occurs with relatively high frequency and allows users to easily resume experiments when the cluster crashes.

Note that this PR may affect automated workflows due to auto-prompting, but this is resolvable.
This commit is contained in:
Richard Liaw
2018-12-29 11:42:25 +08:00
committed by GitHub
parent 382b138fc7
commit aad3c50e2d
16 changed files with 806 additions and 128 deletions
+3 -1
View File
@@ -51,7 +51,9 @@ class Cluster(object):
assert not self.connected
redis_password = head_node_args.get("redis_password")
output_info = ray.init(
redis_address=self.redis_address, redis_password=redis_password)
ignore_reinit_error=True,
redis_address=self.redis_address,
redis_password=redis_password)
logger.info(output_info)
self.connected = True