Implement actor checkpointing (#3839)

* Implement Actor checkpointing

* docs

* fix

* fix

* fix

* move restore-from-checkpoint to HandleActorStateTransition

* Revert "move restore-from-checkpoint to HandleActorStateTransition"

This reverts commit 9aa4447c1e3e321f42a1d895d72f17098b72de12.

* resubmit waiting tasks when actor frontier restored

* add doc about num_actor_checkpoints_to_keep=1

* add num_actor_checkpoints_to_keep to Cython

* add checkpoint_expired api

* check if actor class is abstract

* change checkpoint_ids to long string

* implement java

* Refactor to delay actor creation publish until checkpoint is resumed

* debug, lint

* Erase from checkpoints to restore if task fails

* fix lint

* update comments

* avoid duplicated actor notification log

* fix unintended change

* add actor_id to checkpoint_expired

* small java updates

* make checkpoint info per actor

* lint

* Remove logging

* Remove old actor checkpointing Python code, move new checkpointing code to FunctionActionManager

* Replace old actor checkpointing tests

* Fix test and lint

* address comments

* consolidate kill_actor

* Remove __ray_checkpoint__

* fix non-ascii char

* Loosen test checks

* fix java

* fix sphinx-build
This commit is contained in:
Hao Chen
2019-02-13 19:39:02 +08:00
committed by GitHub
parent 57dcd3033e
commit f31a79f3f7
41 changed files with 1708 additions and 490 deletions
+24 -5
View File
@@ -49,9 +49,20 @@ except ImportError as e:
modin_path = os.path.join(os.path.abspath(os.path.dirname(__file__)), "modin")
sys.path.append(modin_path)
from ray._raylet import (UniqueID, ObjectID, DriverID, ClientID, ActorID,
ActorHandleID, FunctionID, ActorClassID, TaskID,
_ID_TYPES, Config as _Config) # noqa: E402
from ray._raylet import (
ActorCheckpointID,
ActorClassID,
ActorHandleID,
ActorID,
ClientID,
Config as _Config,
DriverID,
FunctionID,
ObjectID,
TaskID,
UniqueID,
_ID_TYPES,
) # noqa: E402
_config = _Config()
@@ -82,8 +93,16 @@ __all__ = [
]
__all__ += [
"UniqueID", "ObjectID", "DriverID", "ClientID", "ActorID", "ActorHandleID",
"FunctionID", "ActorClassID", "TaskID"
"ActorCheckpointID",
"ActorClassID",
"ActorHandleID",
"ActorID",
"ClientID",
"DriverID",
"FunctionID",
"ObjectID",
"TaskID",
"UniqueID",
]
import ctypes # noqa: E402