mirror of
https://github.com/wassname/ray.git
synced 2026-06-27 19:32:11 +08:00
Option to retry failed actor tasks (#8330)
* Python * Consolidate state in the direct actor transport, set the caller starts at * todo * Remove unused * Update and unit tests * Doc * Remove unused * doc * Remove debug * Update src/ray/core_worker/transport/direct_actor_transport.h Co-authored-by: Eric Liang <ekhliang@gmail.com> * Update src/ray/core_worker/transport/direct_actor_transport.cc Co-authored-by: Eric Liang <ekhliang@gmail.com> * lint and fix build * Update * Fix build * Fix tests * Unit test for max_task_retries=0 * Fix java? * Fix bad test * Cross language fix * fix java Co-authored-by: Eric Liang <ekhliang@gmail.com>
This commit is contained in:
@@ -78,12 +78,86 @@ You can experiment with this behavior by running the following code.
|
||||
actor = Actor.remote()
|
||||
|
||||
# The actor will be restarted up to 5 times. After that, methods will
|
||||
# raise exceptions. The actor is restarted by rerunning its
|
||||
# constructor. Methods that were executing when the actor died will also
|
||||
# raise exceptions.
|
||||
# always raise a `RayActorError` exception. The actor is restarted by
|
||||
# rerunning its constructor. Methods that were sent or executing when the
|
||||
# actor died will also raise a `RayActorError` exception.
|
||||
for _ in range(100):
|
||||
try:
|
||||
counter = ray.get(actor.increment_and_possibly_fail.remote())
|
||||
print(counter)
|
||||
except ray.exceptions.RayActorError:
|
||||
print('FAILURE')
|
||||
|
||||
By default, actor tasks execute with at-most-once semantics
|
||||
(``max_task_retries=0`` in the ``@ray.remote`` decorator). This means that if an
|
||||
actor task is submitted to an actor that is unreachable, Ray will report the
|
||||
error with ``RayActorError``, a Python-level exception that is thrown when
|
||||
``ray.get`` is called on the future returned by the task. Note that this
|
||||
exception may be thrown even though the task did indeed execute successfully.
|
||||
For example, this can happen if the actor dies immediately after executing the
|
||||
task.
|
||||
|
||||
Ray also offers at-least-once execution semantics for actor tasks
|
||||
(``max_task_retries=-1`` or ``max_task_retries > 0``). This means that if an
|
||||
actor task is submitted to an actor that is unreachable, the system will
|
||||
automatically retry the task until it receives a reply from the actor. With
|
||||
this option, the system will only throw a ``RayActorError`` to the application
|
||||
if one of the following occurs: (1) the actor’s ``max_restarts`` limit has been
|
||||
exceeded and the actor cannot be restarted anymore, or (2) the
|
||||
``max_task_retries`` limit has been exceeded for this particular task. The
|
||||
limit can be set to infinity with ``max_task_retries = -1``.
|
||||
|
||||
You can experiment with this behavior by running the following code.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import os
|
||||
import ray
|
||||
|
||||
ray.init(ignore_reinit_error=True)
|
||||
|
||||
@ray.remote(max_restarts=5, max_task_retries=-1)
|
||||
class Actor:
|
||||
def __init__(self):
|
||||
self.counter = 0
|
||||
|
||||
def increment_and_possibly_fail(self):
|
||||
# Exit after every 10 tasks.
|
||||
if self.counter == 10:
|
||||
os._exit(0)
|
||||
self.counter += 1
|
||||
return self.counter
|
||||
|
||||
actor = Actor.remote()
|
||||
|
||||
# The actor will be reconstructed up to 5 times. The actor is
|
||||
# reconstructed by rerunning its constructor. Methods that were
|
||||
# executing when the actor died will be retried and will not
|
||||
# raise a `RayActorError`. Retried methods may execute twice, once
|
||||
# on the failed actor and a second time on the restarted actor.
|
||||
for _ in range(50):
|
||||
counter = ray.get(actor.increment_and_possibly_fail.remote())
|
||||
print(counter) # Prints the sequence 1-10 5 times.
|
||||
|
||||
# After the actor has been restarted 5 times, all subsequent methods will
|
||||
# raise a `RayActorError`.
|
||||
for _ in range(10):
|
||||
try:
|
||||
counter = ray.get(actor.increment_and_possibly_fail.remote())
|
||||
print(counter) # Unreachable.
|
||||
except ray.exceptions.RayActorError:
|
||||
print('FAILURE') # Prints 10 times.
|
||||
|
||||
For at-least-once actors, the system will still guarantee execution ordering
|
||||
according to the initial submission order. For example, any tasks submitted
|
||||
after a failed actor task will not execute on the actor until the failed actor
|
||||
task has been successfully retried. The system also will not attempt to
|
||||
re-execute any tasks that executed successfully before the failure.
|
||||
|
||||
At-least-once execution is best suited for read-only actors or actors with
|
||||
ephemeral state that does not need to be rebuilt after a failure. For actors
|
||||
that have critical state, it is best to take periodic checkpoints and either
|
||||
manually restart the actor or automatically restart the actor with at-most-once
|
||||
semantics. If the actor’s exact state at the time of failure is needed, the
|
||||
application is responsible for resubmitting all tasks since the last
|
||||
checkpoint.
|
||||
|
||||
@@ -903,6 +903,7 @@ cdef class CoreWorker:
|
||||
FunctionDescriptor function_descriptor,
|
||||
args,
|
||||
int64_t max_restarts,
|
||||
int64_t max_task_retries,
|
||||
resources,
|
||||
placement_resources,
|
||||
int32_t max_concurrency,
|
||||
@@ -929,7 +930,7 @@ cdef class CoreWorker:
|
||||
check_status(CCoreWorkerProcess.GetCoreWorker().CreateActor(
|
||||
ray_function, args_vector,
|
||||
CActorCreationOptions(
|
||||
max_restarts, max_concurrency,
|
||||
max_restarts, max_task_retries, max_concurrency,
|
||||
c_resources, c_placement_resources,
|
||||
dynamic_worker_options, is_detached, name, is_asyncio),
|
||||
extension_data,
|
||||
|
||||
+21
-9
@@ -244,7 +244,8 @@ class ActorClassMetadata:
|
||||
|
||||
def __init__(self, language, modified_class,
|
||||
actor_creation_function_descriptor, class_id, max_restarts,
|
||||
num_cpus, num_gpus, memory, object_store_memory, resources):
|
||||
max_task_retries, num_cpus, num_gpus, memory,
|
||||
object_store_memory, resources):
|
||||
self.language = language
|
||||
self.modified_class = modified_class
|
||||
self.actor_creation_function_descriptor = \
|
||||
@@ -253,6 +254,7 @@ class ActorClassMetadata:
|
||||
self.is_cross_language = language != Language.PYTHON
|
||||
self.class_id = class_id
|
||||
self.max_restarts = max_restarts
|
||||
self.max_task_retries = max_task_retries
|
||||
self.num_cpus = num_cpus
|
||||
self.num_gpus = num_gpus
|
||||
self.memory = memory
|
||||
@@ -314,7 +316,7 @@ class ActorClass:
|
||||
|
||||
@classmethod
|
||||
def _ray_from_modified_class(cls, modified_class, class_id, max_restarts,
|
||||
num_cpus, num_gpus, memory,
|
||||
max_task_retries, num_cpus, num_gpus, memory,
|
||||
object_store_memory, resources):
|
||||
for attribute in [
|
||||
"remote", "_remote", "_ray_from_modified_class",
|
||||
@@ -344,20 +346,22 @@ class ActorClass:
|
||||
self.__ray_metadata__ = ActorClassMetadata(
|
||||
Language.PYTHON, modified_class,
|
||||
actor_creation_function_descriptor, class_id, max_restarts,
|
||||
num_cpus, num_gpus, memory, object_store_memory, resources)
|
||||
max_task_retries, num_cpus, num_gpus, memory, object_store_memory,
|
||||
resources)
|
||||
|
||||
return self
|
||||
|
||||
@classmethod
|
||||
def _ray_from_function_descriptor(
|
||||
cls, language, actor_creation_function_descriptor, max_restarts,
|
||||
num_cpus, num_gpus, memory, object_store_memory, resources):
|
||||
max_task_retries, num_cpus, num_gpus, memory, object_store_memory,
|
||||
resources):
|
||||
self = ActorClass.__new__(ActorClass)
|
||||
|
||||
self.__ray_metadata__ = ActorClassMetadata(
|
||||
language, None, actor_creation_function_descriptor, None,
|
||||
max_restarts, num_cpus, num_gpus, memory, object_store_memory,
|
||||
resources)
|
||||
max_restarts, max_task_retries, num_cpus, num_gpus, memory,
|
||||
object_store_memory, resources)
|
||||
|
||||
return self
|
||||
|
||||
@@ -406,6 +410,7 @@ class ActorClass:
|
||||
is_direct_call=None,
|
||||
max_concurrency=None,
|
||||
max_restarts=None,
|
||||
max_task_retries=None,
|
||||
name=None,
|
||||
detached=False):
|
||||
"""Create an actor.
|
||||
@@ -557,6 +562,7 @@ class ActorClass:
|
||||
meta.actor_creation_function_descriptor,
|
||||
creation_args,
|
||||
max_restarts or meta.max_restarts,
|
||||
max_task_retries or meta.max_task_retries,
|
||||
resources,
|
||||
actor_placement_resources,
|
||||
max_concurrency,
|
||||
@@ -891,11 +897,13 @@ def modify_class(cls):
|
||||
|
||||
|
||||
def make_actor(cls, num_cpus, num_gpus, memory, object_store_memory, resources,
|
||||
max_restarts):
|
||||
max_restarts, max_task_retries):
|
||||
Class = modify_class(cls)
|
||||
|
||||
if max_restarts is None:
|
||||
max_restarts = 0
|
||||
if max_task_retries is None:
|
||||
max_task_retries = 0
|
||||
|
||||
infinite_restart = max_restarts == -1
|
||||
if not infinite_restart:
|
||||
@@ -907,9 +915,13 @@ def make_actor(cls, num_cpus, num_gpus, memory, object_store_memory, resources,
|
||||
# an overflow.
|
||||
max_restarts = min(max_restarts, ray_constants.MAX_INT64_VALUE)
|
||||
|
||||
if max_restarts == 0 and max_task_retries != 0:
|
||||
raise ValueError(
|
||||
"max_task_retries cannot be set if max_restarts is 0.")
|
||||
|
||||
return ActorClass._ray_from_modified_class(
|
||||
Class, ActorClassID.from_random(), max_restarts, num_cpus, num_gpus,
|
||||
memory, object_store_memory, resources)
|
||||
Class, ActorClassID.from_random(), max_restarts, max_task_retries,
|
||||
num_cpus, num_gpus, memory, object_store_memory, resources)
|
||||
|
||||
|
||||
def exit_actor():
|
||||
|
||||
@@ -77,6 +77,7 @@ def java_actor_class(class_name):
|
||||
Language.JAVA,
|
||||
JavaFunctionDescriptor(class_name, "<init>", ""),
|
||||
0, # max_restarts,
|
||||
0, # max_task_retries,
|
||||
None, # num_cpus,
|
||||
None, # num_gpus,
|
||||
None, # memory,
|
||||
|
||||
@@ -231,6 +231,7 @@ cdef extern from "ray/core_worker/common.h" nogil:
|
||||
CActorCreationOptions()
|
||||
CActorCreationOptions(
|
||||
int64_t max_restarts,
|
||||
int64_t max_task_retries,
|
||||
int32_t max_concurrency,
|
||||
const unordered_map[c_string, double] &resources,
|
||||
const unordered_map[c_string, double] &placement_resources,
|
||||
|
||||
@@ -51,7 +51,7 @@ py_test(
|
||||
size = "medium",
|
||||
srcs = ["test_actor_failures.py"],
|
||||
# TODO(ekl) enable this once we support actor reconstruction again
|
||||
tags = ["exclusive", "manual"],
|
||||
tags = ["exclusive"],
|
||||
deps = ["//:ray_lib"],
|
||||
)
|
||||
|
||||
|
||||
@@ -3,10 +3,6 @@ import json
|
||||
import numpy as np
|
||||
import os
|
||||
import pytest
|
||||
try:
|
||||
import pytest_timeout
|
||||
except ImportError:
|
||||
pytest_timeout = None
|
||||
import signal
|
||||
import sys
|
||||
import time
|
||||
@@ -88,9 +84,13 @@ def ray_checkpointable_actor_cls(request):
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"ray_start_object_store_memory", [150 * 1024 * 1024], indirect=True)
|
||||
def test_actor_eviction(ray_start_object_store_memory):
|
||||
object_store_memory = ray_start_object_store_memory
|
||||
"ray_start_regular", [{
|
||||
"object_store_memory": 150 * 1024 * 1024,
|
||||
"lru_evict": True,
|
||||
}],
|
||||
indirect=True)
|
||||
def test_actor_eviction(ray_start_regular):
|
||||
object_store_memory = 150 * 1024 * 1024
|
||||
|
||||
@ray.remote
|
||||
class Actor:
|
||||
@@ -127,10 +127,14 @@ def test_actor_eviction(ray_start_object_store_memory):
|
||||
assert num_success > 0
|
||||
|
||||
|
||||
def test_actor_restart(ray_start_regular):
|
||||
"""Test actor reconstruction when actor process is killed."""
|
||||
def test_actor_restart():
|
||||
"""Test actor restart when actor process is killed."""
|
||||
ray.init(
|
||||
_internal_config=json.dumps({
|
||||
"task_retry_delay_ms": 100,
|
||||
}), )
|
||||
|
||||
@ray.remote(max_restarts=1)
|
||||
@ray.remote(max_restarts=1, max_task_retries=-1)
|
||||
class RestartableActor:
|
||||
"""An actor that will be restarted at most once."""
|
||||
|
||||
@@ -147,20 +151,48 @@ def test_actor_restart(ray_start_regular):
|
||||
|
||||
actor = RestartableActor.remote()
|
||||
pid = ray.get(actor.get_pid.remote())
|
||||
# Call increase 3 times
|
||||
for _ in range(3):
|
||||
ray.get(actor.increase.remote())
|
||||
# Call increase again with some delay.
|
||||
result = actor.increase.remote(delay=0.5)
|
||||
# Sleep some time to wait for the above task to start execution.
|
||||
time.sleep(0.2)
|
||||
results = [actor.increase.remote() for _ in range(100)]
|
||||
# Kill actor process, while the above task is still being executed.
|
||||
os.kill(pid, signal.SIGKILL)
|
||||
# Check that the above task didn't fail and the actor is restarted.
|
||||
assert ray.get(result) == 4
|
||||
# Make sure that all tasks were executed in order before the actor's death.
|
||||
res = results.pop(0)
|
||||
i = 1
|
||||
while True:
|
||||
try:
|
||||
r = ray.get(res)
|
||||
if r != i:
|
||||
# Actor restarted without any failed tasks.
|
||||
break
|
||||
res = results.pop(0)
|
||||
i += 1
|
||||
except ray.exceptions.RayActorError:
|
||||
# Actor restarted.
|
||||
break
|
||||
# Find the first task to execute after the actor was restarted.
|
||||
while True:
|
||||
try:
|
||||
r = ray.get(res)
|
||||
break
|
||||
except ray.exceptions.RayActorError:
|
||||
res = results.pop(0)
|
||||
pass
|
||||
# Make sure that all tasks were executed in order after the actor's death.
|
||||
i = 1
|
||||
while True:
|
||||
r = ray.get(res)
|
||||
assert r == i
|
||||
if results:
|
||||
res = results.pop(0)
|
||||
i += 1
|
||||
else:
|
||||
break
|
||||
|
||||
# Check that we can still call the actor.
|
||||
assert ray.get(actor.increase.remote()) == 5
|
||||
result = actor.increase.remote()
|
||||
assert ray.get(result) == r + 1
|
||||
|
||||
# kill actor process one more time.
|
||||
results = [actor.increase.remote() for _ in range(100)]
|
||||
pid = ray.get(actor.get_pid.remote())
|
||||
os.kill(pid, signal.SIGKILL)
|
||||
# The actor has exceeded max restarts, and this task should fail.
|
||||
@@ -174,37 +206,154 @@ def test_actor_restart(ray_start_regular):
|
||||
# Check that the actor won't be restarted.
|
||||
with pytest.raises(ray.exceptions.RayActorError):
|
||||
ray.get(actor.increase.remote())
|
||||
ray.shutdown()
|
||||
|
||||
|
||||
def test_actor_restart_with_retry():
|
||||
"""Test actor restart when actor process is killed."""
|
||||
ray.init(
|
||||
_internal_config=json.dumps({
|
||||
"task_retry_delay_ms": 100,
|
||||
}), )
|
||||
|
||||
@ray.remote(max_restarts=1, max_task_retries=-1)
|
||||
class RestartableActor:
|
||||
"""An actor that will be restarted at most once."""
|
||||
|
||||
def __init__(self):
|
||||
self.value = 0
|
||||
|
||||
def increase(self, delay=0):
|
||||
time.sleep(delay)
|
||||
self.value += 1
|
||||
return self.value
|
||||
|
||||
def get_pid(self):
|
||||
return os.getpid()
|
||||
|
||||
actor = RestartableActor.remote()
|
||||
pid = ray.get(actor.get_pid.remote())
|
||||
results = [actor.increase.remote() for _ in range(100)]
|
||||
# Kill actor process, while the above task is still being executed.
|
||||
os.kill(pid, signal.SIGKILL)
|
||||
# Check that none of the tasks failed and the actor is restarted.
|
||||
seq = list(range(1, 101))
|
||||
results = ray.get(results)
|
||||
failed_task_index = None
|
||||
# Make sure that all tasks were executed in order before and after the
|
||||
# actor's death.
|
||||
for i, res in enumerate(results):
|
||||
if res != seq[0]:
|
||||
if failed_task_index is None:
|
||||
failed_task_index = i
|
||||
assert res + failed_task_index == seq[0]
|
||||
seq.pop(0)
|
||||
# Check that we can still call the actor.
|
||||
result = actor.increase.remote()
|
||||
assert ray.get(result) == results[-1] + 1
|
||||
|
||||
# kill actor process one more time.
|
||||
results = [actor.increase.remote() for _ in range(100)]
|
||||
pid = ray.get(actor.get_pid.remote())
|
||||
os.kill(pid, signal.SIGKILL)
|
||||
# The actor has exceeded max restarts, and this task should fail.
|
||||
with pytest.raises(ray.exceptions.RayActorError):
|
||||
ray.get(actor.increase.remote())
|
||||
|
||||
# Create another actor.
|
||||
actor = RestartableActor.remote()
|
||||
# Intentionlly exit the actor
|
||||
actor.__ray_terminate__.remote()
|
||||
# Check that the actor won't be restarted.
|
||||
with pytest.raises(ray.exceptions.RayActorError):
|
||||
ray.get(actor.increase.remote())
|
||||
ray.shutdown()
|
||||
|
||||
|
||||
def test_actor_restart_on_node_failure(ray_start_cluster):
|
||||
config = json.dumps({
|
||||
"num_heartbeats_timeout": 10,
|
||||
"raylet_heartbeat_timeout_milliseconds": 100,
|
||||
"initial_reconstruction_timeout_milliseconds": 1000,
|
||||
"task_retry_delay_ms": 100,
|
||||
})
|
||||
cluster = ray_start_cluster
|
||||
# Head node with no resources.
|
||||
cluster.add_node(num_cpus=0, _internal_config=config)
|
||||
# Node to place the actor.
|
||||
cluster.add_node(num_cpus=1, _internal_config=config)
|
||||
cluster.wait_for_nodes()
|
||||
ray.init(address=cluster.address)
|
||||
|
||||
@ray.remote(num_cpus=1, max_restarts=1, max_task_retries=-1)
|
||||
class RestartableActor:
|
||||
"""An actor that will be reconstructed at most once."""
|
||||
|
||||
def __init__(self):
|
||||
self.value = 0
|
||||
|
||||
def increase(self):
|
||||
self.value += 1
|
||||
return self.value
|
||||
|
||||
def ready(self):
|
||||
return
|
||||
|
||||
actor = RestartableActor.remote()
|
||||
ray.get(actor.ready.remote())
|
||||
results = [actor.increase.remote() for _ in range(100)]
|
||||
# Kill actor node, while the above task is still being executed.
|
||||
cluster.remove_node(cluster.list_all_nodes()[-1])
|
||||
cluster.add_node(num_cpus=1, _internal_config=config)
|
||||
cluster.wait_for_nodes()
|
||||
# Check that none of the tasks failed and the actor is restarted.
|
||||
seq = list(range(1, 101))
|
||||
results = ray.get(results)
|
||||
failed_task_index = None
|
||||
# Make sure that all tasks were executed in order before and after the
|
||||
# actor's death.
|
||||
for i, res in enumerate(results):
|
||||
elm = seq.pop(0)
|
||||
if res != elm:
|
||||
if failed_task_index is None:
|
||||
failed_task_index = i
|
||||
assert res + failed_task_index == elm
|
||||
# Check that we can still call the actor.
|
||||
result = ray.get(actor.increase.remote())
|
||||
assert result == 1 or result == results[-1] + 1
|
||||
|
||||
|
||||
def test_actor_restart_without_task(ray_start_regular):
|
||||
"""Test a dead actor can be restarted without sending task to it."""
|
||||
|
||||
@ray.remote(max_restarts=1)
|
||||
@ray.remote(max_restarts=1, resources={"actor": 1})
|
||||
class RestartableActor:
|
||||
def __init__(self, obj_ids):
|
||||
for obj_id in obj_ids:
|
||||
# Every time the actor gets constructed,
|
||||
# put a new object in plasma store.
|
||||
global_worker = ray.worker.global_worker
|
||||
if not global_worker.core_worker.object_exists(obj_id):
|
||||
global_worker.put_object(1, obj_id)
|
||||
break
|
||||
def __init__(self):
|
||||
pass
|
||||
|
||||
def get_pid(self):
|
||||
return os.getpid()
|
||||
|
||||
obj_ids = [ray.ObjectID.from_random() for _ in range(2)]
|
||||
actor = RestartableActor.remote(obj_ids)
|
||||
@ray.remote(resources={"actor": 1})
|
||||
def probe():
|
||||
return
|
||||
|
||||
# Returns whether the "actor" resource is available.
|
||||
def actor_resource_available():
|
||||
p = probe.remote()
|
||||
ready, _ = ray.wait([p], timeout=1)
|
||||
return len(ready) > 0
|
||||
|
||||
ray.experimental.set_resource("actor", 1)
|
||||
actor = RestartableActor.remote()
|
||||
assert wait_for_condition(lambda: not actor_resource_available())
|
||||
# Kill the actor.
|
||||
pid = ray.get(actor.get_pid.remote())
|
||||
|
||||
p = probe.remote()
|
||||
os.kill(pid, signal.SIGKILL)
|
||||
|
||||
# Wait until the actor is reconstructed.
|
||||
def check_restarted():
|
||||
worker = ray.worker.global_worker
|
||||
return worker.core_worker.object_exists(obj_ids[1])
|
||||
|
||||
assert wait_for_condition(check_restarted)
|
||||
ray.get(p)
|
||||
assert wait_for_condition(lambda: not actor_resource_available())
|
||||
|
||||
|
||||
def test_caller_actor_restart(ray_start_regular):
|
||||
@@ -277,66 +426,6 @@ def test_caller_task_reconstruction(ray_start_regular):
|
||||
assert ray.get(RetryableTask.remote(remote_actor)) == 3
|
||||
|
||||
|
||||
def test_actor_restart_on_node_failure(ray_start_cluster_head):
|
||||
"""Test actor reconstruction when node dies unexpectedly."""
|
||||
cluster = ray_start_cluster_head
|
||||
max_restarts = 3
|
||||
# Add a few nodes to the cluster.
|
||||
# Use custom resource to make sure the actor is only created on worker
|
||||
# nodes, not on the head node.
|
||||
for _ in range(max_restarts + 2):
|
||||
cluster.add_node(
|
||||
resources={"a": 1},
|
||||
_internal_config=json.dumps({
|
||||
"initial_reconstruction_timeout_milliseconds": 200,
|
||||
"num_heartbeats_timeout": 10,
|
||||
}),
|
||||
)
|
||||
|
||||
def kill_node(node_id):
|
||||
node_to_remove = None
|
||||
for node in cluster.worker_nodes:
|
||||
if node_id == node.unique_id:
|
||||
node_to_remove = node
|
||||
cluster.remove_node(node_to_remove)
|
||||
|
||||
@ray.remote(max_restarts=max_restarts, resources={"a": 1})
|
||||
class MyActor:
|
||||
def __init__(self):
|
||||
self.value = 0
|
||||
|
||||
def increase(self):
|
||||
self.value += 1
|
||||
return self.value
|
||||
|
||||
def get_object_store_socket(self):
|
||||
return ray.worker.global_worker.node.unique_id
|
||||
|
||||
actor = MyActor.remote()
|
||||
# Call increase 3 times.
|
||||
for _ in range(3):
|
||||
ray.get(actor.increase.remote())
|
||||
|
||||
for i in range(max_restarts):
|
||||
object_store_socket = ray.get(actor.get_object_store_socket.remote())
|
||||
# Kill actor's node and the actor should be restarted
|
||||
# on a different node.
|
||||
kill_node(object_store_socket)
|
||||
# Call increase again.
|
||||
# Check that the actor is restarted and value is correct.
|
||||
assert ray.get(actor.increase.remote()) == 4 + i
|
||||
# Check that the actor is now on a different node.
|
||||
assert object_store_socket != ray.get(
|
||||
actor.get_object_store_socket.remote())
|
||||
|
||||
# kill the node again.
|
||||
object_store_socket = ray.get(actor.get_object_store_socket.remote())
|
||||
kill_node(object_store_socket)
|
||||
# The actor has exceeded max restarts, and this task should fail.
|
||||
with pytest.raises(ray.exceptions.RayActorError):
|
||||
ray.get(actor.increase.remote())
|
||||
|
||||
|
||||
# NOTE(hchen): we set initial_reconstruction_timeout_milliseconds to 1s for
|
||||
# this test. Because if this value is too small, suprious task reconstruction
|
||||
# may happen and cause the test fauilure. If the value is too large, this test
|
||||
@@ -365,7 +454,7 @@ def test_multiple_actor_restart(ray_start_cluster_head):
|
||||
})) for _ in range(num_nodes)
|
||||
]
|
||||
|
||||
@ray.remote(max_restarts=-1)
|
||||
@ray.remote(max_restarts=-1, max_task_retries=-1)
|
||||
class SlowCounter:
|
||||
def __init__(self):
|
||||
self.x = 0
|
||||
@@ -407,8 +496,12 @@ def test_multiple_actor_restart(ray_start_cluster_head):
|
||||
|
||||
# Get the results and check that they have the correct values.
|
||||
for _, result_id_list in result_ids.items():
|
||||
results = list(range(1, len(result_id_list) + 1))
|
||||
assert ray.get(result_id_list) == results
|
||||
results = ray.get(result_id_list)
|
||||
for i, result in enumerate(results):
|
||||
if i == 0:
|
||||
assert result == 1
|
||||
else:
|
||||
assert result == results[i - 1] + 1 or result == 1
|
||||
|
||||
|
||||
def kill_actor(actor):
|
||||
@@ -418,6 +511,7 @@ def kill_actor(actor):
|
||||
wait_for_pid_to_exit(pid)
|
||||
|
||||
|
||||
@pytest.mark.skip(reason="TODO: Actor checkpointing")
|
||||
def test_checkpointing(ray_start_regular, ray_checkpointable_actor_cls):
|
||||
"""Test actor checkpointing and restoring from a checkpoint."""
|
||||
actor = ray.remote(max_restarts=2)(ray_checkpointable_actor_cls).remote()
|
||||
@@ -440,13 +534,14 @@ def test_checkpointing(ray_start_regular, ray_checkpointable_actor_cls):
|
||||
for _ in range(3):
|
||||
ray.get(actor.increase.remote())
|
||||
expected += 1
|
||||
# Kill actor again and check that reconstruction still works after the
|
||||
# Kill actor again and check that restart still works after the
|
||||
# actor resuming from a checkpoint.
|
||||
kill_actor(actor)
|
||||
assert ray.get(actor.get.remote()) == expected
|
||||
assert ray.get(actor.was_resumed_from_checkpoint.remote()) is True
|
||||
|
||||
|
||||
@pytest.mark.skip(reason="TODO: Actor checkpointing")
|
||||
def test_remote_checkpointing(ray_start_regular, ray_checkpointable_actor_cls):
|
||||
"""Test checkpointing of a remote actor through method invocation."""
|
||||
|
||||
@@ -487,13 +582,14 @@ def test_remote_checkpointing(ray_start_regular, ray_checkpointable_actor_cls):
|
||||
for _ in range(3):
|
||||
ray.get(actor.increase.remote())
|
||||
expected += 1
|
||||
# Kill actor again and check that reconstruction still works after the
|
||||
# Kill actor again and check that restart still works after the
|
||||
# actor resuming from a checkpoint.
|
||||
kill_actor(actor)
|
||||
assert ray.get(actor.get.remote()) == expected
|
||||
assert ray.get(actor.was_resumed_from_checkpoint.remote()) is True
|
||||
|
||||
|
||||
@pytest.mark.skip(reason="TODO: Actor checkpointing")
|
||||
def test_checkpointing_on_node_failure(ray_start_cluster_2_nodes,
|
||||
ray_checkpointable_actor_cls):
|
||||
"""Test actor checkpointing on a remote node."""
|
||||
@@ -520,6 +616,7 @@ def test_checkpointing_on_node_failure(ray_start_cluster_2_nodes,
|
||||
assert ray.get(actor.was_resumed_from_checkpoint.remote()) is True
|
||||
|
||||
|
||||
@pytest.mark.skip(reason="TODO: Actor checkpointing")
|
||||
def test_checkpointing_save_exception(ray_start_regular,
|
||||
ray_checkpointable_actor_cls):
|
||||
"""Test actor can still be recovered if checkpoints fail to complete."""
|
||||
@@ -549,7 +646,7 @@ def test_checkpointing_save_exception(ray_start_regular,
|
||||
for _ in range(3):
|
||||
ray.get(actor.increase.remote())
|
||||
expected += 1
|
||||
# Kill actor again, and check that reconstruction still works and the actor
|
||||
# Kill actor again, and check that restart still works and the actor
|
||||
# wasn't resumed from a checkpoint.
|
||||
kill_actor(actor)
|
||||
assert ray.get(actor.get.remote()) == expected
|
||||
@@ -559,6 +656,7 @@ def test_checkpointing_save_exception(ray_start_regular,
|
||||
wait_for_errors(ray_constants.CHECKPOINT_PUSH_ERROR, 1)
|
||||
|
||||
|
||||
@pytest.mark.skip(reason="TODO: Actor checkpointing")
|
||||
def test_checkpointing_load_exception(ray_start_regular,
|
||||
ray_checkpointable_actor_cls):
|
||||
"""Test actor can still be recovered if checkpoints fail to load."""
|
||||
@@ -589,7 +687,7 @@ def test_checkpointing_load_exception(ray_start_regular,
|
||||
for _ in range(3):
|
||||
ray.get(actor.increase.remote())
|
||||
expected += 1
|
||||
# Kill actor again, and check that reconstruction still works and the actor
|
||||
# Kill actor again, and check that restart still works and the actor
|
||||
# wasn't resumed from a checkpoint.
|
||||
kill_actor(actor)
|
||||
assert ray.get(actor.get.remote()) == expected
|
||||
@@ -718,10 +816,6 @@ def test_decorated_method(ray_start_regular):
|
||||
assert ray.get(object_id) == 7 # 2 * 3 + 1
|
||||
|
||||
|
||||
@pytest.mark.skipif(
|
||||
pytest_timeout is None,
|
||||
reason="Timeout package not installed; skipping test that may hang.")
|
||||
@pytest.mark.timeout(20)
|
||||
@pytest.mark.parametrize(
|
||||
"ray_start_cluster", [{
|
||||
"num_cpus": 1,
|
||||
@@ -747,53 +841,45 @@ def test_ray_wait_dead_actor(ray_start_cluster):
|
||||
actors = [Actor.remote() for _ in range(num_nodes)]
|
||||
ray.get([actor.ping.remote() for actor in actors])
|
||||
|
||||
# Ping the actors and make sure the tasks complete.
|
||||
ping_ids = [actor.ping.remote() for actor in actors]
|
||||
ray.get(ping_ids)
|
||||
# Evict the result from the node that we're about to kill.
|
||||
remote_node = cluster.list_all_nodes()[-1]
|
||||
remote_ping_id = None
|
||||
for i, actor in enumerate(actors):
|
||||
if ray.get(actor.node_id.remote()) == remote_node.unique_id:
|
||||
remote_ping_id = ping_ids[i]
|
||||
ray.internal.free([remote_ping_id], local_only=True)
|
||||
cluster.remove_node(remote_node)
|
||||
def actor_dead():
|
||||
# Ping the actors and make sure the tasks complete.
|
||||
ping_ids = [actor.ping.remote() for actor in actors]
|
||||
unready = ping_ids[:]
|
||||
while unready:
|
||||
_, unready = ray.wait(unready, timeout=0)
|
||||
time.sleep(1)
|
||||
|
||||
# Repeatedly call ray.wait until the exception for the dead actor is
|
||||
# received.
|
||||
unready = ping_ids[:]
|
||||
while unready:
|
||||
_, unready = ray.wait(unready, timeout=0)
|
||||
time.sleep(1)
|
||||
try:
|
||||
ray.get(ping_ids)
|
||||
return False
|
||||
except ray.exceptions.RayActorError:
|
||||
return True
|
||||
|
||||
with pytest.raises(ray.exceptions.RayActorError):
|
||||
ray.get(ping_ids)
|
||||
# Kill a node.
|
||||
cluster.remove_node(cluster.list_all_nodes()[-1])
|
||||
# Repeatedly submit tasks and call ray.wait until the exception for the
|
||||
# dead actor is received.
|
||||
assert wait_for_condition(actor_dead)
|
||||
|
||||
# Evict the result from the dead node.
|
||||
ray.internal.free([remote_ping_id], local_only=True)
|
||||
# Create an actor on the local node that will call ray.wait in a loop.
|
||||
head_node_resource = "HEAD_NODE"
|
||||
ray.experimental.set_resource(head_node_resource, 1)
|
||||
|
||||
@ray.remote(num_cpus=0, resources={head_node_resource: 1})
|
||||
class ParentActor:
|
||||
def __init__(self, ping_ids):
|
||||
self.unready = ping_ids
|
||||
def __init__(self):
|
||||
pass
|
||||
|
||||
def wait(self):
|
||||
_, self.unready = ray.wait(self.unready, timeout=0)
|
||||
return len(self.unready) == 0
|
||||
return actor_dead()
|
||||
|
||||
def ping(self):
|
||||
return
|
||||
|
||||
# Repeatedly call ray.wait through the local actor until the exception for
|
||||
# the dead actor is received.
|
||||
parent_actor = ParentActor.remote(ping_ids)
|
||||
ray.get(parent_actor.ping.remote())
|
||||
failure_detected = False
|
||||
while not failure_detected:
|
||||
failure_detected = ray.get(parent_actor.wait.remote())
|
||||
parent_actor = ParentActor.remote()
|
||||
assert wait_for_condition(lambda: ray.get(parent_actor.wait.remote()))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
+16
-1
@@ -1730,6 +1730,7 @@ def make_decorator(num_return_vals=None,
|
||||
max_calls=None,
|
||||
max_retries=None,
|
||||
max_restarts=None,
|
||||
max_task_retries=None,
|
||||
worker=None):
|
||||
def decorator(function_or_class):
|
||||
if (inspect.isfunction(function_or_class)
|
||||
@@ -1738,6 +1739,9 @@ def make_decorator(num_return_vals=None,
|
||||
if max_restarts is not None:
|
||||
raise ValueError("The keyword 'max_restarts' is not "
|
||||
"allowed for remote functions.")
|
||||
if max_task_retries is not None:
|
||||
raise ValueError("The keyword 'max_task_retries' is not "
|
||||
"allowed for remote functions.")
|
||||
|
||||
return ray.remote_function.RemoteFunction(
|
||||
Language.PYTHON, function_or_class, None, num_cpus, num_gpus,
|
||||
@@ -1754,7 +1758,7 @@ def make_decorator(num_return_vals=None,
|
||||
|
||||
return ray.actor.make_actor(function_or_class, num_cpus, num_gpus,
|
||||
memory, object_store_memory, resources,
|
||||
max_restarts)
|
||||
max_restarts, max_task_retries)
|
||||
|
||||
raise TypeError("The @ray.remote decorator must be applied to "
|
||||
"either a function or to a class.")
|
||||
@@ -1801,6 +1805,14 @@ def remote(*args, **kwargs):
|
||||
unexpectedly. The minimum valid value is 0 (default), which indicates
|
||||
that the actor doesn't need to be restarted. A value of -1
|
||||
indicates that an actor should be restarted indefinitely.
|
||||
* **max_task_retries**: Only for *actors*. How many times to retry an actor
|
||||
task if the task fails due to a system error, e.g., the actor has died.
|
||||
If set to -1, the system will retry the failed task until the task
|
||||
succeeds, or the actor has reached its max_restarts limit. If set to n >
|
||||
0, the system will retry the failed task up to n times, after which the
|
||||
task will throw a `RayActorError` exception upon `ray.get`. Note that
|
||||
Python exceptions are not considered system errors and will not trigger
|
||||
retries.
|
||||
* **max_retries**: Only for *remote functions*. This specifies the maximum
|
||||
number of times that the remote function should be rerun when the worker
|
||||
process executing it crashes unexpectedly. The minimum valid value is 0,
|
||||
@@ -1867,6 +1879,7 @@ def remote(*args, **kwargs):
|
||||
"resources",
|
||||
"max_calls",
|
||||
"max_restarts",
|
||||
"max_task_retries",
|
||||
"max_retries",
|
||||
], error_string
|
||||
|
||||
@@ -1885,6 +1898,7 @@ def remote(*args, **kwargs):
|
||||
num_return_vals = kwargs.get("num_return_vals")
|
||||
max_calls = kwargs.get("max_calls")
|
||||
max_restarts = kwargs.get("max_restarts")
|
||||
max_task_retries = kwargs.get("max_task_retries")
|
||||
memory = kwargs.get("memory")
|
||||
object_store_memory = kwargs.get("object_store_memory")
|
||||
max_retries = kwargs.get("max_retries")
|
||||
@@ -1898,5 +1912,6 @@ def remote(*args, **kwargs):
|
||||
resources=resources,
|
||||
max_calls=max_calls,
|
||||
max_restarts=max_restarts,
|
||||
max_task_retries=max_task_retries,
|
||||
max_retries=max_retries,
|
||||
worker=worker)
|
||||
|
||||
@@ -209,6 +209,10 @@ const rpc::Address &TaskSpecification::CallerAddress() const {
|
||||
return message_->caller_address();
|
||||
}
|
||||
|
||||
WorkerID TaskSpecification::CallerWorkerId() const {
|
||||
return WorkerID::FromBinary(message_->caller_address().worker_id());
|
||||
}
|
||||
|
||||
// === Below are getter methods specific to actor tasks.
|
||||
|
||||
ActorID TaskSpecification::ActorId() const {
|
||||
|
||||
@@ -151,6 +151,8 @@ class TaskSpecification : public MessageWrapper<rpc::TaskSpec> {
|
||||
|
||||
const rpc::Address &CallerAddress() const;
|
||||
|
||||
WorkerID CallerWorkerId() const;
|
||||
|
||||
uint64_t ActorCounter() const;
|
||||
|
||||
ObjectID ActorCreationDummyObjectId() const;
|
||||
|
||||
@@ -23,7 +23,7 @@ ray::rpc::ActorHandle CreateInnerActorHandle(
|
||||
const ray::rpc::Address &owner_address, const class JobID &job_id,
|
||||
const ObjectID &initial_cursor, const Language actor_language,
|
||||
const ray::FunctionDescriptor &actor_creation_task_function_descriptor,
|
||||
const std::string &extension_data) {
|
||||
const std::string &extension_data, int64_t max_task_retries) {
|
||||
ray::rpc::ActorHandle inner;
|
||||
inner.set_actor_id(actor_id.Data(), actor_id.Size());
|
||||
inner.set_owner_id(owner_id.Binary());
|
||||
@@ -34,6 +34,7 @@ ray::rpc::ActorHandle CreateInnerActorHandle(
|
||||
actor_creation_task_function_descriptor->GetMessage();
|
||||
inner.set_actor_cursor(initial_cursor.Binary());
|
||||
inner.set_extension_data(extension_data);
|
||||
inner.set_max_task_retries(max_task_retries);
|
||||
return inner;
|
||||
}
|
||||
|
||||
@@ -69,10 +70,10 @@ ActorHandle::ActorHandle(
|
||||
const rpc::Address &owner_address, const class JobID &job_id,
|
||||
const ObjectID &initial_cursor, const Language actor_language,
|
||||
const ray::FunctionDescriptor &actor_creation_task_function_descriptor,
|
||||
const std::string &extension_data)
|
||||
const std::string &extension_data, int64_t max_task_retries)
|
||||
: ActorHandle(CreateInnerActorHandle(
|
||||
actor_id, owner_id, owner_address, job_id, initial_cursor, actor_language,
|
||||
actor_creation_task_function_descriptor, extension_data)) {}
|
||||
actor_creation_task_function_descriptor, extension_data, max_task_retries)) {}
|
||||
|
||||
ActorHandle::ActorHandle(const std::string &serialized)
|
||||
: ActorHandle(CreateInnerActorHandleFromString(serialized)) {}
|
||||
@@ -95,10 +96,4 @@ void ActorHandle::SetActorTaskSpec(TaskSpecBuilder &builder, const ObjectID new_
|
||||
|
||||
void ActorHandle::Serialize(std::string *output) { inner_.SerializeToString(output); }
|
||||
|
||||
void ActorHandle::Reset() {
|
||||
absl::MutexLock guard(&mutex_);
|
||||
task_counter_ = 0;
|
||||
actor_cursor_ = ObjectID::FromBinary(inner_.actor_cursor());
|
||||
}
|
||||
|
||||
} // namespace ray
|
||||
|
||||
@@ -37,7 +37,7 @@ class ActorHandle {
|
||||
const rpc::Address &owner_address, const JobID &job_id,
|
||||
const ObjectID &initial_cursor, const Language actor_language,
|
||||
const ray::FunctionDescriptor &actor_creation_task_function_descriptor,
|
||||
const std::string &extension_data);
|
||||
const std::string &extension_data, int64_t max_task_retries);
|
||||
|
||||
/// Constructs an ActorHandle from a serialized string.
|
||||
ActorHandle(const std::string &serialized);
|
||||
@@ -68,34 +68,12 @@ class ActorHandle {
|
||||
|
||||
void Serialize(std::string *output);
|
||||
|
||||
/// Reset the handle state next task submitted.
|
||||
///
|
||||
/// This should be called whenever the actor is restarted, since the new
|
||||
/// instance of the actor does not have the previous sequence number.
|
||||
/// TODO: We should also move the other actor state (status and IP) inside
|
||||
/// ActorHandle and reset them in this method.
|
||||
void Reset();
|
||||
|
||||
// Mark the actor handle as dead.
|
||||
void MarkDead() {
|
||||
absl::MutexLock lock(&mutex_);
|
||||
state_ = rpc::ActorTableData::DEAD;
|
||||
}
|
||||
|
||||
// Returns whether the actor is known to be dead.
|
||||
bool IsDead() const {
|
||||
absl::MutexLock lock(&mutex_);
|
||||
return state_ == rpc::ActorTableData::DEAD;
|
||||
}
|
||||
int64_t MaxTaskRetries() const { return inner_.max_task_retries(); }
|
||||
|
||||
private:
|
||||
// Protobuf-defined persistent state of the actor handle.
|
||||
const ray::rpc::ActorHandle inner_;
|
||||
|
||||
/// The actor's state (alive or dead). This defaults to ALIVE. Once marked
|
||||
/// DEAD, the actor handle can never go back to being ALIVE.
|
||||
rpc::ActorTableData::ActorState state_ GUARDED_BY(mutex_) = rpc::ActorTableData::ALIVE;
|
||||
|
||||
/// The unique id of the dummy object returned by the previous task.
|
||||
/// TODO: This can be removed once we schedule actor tasks by task counter
|
||||
/// only.
|
||||
|
||||
@@ -111,12 +111,14 @@ struct TaskOptions {
|
||||
/// Options for actor creation tasks.
|
||||
struct ActorCreationOptions {
|
||||
ActorCreationOptions() {}
|
||||
ActorCreationOptions(int64_t max_restarts, int max_concurrency,
|
||||
ActorCreationOptions(int64_t max_restarts, int64_t max_task_retries,
|
||||
int max_concurrency,
|
||||
const std::unordered_map<std::string, double> &resources,
|
||||
const std::unordered_map<std::string, double> &placement_resources,
|
||||
const std::vector<std::string> &dynamic_worker_options,
|
||||
bool is_detached, std::string &name, bool is_asyncio)
|
||||
: max_restarts(max_restarts),
|
||||
max_task_retries(max_task_retries),
|
||||
max_concurrency(max_concurrency),
|
||||
resources(resources),
|
||||
placement_resources(placement_resources),
|
||||
@@ -125,10 +127,14 @@ struct ActorCreationOptions {
|
||||
name(name),
|
||||
is_asyncio(is_asyncio){};
|
||||
|
||||
/// Maximum number of times that the actor should be reconstructed when it dies
|
||||
/// unexpectedly. A value of -1 indicates infinite restarts.
|
||||
/// If it's 0, the actor won't be restarted.
|
||||
/// Maximum number of times that the actor should be restarted if it dies
|
||||
/// unexpectedly. A value of -1 indicates infinite restarts. If it's 0, the
|
||||
/// actor won't be restarted.
|
||||
const int64_t max_restarts = 0;
|
||||
/// Maximum number of times that individual tasks can be retried at the
|
||||
/// actor, if the actor dies unexpectedly. If -1, then the task may be
|
||||
/// retried infinitely many times.
|
||||
const int64_t max_task_retries = 0;
|
||||
/// The max number of concurrent tasks to run on this direct call actor.
|
||||
const int max_concurrency = 1;
|
||||
/// Resources required by the whole lifetime of this actor.
|
||||
|
||||
@@ -428,7 +428,7 @@ CoreWorker::CoreWorker(const CoreWorkerOptions &options, const WorkerID &worker_
|
||||
}
|
||||
|
||||
direct_actor_submitter_ = std::unique_ptr<CoreWorkerDirectActorTaskSubmitter>(
|
||||
new CoreWorkerDirectActorTaskSubmitter(rpc_address_, client_factory, memory_store_,
|
||||
new CoreWorkerDirectActorTaskSubmitter(client_factory, memory_store_,
|
||||
task_manager_));
|
||||
|
||||
direct_task_submitter_ =
|
||||
@@ -604,20 +604,9 @@ const WorkerID &CoreWorker::GetWorkerID() const { return worker_context_.GetWork
|
||||
|
||||
void CoreWorker::SetCurrentTaskId(const TaskID &task_id) {
|
||||
worker_context_.SetCurrentTaskId(task_id);
|
||||
bool not_actor_task = false;
|
||||
{
|
||||
absl::MutexLock lock(&mutex_);
|
||||
main_thread_task_id_ = task_id;
|
||||
not_actor_task = actor_id_.IsNil();
|
||||
}
|
||||
if (not_actor_task && task_id.IsNil()) {
|
||||
absl::MutexLock lock(&actor_handles_mutex_);
|
||||
// Reset the seqnos so that for the next task it start off at 0.
|
||||
for (const auto &handle : actor_handles_) {
|
||||
handle.second->Reset();
|
||||
}
|
||||
// TODO(ekl) we can't unsubscribe to actor notifications here due to
|
||||
// https://github.com/ray-project/ray/pull/6885
|
||||
}
|
||||
}
|
||||
|
||||
@@ -676,7 +665,12 @@ void CoreWorker::InternalHeartbeat(const boost::system::error_code &error) {
|
||||
|
||||
absl::MutexLock lock(&mutex_);
|
||||
while (!to_resubmit_.empty() && current_time_ms() > to_resubmit_.front().first) {
|
||||
RAY_CHECK_OK(direct_task_submitter_->SubmitTask(to_resubmit_.front().second));
|
||||
auto &spec = to_resubmit_.front().second;
|
||||
if (spec.IsActorTask()) {
|
||||
RAY_CHECK_OK(direct_actor_submitter_->SubmitTask(spec));
|
||||
} else {
|
||||
RAY_CHECK_OK(direct_task_submitter_->SubmitTask(spec));
|
||||
}
|
||||
to_resubmit_.pop_front();
|
||||
}
|
||||
internal_timer_.expires_at(internal_timer_.expiry() +
|
||||
@@ -1156,7 +1150,8 @@ Status CoreWorker::CreateActor(const RayFunction &function,
|
||||
// WaitForActorOutOfScopeRequest.
|
||||
std::unique_ptr<ActorHandle> actor_handle(new ActorHandle(
|
||||
actor_id, GetCallerId(), rpc_address_, job_id, /*actor_cursor=*/return_ids[0],
|
||||
function.GetLanguage(), function.GetFunctionDescriptor(), extension_data));
|
||||
function.GetLanguage(), function.GetFunctionDescriptor(), extension_data,
|
||||
actor_creation_options.max_task_retries));
|
||||
RAY_CHECK(AddActorHandle(std::move(actor_handle),
|
||||
/*is_owner_handle=*/!actor_creation_options.is_detached))
|
||||
<< "Actor " << actor_id << " already exists";
|
||||
@@ -1215,14 +1210,8 @@ Status CoreWorker::SubmitActorTask(const ActorID &actor_id, const RayFunction &f
|
||||
ExecuteTaskLocalMode(task_spec, actor_id);
|
||||
} else {
|
||||
task_manager_->AddPendingTask(GetCallerId(), rpc_address_, task_spec,
|
||||
CurrentCallSite());
|
||||
if (actor_handle->IsDead()) {
|
||||
auto status = Status::IOError("sent task to dead actor");
|
||||
task_manager_->PendingTaskFailed(task_spec.TaskId(), rpc::ErrorType::ACTOR_DIED,
|
||||
&status);
|
||||
} else {
|
||||
status = direct_actor_submitter_->SubmitTask(task_spec);
|
||||
}
|
||||
CurrentCallSite(), actor_handle->MaxTaskRetries());
|
||||
status = direct_actor_submitter_->SubmitTask(task_spec);
|
||||
}
|
||||
return status;
|
||||
}
|
||||
@@ -1299,6 +1288,7 @@ bool CoreWorker::AddActorHandle(std::unique_ptr<ActorHandle> actor_handle,
|
||||
}
|
||||
|
||||
reference_counter_->AddLocalReference(actor_creation_return_id, CurrentCallSite());
|
||||
direct_actor_submitter_->AddActorQueueIfNotExists(actor_id);
|
||||
|
||||
bool inserted;
|
||||
{
|
||||
@@ -1313,19 +1303,9 @@ bool CoreWorker::AddActorHandle(std::unique_ptr<ActorHandle> actor_handle,
|
||||
if (actor_data.state() == gcs::ActorTableData::PENDING) {
|
||||
// The actor is being created and not yet ready, just ignore!
|
||||
} else if (actor_data.state() == gcs::ActorTableData::RESTARTING) {
|
||||
absl::MutexLock lock(&actor_handles_mutex_);
|
||||
auto it = actor_handles_.find(actor_id);
|
||||
RAY_CHECK(it != actor_handles_.end());
|
||||
// We have to reset the actor handle since the next instance of the
|
||||
// actor will not have the last sequence number that we sent.
|
||||
it->second->Reset();
|
||||
direct_actor_submitter_->DisconnectActor(actor_id, false);
|
||||
} else if (actor_data.state() == gcs::ActorTableData::DEAD) {
|
||||
direct_actor_submitter_->DisconnectActor(actor_id, true);
|
||||
|
||||
ActorHandle *actor_handle = nullptr;
|
||||
RAY_CHECK_OK(GetActorHandle(actor_id, &actor_handle));
|
||||
actor_handle->MarkDead();
|
||||
// We cannot erase the actor handle here because clients can still
|
||||
// submit tasks to dead actors. This also means we defer unsubscription,
|
||||
// otherwise we crash when bulk unsubscribing all actor handles.
|
||||
@@ -1369,8 +1349,8 @@ bool CoreWorker::AddActorHandle(std::unique_ptr<ActorHandle> actor_handle,
|
||||
// language frontend receives another ref to the same actor. In this
|
||||
// case, we must remember the last task counter that we sent to the
|
||||
// actor.
|
||||
// TODO(swang): Unsubscribe from the actor table once all local refs
|
||||
// to the actor have gone out of scope.
|
||||
// TODO(ekl) we can't unsubscribe to actor notifications here due to
|
||||
// https://github.com/ray-project/ray/pull/6885
|
||||
auto callback = actor_out_of_scope_callbacks_.extract(actor_id);
|
||||
if (callback) {
|
||||
callback.mapped()(actor_id);
|
||||
@@ -1555,16 +1535,11 @@ Status CoreWorker::ExecuteTask(const TaskSpecification &task_spec,
|
||||
return_ids.pop_back();
|
||||
task_type = TaskType::ACTOR_CREATION_TASK;
|
||||
SetActorId(task_spec.ActorCreationId());
|
||||
// For an actor, set the timestamp as the time its creation task starts execution.
|
||||
SetCallerCreationTimestamp();
|
||||
RAY_LOG(INFO) << "Creating actor: " << task_spec.ActorCreationId();
|
||||
} else if (task_spec.IsActorTask()) {
|
||||
RAY_CHECK(return_ids.size() > 0);
|
||||
return_ids.pop_back();
|
||||
task_type = TaskType::ACTOR_TASK;
|
||||
} else {
|
||||
// For a non-actor task, set the timestamp as the time it starts execution.
|
||||
SetCallerCreationTimestamp();
|
||||
}
|
||||
|
||||
// Because we support concurrent actor calls, we need to update the
|
||||
@@ -2080,9 +2055,4 @@ void CoreWorker::SetActorTitle(const std::string &title) {
|
||||
actor_title_ = title;
|
||||
}
|
||||
|
||||
void CoreWorker::SetCallerCreationTimestamp() {
|
||||
absl::MutexLock lock(&mutex_);
|
||||
direct_actor_submitter_->SetCallerCreationTimestamp(current_sys_time_ms());
|
||||
}
|
||||
|
||||
} // namespace ray
|
||||
|
||||
@@ -110,14 +110,16 @@ inline ray::ActorCreationOptions ToActorCreationOptions(JNIEnv *env,
|
||||
}
|
||||
|
||||
std::string name = "";
|
||||
ray::ActorCreationOptions actor_creation_options{max_restarts,
|
||||
static_cast<int>(max_concurrency),
|
||||
resources,
|
||||
resources,
|
||||
dynamic_worker_options,
|
||||
/*is_detached=*/false,
|
||||
name,
|
||||
/*is_asyncio=*/false};
|
||||
ray::ActorCreationOptions actor_creation_options{
|
||||
max_restarts,
|
||||
0, // TODO: Allow setting max_task_retries from Java.
|
||||
static_cast<int>(max_concurrency),
|
||||
resources,
|
||||
resources,
|
||||
dynamic_worker_options,
|
||||
/*is_detached=*/false,
|
||||
name,
|
||||
/*is_asyncio=*/false};
|
||||
return actor_creation_options;
|
||||
}
|
||||
|
||||
|
||||
@@ -28,7 +28,8 @@ void TaskManager::AddPendingTask(const TaskID &caller_id,
|
||||
const rpc::Address &caller_address,
|
||||
const TaskSpecification &spec,
|
||||
const std::string &call_site, int max_retries) {
|
||||
RAY_LOG(DEBUG) << "Adding pending task " << spec.TaskId();
|
||||
RAY_LOG(DEBUG) << "Adding pending task " << spec.TaskId() << " with " << max_retries
|
||||
<< " retries";
|
||||
|
||||
// Add references for the dependencies to the task.
|
||||
std::vector<ObjectID> task_deps;
|
||||
@@ -90,12 +91,18 @@ Status TaskManager::ResubmitTask(const TaskID &task_id,
|
||||
if (it == submissible_tasks_.end()) {
|
||||
return Status::Invalid("Task spec missing");
|
||||
}
|
||||
if (it->second.spec.IsActorTask()) {
|
||||
return Status::Invalid("Cannot reconstruct objects returned by actors");
|
||||
}
|
||||
|
||||
if (!it->second.pending) {
|
||||
resubmit = true;
|
||||
it->second.pending = true;
|
||||
RAY_CHECK(it->second.num_retries_left != 0);
|
||||
it->second.num_retries_left--;
|
||||
if (it->second.num_retries_left > 0) {
|
||||
it->second.num_retries_left--;
|
||||
} else {
|
||||
RAY_CHECK(it->second.num_retries_left == -1);
|
||||
}
|
||||
spec = it->second.spec;
|
||||
}
|
||||
}
|
||||
@@ -256,7 +263,7 @@ void TaskManager::CompletePendingTask(const TaskID &task_id,
|
||||
ShutdownIfNeeded();
|
||||
}
|
||||
|
||||
void TaskManager::PendingTaskFailed(const TaskID &task_id, rpc::ErrorType error_type,
|
||||
bool TaskManager::PendingTaskFailed(const TaskID &task_id, rpc::ErrorType error_type,
|
||||
Status *status) {
|
||||
// Note that this might be the __ray_terminate__ task, so we don't log
|
||||
// loudly with ERROR here.
|
||||
@@ -286,6 +293,7 @@ void TaskManager::PendingTaskFailed(const TaskID &task_id, rpc::ErrorType error_
|
||||
}
|
||||
}
|
||||
|
||||
bool will_retry = false;
|
||||
// We should not hold the lock during these calls because they may trigger
|
||||
// callbacks in this or other classes.
|
||||
if (num_retries_left != 0) {
|
||||
@@ -294,6 +302,7 @@ void TaskManager::PendingTaskFailed(const TaskID &task_id, rpc::ErrorType error_
|
||||
RAY_LOG(ERROR) << retries_str << " retries left for task " << spec.TaskId()
|
||||
<< ", attempting to resubmit.";
|
||||
retry_task_callback_(spec, /*delay=*/true);
|
||||
will_retry = true;
|
||||
} else {
|
||||
// Throttled logging of task failure errors.
|
||||
{
|
||||
@@ -323,6 +332,8 @@ void TaskManager::PendingTaskFailed(const TaskID &task_id, rpc::ErrorType error_
|
||||
}
|
||||
|
||||
ShutdownIfNeeded();
|
||||
|
||||
return will_retry;
|
||||
}
|
||||
|
||||
void TaskManager::ShutdownIfNeeded() {
|
||||
|
||||
@@ -32,7 +32,7 @@ class TaskFinisherInterface {
|
||||
virtual void CompletePendingTask(const TaskID &task_id, const rpc::PushTaskReply &reply,
|
||||
const rpc::Address &actor_addr) = 0;
|
||||
|
||||
virtual void PendingTaskFailed(const TaskID &task_id, rpc::ErrorType error_type,
|
||||
virtual bool PendingTaskFailed(const TaskID &task_id, rpc::ErrorType error_type,
|
||||
Status *status = nullptr) = 0;
|
||||
|
||||
virtual void OnTaskDependenciesInlined(
|
||||
@@ -116,7 +116,8 @@ class TaskManager : public TaskFinisherInterface, public TaskResubmissionInterfa
|
||||
/// \param[in] task_id ID of the pending task.
|
||||
/// \param[in] error_type The type of the specific error.
|
||||
/// \param[in] status Optional status message.
|
||||
void PendingTaskFailed(const TaskID &task_id, rpc::ErrorType error_type,
|
||||
/// \return Whether the task will be retried or not.
|
||||
bool PendingTaskFailed(const TaskID &task_id, rpc::ErrorType error_type,
|
||||
Status *status = nullptr) override;
|
||||
|
||||
/// A task's dependencies were inlined in the task spec. This will decrement
|
||||
|
||||
@@ -73,8 +73,9 @@ ActorID CreateActorHelper(std::unordered_map<std::string, double> &resources,
|
||||
std::string name = "";
|
||||
ActorCreationOptions actor_options{
|
||||
max_restarts,
|
||||
/*max_concurrency*/ 1, resources, resources, {},
|
||||
/*is_detached=*/false, name, /*is_asyncio=*/false};
|
||||
/*max_task_retries=*/0,
|
||||
/*max_concurrency*/ 1, resources, resources, {},
|
||||
/*is_detached=*/false, name, /*is_asyncio=*/false};
|
||||
|
||||
// Create an actor.
|
||||
ActorID actor_id;
|
||||
@@ -644,6 +645,7 @@ TEST_F(ZeroNodeTest, TestTaskSpecPerf) {
|
||||
std::unordered_map<std::string, double> resources;
|
||||
std::string name = "";
|
||||
ActorCreationOptions actor_options{0,
|
||||
0,
|
||||
1,
|
||||
resources,
|
||||
resources,
|
||||
@@ -654,7 +656,8 @@ TEST_F(ZeroNodeTest, TestTaskSpecPerf) {
|
||||
const auto job_id = NextJobId();
|
||||
ActorHandle actor_handle(ActorID::Of(job_id, TaskID::ForDriverTask(job_id), 1),
|
||||
TaskID::Nil(), rpc::Address(), job_id, ObjectID::FromRandom(),
|
||||
function.GetLanguage(), function.GetFunctionDescriptor(), "");
|
||||
function.GetLanguage(), function.GetFunctionDescriptor(), "",
|
||||
0);
|
||||
|
||||
// Manually create `num_tasks` task specs, and for each of them create a
|
||||
// `PushTaskRequest`, this is to batch performance of TaskSpec
|
||||
@@ -763,10 +766,10 @@ TEST_F(ZeroNodeTest, TestWorkerContext) {
|
||||
TEST_F(ZeroNodeTest, TestActorHandle) {
|
||||
// Test actor handle serialization and deserialization round trip.
|
||||
JobID job_id = NextJobId();
|
||||
ActorHandle original(ActorID::Of(job_id, TaskID::ForDriverTask(job_id), 0),
|
||||
TaskID::Nil(), rpc::Address(), job_id, ObjectID::FromRandom(),
|
||||
Language::PYTHON,
|
||||
ray::FunctionDescriptorBuilder::BuildPython("", "", "", ""), "");
|
||||
ActorHandle original(
|
||||
ActorID::Of(job_id, TaskID::ForDriverTask(job_id), 0), TaskID::Nil(),
|
||||
rpc::Address(), job_id, ObjectID::FromRandom(), Language::PYTHON,
|
||||
ray::FunctionDescriptorBuilder::BuildPython("", "", "", ""), "", 0);
|
||||
std::string output;
|
||||
original.Serialize(&output);
|
||||
ActorHandle deserialized(output);
|
||||
|
||||
@@ -24,16 +24,45 @@
|
||||
namespace ray {
|
||||
|
||||
using ::testing::_;
|
||||
using ::testing::ElementsAre;
|
||||
using ::testing::Return;
|
||||
|
||||
TaskSpecification CreateActorTaskHelper(ActorID actor_id, WorkerID caller_worker_id,
|
||||
int64_t counter,
|
||||
TaskID caller_id = TaskID::Nil()) {
|
||||
TaskSpecification task;
|
||||
task.GetMutableMessage().set_task_id(TaskID::Nil().Binary());
|
||||
task.GetMutableMessage().set_caller_id(caller_id.Binary());
|
||||
task.GetMutableMessage().set_type(TaskType::ACTOR_TASK);
|
||||
task.GetMutableMessage().mutable_caller_address()->set_worker_id(
|
||||
caller_worker_id.Binary());
|
||||
task.GetMutableMessage().mutable_actor_task_spec()->set_actor_id(actor_id.Binary());
|
||||
task.GetMutableMessage().mutable_actor_task_spec()->set_actor_counter(counter);
|
||||
task.GetMutableMessage().set_num_returns(1);
|
||||
return task;
|
||||
}
|
||||
|
||||
rpc::PushTaskRequest CreatePushTaskRequestHelper(ActorID actor_id, int64_t counter,
|
||||
WorkerID caller_worker_id,
|
||||
TaskID caller_id,
|
||||
int64_t caller_timestamp) {
|
||||
auto task_spec = CreateActorTaskHelper(actor_id, caller_worker_id, counter, caller_id);
|
||||
|
||||
rpc::PushTaskRequest request;
|
||||
request.mutable_task_spec()->CopyFrom(task_spec.GetMessage());
|
||||
request.set_sequence_number(request.task_spec().actor_task_spec().actor_counter());
|
||||
request.set_client_processed_up_to(-1);
|
||||
return request;
|
||||
}
|
||||
|
||||
class MockWorkerClient : public rpc::CoreWorkerClientInterface {
|
||||
public:
|
||||
const rpc::Address &Addr() const override { return addr; }
|
||||
|
||||
ray::Status PushActorTask(
|
||||
std::unique_ptr<rpc::PushTaskRequest> request,
|
||||
std::unique_ptr<rpc::PushTaskRequest> request, bool skip_queue,
|
||||
const rpc::ClientCallback<rpc::PushTaskReply> &callback) override {
|
||||
RAY_CHECK(counter == request->task_spec().actor_task_spec().actor_counter());
|
||||
counter++;
|
||||
received_seq_nos.push_back(request->sequence_number());
|
||||
callbacks.push_back(callback);
|
||||
return Status::OK();
|
||||
}
|
||||
@@ -50,7 +79,7 @@ class MockWorkerClient : public rpc::CoreWorkerClientInterface {
|
||||
|
||||
rpc::Address addr;
|
||||
std::list<rpc::ClientCallback<rpc::PushTaskReply>> callbacks;
|
||||
uint64_t counter = 0;
|
||||
std::vector<uint64_t> received_seq_nos;
|
||||
};
|
||||
|
||||
class MockTaskFinisher : public TaskFinisherInterface {
|
||||
@@ -60,7 +89,7 @@ class MockTaskFinisher : public TaskFinisherInterface {
|
||||
MOCK_METHOD3(CompletePendingTask, void(const TaskID &, const rpc::PushTaskReply &,
|
||||
const rpc::Address &addr));
|
||||
MOCK_METHOD3(PendingTaskFailed,
|
||||
void(const TaskID &task_id, rpc::ErrorType error_type, Status *status));
|
||||
bool(const TaskID &task_id, rpc::ErrorType error_type, Status *status));
|
||||
|
||||
MOCK_METHOD2(OnTaskDependenciesInlined,
|
||||
void(const std::vector<ObjectID> &, const std::vector<ObjectID> &));
|
||||
@@ -68,25 +97,15 @@ class MockTaskFinisher : public TaskFinisherInterface {
|
||||
MOCK_METHOD1(MarkTaskCanceled, bool(const TaskID &task_id));
|
||||
};
|
||||
|
||||
TaskSpecification CreateActorTaskHelper(ActorID actor_id, int64_t counter) {
|
||||
TaskSpecification task;
|
||||
task.GetMutableMessage().set_task_id(TaskID::Nil().Binary());
|
||||
task.GetMutableMessage().set_type(TaskType::ACTOR_TASK);
|
||||
task.GetMutableMessage().mutable_actor_task_spec()->set_actor_id(actor_id.Binary());
|
||||
task.GetMutableMessage().mutable_actor_task_spec()->set_actor_counter(counter);
|
||||
return task;
|
||||
}
|
||||
|
||||
class DirectActorSubmitterTest : public ::testing::Test {
|
||||
public:
|
||||
DirectActorSubmitterTest()
|
||||
: worker_client_(std::shared_ptr<MockWorkerClient>(new MockWorkerClient())),
|
||||
store_(std::shared_ptr<CoreWorkerMemoryStore>(new CoreWorkerMemoryStore())),
|
||||
task_finisher_(std::make_shared<MockTaskFinisher>()),
|
||||
submitter_(address_, [&](const rpc::Address &addr) { return worker_client_; },
|
||||
store_, task_finisher_) {}
|
||||
submitter_([&](const rpc::Address &addr) { return worker_client_; }, store_,
|
||||
task_finisher_) {}
|
||||
|
||||
rpc::Address address_;
|
||||
std::shared_ptr<MockWorkerClient> worker_client_;
|
||||
std::shared_ptr<CoreWorkerMemoryStore> store_;
|
||||
std::shared_ptr<MockTaskFinisher> task_finisher_;
|
||||
@@ -95,16 +114,19 @@ class DirectActorSubmitterTest : public ::testing::Test {
|
||||
|
||||
TEST_F(DirectActorSubmitterTest, TestSubmitTask) {
|
||||
rpc::Address addr;
|
||||
auto worker_id = WorkerID::FromRandom();
|
||||
addr.set_worker_id(worker_id.Binary());
|
||||
ActorID actor_id = ActorID::Of(JobID::FromInt(0), TaskID::Nil(), 0);
|
||||
submitter_.AddActorQueueIfNotExists(actor_id);
|
||||
|
||||
auto task = CreateActorTaskHelper(actor_id, 0);
|
||||
auto task = CreateActorTaskHelper(actor_id, worker_id, 0);
|
||||
ASSERT_TRUE(submitter_.SubmitTask(task).ok());
|
||||
ASSERT_EQ(worker_client_->callbacks.size(), 0);
|
||||
|
||||
submitter_.ConnectActor(actor_id, addr);
|
||||
ASSERT_EQ(worker_client_->callbacks.size(), 1);
|
||||
|
||||
task = CreateActorTaskHelper(actor_id, 1);
|
||||
task = CreateActorTaskHelper(actor_id, worker_id, 1);
|
||||
ASSERT_TRUE(submitter_.SubmitTask(task).ok());
|
||||
ASSERT_EQ(worker_client_->callbacks.size(), 2);
|
||||
|
||||
@@ -114,20 +136,24 @@ TEST_F(DirectActorSubmitterTest, TestSubmitTask) {
|
||||
while (!worker_client_->callbacks.empty()) {
|
||||
ASSERT_TRUE(worker_client_->ReplyPushTask());
|
||||
}
|
||||
ASSERT_THAT(worker_client_->received_seq_nos, ElementsAre(0, 1));
|
||||
}
|
||||
|
||||
TEST_F(DirectActorSubmitterTest, TestDependencies) {
|
||||
rpc::Address addr;
|
||||
auto worker_id = WorkerID::FromRandom();
|
||||
addr.set_worker_id(worker_id.Binary());
|
||||
ActorID actor_id = ActorID::Of(JobID::FromInt(0), TaskID::Nil(), 0);
|
||||
submitter_.AddActorQueueIfNotExists(actor_id);
|
||||
submitter_.ConnectActor(actor_id, addr);
|
||||
ASSERT_EQ(worker_client_->callbacks.size(), 0);
|
||||
|
||||
// Create two tasks for the actor with different arguments.
|
||||
ObjectID obj1 = ObjectID::FromRandom().WithTransportType(TaskTransportType::DIRECT);
|
||||
ObjectID obj2 = ObjectID::FromRandom().WithTransportType(TaskTransportType::DIRECT);
|
||||
auto task1 = CreateActorTaskHelper(actor_id, 0);
|
||||
auto task1 = CreateActorTaskHelper(actor_id, worker_id, 0);
|
||||
task1.GetMutableMessage().add_args()->add_object_ids(obj1.Binary());
|
||||
auto task2 = CreateActorTaskHelper(actor_id, 1);
|
||||
auto task2 = CreateActorTaskHelper(actor_id, worker_id, 1);
|
||||
task2.GetMutableMessage().add_args()->add_object_ids(obj2.Binary());
|
||||
|
||||
// Neither task can be submitted yet because they are still waiting on
|
||||
@@ -142,20 +168,24 @@ TEST_F(DirectActorSubmitterTest, TestDependencies) {
|
||||
ASSERT_EQ(worker_client_->callbacks.size(), 1);
|
||||
ASSERT_TRUE(store_->Put(*data, obj2));
|
||||
ASSERT_EQ(worker_client_->callbacks.size(), 2);
|
||||
ASSERT_THAT(worker_client_->received_seq_nos, ElementsAre(0, 1));
|
||||
}
|
||||
|
||||
TEST_F(DirectActorSubmitterTest, TestOutOfOrderDependencies) {
|
||||
rpc::Address addr;
|
||||
auto worker_id = WorkerID::FromRandom();
|
||||
addr.set_worker_id(worker_id.Binary());
|
||||
ActorID actor_id = ActorID::Of(JobID::FromInt(0), TaskID::Nil(), 0);
|
||||
submitter_.AddActorQueueIfNotExists(actor_id);
|
||||
submitter_.ConnectActor(actor_id, addr);
|
||||
ASSERT_EQ(worker_client_->callbacks.size(), 0);
|
||||
|
||||
// Create two tasks for the actor with different arguments.
|
||||
ObjectID obj1 = ObjectID::FromRandom().WithTransportType(TaskTransportType::DIRECT);
|
||||
ObjectID obj2 = ObjectID::FromRandom().WithTransportType(TaskTransportType::DIRECT);
|
||||
auto task1 = CreateActorTaskHelper(actor_id, 0);
|
||||
auto task1 = CreateActorTaskHelper(actor_id, worker_id, 0);
|
||||
task1.GetMutableMessage().add_args()->add_object_ids(obj1.Binary());
|
||||
auto task2 = CreateActorTaskHelper(actor_id, 1);
|
||||
auto task2 = CreateActorTaskHelper(actor_id, worker_id, 1);
|
||||
task2.GetMutableMessage().add_args()->add_object_ids(obj2.Binary());
|
||||
|
||||
// Neither task can be submitted yet because they are still waiting on
|
||||
@@ -171,28 +201,131 @@ TEST_F(DirectActorSubmitterTest, TestOutOfOrderDependencies) {
|
||||
ASSERT_EQ(worker_client_->callbacks.size(), 0);
|
||||
ASSERT_TRUE(store_->Put(*data, obj1));
|
||||
ASSERT_EQ(worker_client_->callbacks.size(), 2);
|
||||
ASSERT_THAT(worker_client_->received_seq_nos, ElementsAre(0, 1));
|
||||
}
|
||||
|
||||
TEST_F(DirectActorSubmitterTest, TestActorFailure) {
|
||||
TEST_F(DirectActorSubmitterTest, TestActorDead) {
|
||||
rpc::Address addr;
|
||||
auto worker_id = WorkerID::FromRandom();
|
||||
addr.set_worker_id(worker_id.Binary());
|
||||
ActorID actor_id = ActorID::Of(JobID::FromInt(0), TaskID::Nil(), 0);
|
||||
submitter_.AddActorQueueIfNotExists(actor_id);
|
||||
gcs::ActorTableData actor_data;
|
||||
submitter_.ConnectActor(actor_id, addr);
|
||||
ASSERT_EQ(worker_client_->callbacks.size(), 0);
|
||||
|
||||
// Create two tasks for the actor.
|
||||
auto task1 = CreateActorTaskHelper(actor_id, 0);
|
||||
auto task2 = CreateActorTaskHelper(actor_id, 1);
|
||||
// Create two tasks for the actor. One depends on an object that is not yet available.
|
||||
auto task1 = CreateActorTaskHelper(actor_id, worker_id, 0);
|
||||
ObjectID obj = ObjectID::FromRandom().WithTransportType(TaskTransportType::DIRECT);
|
||||
auto task2 = CreateActorTaskHelper(actor_id, worker_id, 1);
|
||||
task2.GetMutableMessage().add_args()->add_object_ids(obj.Binary());
|
||||
ASSERT_TRUE(submitter_.SubmitTask(task1).ok());
|
||||
ASSERT_TRUE(submitter_.SubmitTask(task2).ok());
|
||||
ASSERT_EQ(worker_client_->callbacks.size(), 2);
|
||||
ASSERT_EQ(worker_client_->callbacks.size(), 1);
|
||||
|
||||
// Simulate the actor dying. All submitted tasks should get failed.
|
||||
EXPECT_CALL(*task_finisher_, PendingTaskFailed(_, _, _)).Times(2);
|
||||
// Simulate the actor dying. All in-flight tasks should get failed.
|
||||
EXPECT_CALL(*task_finisher_, PendingTaskFailed(task1.TaskId(), _, _)).Times(1);
|
||||
EXPECT_CALL(*task_finisher_, CompletePendingTask(_, _, _)).Times(0);
|
||||
while (!worker_client_->callbacks.empty()) {
|
||||
ASSERT_TRUE(worker_client_->ReplyPushTask(Status::IOError("")));
|
||||
}
|
||||
|
||||
EXPECT_CALL(*task_finisher_, PendingTaskFailed(_, _, _)).Times(0);
|
||||
submitter_.DisconnectActor(actor_id, /*dead=*/false);
|
||||
// Actor marked as dead. All queued tasks should get failed.
|
||||
EXPECT_CALL(*task_finisher_, PendingTaskFailed(task2.TaskId(), _, _)).Times(1);
|
||||
submitter_.DisconnectActor(actor_id, /*dead=*/true);
|
||||
}
|
||||
|
||||
TEST_F(DirectActorSubmitterTest, TestActorRestartNoRetry) {
|
||||
rpc::Address addr;
|
||||
auto worker_id = WorkerID::FromRandom();
|
||||
addr.set_worker_id(worker_id.Binary());
|
||||
ActorID actor_id = ActorID::Of(JobID::FromInt(0), TaskID::Nil(), 0);
|
||||
submitter_.AddActorQueueIfNotExists(actor_id);
|
||||
gcs::ActorTableData actor_data;
|
||||
submitter_.ConnectActor(actor_id, addr);
|
||||
ASSERT_EQ(worker_client_->callbacks.size(), 0);
|
||||
|
||||
// Create four tasks for the actor.
|
||||
auto task1 = CreateActorTaskHelper(actor_id, worker_id, 0);
|
||||
auto task2 = CreateActorTaskHelper(actor_id, worker_id, 1);
|
||||
auto task3 = CreateActorTaskHelper(actor_id, worker_id, 2);
|
||||
auto task4 = CreateActorTaskHelper(actor_id, worker_id, 3);
|
||||
// Submit three tasks.
|
||||
ASSERT_TRUE(submitter_.SubmitTask(task1).ok());
|
||||
ASSERT_TRUE(submitter_.SubmitTask(task2).ok());
|
||||
ASSERT_TRUE(submitter_.SubmitTask(task3).ok());
|
||||
|
||||
EXPECT_CALL(*task_finisher_, CompletePendingTask(task1.TaskId(), _, _)).Times(2);
|
||||
EXPECT_CALL(*task_finisher_, PendingTaskFailed(task2.TaskId(), _, _)).Times(2);
|
||||
// First task finishes. Second task fails.
|
||||
ASSERT_TRUE(worker_client_->ReplyPushTask(Status::OK()));
|
||||
ASSERT_TRUE(worker_client_->ReplyPushTask(Status::IOError("")));
|
||||
|
||||
// Simulate the actor failing.
|
||||
submitter_.DisconnectActor(actor_id, /*dead=*/false);
|
||||
// Third task fails after the actor is disconnected. It should not get
|
||||
// retried.
|
||||
ASSERT_TRUE(worker_client_->ReplyPushTask(Status::IOError("")));
|
||||
|
||||
// Actor gets restarted.
|
||||
submitter_.ConnectActor(actor_id, addr);
|
||||
ASSERT_TRUE(submitter_.SubmitTask(task4).ok());
|
||||
ASSERT_TRUE(worker_client_->ReplyPushTask(Status::OK()));
|
||||
ASSERT_TRUE(worker_client_->callbacks.empty());
|
||||
// Actor counter restarts at 0 after the actor is restarted.
|
||||
ASSERT_THAT(worker_client_->received_seq_nos, ElementsAre(0, 1, 2, 0));
|
||||
}
|
||||
|
||||
TEST_F(DirectActorSubmitterTest, TestActorRestartRetry) {
|
||||
rpc::Address addr;
|
||||
auto worker_id = WorkerID::FromRandom();
|
||||
addr.set_worker_id(worker_id.Binary());
|
||||
ActorID actor_id = ActorID::Of(JobID::FromInt(0), TaskID::Nil(), 0);
|
||||
submitter_.AddActorQueueIfNotExists(actor_id);
|
||||
gcs::ActorTableData actor_data;
|
||||
submitter_.ConnectActor(actor_id, addr);
|
||||
ASSERT_EQ(worker_client_->callbacks.size(), 0);
|
||||
|
||||
// Create four tasks for the actor.
|
||||
auto task1 = CreateActorTaskHelper(actor_id, worker_id, 0);
|
||||
auto task2 = CreateActorTaskHelper(actor_id, worker_id, 1);
|
||||
auto task3 = CreateActorTaskHelper(actor_id, worker_id, 2);
|
||||
auto task4 = CreateActorTaskHelper(actor_id, worker_id, 3);
|
||||
// Submit three tasks.
|
||||
ASSERT_TRUE(submitter_.SubmitTask(task1).ok());
|
||||
ASSERT_TRUE(submitter_.SubmitTask(task2).ok());
|
||||
ASSERT_TRUE(submitter_.SubmitTask(task3).ok());
|
||||
|
||||
// All tasks will eventually finish.
|
||||
EXPECT_CALL(*task_finisher_, CompletePendingTask(task1.TaskId(), _, _)).Times(4);
|
||||
// Tasks 2 and 3 will be retried.
|
||||
EXPECT_CALL(*task_finisher_, PendingTaskFailed(task2.TaskId(), _, _))
|
||||
.Times(2)
|
||||
.WillRepeatedly(Return(true));
|
||||
// First task finishes. Second task fails.
|
||||
ASSERT_TRUE(worker_client_->ReplyPushTask(Status::OK()));
|
||||
ASSERT_TRUE(worker_client_->ReplyPushTask(Status::IOError("")));
|
||||
|
||||
// Simulate the actor failing.
|
||||
submitter_.DisconnectActor(actor_id, /*dead=*/false);
|
||||
// Third task fails after the actor is disconnected.
|
||||
ASSERT_TRUE(worker_client_->ReplyPushTask(Status::IOError("")));
|
||||
|
||||
// Actor gets restarted.
|
||||
submitter_.ConnectActor(actor_id, addr);
|
||||
// A new task is submitted.
|
||||
ASSERT_TRUE(submitter_.SubmitTask(task4).ok());
|
||||
// Tasks 2 and 3 get retried.
|
||||
ASSERT_TRUE(submitter_.SubmitTask(task2).ok());
|
||||
ASSERT_TRUE(submitter_.SubmitTask(task3).ok());
|
||||
while (!worker_client_->callbacks.empty()) {
|
||||
ASSERT_TRUE(worker_client_->ReplyPushTask(Status::OK()));
|
||||
}
|
||||
// Actor counter restarts at 0 after the actor is restarted. New task cannot
|
||||
// execute until after tasks 2 and 3 are re-executed.
|
||||
ASSERT_THAT(worker_client_->received_seq_nos, ElementsAre(0, 1, 2, 2, 0, 1));
|
||||
}
|
||||
|
||||
class MockDependencyWaiterInterface : public DependencyWaiterInterface {
|
||||
@@ -203,35 +336,6 @@ class MockDependencyWaiterInterface : public DependencyWaiterInterface {
|
||||
}
|
||||
};
|
||||
|
||||
TaskSpecification CreateActorTaskHelper(ActorID actor_id, int64_t counter,
|
||||
TaskID caller_id) {
|
||||
TaskSpecification task;
|
||||
task.GetMutableMessage().set_task_id(TaskID::Nil().Binary());
|
||||
task.GetMutableMessage().set_caller_id(caller_id.Binary());
|
||||
task.GetMutableMessage().set_type(TaskType::ACTOR_TASK);
|
||||
task.GetMutableMessage().mutable_actor_task_spec()->set_actor_id(actor_id.Binary());
|
||||
task.GetMutableMessage().mutable_actor_task_spec()->set_actor_counter(counter);
|
||||
task.GetMutableMessage().set_num_returns(1);
|
||||
return task;
|
||||
}
|
||||
|
||||
rpc::PushTaskRequest CreatePushTaskRequestHelper(ActorID actor_id, int64_t counter,
|
||||
WorkerID caller_worker_id,
|
||||
TaskID caller_id,
|
||||
int64_t caller_timestamp) {
|
||||
auto task_spec = CreateActorTaskHelper(actor_id, counter, caller_id);
|
||||
rpc::Address rpc_address;
|
||||
rpc_address.set_worker_id(caller_worker_id.Binary());
|
||||
|
||||
rpc::PushTaskRequest request;
|
||||
request.mutable_caller_address()->CopyFrom(rpc_address);
|
||||
request.mutable_task_spec()->CopyFrom(task_spec.GetMessage());
|
||||
request.set_caller_version(caller_timestamp);
|
||||
request.set_sequence_number(request.task_spec().actor_task_spec().actor_counter());
|
||||
request.set_client_processed_up_to(-1);
|
||||
return request;
|
||||
}
|
||||
|
||||
class MockWorkerContext : public WorkerContext {
|
||||
public:
|
||||
MockWorkerContext(WorkerType worker_type, const JobID &job_id)
|
||||
@@ -278,7 +382,6 @@ class DirectActorReceiverTest : public ::testing::Test {
|
||||
};
|
||||
|
||||
TEST_F(DirectActorReceiverTest, TestNewTaskFromDifferentWorker) {
|
||||
rpc::Address addr;
|
||||
TaskID current_task_id = TaskID::Nil();
|
||||
ActorID actor_id = ActorID::Of(JobID::FromInt(0), TaskID::Nil(), 0);
|
||||
WorkerID worker_id = WorkerID::FromRandom();
|
||||
@@ -319,7 +422,7 @@ TEST_F(DirectActorReceiverTest, TestNewTaskFromDifferentWorker) {
|
||||
receiver_->HandlePushTask(request, &reply, reply_callback);
|
||||
}
|
||||
|
||||
// Create another request with the same caller id, but a differnt worker id,
|
||||
// Create another request with the same caller id, but a different worker id,
|
||||
// and a newer timestamp. This simulates caller reconstruction.
|
||||
// Note that here the task request still has counter 0, which should be
|
||||
// ignored normally, but here it's from a different worker and with a newer
|
||||
|
||||
@@ -72,9 +72,10 @@ class MockTaskFinisher : public TaskFinisherInterface {
|
||||
num_tasks_complete++;
|
||||
}
|
||||
|
||||
void PendingTaskFailed(const TaskID &task_id, rpc::ErrorType error_type,
|
||||
bool PendingTaskFailed(const TaskID &task_id, rpc::ErrorType error_type,
|
||||
Status *status) override {
|
||||
num_tasks_failed++;
|
||||
return true;
|
||||
}
|
||||
|
||||
void OnTaskDependenciesInlined(const std::vector<ObjectID> &inlined_dependency_ids,
|
||||
|
||||
@@ -22,6 +22,14 @@ using ray::rpc::ActorTableData;
|
||||
|
||||
namespace ray {
|
||||
|
||||
void CoreWorkerDirectActorTaskSubmitter::AddActorQueueIfNotExists(
|
||||
const ActorID &actor_id) {
|
||||
absl::MutexLock lock(&mu_);
|
||||
// No need to check whether the insert was successful, since it is possible
|
||||
// for this worker to have multiple references to the same actor.
|
||||
client_queues_.emplace(actor_id, ClientQueue());
|
||||
}
|
||||
|
||||
void CoreWorkerDirectActorTaskSubmitter::KillActor(const ActorID &actor_id,
|
||||
bool force_kill, bool no_restart) {
|
||||
absl::MutexLock lock(&mu_);
|
||||
@@ -29,72 +37,72 @@ void CoreWorkerDirectActorTaskSubmitter::KillActor(const ActorID &actor_id,
|
||||
request.set_intended_actor_id(actor_id.Binary());
|
||||
request.set_force_kill(force_kill);
|
||||
request.set_no_restart(no_restart);
|
||||
auto inserted = pending_force_kills_.emplace(actor_id, request);
|
||||
if (!inserted.second && force_kill) {
|
||||
|
||||
auto it = client_queues_.find(actor_id);
|
||||
// The language frontend can only kill actors that it has a reference to.
|
||||
RAY_CHECK(it != client_queues_.end());
|
||||
|
||||
if (!it->second.pending_force_kill) {
|
||||
it->second.pending_force_kill = request;
|
||||
} else if (force_kill) {
|
||||
// Overwrite the previous request to kill the actor if the new request is a
|
||||
// force kill.
|
||||
inserted.first->second.set_force_kill(true);
|
||||
it->second.pending_force_kill->set_force_kill(true);
|
||||
if (no_restart) {
|
||||
// Overwrite the previous request to disable reconstruction if the new request's
|
||||
// Overwrite the previous request to disable restart if the new request's
|
||||
// no_restart flag is set to true.
|
||||
inserted.first->second.set_no_restart(true);
|
||||
it->second.pending_force_kill->set_no_restart(true);
|
||||
}
|
||||
}
|
||||
auto it = rpc_clients_.find(actor_id);
|
||||
if (it == rpc_clients_.end()) {
|
||||
// Actor is not yet created, or is being restarted, cache the request
|
||||
// and submit after actor is alive.
|
||||
// TODO(zhijunfu): it might be possible for a user to specify an invalid
|
||||
// actor handle (e.g. from unpickling), in that case it might be desirable
|
||||
// to have a timeout to mark it as invalid if it doesn't show up in the
|
||||
// specified time.
|
||||
RAY_LOG(DEBUG) << "Actor " << actor_id << " is not yet created.";
|
||||
} else {
|
||||
SendPendingTasks(actor_id);
|
||||
}
|
||||
|
||||
SendPendingTasks(actor_id);
|
||||
}
|
||||
|
||||
Status CoreWorkerDirectActorTaskSubmitter::SubmitTask(TaskSpecification task_spec) {
|
||||
RAY_LOG(DEBUG) << "Submitting task " << task_spec.TaskId();
|
||||
RAY_CHECK(task_spec.IsActorTask());
|
||||
|
||||
// We must fix the send order prior to resolving dependencies, which may complete
|
||||
// out of order. This ensures we preserve the client-side send order.
|
||||
int64_t send_pos = -1;
|
||||
bool task_queued = false;
|
||||
uint64_t send_pos = 0;
|
||||
{
|
||||
absl::MutexLock lock(&mu_);
|
||||
send_pos = next_send_position_to_assign_[task_spec.ActorId()]++;
|
||||
auto queue = client_queues_.find(task_spec.ActorId());
|
||||
RAY_CHECK(queue != client_queues_.end());
|
||||
if (queue->second.state == rpc::ActorTableData::DEAD) {
|
||||
task_finisher_->MarkTaskCanceled(task_spec.TaskId());
|
||||
auto status = Status::IOError("cancelling all pending tasks of dead actor");
|
||||
// No need to increment the number of completed tasks since the actor is
|
||||
// dead.
|
||||
RAY_UNUSED(!task_finisher_->PendingTaskFailed(task_spec.TaskId(),
|
||||
rpc::ErrorType::ACTOR_DIED, &status));
|
||||
} else {
|
||||
// We must fix the send order prior to resolving dependencies, which may
|
||||
// complete out of order. This ensures that we will not deadlock due to
|
||||
// backpressure. The receiving actor will execute the tasks according to
|
||||
// this sequence number.
|
||||
send_pos = task_spec.ActorCounter();
|
||||
auto inserted =
|
||||
queue->second.requests.emplace(send_pos, std::make_pair(task_spec, false));
|
||||
RAY_CHECK(inserted.second);
|
||||
task_queued = true;
|
||||
}
|
||||
}
|
||||
|
||||
resolver_.ResolveDependencies(task_spec, [this, send_pos, task_spec]() mutable {
|
||||
const auto &actor_id = task_spec.ActorId();
|
||||
|
||||
auto request = std::unique_ptr<rpc::PushTaskRequest>(new rpc::PushTaskRequest);
|
||||
request->mutable_caller_address()->CopyFrom(rpc_address_);
|
||||
// NOTE(swang): CopyFrom is needed because if we use Swap here and the task
|
||||
// fails, then the task data will be gone when the TaskManager attempts to
|
||||
// access the task.
|
||||
request->mutable_task_spec()->CopyFrom(task_spec.GetMessage());
|
||||
request->set_caller_version(caller_creation_timestamp_ms_);
|
||||
|
||||
absl::MutexLock lock(&mu_);
|
||||
|
||||
auto inserted = pending_requests_[actor_id].emplace(send_pos, std::move(request));
|
||||
RAY_CHECK(inserted.second);
|
||||
|
||||
auto it = rpc_clients_.find(actor_id);
|
||||
if (it == rpc_clients_.end()) {
|
||||
// Actor is not yet created, or is being restarted, cache the request
|
||||
// and submit after actor is alive.
|
||||
// TODO(zhijunfu): it might be possible for a user to specify an invalid
|
||||
// actor handle (e.g. from unpickling), in that case it might be desirable
|
||||
// to have a timeout to mark it as invalid if it doesn't show up in the
|
||||
// specified time.
|
||||
RAY_LOG(DEBUG) << "Actor " << actor_id << " is not yet created.";
|
||||
} else {
|
||||
SendPendingTasks(actor_id);
|
||||
}
|
||||
});
|
||||
if (task_queued) {
|
||||
const auto actor_id = task_spec.ActorId();
|
||||
resolver_.ResolveDependencies(task_spec, [this, send_pos, actor_id]() {
|
||||
absl::MutexLock lock(&mu_);
|
||||
auto queue = client_queues_.find(actor_id);
|
||||
RAY_CHECK(queue != client_queues_.end());
|
||||
auto it = queue->second.requests.find(send_pos);
|
||||
// Only dispatch tasks if the submitted task is still queued. The task
|
||||
// may have been dequeued if the actor has since failed.
|
||||
if (it != queue->second.requests.end()) {
|
||||
it->second.second = true;
|
||||
SendPendingTasks(actor_id);
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
// If the task submission subsequently fails, then the client will receive
|
||||
// the error in a callback.
|
||||
@@ -104,105 +112,156 @@ Status CoreWorkerDirectActorTaskSubmitter::SubmitTask(TaskSpecification task_spe
|
||||
void CoreWorkerDirectActorTaskSubmitter::ConnectActor(const ActorID &actor_id,
|
||||
const rpc::Address &address) {
|
||||
absl::MutexLock lock(&mu_);
|
||||
// Update the mapping so new RPCs go out with the right intended worker id.
|
||||
worker_ids_[actor_id] = address.worker_id();
|
||||
// Create a new connection to the actor.
|
||||
// TODO(edoakes): are these clients cleaned up properly?
|
||||
if (rpc_clients_.count(actor_id) == 0) {
|
||||
rpc_clients_[actor_id] =
|
||||
std::shared_ptr<rpc::CoreWorkerClientInterface>(client_factory_(address));
|
||||
|
||||
auto queue = client_queues_.find(actor_id);
|
||||
RAY_CHECK(queue != client_queues_.end());
|
||||
if (queue->second.rpc_client) {
|
||||
// Skip reconnection if we already have a client to this actor.
|
||||
// NOTE(swang): This seems to only trigger in multithreaded Java tests.
|
||||
RAY_CHECK(queue->second.worker_id == address.worker_id());
|
||||
return;
|
||||
}
|
||||
|
||||
queue->second.state = rpc::ActorTableData::ALIVE;
|
||||
// Update the mapping so new RPCs go out with the right intended worker id.
|
||||
queue->second.worker_id = address.worker_id();
|
||||
// Create a new connection to the actor.
|
||||
queue->second.rpc_client =
|
||||
std::shared_ptr<rpc::CoreWorkerClientInterface>(client_factory_(address));
|
||||
// TODO(swang): This assumes that all replies from the previous incarnation
|
||||
// of the actor have been received. Fix this by setting an epoch for each
|
||||
// actor task, so we can ignore completed tasks from old epochs.
|
||||
RAY_LOG(INFO) << "Resetting caller starts at for actor " << actor_id << " from "
|
||||
<< queue->second.caller_starts_at << " to "
|
||||
<< queue->second.num_completed_tasks;
|
||||
queue->second.caller_starts_at = queue->second.num_completed_tasks;
|
||||
SendPendingTasks(actor_id);
|
||||
}
|
||||
|
||||
void CoreWorkerDirectActorTaskSubmitter::DisconnectActor(const ActorID &actor_id,
|
||||
bool dead) {
|
||||
absl::MutexLock lock(&mu_);
|
||||
if (!dead) {
|
||||
// We're restarting the actor, so erase the client for now. The new client
|
||||
// will be inserted once actor reconstruction completes. We don't erase the
|
||||
// client when the actor is DEAD, so that all further tasks will be failed.
|
||||
rpc_clients_.erase(actor_id);
|
||||
worker_ids_.erase(actor_id);
|
||||
auto queue = client_queues_.find(actor_id);
|
||||
RAY_CHECK(queue != client_queues_.end());
|
||||
|
||||
if (dead) {
|
||||
queue->second.state = rpc::ActorTableData::DEAD;
|
||||
} else {
|
||||
queue->second.state = rpc::ActorTableData::RESTARTING;
|
||||
}
|
||||
|
||||
// The actor failed, so erase the client for now. Either the actor is
|
||||
// permanently dead or the new client will be inserted once the actor is
|
||||
// restarted.
|
||||
queue->second.rpc_client = nullptr;
|
||||
queue->second.worker_id.clear();
|
||||
queue->second.pending_force_kill.reset();
|
||||
|
||||
// If there are pending requests, treat the pending tasks as failed.
|
||||
if (dead) {
|
||||
RAY_LOG(INFO) << "Failing pending tasks for actor " << actor_id;
|
||||
// If there are pending requests, treat the pending tasks as failed.
|
||||
auto pending_it = pending_requests_.find(actor_id);
|
||||
if (pending_it != pending_requests_.end()) {
|
||||
auto head = pending_it->second.begin();
|
||||
while (head != pending_it->second.end()) {
|
||||
auto request = std::move(head->second);
|
||||
head = pending_it->second.erase(head);
|
||||
auto task_id = TaskID::FromBinary(request->task_spec().task_id());
|
||||
auto status = Status::IOError("cancelling all pending tasks of dead actor");
|
||||
task_finisher_->PendingTaskFailed(task_id, rpc::ErrorType::ACTOR_DIED, &status);
|
||||
}
|
||||
pending_requests_.erase(pending_it);
|
||||
auto &requests = queue->second.requests;
|
||||
auto head = requests.begin();
|
||||
while (head != requests.end()) {
|
||||
const auto &task_spec = head->second.first;
|
||||
task_finisher_->MarkTaskCanceled(task_spec.TaskId());
|
||||
auto status = Status::IOError("cancelling all pending tasks of dead actor");
|
||||
// No need to increment the number of completed tasks since the actor is
|
||||
// dead.
|
||||
RAY_UNUSED(!task_finisher_->PendingTaskFailed(task_spec.TaskId(),
|
||||
rpc::ErrorType::ACTOR_DIED, &status));
|
||||
head = requests.erase(head);
|
||||
}
|
||||
|
||||
// No need to clean up tasks that have been sent and are waiting for
|
||||
// replies. They will be treated as failed once the connection dies.
|
||||
// We retain the sequencing information so that we can properly fail
|
||||
// any tasks submitted after the actor death.
|
||||
|
||||
pending_force_kills_.erase(actor_id);
|
||||
}
|
||||
}
|
||||
|
||||
void CoreWorkerDirectActorTaskSubmitter::SendPendingTasks(const ActorID &actor_id) {
|
||||
auto &client = rpc_clients_[actor_id];
|
||||
RAY_CHECK(client);
|
||||
auto it = client_queues_.find(actor_id);
|
||||
RAY_CHECK(it != client_queues_.end());
|
||||
if (!it->second.rpc_client) {
|
||||
return;
|
||||
}
|
||||
|
||||
// Check if there is a pending force kill. If there is, send it and disconnect the
|
||||
// client.
|
||||
auto it = pending_force_kills_.find(actor_id);
|
||||
if (it != pending_force_kills_.end()) {
|
||||
if (it->second.pending_force_kill) {
|
||||
RAY_LOG(INFO) << "Sending KillActor request to actor " << actor_id;
|
||||
// It's okay if this fails because this means the worker is already dead.
|
||||
RAY_UNUSED(client->KillActor(it->second, nullptr));
|
||||
pending_force_kills_.erase(it);
|
||||
RAY_UNUSED(it->second.rpc_client->KillActor(*it->second.pending_force_kill, nullptr));
|
||||
it->second.pending_force_kill.reset();
|
||||
}
|
||||
|
||||
// Submit all pending requests.
|
||||
auto &requests = pending_requests_[actor_id];
|
||||
auto &requests = it->second.requests;
|
||||
auto head = requests.begin();
|
||||
while (head != requests.end() && head->first == next_send_position_[actor_id]) {
|
||||
auto request = std::move(head->second);
|
||||
while (head != requests.end() && head->first <= it->second.next_send_position &&
|
||||
head->second.second) {
|
||||
// If the task has been sent before, skip the other tasks in the send
|
||||
// queue.
|
||||
bool skip_queue = head->first < it->second.next_send_position;
|
||||
auto task_spec = std::move(head->second.first);
|
||||
head = requests.erase(head);
|
||||
|
||||
auto num_returns = request->task_spec().num_returns();
|
||||
auto task_id = TaskID::FromBinary(request->task_spec().task_id());
|
||||
PushActorTask(*client, std::move(request), actor_id, task_id, num_returns);
|
||||
RAY_CHECK(!it->second.worker_id.empty());
|
||||
PushActorTask(it->second, task_spec, skip_queue);
|
||||
it->second.next_send_position++;
|
||||
}
|
||||
}
|
||||
|
||||
void CoreWorkerDirectActorTaskSubmitter::PushActorTask(
|
||||
rpc::CoreWorkerClientInterface &client, std::unique_ptr<rpc::PushTaskRequest> request,
|
||||
const ActorID &actor_id, const TaskID &task_id, int num_returns) {
|
||||
RAY_LOG(DEBUG) << "Pushing task " << task_id << " to actor " << actor_id;
|
||||
next_send_position_[actor_id]++;
|
||||
auto it = worker_ids_.find(actor_id);
|
||||
RAY_CHECK(it != worker_ids_.end()) << "Actor worker id not found " << actor_id.Hex();
|
||||
request->set_intended_worker_id(it->second);
|
||||
rpc::Address addr(client.Addr());
|
||||
RAY_CHECK_OK(client.PushActorTask(
|
||||
std::move(request),
|
||||
[this, addr, task_id](Status status, const rpc::PushTaskReply &reply) {
|
||||
void CoreWorkerDirectActorTaskSubmitter::PushActorTask(const ClientQueue &queue,
|
||||
const TaskSpecification &task_spec,
|
||||
bool skip_queue) {
|
||||
auto request = std::unique_ptr<rpc::PushTaskRequest>(new rpc::PushTaskRequest());
|
||||
// NOTE(swang): CopyFrom is needed because if we use Swap here and the task
|
||||
// fails, then the task data will be gone when the TaskManager attempts to
|
||||
// access the task.
|
||||
request->mutable_task_spec()->CopyFrom(task_spec.GetMessage());
|
||||
|
||||
request->set_intended_worker_id(queue.worker_id);
|
||||
RAY_CHECK(task_spec.ActorCounter() >= queue.caller_starts_at)
|
||||
<< "actor counter " << task_spec.ActorCounter() << " " << queue.caller_starts_at;
|
||||
request->set_sequence_number(task_spec.ActorCounter() - queue.caller_starts_at);
|
||||
|
||||
const auto task_id = task_spec.TaskId();
|
||||
const auto actor_id = task_spec.ActorId();
|
||||
const auto counter = task_spec.ActorCounter();
|
||||
RAY_LOG(DEBUG) << "Pushing task " << task_id << " to actor " << actor_id
|
||||
<< " actor counter " << counter << " seq no "
|
||||
<< request->sequence_number();
|
||||
rpc::Address addr(queue.rpc_client->Addr());
|
||||
RAY_UNUSED(queue.rpc_client->PushActorTask(
|
||||
std::move(request), skip_queue,
|
||||
[this, addr, task_id, actor_id](Status status, const rpc::PushTaskReply &reply) {
|
||||
bool increment_completed_tasks = true;
|
||||
if (!status.ok()) {
|
||||
task_finisher_->PendingTaskFailed(task_id, rpc::ErrorType::ACTOR_DIED, &status);
|
||||
bool will_retry = task_finisher_->PendingTaskFailed(
|
||||
task_id, rpc::ErrorType::ACTOR_DIED, &status);
|
||||
if (will_retry) {
|
||||
increment_completed_tasks = false;
|
||||
}
|
||||
} else {
|
||||
task_finisher_->CompletePendingTask(task_id, reply, addr);
|
||||
}
|
||||
|
||||
if (increment_completed_tasks) {
|
||||
absl::MutexLock lock(&mu_);
|
||||
auto queue = client_queues_.find(actor_id);
|
||||
RAY_CHECK(queue != client_queues_.end());
|
||||
queue->second.num_completed_tasks++;
|
||||
}
|
||||
}));
|
||||
}
|
||||
|
||||
bool CoreWorkerDirectActorTaskSubmitter::IsActorAlive(const ActorID &actor_id) const {
|
||||
absl::MutexLock lock(&mu_);
|
||||
|
||||
auto iter = rpc_clients_.find(actor_id);
|
||||
return (iter != rpc_clients_.end());
|
||||
}
|
||||
|
||||
void CoreWorkerDirectActorTaskSubmitter::SetCallerCreationTimestamp(int64_t timestamp) {
|
||||
caller_creation_timestamp_ms_ = timestamp;
|
||||
auto iter = client_queues_.find(actor_id);
|
||||
return (iter != client_queues_.end() && iter->second.rpc_client);
|
||||
}
|
||||
|
||||
void CoreWorkerDirectTaskReceiver::Init(
|
||||
@@ -284,7 +343,7 @@ void CoreWorkerDirectTaskReceiver::HandlePushTask(
|
||||
<< ", actor_id: " << task_spec.ActorCreationId();
|
||||
// Tell raylet that an actor creation task has finished execution, so that
|
||||
// raylet can publish actor creation event to GCS, and mark this worker as
|
||||
// actor, thus if this worker dies later raylet will reconstruct the actor.
|
||||
// actor, thus if this worker dies later raylet will restart the actor.
|
||||
RAY_CHECK_OK(task_done_());
|
||||
}
|
||||
}
|
||||
@@ -315,50 +374,15 @@ void CoreWorkerDirectTaskReceiver::HandlePushTask(
|
||||
send_reply_callback(Status::Invalid("client cancelled stale rpc"), nullptr, nullptr);
|
||||
};
|
||||
|
||||
auto caller_worker_id = WorkerID::FromBinary(request.caller_address().worker_id());
|
||||
auto caller_version = request.caller_version();
|
||||
auto it = scheduling_queue_.find(task_spec.CallerId());
|
||||
if (it != scheduling_queue_.end()) {
|
||||
if (it->second.first.caller_worker_id != caller_worker_id) {
|
||||
// We received a request with the same caller ID, but from a different worker,
|
||||
// this indicates the caller (actor) is restarted.
|
||||
if (it->second.first.caller_creation_timestamp_ms < caller_version) {
|
||||
// The new request has a newer caller version, then remove the old entry
|
||||
// from scheduling queue since it's invalid now.
|
||||
RAY_LOG(INFO) << "Remove existing scheduling queue for caller "
|
||||
<< task_spec.CallerId() << " after receiving a "
|
||||
<< "request from a different worker ID with a newer "
|
||||
<< "version, old worker ID: " << it->second.first.caller_worker_id
|
||||
<< ", new worker ID" << caller_worker_id;
|
||||
scheduling_queue_.erase(task_spec.CallerId());
|
||||
it = scheduling_queue_.end();
|
||||
} else {
|
||||
// The existing caller has the newer version, this indicates the request
|
||||
// is from an old caller, which might be possible when network has problems.
|
||||
// In this case fail this request.
|
||||
RAY_LOG(WARNING) << "Ignoring request from an old caller because "
|
||||
<< "it has a smaller timestamp, old worker ID: "
|
||||
<< caller_worker_id << ", current worker ID"
|
||||
<< it->second.first.caller_worker_id;
|
||||
// Fail request with an old caller version.
|
||||
reject_callback();
|
||||
return;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
auto it = scheduling_queue_.find(task_spec.CallerWorkerId());
|
||||
if (it == scheduling_queue_.end()) {
|
||||
SchedulingQueueTag tag;
|
||||
tag.caller_worker_id = caller_worker_id;
|
||||
tag.caller_creation_timestamp_ms = caller_version;
|
||||
auto result = scheduling_queue_.emplace(
|
||||
task_spec.CallerId(),
|
||||
std::make_pair(tag, std::unique_ptr<SchedulingQueue>(new SchedulingQueue(
|
||||
task_main_io_service_, *waiter_, worker_context_))));
|
||||
task_spec.CallerWorkerId(),
|
||||
SchedulingQueue(task_main_io_service_, *waiter_, worker_context_));
|
||||
it = result.first;
|
||||
}
|
||||
it->second.second->Add(request.sequence_number(), request.client_processed_up_to(),
|
||||
accept_callback, reject_callback, dependencies);
|
||||
it->second.Add(request.sequence_number(), request.client_processed_up_to(),
|
||||
accept_callback, reject_callback, dependencies);
|
||||
}
|
||||
|
||||
void CoreWorkerDirectTaskReceiver::HandleDirectActorCallArgWaitComplete(
|
||||
|
||||
@@ -50,15 +50,21 @@ const int kMaxReorderWaitSeconds = 30;
|
||||
// This class is thread-safe.
|
||||
class CoreWorkerDirectActorTaskSubmitter {
|
||||
public:
|
||||
CoreWorkerDirectActorTaskSubmitter(rpc::Address rpc_address,
|
||||
rpc::ClientFactoryFn client_factory,
|
||||
CoreWorkerDirectActorTaskSubmitter(rpc::ClientFactoryFn client_factory,
|
||||
std::shared_ptr<CoreWorkerMemoryStore> store,
|
||||
std::shared_ptr<TaskFinisherInterface> task_finisher)
|
||||
: rpc_address_(rpc_address),
|
||||
client_factory_(client_factory),
|
||||
: client_factory_(client_factory),
|
||||
resolver_(store, task_finisher),
|
||||
task_finisher_(task_finisher) {}
|
||||
|
||||
/// Add an actor queue. This should be called whenever a reference to an
|
||||
/// actor is created in the language frontend.
|
||||
/// TODO(swang): Remove the actor queue once it is sure that this worker will
|
||||
/// not receive another reference to the same actor.
|
||||
///
|
||||
/// \param[in] actor_id The actor for whom to add a queue.
|
||||
void AddActorQueueIfNotExists(const ActorID &actor_id);
|
||||
|
||||
/// Submit a task to an actor for execution.
|
||||
///
|
||||
/// \param[in] task The task spec to submit.
|
||||
@@ -89,20 +95,84 @@ class CoreWorkerDirectActorTaskSubmitter {
|
||||
void SetCallerCreationTimestamp(int64_t timestamp);
|
||||
|
||||
private:
|
||||
struct ClientQueue {
|
||||
/// The current state of the actor. If this is ALIVE, then we should have
|
||||
/// an RPC client to the actor. If this is DEAD, then all tasks in the
|
||||
/// queue will be marked failed and all other ClientQueue state is ignored.
|
||||
rpc::ActorTableData::ActorState state = rpc::ActorTableData::PENDING;
|
||||
/// The RPC client. We use shared_ptr to enable shared_from_this for
|
||||
/// pending client callbacks.
|
||||
std::shared_ptr<rpc::CoreWorkerClientInterface> rpc_client = nullptr;
|
||||
/// The intended worker ID of the actor.
|
||||
std::string worker_id = "";
|
||||
|
||||
/// The actor's pending requests, ordered by the task number (see below
|
||||
/// diagram) in the request. The bool indicates whether the dependencies
|
||||
/// for that task have been resolved yet. A task will be sent after its
|
||||
/// dependencies have been resolved and its task number matches
|
||||
/// next_send_position.
|
||||
std::map<uint64_t, std::pair<TaskSpecification, bool>> requests;
|
||||
|
||||
/// Diagram of the sequence numbers assigned to actor tasks during actor
|
||||
/// crash and restart:
|
||||
///
|
||||
/// The actor starts, and 10 tasks are submitted. We have sent 6 tasks
|
||||
/// (0-5) so far, and have received a successful reply for 4 tasks (0-3).
|
||||
/// 0 1 2 3 4 5 6 7 8 9
|
||||
/// ^ next_send_position
|
||||
/// ^ num_completed_tasks
|
||||
/// ^ caller_starts_at
|
||||
///
|
||||
/// Suppose the actor crashes and recovers. Then, caller_starts_at is reset
|
||||
/// to the current num_completed_tasks. caller_starts_at is then subtracted
|
||||
/// from each task's counter, so the recovered actor will receive the
|
||||
/// sequence numbers 0, 1, 2 (and so on) for tasks 4, 5, 6, respectively.
|
||||
/// Therefore, the recovered actor will restart execution from task 4.
|
||||
/// 0 1 2 3 4 5 6 7 8 9
|
||||
/// ^ next_send_position
|
||||
/// ^ num_completed_tasks
|
||||
/// ^ caller_starts_at
|
||||
///
|
||||
/// New actor tasks will continue to be sent even while tasks are being
|
||||
/// resubmitted, but the receiving worker will only execute them after the
|
||||
/// resent tasks. For example, this diagram shows what happens if task 6 is
|
||||
/// sent for the first time, tasks 4 and 5 have been resent, and we have
|
||||
/// received a successful reply for task 4.
|
||||
/// 0 1 2 3 4 5 6 7 8 9
|
||||
/// ^ next_send_position
|
||||
/// ^ num_completed_tasks
|
||||
/// ^ caller_starts_at
|
||||
///
|
||||
/// The send position of the next task to send to this actor. This sequence
|
||||
/// number increases monotonically.
|
||||
uint64_t next_send_position = 0;
|
||||
/// The offset at which the the actor should start its counter for this
|
||||
/// caller. This is used for actors that can be restarted, so that the new
|
||||
/// instance of the actor knows from which task to start executing.
|
||||
uint64_t caller_starts_at = 0;
|
||||
/// Out of the tasks sent by this worker to the actor, the number of tasks
|
||||
/// that we will never send to the actor again. This is used to reset
|
||||
/// caller_starts_at if the actor dies and is restarted. We only include
|
||||
/// tasks that will not be sent again, to support automatic task retry on
|
||||
/// actor failure.
|
||||
uint64_t num_completed_tasks = 0;
|
||||
|
||||
/// A force-kill request that should be sent to the actor once an RPC
|
||||
/// client to the actor is available.
|
||||
absl::optional<rpc::KillActorRequest> pending_force_kill;
|
||||
};
|
||||
|
||||
/// Push a task to a remote actor via the given client.
|
||||
/// Note, this function doesn't return any error status code. If an error occurs while
|
||||
/// sending the request, this task will be treated as failed.
|
||||
///
|
||||
/// \param[in] client The RPC client to send tasks to an actor.
|
||||
/// \param[in] request The request to send.
|
||||
/// \param[in] actor_id Actor ID.
|
||||
/// \param[in] task_id The ID of a task.
|
||||
/// \param[in] num_returns Number of return objects.
|
||||
/// \param[in] queue The actor queue. Contains the RPC client state.
|
||||
/// \param[in] task_spec The task to send.
|
||||
/// \param[in] skip_queue Whether to skip the task queue. This will send the
|
||||
/// task for execution immediately.
|
||||
/// \return Void.
|
||||
void PushActorTask(rpc::CoreWorkerClientInterface &client,
|
||||
std::unique_ptr<rpc::PushTaskRequest> request,
|
||||
const ActorID &actor_id, const TaskID &task_id, int num_returns)
|
||||
EXCLUSIVE_LOCKS_REQUIRED(mu_);
|
||||
void PushActorTask(const ClientQueue &queue, const TaskSpecification &task_spec,
|
||||
bool skip_queue) EXCLUSIVE_LOCKS_REQUIRED(mu_);
|
||||
|
||||
/// Send all pending tasks for an actor.
|
||||
/// Note that this function doesn't take lock, the caller is expected to hold
|
||||
@@ -124,39 +194,7 @@ class CoreWorkerDirectActorTaskSubmitter {
|
||||
/// Mutex to proect the various maps below.
|
||||
mutable absl::Mutex mu_;
|
||||
|
||||
/// Address of our RPC server.
|
||||
rpc::Address rpc_address_;
|
||||
|
||||
/// Map from actor id to rpc client. This only includes actors that we send tasks to.
|
||||
/// We use shared_ptr to enable shared_from_this for pending client callbacks.
|
||||
///
|
||||
/// TODO(zhijunfu): this will be moved into `actor_states_` later when we can
|
||||
/// subscribe updates for a specific actor.
|
||||
absl::flat_hash_map<ActorID, std::shared_ptr<rpc::CoreWorkerClientInterface>>
|
||||
rpc_clients_ GUARDED_BY(mu_);
|
||||
|
||||
/// Map from actor ids to worker ids. TODO(ekl) consider unifying this with the
|
||||
/// rpc_clients_ map.
|
||||
absl::flat_hash_map<ActorID, std::string> worker_ids_ GUARDED_BY(mu_);
|
||||
|
||||
/// Map from actor ids that should be force killed once a client is available to the
|
||||
/// pending kill actor requests.
|
||||
absl::flat_hash_map<ActorID, rpc::KillActorRequest> pending_force_kills_
|
||||
GUARDED_BY(mu_);
|
||||
|
||||
/// Map from actor id to the actor's pending requests. Each actor's requests
|
||||
/// are ordered by the task number in the request.
|
||||
absl::flat_hash_map<ActorID, std::map<int64_t, std::unique_ptr<rpc::PushTaskRequest>>>
|
||||
pending_requests_ GUARDED_BY(mu_);
|
||||
|
||||
/// Map from actor id to the send position of the next task to queue for send
|
||||
/// for that actor. This is always greater than or equal to next_send_position_.
|
||||
absl::flat_hash_map<ActorID, int64_t> next_send_position_to_assign_ GUARDED_BY(mu_);
|
||||
|
||||
/// Map from actor id to the send position of the next task to send to that actor.
|
||||
/// Note that this differs from the PushTaskRequest's sequence number in that it
|
||||
/// increases monotonically in this process independently of CallerId changes.
|
||||
absl::flat_hash_map<ActorID, int64_t> next_send_position_ GUARDED_BY(mu_);
|
||||
absl::flat_hash_map<ActorID, ClientQueue> client_queues_ GUARDED_BY(mu_);
|
||||
|
||||
/// Resolve direct call object dependencies;
|
||||
LocalDependencyResolver resolver_;
|
||||
@@ -164,12 +202,6 @@ class CoreWorkerDirectActorTaskSubmitter {
|
||||
/// Used to complete tasks.
|
||||
std::shared_ptr<TaskFinisherInterface> task_finisher_;
|
||||
|
||||
/// Timestamp when the caller is created.
|
||||
/// - if this worker is an actor, this is set to the time that the actor creation
|
||||
/// task starts execution;
|
||||
/// - otherwise, it's set to the time that the current task starts execution.
|
||||
int64_t caller_creation_timestamp_ms_ = 0;
|
||||
|
||||
friend class CoreWorkerTest;
|
||||
};
|
||||
|
||||
@@ -262,13 +294,6 @@ class BoundedExecutor {
|
||||
boost::asio::thread_pool pool_;
|
||||
};
|
||||
|
||||
struct SchedulingQueueTag {
|
||||
/// Worker ID for the caller.
|
||||
WorkerID caller_worker_id;
|
||||
/// Timestamp for the caller, which is used as a version.
|
||||
int64_t caller_creation_timestamp_ms = 0;
|
||||
};
|
||||
|
||||
/// Used to ensure serial order of task execution per actor handle.
|
||||
/// See direct_actor.proto for a description of the ordering protocol.
|
||||
class SchedulingQueue {
|
||||
@@ -475,9 +500,7 @@ class CoreWorkerDirectTaskReceiver {
|
||||
std::unique_ptr<DependencyWaiterImpl> waiter_;
|
||||
/// Queue of pending requests per actor handle.
|
||||
/// TODO(ekl) GC these queues once the handle is no longer active.
|
||||
std::unordered_map<TaskID,
|
||||
std::pair<SchedulingQueueTag, std::unique_ptr<SchedulingQueue>>>
|
||||
scheduling_queue_;
|
||||
std::unordered_map<WorkerID, SchedulingQueue> scheduling_queue_;
|
||||
};
|
||||
|
||||
} // namespace ray
|
||||
|
||||
@@ -39,8 +39,8 @@ Status CoreWorkerDirectTaskSubmitter::SubmitTask(TaskSpecification task_spec) {
|
||||
} else {
|
||||
RAY_LOG(ERROR) << "Failed to create actor " << actor_id
|
||||
<< " with: " << status.ToString();
|
||||
task_finisher_->PendingTaskFailed(
|
||||
task_id, rpc::ErrorType::ACTOR_CREATION_FAILED, &status);
|
||||
RAY_UNUSED(task_finisher_->PendingTaskFailed(
|
||||
task_id, rpc::ErrorType::ACTOR_CREATION_FAILED, &status));
|
||||
}
|
||||
}));
|
||||
return;
|
||||
@@ -70,8 +70,8 @@ Status CoreWorkerDirectTaskSubmitter::SubmitTask(TaskSpecification task_spec) {
|
||||
}
|
||||
}
|
||||
if (!keep_executing) {
|
||||
task_finisher_->PendingTaskFailed(task_spec.TaskId(),
|
||||
rpc::ErrorType::TASK_CANCELLED, nullptr);
|
||||
RAY_UNUSED(task_finisher_->PendingTaskFailed(
|
||||
task_spec.TaskId(), rpc::ErrorType::TASK_CANCELLED, nullptr));
|
||||
}
|
||||
});
|
||||
return Status::OK();
|
||||
@@ -256,7 +256,6 @@ void CoreWorkerDirectTaskSubmitter::PushNormalTask(
|
||||
// NOTE(swang): CopyFrom is needed because if we use Swap here and the task
|
||||
// fails, then the task data will be gone when the TaskManager attempts to
|
||||
// access the task.
|
||||
request->mutable_caller_address()->CopyFrom(rpc_address_);
|
||||
request->mutable_task_spec()->CopyFrom(task_spec.GetMessage());
|
||||
request->mutable_resource_mapping()->CopyFrom(assigned_resources);
|
||||
request->set_intended_worker_id(addr.worker_id.Binary());
|
||||
@@ -284,10 +283,10 @@ void CoreWorkerDirectTaskSubmitter::PushNormalTask(
|
||||
// failure (e.g., by contacting the raylet). If it was a process
|
||||
// failure, it may have been an application-level error and it may
|
||||
// not make sense to retry the task.
|
||||
task_finisher_->PendingTaskFailed(
|
||||
RAY_UNUSED(task_finisher_->PendingTaskFailed(
|
||||
task_id,
|
||||
is_actor ? rpc::ErrorType::ACTOR_DIED : rpc::ErrorType::WORKER_DIED,
|
||||
&status);
|
||||
&status));
|
||||
} else {
|
||||
task_finisher_->CompletePendingTask(task_id, reply, addr.ToProto());
|
||||
}
|
||||
@@ -321,8 +320,8 @@ Status CoreWorkerDirectTaskSubmitter::CancelTask(TaskSpecification task_spec,
|
||||
task_queues_.erase(scheduling_key);
|
||||
CancelWorkerLeaseIfNeeded(scheduling_key);
|
||||
}
|
||||
task_finisher_->PendingTaskFailed(task_spec.TaskId(),
|
||||
rpc::ErrorType::TASK_CANCELLED);
|
||||
RAY_UNUSED(task_finisher_->PendingTaskFailed(task_spec.TaskId(),
|
||||
rpc::ErrorType::TASK_CANCELLED));
|
||||
return Status::OK();
|
||||
}
|
||||
}
|
||||
|
||||
@@ -50,6 +50,9 @@ message ActorHandle {
|
||||
|
||||
// An extension field that is used for storing app-language-specific data.
|
||||
bytes extension_data = 8;
|
||||
|
||||
// How many times tasks may be retried on this actor if the actor fails.
|
||||
int64 max_task_retries = 9;
|
||||
}
|
||||
|
||||
message AssignTaskRequest {
|
||||
@@ -89,30 +92,21 @@ message ReturnObject {
|
||||
message PushTaskRequest {
|
||||
// The ID of the worker this message is intended for.
|
||||
bytes intended_worker_id = 1;
|
||||
// Address of the caller.
|
||||
Address caller_address = 2;
|
||||
// The task to be pushed.
|
||||
TaskSpec task_spec = 3;
|
||||
TaskSpec task_spec = 2;
|
||||
// The sequence number of the task for this client. This must increase
|
||||
// sequentially starting from zero for each actor handle. The server
|
||||
// will guarantee tasks execute in this sequence, waiting for any
|
||||
// out-of-order request messages to arrive as necessary.
|
||||
// If set to -1, ordering is disabled and the task executes immediately.
|
||||
int64 sequence_number = 4;
|
||||
int64 sequence_number = 3;
|
||||
// The max sequence number the client has processed responses for. This
|
||||
// is a performance optimization that allows the client to tell the server
|
||||
// to cancel any PushTaskRequests with seqno <= this value, rather than
|
||||
// waiting for the server to time out waiting for missing messages.
|
||||
int64 client_processed_up_to = 5;
|
||||
int64 client_processed_up_to = 4;
|
||||
// Resource mapping ids assigned to the worker executing the task.
|
||||
repeated ResourceMapEntry resource_mapping = 6;
|
||||
// The version of the caller. This is used to distinguish on-the-fly
|
||||
// requests from a caller before it die, and requests from the restarted
|
||||
// caller, which might happen theoretically when network has issues.
|
||||
// - For an actor, this is set to the timestamp when the actor is created,
|
||||
// so it can be used to differentiate which is the newly restarted actor.
|
||||
// - For a non-actor task, it's set to the timestamp the task starts execution.
|
||||
int64 caller_version = 7;
|
||||
repeated ResourceMapEntry resource_mapping = 5;
|
||||
}
|
||||
|
||||
message PushTaskReply {
|
||||
|
||||
@@ -112,9 +112,12 @@ class CoreWorkerClientInterface {
|
||||
/// Push an actor task directly from worker to worker.
|
||||
///
|
||||
/// \param[in] request The request message.
|
||||
/// \param[in] skip_queue Whether to skip the task queue. This will send the
|
||||
/// task for execution immediately.
|
||||
/// \param[in] callback The callback function that handles reply.
|
||||
/// \return if the rpc call succeeds
|
||||
virtual ray::Status PushActorTask(std::unique_ptr<PushTaskRequest> request,
|
||||
bool skip_queue,
|
||||
const ClientCallback<PushTaskReply> &callback) {
|
||||
return Status::NotImplemented("");
|
||||
}
|
||||
@@ -244,16 +247,19 @@ class CoreWorkerClient : public std::enable_shared_from_this<CoreWorkerClient>,
|
||||
|
||||
RPC_CLIENT_METHOD(CoreWorkerService, PlasmaObjectReady, grpc_client_, override)
|
||||
|
||||
ray::Status PushActorTask(std::unique_ptr<PushTaskRequest> request,
|
||||
ray::Status PushActorTask(std::unique_ptr<PushTaskRequest> request, bool skip_queue,
|
||||
const ClientCallback<PushTaskReply> &callback) override {
|
||||
request->set_sequence_number(request->task_spec().actor_task_spec().actor_counter());
|
||||
if (skip_queue) {
|
||||
// Set this value so that the actor does not skip any tasks when
|
||||
// processing this request. We could also set it to max_finished_seq_no_,
|
||||
// but we just set it to the default of -1 to avoid taking the lock.
|
||||
request->set_client_processed_up_to(-1);
|
||||
return INVOKE_RPC_CALL(CoreWorkerService, PushTask, *request, callback,
|
||||
grpc_client_);
|
||||
}
|
||||
|
||||
{
|
||||
std::lock_guard<std::mutex> lock(mutex_);
|
||||
if (request->task_spec().caller_id() != cur_caller_id_) {
|
||||
// We are running a new task, reset the seq no counter.
|
||||
max_finished_seq_no_ = -1;
|
||||
cur_caller_id_ = request->task_spec().caller_id();
|
||||
}
|
||||
send_queue_.push_back(std::make_pair(std::move(request), callback));
|
||||
}
|
||||
SendRequests();
|
||||
@@ -331,10 +337,6 @@ class CoreWorkerClient : public std::enable_shared_from_this<CoreWorkerClient>,
|
||||
|
||||
/// The max sequence number we have processed responses for.
|
||||
int64_t max_finished_seq_no_ GUARDED_BY(mutex_) = -1;
|
||||
|
||||
/// The task id we are currently sending requests for. When this changes,
|
||||
/// the max finished seq no counter is reset.
|
||||
std::string cur_caller_id_;
|
||||
};
|
||||
|
||||
} // namespace rpc
|
||||
|
||||
@@ -275,8 +275,9 @@ class StreamingQueueTestBase : public ::testing::TestWithParam<uint64_t> {
|
||||
std::string name = "";
|
||||
ActorCreationOptions actor_options{
|
||||
max_restarts,
|
||||
/*max_concurrency=*/1, resources, resources, {},
|
||||
/*is_detached=*/false, name, /*is_asyncio=*/false};
|
||||
/*max_task_retries=*/0,
|
||||
/*max_concurrency=*/1, resources, resources, {},
|
||||
/*is_detached=*/false, name, /*is_asyncio=*/false};
|
||||
// Create an actor.
|
||||
ActorID actor_id;
|
||||
RAY_CHECK_OK(CoreWorkerProcess::GetCoreWorker().CreateActor(
|
||||
|
||||
Reference in New Issue
Block a user