Commit Graph

1747 Commits

Author SHA1 Message Date
Tao Wang 516eb77080 [GCS] Remove task info publish as nowhere uses it (#13509)
* Remove task info publish as nowhere uses it

* simplify right publish channel
2021-01-18 01:15:03 -08:00
Tao Wang 3a0710130c [GCS]Only publish changed field when node dead (#13364)
* Only update changed field when node dead

* node_id missed
2021-01-17 21:28:35 -08:00
ZhuSenlin a4ebdbd7da Refactor node manager to eliminate new_scheduler_enabled_ (#12936) 2021-01-18 00:15:35 +08:00
ZhuSenlin 2cd51ce608 sync write internal config in gcs (#13197) 2021-01-17 12:00:01 +08:00
Eric Liang ee6332dbb0 Bump dev branch to 2.0 to avoid endless version bump toil (#13497)
* wip

* fix

* fix
2021-01-15 17:41:17 -08:00
SangBin Cho d09df55b14 Update ID specification doc (#13356) 2021-01-15 15:15:51 -08:00
Eric Liang 4aeb0ea550 Return version info from Ray client connect, to allow for discovering version mismatches 2021-01-15 14:27:26 -08:00
SangBin Cho f6d9996874 [Object Spilling] Dedup restore objects (#13470)
* done.

* Addressed code review.
2021-01-14 23:51:11 -08:00
fangfengbin ce1b208e41 [GCS]Remove unused class variable (#13454) 2021-01-15 14:48:18 +08:00
Barak Michener 84e110a949 [ray_client]: Support runtime_context as metadata (#13428) 2021-01-14 14:37:00 -08:00
Clark Zinzow 9a658b568f [Core] Ownership-based Object Directory: Consolidate location table and reference table. (#13220)
* Added owned object reference before Plasma put on Create() + Seal() path.

* Consolidated location table and reference table in reference counter.

* Restore type in definition.

* Clean up owned reference on failed Seal().

* Added RemoveOwnedObject test for reference counter.

* Guard against ref going out of scope before location RPCs.

* Add 'owner must have ref in scope' precondition to documentation for object location methods.

* Move to separate Create() + Seal() methods for existing objects.

* Clearer distinction between Create() and Seal() methods.

* Make it clear that references will normally be cleaned up by reference counting.
2021-01-14 13:48:10 -08:00
fangfengbin 4a6c53da46 [Core]Fix raylet scheduling bug (#13452)
* [Core]Fix raylet scheduling bug

* fix lint error

* fix lint error

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2021-01-14 14:50:32 +01:00
fangfengbin 33b092de28 [GCS]Add gcs resource scheduler (#13072) 2021-01-14 20:05:55 +08:00
Kai Fricke b296642646 Fix linter error (#13451) 2021-01-14 10:28:44 +01:00
fyrestone 8697d67791 Fix raylet::MockWorker::GetProcess crashes (#13440)
Co-authored-by: 刘宝 <po.lb@antfin.com>
2021-01-14 12:19:21 +08:00
Tao Wang 062b7efc93 Remove unused handler methods (#13394) 2021-01-14 10:51:31 +08:00
fyrestone 4853aa96cb [Dashboard] Fix missing actor pid (#13229) 2021-01-13 16:45:12 +08:00
Tao Wang f587b9a50c Remove unimplemented GetAll method in actor info accessor (#13362) 2021-01-13 09:55:27 +08:00
Eric Liang 470fda190a Forgot overwrite parameter in Ray client internal kv 2021-01-11 17:50:06 -08:00
Eric Liang de5bc24c60 Implement internal kv in ray client (#13344)
* kv internal

* fix
2021-01-11 14:54:52 -08:00
Eric Liang fbb9795374 [client] Report number of currently active clients on connect (#13326)
* wip

* update

* update

* reset worker

* fix conn

* fix

* disable pycodestyle
2021-01-11 14:53:12 -08:00
ZhuSenlin c39658f368 fix removal of task dependencies (#13333)
Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>
2021-01-11 09:55:48 -08:00
Alex Wu 6ca4fb1054 [Pull manager] Only pull once per retry period (#13245)
* .

* docs

* cleanup

* .

* .

* .

* .

Co-authored-by: Alex <alex@anyscale.com>
2021-01-08 14:51:11 -08:00
Hao Chen 77cd0d5a21 Fix a crash problem caused by GetActorHandle in ActorManager (#13164) 2021-01-08 12:11:08 +08:00
Tao Wang ab2229dcb7 [GCS] Remove old lightweight resource usage report code path (#13192) 2021-01-08 10:30:00 +08:00
Tao Wang 82c54c67ee Publish job/worker info with Hex format instead of Binary (#13235) 2021-01-07 20:31:58 +08:00
fangfengbin 3669c02821 [GCS]Add gcs actor schedule strategy (#13156) 2021-01-07 15:44:33 +08:00
fangfengbin 9ae5bba7cf [GCS]Fix gcs table storage GetAll and GetByJobId api bug (#13195) 2021-01-07 10:37:00 +08:00
Siyuan (Ryans) Zhuang 02ae6c5a9a [Core] Fix incorrect comment (#13228) 2021-01-06 11:37:29 -08:00
Lingxuan Zuo 01d4638b49 [Log] fix spdlog init race (#12973)
* fix spdlog init race

* use global logger

* refine logger name and constructor
2021-01-06 11:02:54 -08:00
dHannasch 695833082d [Redis] Note that each Redis Connect retry takes two minutes (#12183)
* Slightly alter error message so it's the same in both cases.

* Each retry takes about two minutes.
2021-01-06 11:00:58 -08:00
SangBin Cho 32dc5676b4 [Metrics] Record per node and raylet cpu / mem usage (#12982)
* Record per node and raylet cpu / mem usage

* Add comments.

* Addressed code review.
2021-01-05 21:57:21 -08:00
fangfengbin 779b3876f6 [GCS]Fix TestActorSubscribeAll bug (#13193) 2021-01-06 13:52:39 +08:00
fangfengbin dd14e5a3b3 [BugFix][GCS]Fix gcs_actor_manager_test multithreading bug (#13158) 2021-01-06 10:47:06 +08:00
Tao Wang a0bbf2bfc2 Notify listeners after registered node stored (#13069) 2021-01-05 11:18:03 +08:00
fangfengbin 88eaa87e3a Remove unused file(object_manager_integration_test.cc) (#12989) 2021-01-05 11:09:36 +08:00
Eric Liang dfb326d4b5 Surface object store spilling statistics in ray memory (#13124) 2021-01-04 17:35:39 -08:00
Stephanie Wang b765914a1b Revert "Enabling the cancellation of non-actor tasks in a worker's queue (#12117)" (#13178)
This reverts commit b4d688b4a6.
2021-01-04 17:27:48 -08:00
Siyuan (Ryans) Zhuang 46cf433f0e [Core] Remove Arrow dependencies (#13157)
* remove arrow ubsan

* remove arrow build depend

* remove arrow buffer
2021-01-04 11:19:09 -08:00
Gabriele Oliaro b4d688b4a6 Enabling the cancellation of non-actor tasks in a worker's queue (#12117)
* wrote code to enable cancellation of queued non-actor tasks

* minor changes

* bug fixes

* added comments

* rev1

* linting

* making ActorSchedulingQueue::CancelTaskIfFound raise a fatal error

* bug fix

* added two unit tests

* linting

* iterating through pending_normal_tasks starting from end

* fixup! iterating through pending_normal_tasks starting from end

* fixup! fixup! iterating through pending_normal_tasks starting from end

* post merge fixes

* added debugging instructions, pulled Accept() out of guarded loop

* removed debugging instructions, linting
2021-01-04 09:52:29 -08:00
Clark Zinzow c2bff64699 [Core] Locality-aware leasing: Milestone 1 - Owned refs, pinned location (#12817)
* Locality-aware leasing for owned refs (pinned locations).

* LessorPicker --> LeasePolicy.

* Consolidate GetBestNodeIdForTask and GetBestNodeIdForObjects.

* Update comments.

* Turn on locality-aware leasing feature flag by default.

* Move local fallback logic to LeasePolicy, move feature flag check to CoreWorker constructor, add local-only lease policy.

* Add lease policy consulting assertions to the direct task submitter tests.

* Add lease policy tests.

* LocalityLeasePolicy --> LocalityAwareLeasePolicy.

* Add missing const declarations.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* Add RAY_CHECK for raylet address nullptr when creating lease client.

* Make the fact that LocalLeasePolicy always returns the local node more explicit.

* Flatten GetLocalityData conditionals to make it more readable.

* Add ReferenceCounter::GetLocalityData() unit test.

* Add data-intensive microbenchmarks for single-node perf testing.

* Add data-intensive microbenchmarks for simulated cluster perf testing.

* Remove redundant comment.

* Remove data-intensive benchmarks.

* Add locality-aware leasing Python test.

* Formatting changes in ray_perf.py.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2021-01-04 09:49:08 -08:00
fangfengbin 25f9f0d781 [GCS] Move resource usage info to gcs resource manager (#13059) 2020-12-25 15:17:45 +08:00
Siyuan (Ryans) Zhuang cf9952a028 [Core] Remote outdated external store (#13080)
* remove outdated external store
2020-12-24 17:30:06 -08:00
Siyuan (Ryans) Zhuang bf7f6a7de3 [Core] Remove cuda support in plasma store (#13070)
* remove cuda support in plasma store
2020-12-24 13:24:56 -08:00
Stephanie Wang 4461f9980a Refactor TaskDependencyManager, allow passing bundles of objects to ObjectManager (#13006)
* New dependency manager

* Switch raylet to new DependencyManager

* PullManager accepts bundles

* Cleanup, remove old task dependency manager

* x

* PullManager unit tests

* lint

* Unit tests

* Rename

* lint

* test

* Update src/ray/raylet/dependency_manager.cc

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* Update src/ray/raylet/dependency_manager.cc

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* x

* lint

Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2020-12-23 18:36:00 -08:00
Stephanie Wang d95c8b8a41 [core][new scheduler] Move tasks from ready to dispatch to waiting on argument eviction (#13048)
* Add index for tasks to dispatch

* Task dependency manager interface

* Unsubscribe dependencies and tests

* NodeManager

* Revert "Add index for tasks to dispatch"

This reverts commit c6ccb9aa306e00f80d34b991055e4e83872595ea.

* tmp

* Move back to waiting if args not ready

* update
2020-12-23 09:33:43 -08:00
DK.Pino 6e19facc7f [GCS] Delete redis gcs client and redis_xxx_accessor (#12996) 2020-12-23 20:31:46 +08:00
fangfengbin 646c4201ac [GCS]Decouple gcs resource manager and gcs node manager (#13012) 2020-12-23 11:25:01 +08:00
fyrestone 62a5832007 [Dashboard] Add GET /logical/actors API (#12913) 2020-12-23 11:14:23 +08:00
Alex Wu ea8d782be1 [core] Pull Manager exponential backoff (#13024) 2020-12-21 19:17:51 -08:00