Commit Graph

1725 Commits

Author SHA1 Message Date
Alex Wu 6ca4fb1054 [Pull manager] Only pull once per retry period (#13245)
* .

* docs

* cleanup

* .

* .

* .

* .

Co-authored-by: Alex <alex@anyscale.com>
2021-01-08 14:51:11 -08:00
Hao Chen 77cd0d5a21 Fix a crash problem caused by GetActorHandle in ActorManager (#13164) 2021-01-08 12:11:08 +08:00
Tao Wang ab2229dcb7 [GCS] Remove old lightweight resource usage report code path (#13192) 2021-01-08 10:30:00 +08:00
Tao Wang 82c54c67ee Publish job/worker info with Hex format instead of Binary (#13235) 2021-01-07 20:31:58 +08:00
fangfengbin 3669c02821 [GCS]Add gcs actor schedule strategy (#13156) 2021-01-07 15:44:33 +08:00
fangfengbin 9ae5bba7cf [GCS]Fix gcs table storage GetAll and GetByJobId api bug (#13195) 2021-01-07 10:37:00 +08:00
Siyuan (Ryans) Zhuang 02ae6c5a9a [Core] Fix incorrect comment (#13228) 2021-01-06 11:37:29 -08:00
Lingxuan Zuo 01d4638b49 [Log] fix spdlog init race (#12973)
* fix spdlog init race

* use global logger

* refine logger name and constructor
2021-01-06 11:02:54 -08:00
dHannasch 695833082d [Redis] Note that each Redis Connect retry takes two minutes (#12183)
* Slightly alter error message so it's the same in both cases.

* Each retry takes about two minutes.
2021-01-06 11:00:58 -08:00
SangBin Cho 32dc5676b4 [Metrics] Record per node and raylet cpu / mem usage (#12982)
* Record per node and raylet cpu / mem usage

* Add comments.

* Addressed code review.
2021-01-05 21:57:21 -08:00
fangfengbin 779b3876f6 [GCS]Fix TestActorSubscribeAll bug (#13193) 2021-01-06 13:52:39 +08:00
fangfengbin dd14e5a3b3 [BugFix][GCS]Fix gcs_actor_manager_test multithreading bug (#13158) 2021-01-06 10:47:06 +08:00
Tao Wang a0bbf2bfc2 Notify listeners after registered node stored (#13069) 2021-01-05 11:18:03 +08:00
fangfengbin 88eaa87e3a Remove unused file(object_manager_integration_test.cc) (#12989) 2021-01-05 11:09:36 +08:00
Eric Liang dfb326d4b5 Surface object store spilling statistics in ray memory (#13124) 2021-01-04 17:35:39 -08:00
Stephanie Wang b765914a1b Revert "Enabling the cancellation of non-actor tasks in a worker's queue (#12117)" (#13178)
This reverts commit b4d688b4a6.
2021-01-04 17:27:48 -08:00
Siyuan (Ryans) Zhuang 46cf433f0e [Core] Remove Arrow dependencies (#13157)
* remove arrow ubsan

* remove arrow build depend

* remove arrow buffer
2021-01-04 11:19:09 -08:00
Gabriele Oliaro b4d688b4a6 Enabling the cancellation of non-actor tasks in a worker's queue (#12117)
* wrote code to enable cancellation of queued non-actor tasks

* minor changes

* bug fixes

* added comments

* rev1

* linting

* making ActorSchedulingQueue::CancelTaskIfFound raise a fatal error

* bug fix

* added two unit tests

* linting

* iterating through pending_normal_tasks starting from end

* fixup! iterating through pending_normal_tasks starting from end

* fixup! fixup! iterating through pending_normal_tasks starting from end

* post merge fixes

* added debugging instructions, pulled Accept() out of guarded loop

* removed debugging instructions, linting
2021-01-04 09:52:29 -08:00
Clark Zinzow c2bff64699 [Core] Locality-aware leasing: Milestone 1 - Owned refs, pinned location (#12817)
* Locality-aware leasing for owned refs (pinned locations).

* LessorPicker --> LeasePolicy.

* Consolidate GetBestNodeIdForTask and GetBestNodeIdForObjects.

* Update comments.

* Turn on locality-aware leasing feature flag by default.

* Move local fallback logic to LeasePolicy, move feature flag check to CoreWorker constructor, add local-only lease policy.

* Add lease policy consulting assertions to the direct task submitter tests.

* Add lease policy tests.

* LocalityLeasePolicy --> LocalityAwareLeasePolicy.

* Add missing const declarations.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* Add RAY_CHECK for raylet address nullptr when creating lease client.

* Make the fact that LocalLeasePolicy always returns the local node more explicit.

* Flatten GetLocalityData conditionals to make it more readable.

* Add ReferenceCounter::GetLocalityData() unit test.

* Add data-intensive microbenchmarks for single-node perf testing.

* Add data-intensive microbenchmarks for simulated cluster perf testing.

* Remove redundant comment.

* Remove data-intensive benchmarks.

* Add locality-aware leasing Python test.

* Formatting changes in ray_perf.py.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2021-01-04 09:49:08 -08:00
fangfengbin 25f9f0d781 [GCS] Move resource usage info to gcs resource manager (#13059) 2020-12-25 15:17:45 +08:00
Siyuan (Ryans) Zhuang cf9952a028 [Core] Remote outdated external store (#13080)
* remove outdated external store
2020-12-24 17:30:06 -08:00
Siyuan (Ryans) Zhuang bf7f6a7de3 [Core] Remove cuda support in plasma store (#13070)
* remove cuda support in plasma store
2020-12-24 13:24:56 -08:00
Stephanie Wang 4461f9980a Refactor TaskDependencyManager, allow passing bundles of objects to ObjectManager (#13006)
* New dependency manager

* Switch raylet to new DependencyManager

* PullManager accepts bundles

* Cleanup, remove old task dependency manager

* x

* PullManager unit tests

* lint

* Unit tests

* Rename

* lint

* test

* Update src/ray/raylet/dependency_manager.cc

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* Update src/ray/raylet/dependency_manager.cc

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* x

* lint

Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2020-12-23 18:36:00 -08:00
Stephanie Wang d95c8b8a41 [core][new scheduler] Move tasks from ready to dispatch to waiting on argument eviction (#13048)
* Add index for tasks to dispatch

* Task dependency manager interface

* Unsubscribe dependencies and tests

* NodeManager

* Revert "Add index for tasks to dispatch"

This reverts commit c6ccb9aa306e00f80d34b991055e4e83872595ea.

* tmp

* Move back to waiting if args not ready

* update
2020-12-23 09:33:43 -08:00
DK.Pino 6e19facc7f [GCS] Delete redis gcs client and redis_xxx_accessor (#12996) 2020-12-23 20:31:46 +08:00
fangfengbin 646c4201ac [GCS]Decouple gcs resource manager and gcs node manager (#13012) 2020-12-23 11:25:01 +08:00
fyrestone 62a5832007 [Dashboard] Add GET /logical/actors API (#12913) 2020-12-23 11:14:23 +08:00
Alex Wu ea8d782be1 [core] Pull Manager exponential backoff (#13024) 2020-12-21 19:17:51 -08:00
Eric Liang 8068041006 Don't release resources during plasma fetch (#13025) 2020-12-21 18:32:40 -08:00
Eric Liang 03a5b90ed6 Revert "Revert "Increase the number of unique bits for actors to avoi… (#12990) 2020-12-21 15:16:42 -08:00
Kai Yang 5a6801dde7 [Core] Remove delete_creating_tasks (#12962) 2020-12-22 00:01:27 +08:00
fangfengbin 85a4435ba0 [GCS]Fix redis store client AsyncPutWithIndex unordered bug (#13002) 2020-12-21 20:02:50 +08:00
Barak Michener c576f0b073 [ray_client] Implement a gRPC streaming logs API for the client (#13001) 2020-12-20 19:35:34 -08:00
fangfengbin 4caa6c6d78 [GCS]GCS resource manager remove cluster_resources_ (#12972) 2020-12-21 11:00:25 +08:00
Barak Michener e715ade2d1 Support retrieval of named actor handles (#13000)
Change-Id: I05d31c9c67943d2a0230782cbdaa98341584cbc7
2020-12-20 16:34:50 -08:00
Barak Michener 80f6dd16b2 [ray_client] Implement optional arguments to ray.remote() and f.options() (#12985) 2020-12-20 15:43:48 -08:00
Barak Michener 7ab9164f1b [ray_client] Integrate with test_basic, test_basic_2 and test_actor (#12964) 2020-12-20 14:54:18 -08:00
fangfengbin 3fab93b61b Fix scheduling_resources comment errors (#12991)
* Fix scheduling_resources comment error

* add part code

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-20 20:20:07 +08:00
Eric Liang 64c97d25d3 Enable by default new scheduler (#12735) 2020-12-19 13:22:24 -08:00
Eric Liang 5d987f5988 Revert "Increase the number of unique bits for actors to avoid handle collisions (#12894)" (#12988)
This reverts commit 3e492a79ec.
2020-12-18 23:51:44 -08:00
dHannasch a092433bc8 [core] Use the ConnectWithoutRetries error message (#12732) 2020-12-18 22:34:34 -08:00
SangBin Cho 9d939e6674 [Object Spilling] Implement level triggered logic to make streaming shuffle work + additional cleanup (#12773) 2020-12-18 19:31:14 -08:00
Alex Wu 404161a3ff [Autoscaler/Core] Remove autoscaler spam (#12952) 2020-12-18 18:22:45 -08:00
Kai Yang ac5ea2c13d [Java] Fix output parsing in RunManager (#12968)
* Fix output parsing in RunManager

* change log level

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-18 18:22:12 -08:00
Eric Liang 6ece291f35 Clean up block/unblock handling of resources in new scheduler (#12963) 2020-12-18 16:00:54 -08:00
Eric Liang 3e492a79ec Increase the number of unique bits for actors to avoid handle collisions (#12894) 2020-12-18 15:59:03 -08:00
Eric Liang 92812f2e8a Implement resource deadlock detection for new scheduler (#12961) 2020-12-18 12:17:54 -08:00
Barak Michener 5cfa1934e4 [ray_client]: Implement object retain/release and Data Streaming API (#12818) 2020-12-18 11:47:38 -08:00
fangfengbin a442cd17e0 [GCS]Optimize gcs client reconnection (#12878)
* [GCS]Optimize gcs client reconnection

* fix review comment

* fix review comment

* add part code

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-17 21:57:37 -08:00
dHannasch cfefd7c70e Test PingPort (#12954)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-12-17 21:15:42 -08:00