Commit Graph

1761 Commits

Author SHA1 Message Date
SangBin Cho edbb2937d3 [Object Spilling] Multi node file spilling V2. (#13542)
* done.

* done.

* Fix a mistake.

* Ready.

* Fix issues.

* fix.

* Finished the first round of code review.

* formatting.

* In progress.

* Formatting.

* Addressed code review.

* Formatting

* Fix tests.

* fix bugs.

* Skip flaky tests for now.
2021-01-23 23:15:32 -08:00
Qing Wang 8ef835ff03 Remove idle actor from worker pool. (#13523) 2021-01-23 13:57:30 +08:00
Kai Yang 90f1e408de [Java] Add fetchLocal parameter in Ray.wait() (#13604) 2021-01-22 17:55:00 +08:00
Stephanie Wang 0998d69968 [core] Admission control for pulling objects to the local node (#13514)
* Admission control, TODO: tests, object size

* Unit tests for admission control and some bug fixes

* Add object size to object table, only activate pull if object size is known

* Some fixes, reset timer on eviction

* doc

* update

* Trigger OOM from the pull manager

* don't spam

* doc

* Update src/ray/object_manager/pull_manager.cc

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Remove useless tests

* Fix test

* osx build

* Skip broken test

* tests

* Skip failing tests

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2021-01-21 16:46:42 -08:00
Amog Kamsetty 20acc3b05e Revert "Inline small objects in GetObjectStatus response. (#13309)" (#13615)
This reverts commit a82fa80f7b.
2021-01-21 16:10:34 -08:00
Clark Zinzow a82fa80f7b Inline small objects in GetObjectStatus response. (#13309) 2021-01-21 09:15:18 -08:00
Siyuan (Ryans) Zhuang a09997dc9e [Core] Remove 'PlasmaBuffer' in the buffer header (#13188) 2021-01-20 12:01:44 -08:00
ZhuSenlin 2e7c2b774f [Core] add thread name to help performance profiling (#13506) 2021-01-20 20:34:28 +08:00
Tao Wang b2a6e55289 [GCS]Only publish fileds used by sub clients in WorkerTableData (#13508) 2021-01-20 16:14:59 +08:00
Keqiu Hu 6c9088eb62 [core] refactor disconnect message processing and enrich WorkExitType (#13527)
* [core] refactor disconnect message processing and enrich WorkExitType

add changes from refactor pr

fix type typo

fix typo

fix

* address comments

* also update WorkerTableData

* fix tests
2021-01-19 22:09:46 -08:00
SangBin Cho e544c008df Fix restoration request dedup issues. (#13546) 2021-01-19 15:28:54 -08:00
Stephanie Wang bfe147a6a8 Debug info to GCS pub sub (#13564) 2021-01-19 14:55:23 -08:00
SangBin Cho 99375c4cfc [Object Spilling] Remove retries and use a timer instead. (#13175) 2021-01-19 11:01:45 -08:00
fyrestone 86d5000047 Fix passing env on windows (#13253) 2021-01-19 10:04:38 -06:00
Tao Wang 516eb77080 [GCS] Remove task info publish as nowhere uses it (#13509)
* Remove task info publish as nowhere uses it

* simplify right publish channel
2021-01-18 01:15:03 -08:00
Tao Wang 3a0710130c [GCS]Only publish changed field when node dead (#13364)
* Only update changed field when node dead

* node_id missed
2021-01-17 21:28:35 -08:00
ZhuSenlin a4ebdbd7da Refactor node manager to eliminate new_scheduler_enabled_ (#12936) 2021-01-18 00:15:35 +08:00
ZhuSenlin 2cd51ce608 sync write internal config in gcs (#13197) 2021-01-17 12:00:01 +08:00
Eric Liang ee6332dbb0 Bump dev branch to 2.0 to avoid endless version bump toil (#13497)
* wip

* fix

* fix
2021-01-15 17:41:17 -08:00
SangBin Cho d09df55b14 Update ID specification doc (#13356) 2021-01-15 15:15:51 -08:00
Eric Liang 4aeb0ea550 Return version info from Ray client connect, to allow for discovering version mismatches 2021-01-15 14:27:26 -08:00
SangBin Cho f6d9996874 [Object Spilling] Dedup restore objects (#13470)
* done.

* Addressed code review.
2021-01-14 23:51:11 -08:00
fangfengbin ce1b208e41 [GCS]Remove unused class variable (#13454) 2021-01-15 14:48:18 +08:00
Barak Michener 84e110a949 [ray_client]: Support runtime_context as metadata (#13428) 2021-01-14 14:37:00 -08:00
Clark Zinzow 9a658b568f [Core] Ownership-based Object Directory: Consolidate location table and reference table. (#13220)
* Added owned object reference before Plasma put on Create() + Seal() path.

* Consolidated location table and reference table in reference counter.

* Restore type in definition.

* Clean up owned reference on failed Seal().

* Added RemoveOwnedObject test for reference counter.

* Guard against ref going out of scope before location RPCs.

* Add 'owner must have ref in scope' precondition to documentation for object location methods.

* Move to separate Create() + Seal() methods for existing objects.

* Clearer distinction between Create() and Seal() methods.

* Make it clear that references will normally be cleaned up by reference counting.
2021-01-14 13:48:10 -08:00
fangfengbin 4a6c53da46 [Core]Fix raylet scheduling bug (#13452)
* [Core]Fix raylet scheduling bug

* fix lint error

* fix lint error

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2021-01-14 14:50:32 +01:00
fangfengbin 33b092de28 [GCS]Add gcs resource scheduler (#13072) 2021-01-14 20:05:55 +08:00
Kai Fricke b296642646 Fix linter error (#13451) 2021-01-14 10:28:44 +01:00
fyrestone 8697d67791 Fix raylet::MockWorker::GetProcess crashes (#13440)
Co-authored-by: 刘宝 <po.lb@antfin.com>
2021-01-14 12:19:21 +08:00
Tao Wang 062b7efc93 Remove unused handler methods (#13394) 2021-01-14 10:51:31 +08:00
fyrestone 4853aa96cb [Dashboard] Fix missing actor pid (#13229) 2021-01-13 16:45:12 +08:00
Tao Wang f587b9a50c Remove unimplemented GetAll method in actor info accessor (#13362) 2021-01-13 09:55:27 +08:00
Eric Liang 470fda190a Forgot overwrite parameter in Ray client internal kv 2021-01-11 17:50:06 -08:00
Eric Liang de5bc24c60 Implement internal kv in ray client (#13344)
* kv internal

* fix
2021-01-11 14:54:52 -08:00
Eric Liang fbb9795374 [client] Report number of currently active clients on connect (#13326)
* wip

* update

* update

* reset worker

* fix conn

* fix

* disable pycodestyle
2021-01-11 14:53:12 -08:00
ZhuSenlin c39658f368 fix removal of task dependencies (#13333)
Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>
2021-01-11 09:55:48 -08:00
Alex Wu 6ca4fb1054 [Pull manager] Only pull once per retry period (#13245)
* .

* docs

* cleanup

* .

* .

* .

* .

Co-authored-by: Alex <alex@anyscale.com>
2021-01-08 14:51:11 -08:00
Hao Chen 77cd0d5a21 Fix a crash problem caused by GetActorHandle in ActorManager (#13164) 2021-01-08 12:11:08 +08:00
Tao Wang ab2229dcb7 [GCS] Remove old lightweight resource usage report code path (#13192) 2021-01-08 10:30:00 +08:00
Tao Wang 82c54c67ee Publish job/worker info with Hex format instead of Binary (#13235) 2021-01-07 20:31:58 +08:00
fangfengbin 3669c02821 [GCS]Add gcs actor schedule strategy (#13156) 2021-01-07 15:44:33 +08:00
fangfengbin 9ae5bba7cf [GCS]Fix gcs table storage GetAll and GetByJobId api bug (#13195) 2021-01-07 10:37:00 +08:00
Siyuan (Ryans) Zhuang 02ae6c5a9a [Core] Fix incorrect comment (#13228) 2021-01-06 11:37:29 -08:00
Lingxuan Zuo 01d4638b49 [Log] fix spdlog init race (#12973)
* fix spdlog init race

* use global logger

* refine logger name and constructor
2021-01-06 11:02:54 -08:00
dHannasch 695833082d [Redis] Note that each Redis Connect retry takes two minutes (#12183)
* Slightly alter error message so it's the same in both cases.

* Each retry takes about two minutes.
2021-01-06 11:00:58 -08:00
SangBin Cho 32dc5676b4 [Metrics] Record per node and raylet cpu / mem usage (#12982)
* Record per node and raylet cpu / mem usage

* Add comments.

* Addressed code review.
2021-01-05 21:57:21 -08:00
fangfengbin 779b3876f6 [GCS]Fix TestActorSubscribeAll bug (#13193) 2021-01-06 13:52:39 +08:00
fangfengbin dd14e5a3b3 [BugFix][GCS]Fix gcs_actor_manager_test multithreading bug (#13158) 2021-01-06 10:47:06 +08:00
Tao Wang a0bbf2bfc2 Notify listeners after registered node stored (#13069) 2021-01-05 11:18:03 +08:00
fangfengbin 88eaa87e3a Remove unused file(object_manager_integration_test.cc) (#12989) 2021-01-05 11:09:36 +08:00