Commit Graph

1655 Commits

Author SHA1 Message Date
fangfengbin c22990a537 [GCS]GCS node manager rename GetNode to GetAliveNode (#12781) 2020-12-12 20:34:43 +08:00
Alex Wu aa64cd4534 [New scheduler] Fix test_global_state (#12586) 2020-12-11 21:47:01 -08:00
Eric Liang 1ce745cf44 Add automatic local GC and plasma debug logs every 10 minutes by default (#12804) 2020-12-11 17:09:58 -08:00
Alex Wu 676ec363f6 [Object Manager] Pull Manager refactor (#12335) 2020-12-11 11:56:23 -08:00
Eric Liang 4ad4463be6 Add comments to clarify purpose of new scheduler queues (#12730)
* update

* clarify

* update
2020-12-11 11:53:09 -08:00
Tao Wang 295b6e5ce4 Split heartbeat message (#12535)
* first

* xxx

* Split heartbeat message

* only report resource usage when changed

* Fix GetAllResourceUsage

* Fix report resource usage

* Increase default heartbeat interval

* regularize heartbeat interval in test case
2020-12-11 21:19:57 +08:00
Stephanie Wang 86b0741026 [new scheduler] Allocate resources for spilled back task to a local view of the remote node (#12711)
* Force report heartbeats if remote resources may be dirty

* lint

* typo

* typo

* unit test

* debug

* Revert "lint"

This reverts commit 6dc7e982ffee98185665eb7c3c8fde0d91938919.

* Revert "Force report heartbeats if remote resources may be dirty"

This reverts commit cbfa9405197df62f874107d55b46715ceae2abd2.

* Local view of resources

* debug travis

* debug

* debug

* debug

* weaken test

* cleanups

* lint

* Revert "debug travis"

This reverts commit 11ff5f4f84e64e9fbd4eecda5b3c7fd07a7130a4.

* revert

* const view, remove unused
2020-12-10 22:43:29 -05:00
Barak Michener b7f246c451 [ray_client] Include multiple facets of the Ray API (#12736) 2020-12-10 19:09:34 -08:00
Edward Oakes 62d6b0a558 Fix max_task_retries for named actors (#12762) 2020-12-10 18:24:55 -06:00
Kai Yang e3b5deb741 [Multi-tenancy] Delete flag enable_multi_tenancy and remove old code path (#10573) 2020-12-10 19:01:40 +08:00
Stephanie Wang a776209aec Revert "Fix dashboard agent check ppid is raylet pid (#12256)" (#12729)
This reverts commit 3ce9286977.
2020-12-09 17:20:38 -05:00
dHannasch d455cae036 Add period to error message. (#12716) 2020-12-09 15:58:21 -06:00
Keqiu Hu ee012532fb [core] Use node manager client pool for GCS service #10398 (#12368)
* raylet client pool

* Fix merging conflict

* Fix documentation typo

* fix linting

* address comments

* fix typo

* remove unintended logging

* address comments

* fix bazel file lint error
2020-12-09 12:44:40 -08:00
Alex Wu 0b6e44efb8 [New scheduler] Cluster Resource Scheduler dynamic resources (for placement groups) (#12518)
* prepare implemented

* dynamic resources

* .

* commit

* .

* .

* Still needs to be cleaned up

* Passes basic tests + cleanup

* .

* .

* .

* Apply suggestions from code review

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* fix

* lint

Co-authored-by: Alex <alex@anyscale.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2020-12-09 12:05:31 -08:00
fangfengbin ef9ebbc636 [GCS]GCS based Actor Scheduling support actor colocation (#12707)
* [GCS]GCS based Actor Scheduling support actor colocation

* fix review comment

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-09 11:54:23 -08:00
fyrestone 3ce9286977 Fix dashboard agent check ppid is raylet pid (#12256)
* Dashboard agent check ppid is raylet pid

* Improve implementation

* Refine code

* Make the RAY_NODE_PID environment required for dashboard agent

Co-authored-by: 刘宝 <po.lb@antfin.com>
2020-12-09 09:12:34 -05:00
Stephanie Wang 840de49161 Fix race condition between failure detection and references going out of scope (#12573)
* fix

* lint

* fix initialization
2020-12-08 23:49:55 -08:00
Stephanie Wang 50f28811ac [new scheduler] Always spill back to a feasible node if the local node is not feasible (#12557)
* fix

lint

* feasible nodes

* Enable test, cleanup

* Revert "fix"

This reverts commit aef81d04c0b4560b758f846e1afdafbdb5552efe.

* unit test

* doc
2020-12-08 13:46:58 -05:00
fangfengbin 93c0eb249c [PlacementGroup]Support acquire and return bundle resource from gcs resource manager (#12349) 2020-12-08 10:29:57 +08:00
fangfengbin 7e1422e925 [PlacementGroup]Fix placement group strict spread bug when node dead (#12647)
* [PlacementGroup]Fix strict spread bug when node dead

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-07 21:50:28 +08:00
Philipp Moritz 73a1a232b9 Ray debugger stepping between tasks (#12075) 2020-12-06 21:50:18 -08:00
fangfengbin 260b07cf0c [PlacementGroup]Add PlacementGroup wait java api (#12499)
* add part code

* add part code

* add part code

* add part code

* fix review comments

* fix compile bug

* fix compile bug

* fix review comments

* fix review comments

* fix code style

* add part code

* fix review comments

* fix review comments

* fix code style

* rebase master

* fix bug

* fix lint error

* fix compile bug

* fix newline issue

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-05 16:40:04 +08:00
SangBin Cho 0138c2dbb4 [Metrics] Remove redundant unit specification. (#12595) 2020-12-04 00:06:21 -08:00
Kai Yang 21fcee28f9 [Java] Simplify Ray.init() by invoking ray start internally (#10762) 2020-12-04 14:33:45 +08:00
fangfengbin ff34563539 [PlacementGroup]Fix bug that kill workers mistakenly when gcs restarts (#12568) 2020-12-03 17:50:48 +08:00
Stephanie Wang 443339ab19 [core] Move out-of-memory handling into the plasma store and support async object creation (#12186)
* Refactor to extract creation request queue

* timer on oom

* move timer out

* Move evict_if_full and on_store_full into plasma store

* Remove client-side code

* revert

* Distinguish between transient and permanent OOM delays

* update

* Move out create request queue, unit test

* unit test

* Fix max retries

* test

* Do not pin restored objects

* First pass to add polling requests, unit test passes

* worker plasma client retries plasma requests

* cleanup

* Clean up after disconnected clients, check memory leaks

* Support immediate requests in request queue

* Option to try creating immediately

* lint

* Fix build, address comments

* doc

* fixes

* debug travis

* debug

* debug

* debug

* debug

* Revert "debug"

This reverts commit 6bf2f6ee5640e71630c4aecdb7ebf54911ea32db.

Revert "debug"

This reverts commit 73017099c9b06cdaae1217bf0e0f4d23ed68a9e5.

Revert "debug"

This reverts commit 5a155529e28cee9461a598b0cdf7b6a3cc194c93.

Revert "debug"

This reverts commit b50c2101afd45d4cf663daae857bfe1b40387703.

Revert "debug travis"

This reverts commit 012b8721dedf9bca46294ae75eee2815b160368b.

* Skip if new scheduler enabled

* error message

* merge
2020-12-02 13:25:54 -05:00
Ian Rodney 786f839ff3 [Windows] Fix windows build (#12555)
* fix remote watch

* remove const

* unfix remote-watch

* format
2020-12-02 09:37:40 -08:00
Kai Fricke 0a12eba603 Revert "Fix race condition between failure detection and references going out of scope (#12548)" (#12570)
This reverts commit 8801e87a
2020-12-02 10:20:17 -05:00
Stephanie Wang 8801e87afd Fix race condition between failure detection and references going out of scope (#12548)
* fix

* lint
2020-12-01 20:52:30 -05:00
Barak Michener 6412dfaf38 [ray_client] actors v0 (#12388) 2020-12-01 13:12:08 -08:00
SangBin Cho 0e892908f7 [Object Spilling] Delete spilled objects when references are gone out of scope. (#12341) 2020-12-01 13:10:39 -08:00
Simon Mo f596113fc7 [Core] Actor Retries Out of Order Tasks on Restart (#12338) 2020-12-01 09:35:54 -08:00
SangBin Cho f6f3cc9af1 [Core]Remove checkpoint table (#12235)
* Delete an actor entry from node manager.

* Remove checkpoint table

* remote checkpoint interface

* remove checkpoint interface

* fix ExitActorTest

Co-authored-by: chaokunyang <shawn.ck.yang@gmail.com>
2020-12-01 08:58:36 -08:00
Tao Wang b85c6abc3e Rename fields/variables from client id to node id (#12457) 2020-11-30 14:33:36 +08:00
Alex Wu f1cc33a6a6 Actor resource backlog hotfix (#12471)
* prepare implemented

* works?

* deflek

* git

* deflek round 2

* .

* improve the test

Co-authored-by: Alex <alex@anyscale.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2020-11-29 20:55:50 -08:00
Eric Liang 9ad0f173d6 Prestart workers to avoid slow start when multi-tenancy is enabled (#12430) 2020-11-27 21:47:46 -08:00
Eric Liang 569eee5e71 Enable more new scheduler tests (#12421) 2020-11-27 16:10:38 -08:00
fangfengbin d5215745e4 [PlacementGroup] Introduce GcsResourceManager and avoid copying resources when scheduling placement groups (#12253) 2020-11-26 11:21:58 +08:00
SangBin Cho 2e4e285ef0 [Object Spilling] Fusion small objects (#12087) 2020-11-25 10:13:32 -08:00
Tao Wang 4dd0aa7822 [GCS]make thread number of gcs rpc server configurable (#12257) 2020-11-25 11:40:29 +08:00
Tao Wang 5d47d02f81 [GCS]add callback for RegisterSelf api, make it done first (#12252) 2020-11-25 11:36:44 +08:00
Tao Wang e025b9e788 [TEST]Move all WaitReady together (#12254) 2020-11-25 11:21:24 +08:00
Tao Wang 2af10c1b78 [GCS]Add new message ReportResourceUsage (#11848) 2020-11-25 11:18:26 +08:00
Tao Wang e1075c0a82 [GCS]Fill resource fields when re-report heartbeat after gcs restarted (#12097) 2020-11-25 11:07:02 +08:00
fangfengbin 1d909321c9 [PlacementGroup]Fix node manager release unused bundles bug (#12346) 2020-11-25 11:02:43 +08:00
fangfengbin 5934b20b96 [PlacementGroup]Fix destroy bundle resources bug (#12336)
* [PlacementGroup]Fix destroy bundle resources bug

* revert AddBundleLocations code change

* add comment

* fix review comments

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-11-25 09:45:26 +08:00
Lixin Wei 462c7fb575 [streaming] export aligned_ symbols from raylet.so (#12345) 2020-11-24 10:16:12 -06:00
ZhuSenlin 1ae4d2873a [GCS] refactor gcs initialization (#11890) 2020-11-24 21:11:18 +08:00
fangfengbin be7938ee09 [PlacementGroup]Fix AddBundleLocations bug (#12330)
Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-11-24 16:57:17 +08:00
dHannasch 2c4514a2c0 [minor] Refactor to expose RedisContext::PingPort (#12022) 2020-11-23 20:39:50 -08:00