Commit Graph

1683 Commits

Author SHA1 Message Date
Alex Wu 404161a3ff [Autoscaler/Core] Remove autoscaler spam (#12952) 2020-12-18 18:22:45 -08:00
Kai Yang ac5ea2c13d [Java] Fix output parsing in RunManager (#12968)
* Fix output parsing in RunManager

* change log level

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-18 18:22:12 -08:00
Eric Liang 6ece291f35 Clean up block/unblock handling of resources in new scheduler (#12963) 2020-12-18 16:00:54 -08:00
Eric Liang 3e492a79ec Increase the number of unique bits for actors to avoid handle collisions (#12894) 2020-12-18 15:59:03 -08:00
Eric Liang 92812f2e8a Implement resource deadlock detection for new scheduler (#12961) 2020-12-18 12:17:54 -08:00
Barak Michener 5cfa1934e4 [ray_client]: Implement object retain/release and Data Streaming API (#12818) 2020-12-18 11:47:38 -08:00
fangfengbin a442cd17e0 [GCS]Optimize gcs client reconnection (#12878)
* [GCS]Optimize gcs client reconnection

* fix review comment

* fix review comment

* add part code

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-17 21:57:37 -08:00
dHannasch cfefd7c70e Test PingPort (#12954)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-12-17 21:15:42 -08:00
DK.Pino 6404f1e609 [Placement Group][New scheduler] New scheduler pg implementation (#12910) 2020-12-18 11:56:45 +08:00
Tao Wang 17152c84a7 [Tiny]Print raylet info after register (#12566) 2020-12-18 11:22:13 +08:00
dHannasch d747071dd9 Test shard_context on already-created boost::asio::io_service. (#12917) 2020-12-17 14:26:30 -08:00
Allen e6cb4f4bd7 [Core] Add log of address and port (#12908)
Co-authored-by: Allen Yin <allenyin@anyscale.io>
2020-12-17 00:25:29 -08:00
Yi Cheng 40032541dc [core] Introduce fetch_local to ray.wait (#12526) 2020-12-16 23:44:28 -08:00
Tao Wang 12231ec2a6 Optimize heartbeat manager initialization (#12911) 2020-12-17 14:24:23 +08:00
SangBin Cho 057687e534 [New Scheduler] Fix test_failure.py by supporting infeasible tasks (#12738)
* Fix the first issue.

* ip

* In Progress.

* In progress.

* done.

* Remove unnecessary logs.

* Addressed code review + fix some test failures.

* Try fixing issues.

* Fix issues.

* Fix test issues.

* Fix issues.

* done.
2020-12-16 21:27:50 -08:00
Alex Wu 8b783ecafa Fix pull manager retry (#12907) 2020-12-16 14:18:43 -08:00
fangfengbin 91878d18b5 [PlacementGroup]Fix placement group wait api disorder bug (#12827)
* [PlacementGroup]Fix placment group wait api disorder bug

* fix review comment

* fix review comment

* fix review comment

* fix review comments

* increase num_heartbeats_timeout

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-16 18:45:53 +08:00
Eric Liang 7ff314a5df [New scheduler] Also unsubscribe get dependencies on unblock 2020-12-15 20:29:44 -08:00
Alex Wu 0031723ace [New scheduler] Object spilling (#12857) 2020-12-15 11:05:38 -08:00
Edward Oakes 261b2f9053 Check for raylet PID as ppid in dashboard agent fate-sharing (#12867) 2020-12-15 12:13:11 -06:00
Max Fitton e077bc4206 [Release] Bump master to 1.2.0 for 1.1.0 release (#12856) 2020-12-15 09:40:26 -08:00
Simon Mo b291dd4486 [Metrics] Call GetMeasureDoubleByName to prevent override (#12860) 2020-12-15 09:39:39 -08:00
fangfengbin 43b9259d40 [GCS]GCS resource manager support scheduling resource (#12780)
* add part code

* add part code

* fix review comments

* rebase master

* add part code

* add part code

* fix review comments

* add part code

* fix code style

* fix ut bug

* fix ut bug

* fix review comments

* fix review comment

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-15 10:27:55 +08:00
Tao Wang ac53e2f857 [GCS]Tell dead nodes to commit suicide (#12792)
* [GCS]Tell dead nodes to commit suicide

* fix comment, add ut
2020-12-14 11:42:00 -08:00
Tao Wang 35f7d84dbe Revert heartbeat interval to keep ci stable (#12836)
* Revert heartbeat interval to keep ci stable

* fix missing one
2020-12-14 16:58:40 +08:00
fangfengbin 1e02b28abe [GCS]Move node resource info to gcs resource manager (#12775)
* add part code

* add part code

* fix review comments

* fix ut bug

* rebase master

* add part code

* fix ut bug

* fix ut bug

* fix review comments

* fix review comment

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-13 20:37:34 +08:00
DK.Pino 153b24746c [Placement Group] Refactor pg resource constrain in node manager (#12538)
* first version by pointer

* second version reference

* clean up

* add cpp ut

* lint

* extract LocalPlacementGroupManagerInterface

* lint

* fix commemt

* add idempotency test

* lint

* fix pg ut

* fix pg ut

* python lint

* fix pg ut timeout

* python lint

* fix comment

* lint

* lint
2020-12-12 23:32:15 -08:00
Eric Liang b73d4831d4 Add grace period before warning of resource deadlock 2020-12-12 12:02:13 -08:00
fangfengbin c22990a537 [GCS]GCS node manager rename GetNode to GetAliveNode (#12781) 2020-12-12 20:34:43 +08:00
Alex Wu aa64cd4534 [New scheduler] Fix test_global_state (#12586) 2020-12-11 21:47:01 -08:00
Eric Liang 1ce745cf44 Add automatic local GC and plasma debug logs every 10 minutes by default (#12804) 2020-12-11 17:09:58 -08:00
Alex Wu 676ec363f6 [Object Manager] Pull Manager refactor (#12335) 2020-12-11 11:56:23 -08:00
Eric Liang 4ad4463be6 Add comments to clarify purpose of new scheduler queues (#12730)
* update

* clarify

* update
2020-12-11 11:53:09 -08:00
Tao Wang 295b6e5ce4 Split heartbeat message (#12535)
* first

* xxx

* Split heartbeat message

* only report resource usage when changed

* Fix GetAllResourceUsage

* Fix report resource usage

* Increase default heartbeat interval

* regularize heartbeat interval in test case
2020-12-11 21:19:57 +08:00
Stephanie Wang 86b0741026 [new scheduler] Allocate resources for spilled back task to a local view of the remote node (#12711)
* Force report heartbeats if remote resources may be dirty

* lint

* typo

* typo

* unit test

* debug

* Revert "lint"

This reverts commit 6dc7e982ffee98185665eb7c3c8fde0d91938919.

* Revert "Force report heartbeats if remote resources may be dirty"

This reverts commit cbfa9405197df62f874107d55b46715ceae2abd2.

* Local view of resources

* debug travis

* debug

* debug

* debug

* weaken test

* cleanups

* lint

* Revert "debug travis"

This reverts commit 11ff5f4f84e64e9fbd4eecda5b3c7fd07a7130a4.

* revert

* const view, remove unused
2020-12-10 22:43:29 -05:00
Barak Michener b7f246c451 [ray_client] Include multiple facets of the Ray API (#12736) 2020-12-10 19:09:34 -08:00
Edward Oakes 62d6b0a558 Fix max_task_retries for named actors (#12762) 2020-12-10 18:24:55 -06:00
Kai Yang e3b5deb741 [Multi-tenancy] Delete flag enable_multi_tenancy and remove old code path (#10573) 2020-12-10 19:01:40 +08:00
Stephanie Wang a776209aec Revert "Fix dashboard agent check ppid is raylet pid (#12256)" (#12729)
This reverts commit 3ce9286977.
2020-12-09 17:20:38 -05:00
dHannasch d455cae036 Add period to error message. (#12716) 2020-12-09 15:58:21 -06:00
Keqiu Hu ee012532fb [core] Use node manager client pool for GCS service #10398 (#12368)
* raylet client pool

* Fix merging conflict

* Fix documentation typo

* fix linting

* address comments

* fix typo

* remove unintended logging

* address comments

* fix bazel file lint error
2020-12-09 12:44:40 -08:00
Alex Wu 0b6e44efb8 [New scheduler] Cluster Resource Scheduler dynamic resources (for placement groups) (#12518)
* prepare implemented

* dynamic resources

* .

* commit

* .

* .

* Still needs to be cleaned up

* Passes basic tests + cleanup

* .

* .

* .

* Apply suggestions from code review

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* fix

* lint

Co-authored-by: Alex <alex@anyscale.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2020-12-09 12:05:31 -08:00
fangfengbin ef9ebbc636 [GCS]GCS based Actor Scheduling support actor colocation (#12707)
* [GCS]GCS based Actor Scheduling support actor colocation

* fix review comment

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-09 11:54:23 -08:00
fyrestone 3ce9286977 Fix dashboard agent check ppid is raylet pid (#12256)
* Dashboard agent check ppid is raylet pid

* Improve implementation

* Refine code

* Make the RAY_NODE_PID environment required for dashboard agent

Co-authored-by: 刘宝 <po.lb@antfin.com>
2020-12-09 09:12:34 -05:00
Stephanie Wang 840de49161 Fix race condition between failure detection and references going out of scope (#12573)
* fix

* lint

* fix initialization
2020-12-08 23:49:55 -08:00
Stephanie Wang 50f28811ac [new scheduler] Always spill back to a feasible node if the local node is not feasible (#12557)
* fix

lint

* feasible nodes

* Enable test, cleanup

* Revert "fix"

This reverts commit aef81d04c0b4560b758f846e1afdafbdb5552efe.

* unit test

* doc
2020-12-08 13:46:58 -05:00
fangfengbin 93c0eb249c [PlacementGroup]Support acquire and return bundle resource from gcs resource manager (#12349) 2020-12-08 10:29:57 +08:00
fangfengbin 7e1422e925 [PlacementGroup]Fix placement group strict spread bug when node dead (#12647)
* [PlacementGroup]Fix strict spread bug when node dead

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-07 21:50:28 +08:00
Philipp Moritz 73a1a232b9 Ray debugger stepping between tasks (#12075) 2020-12-06 21:50:18 -08:00
fangfengbin 260b07cf0c [PlacementGroup]Add PlacementGroup wait java api (#12499)
* add part code

* add part code

* add part code

* add part code

* fix review comments

* fix compile bug

* fix compile bug

* fix review comments

* fix review comments

* fix code style

* add part code

* fix review comments

* fix review comments

* fix code style

* rebase master

* fix bug

* fix lint error

* fix compile bug

* fix newline issue

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-05 16:40:04 +08:00