Commit Graph

40 Commits

Author SHA1 Message Date
Simon Mo f51c26bae6 Revert "[Core]Fix ray.kill doesn't cancel pending actor bug (#13254)" (#14013)
This reverts commit 2092b097ea.
2021-02-09 11:36:38 -08:00
fangfengbin 2092b097ea [Core]Fix ray.kill doesn't cancel pending actor bug (#13254) 2021-02-09 10:59:14 +08:00
DK.Pino fb89f9c2c8 [Placement Group] Support named placement group (#13755) 2021-02-05 11:04:51 +08:00
DK.Pino 7f6d326ad8 [Placement Group]Add detached support for placement group. (#13582) 2021-01-27 18:51:26 +08:00
fangfengbin 4a6c53da46 [Core]Fix raylet scheduling bug (#13452)
* [Core]Fix raylet scheduling bug

* fix lint error

* fix lint error

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2021-01-14 14:50:32 +01:00
fangfengbin 91878d18b5 [PlacementGroup]Fix placement group wait api disorder bug (#12827)
* [PlacementGroup]Fix placment group wait api disorder bug

* fix review comment

* fix review comment

* fix review comment

* fix review comments

* increase num_heartbeats_timeout

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-16 18:45:53 +08:00
Tao Wang 35f7d84dbe Revert heartbeat interval to keep ci stable (#12836)
* Revert heartbeat interval to keep ci stable

* fix missing one
2020-12-14 16:58:40 +08:00
DK.Pino 153b24746c [Placement Group] Refactor pg resource constrain in node manager (#12538)
* first version by pointer

* second version reference

* clean up

* add cpp ut

* lint

* extract LocalPlacementGroupManagerInterface

* lint

* fix commemt

* add idempotency test

* lint

* fix pg ut

* fix pg ut

* python lint

* fix pg ut timeout

* python lint

* fix comment

* lint

* lint
2020-12-12 23:32:15 -08:00
Eric Liang bdc6624da8 Revert "[PlacementGroup]Add PlacementGroup wait python api (#12601)" (#12825)
This reverts commit 401d342602.
2020-12-12 12:13:48 -08:00
Tao Wang 295b6e5ce4 Split heartbeat message (#12535)
* first

* xxx

* Split heartbeat message

* only report resource usage when changed

* Fix GetAllResourceUsage

* Fix report resource usage

* Increase default heartbeat interval

* regularize heartbeat interval in test case
2020-12-11 21:19:57 +08:00
fangfengbin 401d342602 [PlacementGroup]Add PlacementGroup wait python api (#12601) 2020-12-07 13:53:49 +08:00
fangfengbin ff34563539 [PlacementGroup]Fix bug that kill workers mistakenly when gcs restarts (#12568) 2020-12-03 17:50:48 +08:00
Tao Wang e1075c0a82 [GCS]Fill resource fields when re-report heartbeat after gcs restarted (#12097) 2020-11-25 11:07:02 +08:00
fangfengbin f400333841 [Placement Group]Placement Group supports gcs failover(Part2) (#12003)
* add testcase

* fix ut

* fix review comment

* fix review comment

* fix review comments

* fix ut bug

* add part code

* add part code

* add part code

* add testcase

* add part code

* fix ut bug

* fix ut timeout bug

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-11-18 10:59:26 +08:00
fangfengbin 7f050c706b [PlacementGroup]Skip flaky testcase (#12065)
Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-11-17 12:21:34 -08:00
fangfengbin 8fb926565c [Placement Group]Placement Group supports gcs failover (Part1) (#11933) 2020-11-16 14:42:56 +08:00
SangBin Cho 6e2a1eac36 [Placement Group] Placement group automatic cleanup. (#11546)
* In progress. Done with all placement group manager code.

* It is working with job.

* Finished detached actor implementation.

* Fix minor issue.

* In progress.

* Addressed code review.

* Addressed code review.

* Addressed code reivew.

* Fix a build error.
2020-10-30 10:55:43 -07:00
DK.Pino 9f804ade5f [Placement Group]Add get all placement group api (#11460)
* add get all interface for placement group

* add get all interface for placement group

* make it work

* fix lint

* fix lint

* fix comment

* add cpp test

* fix python lint
2020-10-23 11:46:48 -07:00
SangBin Cho 666fcde8ca [Placement group] Input validation (#11152)
* Add a basic input validation.

* Addressed code review.
2020-10-14 13:56:41 -07:00
SangBin Cho 1e39c40370 [Placement Group] Capture child tasks by default. (#11025)
* In progress.

* Finished up.

* Improve comment.

* Addressed code review.

* Fix test failure.

* Fix ci failures.

* Fix CI issues.
2020-09-27 19:33:00 -07:00
SangBin Cho 29663d89f1 [Placement Group] Remove warning msg for placement groups. (#11034)
* Done.

* Addressed code review.

* Fixed typo.

* Addressed code review.
2020-09-25 20:53:42 -07:00
SangBin Cho 5e6b887f2d [Placement Group] Capture Child Task Part 1 (#10968)
* In progress.

* In progers.

* Done.

* Addressed code review.

* Increase timeout to make a test less flaky.

* Addressed code review.

* Addressed code review.
2020-09-24 09:02:03 -07:00
SangBin Cho dcb9e03fde [Placement Group] Atomic Creation using 2 phase protocol part 2. (#10599)
* In progress.

* In Progress

* Basic done.

* Fix build issues.

* Addressed code review.

* Change the confusing test name.

* Fix comments.

* Addressed code review.
2020-09-08 13:11:11 -07:00
Eric Liang 8ee7c182f5 [1.0] move placement groups from experimental to util. Note they are still undocumented. (#10554)
* move files

* Update __init__.py

* remove

* Update __init__.py
2020-09-04 19:01:24 -07:00
SangBin Cho dc7fe1a4c5 [Placement Group] Atomic Placement Group Part 1, Basic Structure. (#10482)
* Write a test.

* Basic structure done.

* Reduce flakiness of tests.

* Addressed code review.

* Skipping tests because it is flaky for now.

* Fix linting issues.

* Increase sleep time to see lint messages.

* Lint issue fixed.
2020-09-02 18:14:46 -07:00
Eric Liang 2a204260a8 [api] Second round of 1.0 API changes: exceptions, num_return_vals (#10377) 2020-08-28 19:57:02 -07:00
Eric Liang bd245a1c18 [api] Clean up and document Actor name / lifetime API (#10332) 2020-08-27 13:38:39 -07:00
SangBin Cho 3b3ca96a4e [Placement Group] Wait (#10259)
* Initial progress done.

* Fix wrong test.

* Improve tests.

* Update code.

* Addressed code review and merge conflict.

* Addressed code review.
2020-08-24 20:14:48 -07:00
fangfengbin b61a79efd7 [Placement Group]Fix SigSegv bug (#10262)
* fix SigSegv bug

* fix review comments

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-08-23 11:33:40 -07:00
fangfengbin 36c6c4b298 [Placement group] Check if placement group bundle index is valid (#10194)
* add part code

* rebase master

* add java testcase

* fix review comments

* fix lint error

* rebase master

* fix lint error

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-08-21 11:04:56 -07:00
fangfengbin a462ae2747 [Placement Group]Add strict spread strategy (#10174)
* support STRICT_SPREAD strategy

* fix review comments

* rebase master

* fix lint error

* fix lint error

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-08-20 10:18:58 -07:00
SangBin Cho 224933b5e4 [Placement Group] Remove API part 2 (#10215)
* Initial progress done.

* Fix mistake.

* Addressed code review.

* Fix cpp build issue.

* Addressed code review.
2020-08-20 09:50:13 -07:00
fangfengbin 9734dbca3e [Placement Group]Reschedule bundles when the node of bundles is dead (#10021) 2020-08-19 13:24:42 -07:00
SangBin Cho 263df6163c [Placement Group] Placement group remove api part 1 (#10063)
* Added basic rpc calls.

* fix issues.

* Fix the gcs server not getting request issue.

* In Progress.

* Basic logic done. Tests are required.

* In progress.

* In progress in refactoring context.

* Revert "In progress in refactoring context."

This reverts commit 38236256cf1306c60dd203e75d45ceb4509c8106.

* Working now.

* Python test works.

* Lint.

* Addressed code review.

* Addressed code review.

* Lint.

* Added unit tests.

* Done, but one of unit tests fail

* Addressed code review.

* Addressed the last code review.

* Fix the wrong test case.
2020-08-18 12:44:00 -07:00
SangBin Cho 053188dfbe [Placement Group] Support Placement Group state table. (#10090)
* Done.

* Addressed code review.

* Linting.

* Fix lint.

* Fix lint.

* Fix a test.

* Lint.

* Add a lint sleep to test.

* Fix the lint issue.

* Fixed doc build error.
2020-08-17 09:24:50 -07:00
fangfengbin edd783bc32 [Placement Group]Add soft pack strategy (#10099) 2020-08-17 12:01:34 +08:00
Eric Liang c9f13b0833 [Placement Groups] Support CUDA_VISIBLE_DEVICES (#10053) 2020-08-13 18:00:04 -07:00
Eric Liang 7d4f204aa8 [Placement Group] Allow scheduling a task on any bundle (-1, default) (#9885)
* wip

* wip

* fix tests

* wip

* wip

* wip

* wip

* wip

* add test

* update

* update

* remov debug

* comments
2020-08-06 00:05:21 -07:00
Eric Liang b73080c85f Allow tasks to be used with placement groups (#9738) 2020-07-31 10:51:37 -07:00
Alisa 51e12ee97c Python api of placement group (#9243) 2020-07-27 14:57:05 -07:00