Commit Graph

6812 Commits

Author SHA1 Message Date
Tao Wang 12231ec2a6 Optimize heartbeat manager initialization (#12911) 2020-12-17 14:24:23 +08:00
acxz 020ad98f6f install setproctitle from pypi instead of building from source 2020-12-17 00:36:12 -05:00
SangBin Cho 057687e534 [New Scheduler] Fix test_failure.py by supporting infeasible tasks (#12738)
* Fix the first issue.

* ip

* In Progress.

* In progress.

* done.

* Remove unnecessary logs.

* Addressed code review + fix some test failures.

* Try fixing issues.

* Fix issues.

* Fix test issues.

* Fix issues.

* done.
2020-12-16 21:27:50 -08:00
acxz c8d14eb3c5 update setproctitle to use with py39 2020-12-16 22:42:31 -05:00
Philipp Moritz ad036fd564 Fix continue for debugger (#12862) 2020-12-16 16:09:13 -08:00
Amog Kamsetty dd522a71a1 [SGD] Disable Elastic Training by default when using with Tune (#12927) 2020-12-16 15:37:44 -08:00
Alex Wu 8b783ecafa Fix pull manager retry (#12907) 2020-12-16 14:18:43 -08:00
Ameer Haj Ali c677b9e201 [autoscaler] Fix flaky autoscaler test (#12918) 2020-12-16 14:18:27 -08:00
Edward Oakes aedcf0c9d9 Disable test_distributions (#12919) 2020-12-16 14:17:49 -08:00
Edward Oakes fdb4c6eb1c Better message for too little /dev/shm memory (#12896) 2020-12-16 10:30:20 -06:00
acxz 2b38938305 remove extra newline 2020-12-16 11:06:33 -05:00
Akash Patel 7d8a008aeb Merge branch 'master' into py39 2020-12-16 11:04:27 -05:00
fangfengbin 91878d18b5 [PlacementGroup]Fix placement group wait api disorder bug (#12827)
* [PlacementGroup]Fix placment group wait api disorder bug

* fix review comment

* fix review comment

* fix review comment

* fix review comments

* increase num_heartbeats_timeout

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-16 18:45:53 +08:00
Eric Liang 7ff314a5df [New scheduler] Also unsubscribe get dependencies on unblock 2020-12-15 20:29:44 -08:00
Richard Liaw a7caa14d3d [k8s] avoid bad error messages (#12871) 2020-12-15 15:00:02 -08:00
Edward Oakes f4b5a8b2f7 [serve] Re-enable test_failure.py (#12891) 2020-12-15 16:02:04 -06:00
Richard Liaw 87cf1a97e5 [core] recover startup logs (#12876) 2020-12-15 13:49:45 -08:00
Edward Oakes 6795d7c75c [serve] Fix flaky test_api.py::test_backend_user_config (#12892) 2020-12-15 15:35:30 -06:00
Kai Fricke ea1228074d [tune] enable points_to_eval for all search algorithms (#12790)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-12-15 11:51:53 -08:00
Simon Mo fdd85e3af4 [Serve] Add benchmark for async handles (#12858) 2020-12-15 11:21:51 -08:00
Alex Wu 0031723ace [New scheduler] Object spilling (#12857) 2020-12-15 11:05:38 -08:00
Edward Oakes cde711aaf1 Revert "[RLLib] Execution-Folder Type Annotations (#12760)" (#12886)
This reverts commit becca1424d.
2020-12-15 11:03:02 -08:00
architkulkarni ba12fb1451 Fix for RLIMIT patch (#12882)
Implement new soft limit introduced by https://github.com/ray-project/ray/pull/12853.
2020-12-15 10:38:46 -08:00
SangBin Cho de7848231c [Doc] Fix placement group doc (#12875) 2020-12-15 10:36:51 -08:00
Edward Oakes 261b2f9053 Check for raylet PID as ppid in dashboard agent fate-sharing (#12867) 2020-12-15 12:13:11 -06:00
Max Fitton e077bc4206 [Release] Bump master to 1.2.0 for 1.1.0 release (#12856) 2020-12-15 09:40:26 -08:00
Simon Mo b291dd4486 [Metrics] Call GetMeasureDoubleByName to prevent override (#12860) 2020-12-15 09:39:39 -08:00
Gekho457 5a142d5bd6 Use nightly images in all kubernetes examples. (#12868) 2020-12-14 20:49:41 -08:00
fangfengbin 43b9259d40 [GCS]GCS resource manager support scheduling resource (#12780)
* add part code

* add part code

* fix review comments

* rebase master

* add part code

* add part code

* fix review comments

* add part code

* fix code style

* fix ut bug

* fix ut bug

* fix review comments

* fix review comment

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-15 10:27:55 +08:00
Gekho457 8cebe5cbe9 [docs][autoscaler][k8s][minor] quotes #12866 2020-12-14 18:24:13 -08:00
Gekho457 44f5be04ca [autoscaler][k8s][doc][minor] Fix typo in k8s doc. (#12865) 2020-12-14 17:30:43 -08:00
Simon Mo b56db5a22f [Serve] Wait for actor name to be cleaned up (#12215) 2020-12-14 15:09:43 -08:00
architkulkarni 231518e86f [Serve] Support basic Starlette response types (#12811) 2020-12-14 17:03:56 -06:00
Max Fitton d0813c1c58 [Dashboard] Add dashboard multi-node churn test (#11768) 2020-12-14 17:03:33 -06:00
Richard Liaw c56799e3da disable-for-now (#12838)
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2020-12-14 14:18:31 -08:00
Eric Liang 1eb4ac12b1 Clip RLIMIT_NOFILE increase to avoid redis failing to start on Big Sur 2020-12-14 14:05:19 -08:00
SangBin Cho 69b0bc2132 [Logging] Use file handle temporalily (#12839) 2020-12-14 11:42:44 -08:00
Tao Wang ac53e2f857 [GCS]Tell dead nodes to commit suicide (#12792)
* [GCS]Tell dead nodes to commit suicide

* fix comment, add ut
2020-12-14 11:42:00 -08:00
Michael Luo becca1424d [RLLib] Execution-Folder Type Annotations (#12760) 2020-12-14 19:16:44 +01:00
Gekho457 11ce1dc743 Ray cluster CRD and example CR + multi-ray-cluster operator (#12098) 2020-12-14 10:26:01 -06:00
Tao Wang 35f7d84dbe Revert heartbeat interval to keep ci stable (#12836)
* Revert heartbeat interval to keep ci stable

* fix missing one
2020-12-14 16:58:40 +08:00
Eric Squires 22c1968d62 Runing -> Running (#12826) 2020-12-13 22:23:48 -08:00
Ameer Haj Ali aaa11941f6 [autoscaler] Fix flaky autoscaler test (#12829) 2020-12-13 17:09:30 -08:00
Sven Mika 3c808835a5 [RLlib] Issue 12831: AttributeError: 'NoneType' object has no attribute 'id' when using custom Atari env. (#12832) 2020-12-13 16:15:54 +01:00
fangfengbin 1e02b28abe [GCS]Move node resource info to gcs resource manager (#12775)
* add part code

* add part code

* fix review comments

* fix ut bug

* rebase master

* add part code

* fix ut bug

* fix ut bug

* fix review comments

* fix review comment

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-13 20:37:34 +08:00
Max Fitton ac24d1db30 [Dashboard][Bugfix] Fix GPU List Bug (#12666)
* Fix bug where None was passed as the empty value for ActorInfo.gpu_stats instead of an empty list

* lint

* dashboard/modules/logical_view

* fix test

* trigger build
2020-12-12 23:34:24 -08:00
DK.Pino 153b24746c [Placement Group] Refactor pg resource constrain in node manager (#12538)
* first version by pointer

* second version reference

* clean up

* add cpp ut

* lint

* extract LocalPlacementGroupManagerInterface

* lint

* fix commemt

* add idempotency test

* lint

* fix pg ut

* fix pg ut

* python lint

* fix pg ut timeout

* python lint

* fix comment

* lint

* lint
2020-12-12 23:32:15 -08:00
Eric Liang bdc6624da8 Revert "[PlacementGroup]Add PlacementGroup wait python api (#12601)" (#12825)
This reverts commit 401d342602.
2020-12-12 12:13:48 -08:00
Eric Liang b73d4831d4 Add grace period before warning of resource deadlock 2020-12-12 12:02:13 -08:00
Barak Michener 6eb0e6f734 [format] Improve formatting with a real .flake8 file (#12800)
Change-Id: I42acd948dd915bad6b132f8caa9038898b55d6e4
2020-12-12 11:34:30 -08:00