Commit Graph

6593 Commits

Author SHA1 Message Date
Edward Oakes fd4e025da6 [serve] Add docs on configuring cv2 parallelism (#12652) 2020-12-08 16:03:13 -06:00
Stephanie Wang 50f28811ac [new scheduler] Always spill back to a feasible node if the local node is not feasible (#12557)
* fix

lint

* feasible nodes

* Enable test, cleanup

* Revert "fix"

This reverts commit aef81d04c0b4560b758f846e1afdafbdb5552efe.

* unit test

* doc
2020-12-08 13:46:58 -05:00
Sumanth Ratna b7404e7955 [dashboard] Resolve npm vulnerabilities (#12620)
* npm audit fix

* npm dedupe
2020-12-08 10:26:49 -08:00
Kai Fricke df10b84113 [Release] release tests yamls for Tune & GPU (#12496) 2020-12-08 10:15:07 -08:00
Gekho457 f61bc79a87 Dmitri/k8s command runner home try again (#12609) 2020-12-08 11:44:22 -06:00
Keqiu Hu 2a9079aef9 [grpc]'ray memory' fails if there are many objects in scope #8502 (#12673) 2020-12-08 09:36:53 -08:00
Felipe Antunes 4c0f0ce3a9 [RLlib] In OffPolicyEstimators (Offline RL): Include last step of trajectory (#12619) 2020-12-08 12:39:40 +01:00
Keqiu Hu f27ceecbf6 [doc] update lint script location (#12670) 2020-12-07 22:26:42 -08:00
SangBin Cho 162f361dab [Logging] Fix log monitor issue (#12588)
* Try fixing issues.

* Verficiation.
2020-12-07 22:01:18 -08:00
Max Fitton cc2f43c826 [Dashboard][Bugfix] Fix bug in display of worker logs and errors in Dashboard (#12660)
* Fix bug with worker logs/errors not displaying in the dashboard

* Add error endpoint test.

* lint
2020-12-07 21:41:13 -08:00
Max Fitton 34b9c7449b [Dashboard] Fix object store memory display. (#12664) 2020-12-07 21:40:49 -08:00
fangfengbin 93c0eb249c [PlacementGroup]Support acquire and return bundle resource from gcs resource manager (#12349) 2020-12-08 10:29:57 +08:00
SangBin Cho b1f2b142d5 [Core] Ensure global state is connected when exception hook is called from the driver. (#12655) 2020-12-07 18:28:32 -08:00
SangBin Cho 040cf2c13b [Doc] Placement group doc small update (#12594)
* Modify doc  that wasn't supposed to be merged.

* Addressed coder eview.
2020-12-07 13:58:27 -08:00
SangBin Cho 3ee4612696 [Release] Fix cluster.yaml (#12589)
* Fix cluster.yaml

* Updated to use manylinux2014
2020-12-07 13:52:30 -08:00
Sven Mika 340b1e99fc [RLlib] Fix JAX import bug. (#12621) 2020-12-07 11:05:08 -08:00
fangfengbin 7e1422e925 [PlacementGroup]Fix placement group strict spread bug when node dead (#12647)
* [PlacementGroup]Fix strict spread bug when node dead

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-07 21:50:28 +08:00
Sven Mika 99c81c6795 [RLlib] Attention Net prep PR #3. (#12450) 2020-12-07 13:08:17 +01:00
fangfengbin 401d342602 [PlacementGroup]Add PlacementGroup wait python api (#12601) 2020-12-07 13:53:49 +08:00
Philipp Moritz 73a1a232b9 Ray debugger stepping between tasks (#12075) 2020-12-06 21:50:18 -08:00
fangfengbin 260b07cf0c [PlacementGroup]Add PlacementGroup wait java api (#12499)
* add part code

* add part code

* add part code

* add part code

* fix review comments

* fix compile bug

* fix compile bug

* fix review comments

* fix review comments

* fix code style

* add part code

* fix review comments

* fix review comments

* fix code style

* rebase master

* fix bug

* fix lint error

* fix compile bug

* fix newline issue

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-05 16:40:04 +08:00
Kai Fricke 1c0d10f67e [tune] Add xgboost_ray integration (#12572) 2020-12-04 13:59:20 -08:00
Kai Fricke 219c445648 [tune] verbosity refactor second attempt (#12571)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-12-04 13:56:26 -08:00
Xianyang Liu 7cad648370 [SGD] Fixes TorchTrainer scales up (#12563) 2020-12-04 13:55:15 -08:00
Marci f965537ae9 [tune] Callable accepted for register_env (#12618) 2020-12-04 12:21:25 -08:00
SangBin Cho 0138c2dbb4 [Metrics] Remove redundant unit specification. (#12595) 2020-12-04 00:06:21 -08:00
Kai Yang 21fcee28f9 [Java] Simplify Ray.init() by invoking ray start internally (#10762) 2020-12-04 14:33:45 +08:00
Eric Liang 8cebe1e79c [autoscaler] Fix worker capping fifo test in new scheduler (#12512) 2020-12-03 17:21:35 -08:00
Richard Liaw 515f67034a [tune] debug py37 build (#12597) 2020-12-03 13:47:54 -08:00
Richard Liaw 1ce5e0e99f [tune] Fix file descriptor leak by syncer (#12590) 2020-12-03 13:39:04 -08:00
Eric Liang 36e46ed923 Revert "[autoscaler/k8s] Use ray node's HOME in Kubernetes command runner. (#12417)" (#12607)
This reverts commit f669830de6.
2020-12-03 12:57:59 -08:00
Simon Mo 1f7a4806ff [Serve] Fix Flask Request self reference (#12560)
* [Serve] Fix Flask Request self reference

* Working flag

* Fix
2020-12-03 10:45:04 -06:00
Gekho457 f669830de6 [autoscaler/k8s] Use ray node's HOME in Kubernetes command runner. (#12417) 2020-12-03 10:43:16 -06:00
Sven Mika 3f4bc16276 [RLlib] Add a minimal JAX ModelV2 (FCNet) to RLlib. (#12502) 2020-12-03 15:51:30 +01:00
fangfengbin ff34563539 [PlacementGroup]Fix bug that kill workers mistakenly when gcs restarts (#12568) 2020-12-03 17:50:48 +08:00
Richard Liaw 7c58a85fed [tune] fix Tensorboard file descriptor leak (#12425) 2020-12-03 00:06:54 -08:00
Eric Liang 62fbe63f34 Disable flaky test test_delete_objects_multi_node (#12584)
* update

* fix

* update
2020-12-02 19:19:12 -08:00
Edward Oakes 8058c1eb54 [serve] Add option to not start HTTP servers (#11627) 2020-12-02 16:49:34 -06:00
Max Fitton a5c846c83b [Dashboard][Bugfix] Filter dead nodes from Machine View (fixes duplicate node issue) (#12579) 2020-12-02 14:08:14 -08:00
Keqiu Hu 2ec7b7367e [doc] update contributing doc (#12564)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-12-02 12:08:30 -08:00
Kaushik B 7422abddb4 [tune] trim kwargs in shim instantiation functions (#12544) 2020-12-02 12:07:00 -08:00
Richard Liaw da42bf29d0 [tune] horovod release test (#12495) 2020-12-02 12:04:54 -08:00
Stephanie Wang 443339ab19 [core] Move out-of-memory handling into the plasma store and support async object creation (#12186)
* Refactor to extract creation request queue

* timer on oom

* move timer out

* Move evict_if_full and on_store_full into plasma store

* Remove client-side code

* revert

* Distinguish between transient and permanent OOM delays

* update

* Move out create request queue, unit test

* unit test

* Fix max retries

* test

* Do not pin restored objects

* First pass to add polling requests, unit test passes

* worker plasma client retries plasma requests

* cleanup

* Clean up after disconnected clients, check memory leaks

* Support immediate requests in request queue

* Option to try creating immediately

* lint

* Fix build, address comments

* doc

* fixes

* debug travis

* debug

* debug

* debug

* debug

* Revert "debug"

This reverts commit 6bf2f6ee5640e71630c4aecdb7ebf54911ea32db.

Revert "debug"

This reverts commit 73017099c9b06cdaae1217bf0e0f4d23ed68a9e5.

Revert "debug"

This reverts commit 5a155529e28cee9461a598b0cdf7b6a3cc194c93.

Revert "debug"

This reverts commit b50c2101afd45d4cf663daae857bfe1b40387703.

Revert "debug travis"

This reverts commit 012b8721dedf9bca46294ae75eee2815b160368b.

* Skip if new scheduler enabled

* error message

* merge
2020-12-02 13:25:54 -05:00
Ian Rodney 786f839ff3 [Windows] Fix windows build (#12555)
* fix remote watch

* remove const

* unfix remote-watch

* format
2020-12-02 09:37:40 -08:00
Kai Fricke 0a12eba603 Revert "Fix race condition between failure detection and references going out of scope (#12548)" (#12570)
This reverts commit 8801e87a
2020-12-02 10:20:17 -05:00
Richard Liaw a21523c709 [tune/core] serialization debugging utility (#12142)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2020-12-02 00:52:17 -08:00
Kai Fricke 63b85df828 [xgb] update docs (#12549) 2020-12-01 23:17:23 -08:00
Simon Mo e428134137 [Hotfix] Pin llvmlite for windows build (#12559) 2020-12-01 19:43:08 -08:00
Siyuan (Ryans) Zhuang 615f974313 Add context for "test_buffer_alignment" (#12519) 2020-12-01 19:27:14 -08:00
Stephanie Wang 8801e87afd Fix race condition between failure detection and references going out of scope (#12548)
* fix

* lint
2020-12-01 20:52:30 -05:00