Commit Graph

6812 Commits

Author SHA1 Message Date
Keqiu Hu ee012532fb [core] Use node manager client pool for GCS service #10398 (#12368)
* raylet client pool

* Fix merging conflict

* Fix documentation typo

* fix linting

* address comments

* fix typo

* remove unintended logging

* address comments

* fix bazel file lint error
2020-12-09 12:44:40 -08:00
architkulkarni 8b9197ea8c [Doc] replace github discussion link with discourse (#12684) 2020-12-09 12:43:45 -08:00
Edward Oakes c9873cdbc3 [Serve] Remove unused assign_request wrapper (#12721) 2020-12-09 12:22:43 -08:00
Alex Wu 0b6e44efb8 [New scheduler] Cluster Resource Scheduler dynamic resources (for placement groups) (#12518)
* prepare implemented

* dynamic resources

* .

* commit

* .

* .

* Still needs to be cleaned up

* Passes basic tests + cleanup

* .

* .

* .

* Apply suggestions from code review

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* fix

* lint

Co-authored-by: Alex <alex@anyscale.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2020-12-09 12:05:31 -08:00
fangfengbin ef9ebbc636 [GCS]GCS based Actor Scheduling support actor colocation (#12707)
* [GCS]GCS based Actor Scheduling support actor colocation

* fix review comment

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-09 11:54:23 -08:00
Sven Mika ea25482f6a WIP. (#12706) 2020-12-09 11:49:21 -08:00
Ian Rodney 19542c5eb0 [docker] Default to ray-ml image (#12703) 2020-12-09 11:49:16 -08:00
architkulkarni 6f3aacd087 [serve] Clarify conda env docs (#12679) 2020-12-09 13:35:48 -06:00
Sven Mika f6241302a8 [RLlib] Fix issue 12678: MultiAgentBatch has no attribute total. (#12704) 2020-12-09 16:41:13 +01:00
fyrestone 3ce9286977 Fix dashboard agent check ppid is raylet pid (#12256)
* Dashboard agent check ppid is raylet pid

* Improve implementation

* Refine code

* Make the RAY_NODE_PID environment required for dashboard agent

Co-authored-by: 刘宝 <po.lb@antfin.com>
2020-12-09 09:12:34 -05:00
Stephanie Wang 840de49161 Fix race condition between failure detection and references going out of scope (#12573)
* fix

* lint

* fix initialization
2020-12-08 23:49:55 -08:00
Sven Mika 28108c905b [RLlib] Tf-eager policy bug fix: Duplicate model call in compute_gradients. (#12682) 2020-12-09 08:03:58 +01:00
Eric Liang cab46b7931 Improve issue templates (#12687)
* update

* Update .github/ISSUE_TEMPLATE/bug_report.md

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-12-08 22:29:03 -08:00
Alex Wu bd7e26b768 [Autoscaler] Temporarily suppress "Removed stale ip mappings" message. (#12689) 2020-12-08 21:55:10 -08:00
Barak Michener dc4b5c7aa3 [ray_client] Passing actors to actors (#12585)
* start building tests around passing handles to handles

Change-Id: Ie8c3de5c8ce789c3ec8d29f0702df80ba598279f

* clean up the switch statements by moving to a method, implement state tranfer, extend test

Change-Id: Ie7b6493db3a6c203d3a0b262b8fbacb90e5cdbc5

* passing

Change-Id: Id88dc0a41da1c9d5ba68f754c5b57141aae47beb

* flush out tests

Change-Id: If77c0f586e9e99449d494be4e85f854e4a7a4952

* formatting

Change-Id: I497c07cee70b52453b221ed4393f04f6f560061e

* fix python3.6 and other attributes

Change-Id: I5a2c5231e8a021184d9dfc3e346df7f71fc93257

* address documentation

Change-Id: I049d841ed1f85b7350c17c05da4a4d81d5cb03df

* formatting

Change-Id: I6a2b32a2466ffc9f03fc91ac17901b9c1a49505c

* use the pickled handle as the id bytes for actors

Change-Id: I9ddcb41d614de65d42d6f0382fe0faa7ad2c2ade

* pydoc

Change-Id: I9b32a0f383d5ff5ac052e61929b7ae3e42a89fc5

* format

Change-Id: Iac0010bb990a4025a98139ab88700030b2e9e7f5

* todos

Change-Id: I7b550800cf7499403e8a17b77484bc46f20f0afc

* tests

Change-Id: If8ebf6a335baeb113c1332acc930c41a6b4f5384

* fix lint

Change-Id: I019f41e0ec341d39bbbbd39aa43d9fb5f8b57cf0

* nits

Change-Id: I2e6813d8db34f4ce008326faa095d414c10eee95

* add some tricky, python3.6-troublesome type checking

Change-Id: Ib887fc943a6e7084002bc13dfbe113b69b4d9317
2020-12-08 21:54:55 -08:00
Richard Liaw d534719af6 temporary-fix (#12700)
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2020-12-08 21:48:26 -08:00
Ameer Haj Ali a4dbb271bd [hotfix][autoscaler] Request resources refactor2 (#12661)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* request_resources -> min workers

* test fixes

* add race condition tests

* Eric

* fixes

* semi final

* semi final

* lint

* lint

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
2020-12-08 18:41:30 -08:00
Philipp Moritz 343b479ae2 [TEST] Fix Ray windows build for debugger (#12671)
* Fix Ray windows build for debugger

* update
2020-12-08 18:12:48 -08:00
Sven Mika e40b14d255 [RLlib] Batch-size for truncate_episode batch_mode should be confgurable in agent-steps (rather than env-steps), if needed. (#12420) 2020-12-08 16:41:45 -08:00
Edward Oakes fd4e025da6 [serve] Add docs on configuring cv2 parallelism (#12652) 2020-12-08 16:03:13 -06:00
Stephanie Wang 50f28811ac [new scheduler] Always spill back to a feasible node if the local node is not feasible (#12557)
* fix

lint

* feasible nodes

* Enable test, cleanup

* Revert "fix"

This reverts commit aef81d04c0b4560b758f846e1afdafbdb5552efe.

* unit test

* doc
2020-12-08 13:46:58 -05:00
Sumanth Ratna b7404e7955 [dashboard] Resolve npm vulnerabilities (#12620)
* npm audit fix

* npm dedupe
2020-12-08 10:26:49 -08:00
Kai Fricke df10b84113 [Release] release tests yamls for Tune & GPU (#12496) 2020-12-08 10:15:07 -08:00
Gekho457 f61bc79a87 Dmitri/k8s command runner home try again (#12609) 2020-12-08 11:44:22 -06:00
Keqiu Hu 2a9079aef9 [grpc]'ray memory' fails if there are many objects in scope #8502 (#12673) 2020-12-08 09:36:53 -08:00
Felipe Antunes 4c0f0ce3a9 [RLlib] In OffPolicyEstimators (Offline RL): Include last step of trajectory (#12619) 2020-12-08 12:39:40 +01:00
Keqiu Hu f27ceecbf6 [doc] update lint script location (#12670) 2020-12-07 22:26:42 -08:00
SangBin Cho 162f361dab [Logging] Fix log monitor issue (#12588)
* Try fixing issues.

* Verficiation.
2020-12-07 22:01:18 -08:00
Max Fitton cc2f43c826 [Dashboard][Bugfix] Fix bug in display of worker logs and errors in Dashboard (#12660)
* Fix bug with worker logs/errors not displaying in the dashboard

* Add error endpoint test.

* lint
2020-12-07 21:41:13 -08:00
Max Fitton 34b9c7449b [Dashboard] Fix object store memory display. (#12664) 2020-12-07 21:40:49 -08:00
fangfengbin 93c0eb249c [PlacementGroup]Support acquire and return bundle resource from gcs resource manager (#12349) 2020-12-08 10:29:57 +08:00
SangBin Cho b1f2b142d5 [Core] Ensure global state is connected when exception hook is called from the driver. (#12655) 2020-12-07 18:28:32 -08:00
SangBin Cho 040cf2c13b [Doc] Placement group doc small update (#12594)
* Modify doc  that wasn't supposed to be merged.

* Addressed coder eview.
2020-12-07 13:58:27 -08:00
SangBin Cho 3ee4612696 [Release] Fix cluster.yaml (#12589)
* Fix cluster.yaml

* Updated to use manylinux2014
2020-12-07 13:52:30 -08:00
Sven Mika 340b1e99fc [RLlib] Fix JAX import bug. (#12621) 2020-12-07 11:05:08 -08:00
fangfengbin 7e1422e925 [PlacementGroup]Fix placement group strict spread bug when node dead (#12647)
* [PlacementGroup]Fix strict spread bug when node dead

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-07 21:50:28 +08:00
Sven Mika 99c81c6795 [RLlib] Attention Net prep PR #3. (#12450) 2020-12-07 13:08:17 +01:00
fangfengbin 401d342602 [PlacementGroup]Add PlacementGroup wait python api (#12601) 2020-12-07 13:53:49 +08:00
Philipp Moritz 73a1a232b9 Ray debugger stepping between tasks (#12075) 2020-12-06 21:50:18 -08:00
fangfengbin 260b07cf0c [PlacementGroup]Add PlacementGroup wait java api (#12499)
* add part code

* add part code

* add part code

* add part code

* fix review comments

* fix compile bug

* fix compile bug

* fix review comments

* fix review comments

* fix code style

* add part code

* fix review comments

* fix review comments

* fix code style

* rebase master

* fix bug

* fix lint error

* fix compile bug

* fix newline issue

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-05 16:40:04 +08:00
Kai Fricke 1c0d10f67e [tune] Add xgboost_ray integration (#12572) 2020-12-04 13:59:20 -08:00
Kai Fricke 219c445648 [tune] verbosity refactor second attempt (#12571)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-12-04 13:56:26 -08:00
Xianyang Liu 7cad648370 [SGD] Fixes TorchTrainer scales up (#12563) 2020-12-04 13:55:15 -08:00
Marci f965537ae9 [tune] Callable accepted for register_env (#12618) 2020-12-04 12:21:25 -08:00
SangBin Cho 0138c2dbb4 [Metrics] Remove redundant unit specification. (#12595) 2020-12-04 00:06:21 -08:00
Kai Yang 21fcee28f9 [Java] Simplify Ray.init() by invoking ray start internally (#10762) 2020-12-04 14:33:45 +08:00
Eric Liang 8cebe1e79c [autoscaler] Fix worker capping fifo test in new scheduler (#12512) 2020-12-03 17:21:35 -08:00
Richard Liaw 515f67034a [tune] debug py37 build (#12597) 2020-12-03 13:47:54 -08:00
Richard Liaw 1ce5e0e99f [tune] Fix file descriptor leak by syncer (#12590) 2020-12-03 13:39:04 -08:00
Eric Liang 36e46ed923 Revert "[autoscaler/k8s] Use ray node's HOME in Kubernetes command runner. (#12417)" (#12607)
This reverts commit f669830de6.
2020-12-03 12:57:59 -08:00