Commit Graph

7210 Commits

Author SHA1 Message Date
Ameer Haj Ali d87a82e891 Revert "Revert "[Autoscaler] Monitor refactor for backward compatability. (#13970)" (#14046)" (#14050)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* Revert "Revert "[Autoscaler] Monitor refactor for backward compatability. (#13970)" (#14046)"

This reverts commit 6f9d39fb3e.

* fake news

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
2021-02-10 17:59:08 -08:00
Clark Zinzow c5574a33e4 [dask-on-ray] Add better Dask-on-Ray example, and detail custom shuffle optimization. (#13950)
* Add better Dask-on-Ray example, and detail custom shuffle optimization.

* Misc. updates and feedback.

* Update doc/source/dask-on-ray.rst

Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>

* Set max_branch to infinity in shuffle optimization example.

* Feedback

* Apply suggestions from code review

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* 80 col width

Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-02-10 14:24:09 -08:00
Crissman Loomis 05ab75fbe1 [docs] Add mode to Ray Tune quick start (#14023) 2021-02-10 12:41:45 -08:00
Thomas J. Fan 75fbd48edd [doc] Minor fix to indentation (#14040) 2021-02-10 12:31:47 -08:00
Stephanie Wang fc89984162 Subtract from num bytes in use (#13944) 2021-02-10 12:22:08 -08:00
architkulkarni 6f9d39fb3e Revert "[Autoscaler] Monitor refactor for backward compatability. (#13970)" (#14046)
This reverts commit 7a6f8054d1.
2021-02-10 12:16:52 -08:00
Alex Wu 68e985ddcd [hotfix][docs] RayDP tensorflow != pytorch (#14044) 2021-02-10 11:23:02 -08:00
Kai Fricke 1ef2a6790c [tune] add scalability release tests (#13986)
* Add scalability tests

* Network overhead cluster

* Update xgboost tests

* Document release tests

* Don't raise on failed trial

* Update to multi node yamls

* Update yamls

* Revert xgboost test changes

* Fix import

* Update release/tune_tests/scalability_tests/workloads/test_bookkeeping_overhead.py

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Pass aws credentials (WIP)

* Update durable trainable example

* Update xgboost sweep

* Change xgboost scope, fix durable trainable stop condition

* Fix max depth to limit total test length

* Add cluster information to test descriptions. Update release checklist/process docs

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-02-10 17:16:31 +01:00
Sven Mika 81e7434091 [RLlib] TFPolicy.export_model: Add timestep placeholder to model's signature, if needed. (#13988) 2021-02-10 15:21:46 +01:00
Sven Mika 37c7daa3c0 [RLlib] DDPG: Support simplex action space. (#14011) 2021-02-10 15:10:01 +01:00
fangfengbin 1754359281 [Core]Fix ray.kill doesn't cancel pending actor bug (#14025) 2021-02-10 15:30:21 +08:00
Alex Wu ce80ef5aee [Docs] RayDP Documentation (#14018)
* .

* done?

* Docs

* Docs

* Update raydp.rst

* Update raydp.rst

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-02-09 23:05:18 -08:00
Dmitri Gekhtman 8ca0a32819 HotFix k8s autoscaling (#14024) 2021-02-09 22:34:24 -08:00
Eric Liang 8b7cf7cab9 Add tip on how to disable Ray OOM handler (#14017) 2021-02-09 21:52:22 -08:00
Ameer Haj Ali 7a6f8054d1 [Autoscaler] Monitor refactor for backward compatability. (#13970) 2021-02-09 21:41:50 -08:00
Eric Liang 7f342eb371 Update example shuffle script (#14021) 2021-02-09 20:47:41 -08:00
Clark Zinzow 79c7c181f3 [dask-on-ray] Add multiple return DataFrame shuffle optimization. (#13951) 2021-02-09 15:39:48 -08:00
Kai Yang e0b81796c5 Revert "Revert "[Java] fix test hang occasionally when running FailureTest (#13934)" (#13992)" (#14008) 2021-02-09 12:43:26 -08:00
Simon Mo f51c26bae6 Revert "[Core]Fix ray.kill doesn't cancel pending actor bug (#13254)" (#14013)
This reverts commit 2092b097ea.
2021-02-09 11:36:38 -08:00
Alex Wu 1dcdfe9101 [autoscaler/dashboard] Publish resource usage in units of bytes (#14002) 2021-02-09 10:27:26 -08:00
Crissman Loomis 43083b9653 [docs] optuna variable typo (#14006)
* fix variable name typo

* align
2021-02-09 09:51:29 -08:00
Kai Fricke 3c8b164882 [tune] pass trainable function name when using tune.with_parameters (#14009) 2021-02-09 08:51:14 -08:00
Sven Mika d7301a51f4 [RLlib]: Trajectory View API: Keep env infos (e.g. for postprocessing callbacks), no matter what. (#13555) 2021-02-09 17:05:26 +01:00
fangfengbin 2092b097ea [Core]Fix ray.kill doesn't cancel pending actor bug (#13254) 2021-02-09 10:59:14 +08:00
Simon Mo 914696ac3f Skip placement tests on Windows (#14000) 2021-02-08 18:27:11 -08:00
Dmitri Gekhtman 081f3e5f07 [autoscaler][kubernetes] Ray client setup, example config simplification, example scripts. (#13920) 2021-02-08 20:00:34 -06:00
Ameer Haj Ali 1643bc5c4f Fix autoscaler wrong parameter names (#13966)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* improve code readability

* lint

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
2021-02-08 13:19:33 -08:00
SongGuyang 09242e6d31 random a job id in c++ worker (#13982) 2021-02-08 12:57:25 -08:00
Simon Mo ec94214957 Revert "[Java] fix test hang occasionally when running FailureTest (#13934)" (#13992)
This reverts commit bcf9457abb.
2021-02-08 11:30:30 -08:00
SangBin Cho 0e07b5fa89 [Doc] Update actor resource information (#13909)
* in progress.

* Revert "in progress."

This reverts commit 21a91a47522797210bdc5db9477bd0b02ed9d926.

* done.

* done.
2021-02-08 10:23:57 -08:00
Sven Mika eb0038612f [RLlib] Extend on_learn_on_batch callback to allow for custom metrics to be added. (#13584) 2021-02-08 15:02:19 +01:00
Chace Ashcraft ebeee1d59a [RLlib] Pytorch MAML fix for more than two workers with discrete actions (#13835) 2021-02-08 12:06:02 +01:00
Sven Mika d001af3e59 [RLlib] Allow rllib rollout to run distributed via evaluation workers. (#13718) 2021-02-08 12:05:16 +01:00
Kai Yang bcf9457abb [Java] fix test hang occasionally when running FailureTest (#13934) 2021-02-08 18:21:50 +08:00
Xianyang Liu 918ad84f08 [core] Java worker should respect the user provided node_ip_address (#13732) 2021-02-08 11:59:06 +08:00
Richard Liaw 7231b6b91c [core/client] enable more tests (#13961) 2021-02-07 19:37:52 -08:00
Richard Liaw 3a230fa1a4 [ray_client] close ray connection upon client deactivation (#13919) 2021-02-07 13:11:38 -08:00
Kai Yang 4b4941435d [Java] fix actor restart failure when multi-worker is turned on (#13793) 2021-02-07 21:12:54 +08:00
Devin Petersohn 1412f3c546 [docs] page for using Modin with Ray (#13937)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-02-06 00:28:04 -08:00
Clark Zinzow f070b3c9a9 [dask-on-ray] Fix Dask-on-Ray test: Python 3 dictionary .values() is a view, and is not indexable (#13945) 2021-02-05 21:21:41 -08:00
Simon Mo ea4154df80 [Hotfix] Master compilation error on MacOS. (#13946) 2021-02-05 16:07:45 -08:00
Travis Addair cbd3598970 [tune] Fixed wait_for_gpu to handle str representations of ordinal IDs (#13936)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-02-05 15:41:24 -08:00
Hao Chen e1a5e5bad4 Fix test_actor_restart (#13901) 2021-02-05 14:08:43 -08:00
Simon Mo 4a3dd6858d Buildkite determine-to-run support (#13866) 2021-02-05 12:58:07 -08:00
Amog Kamsetty f44f368eae [Tune] Add try-except to FailureInjectorCallback (#13939) 2021-02-05 11:02:42 -08:00
Eric Liang f782ed59a0 Ray client version check strict eq (#13926) 2021-02-05 00:06:10 -08:00
fyrestone eee624cf5f Revert "Fix passing env on windows (#13253)" (#13828) 2021-02-05 13:03:16 +08:00
fangfengbin 8a5999c12a [GCS]Fix bug that gcs client does not set last_resource_usage_ (#13856) 2021-02-05 11:51:25 +08:00
DK.Pino fb89f9c2c8 [Placement Group] Support named placement group (#13755) 2021-02-05 11:04:51 +08:00
Dmitri Gekhtman 40bad86c7a [hotfix][test][windows] Exclude k8s operator mock test from build. (#13924) 2021-02-04 18:35:10 -08:00