Commit Graph

7065 Commits

Author SHA1 Message Date
Alex Wu 1e800ab511 [Docs] RayDP Documentation (#14018)
* .

* done?

* Docs

* Docs

* Update raydp.rst

* Update raydp.rst

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-02-11 19:01:43 +00:00
Alex 1b1a2496ca gpu tests fail 2021-02-06 02:49:42 +00:00
Kai Fricke fc630813cd Update XGBoost release test configs 2021-02-06 01:32:46 +00:00
Alex 29fd4ca5a6 changes to autoscaler yamls 2021-02-06 01:24:56 +00:00
Alex 81f0796841 xgboost cpu small autoscaler yaml 2021-02-05 20:28:48 +00:00
Amog Kamsetty 189f38c22b [Tune] Add try-except to FailureInjectorCallback (#13939) 2021-02-05 19:41:04 +00:00
Alex 5f61ace191 autoscaler yaml for long running distributed 2021-02-05 19:40:02 +00:00
Alex 75886c8e78 Merge branch 'releases/1.2.0' of github.com:ray-project/ray into releases/1.2.0 2021-02-05 02:45:12 +00:00
Alex 40beec569c long running distributed fails 2021-02-05 02:44:45 +00:00
Alex c2a46846f2 long running distributed fails 2021-02-05 02:43:55 +00:00
Alex a0ff0defac scalability tests run 2021-02-05 01:18:50 +00:00
Amog Kamsetty 4c71f76b25 [Release] Fix SGD+Tune long running distributed release test (#13812)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-02-03 15:55:13 -08:00
Alex Wu 34e0dfe934 [Core] Put raylet ip's in resource usage report (#13871)
* .

* done?

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-02-03 15:54:46 -08:00
Alex ceb60965ae rllib regression 2021-01-30 04:32:18 +00:00
Alex 115afee4c3 stress tests done 2021-01-29 21:24:15 +00:00
Alex 9fd198635f stress tests done 2021-01-29 21:24:07 +00:00
Eric Liang b4d87b8fc5 Fix high CPU usage in object manager due to O(n^2) iteration over active pulls list (#13724) 2021-01-28 13:39:13 -08:00
Ian Rodney 5c2aedc7d9 [CLI] Fix Ray Status with ENV Variable set (#13707) 2021-01-28 13:29:44 -08:00
Simon Mo 942d603d7e [Core] Hotfix Windows Compilation Error for ClusterTaskManager (#13754)
* [Core] Hotfix Windows Compilation Error for ClusterTaskManager

* fix
2021-01-28 13:29:09 -08:00
Alex Wu 9a40d7b4ee [Core/Autoscaler] Properly clean up resource backlog from (#13727) 2021-01-28 13:28:37 -08:00
Alex Wu c589de6bc8 Version bump 2021-01-25 19:37:09 -08:00
Alex Wu 840987c7af Scalability Envelope Tests (#13464) 2021-01-25 18:48:31 -08:00
Simon Mo f2867b0609 [CI] Remove object_manager_test (#13703)
https://github.com/ray-project/ray/commit/0998d69968608012ca6cdd1ee166961df1aa0f0b
removed the object_manager_test.
2021-01-25 17:33:41 -08:00
Simon Mo fe8262afd0 Add K8s test to release process (#13694) 2021-01-25 16:53:52 -08:00
Simon Mo 8b8d6b984b [Buildkite] Add all Python tests (#13566) 2021-01-25 16:05:59 -08:00
dependabot[bot] 0d75f37c1f [tune](deps): Bump distributed in /python/requirements (#13643)
Bumps [distributed](https://github.com/dask/distributed) from 2020.12.0 to 2021.1.1.
- [Release notes](https://github.com/dask/distributed/releases)
- [Changelog](https://github.com/dask/distributed/blob/master/docs/release-procedure.md)
- [Commits](https://github.com/dask/distributed/compare/2020.12.0...2021.01.1)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-01-26 00:03:38 +01:00
Amog Kamsetty 9feae90e3b skip test_spill (#13693) 2021-01-25 14:37:07 -08:00
Amog Kamsetty d96a9fa192 Revert "Revert "[dashboard] Fix RAY_RAYLET_PID KeyError on Windows (#12948)" (#13572)" (#13685)
This reverts commit c4a710369b.
2021-01-25 10:35:25 -08:00
Edward Oakes 1c77cc7e23 [docs] Remove API warning from mp.Pool (#13683) 2021-01-25 09:59:46 -08:00
Dmitri Gekhtman 79209110c5 [kubernetes][operator][hotfix] Dictionary fix (#13663) 2021-01-25 10:40:59 -06:00
Lingxuan Zuo f9f2bfa778 [Metric] Fix crashed when register metric view in multithread (#13485)
* Fix crashed when register metric view in multithread

* fix comments

* fix
2021-01-25 20:32:08 +08:00
DK.Pino db2c836587 [Placement Group] Move PlacementGroup public method to interface. (#13629) 2021-01-25 20:14:21 +08:00
Maltimore b4702de1c2 [RLlib] move evaluation to trainer.step() such that the result is properly logged (#12708) 2021-01-25 12:56:00 +01:00
Jan Blumenkamp 964689b280 [RLlib] Fix bug in ModelCatalog when using custom action distribution (#12846)
* return tuple returned from _get_multi_action_distribution when using custom action dict

* Always return dst_class and required_model_output_shape in _get_multi_action_distribution

* pass model config to _get_multi_action_distribution
2021-01-25 12:42:39 +01:00
Sven Mika 9423930bcc [RLlib] MAML: Add cartpole mass test for PyTorch. (#13679) 2021-01-25 12:32:41 +01:00
Kai Yang e9103eeb6d [Java] [Test] Move multi-worker config to ray.conf file (#13583) 2021-01-25 18:07:45 +08:00
Ameer Haj Ali 4dabf017ee Close #12031 (Autoscaler is overriding your resource for same quantity) (#13671) 2021-01-24 16:31:53 -08:00
SangBin Cho edbb2937d3 [Object Spilling] Multi node file spilling V2. (#13542)
* done.

* done.

* Fix a mistake.

* Ready.

* Fix issues.

* fix.

* Finished the first round of code review.

* formatting.

* In progress.

* Formatting.

* Addressed code review.

* Formatting

* Fix tests.

* fix bugs.

* Skip flaky tests for now.
2021-01-23 23:15:32 -08:00
Barak Michener e675e5b75a [ray_client]: Add more retry logic (#13478) 2021-01-23 23:11:39 -08:00
Ameer Haj Ali b7dd7ddb52 deprecate useless fields in the cluster yaml. (#13637)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* deflake test_joblib

* lint

* placement groups bypass

* remove space

* Eric

* first ocmmit

* lint

* exmaple

* documentation

* hmm

* file path fix

* fix test

* some format issue in docs

* modified docs

* joblib strikes again on windows

* add ability to not start autoscaler/monitor

* a

* remove worker_default

* Remove default pod type from operator

* Remove worker_default_node_type from rewrite_legacy_yaml_to_availble_node_types

* deprecate useless fields

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal>
Co-authored-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
2021-01-23 12:06:51 -08:00
Kai Fricke 17760e1510 [tune] update Optuna integration to 2.4.0 API (#13631)
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2021-01-23 00:32:37 -08:00
Qing Wang 8ef835ff03 Remove idle actor from worker pool. (#13523) 2021-01-23 13:57:30 +08:00
Amog Kamsetty 01d74af89d [horovod] Horovod+Ray Pytorch Lightning Accelerator (#13458) 2021-01-22 16:30:10 -08:00
Amog Kamsetty 25e1b78eed [Dependencies] Move requirements.txt to requirements directory. (#13636) 2021-01-22 16:29:05 -08:00
architkulkarni 0c3d9a3eaa [Metrics] Fix serialization for custom metrics (#13571) 2021-01-22 14:11:59 -06:00
Amog Kamsetty c4a710369b Revert "[dashboard] Fix RAY_RAYLET_PID KeyError on Windows (#12948)" (#13572)
This reverts commit ef6d859e9b.
2021-01-22 14:10:24 -06:00
Dmitri Gekhtman 7fec19dad2 [kubernetes][operator][minutiae] Backwards compatibility of operator (#13623) 2021-01-22 14:07:25 -06:00
Sven Mika d629292d63 [RLlib] Add grad_clip config option to MARWIL and stabilize grad clipping against inf global_norms. (#13634) 2021-01-22 19:36:02 +01:00
architkulkarni da5928304a [Metrics] Cache metrics ports in a file at each node (#13501)
* cache metric ports in a file at each node

* remove old assignment of export port

* lint

* lint

* move e2e test to top of file to avoid shutdown bug
2021-01-22 09:59:20 -08:00
Kai Yang 90f1e408de [Java] Add fetchLocal parameter in Ray.wait() (#13604) 2021-01-22 17:55:00 +08:00