Commit Graph

7139 Commits

Author SHA1 Message Date
Alex Wu c8c20ca73c Scalability envelope readme typo 2021-02-02 19:34:15 -08:00
Eric Liang d335ce2aab Move the tune driver into a remote task (#13778) 2021-02-02 18:41:45 -08:00
fangfengbin b4684cf37a Fix bug that otal_commands_queued_ is not initialized (#13852) 2021-02-03 10:00:15 +08:00
architkulkarni c8e1f07c52 remove starlette install instruction (#13869) 2021-02-02 14:37:55 -08:00
architkulkarni 32fc649f39 [serve] Add example code for custom status code response (#13868) 2021-02-02 16:30:45 -06:00
Edward Oakes fc956e084a [Hotfix] Lint (#13864) 2021-02-02 12:56:50 -08:00
James 863c1b8282 Add podman support (#13633) 2021-02-02 11:09:43 -08:00
Sven Mika 9ac731558b [RLlib] Unify fcnet initializers for the value output layer (std=1.0 in torch, but 0.01 in tf). (#13733) 2021-02-02 18:42:49 +01:00
Sven Mika 0a0d9183fe [RLlib] Trajectory view API example script (enhancements and tf2 support). (#13786) 2021-02-02 18:42:18 +01:00
Edward Oakes a6138ca31f [serve] Support batches for ImportedBackends (#13843) 2021-02-02 09:44:01 -06:00
Kai Fricke d29fcfb45c [tune] catch SIGINT signal and trigger experiment checkpoint (#13767)
* [tune] catch SIGINT signal and trigger experiment checkpoint

* Apply suggestions from code review

* Fix user guide docs

* Update doc/source/tune/user-guide.rst
2021-02-02 14:52:09 +01:00
Stanislav Chekmenev b9c15a2551 [RLlib] Issue #13761: Fix get action shape (#13764) 2021-02-02 13:13:43 +01:00
Raoul Khouri 714c367b9d [RLlib] Trainer._validate_config idempotentcy correction (issue 13427) (#13556) 2021-02-02 13:11:57 +01:00
QuantumMecha 0c93bb77cb [RLlib] Update Documentation for Curiosity's support of continuous actions (#13784)
Only (Multi)Discrete action spaces are supported so far according to https://github.com/ray-project/ray/blob/master/rllib/utils/exploration/curiosity.py
2021-02-02 13:10:09 +01:00
Sven Mika 52c94b7ee9 [RLlib] Allow SAC to use custom models as Q- or policy nets and deprecate "state-preprocessor" for image spaces. (#13522) 2021-02-02 13:05:58 +01:00
Eric Liang fa4290090d Add Ray client protocol version (#13846) 2021-02-02 00:19:08 -08:00
Eric Liang 26beb3b67b Revert "Revert "Enable Ray client server by default (#13350)" (#13429)" (#13442)
* Revert "Revert "Enable Ray client server by default (#13350)" (#13429)"

This reverts commit 560299972c.

* fix job id collision with ray client server
2021-02-02 00:17:29 -08:00
Eric Liang 88ab887cc4 Unconditionally retry all RPC errors on client connect (#13845)
* wip

* Update python/ray/util/client/worker.py

Co-authored-by: fangfengbin <869218239a@zju.edu.cn>

Co-authored-by: fangfengbin <869218239a@zju.edu.cn>
2021-02-02 00:10:35 -08:00
Eric Liang d71eeac2d6 remove lru evict docs (#13849) 2021-02-02 00:07:47 -08:00
SangBin Cho 886217c333 [Object Spilling] Skip normal ray.get path when spilling objects. (#13831) 2021-02-01 16:03:34 -08:00
Eric Liang e4d30430c0 Fix naming of ray_spilled_objects directory 2021-02-01 15:46:40 -08:00
Barak Michener 26ba95e96d [python/ray]: add cloudpickle dependency (#13838)
Change-Id: I248a2174c27cacb84a1cf0fd1feaa99535a90b71
2021-02-01 15:27:39 -08:00
Ian Rodney 1ee5d5faff [AWS] Fill-in AMI if not provided (#13808)
* fill in default ami if not provided

* lint fix

* quick test

* Update python/ray/tests/aws/test_autoscaler_aws.py

* Update python/ray/tests/aws/test_autoscaler_aws.py

* fix test

* fix tests

* fix lint

* remove bad test

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
2021-02-01 14:30:48 -08:00
Barak Michener 55566bc797 [ray_client]: Add python version check and test (and some minor fixes along the way) (#13722) 2021-02-01 13:04:38 -08:00
Stephanie Wang 754bee9282 [core][object spillin] Fix bugs in admission control (#13781) 2021-02-01 10:48:21 -08:00
SongGuyang 6e53a71978 bug fix for doc (#13834) 2021-02-01 21:13:43 +08:00
SongGuyang 361e5f0bef support dynamic library loading in C++ worker (#13734) 2021-02-01 19:24:33 +08:00
Tao Wang 1d2ab018b0 Use right reserve size (#13829) 2021-02-01 15:49:34 +08:00
Ameer Haj Ali 9d7b8b58a2 [autoscaler] Remove min workers from multi node type examples (#13814)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* remove global min_workers from mult-node-type-examples

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
2021-01-31 23:29:57 -08:00
SangBin Cho d1ec787d9d [Object Spilling] Turn on by default. (#13745)
* Done.

* in progress.

* in progress.

* fixed tests.

* Fix.
2021-01-31 23:28:37 -08:00
Amog Kamsetty 2ba77ae3a2 [Release] Fix SGD+Tune long running distributed release test (#13812)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-01-31 21:05:50 -08:00
Lingxuan Zuo b5f0aed974 [Log] use default stderr logger if no raylog starting (#13762) 2021-02-01 11:13:06 +08:00
Ameer Haj Ali 660857ffab Fix windows test (#13811) 2021-01-29 21:10:59 -08:00
Dominic Ming 4b60c388ef [Dashboard] fix new dashboard entrance and some table problem (#13790) 2021-01-30 10:42:16 +08:00
Stephanie Wang 30f82329e3 [core] Add debug information for the PullManager and LocalObjectManager (#13782)
* Add debug info

* Formatting.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2021-01-29 17:55:46 -08:00
Simon Mo a3796b3ed5 [CI] Add other Travis Linux builds to buildkite (#13769) 2021-01-29 15:48:02 -08:00
Simon Mo 194656731d [CI] Deflake test_basics and skip test_component_failures_3 (#13801) 2021-01-29 15:47:21 -08:00
Simon Mo 50808024eb Revert "[autoscaler] Better validation for min_workers and max_workers (#13779)" (#13807)
This reverts commit 4d6817c683.
2021-01-29 15:43:01 -08:00
Barak Michener 9441f85e1a [client] Hook runtime context (#13750)
Change-Id: I701d21e53900b5f3fb0e23e09f59e8316c7ba623
2021-01-29 12:58:41 -08:00
SangBin Cho c21a79ae6e [Object Spilling] 100GB shuffle release test (#13729) 2021-01-29 12:38:06 -08:00
Ian Rodney 1a9a0024d5 [Wheel] Build Py36 & Py38 in separate deploy (#13797) 2021-01-29 12:28:40 -08:00
Siyuan (Ryans) Zhuang 0b598c0f05 [Serialization] API for deregistering serializers; code & doc cleanup (#13471)
* make methods private, remove confusion brackets and usages

* unregister serializer; fix doc

* Cleanup doc

* rename unregister -> deregister
2021-01-29 10:27:05 -08:00
Eric Liang b20a38febb [autoscaler] Avoid launching GPU nodes when the workload only has CPU tasks. (#13776)
* wip

* avoid gpus

* update

* update
2021-01-29 09:50:28 -08:00
Ameer Haj Ali 4d6817c683 [autoscaler] Better validation for min_workers and max_workers (#13779)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* deflake test_joblib

* lint

* placement groups bypass

* remove space

* Eric

* first ocmmit

* lint

* exmaple

* documentation

* hmm

* file path fix

* fix test

* some format issue in docs

* modified docs

* joblib strikes again on windows

* add ability to not start autoscaler/monitor

* a

* remove worker_default

* Remove default pod type from operator

* Remove worker_default_node_type from rewrite_legacy_yaml_to_availble_node_types

* deprecate useless fields

* fix error msg

* validate sum min_workers < max_workers

* 1 more edge case test

* lint

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal>
Co-authored-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
2021-01-29 09:41:56 -08:00
Kai Fricke 9a413144b1 [tune] dynamic global checkpointing interval (#13736)
* Add scalability tests

* Move experiment checkpointing into a manager class

* Dynamic global checkpointing

* Actually write checkpoints

* Remove debug message

* Pass `force`

* Pre-review

* Revert scalability commits

* Revert scalability commits

* Apply suggestions from code review
2021-01-29 17:14:46 +01:00
Hao Chen 0f3a3e14aa Only delete local object in CoreWorkerPlasmaStoreProvider:::WarmupStore (#13788) 2021-01-29 20:24:09 +08:00
Dominic Ming 752da83bb7 [Dashboard] Add the new dashboard code and prompt users to try it (#11667) 2021-01-29 15:22:26 +08:00
Stephanie Wang 42d501d747 [core] Pin arguments during task execution (#13737)
* tmp

* Pin task args

* unit tests

* update

* test

* Fix
2021-01-28 19:07:10 -08:00
Ian Rodney 813a7ab0e2 [docker] Build Python3.6 & Python3.8 Docker Images (#13548) 2021-01-28 15:24:50 -08:00
Tanja Bayer 0c906a8b93 [Docker] usage of python-version (#13011)
Co-authored-by: Tanja Bayer <tanja.bayer@widas.de>
Co-authored-by: Ian Rodney <ian.rodney@gmail.com>
2021-01-28 14:27:54 -08:00