Commit Graph

3712 Commits

Author SHA1 Message Date
architkulkarni e89bbcbd44 [Serve] Revert "Revert "[Serve] Fix ServeHandle serialization"" and disable failing Windows test (#13771) 2021-02-04 14:50:01 -08:00
Edward Oakes 7af0c999f3 [serve] Built-in support for imported backends (#13867) 2021-02-04 15:09:12 -06:00
Dmitri Gekhtman db59736b1a [autoscaler][kubernetes] Add ability to not copy cluster config to head node when calling create_or_update_head_node. (#13720)
* Add option to skip bootstrapping head node autoscaling config

* don't close remote config before copying

* Type

* Type hints etc.

* test

* Test CR to config conversion

* comment
2021-02-04 10:30:03 -08:00
Richard Liaw 0fc81e2393 [tune] fix gpu check (#13825)
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2021-02-04 01:13:58 -08:00
Eric Liang e79a380a7e Check in shuffle code as experimental (#13899) 2021-02-04 00:24:16 -08:00
Clark Zinzow 243f678ffd Fall back to random port instead of default port for non-primary Redis shards; attempt to cluster Redis shard ports close to each other. (#13847) 2021-02-03 22:00:15 -08:00
Tao Wang 44aa9c173f Rename timeout to period with heartbeat interval (#13872) 2021-02-04 10:37:28 +08:00
Dmitri Gekhtman 1187d1dd3e [autoscaler][kubernetes][operator] Rudimentary error handling, make "MODIFIED" -> update event work. (#13756) 2021-02-03 20:07:11 -06:00
Eric Liang e8fce9f1f3 Check Ray client protocol version (#13886)
* wip

* wip

* fix tests
2021-02-03 16:44:09 -08:00
SangBin Cho cb9fa90203 [Object Spilling] Add consumed bytes to detect thrashing. (#13853) 2021-02-03 14:16:26 -08:00
Barak Michener 77ee2c569f [ray_client] convert things registered for ray into ray_client (#13639) 2021-02-03 13:30:05 -08:00
Alex Wu f14171ced9 [Core] Put raylet ip's in resource usage report (#13871)
* .

* done?

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-02-03 11:28:56 -08:00
Gabriele Oliaro 79310452e7 Enabling the cancellation of non-actor tasks in a worker's queue 2 (#13244)
* wrote code to enable cancellation of queued non-actor tasks

* minor changes

* bug fixes

* added comments

* rev1

* linting

* making ActorSchedulingQueue::CancelTaskIfFound raise a fatal error

* bug fix

* added two unit tests

* linting

* iterating through pending_normal_tasks starting from end

* fixup! iterating through pending_normal_tasks starting from end

* fixup! fixup! iterating through pending_normal_tasks starting from end

* post merge fixes

* added debugging instructions, pulled Accept() out of guarded loop

* removed debugging instructions, linting

* first commit

* lint

* lint

* added hack to avoid race condition in test stress

* moved hack

* fix test cancel

* removed hack (hopefully no longer needed)

* Revert "removed hack (hopefully no longer needed)"

This reverts commit 99d0e7c91539f290700f50aaaed805dcde04a5ee.

* added sleep in mock_worker.cc

* sleep function fixup to work on windows

* sleep in test_fast both for force=true and force=false

* linting

Co-authored-by: Ian <ian.rodney@gmail.com>
2021-02-03 10:20:12 -08:00
Edward Oakes a695c651ee [serve] Small cleanups for BackendState (#13870) 2021-02-03 11:46:25 -06:00
Ameer Haj Ali 2a903b904a [joblib] Log once the context warning argument. (#13865)
Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
2021-02-03 00:23:20 -08:00
Eric Liang d335ce2aab Move the tune driver into a remote task (#13778) 2021-02-02 18:41:45 -08:00
Edward Oakes fc956e084a [Hotfix] Lint (#13864) 2021-02-02 12:56:50 -08:00
James 863c1b8282 Add podman support (#13633) 2021-02-02 11:09:43 -08:00
Edward Oakes a6138ca31f [serve] Support batches for ImportedBackends (#13843) 2021-02-02 09:44:01 -06:00
Kai Fricke d29fcfb45c [tune] catch SIGINT signal and trigger experiment checkpoint (#13767)
* [tune] catch SIGINT signal and trigger experiment checkpoint

* Apply suggestions from code review

* Fix user guide docs

* Update doc/source/tune/user-guide.rst
2021-02-02 14:52:09 +01:00
Eric Liang fa4290090d Add Ray client protocol version (#13846) 2021-02-02 00:19:08 -08:00
Eric Liang 26beb3b67b Revert "Revert "Enable Ray client server by default (#13350)" (#13429)" (#13442)
* Revert "Revert "Enable Ray client server by default (#13350)" (#13429)"

This reverts commit 560299972c.

* fix job id collision with ray client server
2021-02-02 00:17:29 -08:00
Eric Liang 88ab887cc4 Unconditionally retry all RPC errors on client connect (#13845)
* wip

* Update python/ray/util/client/worker.py

Co-authored-by: fangfengbin <869218239a@zju.edu.cn>

Co-authored-by: fangfengbin <869218239a@zju.edu.cn>
2021-02-02 00:10:35 -08:00
SangBin Cho 886217c333 [Object Spilling] Skip normal ray.get path when spilling objects. (#13831) 2021-02-01 16:03:34 -08:00
Eric Liang e4d30430c0 Fix naming of ray_spilled_objects directory 2021-02-01 15:46:40 -08:00
Barak Michener 26ba95e96d [python/ray]: add cloudpickle dependency (#13838)
Change-Id: I248a2174c27cacb84a1cf0fd1feaa99535a90b71
2021-02-01 15:27:39 -08:00
Ian Rodney 1ee5d5faff [AWS] Fill-in AMI if not provided (#13808)
* fill in default ami if not provided

* lint fix

* quick test

* Update python/ray/tests/aws/test_autoscaler_aws.py

* Update python/ray/tests/aws/test_autoscaler_aws.py

* fix test

* fix tests

* fix lint

* remove bad test

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
2021-02-01 14:30:48 -08:00
Barak Michener 55566bc797 [ray_client]: Add python version check and test (and some minor fixes along the way) (#13722) 2021-02-01 13:04:38 -08:00
SongGuyang 361e5f0bef support dynamic library loading in C++ worker (#13734) 2021-02-01 19:24:33 +08:00
Ameer Haj Ali 9d7b8b58a2 [autoscaler] Remove min workers from multi node type examples (#13814)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* remove global min_workers from mult-node-type-examples

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
2021-01-31 23:29:57 -08:00
SangBin Cho d1ec787d9d [Object Spilling] Turn on by default. (#13745)
* Done.

* in progress.

* in progress.

* fixed tests.

* Fix.
2021-01-31 23:28:37 -08:00
Amog Kamsetty 2ba77ae3a2 [Release] Fix SGD+Tune long running distributed release test (#13812)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-01-31 21:05:50 -08:00
Ameer Haj Ali 660857ffab Fix windows test (#13811) 2021-01-29 21:10:59 -08:00
Simon Mo 194656731d [CI] Deflake test_basics and skip test_component_failures_3 (#13801) 2021-01-29 15:47:21 -08:00
Simon Mo 50808024eb Revert "[autoscaler] Better validation for min_workers and max_workers (#13779)" (#13807)
This reverts commit 4d6817c683.
2021-01-29 15:43:01 -08:00
Barak Michener 9441f85e1a [client] Hook runtime context (#13750)
Change-Id: I701d21e53900b5f3fb0e23e09f59e8316c7ba623
2021-01-29 12:58:41 -08:00
Siyuan (Ryans) Zhuang 0b598c0f05 [Serialization] API for deregistering serializers; code & doc cleanup (#13471)
* make methods private, remove confusion brackets and usages

* unregister serializer; fix doc

* Cleanup doc

* rename unregister -> deregister
2021-01-29 10:27:05 -08:00
Eric Liang b20a38febb [autoscaler] Avoid launching GPU nodes when the workload only has CPU tasks. (#13776)
* wip

* avoid gpus

* update

* update
2021-01-29 09:50:28 -08:00
Ameer Haj Ali 4d6817c683 [autoscaler] Better validation for min_workers and max_workers (#13779)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* deflake test_joblib

* lint

* placement groups bypass

* remove space

* Eric

* first ocmmit

* lint

* exmaple

* documentation

* hmm

* file path fix

* fix test

* some format issue in docs

* modified docs

* joblib strikes again on windows

* add ability to not start autoscaler/monitor

* a

* remove worker_default

* Remove default pod type from operator

* Remove worker_default_node_type from rewrite_legacy_yaml_to_availble_node_types

* deprecate useless fields

* fix error msg

* validate sum min_workers < max_workers

* 1 more edge case test

* lint

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal>
Co-authored-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
2021-01-29 09:41:56 -08:00
Kai Fricke 9a413144b1 [tune] dynamic global checkpointing interval (#13736)
* Add scalability tests

* Move experiment checkpointing into a manager class

* Dynamic global checkpointing

* Actually write checkpoints

* Remove debug message

* Pass `force`

* Pre-review

* Revert scalability commits

* Revert scalability commits

* Apply suggestions from code review
2021-01-29 17:14:46 +01:00
Stephanie Wang 42d501d747 [core] Pin arguments during task execution (#13737)
* tmp

* Pin task args

* unit tests

* update

* test

* Fix
2021-01-28 19:07:10 -08:00
Ian Rodney 813a7ab0e2 [docker] Build Python3.6 & Python3.8 Docker Images (#13548) 2021-01-28 15:24:50 -08:00
architkulkarni cb771f263d [Serve] Add ServeHandle metrics (#13640) 2021-01-28 14:40:47 -06:00
Lena Kashtelyan c583113d66 [Ax] Align optimization mode and reported SEM with Ax (#13611)
* [Ax] Align optimization mode and reported SEM with Ax

Ensure that `mode` aligns with the mode set in Ax + report SEM as None rather than as 0.0 to make use of Ax noise inference

* Account for review

* Update ax.py

* Fix lint

* Fix tests, ad additional checks

* Fix tests for python 3.6

Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-01-28 19:01:51 +01:00
Yuri Rocha b01b0f80aa [RLlib] Fix multiple Unity3DEnvs trying to connect to the same custom port (#13519) 2021-01-28 13:28:08 +01:00
architkulkarni cb95ff1e56 [Serve] Add "endpoint registered" message to router log (#13752) 2021-01-27 19:03:15 -08:00
Simon Mo c10abbb1bb Revert "[Serve] Fix ServeHandle serialization (#13695)" (#13753)
This reverts commit 202fbdf38c.
2021-01-27 17:47:42 -08:00
Eric Liang 2e01d5d26e Report failed deserialization of errors in Ray client 2021-01-27 17:37:50 -08:00
Zhe Zhang 0e7343ec19 [docs] Fix MLflow / Tune example in documentation (#13740)
Minor fixes to make it runnable
2021-01-27 17:16:29 -08:00
Dmitri Gekhtman 40234ad631 [autoscaler][AWS] Make sure subnets belong to same VPC as user-specified security groups (#13558)
* initial commit

* Filter subnets by security groups' VPCs

* fix stubs

* wip

* Fix inbound rule logic. Tests WIP.

* wip

* unit test

* example yaml

* Unit test tests for bug being fixed

* Update python/ray/tests/aws/utils/constants.py

Co-authored-by: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com>

Co-authored-by: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com>
2021-01-27 17:00:52 -08:00