Commit Graph

447 Commits

Author SHA1 Message Date
Ian Rodney f6cfc44dbd [autoscaler] run setup commands with restart_only=True (#13836) 2021-02-10 20:17:20 -08:00
Dmitri Gekhtman 8ca0a32819 HotFix k8s autoscaling (#14024) 2021-02-09 22:34:24 -08:00
Alex Wu 1dcdfe9101 [autoscaler/dashboard] Publish resource usage in units of bytes (#14002) 2021-02-09 10:27:26 -08:00
Dmitri Gekhtman 081f3e5f07 [autoscaler][kubernetes] Ray client setup, example config simplification, example scripts. (#13920) 2021-02-08 20:00:34 -06:00
Ameer Haj Ali 1643bc5c4f Fix autoscaler wrong parameter names (#13966)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* improve code readability

* lint

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
2021-02-08 13:19:33 -08:00
Dmitri Gekhtman db59736b1a [autoscaler][kubernetes] Add ability to not copy cluster config to head node when calling create_or_update_head_node. (#13720)
* Add option to skip bootstrapping head node autoscaling config

* don't close remote config before copying

* Type

* Type hints etc.

* test

* Test CR to config conversion

* comment
2021-02-04 10:30:03 -08:00
Dmitri Gekhtman 1187d1dd3e [autoscaler][kubernetes][operator] Rudimentary error handling, make "MODIFIED" -> update event work. (#13756) 2021-02-03 20:07:11 -06:00
James 863c1b8282 Add podman support (#13633) 2021-02-02 11:09:43 -08:00
Ian Rodney 1ee5d5faff [AWS] Fill-in AMI if not provided (#13808)
* fill in default ami if not provided

* lint fix

* quick test

* Update python/ray/tests/aws/test_autoscaler_aws.py

* Update python/ray/tests/aws/test_autoscaler_aws.py

* fix test

* fix tests

* fix lint

* remove bad test

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
2021-02-01 14:30:48 -08:00
Ameer Haj Ali 9d7b8b58a2 [autoscaler] Remove min workers from multi node type examples (#13814)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* remove global min_workers from mult-node-type-examples

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
2021-01-31 23:29:57 -08:00
Ameer Haj Ali 660857ffab Fix windows test (#13811) 2021-01-29 21:10:59 -08:00
Simon Mo 50808024eb Revert "[autoscaler] Better validation for min_workers and max_workers (#13779)" (#13807)
This reverts commit 4d6817c683.
2021-01-29 15:43:01 -08:00
Eric Liang b20a38febb [autoscaler] Avoid launching GPU nodes when the workload only has CPU tasks. (#13776)
* wip

* avoid gpus

* update

* update
2021-01-29 09:50:28 -08:00
Ameer Haj Ali 4d6817c683 [autoscaler] Better validation for min_workers and max_workers (#13779)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* deflake test_joblib

* lint

* placement groups bypass

* remove space

* Eric

* first ocmmit

* lint

* exmaple

* documentation

* hmm

* file path fix

* fix test

* some format issue in docs

* modified docs

* joblib strikes again on windows

* add ability to not start autoscaler/monitor

* a

* remove worker_default

* Remove default pod type from operator

* Remove worker_default_node_type from rewrite_legacy_yaml_to_availble_node_types

* deprecate useless fields

* fix error msg

* validate sum min_workers < max_workers

* 1 more edge case test

* lint

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal>
Co-authored-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
2021-01-29 09:41:56 -08:00
Dmitri Gekhtman 40234ad631 [autoscaler][AWS] Make sure subnets belong to same VPC as user-specified security groups (#13558)
* initial commit

* Filter subnets by security groups' VPCs

* fix stubs

* wip

* Fix inbound rule logic. Tests WIP.

* wip

* unit test

* example yaml

* Unit test tests for bug being fixed

* Update python/ray/tests/aws/utils/constants.py

Co-authored-by: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com>

Co-authored-by: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com>
2021-01-27 17:00:52 -08:00
Ameer Haj Ali b7dd7ddb52 deprecate useless fields in the cluster yaml. (#13637)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* deflake test_joblib

* lint

* placement groups bypass

* remove space

* Eric

* first ocmmit

* lint

* exmaple

* documentation

* hmm

* file path fix

* fix test

* some format issue in docs

* modified docs

* joblib strikes again on windows

* add ability to not start autoscaler/monitor

* a

* remove worker_default

* Remove default pod type from operator

* Remove worker_default_node_type from rewrite_legacy_yaml_to_availble_node_types

* deprecate useless fields

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal>
Co-authored-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
2021-01-23 12:06:51 -08:00
Ameer Haj Ali 1fbb752f42 [autoscaler] remove worker_default_node_type that is useless. (#13588) 2021-01-21 17:04:38 -08:00
Nikita Vemuri 4e01a9ec38 [Autoscaler] Ensure ubuntu is owner of docker host mount folder (#13579)
* change ownership to ubuntu if root

* use ssh user in cluster config

* formatting

Co-authored-by: Nikita Vemuri <nikitavemuri@Nikitas-MacBook-Pro.local>
2021-01-21 17:01:55 -08:00
Alex Wu b9ac3878ae [Autoscaler] Display node status tag in autsocaler status (#13561)
* .

* .

* .

* .

* .

* lint

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-01-20 19:20:54 -08:00
dmatch01 fd6882176a Fix for operator role definition to add raycluster/finalizer (#13567) 2021-01-20 13:02:02 -06:00
Dmitri Gekhtman 7b4a97c610 Make AWSNodeProvider.create_node return nodes created (#13498)
* Make AWSNodeProvider.create_node return node config

* return-dict

* Node provider interface create node return type Any

* Type clarification.

* Delete debug code

* Oops reset example-full changes

* Return type specified. GCP create node returns None.

* Article
2021-01-19 12:17:46 -08:00
Eric Liang 8c8af2616e Minimal version of piping autoscaler events to driver logs (#13434) 2021-01-16 10:06:20 -08:00
Dmitri Gekhtman 7e54911093 move message to debug (#13472) 2021-01-16 10:04:41 -08:00
Eric Liang ee6332dbb0 Bump dev branch to 2.0 to avoid endless version bump toil (#13497)
* wip

* fix

* fix
2021-01-15 17:41:17 -08:00
Ian Rodney 0ec9ddabc1 [docker/dashboard] Fix ray dashboard (#12899) 2021-01-15 10:03:01 -08:00
Micah Yong c89ebdd94a [Core][CLI] ray status and ray memory no longer starts a new job (#13391)
* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Make status and error args required in commands.py#debug.status

* Remove unnecessary imports

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Make status and error args required in commands.py#debug.status

* Remove unnecessary imports

* Job 38482.1 should now pass

* Resolve merge conflict
2021-01-14 10:12:16 -08:00
Eric Liang 602c103eae Make request_resources() use internal kv instead of redis pub sub (#13410) 2021-01-13 17:30:43 -08:00
Ian Rodney 4aef3d6836 [docker] Pull if image is not present (#13136) 2021-01-07 17:17:00 -08:00
Ameer Haj Ali 44483f465c [autoscaler] Make placement groups bypass max launch limit (#13089) 2020-12-29 10:06:11 -08:00
Ian Rodney 7ad56826db [docker] Fix restart behavior with Docker (#12898)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: ijrsvt <ilr@anyscale.com>
2020-12-28 18:56:28 -08:00
Alex Wu 8df94e33e0 [Autoscaler] New output log format (#12772) 2020-12-23 12:02:55 -08:00
Ameer Haj Ali 5e2b850836 [autoscaler] Fixes max_workers bug. (#13008) 2020-12-21 10:30:03 -08:00
Ameer Haj Ali 11f34f72d8 [autoscaler] Do not count head node with min_workers constraint. (#12980) 2020-12-20 14:54:46 -08:00
Dmitri Gekhtman 4832b39066 Suggest mounting into home. Note non-root user. (#12987) 2020-12-19 16:09:24 -08:00
Alex Wu 404161a3ff [Autoscaler/Core] Remove autoscaler spam (#12952) 2020-12-18 18:22:45 -08:00
Gekho457 bff50cfc37 [k8s] Read gpu resources properly (#12942)
* Read gpu resources properly

* Comments and docstrings

* Comment formatting
2020-12-18 01:32:12 -08:00
Gekho457 82f9c7014e [K8s] Retry getting home directory in command runner. (#12925) 2020-12-17 09:41:48 -08:00
Richard Liaw a7caa14d3d [k8s] avoid bad error messages (#12871) 2020-12-15 15:00:02 -08:00
Max Fitton e077bc4206 [Release] Bump master to 1.2.0 for 1.1.0 release (#12856) 2020-12-15 09:40:26 -08:00
Gekho457 5a142d5bd6 Use nightly images in all kubernetes examples. (#12868) 2020-12-14 20:49:41 -08:00
Gekho457 11ce1dc743 Ray cluster CRD and example CR + multi-ray-cluster operator (#12098) 2020-12-14 10:26:01 -06:00
Eric Squires 9f70293700 Remove debug extras from setup.py (#12751) 2020-12-10 16:23:11 -06:00
Kai Yang e3b5deb741 [Multi-tenancy] Delete flag enable_multi_tenancy and remove old code path (#10573) 2020-12-10 19:01:40 +08:00
Ameer Haj Ali 2f8e308444 [autoscaler] LoadMetrics missed logger.debug (#12714) 2020-12-09 17:19:36 -08:00
Ian Rodney 19542c5eb0 [docker] Default to ray-ml image (#12703) 2020-12-09 11:49:16 -08:00
Alex Wu bd7e26b768 [Autoscaler] Temporarily suppress "Removed stale ip mappings" message. (#12689) 2020-12-08 21:55:10 -08:00
Ameer Haj Ali a4dbb271bd [hotfix][autoscaler] Request resources refactor2 (#12661)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* request_resources -> min workers

* test fixes

* add race condition tests

* Eric

* fixes

* semi final

* semi final

* lint

* lint

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
2020-12-08 18:41:30 -08:00
Gekho457 f61bc79a87 Dmitri/k8s command runner home try again (#12609) 2020-12-08 11:44:22 -06:00
Eric Liang 36e46ed923 Revert "[autoscaler/k8s] Use ray node's HOME in Kubernetes command runner. (#12417)" (#12607)
This reverts commit f669830de6.
2020-12-03 12:57:59 -08:00
Gekho457 f669830de6 [autoscaler/k8s] Use ray node's HOME in Kubernetes command runner. (#12417) 2020-12-03 10:43:16 -06:00