Commit Graph

105 Commits

Author SHA1 Message Date
Ameer Haj Ali d87a82e891 Revert "Revert "[Autoscaler] Monitor refactor for backward compatability. (#13970)" (#14046)" (#14050)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* Revert "Revert "[Autoscaler] Monitor refactor for backward compatability. (#13970)" (#14046)"

This reverts commit 6f9d39fb3e.

* fake news

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
2021-02-10 17:59:08 -08:00
architkulkarni 6f9d39fb3e Revert "[Autoscaler] Monitor refactor for backward compatability. (#13970)" (#14046)
This reverts commit 7a6f8054d1.
2021-02-10 12:16:52 -08:00
Ameer Haj Ali 7a6f8054d1 [Autoscaler] Monitor refactor for backward compatability. (#13970) 2021-02-09 21:41:50 -08:00
Eric Liang 8c8af2616e Minimal version of piping autoscaler events to driver logs (#13434) 2021-01-16 10:06:20 -08:00
Eric Liang 602c103eae Make request_resources() use internal kv instead of redis pub sub (#13410) 2021-01-13 17:30:43 -08:00
Alex Wu 8df94e33e0 [Autoscaler] New output log format (#12772) 2020-12-23 12:02:55 -08:00
Gekho457 11ce1dc743 Ray cluster CRD and example CR + multi-ray-cluster operator (#12098) 2020-12-14 10:26:01 -06:00
Tao Wang 295b6e5ce4 Split heartbeat message (#12535)
* first

* xxx

* Split heartbeat message

* only report resource usage when changed

* Fix GetAllResourceUsage

* Fix report resource usage

* Increase default heartbeat interval

* regularize heartbeat interval in test case
2020-12-11 21:19:57 +08:00
SangBin Cho 8223a33bff [Logging] Log rotation on all components (#12101)
* In Progress.

* Done.

* Fix the issue.

* Add wait for condition because logs are not written right away now.

* debug string.

* lint.

* Fix flaky test.

* Fix issues.

* Fix test.

* lint.
2020-11-30 19:03:55 -08:00
Tao Wang b85c6abc3e Rename fields/variables from client id to node id (#12457) 2020-11-30 14:33:36 +08:00
SangBin Cho f56d7c1a76 [Logging] Remove per worker job log file / support worker log rotation (#11927)
* In progress.

* MVP done.

* In Progress.

* Remove unnecessay code.

* Fix some issues.

* Fix test failures.

* Addressed code review + fix object spilling test failure.
2020-11-16 11:29:43 -08:00
Gekho457 ad639f12d8 [autoscaler/k8s] Preliminary k8s operator (#11929) 2020-11-12 11:58:02 -06:00
Ameer Haj Ali 8d74a04a42 [autoscaler] Flag flip for resource_demand_scheduler should take into account queue (#11615) 2020-11-02 12:41:22 -08:00
Eric Liang f9f372c327 [autoscaler] Clean up monitoring loop code (#11677) 2020-10-30 13:48:43 -07:00
Tao Wang 1d5694ddea [GCS]Use direct getting instead of pub-sub to update load metrics in monitor.py (#11339) 2020-10-28 11:23:18 -07:00
Alex Wu 7466ce82df [Autoscaler] Placement group autoscaling (#11243) 2020-10-14 13:11:46 -07:00
Alex Wu 175fc41fbc [Autoscaler] Account for resource backlog size (#11261) 2020-10-12 09:43:48 -07:00
Tao Wang 0dcfa9ed6c Add light heartbeat flag in python and use it in load metrics (#11032) 2020-09-30 11:39:28 -07:00
Eric Liang 609c1b8acd Start moving ray internal files to _private module (#10994) 2020-09-24 22:46:35 -07:00
Eric Liang 6a227ae501 [autoscaler] Split autoscaler interface public private (#10898) 2020-09-18 18:16:23 -07:00
Richard Liaw ed5de89470 FIX: Lint (#10384) 2020-08-27 17:56:39 -07:00
Alex Wu 7dbc1f439c [hotfix] Autoscaler monitor fix unit tests 2020-08-27 14:26:41 -07:00
Alex Wu 6d2af33a01 [Autoscaler] Proper resource demand plumbing (#10329) 2020-08-26 23:36:01 -07:00
SangBin Cho 92664249e8 Partially Use f string (#10218)
* flynt. trial 1.

* Trial 1.

* Addressed code review.
2020-08-20 18:21:16 -07:00
Alex Wu 4b14bf85e4 [Autoscaler] Resource demand vector (hearbeat -> autoscaler plumbing) (#10127) 2020-08-17 13:57:15 -07:00
Tao Wang 44ccca1acb Only update raylet map when autoscaler configured (#9435) 2020-07-27 11:23:06 +08:00
Tao Wang f7ac495a68 [Core] Use map instead of list to represent resources in heartbeat message (#9294) 2020-07-05 10:59:25 +08:00
Eric Liang 0ff24ec8dc Add "ray status" debug tool for autoscaler. (#9091) 2020-06-24 18:22:03 -07:00
mehrdadn f68183d778 Error-checking for a couple of corruption issues (#8059)
* Extra error handling
* Handle connection closed in Redis monitor
Co-authored-by: Mehrdad <noreply@github.com>
2020-06-07 15:43:00 +02:00
Eric Liang a24d117c68 [autoscaler] Refactor code in preparation for multi instance type support (#8632)
* wip refactor

* add util

* wip

* fix

* fix

* remove

* remove extraneous string type for sg
2020-06-03 12:53:55 -07:00
SangBin Cho 7c43991100 [GCS] Monitor.py bug fix (#8725)
* comment.

* Fix bugs.

* Used pubsub message instead.

* Added a ray.actors test
2020-06-02 16:06:36 -07:00
fangfengbin 016337d4eb Heartbeat table uses gcs pub-sub instead of redis accessor (#8655) 2020-05-30 23:17:25 +08:00
mehrdadn ebf060d484 Make more tests run on Windows (#8446)
* Remove worker Wait() call due to SIGCHLD being ignored

* Port _pid_alive to Windows

* Show PID as well as TID in glog

* Update TensorFlow version for Python 3.8 on Windows

* Handle missing Pillow on Windows

* Work around dm-tree PermissionError on Windows

* Fix some lint errors on Windows with Python 3.8

* Simplify torch requirements

* Quiet git clean

* Handle finalizer issues

* Exit with the signal number

* Get rid of wget

* Fix some Windows compatibility issues with tests

Co-authored-by: Mehrdad <noreply@github.com>
2020-05-20 12:25:04 -07:00
Robert Nishihara b011c604d7 Remove ray.tasks() from API. (#7807) 2020-04-01 10:10:40 -05:00
Edward Oakes 7b609ca211 Remove instances of 'raise Exception' (#7523) 2020-03-10 17:51:22 -07:00
Eric Liang 5df801605e Add ray.util package and move libraries from experimental (#7100) 2020-02-18 13:43:19 -08:00
Daniel Edgecumbe e516c50745 [autoscaler]: Kill workers if the monitor raises an exception (#3977)
Co-authored-by: CJosephides <cjosephides@gmail.com>
2020-01-23 14:12:52 -06:00
Sven 60d4d5e1aa Remove future imports (#6724)
* Remove all __future__ imports from RLlib.

* Remove (object) again from tf_run_builder.py::TFRunBuilder.

* Fix 2xLINT warnings.

* Fix broken appo_policy import (must be appo_tf_policy)

* Remove future imports from all other ray files (not just RLlib).

* Remove future imports from all other ray files (not just RLlib).

* Remove future import blocks that contain `unicode_literals` as well.
Revert appo_tf_policy.py to appo_policy.py (belongs to another PR).

* Add two empty lines before Schedule class.

* Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error.
2020-01-09 00:15:48 -08:00
Robert Nishihara 39a3459886 Remove (object) from class declarations. (#6658) 2020-01-02 17:42:13 -08:00
Edward Oakes fc56872012 Send active object IDs to the raylet (#5803)
* Send active object IDs to the raylet

* comment

* comments

* dedup

* signed int in config

* comments

* Remove object ID from monitor

* Fix test

* re-add check

* fix cast

* check if core worker

* Add comment

* Reservoir sampling

* Fix lint

* Pointer return

* tmp

* Fix merge

* Initialize object ids properly

* Fix lint
2019-10-20 22:05:28 -07:00
Eric Liang 2fdefe19b7 Take into account queue length in autoscaling (#5684) 2019-09-11 11:31:35 -07:00
micafan b3bcf59148 Rename ClientTableData to GcsNodeInfo (#5251) 2019-07-30 11:22:47 +08:00
Daniel Edgecumbe 06fec63c87 [autoscaler] Add a 'request_cores' function for manual autoscaling (#4754) 2019-07-26 17:14:45 -07:00
Richard Liaw 3e0ad11ae0 Add heartbeat test + Fix monitor.py (#5191) 2019-07-16 21:59:48 -07:00
Philipp Moritz c5253cc300 Add job table to state API (#5076) 2019-07-06 00:05:48 -07:00
Qing Wang 62e4b591e3 [ID Refactor] Rename DriverID to JobID (#5004)
* WIP

WIP

WIP

Rename Driver -> Job

Fix complition

Fix

Rename in Java

In py

WIP

Fix

WIP

Fix

Fix test

Fix

Fix C++ linting

Fix

* Update java/runtime/src/main/java/org/ray/runtime/config/RayConfig.java

Co-Authored-By: Stephanie Wang <swang@cs.berkeley.edu>

* Update src/ray/core_worker/core_worker.cc

Co-Authored-By: Stephanie Wang <swang@cs.berkeley.edu>

* Address comments

* Fix

* Fix CI

* Fix cpp linting

* Fix py lint

* FIx

* Address comments and fix

* Address comments

* Address

* Fix import_threading
2019-06-28 00:44:51 +08:00
Daniel Edgecumbe 49c6e81de2 autoscaler/monitor: Kill workers on exception (#4997) 2019-06-26 17:59:12 -07:00
Hao Chen 0131353d42 [gRPC] Migrate gcs data structures to protobuf (#5024) 2019-06-25 14:31:19 -07:00
Yuhong Guo 5eff47b657 [C++] Add hash table to Redis-Module (#4911) 2019-06-07 16:11:37 +08:00
Robert Nishihara 6703519144 Move global state API out of global_state object. (#4857) 2019-05-26 11:27:53 -07:00