Commit Graph

97 Commits

Author SHA1 Message Date
SangBin Cho 8223a33bff [Logging] Log rotation on all components (#12101)
* In Progress.

* Done.

* Fix the issue.

* Add wait for condition because logs are not written right away now.

* debug string.

* lint.

* Fix flaky test.

* Fix issues.

* Fix test.

* lint.
2020-11-30 19:03:55 -08:00
Tao Wang b85c6abc3e Rename fields/variables from client id to node id (#12457) 2020-11-30 14:33:36 +08:00
SangBin Cho f56d7c1a76 [Logging] Remove per worker job log file / support worker log rotation (#11927)
* In progress.

* MVP done.

* In Progress.

* Remove unnecessay code.

* Fix some issues.

* Fix test failures.

* Addressed code review + fix object spilling test failure.
2020-11-16 11:29:43 -08:00
Gekho457 ad639f12d8 [autoscaler/k8s] Preliminary k8s operator (#11929) 2020-11-12 11:58:02 -06:00
Ameer Haj Ali 8d74a04a42 [autoscaler] Flag flip for resource_demand_scheduler should take into account queue (#11615) 2020-11-02 12:41:22 -08:00
Eric Liang f9f372c327 [autoscaler] Clean up monitoring loop code (#11677) 2020-10-30 13:48:43 -07:00
Tao Wang 1d5694ddea [GCS]Use direct getting instead of pub-sub to update load metrics in monitor.py (#11339) 2020-10-28 11:23:18 -07:00
Alex Wu 7466ce82df [Autoscaler] Placement group autoscaling (#11243) 2020-10-14 13:11:46 -07:00
Alex Wu 175fc41fbc [Autoscaler] Account for resource backlog size (#11261) 2020-10-12 09:43:48 -07:00
Tao Wang 0dcfa9ed6c Add light heartbeat flag in python and use it in load metrics (#11032) 2020-09-30 11:39:28 -07:00
Eric Liang 609c1b8acd Start moving ray internal files to _private module (#10994) 2020-09-24 22:46:35 -07:00
Eric Liang 6a227ae501 [autoscaler] Split autoscaler interface public private (#10898) 2020-09-18 18:16:23 -07:00
Richard Liaw ed5de89470 FIX: Lint (#10384) 2020-08-27 17:56:39 -07:00
Alex Wu 7dbc1f439c [hotfix] Autoscaler monitor fix unit tests 2020-08-27 14:26:41 -07:00
Alex Wu 6d2af33a01 [Autoscaler] Proper resource demand plumbing (#10329) 2020-08-26 23:36:01 -07:00
SangBin Cho 92664249e8 Partially Use f string (#10218)
* flynt. trial 1.

* Trial 1.

* Addressed code review.
2020-08-20 18:21:16 -07:00
Alex Wu 4b14bf85e4 [Autoscaler] Resource demand vector (hearbeat -> autoscaler plumbing) (#10127) 2020-08-17 13:57:15 -07:00
Tao Wang 44ccca1acb Only update raylet map when autoscaler configured (#9435) 2020-07-27 11:23:06 +08:00
Tao Wang f7ac495a68 [Core] Use map instead of list to represent resources in heartbeat message (#9294) 2020-07-05 10:59:25 +08:00
Eric Liang 0ff24ec8dc Add "ray status" debug tool for autoscaler. (#9091) 2020-06-24 18:22:03 -07:00
mehrdadn f68183d778 Error-checking for a couple of corruption issues (#8059)
* Extra error handling
* Handle connection closed in Redis monitor
Co-authored-by: Mehrdad <noreply@github.com>
2020-06-07 15:43:00 +02:00
Eric Liang a24d117c68 [autoscaler] Refactor code in preparation for multi instance type support (#8632)
* wip refactor

* add util

* wip

* fix

* fix

* remove

* remove extraneous string type for sg
2020-06-03 12:53:55 -07:00
SangBin Cho 7c43991100 [GCS] Monitor.py bug fix (#8725)
* comment.

* Fix bugs.

* Used pubsub message instead.

* Added a ray.actors test
2020-06-02 16:06:36 -07:00
fangfengbin 016337d4eb Heartbeat table uses gcs pub-sub instead of redis accessor (#8655) 2020-05-30 23:17:25 +08:00
mehrdadn ebf060d484 Make more tests run on Windows (#8446)
* Remove worker Wait() call due to SIGCHLD being ignored

* Port _pid_alive to Windows

* Show PID as well as TID in glog

* Update TensorFlow version for Python 3.8 on Windows

* Handle missing Pillow on Windows

* Work around dm-tree PermissionError on Windows

* Fix some lint errors on Windows with Python 3.8

* Simplify torch requirements

* Quiet git clean

* Handle finalizer issues

* Exit with the signal number

* Get rid of wget

* Fix some Windows compatibility issues with tests

Co-authored-by: Mehrdad <noreply@github.com>
2020-05-20 12:25:04 -07:00
Robert Nishihara b011c604d7 Remove ray.tasks() from API. (#7807) 2020-04-01 10:10:40 -05:00
Edward Oakes 7b609ca211 Remove instances of 'raise Exception' (#7523) 2020-03-10 17:51:22 -07:00
Eric Liang 5df801605e Add ray.util package and move libraries from experimental (#7100) 2020-02-18 13:43:19 -08:00
Daniel Edgecumbe e516c50745 [autoscaler]: Kill workers if the monitor raises an exception (#3977)
Co-authored-by: CJosephides <cjosephides@gmail.com>
2020-01-23 14:12:52 -06:00
Sven 60d4d5e1aa Remove future imports (#6724)
* Remove all __future__ imports from RLlib.

* Remove (object) again from tf_run_builder.py::TFRunBuilder.

* Fix 2xLINT warnings.

* Fix broken appo_policy import (must be appo_tf_policy)

* Remove future imports from all other ray files (not just RLlib).

* Remove future imports from all other ray files (not just RLlib).

* Remove future import blocks that contain `unicode_literals` as well.
Revert appo_tf_policy.py to appo_policy.py (belongs to another PR).

* Add two empty lines before Schedule class.

* Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error.
2020-01-09 00:15:48 -08:00
Robert Nishihara 39a3459886 Remove (object) from class declarations. (#6658) 2020-01-02 17:42:13 -08:00
Edward Oakes fc56872012 Send active object IDs to the raylet (#5803)
* Send active object IDs to the raylet

* comment

* comments

* dedup

* signed int in config

* comments

* Remove object ID from monitor

* Fix test

* re-add check

* fix cast

* check if core worker

* Add comment

* Reservoir sampling

* Fix lint

* Pointer return

* tmp

* Fix merge

* Initialize object ids properly

* Fix lint
2019-10-20 22:05:28 -07:00
Eric Liang 2fdefe19b7 Take into account queue length in autoscaling (#5684) 2019-09-11 11:31:35 -07:00
micafan b3bcf59148 Rename ClientTableData to GcsNodeInfo (#5251) 2019-07-30 11:22:47 +08:00
Daniel Edgecumbe 06fec63c87 [autoscaler] Add a 'request_cores' function for manual autoscaling (#4754) 2019-07-26 17:14:45 -07:00
Richard Liaw 3e0ad11ae0 Add heartbeat test + Fix monitor.py (#5191) 2019-07-16 21:59:48 -07:00
Philipp Moritz c5253cc300 Add job table to state API (#5076) 2019-07-06 00:05:48 -07:00
Qing Wang 62e4b591e3 [ID Refactor] Rename DriverID to JobID (#5004)
* WIP

WIP

WIP

Rename Driver -> Job

Fix complition

Fix

Rename in Java

In py

WIP

Fix

WIP

Fix

Fix test

Fix

Fix C++ linting

Fix

* Update java/runtime/src/main/java/org/ray/runtime/config/RayConfig.java

Co-Authored-By: Stephanie Wang <swang@cs.berkeley.edu>

* Update src/ray/core_worker/core_worker.cc

Co-Authored-By: Stephanie Wang <swang@cs.berkeley.edu>

* Address comments

* Fix

* Fix CI

* Fix cpp linting

* Fix py lint

* FIx

* Address comments and fix

* Address comments

* Address

* Fix import_threading
2019-06-28 00:44:51 +08:00
Daniel Edgecumbe 49c6e81de2 autoscaler/monitor: Kill workers on exception (#4997) 2019-06-26 17:59:12 -07:00
Hao Chen 0131353d42 [gRPC] Migrate gcs data structures to protobuf (#5024) 2019-06-25 14:31:19 -07:00
Yuhong Guo 5eff47b657 [C++] Add hash table to Redis-Module (#4911) 2019-06-07 16:11:37 +08:00
Robert Nishihara 6703519144 Move global state API out of global_state object. (#4857) 2019-05-26 11:27:53 -07:00
Yuhong Guo 1a39fee9c6 Refactor ID Serial 1: Separate ObjectID and TaskID from UniqueID (#4776)
* Enable BaseId.

* Change TaskID and make python test pass

* Remove unnecessary functions and fix test failure and change TaskID to
16 bytes.

* Java code change draft

* Refine

* Lint

* Update java/api/src/main/java/org/ray/api/id/TaskId.java

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

* Update java/api/src/main/java/org/ray/api/id/BaseId.java

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

* Update java/api/src/main/java/org/ray/api/id/BaseId.java

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

* Update java/api/src/main/java/org/ray/api/id/ObjectId.java

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

* Address comment

* Lint

* Fix SINGLE_PROCESS

* Fix comments

* Refine code

* Refine test

* Resolve conflict
2019-05-22 14:46:30 +08:00
Romil Bhardwaj 0421cba4e8 Autoscaler hotfix for #4555. (#4653) 2019-05-08 14:50:52 -07:00
Si-Yuan dab99d26af Improve code related to node (#4383)
* Make full use of node

implement local node

fix bugs mentioned in comments

* Add more tests

* Use more specific exception handling

* fix, lint

* fix for py2.x
2019-04-09 17:27:54 +08:00
Yuhong Guo c2349cf12d Remove local/global_scheduler from code and doc. (#4549) 2019-04-03 17:05:09 -07:00
Robert Nishihara ef527f84ab Stream logs to driver by default. (#3892)
* Stream logs to driver by default.

* Fix from rebase

* Redirect raylet output independently of worker output.

* Fix.

* Create redis client with services.create_redis_client.

* Suppress Redis connection error at exit.

* Remove thread_safe_client from redis.

* Shutdown driver threads in ray.shutdown().

* Add warning for too many log messages.

* Only stop threads if worker is connected.

* Only stop threads if they exist.

* Remove unnecessary try/excepts.

* Fix

* Only add new logging handler once.

* Increase timeout.

* Fix tempfile test.

* Fix logging in cluster_utils.

* Revert "Increase timeout."

This reverts commit b3846b89040bcd8e583b2e18cb513cb040e71d95.

* Retry longer when connecting to plasma store from node manager and object manager.

* Close pubsub channels to avoid leaking file descriptors.

* Limit log monitor open files to 200.

* Increase plasma connect retries.

* Add comment.
2019-02-07 19:53:50 -08:00
Si-Yuan 9295ab8f60 Various Python code cleanups. (#3837) 2019-02-03 10:16:24 -08:00
Daniel Edgecumbe 315edab085 [autoscaler] Speedups (#3720)
- NodeUpdater gets its' IP in parallel now (no longer in __init__)
- We use persistent connections in SSH (temp folder created only for ray; ControlMaster)
- hash_runtime_conf was performing a pointless hexlify step, wasting time on large files
- We use NodeUpdaterThreads and share the NodeProvider; NodeUpdaterProcess is removed
- AWSNodeProvider caches nodes more aggressively
- NodeProvider now has a shim batch terminate_nodes() call; AWSNodeProvider parallelises it; the autoscaler uses it
- AWSNodeProvider batches EC2 update_tags calls
- Logging changes throughout to provide standardised timing information for profiling
- Pulled out a few unnecessary is_running calls (NodeUpdater will loop waiting for SSH anyway)

## Related issue number
Issue #3599
2019-02-01 02:46:32 -08:00
Richard Liaw d128636bab Ray Logging Configuration (#3691)
* fix logging for autoscaler

* module logging

* try this for logging

* yapf

* fix

* Initial logging setup

* momery

* ok

* remove basicconfig

* catch

* remove package logging

* print

* fix

* try_fix

* fix 1

* revert rllib

* logging level

* flake8

* fix

* fix

* Remove vestigal TODO
2019-01-30 21:01:12 -08:00