wassname/ray - ray - Gitea: Git with a cup of tea

mirror of https://github.com/wassname/ray.git synced 2026-07-02 13:57:49 +08:00

Author	SHA1	Message	Date
SangBin Cho	8223a33bff	[Logging] Log rotation on all components (#12101 ) * In Progress. * Done. * Fix the issue. * Add wait for condition because logs are not written right away now. * debug string. * lint. * Fix flaky test. * Fix issues. * Fix test. * lint.	2020-11-30 19:03:55 -08:00
Tao Wang	b85c6abc3e	Rename fields/variables from client id to node id (#12457 )	2020-11-30 14:33:36 +08:00
SangBin Cho	f56d7c1a76	[Logging] Remove per worker job log file / support worker log rotation (#11927 ) * In progress. * MVP done. * In Progress. * Remove unnecessay code. * Fix some issues. * Fix test failures. * Addressed code review + fix object spilling test failure.	2020-11-16 11:29:43 -08:00
Gekho457	ad639f12d8	[autoscaler/k8s] Preliminary k8s operator (#11929 )	2020-11-12 11:58:02 -06:00
Ameer Haj Ali	8d74a04a42	[autoscaler] Flag flip for resource_demand_scheduler should take into account queue (#11615 )	2020-11-02 12:41:22 -08:00
Eric Liang	f9f372c327	[autoscaler] Clean up monitoring loop code (#11677 )	2020-10-30 13:48:43 -07:00
Tao Wang	1d5694ddea	[GCS]Use direct getting instead of pub-sub to update load metrics in monitor.py (#11339 )	2020-10-28 11:23:18 -07:00
Alex Wu	7466ce82df	[Autoscaler] Placement group autoscaling (#11243 )	2020-10-14 13:11:46 -07:00
Alex Wu	175fc41fbc	[Autoscaler] Account for resource backlog size (#11261 )	2020-10-12 09:43:48 -07:00
Tao Wang	0dcfa9ed6c	Add light heartbeat flag in python and use it in load metrics (#11032 )	2020-09-30 11:39:28 -07:00
Eric Liang	609c1b8acd	Start moving ray internal files to _private module (#10994 )	2020-09-24 22:46:35 -07:00
Eric Liang	6a227ae501	[autoscaler] Split autoscaler interface public private (#10898 )	2020-09-18 18:16:23 -07:00
Richard Liaw	ed5de89470	FIX: Lint (#10384 )	2020-08-27 17:56:39 -07:00
Alex Wu	7dbc1f439c	[hotfix] Autoscaler monitor fix unit tests	2020-08-27 14:26:41 -07:00
Alex Wu	6d2af33a01	[Autoscaler] Proper resource demand plumbing (#10329 )	2020-08-26 23:36:01 -07:00
SangBin Cho	92664249e8	Partially Use f string (#10218 ) * flynt. trial 1. * Trial 1. * Addressed code review.	2020-08-20 18:21:16 -07:00
Alex Wu	4b14bf85e4	[Autoscaler] Resource demand vector (hearbeat -> autoscaler plumbing) (#10127 )	2020-08-17 13:57:15 -07:00
Tao Wang	44ccca1acb	Only update raylet map when autoscaler configured (#9435 )	2020-07-27 11:23:06 +08:00
Tao Wang	f7ac495a68	[Core] Use map instead of list to represent resources in heartbeat message (#9294 )	2020-07-05 10:59:25 +08:00
Eric Liang	0ff24ec8dc	Add "ray status" debug tool for autoscaler. (#9091 )	2020-06-24 18:22:03 -07:00
mehrdadn	f68183d778	Error-checking for a couple of corruption issues (#8059 ) * Extra error handling * Handle connection closed in Redis monitor Co-authored-by: Mehrdad <noreply@github.com>	2020-06-07 15:43:00 +02:00
Eric Liang	a24d117c68	[autoscaler] Refactor code in preparation for multi instance type support (#8632 ) * wip refactor * add util * wip * fix * fix * remove * remove extraneous string type for sg	2020-06-03 12:53:55 -07:00
SangBin Cho	7c43991100	[GCS] Monitor.py bug fix (#8725 ) * comment. * Fix bugs. * Used pubsub message instead. * Added a ray.actors test	2020-06-02 16:06:36 -07:00
fangfengbin	016337d4eb	Heartbeat table uses gcs pub-sub instead of redis accessor (#8655 )	2020-05-30 23:17:25 +08:00
mehrdadn	ebf060d484	Make more tests run on Windows (#8446 ) * Remove worker Wait() call due to SIGCHLD being ignored * Port _pid_alive to Windows * Show PID as well as TID in glog * Update TensorFlow version for Python 3.8 on Windows * Handle missing Pillow on Windows * Work around dm-tree PermissionError on Windows * Fix some lint errors on Windows with Python 3.8 * Simplify torch requirements * Quiet git clean * Handle finalizer issues * Exit with the signal number * Get rid of wget * Fix some Windows compatibility issues with tests Co-authored-by: Mehrdad <noreply@github.com>	2020-05-20 12:25:04 -07:00
Robert Nishihara	b011c604d7	Remove ray.tasks() from API. (#7807 )	2020-04-01 10:10:40 -05:00
Edward Oakes	7b609ca211	Remove instances of 'raise Exception' (#7523 )	2020-03-10 17:51:22 -07:00
Eric Liang	5df801605e	Add ray.util package and move libraries from experimental (#7100 )	2020-02-18 13:43:19 -08:00
Daniel Edgecumbe	e516c50745	[autoscaler]: Kill workers if the monitor raises an exception (#3977 ) Co-authored-by: CJosephides <cjosephides@gmail.com>	2020-01-23 14:12:52 -06:00
Sven	60d4d5e1aa	Remove future imports (#6724 ) * Remove all __future__ imports from RLlib. * Remove (object) again from tf_run_builder.py::TFRunBuilder. * Fix 2xLINT warnings. * Fix broken appo_policy import (must be appo_tf_policy) * Remove future imports from all other ray files (not just RLlib). * Remove future imports from all other ray files (not just RLlib). * Remove future import blocks that contain `unicode_literals` as well. Revert appo_tf_policy.py to appo_policy.py (belongs to another PR). * Add two empty lines before Schedule class. * Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error.	2020-01-09 00:15:48 -08:00
Robert Nishihara	39a3459886	Remove (object) from class declarations. (#6658 )	2020-01-02 17:42:13 -08:00
Edward Oakes	fc56872012	Send active object IDs to the raylet (#5803 ) * Send active object IDs to the raylet * comment * comments * dedup * signed int in config * comments * Remove object ID from monitor * Fix test * re-add check * fix cast * check if core worker * Add comment * Reservoir sampling * Fix lint * Pointer return * tmp * Fix merge * Initialize object ids properly * Fix lint	2019-10-20 22:05:28 -07:00
Eric Liang	2fdefe19b7	Take into account queue length in autoscaling (#5684 )	2019-09-11 11:31:35 -07:00
micafan	b3bcf59148	Rename ClientTableData to GcsNodeInfo (#5251 )	2019-07-30 11:22:47 +08:00
Daniel Edgecumbe	06fec63c87	[autoscaler] Add a 'request_cores' function for manual autoscaling (#4754 )	2019-07-26 17:14:45 -07:00
Richard Liaw	3e0ad11ae0	Add heartbeat test + Fix monitor.py (#5191 )	2019-07-16 21:59:48 -07:00
Philipp Moritz	c5253cc300	Add job table to state API (#5076 )	2019-07-06 00:05:48 -07:00
Qing Wang	62e4b591e3	[ID Refactor] Rename DriverID to JobID (#5004 ) * WIP WIP WIP Rename Driver -> Job Fix complition Fix Rename in Java In py WIP Fix WIP Fix Fix test Fix Fix C++ linting Fix * Update java/runtime/src/main/java/org/ray/runtime/config/RayConfig.java Co-Authored-By: Stephanie Wang <swang@cs.berkeley.edu> * Update src/ray/core_worker/core_worker.cc Co-Authored-By: Stephanie Wang <swang@cs.berkeley.edu> * Address comments * Fix * Fix CI * Fix cpp linting * Fix py lint * FIx * Address comments and fix * Address comments * Address * Fix import_threading	2019-06-28 00:44:51 +08:00
Daniel Edgecumbe	49c6e81de2	autoscaler/monitor: Kill workers on exception (#4997 )	2019-06-26 17:59:12 -07:00
Hao Chen	0131353d42	[gRPC] Migrate gcs data structures to protobuf (#5024 )	2019-06-25 14:31:19 -07:00
Yuhong Guo	5eff47b657	[C++] Add hash table to Redis-Module (#4911 )	2019-06-07 16:11:37 +08:00
Robert Nishihara	6703519144	Move global state API out of global_state object. (#4857 )	2019-05-26 11:27:53 -07:00
Yuhong Guo	1a39fee9c6	Refactor ID Serial 1: Separate ObjectID and TaskID from UniqueID (#4776 ) * Enable BaseId. * Change TaskID and make python test pass * Remove unnecessary functions and fix test failure and change TaskID to 16 bytes. * Java code change draft * Refine * Lint * Update java/api/src/main/java/org/ray/api/id/TaskId.java Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Update java/api/src/main/java/org/ray/api/id/BaseId.java Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Update java/api/src/main/java/org/ray/api/id/BaseId.java Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Update java/api/src/main/java/org/ray/api/id/ObjectId.java Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Address comment * Lint * Fix SINGLE_PROCESS * Fix comments * Refine code * Refine test * Resolve conflict	2019-05-22 14:46:30 +08:00
Romil Bhardwaj	0421cba4e8	Autoscaler hotfix for #4555 . (#4653 )	2019-05-08 14:50:52 -07:00
Si-Yuan	dab99d26af	Improve code related to node (#4383 ) * Make full use of node implement local node fix bugs mentioned in comments * Add more tests * Use more specific exception handling * fix, lint * fix for py2.x	2019-04-09 17:27:54 +08:00
Yuhong Guo	c2349cf12d	Remove local/global_scheduler from code and doc. (#4549 )	2019-04-03 17:05:09 -07:00
Robert Nishihara	ef527f84ab	Stream logs to driver by default. (#3892 ) * Stream logs to driver by default. * Fix from rebase * Redirect raylet output independently of worker output. * Fix. * Create redis client with services.create_redis_client. * Suppress Redis connection error at exit. * Remove thread_safe_client from redis. * Shutdown driver threads in ray.shutdown(). * Add warning for too many log messages. * Only stop threads if worker is connected. * Only stop threads if they exist. * Remove unnecessary try/excepts. * Fix * Only add new logging handler once. * Increase timeout. * Fix tempfile test. * Fix logging in cluster_utils. * Revert "Increase timeout." This reverts commit b3846b89040bcd8e583b2e18cb513cb040e71d95. * Retry longer when connecting to plasma store from node manager and object manager. * Close pubsub channels to avoid leaking file descriptors. * Limit log monitor open files to 200. * Increase plasma connect retries. * Add comment.	2019-02-07 19:53:50 -08:00
Si-Yuan	9295ab8f60	Various Python code cleanups. (#3837 )	2019-02-03 10:16:24 -08:00
Daniel Edgecumbe	315edab085	[autoscaler] Speedups (#3720 ) - NodeUpdater gets its' IP in parallel now (no longer in __init__) - We use persistent connections in SSH (temp folder created only for ray; ControlMaster) - hash_runtime_conf was performing a pointless hexlify step, wasting time on large files - We use NodeUpdaterThreads and share the NodeProvider; NodeUpdaterProcess is removed - AWSNodeProvider caches nodes more aggressively - NodeProvider now has a shim batch terminate_nodes() call; AWSNodeProvider parallelises it; the autoscaler uses it - AWSNodeProvider batches EC2 update_tags calls - Logging changes throughout to provide standardised timing information for profiling - Pulled out a few unnecessary is_running calls (NodeUpdater will loop waiting for SSH anyway) ## Related issue number Issue #3599	2019-02-01 02:46:32 -08:00
Richard Liaw	d128636bab	Ray Logging Configuration (#3691 ) * fix logging for autoscaler * module logging * try this for logging * yapf * fix * Initial logging setup * momery * ok * remove basicconfig * catch * remove package logging * print * fix * try_fix * fix 1 * revert rllib * logging level * flake8 * fix * fix * Remove vestigal TODO	2019-01-30 21:01:12 -08:00

1 2

97 Commits