wassname/ray - ray - Gitea: Git with a cup of tea

mirror of https://github.com/wassname/ray.git synced 2026-06-28 10:01:11 +08:00

Author	SHA1	Message	Date
Eric Liang	0ff24ec8dc	Add "ray status" debug tool for autoscaler. (#9091 )	2020-06-24 18:22:03 -07:00
mehrdadn	f68183d778	Error-checking for a couple of corruption issues (#8059 ) * Extra error handling * Handle connection closed in Redis monitor Co-authored-by: Mehrdad <noreply@github.com>	2020-06-07 15:43:00 +02:00
Eric Liang	a24d117c68	[autoscaler] Refactor code in preparation for multi instance type support (#8632 ) * wip refactor * add util * wip * fix * fix * remove * remove extraneous string type for sg	2020-06-03 12:53:55 -07:00
SangBin Cho	7c43991100	[GCS] Monitor.py bug fix (#8725 ) * comment. * Fix bugs. * Used pubsub message instead. * Added a ray.actors test	2020-06-02 16:06:36 -07:00
fangfengbin	016337d4eb	Heartbeat table uses gcs pub-sub instead of redis accessor (#8655 )	2020-05-30 23:17:25 +08:00
mehrdadn	ebf060d484	Make more tests run on Windows (#8446 ) * Remove worker Wait() call due to SIGCHLD being ignored * Port _pid_alive to Windows * Show PID as well as TID in glog * Update TensorFlow version for Python 3.8 on Windows * Handle missing Pillow on Windows * Work around dm-tree PermissionError on Windows * Fix some lint errors on Windows with Python 3.8 * Simplify torch requirements * Quiet git clean * Handle finalizer issues * Exit with the signal number * Get rid of wget * Fix some Windows compatibility issues with tests Co-authored-by: Mehrdad <noreply@github.com>	2020-05-20 12:25:04 -07:00
Robert Nishihara	b011c604d7	Remove ray.tasks() from API. (#7807 )	2020-04-01 10:10:40 -05:00
Edward Oakes	7b609ca211	Remove instances of 'raise Exception' (#7523 )	2020-03-10 17:51:22 -07:00
Eric Liang	5df801605e	Add ray.util package and move libraries from experimental (#7100 )	2020-02-18 13:43:19 -08:00
Daniel Edgecumbe	e516c50745	[autoscaler]: Kill workers if the monitor raises an exception (#3977 ) Co-authored-by: CJosephides <cjosephides@gmail.com>	2020-01-23 14:12:52 -06:00
Sven	60d4d5e1aa	Remove future imports (#6724 ) * Remove all __future__ imports from RLlib. * Remove (object) again from tf_run_builder.py::TFRunBuilder. * Fix 2xLINT warnings. * Fix broken appo_policy import (must be appo_tf_policy) * Remove future imports from all other ray files (not just RLlib). * Remove future imports from all other ray files (not just RLlib). * Remove future import blocks that contain `unicode_literals` as well. Revert appo_tf_policy.py to appo_policy.py (belongs to another PR). * Add two empty lines before Schedule class. * Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error.	2020-01-09 00:15:48 -08:00
Robert Nishihara	39a3459886	Remove (object) from class declarations. (#6658 )	2020-01-02 17:42:13 -08:00
Edward Oakes	fc56872012	Send active object IDs to the raylet (#5803 ) * Send active object IDs to the raylet * comment * comments * dedup * signed int in config * comments * Remove object ID from monitor * Fix test * re-add check * fix cast * check if core worker * Add comment * Reservoir sampling * Fix lint * Pointer return * tmp * Fix merge * Initialize object ids properly * Fix lint	2019-10-20 22:05:28 -07:00
Eric Liang	2fdefe19b7	Take into account queue length in autoscaling (#5684 )	2019-09-11 11:31:35 -07:00
micafan	b3bcf59148	Rename ClientTableData to GcsNodeInfo (#5251 )	2019-07-30 11:22:47 +08:00
Daniel Edgecumbe	06fec63c87	[autoscaler] Add a 'request_cores' function for manual autoscaling (#4754 )	2019-07-26 17:14:45 -07:00
Richard Liaw	3e0ad11ae0	Add heartbeat test + Fix monitor.py (#5191 )	2019-07-16 21:59:48 -07:00
Philipp Moritz	c5253cc300	Add job table to state API (#5076 )	2019-07-06 00:05:48 -07:00
Qing Wang	62e4b591e3	[ID Refactor] Rename DriverID to JobID (#5004 ) * WIP WIP WIP Rename Driver -> Job Fix complition Fix Rename in Java In py WIP Fix WIP Fix Fix test Fix Fix C++ linting Fix * Update java/runtime/src/main/java/org/ray/runtime/config/RayConfig.java Co-Authored-By: Stephanie Wang <swang@cs.berkeley.edu> * Update src/ray/core_worker/core_worker.cc Co-Authored-By: Stephanie Wang <swang@cs.berkeley.edu> * Address comments * Fix * Fix CI * Fix cpp linting * Fix py lint * FIx * Address comments and fix * Address comments * Address * Fix import_threading	2019-06-28 00:44:51 +08:00
Daniel Edgecumbe	49c6e81de2	autoscaler/monitor: Kill workers on exception (#4997 )	2019-06-26 17:59:12 -07:00
Hao Chen	0131353d42	[gRPC] Migrate gcs data structures to protobuf (#5024 )	2019-06-25 14:31:19 -07:00
Yuhong Guo	5eff47b657	[C++] Add hash table to Redis-Module (#4911 )	2019-06-07 16:11:37 +08:00
Robert Nishihara	6703519144	Move global state API out of global_state object. (#4857 )	2019-05-26 11:27:53 -07:00
Yuhong Guo	1a39fee9c6	Refactor ID Serial 1: Separate ObjectID and TaskID from UniqueID (#4776 ) * Enable BaseId. * Change TaskID and make python test pass * Remove unnecessary functions and fix test failure and change TaskID to 16 bytes. * Java code change draft * Refine * Lint * Update java/api/src/main/java/org/ray/api/id/TaskId.java Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Update java/api/src/main/java/org/ray/api/id/BaseId.java Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Update java/api/src/main/java/org/ray/api/id/BaseId.java Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Update java/api/src/main/java/org/ray/api/id/ObjectId.java Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Address comment * Lint * Fix SINGLE_PROCESS * Fix comments * Refine code * Refine test * Resolve conflict	2019-05-22 14:46:30 +08:00
Romil Bhardwaj	0421cba4e8	Autoscaler hotfix for #4555 . (#4653 )	2019-05-08 14:50:52 -07:00
Si-Yuan	dab99d26af	Improve code related to node (#4383 ) * Make full use of node implement local node fix bugs mentioned in comments * Add more tests * Use more specific exception handling * fix, lint * fix for py2.x	2019-04-09 17:27:54 +08:00
Yuhong Guo	c2349cf12d	Remove local/global_scheduler from code and doc. (#4549 )	2019-04-03 17:05:09 -07:00
Robert Nishihara	ef527f84ab	Stream logs to driver by default. (#3892 ) * Stream logs to driver by default. * Fix from rebase * Redirect raylet output independently of worker output. * Fix. * Create redis client with services.create_redis_client. * Suppress Redis connection error at exit. * Remove thread_safe_client from redis. * Shutdown driver threads in ray.shutdown(). * Add warning for too many log messages. * Only stop threads if worker is connected. * Only stop threads if they exist. * Remove unnecessary try/excepts. * Fix * Only add new logging handler once. * Increase timeout. * Fix tempfile test. * Fix logging in cluster_utils. * Revert "Increase timeout." This reverts commit b3846b89040bcd8e583b2e18cb513cb040e71d95. * Retry longer when connecting to plasma store from node manager and object manager. * Close pubsub channels to avoid leaking file descriptors. * Limit log monitor open files to 200. * Increase plasma connect retries. * Add comment.	2019-02-07 19:53:50 -08:00
Si-Yuan	9295ab8f60	Various Python code cleanups. (#3837 )	2019-02-03 10:16:24 -08:00
Daniel Edgecumbe	315edab085	[autoscaler] Speedups (#3720 ) - NodeUpdater gets its' IP in parallel now (no longer in __init__) - We use persistent connections in SSH (temp folder created only for ray; ControlMaster) - hash_runtime_conf was performing a pointless hexlify step, wasting time on large files - We use NodeUpdaterThreads and share the NodeProvider; NodeUpdaterProcess is removed - AWSNodeProvider caches nodes more aggressively - NodeProvider now has a shim batch terminate_nodes() call; AWSNodeProvider parallelises it; the autoscaler uses it - AWSNodeProvider batches EC2 update_tags calls - Logging changes throughout to provide standardised timing information for profiling - Pulled out a few unnecessary is_running calls (NodeUpdater will loop waiting for SSH anyway) ## Related issue number Issue #3599	2019-02-01 02:46:32 -08:00
Richard Liaw	d128636bab	Ray Logging Configuration (#3691 ) * fix logging for autoscaler * module logging * try this for logging * yapf * fix * Initial logging setup * momery * ok * remove basicconfig * catch * remove package logging * print * fix * try_fix * fix 1 * revert rllib * logging level * flake8 * fix * fix * Remove vestigal TODO	2019-01-30 21:01:12 -08:00
Si-Yuan	48139cf861	Migrate Python C extension to Cython (#3541 )	2019-01-24 09:17:14 -08:00
Si-Yuan	59d861281e	Bug fixing: Redis password should be used when reporting errors. (#3724 )	2019-01-08 21:23:55 -08:00
Robert Nishihara	82863b5251	[autoscaler] Update autoscaler to use heartbeat batches. (#3409 )	2018-11-27 23:46:27 -08:00
Robert Nishihara	1f29a960f4	Update task_table and object_table API. (#3161 ) * Update task_table and object_table API. * Fix	2018-10-31 12:52:50 -07:00
Robert Nishihara	658c14282c	Remove legacy Ray code. (#3121 ) * Remove legacy Ray code. * Fix cmake and simplify monitor. * Fix linting * Updates * Fix * Implement some methods. * Remove more plasma manager references. * Fix * Linting * Fix * Fix * Make sure class IDs are strings. * Some path fixes * Fix * Path fixes and update arrow * Fixes. * linting * Fixes * Java fixes * Some java fixes * TaskLanguage -> Language * Minor * Fix python test and remove unused method signature. * Fix java tests * Fix jenkins tests * Remove commented out code.	2018-10-26 13:36:58 -07:00
Peter Schafhalter	a41bbc10ef	Add password authentication to Redis ports (#2952 ) * Implement Redis authentication * Throw exception for legacy Ray * Add test * Formatting * Fix bugs in CLI * Fix bugs in Raylet * Move default password to constants.h * Use pytest.fixture * Fix bug * Authenticate using formatted strings * Add missing passwords * Add test * Improve authentication of async contexts * Disable Redis authentication for credis * Update test for credis * Fix rebase artifacts * Fix formatting * Add workaround for issue #3045 * Increase timeout for test * Improve C++ readability * Fixes for CLI * Add security docs * Address comments * Address comments * Adress comments * Use ray.get * Fix lint	2018-10-16 22:48:30 -07:00
Peter Schafhalter	5da6e78db1	Add available resources to global state (#2501 )	2018-09-10 15:46:32 -07:00
Robert Nishihara	bd64c940e9	Push error to driver when monitor raises an exception. (#2834 )	2018-09-07 17:42:45 -07:00
Alexey Tumanov	de047daea7	[xray] raylet scheduling mechanism with a simple spillback policy (#2749 ) ## What do these changes do? * distribute load and resource information on a heartbeat * for each raylet, maintain total and available resource capacity as well as measure of current load * this PR introduces a new notion of load, defined as a sum of all resource demand induced by queued ready tasks on the local raylet. This provides a heterogeneity-aware measure of load that supersedes legacy Ray's task count as a proxy for load. * modify the scheduling policy to perform capacity-based, load-aware, optimistically concurrent resource allocation * perform task spillover to the heartbeating node in response to a heartbeat, implementing heterogeneity-aware late-binding/work-stealing.	2018-08-28 00:03:34 -07:00
Yuhong Guo	0b6e08ebee	Separate python logger module-wise (#2703 ) ## What do these changes do? 1. Separate the log related code to logger.py from services.py. 2. Allow users to modify logging formatter in `ray start`. ## Related issue number https://github.com/ray-project/ray/pull/2664	2018-08-26 13:46:14 -07:00
Eric Liang	079c4e482a	ray exec and ray attach commands (#2560 ) ray exec CLUSTER CMD [--screen] [--start] [--stop] ray attach CLUSTER [--start] Example: ray exec sgd.yaml 'source activate tensorflow_p27 && cd ~/ray/python/ray/rllib && ./train.py --run=PPO --env=CartPole-v0' --screen --start --stop This will in one command create a cluster and run the command on it in a screen session. The screen can later be attached to via ray attach. After the command finishes, the cluster workers will be terminated and the head node stopped.	2018-08-15 14:31:50 -07:00
Melih Elibol	8ae82180b4	[xray] Adds a driver table. (#2289 ) This PR adds a driver table for the new GCS, which enables cleanup functionality associated with monitoring driver death. Some testing in `monitor_test.py` is restored, but redis sharding for xray is needed to enable remaining tests.	2018-08-08 23:41:40 -07:00
Robert Nishihara	909d7172b1	Introduce constant for ID_SIZE in python code. (#2517 )	2018-07-31 12:40:53 -07:00
Eric Liang	90a3ea9443	[xray] Fix heartbeat subscription for autoscaler (#2498 )	2018-07-28 13:34:55 -07:00
Zongheng Yang	ba28dddf6f	Make xray object table credis-managed and hence flushable. (#2338 ) * monitor.py: issue flushes to data shard * ResultTableAdd & ObjectTableAdd: add credis-managed versions * Fix return codes * Credis-manage xray object table & associated ray.table_append cmd * Fix incorrect return code from TableAppend_DoWrite() * Revert "ResultTableAdd & ObjectTableAdd: add credis-managed versions" This reverts commit 628c2ea190df4c861dda0c284fab7ca6faa1ea24. * Address comments * Lint: fix indent * Address comment	2018-07-03 17:32:44 -07:00
Robert Nishihara	ff2217251f	[xray] Add error table and push error messages to driver through node manager. (#2256 ) * Fix documentation indentation. * Add error table to GCS and push error messages through node manager. * Add type to error data. * Linting * Fix failure_test bug. * Linting. * Enable one more test. * Attempt to fix doc building. * Restructuring * Fixes * More fixes. * Move current_time_ms function into util.h.	2018-06-20 21:29:28 -07:00
Zongheng Yang	8190ff1fd0	Experimental: enable automatic GCS flushing with configurable policy. (#2266 ) * build_credis.sh: use an up-to-date credis commit. * build_credis.sh: leveldb is updated, so update build cmds for it * WIP: make monitor.py issue flush; switch gcs client to use credis * Experimental: enable automatic GCS flushing with configurable policy. * Fix linux compilation error * Fix leveldb build * Use optimized build for credis * Address comments * Attempt to fix tests	2018-06-20 14:40:57 -07:00
Eric Liang	100d8c207f	[xray] [autoscaler] Fix autoscaler / raylet integration (#2143 )	2018-06-07 15:43:20 -07:00
Robert Nishihara	6172f94c04	Implement Python global state API for xray. (#2125 ) * Implement global state API for xray. * Fix object table. * Fixes for log structure. * Implement cluster_resources. * Add driver task to task table. * Remove python flatbuffers code * Get some global state API tests running. * Python linting. * Fix linting. * Fix mock modules for doc * Copy over flatbuffer bindings. * Fix for tests. * Linting * Fix monitor crash.	2018-05-29 16:25:54 -07:00

1 2

78 Commits