wassname/ray - ray - Gitea: Git with a cup of tea

mirror of https://github.com/wassname/ray.git synced 2026-07-02 02:59:52 +08:00

Author	SHA1	Message	Date
Eric Liang	5df801605e	Add ray.util package and move libraries from experimental (#7100 )	2020-02-18 13:43:19 -08:00
Daniel Edgecumbe	e516c50745	[autoscaler]: Kill workers if the monitor raises an exception (#3977 ) Co-authored-by: CJosephides <cjosephides@gmail.com>	2020-01-23 14:12:52 -06:00
Sven	60d4d5e1aa	Remove future imports (#6724 ) * Remove all __future__ imports from RLlib. * Remove (object) again from tf_run_builder.py::TFRunBuilder. * Fix 2xLINT warnings. * Fix broken appo_policy import (must be appo_tf_policy) * Remove future imports from all other ray files (not just RLlib). * Remove future imports from all other ray files (not just RLlib). * Remove future import blocks that contain `unicode_literals` as well. Revert appo_tf_policy.py to appo_policy.py (belongs to another PR). * Add two empty lines before Schedule class. * Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error.	2020-01-09 00:15:48 -08:00
Robert Nishihara	39a3459886	Remove (object) from class declarations. (#6658 )	2020-01-02 17:42:13 -08:00
Edward Oakes	fc56872012	Send active object IDs to the raylet (#5803 ) * Send active object IDs to the raylet * comment * comments * dedup * signed int in config * comments * Remove object ID from monitor * Fix test * re-add check * fix cast * check if core worker * Add comment * Reservoir sampling * Fix lint * Pointer return * tmp * Fix merge * Initialize object ids properly * Fix lint	2019-10-20 22:05:28 -07:00
Eric Liang	2fdefe19b7	Take into account queue length in autoscaling (#5684 )	2019-09-11 11:31:35 -07:00
micafan	b3bcf59148	Rename ClientTableData to GcsNodeInfo (#5251 )	2019-07-30 11:22:47 +08:00
Daniel Edgecumbe	06fec63c87	[autoscaler] Add a 'request_cores' function for manual autoscaling (#4754 )	2019-07-26 17:14:45 -07:00
Richard Liaw	3e0ad11ae0	Add heartbeat test + Fix monitor.py (#5191 )	2019-07-16 21:59:48 -07:00
Philipp Moritz	c5253cc300	Add job table to state API (#5076 )	2019-07-06 00:05:48 -07:00
Qing Wang	62e4b591e3	[ID Refactor] Rename DriverID to JobID (#5004 ) * WIP WIP WIP Rename Driver -> Job Fix complition Fix Rename in Java In py WIP Fix WIP Fix Fix test Fix Fix C++ linting Fix * Update java/runtime/src/main/java/org/ray/runtime/config/RayConfig.java Co-Authored-By: Stephanie Wang <swang@cs.berkeley.edu> * Update src/ray/core_worker/core_worker.cc Co-Authored-By: Stephanie Wang <swang@cs.berkeley.edu> * Address comments * Fix * Fix CI * Fix cpp linting * Fix py lint * FIx * Address comments and fix * Address comments * Address * Fix import_threading	2019-06-28 00:44:51 +08:00
Daniel Edgecumbe	49c6e81de2	autoscaler/monitor: Kill workers on exception (#4997 )	2019-06-26 17:59:12 -07:00
Hao Chen	0131353d42	[gRPC] Migrate gcs data structures to protobuf (#5024 )	2019-06-25 14:31:19 -07:00
Yuhong Guo	5eff47b657	[C++] Add hash table to Redis-Module (#4911 )	2019-06-07 16:11:37 +08:00
Robert Nishihara	6703519144	Move global state API out of global_state object. (#4857 )	2019-05-26 11:27:53 -07:00
Yuhong Guo	1a39fee9c6	Refactor ID Serial 1: Separate ObjectID and TaskID from UniqueID (#4776 ) * Enable BaseId. * Change TaskID and make python test pass * Remove unnecessary functions and fix test failure and change TaskID to 16 bytes. * Java code change draft * Refine * Lint * Update java/api/src/main/java/org/ray/api/id/TaskId.java Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Update java/api/src/main/java/org/ray/api/id/BaseId.java Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Update java/api/src/main/java/org/ray/api/id/BaseId.java Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Update java/api/src/main/java/org/ray/api/id/ObjectId.java Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Address comment * Lint * Fix SINGLE_PROCESS * Fix comments * Refine code * Refine test * Resolve conflict	2019-05-22 14:46:30 +08:00
Romil Bhardwaj	0421cba4e8	Autoscaler hotfix for #4555 . (#4653 )	2019-05-08 14:50:52 -07:00
Si-Yuan	dab99d26af	Improve code related to node (#4383 ) * Make full use of node implement local node fix bugs mentioned in comments * Add more tests * Use more specific exception handling * fix, lint * fix for py2.x	2019-04-09 17:27:54 +08:00
Yuhong Guo	c2349cf12d	Remove local/global_scheduler from code and doc. (#4549 )	2019-04-03 17:05:09 -07:00
Robert Nishihara	ef527f84ab	Stream logs to driver by default. (#3892 ) * Stream logs to driver by default. * Fix from rebase * Redirect raylet output independently of worker output. * Fix. * Create redis client with services.create_redis_client. * Suppress Redis connection error at exit. * Remove thread_safe_client from redis. * Shutdown driver threads in ray.shutdown(). * Add warning for too many log messages. * Only stop threads if worker is connected. * Only stop threads if they exist. * Remove unnecessary try/excepts. * Fix * Only add new logging handler once. * Increase timeout. * Fix tempfile test. * Fix logging in cluster_utils. * Revert "Increase timeout." This reverts commit b3846b89040bcd8e583b2e18cb513cb040e71d95. * Retry longer when connecting to plasma store from node manager and object manager. * Close pubsub channels to avoid leaking file descriptors. * Limit log monitor open files to 200. * Increase plasma connect retries. * Add comment.	2019-02-07 19:53:50 -08:00
Si-Yuan	9295ab8f60	Various Python code cleanups. (#3837 )	2019-02-03 10:16:24 -08:00
Daniel Edgecumbe	315edab085	[autoscaler] Speedups (#3720 ) - NodeUpdater gets its' IP in parallel now (no longer in __init__) - We use persistent connections in SSH (temp folder created only for ray; ControlMaster) - hash_runtime_conf was performing a pointless hexlify step, wasting time on large files - We use NodeUpdaterThreads and share the NodeProvider; NodeUpdaterProcess is removed - AWSNodeProvider caches nodes more aggressively - NodeProvider now has a shim batch terminate_nodes() call; AWSNodeProvider parallelises it; the autoscaler uses it - AWSNodeProvider batches EC2 update_tags calls - Logging changes throughout to provide standardised timing information for profiling - Pulled out a few unnecessary is_running calls (NodeUpdater will loop waiting for SSH anyway) ## Related issue number Issue #3599	2019-02-01 02:46:32 -08:00
Richard Liaw	d128636bab	Ray Logging Configuration (#3691 ) * fix logging for autoscaler * module logging * try this for logging * yapf * fix * Initial logging setup * momery * ok * remove basicconfig * catch * remove package logging * print * fix * try_fix * fix 1 * revert rllib * logging level * flake8 * fix * fix * Remove vestigal TODO	2019-01-30 21:01:12 -08:00
Si-Yuan	48139cf861	Migrate Python C extension to Cython (#3541 )	2019-01-24 09:17:14 -08:00
Si-Yuan	59d861281e	Bug fixing: Redis password should be used when reporting errors. (#3724 )	2019-01-08 21:23:55 -08:00
Robert Nishihara	82863b5251	[autoscaler] Update autoscaler to use heartbeat batches. (#3409 )	2018-11-27 23:46:27 -08:00
Robert Nishihara	1f29a960f4	Update task_table and object_table API. (#3161 ) * Update task_table and object_table API. * Fix	2018-10-31 12:52:50 -07:00
Robert Nishihara	658c14282c	Remove legacy Ray code. (#3121 ) * Remove legacy Ray code. * Fix cmake and simplify monitor. * Fix linting * Updates * Fix * Implement some methods. * Remove more plasma manager references. * Fix * Linting * Fix * Fix * Make sure class IDs are strings. * Some path fixes * Fix * Path fixes and update arrow * Fixes. * linting * Fixes * Java fixes * Some java fixes * TaskLanguage -> Language * Minor * Fix python test and remove unused method signature. * Fix java tests * Fix jenkins tests * Remove commented out code.	2018-10-26 13:36:58 -07:00
Peter Schafhalter	a41bbc10ef	Add password authentication to Redis ports (#2952 ) * Implement Redis authentication * Throw exception for legacy Ray * Add test * Formatting * Fix bugs in CLI * Fix bugs in Raylet * Move default password to constants.h * Use pytest.fixture * Fix bug * Authenticate using formatted strings * Add missing passwords * Add test * Improve authentication of async contexts * Disable Redis authentication for credis * Update test for credis * Fix rebase artifacts * Fix formatting * Add workaround for issue #3045 * Increase timeout for test * Improve C++ readability * Fixes for CLI * Add security docs * Address comments * Address comments * Adress comments * Use ray.get * Fix lint	2018-10-16 22:48:30 -07:00
Peter Schafhalter	5da6e78db1	Add available resources to global state (#2501 )	2018-09-10 15:46:32 -07:00
Robert Nishihara	bd64c940e9	Push error to driver when monitor raises an exception. (#2834 )	2018-09-07 17:42:45 -07:00
Alexey Tumanov	de047daea7	[xray] raylet scheduling mechanism with a simple spillback policy (#2749 ) ## What do these changes do? * distribute load and resource information on a heartbeat * for each raylet, maintain total and available resource capacity as well as measure of current load * this PR introduces a new notion of load, defined as a sum of all resource demand induced by queued ready tasks on the local raylet. This provides a heterogeneity-aware measure of load that supersedes legacy Ray's task count as a proxy for load. * modify the scheduling policy to perform capacity-based, load-aware, optimistically concurrent resource allocation * perform task spillover to the heartbeating node in response to a heartbeat, implementing heterogeneity-aware late-binding/work-stealing.	2018-08-28 00:03:34 -07:00
Yuhong Guo	0b6e08ebee	Separate python logger module-wise (#2703 ) ## What do these changes do? 1. Separate the log related code to logger.py from services.py. 2. Allow users to modify logging formatter in `ray start`. ## Related issue number https://github.com/ray-project/ray/pull/2664	2018-08-26 13:46:14 -07:00
Eric Liang	079c4e482a	ray exec and ray attach commands (#2560 ) ray exec CLUSTER CMD [--screen] [--start] [--stop] ray attach CLUSTER [--start] Example: ray exec sgd.yaml 'source activate tensorflow_p27 && cd ~/ray/python/ray/rllib && ./train.py --run=PPO --env=CartPole-v0' --screen --start --stop This will in one command create a cluster and run the command on it in a screen session. The screen can later be attached to via ray attach. After the command finishes, the cluster workers will be terminated and the head node stopped.	2018-08-15 14:31:50 -07:00
Melih Elibol	8ae82180b4	[xray] Adds a driver table. (#2289 ) This PR adds a driver table for the new GCS, which enables cleanup functionality associated with monitoring driver death. Some testing in `monitor_test.py` is restored, but redis sharding for xray is needed to enable remaining tests.	2018-08-08 23:41:40 -07:00
Robert Nishihara	909d7172b1	Introduce constant for ID_SIZE in python code. (#2517 )	2018-07-31 12:40:53 -07:00
Eric Liang	90a3ea9443	[xray] Fix heartbeat subscription for autoscaler (#2498 )	2018-07-28 13:34:55 -07:00
Zongheng Yang	ba28dddf6f	Make xray object table credis-managed and hence flushable. (#2338 ) * monitor.py: issue flushes to data shard * ResultTableAdd & ObjectTableAdd: add credis-managed versions * Fix return codes * Credis-manage xray object table & associated ray.table_append cmd * Fix incorrect return code from TableAppend_DoWrite() * Revert "ResultTableAdd & ObjectTableAdd: add credis-managed versions" This reverts commit 628c2ea190df4c861dda0c284fab7ca6faa1ea24. * Address comments * Lint: fix indent * Address comment	2018-07-03 17:32:44 -07:00
Robert Nishihara	ff2217251f	[xray] Add error table and push error messages to driver through node manager. (#2256 ) * Fix documentation indentation. * Add error table to GCS and push error messages through node manager. * Add type to error data. * Linting * Fix failure_test bug. * Linting. * Enable one more test. * Attempt to fix doc building. * Restructuring * Fixes * More fixes. * Move current_time_ms function into util.h.	2018-06-20 21:29:28 -07:00
Zongheng Yang	8190ff1fd0	Experimental: enable automatic GCS flushing with configurable policy. (#2266 ) * build_credis.sh: use an up-to-date credis commit. * build_credis.sh: leveldb is updated, so update build cmds for it * WIP: make monitor.py issue flush; switch gcs client to use credis * Experimental: enable automatic GCS flushing with configurable policy. * Fix linux compilation error * Fix leveldb build * Use optimized build for credis * Address comments * Attempt to fix tests	2018-06-20 14:40:57 -07:00
Eric Liang	100d8c207f	[xray] [autoscaler] Fix autoscaler / raylet integration (#2143 )	2018-06-07 15:43:20 -07:00
Robert Nishihara	6172f94c04	Implement Python global state API for xray. (#2125 ) * Implement global state API for xray. * Fix object table. * Fixes for log structure. * Implement cluster_resources. * Add driver task to task table. * Remove python flatbuffers code * Get some global state API tests running. * Python linting. * Fix linting. * Fix mock modules for doc * Copy over flatbuffer bindings. * Fix for tests. * Linting * Fix monitor crash.	2018-05-29 16:25:54 -07:00
Alok Singh	f795173b51	Use flake8-comprehensions (#1976 ) * Add flake8 to Travis * Add flake8-comprehensions [flake8 plugin](https://github.com/adamchainz/flake8-comprehensions) that checks for useless constructions. * Use generators instead of lists where appropriate A lot of the builtins can take in generators instead of lists. This commit applies `flake8-comprehensions` to find them. * Fix lint error * Fix some string formatting The rest can be fixed in another PR * Fix compound literals syntax This should probably be merged after #1963. * dict() -> {} * Use dict literal syntax dict(...) -> {...} * Rewrite nested dicts * Fix hanging indent * Add missing import * Add missing quote * fmt * Add missing whitespace * rm duplicate pip install This is already installed in another file. * Fix indent * move `merge_dicts` into utils * Bring up to date with `master` * Add automatic syntax upgrade * rm pyupgrade In case users want to still use it on their own, the upgrade-syn.sh script was left in the `.travis` dir.	2018-05-20 16:15:06 -07:00
Alok Singh	9a8f29e571	YAPF, take 3 (#2098 ) * Use pep8 style The original style file is actually just pep8 style, but with everything spelled out. It's easier to use the `based_on_style` feature. Any overrides are clearer that way. * Improve yapf script 1. Do formatting in parallel 2. Lint RLlib 3. Use .style.yapf file * Pull out expressions into variables * Don't format rllib * Don't allow splits in dicts * Apply yapf * Disallow single line if-statements * Use arithmetic comparison * Simplify checking for changed files * Pull out expr into var	2018-05-19 16:07:28 -07:00
Melih Elibol	bea97b425b	Fix python linting (#2076 )	2018-05-16 15:04:31 -07:00
Philipp Moritz	74162d1492	Lint Python files with Yapf (#1872 )	2018-04-11 10:11:35 -07:00
Robert Nishihara	de3cfa223d	Fix monitor.py bottleneck by removing excess Redis queries. (#1786 ) * Fix monitor.py bottleneck by removing excess Redis queries. * Remove unnecessary default value.	2018-03-26 22:30:38 -07:00
Robert Nishihara	96913be939	Treat actor creation like a regular task. (#1668 ) * Treat actor creation like a regular task. * Small cleanups. * Change semantics of actor resource handling. * Bug fix. * Minor linting * Bug fix * Fix jenkins test. * Fix actor tests * Some cleanups * Bug fix * Fix bug. * Remove cached actor tasks when a driver is removed. * Add more info to taskspec in global state API. * Fix cyclic import bug in tune. * Fix * Fix linting. * Fix linting. * Don't schedule any tasks (especially actor creaiton tasks) on local schedulers with 0 CPUs. * Bug fix. * Add test for 0 CPU case * Fix linting * Address comments. * Fix typos and add comment. * Add assertion and fix test.	2018-03-16 11:18:07 -07:00
Alexey Tumanov	f1303291b4	Ray scheduler spillback plumbing + mechanism (#1362 ) * spillback mechanism and plumbing : adding spillback counter + timestamp * linting fix * documentation * Fix argument name.	2018-01-23 20:18:12 -08:00
Eric Liang	b6c42f96be	Auto-scale ray clusters based on GCS load metrics (#1348 ) This adds (experimental) auto-scaling support for Ray clusters based on GCS load metrics. The auto-scaling algorithm is as follows: Based on current (instantaneous) load information, we compute the approximate number of "used workers". This is based on the bottleneck resource, e.g. if 8/8 GPUs are used in a 8-node cluster but all the CPUs are idle, the number of used nodes is still counted as 8. This number can also be fractional. We scale that number by 1 / target_utilization_fraction and round up to determine the target cluster size (subject to the max_workers constraint). The autoscaler control loop takes care of launching new nodes until the target cluster size is met. When a node is idle for more than idle_timeout_minutes, we remove it from the cluster if that would not drop the cluster size below min_workers. Note that we'll need to update the wheel in the example yaml file after this PR is merged.	2017-12-31 14:39:57 -08:00

1 2

70 Commits