wassname/ray - ray - Gitea: Git with a cup of tea

mirror of https://github.com/wassname/ray.git synced 2026-07-04 01:07:30 +08:00

Author	SHA1	Message	Date
Robert Nishihara	0b1608a546	Factor out code for starting new processes and test plasma store in valgrind. (#3824 ) * Factor out starting Ray processes. * Detect flags through environment variables. * Return ProcessInfo from start_ray_process. * Print valgrind errors at exit. * Test valgrind in travis. * Some valgrind fixes. * Undo raylet monitor change. * Only test plasma store in valgrind.	2019-01-22 14:59:11 -08:00
Yuhong Guo	d2cf8561f2	Refactor code about ray.ObjectID. (#3674 ) * Refactor code about ray.ObjectID. * remove from_random and use nil_id instead of constructor * remove id() in hash * Lint and fix * Change driver id to ObjectID * Replace binary_to_hex(ObjectID.id()) to ObjectID.hex()	2019-01-13 01:47:29 -08:00
Robert Nishihara	067976ad3d	Push a warning to all users when large number of workers have been started. (#3645 ) * Push a warning to all users when large number of workers have been started. * Add test. * Fix bug. * Give warning when worker starts instead of when worker registers. * Fix * Fix tests	2019-01-05 13:27:32 -08:00
Robert Nishihara	586a5c9ffa	Limit default redis max memory to 10GB. (#3630 ) * Limit Redis max memory to 10GB/shard by default. * Update stress tests. * Reorganize * Update * Add minimum cap size for object store and redis. * Small test update.	2019-01-03 13:23:54 -08:00
Yuhong Guo	fb33fa9097	Enable function_descriptor in backend to replace the function_id (#3028 )	2018-12-18 18:53:59 -05:00
Hao Chen	e7b51cbd1b	[xray] Implement Actor Reconstruction (#3332 ) * Implement Actor Reconstruction * fix * fix actor handle __del__ * fix lint * add comment * Remove actorCreationDummyObjectId * address comments * fix * address comments * avoid copy * change log to debug * fix error name	2018-12-13 21:28:58 -08:00
Robert Nishihara	658c14282c	Remove legacy Ray code. (#3121 ) * Remove legacy Ray code. * Fix cmake and simplify monitor. * Fix linting * Updates * Fix * Implement some methods. * Remove more plasma manager references. * Fix * Linting * Fix * Fix * Make sure class IDs are strings. * Some path fixes * Fix * Path fixes and update arrow * Fixes. * linting * Fixes * Java fixes * Some java fixes * TaskLanguage -> Language * Minor * Fix python test and remove unused method signature. * Fix java tests * Fix jenkins tests * Remove commented out code.	2018-10-26 13:36:58 -07:00
Si-Yuan	cc7e2ecdd5	Change logfile names and also allow plasma store socket to be passed in. (#2862 )	2018-10-03 10:03:53 -07:00
Robert Nishihara	bd64c940e9	Push error to driver when monitor raises an exception. (#2834 )	2018-09-07 17:42:45 -07:00
Robert Nishihara	0ac855e061	Push errors to all drivers when node is marked dead. (#2808 ) * Push errors to all drivers when node is marked dead. * Fix	2018-09-02 20:04:58 -07:00
Alexey Tumanov	fdc9688226	[xray] push warning to driver for infeasible tasks (#2784 ) This PR pushes a warning to the user for infeasible tasks to alert them to the fact that they can't currently be executed. Fixes #2780.	2018-09-01 13:21:27 -07:00
Yuhong Guo	0b6e08ebee	Separate python logger module-wise (#2703 ) ## What do these changes do? 1. Separate the log related code to logger.py from services.py. 2. Allow users to modify logging formatter in `ray start`. ## Related issue number https://github.com/ray-project/ray/pull/2664	2018-08-26 13:46:14 -07:00
Robert Nishihara	ff2217251f	[xray] Add error table and push error messages to driver through node manager. (#2256 ) * Fix documentation indentation. * Add error table to GCS and push error messages through node manager. * Add type to error data. * Linting * Fix failure_test bug. * Linting. * Enable one more test. * Attempt to fix doc building. * Restructuring * Fixes * More fixes. * Move current_time_ms function into util.h.	2018-06-20 21:29:28 -07:00
Adam Gleave	30684446a6	Support multiple availability zones in AWS (fix #2177 ) (#2254 ) * AWS: support multiple availability zones (fix #2177) * Bugfix: [] rather than () * Test config * Test config tweaks * Remove test config * Formatting fixes * Update YAML config	2018-06-19 20:22:07 -07:00
Robert Nishihara	61139e1509	Enable fractional resources and resource IDs for xray. (#2187 ) * Implement GPU IDs and fractional resources. * Add documentation and python exceptions. * Fix signed/unsigned comparison. * Fix linting. * Fixes from rebase. * Re-enable tests that use ray.wait. * Don't kill the raylet if an infeasible task is submitted. * Ignore tests that require better load balancing. * Linting * Ignore array test. * Ignore stress test reconstructions tests. * Don't kill node manager if remote node manager disconnects. * Ignore more stress tests. * Naming changes * Remove outdated todo * Small fix * Re-enable test. * Linting * Fix resource bookkeeping for blocked tasks. * Fix linting * Fix Java client. * Ignore test * Ignore put error tests	2018-06-10 15:31:43 -07:00
Robert Nishihara	125fe1c09c	Print warning when defining very large remote function or actor. (#2179 ) * Print warning when defining very large remote function or actor. * Add weak test. * Check that warnings appear in test. * Make wait_for_errors actually fail in failure_test.py. * Use constants for error types. * Fix	2018-06-09 19:59:15 -07:00
Adam Gleave	6ef3b255ea	Launch nodes in separate threads (#2183 ) Modifies the autoscaler to run launch_new_nodes in a separate thread, keeping track of the number of pending requests.	2018-06-05 20:19:31 -07:00
Eric Liang	3f1dd29eab	[autoscaler] Remove faulty assert that breaks during downscaling, pull configs from env (#2006 ) * fixes * coment out test * Update ray_constants.py * Update autoscaler_test.py * Update ray_constants.py * lint * lint	2018-05-15 12:47:11 -07:00
Philipp Moritz	74162d1492	Lint Python files with Yapf (#1872 )	2018-04-11 10:11:35 -07:00
Eric Liang	b8811cbe34	[autoscaling] increase connect timeout, boto retries, and check subnet conf (#1422 ) * some autoscaling config tweaks * Sun Jan 14 13:56:55 PST 2018 * Mon Jan 15 14:21:09 PST 2018 * increase backoff * Mon Jan 15 14:40:47 PST 2018 * check boto version	2018-01-16 16:11:09 -08:00
Eric Liang	b6c42f96be	Auto-scale ray clusters based on GCS load metrics (#1348 ) This adds (experimental) auto-scaling support for Ray clusters based on GCS load metrics. The auto-scaling algorithm is as follows: Based on current (instantaneous) load information, we compute the approximate number of "used workers". This is based on the bottleneck resource, e.g. if 8/8 GPUs are used in a 8-node cluster but all the CPUs are idle, the number of used nodes is still counted as 8. This number can also be fractional. We scale that number by 1 / target_utilization_fraction and round up to determine the target cluster size (subject to the max_workers constraint). The autoscaler control loop takes care of launching new nodes until the target cluster size is met. When a node is idle for more than idle_timeout_minutes, we remove it from the cluster if that would not drop the cluster size below min_workers. Note that we'll need to update the wheel in the example yaml file after this PR is merged.	2017-12-31 14:39:57 -08:00

21 Commits