wassname/ray - ray - Gitea: Git with a cup of tea

mirror of https://github.com/wassname/ray.git synced 2026-06-30 20:18:33 +08:00

Author	SHA1	Message	Date
SangBin Cho	39088ab6f2	[Stats] Metrics Export User Interface Part 2 (Prometheus Service Discovery) (#9970 ) * In progress. * In Progress. * Finish the working version. * Write a documentation. * Addressed code review. * Fix lint error. * Lint. * Addressed code review. Make test less flaky. * Use a random port for ray start. * Modify doc. * Make write atomic.	2020-08-07 21:59:24 -07:00
Alex Wu	5b96a88cd7	[Core] Gpu type detection (#9695 ) * . * . * . * . * . * . * . * . * Test cases * detection only * . * Done? * . * . * Done * added test case * . * . * . * . * . * . * Update python/ray/ray_constants.py Co-authored-by: Eric Liang <ekhliang@gmail.com> * . * . Co-authored-by: Eric Liang <ekhliang@gmail.com>	2020-08-01 11:43:56 -07:00
fyrestone	4d08ddbf24	[Dashboard] New dashboard skeleton (#9099 )	2020-07-27 11:34:47 +08:00
ChenZhilei	c11855728a	Remove raylet monitor after use GCS service (#9179 )	2020-07-01 20:01:52 +08:00
Max Fitton	ad09aa985c	Make Dashboard Port Configurable (#8999 )	2020-06-19 16:26:22 -05:00
Zhilei Chen	d8a9247448	Remove gcs_service_disabled ci jobs and code (#8854 )	2020-06-19 11:32:27 +08:00
Max Fitton	13231ba63b	Rename redis-port to port and add default (#8406 )	2020-05-18 13:25:34 -05:00
Max Fitton	00325eb2b2	Rename max_reconstructions to max_restarts and use -1 for infinite (#8274 ) Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2020-05-14 10:30:29 -05:00
fyrestone	fc6259a656	Cross language serialization for primitive types (#7711 ) * Cross language serialization for Java and Python * Use strict types when Python serializing * Handle recursive objects in Python; Pin msgpack >= 0.6.0, < 1.0.0 * Disable gc for optimizing msgpack loads * Fix merge bug * Java call Python use returnType; Fix ClassLoaderTest * Fix RayMethodsTest * Fix checkstyle * Fix lint * prepare_args raises exception if try to transfer a non-deserializable object to another language * Fix CrossLanguageInvocationTest.java, Python msgpack treat float as double * Minor fixes * Fix compile error on linux * Fix lint in java/BUILD.bazel * Fix test_failure * Fix lint * Class<?> to Class<T>; Refine metadata bytes. * Rename FST to Fst; sort java dependencies * Change Class<?>[] to Optional<Class<?>>; sort requirements in setup.py * Improve CrossLanguageInvocationTest * Refactor MessagePackSerializer.java * Refactor MessagePackSerializer.java; Refine CrossLanguageInvocationTest.java * Remove unnecessary dependencies for Java; Add getReturnType() for RayFunction in Java * Fix bug * Remove custom cross language type support * Replace Serializer.Meta with MutableBoolean * Remove @SuppressWarnings support from checkstyle.xml; Add null test in CrossLanguageInvocationTest.java * Refine MessagePackSerializer.pack * Ray.get support RayObject as input * Improve comments and error info * Remove classLoader argument from serializer * Separate msgpack from pickle5 in Python * Pair<byte[], MutableBoolean> to Pair<byte[], Boolean> * Remove public static <T> T get(RayObject<T> object), use RayObject.get() instead * Refine test * small fixes Co-authored-by: 刘宝 <po.lb@antfin.com> Co-authored-by: Hao Chen <chenh1024@gmail.com>	2020-04-08 21:10:57 +08:00
fangfengbin	e196fcdbaf	Add gcs_service_enabled function to avoid getting environment variable directly (#7742 )	2020-03-26 22:02:53 +08:00
Eric Liang	5a112ab212	Remove object store memory cap (#7654 )	2020-03-19 16:00:30 -07:00
Edward Oakes	d9027acaf2	Deprecate non-direct-call API (#7336 )	2020-02-27 10:37:23 -08:00
Edward Oakes	2ad9bc5684	Move plasma retry logic into plasma store provider (#7328 )	2020-02-26 16:57:02 -08:00
fangfengbin	e7d0ec9531	Enable GCS server when running python unit tests (#7101 ) * Enable GCS server when running python unit tests * restart ci * restart ci * fix code style * restart ci * restart ci * restart ci * restart ci * restart ci * Define RAY_GCS_SERVICE_ENABLED as a constant * fix review comments * fix code style * fix code style * fix code style * fix code style * fix review comments * add gcs service python testcase * fix TESTSUITE name bug	2020-02-24 09:48:40 +08:00
Alex Wu	72c31e3e19	Ray nodes should respect docker limits (#7039 )	2020-02-10 11:08:38 -08:00
fangfengbin	694c0f2867	[Java] Enable GCS server when running java unit tests (#7041 ) * enable gcs service when run java testcase * fix ci bug * fix windows compile bug * fix ci bug * restart ci job * enable java testcase * restart ci job * restart ci job * add debug log * add debug log * restart ci job * add debug log * restart ci * add debug log * fix java testcase bug * restart ci job * restart ci job * restart ci job * restart ci job * restart ci job * restart ci job * restart ci job * restart ci job	2020-02-10 09:39:14 +08:00
Eric Liang	740bd00651	Use 100k for memory limit #7013 )	2020-02-02 22:48:59 -08:00
Sven	60d4d5e1aa	Remove future imports (#6724 ) * Remove all __future__ imports from RLlib. * Remove (object) again from tf_run_builder.py::TFRunBuilder. * Fix 2xLINT warnings. * Fix broken appo_policy import (must be appo_tf_policy) * Remove future imports from all other ray files (not just RLlib). * Remove future imports from all other ray files (not just RLlib). * Remove future import blocks that contain `unicode_literals` as well. Revert appo_tf_policy.py to appo_policy.py (belongs to another PR). * Add two empty lines before Schedule class. * Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error.	2020-01-09 00:15:48 -08:00
Eric Liang	de22cdb233	Reduce reporter CPU (#6553 ) * wip * remove * Update ray_constants.py	2019-12-19 22:21:30 -08:00
Simon Mo	e530c37b0e	Use localhost and set redis password by default (#6481 )	2019-12-17 19:41:19 -08:00
Eric Liang	be5dd8eb5e	Enable direct calls by default (#6367 ) * wip * add * timeout fix * const ref * comments * fix * fix * Move actor state into actor handle * comments 2 * enable by default * temp reorder * some fixes * add debug code * tmp * fix * wip * remove dbg * fix compile * fix * fix check * remove non direct tests * Increment ref count before resolving value * rename * fix another bug * tmp * tmp * Fix object pinning * build change * lint * ActorManager * tmp * ActorManager * fix test component failures * Remove old code * Remove unused * fix * fix * fix resources * fix advanced * eric's diff * blacklist * blacklist * cleanup * annotate * disable tests for now * remove * fix * fix * clean up verbosity * fix test * fix concurrency test * Update .travis.yml * Update .travis.yml * Update .travis.yml * split up analysis suite * split up trial runner suite * fix detached direct actors * fix * split up advanced tesT * lint * fix core worker test hang * fix bad check fail which breaks test_cluster.py in tune * fix some minor diffs in test_cluster * less workers * make less stressful * split up test * retry flaky tests * remove old test flags * fixes * lint * Update worker_pool.cc * fix race * fix * fix bugs in node failure handling * fix race condition * fix bugs in node failure handling * fix race condition * nits * fix test * disable heartbeatS * disable heartbeatS * fix * fix * use worker id * fix max fail * debug exit * fix merge, and apply [PATCH] fix concurrency test * [patch] fix core worker test hang * remove NotifyActorCreation, and return worker on completion of actor creation task * remove actor diied callback * Update core_worker.cc * lint * use task manager * fix merge * fix deadlock * wip * merge conflits * fix * better sysexit handling * better sysexit handling * better sysexit handling * check id * better debug * task failed msg * task failed msg * retry failed tasks with delay * retry failed tasks with delay * clip deps * fix * fix core worker tests * fix task manager test * fix all tests * cleanup * set to 0 for direct tests * dont check worker id for ownership rpc * dont check worker id for ownership rpc * debug messages * add comment * remove debug statements * nit * check worker id * fix test * owner * fix tests	2019-12-13 13:58:04 -08:00
Eric Liang	304b4f0d3d	Shard unit tests into medium sized files for test stability (#6398 )	2019-12-09 13:15:29 -08:00
Edward Oakes	e4f9b3b7d9	Use process reaper for cleanup (#6253 )	2019-11-26 22:00:08 -06:00
Robert Nishihara	ffb9c0ecae	Fix bug in which remote function redefinition doesn't happen. (#6175 )	2019-11-26 11:19:19 -06:00
Eric Liang	f3f86385d6	Minimal implementation of direct task calls (#6075 )	2019-11-12 11:45:28 -08:00
Adam Gleave	01aee8d970	[autoscaler] Retry creating EC2 instances in new AZ (#6129 )	2019-11-09 19:44:27 -08:00
Philipp Moritz	32b2907457	Update max resource label and give better error message (#5916 )	2019-10-16 22:37:01 -07:00
Si-Yuan	2fb7d7846f	Initial implementation of Cython pickle5 support (#5725 )	2019-10-03 09:20:26 -07:00
Eric Liang	19bbf1eb4d	[rllib] Revert [rllib] Port DDPG to the build_tf_policy pattern (#5626 )	2019-09-04 21:39:22 -07:00
Eric Liang	3e70daba74	Warn on resource deadlock; improve object store error messages (#5555 ) * wip * wip * wip * wip * wip * add impl * second * warn once	2019-08-30 16:45:54 -07:00
Eric Liang	e2e30ca507	Ray, Tune, and RLlib support for memory, object_store_memory options (#5226 )	2019-08-21 23:01:10 -07:00
Richard Liaw	9c00616cdc	Retry and exception for hang on memory store full (#5143 )	2019-07-27 01:20:13 -07:00
Daniel Edgecumbe	06fec63c87	[autoscaler] Add a 'request_cores' function for manual autoscaling (#4754 )	2019-07-26 17:14:45 -07:00
justinwyang	e88e706fcc	Enforce quoting style in Travis. (#4589 )	2019-04-11 14:24:26 -07:00
William Ma	11580fb7dc	Changes where actor resources are assigned (#4323 )	2019-03-24 15:49:36 -07:00
Hao Chen	d03999d01e	Cross-language invocation Part 1: Java calling Python functions and actors (#4166 )	2019-03-21 13:34:21 +08:00
Yuhong Guo	6f46edca51	Skip dead nodes to avoid connection timeout. (#4154 )	2019-03-02 13:11:19 -08:00
Daniel Edgecumbe	2e30f7ba38	Add a web dashboard for monitoring node resource usage (#4066 )	2019-02-21 00:10:04 -08:00
Robert Nishihara	ef527f84ab	Stream logs to driver by default. (#3892 ) * Stream logs to driver by default. * Fix from rebase * Redirect raylet output independently of worker output. * Fix. * Create redis client with services.create_redis_client. * Suppress Redis connection error at exit. * Remove thread_safe_client from redis. * Shutdown driver threads in ray.shutdown(). * Add warning for too many log messages. * Only stop threads if worker is connected. * Only stop threads if they exist. * Remove unnecessary try/excepts. * Fix * Only add new logging handler once. * Increase timeout. * Fix tempfile test. * Fix logging in cluster_utils. * Revert "Increase timeout." This reverts commit b3846b89040bcd8e583b2e18cb513cb040e71d95. * Retry longer when connecting to plasma store from node manager and object manager. * Close pubsub channels to avoid leaking file descriptors. * Limit log monitor open files to 200. * Increase plasma connect retries. * Add comment.	2019-02-07 19:53:50 -08:00
Richard Liaw	d128636bab	Ray Logging Configuration (#3691 ) * fix logging for autoscaler * module logging * try this for logging * yapf * fix * Initial logging setup * momery * ok * remove basicconfig * catch * remove package logging * print * fix * try_fix * fix 1 * revert rllib * logging level * flake8 * fix * fix * Remove vestigal TODO	2019-01-30 21:01:12 -08:00
Robert Nishihara	0b1608a546	Factor out code for starting new processes and test plasma store in valgrind. (#3824 ) * Factor out starting Ray processes. * Detect flags through environment variables. * Return ProcessInfo from start_ray_process. * Print valgrind errors at exit. * Test valgrind in travis. * Some valgrind fixes. * Undo raylet monitor change. * Only test plasma store in valgrind.	2019-01-22 14:59:11 -08:00
Yuhong Guo	d2cf8561f2	Refactor code about ray.ObjectID. (#3674 ) * Refactor code about ray.ObjectID. * remove from_random and use nil_id instead of constructor * remove id() in hash * Lint and fix * Change driver id to ObjectID * Replace binary_to_hex(ObjectID.id()) to ObjectID.hex()	2019-01-13 01:47:29 -08:00
Robert Nishihara	067976ad3d	Push a warning to all users when large number of workers have been started. (#3645 ) * Push a warning to all users when large number of workers have been started. * Add test. * Fix bug. * Give warning when worker starts instead of when worker registers. * Fix * Fix tests	2019-01-05 13:27:32 -08:00
Robert Nishihara	586a5c9ffa	Limit default redis max memory to 10GB. (#3630 ) * Limit Redis max memory to 10GB/shard by default. * Update stress tests. * Reorganize * Update * Add minimum cap size for object store and redis. * Small test update.	2019-01-03 13:23:54 -08:00
Yuhong Guo	fb33fa9097	Enable function_descriptor in backend to replace the function_id (#3028 )	2018-12-18 18:53:59 -05:00
Hao Chen	e7b51cbd1b	[xray] Implement Actor Reconstruction (#3332 ) * Implement Actor Reconstruction * fix * fix actor handle __del__ * fix lint * add comment * Remove actorCreationDummyObjectId * address comments * fix * address comments * avoid copy * change log to debug * fix error name	2018-12-13 21:28:58 -08:00
Robert Nishihara	658c14282c	Remove legacy Ray code. (#3121 ) * Remove legacy Ray code. * Fix cmake and simplify monitor. * Fix linting * Updates * Fix * Implement some methods. * Remove more plasma manager references. * Fix * Linting * Fix * Fix * Make sure class IDs are strings. * Some path fixes * Fix * Path fixes and update arrow * Fixes. * linting * Fixes * Java fixes * Some java fixes * TaskLanguage -> Language * Minor * Fix python test and remove unused method signature. * Fix java tests * Fix jenkins tests * Remove commented out code.	2018-10-26 13:36:58 -07:00
Si-Yuan	cc7e2ecdd5	Change logfile names and also allow plasma store socket to be passed in. (#2862 )	2018-10-03 10:03:53 -07:00
Robert Nishihara	bd64c940e9	Push error to driver when monitor raises an exception. (#2834 )	2018-09-07 17:42:45 -07:00
Robert Nishihara	0ac855e061	Push errors to all drivers when node is marked dead. (#2808 ) * Push errors to all drivers when node is marked dead. * Fix	2018-09-02 20:04:58 -07:00

1 2

61 Commits