Commit Graph

61 Commits

Author SHA1 Message Date
SangBin Cho 39088ab6f2 [Stats] Metrics Export User Interface Part 2 (Prometheus Service Discovery) (#9970)
* In progress.

* In Progress.

* Finish the working version.

* Write a documentation.

* Addressed code review.

* Fix lint error.

* Lint.

* Addressed code review. Make test less flaky.

* Use a random port for ray start.

* Modify doc.

* Make write atomic.
2020-08-07 21:59:24 -07:00
Alex Wu 5b96a88cd7 [Core] Gpu type detection (#9695)
* .

* .

* .

* .

* .

* .

* .

* .

* Test cases

* detection only

* .

* Done?

* .

* .

* Done

* added test case

* .

* .

* .

* .

* .

* .

* Update python/ray/ray_constants.py

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* .

* .

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2020-08-01 11:43:56 -07:00
fyrestone 4d08ddbf24 [Dashboard] New dashboard skeleton (#9099) 2020-07-27 11:34:47 +08:00
ChenZhilei c11855728a Remove raylet monitor after use GCS service (#9179) 2020-07-01 20:01:52 +08:00
Max Fitton ad09aa985c Make Dashboard Port Configurable (#8999) 2020-06-19 16:26:22 -05:00
Zhilei Chen d8a9247448 Remove gcs_service_disabled ci jobs and code (#8854) 2020-06-19 11:32:27 +08:00
Max Fitton 13231ba63b Rename redis-port to port and add default (#8406) 2020-05-18 13:25:34 -05:00
Max Fitton 00325eb2b2 Rename max_reconstructions to max_restarts and use -1 for infinite (#8274)
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-05-14 10:30:29 -05:00
fyrestone fc6259a656 Cross language serialization for primitive types (#7711)
* Cross language serialization for Java and Python

* Use strict types when Python serializing

* Handle recursive objects in Python; Pin msgpack >= 0.6.0, < 1.0.0

* Disable gc for optimizing msgpack loads

* Fix merge bug

* Java call Python use returnType; Fix ClassLoaderTest

* Fix RayMethodsTest

* Fix checkstyle

* Fix lint

* prepare_args raises exception if try to transfer a non-deserializable object to another language

* Fix CrossLanguageInvocationTest.java, Python msgpack treat float as double

* Minor fixes

* Fix compile error on linux

* Fix lint in java/BUILD.bazel

* Fix test_failure

* Fix lint

* Class<?> to Class<T>; Refine metadata bytes.

* Rename FST to Fst; sort java dependencies

* Change Class<?>[] to Optional<Class<?>>; sort requirements in setup.py

* Improve CrossLanguageInvocationTest

* Refactor MessagePackSerializer.java

* Refactor MessagePackSerializer.java; Refine CrossLanguageInvocationTest.java

* Remove unnecessary dependencies for Java; Add getReturnType() for RayFunction in Java

* Fix bug

* Remove custom cross language type support

* Replace Serializer.Meta with MutableBoolean

* Remove @SuppressWarnings support from checkstyle.xml; Add null test in CrossLanguageInvocationTest.java

* Refine MessagePackSerializer.pack

* Ray.get support RayObject as input

* Improve comments and error info

* Remove classLoader argument from serializer

* Separate msgpack from pickle5 in Python

* Pair<byte[], MutableBoolean> to Pair<byte[], Boolean>

* Remove public static <T> T get(RayObject<T> object), use RayObject.get() instead

* Refine test

* small fixes

Co-authored-by: 刘宝 <po.lb@antfin.com>
Co-authored-by: Hao Chen <chenh1024@gmail.com>
2020-04-08 21:10:57 +08:00
fangfengbin e196fcdbaf Add gcs_service_enabled function to avoid getting environment variable directly (#7742) 2020-03-26 22:02:53 +08:00
Eric Liang 5a112ab212 Remove object store memory cap (#7654) 2020-03-19 16:00:30 -07:00
Edward Oakes d9027acaf2 Deprecate non-direct-call API (#7336) 2020-02-27 10:37:23 -08:00
Edward Oakes 2ad9bc5684 Move plasma retry logic into plasma store provider (#7328) 2020-02-26 16:57:02 -08:00
fangfengbin e7d0ec9531 Enable GCS server when running python unit tests (#7101)
* Enable GCS server when running python unit tests

* restart ci

* restart ci

* fix code style

* restart ci

* restart ci

* restart ci

* restart ci

* restart ci

* Define RAY_GCS_SERVICE_ENABLED as a constant

* fix review comments

* fix code style

* fix code style

* fix code style

* fix code style

* fix review comments

* add gcs service python testcase

* fix TESTSUITE name bug
2020-02-24 09:48:40 +08:00
Alex Wu 72c31e3e19 Ray nodes should respect docker limits (#7039) 2020-02-10 11:08:38 -08:00
fangfengbin 694c0f2867 [Java] Enable GCS server when running java unit tests (#7041)
* enable gcs service when run java testcase

* fix ci bug

* fix windows compile bug

* fix ci bug

* restart ci job

* enable java testcase

* restart ci job

* restart ci job

* add debug log

* add debug log

* restart ci job

* add debug log

* restart ci

* add debug log

* fix java testcase bug

* restart ci job

* restart ci job

* restart ci job

* restart ci job

* restart ci job

* restart ci job

* restart ci job

* restart ci job
2020-02-10 09:39:14 +08:00
Eric Liang 740bd00651 Use 100k for memory limit #7013) 2020-02-02 22:48:59 -08:00
Sven 60d4d5e1aa Remove future imports (#6724)
* Remove all __future__ imports from RLlib.

* Remove (object) again from tf_run_builder.py::TFRunBuilder.

* Fix 2xLINT warnings.

* Fix broken appo_policy import (must be appo_tf_policy)

* Remove future imports from all other ray files (not just RLlib).

* Remove future imports from all other ray files (not just RLlib).

* Remove future import blocks that contain `unicode_literals` as well.
Revert appo_tf_policy.py to appo_policy.py (belongs to another PR).

* Add two empty lines before Schedule class.

* Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error.
2020-01-09 00:15:48 -08:00
Eric Liang de22cdb233 Reduce reporter CPU (#6553)
* wip

* remove

* Update ray_constants.py
2019-12-19 22:21:30 -08:00
Simon Mo e530c37b0e Use localhost and set redis password by default (#6481) 2019-12-17 19:41:19 -08:00
Eric Liang be5dd8eb5e Enable direct calls by default (#6367)
* wip

* add

* timeout fix

* const ref

* comments

* fix

* fix

* Move actor state into actor handle

* comments 2

* enable by default

* temp reorder

* some fixes

* add debug code

* tmp

* fix

* wip

* remove dbg

* fix compile

* fix

* fix check

* remove non direct tests

* Increment ref count before resolving value

* rename

* fix another bug

* tmp

* tmp

* Fix object pinning

* build change

* lint

* ActorManager

* tmp

* ActorManager

* fix test component failures

* Remove old code

* Remove unused

* fix

* fix

* fix resources

* fix advanced

* eric's diff

* blacklist

* blacklist

* cleanup

* annotate

* disable tests for now

* remove

* fix

* fix

* clean up verbosity

* fix test

* fix concurrency test

* Update .travis.yml

* Update .travis.yml

* Update .travis.yml

* split up analysis suite

* split up trial runner suite

* fix detached direct actors

* fix

* split up advanced tesT

* lint

* fix core worker test hang

* fix bad check fail which breaks test_cluster.py in tune

* fix some minor diffs in test_cluster

* less workers

* make less stressful

* split up test

* retry flaky tests

* remove old test flags

* fixes

* lint

* Update worker_pool.cc

* fix race

* fix

* fix bugs in node failure handling

* fix race condition

* fix bugs in node failure handling

* fix race condition

* nits

* fix test

* disable heartbeatS

* disable heartbeatS

* fix

* fix

* use worker id

* fix max fail

* debug exit

* fix merge, and apply [PATCH] fix concurrency test

* [patch] fix core worker test hang

* remove NotifyActorCreation, and return worker on completion of actor creation task

* remove actor diied callback

* Update core_worker.cc

* lint

* use task manager

* fix merge

* fix deadlock

* wip

* merge conflits

* fix

* better sysexit handling

* better sysexit handling

* better sysexit handling

* check id

* better debug

* task failed msg

* task failed msg

* retry failed tasks with delay

* retry failed tasks with delay

* clip deps

* fix

* fix core worker tests

* fix task manager test

* fix all tests

* cleanup

* set to 0 for direct tests

* dont check worker id for ownership rpc

* dont check worker id for ownership rpc

* debug messages

* add comment

* remove debug statements

* nit

* check worker id

* fix test

* owner

* fix tests
2019-12-13 13:58:04 -08:00
Eric Liang 304b4f0d3d Shard unit tests into medium sized files for test stability (#6398) 2019-12-09 13:15:29 -08:00
Edward Oakes e4f9b3b7d9 Use process reaper for cleanup (#6253) 2019-11-26 22:00:08 -06:00
Robert Nishihara ffb9c0ecae Fix bug in which remote function redefinition doesn't happen. (#6175) 2019-11-26 11:19:19 -06:00
Eric Liang f3f86385d6 Minimal implementation of direct task calls (#6075) 2019-11-12 11:45:28 -08:00
Adam Gleave 01aee8d970 [autoscaler] Retry creating EC2 instances in new AZ (#6129) 2019-11-09 19:44:27 -08:00
Philipp Moritz 32b2907457 Update max resource label and give better error message (#5916) 2019-10-16 22:37:01 -07:00
Si-Yuan 2fb7d7846f Initial implementation of Cython pickle5 support (#5725) 2019-10-03 09:20:26 -07:00
Eric Liang 19bbf1eb4d [rllib] Revert [rllib] Port DDPG to the build_tf_policy pattern (#5626) 2019-09-04 21:39:22 -07:00
Eric Liang 3e70daba74 Warn on resource deadlock; improve object store error messages (#5555)
* wip

* wip

* wip

* wip

* wip

* add impl

* second

* warn once
2019-08-30 16:45:54 -07:00
Eric Liang e2e30ca507 Ray, Tune, and RLlib support for memory, object_store_memory options (#5226) 2019-08-21 23:01:10 -07:00
Richard Liaw 9c00616cdc Retry and exception for hang on memory store full (#5143) 2019-07-27 01:20:13 -07:00
Daniel Edgecumbe 06fec63c87 [autoscaler] Add a 'request_cores' function for manual autoscaling (#4754) 2019-07-26 17:14:45 -07:00
justinwyang e88e706fcc Enforce quoting style in Travis. (#4589) 2019-04-11 14:24:26 -07:00
William Ma 11580fb7dc Changes where actor resources are assigned (#4323) 2019-03-24 15:49:36 -07:00
Hao Chen d03999d01e Cross-language invocation Part 1: Java calling Python functions and actors (#4166) 2019-03-21 13:34:21 +08:00
Yuhong Guo 6f46edca51 Skip dead nodes to avoid connection timeout. (#4154) 2019-03-02 13:11:19 -08:00
Daniel Edgecumbe 2e30f7ba38 Add a web dashboard for monitoring node resource usage (#4066) 2019-02-21 00:10:04 -08:00
Robert Nishihara ef527f84ab Stream logs to driver by default. (#3892)
* Stream logs to driver by default.

* Fix from rebase

* Redirect raylet output independently of worker output.

* Fix.

* Create redis client with services.create_redis_client.

* Suppress Redis connection error at exit.

* Remove thread_safe_client from redis.

* Shutdown driver threads in ray.shutdown().

* Add warning for too many log messages.

* Only stop threads if worker is connected.

* Only stop threads if they exist.

* Remove unnecessary try/excepts.

* Fix

* Only add new logging handler once.

* Increase timeout.

* Fix tempfile test.

* Fix logging in cluster_utils.

* Revert "Increase timeout."

This reverts commit b3846b89040bcd8e583b2e18cb513cb040e71d95.

* Retry longer when connecting to plasma store from node manager and object manager.

* Close pubsub channels to avoid leaking file descriptors.

* Limit log monitor open files to 200.

* Increase plasma connect retries.

* Add comment.
2019-02-07 19:53:50 -08:00
Richard Liaw d128636bab Ray Logging Configuration (#3691)
* fix logging for autoscaler

* module logging

* try this for logging

* yapf

* fix

* Initial logging setup

* momery

* ok

* remove basicconfig

* catch

* remove package logging

* print

* fix

* try_fix

* fix 1

* revert rllib

* logging level

* flake8

* fix

* fix

* Remove vestigal TODO
2019-01-30 21:01:12 -08:00
Robert Nishihara 0b1608a546 Factor out code for starting new processes and test plasma store in valgrind. (#3824)
* Factor out starting Ray processes.

* Detect flags through environment variables.

* Return ProcessInfo from start_ray_process.

* Print valgrind errors at exit.

* Test valgrind in travis.

* Some valgrind fixes.

* Undo raylet monitor change.

* Only test plasma store in valgrind.
2019-01-22 14:59:11 -08:00
Yuhong Guo d2cf8561f2 Refactor code about ray.ObjectID. (#3674)
* Refactor code about ray.ObjectID.

* remove from_random and use nil_id instead of constructor

* remove id() in hash

* Lint and fix

* Change driver id to ObjectID

* Replace binary_to_hex(ObjectID.id()) to ObjectID.hex()
2019-01-13 01:47:29 -08:00
Robert Nishihara 067976ad3d Push a warning to all users when large number of workers have been started. (#3645)
* Push a warning to all users when large number of workers have been started.

* Add test.

* Fix bug.

* Give warning when worker starts instead of when worker registers.

* Fix

* Fix tests
2019-01-05 13:27:32 -08:00
Robert Nishihara 586a5c9ffa Limit default redis max memory to 10GB. (#3630)
* Limit Redis max memory to 10GB/shard by default.

* Update stress tests.

* Reorganize

* Update

* Add minimum cap size for object store and redis.

* Small test update.
2019-01-03 13:23:54 -08:00
Yuhong Guo fb33fa9097 Enable function_descriptor in backend to replace the function_id (#3028) 2018-12-18 18:53:59 -05:00
Hao Chen e7b51cbd1b [xray] Implement Actor Reconstruction (#3332)
* Implement Actor Reconstruction

* fix

* fix actor handle __del__

* fix lint

* add comment

* Remove actorCreationDummyObjectId

* address comments

* fix

* address comments

* avoid copy

* change log to debug

* fix error name
2018-12-13 21:28:58 -08:00
Robert Nishihara 658c14282c Remove legacy Ray code. (#3121)
* Remove legacy Ray code.

* Fix cmake and simplify monitor.

* Fix linting

* Updates

* Fix

* Implement some methods.

* Remove more plasma manager references.

* Fix

* Linting

* Fix

* Fix

* Make sure class IDs are strings.

* Some path fixes

* Fix

* Path fixes and update arrow

* Fixes.

* linting

* Fixes

* Java fixes

* Some java fixes

* TaskLanguage -> Language

* Minor

* Fix python test and remove unused method signature.

* Fix java tests

* Fix jenkins tests

* Remove commented out code.
2018-10-26 13:36:58 -07:00
Si-Yuan cc7e2ecdd5 Change logfile names and also allow plasma store socket to be passed in. (#2862) 2018-10-03 10:03:53 -07:00
Robert Nishihara bd64c940e9 Push error to driver when monitor raises an exception. (#2834) 2018-09-07 17:42:45 -07:00
Robert Nishihara 0ac855e061 Push errors to all drivers when node is marked dead. (#2808)
* Push errors to all drivers when node is marked dead.

* Fix
2018-09-02 20:04:58 -07:00