Commit Graph

164 Commits

Author SHA1 Message Date
Philipp Moritz f24d96ec4f Revert "Try to enable dashboard (again) (#6069)" (#6159)
This reverts commit 4044af8520.
2019-11-13 12:32:12 -08:00
Stephanie Wang 35d177f459 Use grpc for communication from worker to local raylet (task submission and direct actor args only) (#6118)
* Skeleton for SubmitTask proto

* Pass through node manager port, connect in raylet client

* Switch submit task to grpc

* Check port in use

* doc

* Remove default port, set port randomly from driver

* update

* Fix test

* Fix object manager test
2019-11-11 21:17:25 -08:00
Eric Liang 4044af8520 Try to enable dashboard (again) (#6069)
* Revert "Revert "Enable the Ray dashboard by default (#5976)" (#6068)"

This reverts commit 1a3e97cf23.

* fix tests that assume the dashboard isn't a job

* travis
2019-11-08 10:48:48 -08:00
Eric Liang 1a3e97cf23 Revert "Enable the Ray dashboard by default (#5976)" (#6068)
This reverts commit 6166ef3e09.
2019-11-01 17:08:37 -07:00
Eric Liang 6166ef3e09 Enable the Ray dashboard by default (#5976) 2019-11-01 12:19:01 -07:00
Edward Oakes f8a6ed7832 Spawn processes in background sessions (#6008)
Allows us to properly handle KeyboardInterrupts in interactive python interpreters.
2019-10-25 13:01:35 -07:00
Mitchell Stern 235dec8aa3 [Dashboard] Remove token authentication from dashboard (#5888) 2019-10-21 12:48:48 -07:00
Philipp Moritz d23696de17 Introduce flag to use pickle for serialization (#5805) 2019-10-18 22:29:36 -07:00
Edward Oakes 62bc30c1cf Validate redis address parameters (#5746)
* Validate redis address params

* Fix comment

* Add check
2019-09-23 10:52:34 -05:00
Mitchell Stern 98dcc1d440 [Dashboard] Add initial version of new dashboard (#5730) 2019-09-23 08:50:40 -07:00
Edward Oakes ee5db5b67f Raise error if space in redis password (#5673) 2019-09-11 20:58:39 -07:00
Kai Yang 732336fc4f [Java] Support multiple workers in Java worker process (#5505) 2019-09-07 22:52:05 +08:00
Eric Liang d20696300e Fix autoscaler format string for memory (#5542)
* add format string

* fix cast
2019-08-26 23:25:11 -07:00
Eric Liang e2e30ca507 Ray, Tune, and RLlib support for memory, object_store_memory options (#5226) 2019-08-21 23:01:10 -07:00
Eric Liang df47bdf6c9 Allow address instead of redis_address (#5412)
* addr

* wip

* fix typo

* add to start

* switch to ray address for train

* say address

* disambiguate help

* comments 2
2019-08-10 00:18:41 -07:00
Eric Liang 955154a19d Reduce Ray / RLlib startup messages (#5368) 2019-08-05 13:23:54 -07:00
Qing Wang f2293243cc [ID Refactor] Shorten the length of JobID to 4 bytes (#5110)
* WIP

* Fix

* Add jobid test

* Fix

* Add python part

* Fix

* Fix tes

* Remove TODOs

* Fix C++ tests

* Lint

* Fix

* Fix exporting functions in multiple ray.init

* Fix java test

* Fix lint

* Fix linting

* Address comments.

* FIx

* Address and fix linting

* Refine and fix

* Fix

* address

* Address comments.

* Fix linting

* Fix

* Address

* Address comments.

* Address

* Address

* Fix

* Fix

* Fix

* Fix lint

* Fix

* Fix linting

* Address comments.

* Fix linting

* Address comments.

* Fix linting

* address comments.

* Fix
2019-07-11 14:25:16 +08:00
Eric Liang 5aec750107 Add warning/error if object store memory exceeds available memory (#4893)
* exclude

* format

* add warning

* hatch

* reduce mem usage

* reduce object store mem

* set obj mem
2019-07-08 21:37:08 -07:00
Qing Wang e33d0eac68 Add dynamic worker options for worker command. (#4970)
* Add fields for fbs

* WIP

* Fix complition errors

* Add java part

* FIx

* Fix

* Fix

* Fix lint

* Refine API

* address comments and add test

* Fix

* Address comment.

* Address comments.

* Fix linting

* Refine

* Fix lint

* WIP: address comment.

* Fix java

* Fix py

* Refin

* Fix

* Fix

* Fix linting

* Fix lint

* Address comments

* WIP

* Fix

* Fix

* minor refine

* Fix lint

* Fix raylet test.

* Fix lint

* Update src/ray/raylet/worker_pool.h

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

* Update java/runtime/src/main/java/org/ray/runtime/AbstractRayRuntime.java

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

* Address comments.

* Address comments.

* Fix test.

* Update src/ray/raylet/worker_pool.h

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

* Address comments.

* Address comments.

* Fix

* Fix lint

* Fix lint

* Fix

* Address comments.

* Fix linting
2019-06-23 18:08:33 +08:00
Philipp Moritz 1e2b649580 Use proper session directory for debug_string.txt (#4960) 2019-06-10 23:46:37 -07:00
Si-Yuan 4e0be8b450 Drop duplicated string format (#4897)
This string format is unnecessary. java_worker_options has been appended to the commandline later.
2019-05-30 19:43:27 +08:00
Robert Nishihara 6703519144 Move global state API out of global_state object. (#4857) 2019-05-26 11:27:53 -07:00
Qing Wang 259cdfa0de Fix issue when starting raylet_monitor (#4829) 2019-05-22 11:08:24 +08:00
Qing Wang dcd6d4949c Fix Java worker log dir (#4781) 2019-05-17 16:13:28 +08:00
Qing Wang f39b6747e5 Refactor command line argument parsing with gflags (#4676) 2019-04-24 14:53:07 +08:00
Daniel Edgecumbe 3e1adafbce [autoscaler] Add an aggressive_autoscaling flag (#4285) 2019-04-13 18:44:32 -07:00
Romil Bhardwaj 0f42f87ebc Updating zero capacity resource semantics (#4555) 2019-04-12 16:53:57 -07:00
Si-Yuan dab99d26af Improve code related to node (#4383)
* Make full use of node

implement local node

fix bugs mentioned in comments

* Add more tests

* Use more specific exception handling

* fix, lint

* fix for py2.x
2019-04-09 17:27:54 +08:00
Yuhong Guo c2349cf12d Remove local/global_scheduler from code and doc. (#4549) 2019-04-03 17:05:09 -07:00
Robert Nishihara 8548f12eb2 Give better error when include_webui=1 and webui can't be started. (#4471) 2019-03-26 14:54:32 -07:00
Philipp Moritz 95254b3d71 Remove the old web UI (#4301) 2019-03-07 23:15:11 -08:00
Hao Chen f0465bc68c [Java] Refine tests and fix single-process mode (#4265) 2019-03-07 09:59:13 +08:00
Eric Liang 3896b726dd Dynamically adjust redis memory usage (#4152)
* f

* Update services.py
2019-02-25 16:21:37 -08:00
Daniel Edgecumbe 2e30f7ba38 Add a web dashboard for monitoring node resource usage (#4066) 2019-02-21 00:10:04 -08:00
Yuhong Guo 1f864a02bc Add option of load_code_from_local which is required in cross-language ray call. (#3675) 2019-02-21 12:37:17 +08:00
Wang Qing 7574757391 Fix crash for Java task's task.argument() in state. (#4063) 2019-02-19 12:46:07 +08:00
Si-Yuan 2de31eb489 minor fix (#4040) 2019-02-13 17:22:45 -08:00
Si-Yuan 21472b890a Integrate "tempfile_service" into "ray.node.Node" (#3953) 2019-02-12 17:34:04 -08:00
Wang Qing c523bc04ad Enable redis password in Java worker (#3943)
* Support Java redis password

* Fix

* Refine

* Fix lint.
2019-02-12 13:11:25 +08:00
Robert Nishihara ef527f84ab Stream logs to driver by default. (#3892)
* Stream logs to driver by default.

* Fix from rebase

* Redirect raylet output independently of worker output.

* Fix.

* Create redis client with services.create_redis_client.

* Suppress Redis connection error at exit.

* Remove thread_safe_client from redis.

* Shutdown driver threads in ray.shutdown().

* Add warning for too many log messages.

* Only stop threads if worker is connected.

* Only stop threads if they exist.

* Remove unnecessary try/excepts.

* Fix

* Only add new logging handler once.

* Increase timeout.

* Fix tempfile test.

* Fix logging in cluster_utils.

* Revert "Increase timeout."

This reverts commit b3846b89040bcd8e583b2e18cb513cb040e71d95.

* Retry longer when connecting to plasma store from node manager and object manager.

* Close pubsub channels to avoid leaking file descriptors.

* Limit log monitor open files to 200.

* Increase plasma connect retries.

* Add comment.
2019-02-07 19:53:50 -08:00
William Ma f067223c4a Allow Ray processes to be started inside of gdb and tmux. (#3847) 2019-02-04 15:23:39 -08:00
Wang Qing e1c68a0881 Enable including Java worker for ray start command (#3838) 2019-02-04 16:23:43 +08:00
Si-Yuan 9295ab8f60 Various Python code cleanups. (#3837) 2019-02-03 10:16:24 -08:00
Richard Liaw d128636bab Ray Logging Configuration (#3691)
* fix logging for autoscaler

* module logging

* try this for logging

* yapf

* fix

* Initial logging setup

* momery

* ok

* remove basicconfig

* catch

* remove package logging

* print

* fix

* try_fix

* fix 1

* revert rllib

* logging level

* flake8

* fix

* fix

* Remove vestigal TODO
2019-01-30 21:01:12 -08:00
Robert Nishihara 0b1608a546 Factor out code for starting new processes and test plasma store in valgrind. (#3824)
* Factor out starting Ray processes.

* Detect flags through environment variables.

* Return ProcessInfo from start_ray_process.

* Print valgrind errors at exit.

* Test valgrind in travis.

* Some valgrind fixes.

* Undo raylet monitor change.

* Only test plasma store in valgrind.
2019-01-22 14:59:11 -08:00
Robert Nishihara 8723d6b061 Define a Node class to manage Ray processes. (#3733)
* Implement Node class and move most of services.py into it.

* Wait for nodes as they are added to the cluster.

* Fix Redis authentication bug.

* Fix bug in client table ordering.

* Address comments.

* Kill raylet before plasma store in test.

* Minor
2019-01-11 22:30:38 -08:00
Robert Nishihara 6bbc667f93 Remove unused code path in services.py. (#3722) 2019-01-08 19:57:16 -08:00
Robert Nishihara c9d70f0dda Remove num_local_schedulers argument from ray.worker._init. (#3704)
* Remove num_local_schedulers argument from ray.worker._init.

* Fix

* Fix tests.
2019-01-07 12:44:49 -08:00
Robert Nishihara 586a5c9ffa Limit default redis max memory to 10GB. (#3630)
* Limit Redis max memory to 10GB/shard by default.

* Update stress tests.

* Reorganize

* Update

* Add minimum cap size for object store and redis.

* Small test update.
2019-01-03 13:23:54 -08:00
Robert Nishihara b6bcd18d65 Split profile table among many keys in the GCS. (#3676)
* Divide profile table among many keys in GCS.

* Fix, and remove --collect-profiling-data arg.

* Remove reference in doc.
2019-01-02 21:33:01 -08:00