Commit Graph

49 Commits

Author SHA1 Message Date
Eric Liang c46ea2ff4b Click 0.7 changes the naming convention for commands; fix this 2018-11-28 14:59:58 -08:00
Eric Liang 0d56fc10cc Move setproctitle to ray[debug] package (#3415) 2018-11-27 09:50:59 -08:00
Richard Liaw c24d87b4d1 [autoscaler] Submit command (#3312) 2018-11-20 14:03:34 -08:00
Philipp Moritz 8894883153 Force kill web UI in ray stop (#3257) 2018-11-08 00:05:32 -08:00
Richard Liaw 0bab8ed95c Expose internal config parameters for starting Ray (#3246)
## What do these changes do?

This PR exposes the CL option for using a config parameter. This is important for certain tests (i.e., FT tests that removing nodes) to run quickly.

Note that this is bad practice and should be replaced with GFLAGS or some equivalent as soon as possible.

#3239 depends on this.

TODO:
 - [x] Add documentation to method arguments before merging.
 - [x] Add test to verify this works?

## Related issue number
2018-11-07 21:46:02 -08:00
Eric Liang 725df3a485 Set the process title in workers and actors (#3219) 2018-11-06 14:59:22 -08:00
Eric Liang 9a0f0db070 Add ray stack tool for debugging (#3213) 2018-11-03 13:13:02 -07:00
Robert Nishihara e495ab5e7c Fix some paths /tmp/raylogs -> /tmp/ray. (#3189) 2018-11-02 12:10:53 -07:00
Robert Nishihara fd854ff090 Allow the node manager port and object manager port to be set through… (#3130)
* Allow the node manager port and object manager port to be set through ray start.

* Linting

* Fix Java test

* Address comments.
2018-10-28 17:28:41 -07:00
Robert Nishihara 658c14282c Remove legacy Ray code. (#3121)
* Remove legacy Ray code.

* Fix cmake and simplify monitor.

* Fix linting

* Updates

* Fix

* Implement some methods.

* Remove more plasma manager references.

* Fix

* Linting

* Fix

* Fix

* Make sure class IDs are strings.

* Some path fixes

* Fix

* Path fixes and update arrow

* Fixes.

* linting

* Fixes

* Java fixes

* Some java fixes

* TaskLanguage -> Language

* Minor

* Fix python test and remove unused method signature.

* Fix java tests

* Fix jenkins tests

* Remove commented out code.
2018-10-26 13:36:58 -07:00
Robert Nishihara 5aa29613db Fix linting errors. (#3127) 2018-10-24 16:30:00 -07:00
Robert Nishihara 9c1826ed69 Use XRay backend by default. (#3020)
* Use XRay backend by default.

* Remove irrelevant valgrind tests.

* Fix

* Move tests around.

* Fix

* Fix test

* Fix test.

* String/unicode fix.

* Fix test

* Fix unicode issue.

* Minor changes

* Fix bug in test_global_state.py.

* Fix test.

* Linting

* Try arrow change and other object manager changes.

* Use newer plasma client API

* Small updates

* Revert plasma client api change.

* Update

* Update arrow and allow SendObjectHeaders to fail.

* Update arrow

* Update python/ray/experimental/state.py

Co-Authored-By: robertnishihara <robertnishihara@gmail.com>

* Address comments.
2018-10-23 12:46:39 -07:00
Peter Schafhalter b82fd157a7 Remove Redis protected mode (#3073)
Follow-up to #2925 and #2952. Removes the Redis protected mode implementation from Ray which was replaced by Redis port authentication.
2018-10-17 22:48:14 -07:00
Peter Schafhalter a41bbc10ef Add password authentication to Redis ports (#2952)
* Implement Redis authentication

* Throw exception for legacy Ray

* Add test

* Formatting

* Fix bugs in CLI

* Fix bugs in Raylet

* Move default password to constants.h

* Use pytest.fixture

* Fix bug

* Authenticate using formatted strings

* Add missing passwords

* Add test

* Improve authentication of async contexts

* Disable Redis authentication for credis

* Update test for credis

* Fix rebase artifacts

* Fix formatting

* Add workaround for issue #3045

* Increase timeout for test

* Improve C++ readability

* Fixes for CLI

* Add security docs

* Address comments

* Address comments

* Adress comments

* Use ray.get

* Fix lint
2018-10-16 22:48:30 -07:00
Si-Yuan f2dbd3096c Minor improvements and fixes in Python code. (#3022)
This commit fix some small defects. 
1. Remove a comment that should have been removed in #3003
2. Remove `redis_protected_mode` that is never used in `ray.init()`
3. Fix `object_id_seed` that is forgotten to be passed into `ray._init()`
4. Remove several redundant brackets.
2018-10-03 21:08:20 -07:00
Si-Yuan cc7e2ecdd5 Change logfile names and also allow plasma store socket to be passed in. (#2862) 2018-10-03 10:03:53 -07:00
Eric Liang cf9cd5da9d [ray] Add --new flag for ray attach (#2973)
* new flag

* yapf
2018-09-29 23:04:13 -07:00
Richard Liaw 1c9617bc1c [autoscaler] Add tmux support for attach and exec (#2907)
Adds a tmux flag that can be used to support background execution of experiments. Cannot be used together with screen. Seems to be useful feature that has shown up with different users.
2018-09-26 23:22:45 -07:00
Eric Liang 588c573d41 Ray stop needs to kill plasma_store_server not plasma_store (#2850) 2018-09-09 19:23:09 -07:00
Eric Liang e7db54bdb0 Log at INFO level by default (including in autoscaler). (#2824)
Before this change, the autoscaler `up` and related commands don't print any info messages to the console at all. This was a regression from 0.5. @richardliaw @robertnishihara https://github.com/ray-project/ray/issues/2812
2018-09-06 13:31:19 -07:00
Mitar 3850e3ba64 Added extra logging related arguments to "ray start" (#2664) 2018-08-28 23:00:37 -07:00
Yuhong Guo 0b6e08ebee Separate python logger module-wise (#2703)
## What do these changes do?
1. Separate the log related code to logger.py from services.py.
2. Allow users to modify logging formatter in `ray start`.

## Related issue number
https://github.com/ray-project/ray/pull/2664
2018-08-26 13:46:14 -07:00
Eric Liang aa014af85b [rllib] Fix atari reward calculations, add LR annealing, explained var stat for A2C / impala (#2700)
Changes needed to reproduce Atari plots in IMPALA / A2C: https://github.com/ray-project/rl-experiments
2018-08-23 17:49:10 -07:00
Robert Nishihara 89d4a6df93 Start Redis in protected mode when started via ray.init(). (#2697)
This PR makes it so that when Ray is started via ray.init() (as opposed to via ray start) the Redis servers will be started in "protected mode" (which means that clients can only connect by connecting to localhost).

In practice, we actually connect redis clients by passing in the node IP address (not localhost), so I need to create a redis config file on the fly to allow both localhost and the node's actual IP address (it would have been nice to find a way to do this from the Python redis client, but I couldn't find one).
2018-08-20 14:08:01 -07:00
Eric Liang 9473da69bd [autoscaler] Experimental support for local / on-prem clusters (#2678)
This adds some experimental (undocumented) support for launching Ray on existing nodes. You have to provide the head ip, and the list of worker ips.

There are also a couple additional utils added for rsyncing files and port-forward.
2018-08-19 12:43:04 -07:00
Eric Liang 079c4e482a ray exec and ray attach commands (#2560)
ray exec CLUSTER CMD [--screen] [--start] [--stop]
ray attach CLUSTER [--start]

Example:
ray exec sgd.yaml 'source activate tensorflow_p27 && cd ~/ray/python/ray/rllib && ./train.py --run=PPO --env=CartPole-v0' --screen --start --stop

This will in one command create a cluster and run the command on it in a screen session. The screen can later be attached to via ray attach. After the command finishes, the cluster workers will be terminated and the head node stopped.
2018-08-15 14:31:50 -07:00
Robert Nishihara 515da7721a Change ray.worker.cleanup -> ray.shutdown and improve API documentation. (#2374)
* Change ray.worker.cleanup -> ray.shutdown and improve API documentation.

* Deprecate ray.worker.cleanup() gracefully.

* Fix linting
2018-07-12 12:00:00 -07:00
Robert Nishihara b90e551b41 [xray] Implement timeline and profiling API. (#2306)
* Add profile table and store profiling information there.

* Code for dumping timeline.

* Improve color scheme.

* Push timeline events on driver only for raylet.

* Improvements to profiling and timeline visualization

* Some linting

* Small fix.

* Linting

* Propagate node IP address through profiling events.

* Fix test.

* object_id.hex() should return byte string in python 2.

* Include gcs.fbs in node_manager.fbs.

* Remove flatbuffer definition duplication.

* Decode to unicode in Python 3 and bytes in Python 2.

* Minor

* Submit profile events in a batch. Revert some CMake changes.

* Fix

* Workaround test failure.

* Fix linting

* Linting

* Don't return anything from chrome_tracing_dump when filename is provided.

* Remove some redundancy from profile table.

* Linting

* Move TODOs out of docstring.

* Minor
2018-07-04 23:23:48 -07:00
Robert Nishihara 52b0f3734a [xray] Add Travis build for testing xray on Linux. (#2047)
* Run xray tests in travis.

* Comment out TaskTests.testSubmittingManyTasks.

* Comment out failing tests.

* Comment out hanging test.

* Linting

* Comment out failing test.

* Comment out failing test.

* Ignore test_dataframe.py for now.

* Comment out testDriverExitingQuickly.
2018-05-13 21:22:01 -07:00
Alexey Tumanov 1c965fcfeb Raylet task dispatch and throttling worker startup (#1912)
* separate task placement and task dispatch; throttle task dispatch with locally available resournces

* keep track of worker's being started/in flight and suppress starting extraneous workers

* cleanup comments

* remove early termination in task dispatch to support zero-resource actor tasks

* info -> debug

* add documentation

* linting

* mock the worker pool for testing

* some linting

* kill all workers in flight; clear the worker pool in dtor

* remove fixed todo

* lint
2018-04-18 10:58:11 -07:00
Philipp Moritz 74162d1492 Lint Python files with Yapf (#1872) 2018-04-11 10:11:35 -07:00
Robert Nishihara fbfbb1c079 [xray] Integrate worker.py with raylet. (#1810)
* Integrate worker with raylet.

* Begin allowing worker to attach to cluster.

* Fix linting and documentation.

* Fix linting.

* Comment tests back in.

* Fix type of worker command.

* Remove xray python files and tests.

* Fix from rebase.

* Add test.

* Copy over raylet executable.

* Small cleanup.
2018-04-03 02:38:56 -07:00
Robert Nishihara 4bccabd910 Redirect output of all processes by default. (#1752)
* Redirect output of all processes by default.

* Add separate flag for redirecting worker output.

* Fix tests.
2018-03-20 18:14:54 -07:00
Robert Nishihara 330159d8bd Allow setting redis shard ports through ray start (also object store memory). (#1581)
* Allow passing in --object-store-memory to ray start.

* Allow setting ports for the redis shards.

* Reorder arguments and infer number of shards from ports.

* Move code block into only the head node case.

* Add test.
2018-02-22 11:05:37 -08:00
Richard Liaw e62ad7007d [autoscaler] Improve UX for Autoscaler (#1558) 2018-02-21 22:19:04 -08:00
Richard Liaw 73be235701 Quick Fix for Killing Ray Notebooks (#1563) 2018-02-19 16:10:37 -08:00
Eric Liang b6c42f96be Auto-scale ray clusters based on GCS load metrics (#1348)
This adds (experimental) auto-scaling support for Ray clusters based on GCS load metrics. The auto-scaling algorithm is as follows:

Based on current (instantaneous) load information, we compute the approximate number of "used workers". This is based on the bottleneck resource, e.g. if 8/8 GPUs are used in a 8-node cluster but all the CPUs are idle, the number of used nodes is still counted as 8. This number can also be fractional.
We scale that number by 1 / target_utilization_fraction and round up to determine the target cluster size (subject to the max_workers constraint). The autoscaler control loop takes care of launching new nodes until the target cluster size is met.
When a node is idle for more than idle_timeout_minutes, we remove it from the cluster if that would not drop the cluster size below min_workers.
Note that we'll need to update the wheel in the example yaml file after this PR is merged.
2017-12-31 14:39:57 -08:00
Eric Liang f5ea44338e EC2 cluster setup scripts and initial version of auto-scaler (#1311) 2017-12-15 23:56:39 -08:00
Robert Nishihara c21e189371 Allow scheduling with arbitrary user-defined resource labels. (#1236)
* Enable scheduling with custom resource labels.

* Fix.

* Minor fixes and ref counting fix.

* Linting

* Use .data() instead of .c_str().

* Fix linting.

* Fix ResourcesTest.testGPUIDs test by waiting for workers to start up.

* Sleep in test so that all tasks are submitted before any completes.
2017-12-01 11:41:40 -08:00
Robert Nishihara c1496b8111 Check version info in ray start for non-head nodes. (#1264)
* Check version info in ray start for non-head nodes.

* Small fix.

* Fix

* Push error to all drivers when worker has version mismatch.

* Linting

* Linting

* Fix

* Unify methods.

* Fix bug.
2017-11-27 22:03:38 -08:00
Robert Nishihara 0b4961b161 Provide flag for setting redis maxclients. (#1257)
* Add flag for attempting to increase ulimit -n and the redis maxclients.

* Don't bother trying to set ulimit -n.

* Fix linting.

* Add basic test.
2017-11-26 18:25:55 -08:00
Robert Nishihara 11f8f8bd8c Document --num-workers better. (#1201) 2017-11-09 17:02:18 -08:00
Robert Nishihara 3317d38278 Replace hostnames with numerical IP addresses in redis address. (#1177)
* Replace hostnames with numerical IP addresses in redis address.

* Also do conversion for node_ip_address. Add test.

* Simplifications.
2017-11-01 17:13:22 -07:00
Alexey Tumanov 2d0f439b7b hugepage + plasma directory support plumbing + documentation (#1030)
* hugepage + plasma directory support plumbing + documentation

* Indentation fix.

* huge_pages_enabled --> huge_pages

* One more change
2017-09-30 09:56:52 -07:00
Robert Nishihara b991dc8900 Add flag for ignoring the UI, don't start UI in jenkins tests. (#1021) 2017-09-29 15:22:51 -07:00
Alexey Tumanov fc885bd918 Adding basic support for a user-interpretable resource label (#761)
* adding support for the user-interpretable label(UIR)

* more plumbing for num_uirs further upstream; set to infty when specified on cmd line

* pass default num_uirs for actors; update GlobalStateAPI

* support num_uirs in ray.init()

* local scheduler resource accounting: support num_uirs; prep for vectorized resource accounting

* global scheduler test updated

* Fix bug introduced by rebase.

* Rename UIR -> CustomResource and add test.

* Small changes and use constexpr instead of macros.

* Linting and some renaming.

* Reorder some code.

* Remove cpus_in_use and fix bug.

* Add another test and make a small change.

* Rephrase documentation about feature stability.
2017-08-08 02:53:59 -07:00
Robert Nishihara e0867c8845 Switch Python indentation from 2 spaces to 4 spaces. (#726)
* 4 space indentation for actor.py.

* 4 space indentation for worker.py.

* 4 space indentation for more files.

* 4 space indentation for some test files.

* Check indentation in Travis.

* 4 space indentation for some rl files.

* Fix failure test.

* Fix multi_node_test.

* 4 space indentation for more files.

* 4 space indentation for remaining files.

* Fixes.
2017-07-13 21:53:57 +00:00
Robert Nishihara 2d636d9278 Kill jupyter in ray stop. (#689)
* Kill jupyter in ray stop.

* Terminate jupyter notebook in ray stop.

* Fix linting.
2017-06-21 05:58:34 +00:00
Robert Nishihara 1a682e2807 Enable starting and stopping ray with "ray start" and "ray stop". (#628)
* Install start_ray and stop_ray scripts in setup.py.

* Update documentation.

* Fix docker tests.

* Implement stop_ray script in python.

* Fix linting.
2017-06-02 20:17:48 +00:00