## What do these changes do?
This PR exposes the CL option for using a config parameter. This is important for certain tests (i.e., FT tests that removing nodes) to run quickly.
Note that this is bad practice and should be replaced with GFLAGS or some equivalent as soon as possible.
#3239 depends on this.
TODO:
- [x] Add documentation to method arguments before merging.
- [x] Add test to verify this works?
## Related issue number
This commit fix some small defects.
1. Remove a comment that should have been removed in #3003
2. Remove `redis_protected_mode` that is never used in `ray.init()`
3. Fix `object_id_seed` that is forgotten to be passed into `ray._init()`
4. Remove several redundant brackets.
Adds a tmux flag that can be used to support background execution of experiments. Cannot be used together with screen. Seems to be useful feature that has shown up with different users.
Before this change, the autoscaler `up` and related commands don't print any info messages to the console at all. This was a regression from 0.5. @richardliaw @robertnishihara https://github.com/ray-project/ray/issues/2812
## What do these changes do?
1. Separate the log related code to logger.py from services.py.
2. Allow users to modify logging formatter in `ray start`.
## Related issue number
https://github.com/ray-project/ray/pull/2664
This PR makes it so that when Ray is started via ray.init() (as opposed to via ray start) the Redis servers will be started in "protected mode" (which means that clients can only connect by connecting to localhost).
In practice, we actually connect redis clients by passing in the node IP address (not localhost), so I need to create a redis config file on the fly to allow both localhost and the node's actual IP address (it would have been nice to find a way to do this from the Python redis client, but I couldn't find one).
This adds some experimental (undocumented) support for launching Ray on existing nodes. You have to provide the head ip, and the list of worker ips.
There are also a couple additional utils added for rsyncing files and port-forward.
ray exec CLUSTER CMD [--screen] [--start] [--stop]
ray attach CLUSTER [--start]
Example:
ray exec sgd.yaml 'source activate tensorflow_p27 && cd ~/ray/python/ray/rllib && ./train.py --run=PPO --env=CartPole-v0' --screen --start --stop
This will in one command create a cluster and run the command on it in a screen session. The screen can later be attached to via ray attach. After the command finishes, the cluster workers will be terminated and the head node stopped.
* Add profile table and store profiling information there.
* Code for dumping timeline.
* Improve color scheme.
* Push timeline events on driver only for raylet.
* Improvements to profiling and timeline visualization
* Some linting
* Small fix.
* Linting
* Propagate node IP address through profiling events.
* Fix test.
* object_id.hex() should return byte string in python 2.
* Include gcs.fbs in node_manager.fbs.
* Remove flatbuffer definition duplication.
* Decode to unicode in Python 3 and bytes in Python 2.
* Minor
* Submit profile events in a batch. Revert some CMake changes.
* Fix
* Workaround test failure.
* Fix linting
* Linting
* Don't return anything from chrome_tracing_dump when filename is provided.
* Remove some redundancy from profile table.
* Linting
* Move TODOs out of docstring.
* Minor
* Run xray tests in travis.
* Comment out TaskTests.testSubmittingManyTasks.
* Comment out failing tests.
* Comment out hanging test.
* Linting
* Comment out failing test.
* Comment out failing test.
* Ignore test_dataframe.py for now.
* Comment out testDriverExitingQuickly.
* separate task placement and task dispatch; throttle task dispatch with locally available resournces
* keep track of worker's being started/in flight and suppress starting extraneous workers
* cleanup comments
* remove early termination in task dispatch to support zero-resource actor tasks
* info -> debug
* add documentation
* linting
* mock the worker pool for testing
* some linting
* kill all workers in flight; clear the worker pool in dtor
* remove fixed todo
* lint
* Integrate worker with raylet.
* Begin allowing worker to attach to cluster.
* Fix linting and documentation.
* Fix linting.
* Comment tests back in.
* Fix type of worker command.
* Remove xray python files and tests.
* Fix from rebase.
* Add test.
* Copy over raylet executable.
* Small cleanup.
* Allow passing in --object-store-memory to ray start.
* Allow setting ports for the redis shards.
* Reorder arguments and infer number of shards from ports.
* Move code block into only the head node case.
* Add test.
This adds (experimental) auto-scaling support for Ray clusters based on GCS load metrics. The auto-scaling algorithm is as follows:
Based on current (instantaneous) load information, we compute the approximate number of "used workers". This is based on the bottleneck resource, e.g. if 8/8 GPUs are used in a 8-node cluster but all the CPUs are idle, the number of used nodes is still counted as 8. This number can also be fractional.
We scale that number by 1 / target_utilization_fraction and round up to determine the target cluster size (subject to the max_workers constraint). The autoscaler control loop takes care of launching new nodes until the target cluster size is met.
When a node is idle for more than idle_timeout_minutes, we remove it from the cluster if that would not drop the cluster size below min_workers.
Note that we'll need to update the wheel in the example yaml file after this PR is merged.
* Enable scheduling with custom resource labels.
* Fix.
* Minor fixes and ref counting fix.
* Linting
* Use .data() instead of .c_str().
* Fix linting.
* Fix ResourcesTest.testGPUIDs test by waiting for workers to start up.
* Sleep in test so that all tasks are submitted before any completes.
* Check version info in ray start for non-head nodes.
* Small fix.
* Fix
* Push error to all drivers when worker has version mismatch.
* Linting
* Linting
* Fix
* Unify methods.
* Fix bug.
* adding support for the user-interpretable label(UIR)
* more plumbing for num_uirs further upstream; set to infty when specified on cmd line
* pass default num_uirs for actors; update GlobalStateAPI
* support num_uirs in ray.init()
* local scheduler resource accounting: support num_uirs; prep for vectorized resource accounting
* global scheduler test updated
* Fix bug introduced by rebase.
* Rename UIR -> CustomResource and add test.
* Small changes and use constexpr instead of macros.
* Linting and some renaming.
* Reorder some code.
* Remove cpus_in_use and fix bug.
* Add another test and make a small change.
* Rephrase documentation about feature stability.
* 4 space indentation for actor.py.
* 4 space indentation for worker.py.
* 4 space indentation for more files.
* 4 space indentation for some test files.
* Check indentation in Travis.
* 4 space indentation for some rl files.
* Fix failure test.
* Fix multi_node_test.
* 4 space indentation for more files.
* 4 space indentation for remaining files.
* Fixes.