* Remove all __future__ imports from RLlib.
* Remove (object) again from tf_run_builder.py::TFRunBuilder.
* Fix 2xLINT warnings.
* Fix broken appo_policy import (must be appo_tf_policy)
* Remove future imports from all other ray files (not just RLlib).
* Remove future imports from all other ray files (not just RLlib).
* Remove future import blocks that contain `unicode_literals` as well.
Revert appo_tf_policy.py to appo_policy.py (belongs to another PR).
* Add two empty lines before Schedule class.
* Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error.
* Stream logs to driver by default.
* Fix from rebase
* Redirect raylet output independently of worker output.
* Fix.
* Create redis client with services.create_redis_client.
* Suppress Redis connection error at exit.
* Remove thread_safe_client from redis.
* Shutdown driver threads in ray.shutdown().
* Add warning for too many log messages.
* Only stop threads if worker is connected.
* Only stop threads if they exist.
* Remove unnecessary try/excepts.
* Fix
* Only add new logging handler once.
* Increase timeout.
* Fix tempfile test.
* Fix logging in cluster_utils.
* Revert "Increase timeout."
This reverts commit b3846b89040bcd8e583b2e18cb513cb040e71d95.
* Retry longer when connecting to plasma store from node manager and object manager.
* Close pubsub channels to avoid leaking file descriptors.
* Limit log monitor open files to 200.
* Increase plasma connect retries.
* Add comment.
* Factor out starting Ray processes.
* Detect flags through environment variables.
* Return ProcessInfo from start_ray_process.
* Print valgrind errors at exit.
* Test valgrind in travis.
* Some valgrind fixes.
* Undo raylet monitor change.
* Only test plasma store in valgrind.
* Refactor code about ray.ObjectID.
* remove from_random and use nil_id instead of constructor
* remove id() in hash
* Lint and fix
* Change driver id to ObjectID
* Replace binary_to_hex(ObjectID.id()) to ObjectID.hex()
* Push a warning to all users when large number of workers have been started.
* Add test.
* Fix bug.
* Give warning when worker starts instead of when worker registers.
* Fix
* Fix tests
* Limit Redis max memory to 10GB/shard by default.
* Update stress tests.
* Reorganize
* Update
* Add minimum cap size for object store and redis.
* Small test update.
## What do these changes do?
1. Separate the log related code to logger.py from services.py.
2. Allow users to modify logging formatter in `ray start`.
## Related issue number
https://github.com/ray-project/ray/pull/2664
* Fix documentation indentation.
* Add error table to GCS and push error messages through node manager.
* Add type to error data.
* Linting
* Fix failure_test bug.
* Linting.
* Enable one more test.
* Attempt to fix doc building.
* Restructuring
* Fixes
* More fixes.
* Move current_time_ms function into util.h.
* AWS: support multiple availability zones (fix#2177)
* Bugfix: [] rather than ()
* Test config
* Test config tweaks
* Remove test config
* Formatting fixes
* Update YAML config
* Print warning when defining very large remote function or actor.
* Add weak test.
* Check that warnings appear in test.
* Make wait_for_errors actually fail in failure_test.py.
* Use constants for error types.
* Fix
* some autoscaling config tweaks
* Sun Jan 14 13:56:55 PST 2018
* Mon Jan 15 14:21:09 PST 2018
* increase backoff
* Mon Jan 15 14:40:47 PST 2018
* check boto version
This adds (experimental) auto-scaling support for Ray clusters based on GCS load metrics. The auto-scaling algorithm is as follows:
Based on current (instantaneous) load information, we compute the approximate number of "used workers". This is based on the bottleneck resource, e.g. if 8/8 GPUs are used in a 8-node cluster but all the CPUs are idle, the number of used nodes is still counted as 8. This number can also be fractional.
We scale that number by 1 / target_utilization_fraction and round up to determine the target cluster size (subject to the max_workers constraint). The autoscaler control loop takes care of launching new nodes until the target cluster size is met.
When a node is idle for more than idle_timeout_minutes, we remove it from the cluster if that would not drop the cluster size below min_workers.
Note that we'll need to update the wheel in the example yaml file after this PR is merged.