Commit Graph

1951 Commits

Author SHA1 Message Date
Robert Nishihara 6edbbf4fbf Document the release process. (#2760) 2018-08-29 00:06:33 -07:00
Robert Nishihara 132f133214 Limit number of concurrent workers started by hardware concurrency. (#2753)
* Limit number of concurrent workers started by hardware concurrency.

* Check if std::thread::hardware_concurrency() returns 0.

* Pass in max concurrency from Python.

* Fix Java call to startRaylet.

* Fix typo

* Remove unnecessary cast.

* Fix linting.

* Cleanups on Java side.

* Comment back in actor test.

* Require maximum_startup_concurrency to be at least 1.

* Fix linting and test.

* Improve documentation.

* Fix typo.
2018-08-29 14:53:40 +08:00
Mitar 3850e3ba64 Added extra logging related arguments to "ray start" (#2664) 2018-08-28 23:00:37 -07:00
Eric Liang 69d1354016 [rllib] Document ARS & rainbow (#2744)
* wip

* rainbow doc too

* e not used

* fix ppo doc

* clean list

* use same title
2018-08-28 18:13:36 -07:00
Robert Nishihara 6e1de19cc2 Bump version to 0.5.1. (#2755) ray-0.5.1 2018-08-28 16:52:17 -07:00
Robert Nishihara b7722897b4 Deprecate 'driver_mode' argument. (#2758)
* Deprecate 'driver_mode' argument.

* Fix

* Fix
2018-08-28 16:45:49 -07:00
Alexey Tumanov de047daea7 [xray] raylet scheduling mechanism with a simple spillback policy (#2749)
## What do these changes do?
* distribute load and resource information on a heartbeat
* for each raylet, maintain total and available resource capacity as well as measure of current load
* this PR introduces a new notion of load, defined as a sum of all resource demand induced by queued ready tasks on the local raylet. This provides a heterogeneity-aware measure of load that supersedes legacy Ray's task count as a proxy for load.
* modify the scheduling policy to perform *capacity-based*, *load-aware*, *optimistically concurrent* resource allocation
* perform task spillover to the heartbeating node in response to a heartbeat, implementing  heterogeneity-aware late-binding/work-stealing.
2018-08-28 00:03:34 -07:00
adoda 90ae8f11df The function get_node_ip_address while catch an exception and return … (#2722)
…'127.0.0.1',

when we forbid the external network. Instead of we can get ip address from hostname.

The function get_node_ip_address while catch an exception and return '127.0.0.1' when we forbid the external network. Instead of we can get ip address from hostname.

https://github.com/ray-project/ray/issues/2721
2018-08-27 22:24:49 -07:00
Wang Qing b4cba9a49f [java] Fix the logic of generating TaskID (#2747)
## What do these changes do?
Because the logic of generating `TaskID` in java is different from python's, there are many tests fail when we change the `Ray Core` code.
In this change,  I rewrote the logic of generating `TaskID` in java which is the same as the python's.

In java, we call the native method `_generateTaskId()` to generate a `TaskID` which is also used in python. We change `computePutId()`'s logic too.

## Related issue number
[#2608](https://github.com/ray-project/ray/issues/2608)
2018-08-27 13:11:33 -07:00
Hao Chen f37c260bdb [multi-language part 3] support multiple languages in raylet backend (#2672)
This PR enables multi-language support in the raylet backend.
- `Worker` class now has a `language` label;
- `WorkerPool`:
	- It now maintains one set of states for each language.
	- `PopWorker` function's parameter type is changed to `TaskSpecification`, and it will choose a worker to pop based on both task's language and actor id.
    -  `Size` and `StartWorkerProcess` functions now have an extra `language` parameter.
- `RegisterClientRequest` message now has an extra `language` field in raylet mode, which tells the node manager which language the worker is.
2018-08-26 22:06:25 -07:00
Yuhong Guo 0b6e08ebee Separate python logger module-wise (#2703)
## What do these changes do?
1. Separate the log related code to logger.py from services.py.
2. Allow users to modify logging formatter in `ray start`.

## Related issue number
https://github.com/ray-project/ray/pull/2664
2018-08-26 13:46:14 -07:00
Wang Qing 26d3c0655c [java] Improve UniqueID code. (#2723) 2018-08-26 12:32:57 -07:00
Hao Chen 4f4bea086a [java] Remove multi-return API (#2724) 2018-08-26 00:04:54 -07:00
Richard Liaw dbba7f2a53 [autoscaler] Cleanup Logging (#2709)
Moves Autoscaler onto Python `logging` module.
2018-08-25 17:08:45 -07:00
Jones Wong 982cde664f [rllib] Add noisy network and distributional Q-learning to implement Rainbow (#2737)
*  add noisy network

*  distributional q-learning in dev

*  add distributional q-learning

*  validated rainbow module

*  add some comments

*  supply some comments

*  remove redundant argument to pass CI test

*  async replay optimizer does NOT need annealing beta

*  ignore rainbow specific arguments for DDPG and Apex

*  formatted by yapf

* Update dqn_policy_graph.py

* Update dqn_policy_graph.py
2018-08-25 14:17:14 -07:00
eugenevinitsky 6201a6d1c7 [rllib] add augmented random search (#2714)
* added ars

* functioning ars with regression test

* added regression tests for ARs

* fixed default config for ARS

* ARS code runs, now time to test

* ARS working and tested, changed std deviation of meanstd filter to initialize to 1

* ARS working and tested, changed std deviation of meanstd filter to initialize to 1

* pep8 fixes

* removed unused linear model

* address comments

* more fixing comments

* post yapf

* fixed support failure

* Update LICENSE

* Update policies.py

* Update test_supported_spaces.py

* Update policies.py

* Update LICENSE

* Update test_supported_spaces.py

* Update policies.py

* Update policies.py

* Update filter.py
2018-08-24 22:20:02 -07:00
Robert Nishihara 5fd44afb8a Add note about huge pages using up memory. (#2733)
* Add note about huge pages using up memory.

* Update doc

* Update
2018-08-24 17:02:54 -07:00
Yuhong Guo 697bfb14db Hotfix for glog PR (#2734) 2018-08-24 16:30:51 -07:00
Michael Tu d16b6f6a32 [tune] Rename 'repeat' to 'num_samples' (#2698)
Deprecates the `repeat` argument and introduces `num_samples`. Also updates docs accordingly.
2018-08-24 15:05:24 -07:00
Eric Liang bcab5bcd02 fix it (#2735) 2018-08-24 15:01:12 -07:00
Philipp Moritz b4c47a5861 Upgrade arrow to include more detailed flushing message (#2706) 2018-08-24 11:44:04 -07:00
Robert Nishihara e467f546b5 Upgrade version of anaconda. (#2730) 2018-08-23 19:14:39 -07:00
Eric Liang aa014af85b [rllib] Fix atari reward calculations, add LR annealing, explained var stat for A2C / impala (#2700)
Changes needed to reproduce Atari plots in IMPALA / A2C: https://github.com/ray-project/rl-experiments
2018-08-23 17:49:10 -07:00
Stephanie Wang 1b3de31ff1 [xray] Fix bug where driver task ID is assumed to be nil (#2725)
## What do these changes do?

#2362 left a bug where it assumed that the driver task ID was nil. This fixes the bug to check the `SchedulingQueue` for any driver task IDs instead.
2018-08-23 14:44:47 -07:00
Yuhong Guo 344a83f327 Fix build failure of Arrow and Parquet when the folder is empty. (#2720) 2018-08-23 09:44:26 -07:00
Yuhong Guo eec1a3eb89 Support pluggable backend log lib with glog (#2695)
* [WIP] Support different backend log lib

* Refine code, unify level, address comment

* Address comment and change formatter

* Fix linux building failure.

* Fix lint

* Remove log4cplus.

* Add log init to raylet main and add test to travis.

* Address comment and refine.

* Update logging_test.cc
2018-08-23 09:43:38 -07:00
old-bear 4be324efc3 [tune] Support infinity value in report result (#2693)
* + Compatibility fix under py2 on ray.tune

* + Revert changes on master branch

* + Use default JsonEncoder in ray.tune.logger

* + Add UT for infinity support
2018-08-22 13:09:14 -07:00
joyyoj 38867eea4e [tune] Cross-Framework Compatibility (#2646)
This commit is a first pass at restructuring the Trial execution logic to support running on multiple frameworks.
2018-08-22 10:55:45 -07:00
Eric Liang fbe6c59f72 [rllib] Misc fixes, A2C (#2679)
A bunch of minor rllib fixes:

pull in latest baselines atari wrapper changes (and use deepmind wrapper by default)
move reward clipping to policy evaluator
add a2c variant of a3c
reduce vision network fc layer size to 256 units
switch to 84x84 images
doc tweaks
print timesteps in tune status
2018-08-20 15:28:03 -07:00
Yucong He 880ef1bd21 doc fix (#2696) 2018-08-20 14:11:32 -07:00
Robert Nishihara 89d4a6df93 Start Redis in protected mode when started via ray.init(). (#2697)
This PR makes it so that when Ray is started via ray.init() (as opposed to via ray start) the Redis servers will be started in "protected mode" (which means that clients can only connect by connecting to localhost).

In practice, we actually connect redis clients by passing in the node IP address (not localhost), so I need to create a redis config file on the fly to allow both localhost and the node's actual IP address (it would have been nice to find a way to do this from the Python redis client, but I couldn't find one).
2018-08-20 14:08:01 -07:00
Stephanie Wang 8fd5757aaa [xray] Don't process any more messages from dead node managers (#2688) 2018-08-19 21:11:40 -07:00
old-bear 230ac7aa80 [tune] Compatibility fix under py2 on str condition (#2673)
* * Compatibility fix under py2 on ray.tune

* + Fix compatibility

* + Use package six to achieve str compatibility
2018-08-19 20:43:03 -07:00
Eric Liang 9473da69bd [autoscaler] Experimental support for local / on-prem clusters (#2678)
This adds some experimental (undocumented) support for launching Ray on existing nodes. You have to provide the head ip, and the list of worker ips.

There are also a couple additional utils added for rsyncing files and port-forward.
2018-08-19 12:43:04 -07:00
Richard Liaw 62d0698097 [tune] Tune Facelift (#2472)
This PR introduces the following changes:

 * Ray Tune -> Tune 
 * [breaking] Creation of `schedulers/`, moving PBT, HyperBand into a submodule
 * [breaking] Search Algorithms now must take in experiment configurations via `add_configurations` rather through initialization
 * Support `"run": (function | class | str)` with automatic registering of trainable
 * Documentation Changes
2018-08-19 11:00:55 -07:00
Hao Chen 78b6bfb7f9 [Java] Change log dir to /tmp/raylogs (#2677)
Currently, log directory in Java is a relative path . This PR changes it to `/tmp/raylogs` (with the same format as Python, e.g., `local_scheduler-2018-51-17_17-8-6-05164.err`). It also cleans up some relative code.
2018-08-18 23:46:36 -07:00
Eric Liang e56eb354eb [tune] Remove hack to serve pin requests off thread (#2680)
* nopin

* fix
2018-08-18 13:19:52 -07:00
Robert Nishihara aaf5456b3d Add test that tasks sent to actor on dead node raise exceptions. (#2626)
* Add actor failure test.

* Minor change.

* Make test harder.

* Change numbers a bit.

* Skip test for non xray.
2018-08-16 22:48:31 -07:00
Wang Qing 06a58016d8 [multi-language part 2] Change the command line arguments to start raylet (#2670) 2018-08-16 21:59:44 -07:00
Hao Chen a719e089b0 [multi-language part 1] add a 'language' field to task specification (#2639) 2018-08-16 21:26:42 -07:00
Eric Liang 6670880f03 [rllib] Workaround actor creation hang edge case for ape-X (#2661)
* apex hang

* fix

* move pyt to end
2018-08-16 18:03:50 -07:00
Eric Liang 5f430da180 [rllib] Provide internal access to episode state in compute_actions() and allow returning extra batches (#2559)
The goal of this PR is to allow custom policies to perform model-based rollouts. In the multi-agent setting, this requires access to not only policies of other agents, but also their current observations.
Also, you might want to return the model-based trajectories as part of the rollout for efficiency.

  compute_actions() now takes a new keyword arg episodes
  pull out internal episode class into a top-level file
  add function to return extra trajectories from an episode that will be appended to the sample batch
  documentation
2018-08-16 14:37:21 -07:00
Eric Liang 127cf291a3 Delete __init__.py (#2668) 2018-08-16 02:01:21 -07:00
Stephanie Wang e3e0cfce87 [xray] Resubmit tasks that fail to be forwarded (#2645) 2018-08-16 00:12:56 -07:00
Hao Chen dd924a388b silence progress log from 'git clone' and 'pip install' (#2667) 2018-08-15 22:54:35 -07:00
Philipp Moritz 6cb6dd30d1 silence shutdown callback (#2662) 2018-08-15 22:48:00 -07:00
Eric Liang 079c4e482a ray exec and ray attach commands (#2560)
ray exec CLUSTER CMD [--screen] [--start] [--stop]
ray attach CLUSTER [--start]

Example:
ray exec sgd.yaml 'source activate tensorflow_p27 && cd ~/ray/python/ray/rllib && ./train.py --run=PPO --env=CartPole-v0' --screen --start --stop

This will in one command create a cluster and run the command on it in a screen session. The screen can later be attached to via ray attach. After the command finishes, the cluster workers will be terminated and the head node stopped.
2018-08-15 14:31:50 -07:00
Eric Liang 53f9755594 [rllib] Fix support for mixed discrete and continuous action spaces, add to regression test (#2655)
* fix

* lint

* fix
2018-08-15 10:19:41 -07:00
tianyapiaozi 98fed67b45 fix offset by one issue in the local scheduler (#2652) 2018-08-15 10:10:30 -07:00
Hao Chen 3c75e71afc reduce noisy log messages from wget (#2656) 2018-08-15 09:10:28 -07:00