Commit Graph

926 Commits

Author SHA1 Message Date
Devin Petersohn 4d2010a852 Ship Modin with Ray. (#3109) 2018-11-29 20:05:24 +01:00
Chunyang Wen fd7e494344 Remove: duplicate feed_dict constructing (#3431) 2018-11-29 10:21:46 -08:00
Kristian Hartikainen 7e319dbf0c Automatically indent tune logger params (#3399) 2018-11-29 00:15:50 -08:00
Eric Liang c46ea2ff4b Click 0.7 changes the naming convention for commands; fix this 2018-11-28 14:59:58 -08:00
Robert Nishihara 82863b5251 [autoscaler] Update autoscaler to use heartbeat batches. (#3409) 2018-11-27 23:46:27 -08:00
Eric Liang f0df97db6f [rllib] example and docs on how to use parametric actions with DQN / PG algorithms (#3384) 2018-11-27 23:35:19 -08:00
Eric Liang 0d56fc10cc Move setproctitle to ray[debug] package (#3415) 2018-11-27 09:50:59 -08:00
Eric Liang e3c088fa1e [rllib] PPO doesn't work with fractional num gpus (#3396)
* frac ppo

* gpu test
2018-11-27 01:14:10 -08:00
Eric Liang aa94d3dd50 [autoscaler] Allow more than 5s from node creation to first heartbeat (#3385) 2018-11-26 17:25:05 -08:00
Robert Nishihara 0f0099fb90 UI changes, fix the task timeline and add the object transfer timeline to UI. (#3397)
* Saving

* Fix cmake and remove object/task search boxes.

* Add comment
2018-11-25 10:16:49 -08:00
Eric Liang b85e7b43f3 [rllib] Refactor the sampler (#3387)
* refactor

* fix test

* add perf test

* Update sampler.py
2018-11-24 18:16:54 -08:00
Robert Nishihara 3856533065 Fix incompatibility with most recent version of Redis. (#3379)
* Fix incompatibility with most recent version of Redis.

* Fix

* Fixes.
2018-11-24 16:36:38 -08:00
Eric Liang 18a8dbfcfb [rllib] Clip DDPG ou-noise to avoid exceeding action bounds (#3386)
Closes #2965
2018-11-24 00:56:50 -08:00
Eric Liang 55fca828ce [rllib] Fix use_lstm option when using custom model with dict space (#3368)
## What do these changes do?

This passes in the right obs space to the lstm model wrapper, so that it doesn't attempt to un-flatten the already processed dict observation.

## Related issue number

Closes https://github.com/ray-project/ray/issues/3367
2018-11-23 22:51:08 -08:00
Eric Liang 8b76bab25c [rllib] docs for td3 (#3381)
* td3 doc

* Update rllib-env.rst
2018-11-22 13:36:47 -08:00
Eric Liang 41b6b50d09 fix py3 (#3382) 2018-11-22 11:43:52 -08:00
GiliR4t1qbit b9ae5edf74 When getting a role/profile, catch only exception that indicates the role/profile already exists, allow others to be raised (#3383) 2018-11-22 09:42:58 -08:00
Jones Wong 24bfe8ab76 Enable Twin Delayed DDPG for RLlib DDPG agent (#3353) 2018-11-21 20:03:20 -08:00
Richard Liaw 784a6399b0 [tune] Node Fault Tolerance (#3238)
This PR introduces single-node fault tolerance for Tune.

## Previous behavior:
 - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.

## New behavior:
 - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). 
 - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
 - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running.


Remaining questions:
 -  Should `last_result` be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).

 - Waiting for some PRs to merge first (#3239)

Closes #2851.
2018-11-21 12:38:16 -08:00
Richard Liaw c24d87b4d1 [autoscaler] Submit command (#3312) 2018-11-20 14:03:34 -08:00
Eric Liang abdc3b592e [rllib] Update multi-gpu impala numbers (#3327) 2018-11-19 20:55:27 -08:00
Eric Liang 5972c29d28 [rllib] Set ape-x local exploration to 0, also load explorations before training steps (#3349)
## What do these changes do?

This should fix high explorations being used after restore / for rollouts.

## Related issue number

(dev list issue)
2018-11-19 20:36:25 -08:00
Eric Liang afc48d7b77 Don't setpgid() on actors (#3347) 2018-11-19 17:35:26 -08:00
Eric Liang e4bb5d8d16 Fix logging when ray cluster utils is used 2018-11-18 21:49:27 -08:00
Wenting Shen ab1e0f5c2f support home path and relative path for temp-dir (#3329) 2018-11-16 17:41:10 -08:00
Eric Liang e0bf9d7305 Add debug string to raylet (#3317)
* initial debug string

* format

* wip debug string

* fix compile

* fix

* update

* finished

* to file

* logs dir

* use temp root

* fix

* override
2018-11-15 21:47:50 -08:00
Robert Nishihara d10cb570ab Rename _submit -> _remote. (#3321) 2018-11-15 15:30:18 -08:00
Eric Liang 5723291db6 Raise exception if the node is nearly out of memory (#3323)
* wip

* add

* comment

* escape hatch

* update

* object store too

* .2
2018-11-15 12:55:25 -08:00
Lewis Belcher 5319fd044c Update redis version in setup.py (#3333)
* `redis` has released a new version (https://github.com/andymccurdy/redis-py/releases/tag/3.0.0)
* `ray` is not compatible with this version
* This PR adds the "compatible release" operator for `redis` version 2.10.6.
2018-11-15 10:40:08 -08:00
Eric Liang 706dc1d473 [rllib] Add test for multi-agent support and fix IMPALA multi-agent (#3289)
IMPALA support for multiagent was broken since IMPALA has a requirement that batch sizes be of a certain length. However multi-agent envs can create variable-length batches.

Fix this by adding zero-padding as needed (similar to the RNN case).
2018-11-14 14:14:07 -08:00
andrewztan 57c7b4238e KL Divergence Metrics (#3300)
* added KL divergence metrics

* fix
2018-11-13 23:12:35 -08:00
Eric Liang 1660c9d627 Kill actor child processes on shutdown (#3297)
* example

* add env

* test pg

* change to test

* add atexit test

* Update rllib-env.rst

* comment

* revert unnecessary file

* fix title when actor is idle

* Update python/ray/actor.py

Co-Authored-By: ericl <ekhliang@gmail.com>
2018-11-13 19:16:42 -08:00
Eric Liang 65c27c70cf [rllib] Clean up agent resource configurations (#3296)
Closes #3284
2018-11-13 18:00:03 -08:00
Philipp Moritz d4fad222e1 Update profiling instructions for raylet (#3311) 2018-11-13 17:48:33 -05:00
Richard Liaw 97f423781b Clean up Ray processes after cluster util exits (#3278) 2018-11-13 13:18:12 -08:00
Richard Liaw c3a2c7ebed [tune] Doc: Autofilled, StatusReporter (#3294)
* autofill and revise doc page for things

* lint

* comments
2018-11-13 13:15:56 -08:00
Eric Liang 6ee7a3b571 [rllib] Raise worker TF intra_op threads to 2, lower driver intra_op threads to 8 (#3299) 2018-11-13 11:41:58 -08:00
Richard Liaw c0423db05c [core] Add Global State Test for multi-node setting (#3239)
* add test for adding node

* multinode test fixes

* First pass at allowing updatable values

* Fix compilation issues

* Add config file parsing

* Full initialization

* Wrote a good test

* configuration parsing and stuff

* docs

* write some tests, make it good

* fixed init

* Add all config options and bring back stress tests.

* Update python/ray/worker.py

* Update python/ray/worker.py

* Fix internalization

* some last changes

* Linting and Java fix

* add docstring

* Fix test, add assertions

* pytest ext

* lint

* lint
2018-11-13 10:35:24 -08:00
Eric Liang d90f365394 [rllib] Add self-supervised loss to model (#3291)
# What do these changes do?

Allow self-supervised losses to be easily defined in custom models. Add this to the reference policy graphs.
2018-11-12 18:55:24 -08:00
Eric Liang bd0dbde149 [rllib] Rename ServingEnv => ExternalEnv (#3302) 2018-11-12 16:31:27 -08:00
Richard Liaw e37891d79d [tune] Fix default handling for timesteps (#3293)
This PR fixes an issue where previously if timesteps_this_iter = 0,
then it would render as "None".

Closes #3057.
2018-11-12 15:52:17 -08:00
Eric Liang 49e2085d78 [rllib] Don't reset envs when possible (#3290)
* laz

* better errors
2018-11-11 01:45:37 -08:00
Eric Liang 463511f8a6 [tune] Track and warn on low memory (#3298) 2018-11-11 00:29:45 -08:00
Eric Liang 53489d2f85 [sgd] Document and add simple MNIST example (#3236) 2018-11-10 21:52:20 -08:00
Richard Liaw 29c182d449 [tune] Support "None" for upload_dir 2018-11-09 22:02:08 -08:00
Eric Liang a51d618d88 [autoscaler] missing example-full.yaml file in the latest wheel for provider type "local" 2018-11-09 21:25:15 -08:00
Eric Liang 9dd3eedbac [rllib] rollout.py should reduce num workers (#3263)
## What do these changes do?

Don't create an excessive amount of workers for rollout.py, and also fix up the env wrapping to be consistent with the internal agent wrapper.

## Related issue number

Closes #3260.
2018-11-09 12:29:16 -08:00
Richard Liaw 22113be04c [tune] Annotated Example Page and showcase Tutorials (#3267)
Adds an example page and link in codebase.

Closes #2728.
2018-11-08 23:45:05 -08:00
Eric Liang 588705b6fa [autoscaler] Add option to allow private ips only (#3270)
* merge

* update

* upd

* Update python/ray/autoscaler/autoscaler.py

Co-Authored-By: ericl <ekhliang@gmail.com>

* Update python/ray/autoscaler/autoscaler.py

Co-Authored-By: ericl <ekhliang@gmail.com>

* Update python/ray/autoscaler/aws/config.py

Co-Authored-By: ericl <ekhliang@gmail.com>

* fix
2018-11-08 17:07:31 -08:00
Philipp Moritz 8894883153 Force kill web UI in ray stop (#3257) 2018-11-08 00:05:32 -08:00