Commit Graph

916 Commits

Author SHA1 Message Date
Eric Liang b85e7b43f3 [rllib] Refactor the sampler (#3387)
* refactor

* fix test

* add perf test

* Update sampler.py
2018-11-24 18:16:54 -08:00
Robert Nishihara 3856533065 Fix incompatibility with most recent version of Redis. (#3379)
* Fix incompatibility with most recent version of Redis.

* Fix

* Fixes.
2018-11-24 16:36:38 -08:00
Eric Liang 18a8dbfcfb [rllib] Clip DDPG ou-noise to avoid exceeding action bounds (#3386)
Closes #2965
2018-11-24 00:56:50 -08:00
Eric Liang 55fca828ce [rllib] Fix use_lstm option when using custom model with dict space (#3368)
## What do these changes do?

This passes in the right obs space to the lstm model wrapper, so that it doesn't attempt to un-flatten the already processed dict observation.

## Related issue number

Closes https://github.com/ray-project/ray/issues/3367
2018-11-23 22:51:08 -08:00
Eric Liang 8b76bab25c [rllib] docs for td3 (#3381)
* td3 doc

* Update rllib-env.rst
2018-11-22 13:36:47 -08:00
Eric Liang 41b6b50d09 fix py3 (#3382) 2018-11-22 11:43:52 -08:00
GiliR4t1qbit b9ae5edf74 When getting a role/profile, catch only exception that indicates the role/profile already exists, allow others to be raised (#3383) 2018-11-22 09:42:58 -08:00
Jones Wong 24bfe8ab76 Enable Twin Delayed DDPG for RLlib DDPG agent (#3353) 2018-11-21 20:03:20 -08:00
Richard Liaw 784a6399b0 [tune] Node Fault Tolerance (#3238)
This PR introduces single-node fault tolerance for Tune.

## Previous behavior:
 - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.

## New behavior:
 - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). 
 - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
 - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running.


Remaining questions:
 -  Should `last_result` be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).

 - Waiting for some PRs to merge first (#3239)

Closes #2851.
2018-11-21 12:38:16 -08:00
Richard Liaw c24d87b4d1 [autoscaler] Submit command (#3312) 2018-11-20 14:03:34 -08:00
Eric Liang abdc3b592e [rllib] Update multi-gpu impala numbers (#3327) 2018-11-19 20:55:27 -08:00
Eric Liang 5972c29d28 [rllib] Set ape-x local exploration to 0, also load explorations before training steps (#3349)
## What do these changes do?

This should fix high explorations being used after restore / for rollouts.

## Related issue number

(dev list issue)
2018-11-19 20:36:25 -08:00
Eric Liang afc48d7b77 Don't setpgid() on actors (#3347) 2018-11-19 17:35:26 -08:00
Eric Liang e4bb5d8d16 Fix logging when ray cluster utils is used 2018-11-18 21:49:27 -08:00
Wenting Shen ab1e0f5c2f support home path and relative path for temp-dir (#3329) 2018-11-16 17:41:10 -08:00
Eric Liang e0bf9d7305 Add debug string to raylet (#3317)
* initial debug string

* format

* wip debug string

* fix compile

* fix

* update

* finished

* to file

* logs dir

* use temp root

* fix

* override
2018-11-15 21:47:50 -08:00
Robert Nishihara d10cb570ab Rename _submit -> _remote. (#3321) 2018-11-15 15:30:18 -08:00
Eric Liang 5723291db6 Raise exception if the node is nearly out of memory (#3323)
* wip

* add

* comment

* escape hatch

* update

* object store too

* .2
2018-11-15 12:55:25 -08:00
Lewis Belcher 5319fd044c Update redis version in setup.py (#3333)
* `redis` has released a new version (https://github.com/andymccurdy/redis-py/releases/tag/3.0.0)
* `ray` is not compatible with this version
* This PR adds the "compatible release" operator for `redis` version 2.10.6.
2018-11-15 10:40:08 -08:00
Eric Liang 706dc1d473 [rllib] Add test for multi-agent support and fix IMPALA multi-agent (#3289)
IMPALA support for multiagent was broken since IMPALA has a requirement that batch sizes be of a certain length. However multi-agent envs can create variable-length batches.

Fix this by adding zero-padding as needed (similar to the RNN case).
2018-11-14 14:14:07 -08:00
andrewztan 57c7b4238e KL Divergence Metrics (#3300)
* added KL divergence metrics

* fix
2018-11-13 23:12:35 -08:00
Eric Liang 1660c9d627 Kill actor child processes on shutdown (#3297)
* example

* add env

* test pg

* change to test

* add atexit test

* Update rllib-env.rst

* comment

* revert unnecessary file

* fix title when actor is idle

* Update python/ray/actor.py

Co-Authored-By: ericl <ekhliang@gmail.com>
2018-11-13 19:16:42 -08:00
Eric Liang 65c27c70cf [rllib] Clean up agent resource configurations (#3296)
Closes #3284
2018-11-13 18:00:03 -08:00
Philipp Moritz d4fad222e1 Update profiling instructions for raylet (#3311) 2018-11-13 17:48:33 -05:00
Richard Liaw 97f423781b Clean up Ray processes after cluster util exits (#3278) 2018-11-13 13:18:12 -08:00
Richard Liaw c3a2c7ebed [tune] Doc: Autofilled, StatusReporter (#3294)
* autofill and revise doc page for things

* lint

* comments
2018-11-13 13:15:56 -08:00
Eric Liang 6ee7a3b571 [rllib] Raise worker TF intra_op threads to 2, lower driver intra_op threads to 8 (#3299) 2018-11-13 11:41:58 -08:00
Richard Liaw c0423db05c [core] Add Global State Test for multi-node setting (#3239)
* add test for adding node

* multinode test fixes

* First pass at allowing updatable values

* Fix compilation issues

* Add config file parsing

* Full initialization

* Wrote a good test

* configuration parsing and stuff

* docs

* write some tests, make it good

* fixed init

* Add all config options and bring back stress tests.

* Update python/ray/worker.py

* Update python/ray/worker.py

* Fix internalization

* some last changes

* Linting and Java fix

* add docstring

* Fix test, add assertions

* pytest ext

* lint

* lint
2018-11-13 10:35:24 -08:00
Eric Liang d90f365394 [rllib] Add self-supervised loss to model (#3291)
# What do these changes do?

Allow self-supervised losses to be easily defined in custom models. Add this to the reference policy graphs.
2018-11-12 18:55:24 -08:00
Eric Liang bd0dbde149 [rllib] Rename ServingEnv => ExternalEnv (#3302) 2018-11-12 16:31:27 -08:00
Richard Liaw e37891d79d [tune] Fix default handling for timesteps (#3293)
This PR fixes an issue where previously if timesteps_this_iter = 0,
then it would render as "None".

Closes #3057.
2018-11-12 15:52:17 -08:00
Eric Liang 49e2085d78 [rllib] Don't reset envs when possible (#3290)
* laz

* better errors
2018-11-11 01:45:37 -08:00
Eric Liang 463511f8a6 [tune] Track and warn on low memory (#3298) 2018-11-11 00:29:45 -08:00
Eric Liang 53489d2f85 [sgd] Document and add simple MNIST example (#3236) 2018-11-10 21:52:20 -08:00
Richard Liaw 29c182d449 [tune] Support "None" for upload_dir 2018-11-09 22:02:08 -08:00
Eric Liang a51d618d88 [autoscaler] missing example-full.yaml file in the latest wheel for provider type "local" 2018-11-09 21:25:15 -08:00
Eric Liang 9dd3eedbac [rllib] rollout.py should reduce num workers (#3263)
## What do these changes do?

Don't create an excessive amount of workers for rollout.py, and also fix up the env wrapping to be consistent with the internal agent wrapper.

## Related issue number

Closes #3260.
2018-11-09 12:29:16 -08:00
Richard Liaw 22113be04c [tune] Annotated Example Page and showcase Tutorials (#3267)
Adds an example page and link in codebase.

Closes #2728.
2018-11-08 23:45:05 -08:00
Eric Liang 588705b6fa [autoscaler] Add option to allow private ips only (#3270)
* merge

* update

* upd

* Update python/ray/autoscaler/autoscaler.py

Co-Authored-By: ericl <ekhliang@gmail.com>

* Update python/ray/autoscaler/autoscaler.py

Co-Authored-By: ericl <ekhliang@gmail.com>

* Update python/ray/autoscaler/aws/config.py

Co-Authored-By: ericl <ekhliang@gmail.com>

* fix
2018-11-08 17:07:31 -08:00
Philipp Moritz 8894883153 Force kill web UI in ray stop (#3257) 2018-11-08 00:05:32 -08:00
Eric Liang 9b2794101d [minor] Change chunk already exists to DEBUG, add flags for rllib multi node testing (#3228) 2018-11-08 00:04:20 -08:00
Stephanie Wang d950e92f63 Allow multiple threads to call ray.get and ray.wait (#3244)
* Handle multiple threads calling ray.get

* Multithreaded ray.wait

* Pass in current task ID in java backend

* Add multithreaded actor to tests, add warning messages to worker for multithreaded ray.get

* Fix test

* Some cleanups

* Improve error message

* Add assertion

* Cleanup, throw error in HandleTaskUnblocked if task not actually blocked

* lint

* Fix python worker reset

* Fix references to reconstruct_objects

* Linting

* java lint

* Fix java

* Fix iterator
2018-11-07 22:39:28 -08:00
Richard Liaw 0bab8ed95c Expose internal config parameters for starting Ray (#3246)
## What do these changes do?

This PR exposes the CL option for using a config parameter. This is important for certain tests (i.e., FT tests that removing nodes) to run quickly.

Note that this is bad practice and should be replaced with GFLAGS or some equivalent as soon as possible.

#3239 depends on this.

TODO:
 - [x] Add documentation to method arguments before merging.
 - [x] Add test to verify this works?

## Related issue number
2018-11-07 21:46:02 -08:00
Eric Liang 43df405d07 [rllib] Add some debug logs during agent setup (#3247) 2018-11-07 14:54:28 -08:00
Richard Liaw cf9e838326 [tune] Raise Error when overstepping (#3235) 2018-11-07 14:27:09 -08:00
Eric Liang 29e3362905 Better errors on process deaths (#3252) 2018-11-07 14:08:16 -08:00
Robert Nishihara 1dd5d92789 Enable timeline visualizations of object transfers. (#3255)
* Plot object transfers.

* Linting
2018-11-07 12:45:59 -08:00
Eric Liang 2e04ffe00c Change dict serialization warning to debug (#3230) 2018-11-06 21:23:07 -08:00
eugenevinitsky 344b4ef0ff [rllib] Fix filter sync for ES and ARS (#2918) 2018-11-06 19:09:34 -08:00
Eric Liang 725df3a485 Set the process title in workers and actors (#3219) 2018-11-06 14:59:22 -08:00