Commit Graph

1510 Commits

Author SHA1 Message Date
Qing Wang 1465a30ea9 Fix releasing CPUs incorrectly when actor creation task blocked. (#5271)
* Fix

* Remove useless log

* Address

* Fix typo

* sleep
2019-07-28 15:46:17 +08:00
Richard Liaw 5ea859dc73 [sgd] hotfix example failure (#5297)
* hotfix

* Update train_example.py
2019-07-27 18:13:22 -07:00
Eric Liang 6f2c5b2819 Revert "[autoscaler] Clean up error messages on setup failure (#5210)" (#5299)
This reverts commit 7fc15dbf7f.
2019-07-27 16:53:47 -07:00
lanlin 341dbf6c45 [tune] support nested dictionaries for CSVLogger (#5295) 2019-07-27 14:44:34 -07:00
Richard Liaw b4823d63c6 [autoscaler] Local YAML readability (#5290) 2019-07-27 12:51:50 -07:00
Eric Liang a62c5f40f6 [rllib] Document ModelV2 and clean up the models/ directory (#5277) 2019-07-27 02:08:16 -07:00
Richard Liaw 9c00616cdc Retry and exception for hang on memory store full (#5143) 2019-07-27 01:20:13 -07:00
Richard Liaw 5e15b36d6e [tune] experiment_analysis split to Analysis (#5115) 2019-07-27 01:10:52 -07:00
Richard Liaw 7e715520e5 [sgd] Example for Training (#5292) 2019-07-27 01:10:25 -07:00
Daniel Edgecumbe 06fec63c87 [autoscaler] Add a 'request_cores' function for manual autoscaling (#4754) 2019-07-26 17:14:45 -07:00
lanlin d9e81da3b8 [tune] configurable maximum length of trial identifier (#5287) 2019-07-26 17:09:54 -07:00
Antoine Galataud 827618254a [rllib] Configure learner queue timeout (#5270)
* configure learner queue timeout

* lint

* use config

* fix method args order, add unit test

* fix wrong param name
2019-07-25 21:18:05 -07:00
Stephanie Wang 3321555975 Increase timeout for ray.wait test (#5273)
* Increase test timeout for ray.wait

* make sure the actor is scheduled
2019-07-25 14:23:46 -07:00
Eric Liang bf9199ad77 [rllib] ModelV2 support for pytorch (#5249) 2019-07-25 11:02:53 -07:00
Joey Jiang 40395acadf [gRPC] Migrate raylet client implementation to grpc (#5120) 2019-07-25 14:48:56 +08:00
Eric Liang 60f59639c1 [rllib] Port DDPG to the build_tf_policy pattern (#5242) 2019-07-24 13:55:55 -07:00
Eric Liang 690b374581 [rllib] Add Keras LSTM example with ModelV2 (#5258) 2019-07-24 13:09:41 -07:00
Eric Liang 5b76238bce Fix two types of eviction hangs (#5225) 2019-07-23 21:20:17 -07:00
Eric Liang 97c43284a6 [rllib] Fix trainer state restore (#5257) 2019-07-23 21:18:58 -07:00
Stephanie Wang 9c651f47bb Add regression test for actor load balancing (#5224)
* Add regression test for actor load balancing

* Increase timeout

* Reduce number of nodes?
2019-07-23 15:11:55 -07:00
Stephanie Wang 15959b0f0d Leave ray.wait calls open until the task or actor exits (#5234)
* Regression test

* Split TaskDependencyManager::SubscribeDependencies into ray.get and ray.wait dependencies
- Some initial implementation

* unit test

* Improve unit tests for TaskDependencyManager

* Implement SubscribeWaitDependencies and UnsubscribeWaitDependencies, unit tests passing

* Add ray.wait python test for drivers that exit early

* Add WorkerID to Worker

* Update test to use two nodes

* Regression test for ray.wait passes

* Extend regression test to include ray.wait from an actor

* Fix ClientID and WorkerIDs

* lint

* lint

* Remove unnecessary ray_get argument

* fix build
2019-07-23 11:55:28 -07:00
Peter Schafhalter fc589050c9 [sgd] Deprecate old distributed SGD implementation (#5160)
* Deprecate old distributed SGD implementation

* Update README
2019-07-22 15:47:10 -07:00
Richard Liaw 7fc15dbf7f [autoscaler] Clean up error messages on setup failure (#5210) 2019-07-22 11:27:51 -07:00
Richard Liaw 53fb876a5f Improved KeyboardInterrupt Exception Handling (#5237) 2019-07-22 02:29:56 -07:00
Eric Liang f9043cc49a [rllib] Remove experimental eager support 2019-07-21 12:27:17 -07:00
Richard Liaw b0c0de49a2 [tune] Fixup exception messages (#5238) 2019-07-20 22:36:27 -07:00
Eric Liang d58b986858 [rllib] MultiCategorical shouldn't return array for kl or entropy (#5215)
* wip

* fix
2019-07-19 12:12:04 -07:00
Jones Wong da7676c925 Removed the implicit sync barrier at the end of each training iteration (#5217)
*  removed sync barrier at the end of each training iteration

*  formatted

*  modify the comment according to current semantics

*  lint check

* Update trainer.py
2019-07-18 22:59:52 -07:00
Eric Liang 28e5c5555d [rllib] Move some inline defs to avoid deserialization errors (#5228)
* fix bug

* move metrics too
2019-07-18 21:01:16 -07:00
Jones Wong 0af07bd493 Enable seeding actors for reproducible experiments (#5197)
*  enable graph-level worker-specific seed

*  lint checked

*  revised according to eric's suggestions

*  revised accordingly and added a test case

*  formated

* Update test_reproducibility.py

* Update trainer.py

* Update rollout_worker.py

* Update run_rllib_tests.sh

* Update worker_set.py
2019-07-17 23:31:34 -07:00
Qingqing Mao 63f49f95dd Improve memory check (#5216)
* Improve MemoryMonitor

- Add an env var to control the threshold.
- Use cgroup memory limit and usage for container environment.

* linting

* white space

* add comment
2019-07-17 23:30:02 -07:00
Jones Wong 81d297f87e Remove redundant scaler of l2 reg (#5172)
*  remove redundant scaler of l2 reg

*  lint formatted

* Update ddpg_policy.py
2019-07-17 15:11:27 -07:00
Jones Wong ae03c42dd6 Fixed inconsistent action placeholder (#5213) 2019-07-17 10:55:14 -07:00
Sam Toyer 214f09d969 [rllib] Make RLLib handle zero-length observation arrays (#5208)
* [rllib] Make _summarize handle zero-len arrays

Fixes #5207

* [rllib] Make aligned_array() handle empty arrays

* [rllib] Conform with old yapf
2019-07-16 22:37:57 -07:00
Richard Liaw 3e0ad11ae0 Add heartbeat test + Fix monitor.py (#5191) 2019-07-16 21:59:48 -07:00
Eric Liang 4fa2a6006c [rllib] Remove nested import (#5204)
* remove nested import

* Update metrics.py
2019-07-16 10:52:56 -07:00
Eric Liang 047f4ccd61 [rllib] Fix rollout.py with tuple action space (#5201)
* fix it

* update doc too

* fix rollout
2019-07-16 10:52:35 -07:00
Edward Oakes e5be5fd46d Remove dependencies from TaskExecutionSpecification (#5166) 2019-07-15 18:15:21 -07:00
Hao Chen ea6aa6409a Reconstruct failed actors without sending tasks. (#5161)
* fast reconstruct dead actors

* add test

* fix typos

* remove debug print

* small fix

* fix typos

* Update test_actor.py
2019-07-15 10:25:09 -07:00
Jones Wong 5b13a7eb90 Keep parameter space noise consistent with action space noise (Fix 5173) (#5193)
*  make parameter space noise consistent with action space noise

*  modified according to lint check

*  indent
2019-07-14 12:20:35 -07:00
Philipp Moritz 322b5166ad Update arrow to include user defined status for plasma (#5156) 2019-07-12 22:51:14 -07:00
Richard Liaw b6509f46b0 Update wheels to 0.8.0dev2 (#5186) 2019-07-12 17:27:03 -07:00
Richard Liaw 1530389822 [tune] Fast Node Recovery (#5053) 2019-07-12 13:47:30 -07:00
Kristian Hartikainen 3456afdea7 [autoscaler] Fix missing body argument in GCP getIamPolicy #5169 2019-07-11 13:03:51 -07:00
Hao Chen fd835d107e Move task to common module and add checks in getter methods (#5147) 2019-07-11 17:07:04 +08:00
Qing Wang f2293243cc [ID Refactor] Shorten the length of JobID to 4 bytes (#5110)
* WIP

* Fix

* Add jobid test

* Fix

* Add python part

* Fix

* Fix tes

* Remove TODOs

* Fix C++ tests

* Lint

* Fix

* Fix exporting functions in multiple ray.init

* Fix java test

* Fix lint

* Fix linting

* Address comments.

* FIx

* Address and fix linting

* Refine and fix

* Fix

* address

* Address comments.

* Fix linting

* Fix

* Address

* Address comments.

* Address

* Address

* Fix

* Fix

* Fix

* Fix lint

* Fix

* Fix linting

* Address comments.

* Fix linting

* Address comments.

* Fix linting

* address comments.

* Fix
2019-07-11 14:25:16 +08:00
Kai Yang 43b6513d19 [GCS] Move node resource info from client table to resource table (#5050) 2019-07-11 13:17:19 +08:00
Richard Liaw 691c9733f9 [tune] Document trainable attributes and enable user-checkpoint… (#4868) 2019-07-10 18:51:11 -07:00
Richard Liaw 0b540ab492 [tune] Test example checkpointing (#4728) 2019-07-10 01:58:26 -07:00
Eric Liang 5ab5017c67 [rllib] Fix impala stress test (#5101)
* add copy

* upgrade to tf 1.14

* update

* reduce count to workaround https://github.com/ray-project/ray/issues/5125

* Update impala.py

* placeholder

* comments

* update
2019-07-09 20:22:30 -07:00