Commit Graph

55 Commits

Author SHA1 Message Date
Ujval Misra 98a07fe37e [tune] Asynchronous saves (#6912)
* Support asynchronous saves

* Fix merge issues

* Add test, fix existing tests

* More informative warning

* Lint, remove print statements

* Address comments, add checkpoint.is_resolved fn

* Add more detailed comments
2020-02-09 12:17:45 -08:00
Ujval Misra 1558307ac4 [tune] Prevent MEMORY checkpoints from breaking trial FT (#6691)
* Prevent MEMORY checkpoints from breaking FT

* Add save/pause/resume/restore test

* change checkpoint return value based on status

* Fix test_checkpoint_manager_tests.

* Fix test + checkpoint manager bug

* lint

* Add docstring

* Add docstring to checkpoint_manager constructor

* Change variable name for clarity

* Revert on_checkpoint docstring wording

* Break after success

* nit: more informative warning

* Quarantine test
2020-01-22 23:17:09 -08:00
Sven 60d4d5e1aa Remove future imports (#6724)
* Remove all __future__ imports from RLlib.

* Remove (object) again from tf_run_builder.py::TFRunBuilder.

* Fix 2xLINT warnings.

* Fix broken appo_policy import (must be appo_tf_policy)

* Remove future imports from all other ray files (not just RLlib).

* Remove future imports from all other ray files (not just RLlib).

* Remove future import blocks that contain `unicode_literals` as well.
Revert appo_tf_policy.py to appo_policy.py (belongs to another PR).

* Add two empty lines before Schedule class.

* Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error.
2020-01-09 00:15:48 -08:00
Michał Słapek aaeb3c44a5 [tune] Add _change_working_directory to RayTrialExecutor (#6228) (#6320)
* [tune] Add _switch_working_directory to RayTrialExecutor (#6228)

* Make _switch_working_directory before warn_if_slow

* Rename _switch_working_directory to _change_working_directory
2020-01-07 01:51:04 -08:00
Ujval Misra 20ba7ef647 [tune] Move util to utils package (#6682)
* Move util.py to utils

* Fix import
2020-01-06 18:11:02 -08:00
Ujval Misra 5b40408678 [tune] Remove py2.7-specific code (#6665)
* Remove backwards compatability py2.7 code.

* Use exists_ok=True in ray

* nit

* nit

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-01-03 01:03:13 -08:00
Ujval Misra ca651af1d7 [tune] Async restores and S3/GCP-capable trial FT (#6376)
* Initial commit for asynchronous save/restore

* Set stage for cloud checkpointable trainable.

* Refactor log_sync and sync_client.

* Add durable trainable impl.

* Support delete in cmd based client

* Fix some tests and such

* Cleanup, comments.

* Use upload_dir instead.

* Revert files belonging to other PR in split.

* Pass upload_dir into trainable init.

* Pickle checkpoint at driver, more robust checkpoint_dir discovery.

* Cleanup trainable helper functions, fix tests.

* Addressed comments.

* Fix bugs from cluster testing, add parameterized cluster tests.

* Add trainable util test

* package_ref

* pbt_address

* Fix bug after running pbt example (_save returning dir).

* get cluster tests running, other bug fixes.

* raise_errors

* Fix deleter bug, add durable trainable example.

* Fix cluster test bugs.

* filelock

* save/restore bug fixes

* .

* Working cluster tests.

* Lint, revert to tracking memory checkpoints.

* Documentation, cleanup

* fixinitialsync

* fix_one_test

* Fix cluster test bug

* nit

* lint

* Revert tune md change

* Fix basename bug for directories.

* lint

* fix_tests

* nit_fix

* Add __init__ file.

* Move to utils package

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-01-02 20:40:53 -08:00
Robert Nishihara 39a3459886 Remove (object) from class declarations. (#6658) 2020-01-02 17:42:13 -08:00
Richard Liaw 93e8c85e72 [tune] Avoid duplication in TrialRunner execution (#6598)
* avoid_duplication

* Update python/ray/tune/ray_trial_executor.py

Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>

Co-authored-by: Kristian Hartikainen <kristian.hartikainen@gmail.com>
2019-12-26 02:13:55 +01:00
visatish e2ba8c1898 [tune] Fixed bug in PBT where initial trial result is empty. (#6351)
* Fixed bug in tune pbt where initial result is empty.

* Updated mock trial executor in test suite.

* Added comment.
2019-12-06 15:30:27 -08:00
Eric Liang 4c6739476b [rllib] Raise an error if GPUs are enabled but not tf.test.is_gpu_available() (#6365) 2019-12-05 10:13:54 -08:00
Ujval Misra fa5d62e8ba [tune] Retry restore on timeout (#6284)
* Retry recovery on timeout

* fix bug, revert some code

* Add test for restore time outs.

* Fix lint

* Address comments

* Don't timeout restores.
2019-12-02 20:01:47 -08:00
Ujval Misra 2965dc1b72 [tune] Fault tolerance improvements (#5877)
* Precede ray.get with ray.wait.

* Trigger checkpoint deletes locally in Trainable

* Clean-up code.

* Minor changes.

* Track best checkpoint so far again

* Pulled checkpoint GC out of Trainable.

* Added comments, error logging.

* Immediate pull after checkpoint taken; rsync source delete on pull

* Minor doc fixes

* Fix checkpoint manager bug

* Fix bugs, tests, formatting

* Fix bugs, feature flag for force sync.

* Fix test.

* Fix minor bugs: clear proc and less verbose sync_on_checkpoint warnings.

* Fix bug: update IP of last_result.

* Fixed message.

* Added a lot of logging.

* Changes to ray trial executor.

* More bug fixes (logging after failure), better logging.

* Fix richards bug and logging

* Add comments.

* try-except

* Fix heapq bug.

* .

* Move handling of no available trials to ray_trial_executor (#1)

* Fix formatting bug, lint.

* Addressed Richard's comments

* Revert tests.

* fix rebase

* Fix trial location reporting.

* Fix test

* Fix lint

* Rebase, use ray.get w/ timeout, lint.

* lint

* fix rebase

* Address richard's comments
2019-11-18 01:14:41 -08:00
Richard Liaw 91acecc9f9 [tune][minor] gpu warning (#5948)
* gpu

* formaat

* defaults

* format_and_check

* better registration

* fix

* fix

* trial

* foramt

* tune
2019-10-19 17:09:48 -07:00
Eric Liang 6843a01a7f Automatically create custom node id resource (#5882)
* node id

* comment

* comments

* fix tests
2019-10-15 21:31:11 -07:00
Richard Liaw 1650f7b174 [tune] Remove TF MNIST example + add TrialRunner hook to execut… (#5868)
* remove test

* add trial runner

* remvoerestore

* Remove other mnist examples

* tunetest

* revert

* v1

* Revert "v1"

This reverts commit c8bddaf2db7a8270c43c02021cac0e75df15ed20.

* Revert "revert"

This reverts commit b58f56884a0c288d3a6f997d149ab4d496ddd7a3.

* errors

* format
2019-10-13 20:33:56 -07:00
Richard Liaw 52e5c9b22d [tune] CPU-Only Head Node support (#5900)
* trialqueue

* add tests
2019-10-13 20:31:42 -07:00
Ujval Misra 375852af23 [tune] Check node liveness before result fetch (#5844)
* Check if trial's node is alive before trying to fetch result

* Added function for failed trials to trial_executor interface

* Address comments, add test.
2019-10-08 11:41:01 -07:00
Si-Yuan 0292f99e6c Fix DeprecationWarning (#5608) 2019-09-01 15:21:32 -07:00
Eric Liang e2e30ca507 Ray, Tune, and RLlib support for memory, object_store_memory options (#5226) 2019-08-21 23:01:10 -07:00
Richard Liaw cff72d1a54 [minor][tune] update pbt docs (#5420) 2019-08-11 12:39:54 -07:00
Richard Liaw 094ec7adbc [tune] Allow nested values in trial runner (#5346) 2019-08-06 14:36:17 -07:00
jichan3751 bd6dfc994f [sgd] Replaced class Resources in sgd with use_gpu (#5252) 2019-08-01 01:03:10 -07:00
Richard Liaw b0c0de49a2 [tune] Fixup exception messages (#5238) 2019-07-20 22:36:27 -07:00
Richard Liaw 1530389822 [tune] Fast Node Recovery (#5053) 2019-07-12 13:47:30 -07:00
Richard Liaw acee89b1f6 [tune] Auto-init Ray + default SearchAlg (#4815) 2019-05-29 12:09:34 -07:00
Robert Nishihara 6703519144 Move global state API out of global_state object. (#4857) 2019-05-26 11:27:53 -07:00
Adi Zimmerman 36b71d1446 [Tune] Post-Experiment Tools (#4351) 2019-05-04 02:51:26 -04:00
Romil Bhardwaj 0f42f87ebc Updating zero capacity resource semantics (#4555) 2019-04-12 16:53:57 -07:00
justinwyang e88e706fcc Enforce quoting style in Travis. (#4589) 2019-04-11 14:24:26 -07:00
Dušan Josipović 820c71b7d0 [tune/rllib] Add checkpoint eraser (#4490) 2019-04-06 20:01:54 -07:00
Richard Liaw ea5a6f8455 [tune] Simplify API (#4234)
Uses `tune.run` to execute experiments as preferred API.

@noahgolmant

This does not break backwards compat, but will slowly internalize `Experiment`. 

In a separate PR, Tune schedulers should only support 1 running experiment at a time.
2019-03-17 13:03:32 -07:00
Richard Liaw 5e95abe63e [tune] Fix performance issue and fix reuse tests (#4379)
* fix tests

* better name

* reduce warnings

* better resource tracking

* oops

* revertmessage

* fix_executor
2019-03-16 13:52:02 -07:00
Eric Liang 2c1131e8b2 [tune] Add warnings if tune event loop gets clogged (#4353)
* add guards

* comemnts
2019-03-14 19:44:01 -07:00
Eric Liang d5f4698305 [tune] Avoid scheduler blocking, add reuse_actors optimization (#4218) 2019-03-12 23:49:31 -07:00
Kristian Hartikainen df9beb7123 [tune] Fix trial result fetching (#4219)
* Fix trial results wait in RayTrialExecutor.get_next_available_trial

* Add comment for the results shuffling

* Remove timeout from the wait

* Change random.sample to random.shuffle
2019-03-04 14:26:10 -08:00
Richard Liaw 3483282254 [tune] Local Mode support (#4138) 2019-03-03 14:05:59 -08:00
Richard Liaw bb7c4ce9c4 [tune] Improve error message when Ray crashes (#3795) 2019-02-15 01:04:17 -08:00
Richard Liaw 5db1afef07 [tune] Support Custom Resources (#2979)
Support arbitrary resource declarations in Tune.

Fixes https://github.com/ray-project/ray/issues/2875
2019-02-07 00:29:19 -08:00
Tianming Xu 1302fafc0b [Tune] Add export_formats option to export policy graphs (#3868)
In earlier PRs, PR#3585 and PR#3637, export_policy_model and export_policy_checkpoint were introduced for users to export TensorFlow model and checkpoint.

For Ray Tune users, these APIs are not accessible through YAML configurations.

In this pull request, export_formats option is provided to enable users to choose the desired export format.
2019-01-31 17:07:27 -08:00
Richard Liaw 3918934dfd [tune] Cross-Node Recovery (#3725)
Augments trial restore to also check if the runner is at the same
location. If not, the checkpoint files are pushed onto the new location.
2019-01-15 10:37:28 -08:00
Robert Nishihara d1e21b702e Change timeout from milliseconds to seconds in ray.wait. (#3706)
* Change timeout from milliseconds to seconds in ray.wait.

* Suppress warning.

* Suppress warning.

* Add prominent warning in API documentation.
2019-01-08 21:32:08 -08:00
Richard Liaw 8934e37a78 [tune] Change log handling for Tune (#3661)
Also provides a small retry mechanism for a transient error as reported
by #3340.

Closes #3653.
2019-01-06 13:20:10 -08:00
Richard Liaw 960a943503 [tune] Fault Tolerance: handle lost checkpoints by restart (#3657)
Checks that node failure with lost checkpoints does not crash. Also adds test.
2019-01-04 22:05:27 -08:00
Richard Liaw aad3c50e2d [tune] Cluster Fault Tolerance (#3309)
This PR introduces cluster-level fault tolerance for Tune by checkpointing global state. This occurs with relatively high frequency and allows users to easily resume experiments when the cluster crashes.

Note that this PR may affect automated workflows due to auto-prompting, but this is resolvable.
2018-12-29 11:42:25 +08:00
Richard Liaw 9d0bd50e78 [tune] Component notification on node failure + Tests (#3414)
Changes include:
 - Notify Components on Requeue
 - Slight refactoring of Node Failure handling
 - Better tests
2018-12-04 14:47:31 -08:00
Richard Liaw 784a6399b0 [tune] Node Fault Tolerance (#3238)
This PR introduces single-node fault tolerance for Tune.

## Previous behavior:
 - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.

## New behavior:
 - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). 
 - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
 - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running.


Remaining questions:
 -  Should `last_result` be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).

 - Waiting for some PRs to merge first (#3239)

Closes #2851.
2018-11-21 12:38:16 -08:00
Eric Liang 65c27c70cf [rllib] Clean up agent resource configurations (#3296)
Closes #3284
2018-11-13 18:00:03 -08:00
Robert Nishihara 658c14282c Remove legacy Ray code. (#3121)
* Remove legacy Ray code.

* Fix cmake and simplify monitor.

* Fix linting

* Updates

* Fix

* Implement some methods.

* Remove more plasma manager references.

* Fix

* Linting

* Fix

* Fix

* Make sure class IDs are strings.

* Some path fixes

* Fix

* Path fixes and update arrow

* Fixes.

* linting

* Fixes

* Java fixes

* Some java fixes

* TaskLanguage -> Language

* Minor

* Fix python test and remove unused method signature.

* Fix java tests

* Fix jenkins tests

* Remove commented out code.
2018-10-26 13:36:58 -07:00
Richard Liaw 0651d3b629 [tune/core] Use Global State API for resources (#3004) 2018-10-04 17:23:17 -07:00