Ujval Misra
98a07fe37e
[tune] Asynchronous saves ( #6912 )
...
* Support asynchronous saves
* Fix merge issues
* Add test, fix existing tests
* More informative warning
* Lint, remove print statements
* Address comments, add checkpoint.is_resolved fn
* Add more detailed comments
2020-02-09 12:17:45 -08:00
Ujval Misra
1558307ac4
[tune] Prevent MEMORY checkpoints from breaking trial FT ( #6691 )
...
* Prevent MEMORY checkpoints from breaking FT
* Add save/pause/resume/restore test
* change checkpoint return value based on status
* Fix test_checkpoint_manager_tests.
* Fix test + checkpoint manager bug
* lint
* Add docstring
* Add docstring to checkpoint_manager constructor
* Change variable name for clarity
* Revert on_checkpoint docstring wording
* Break after success
* nit: more informative warning
* Quarantine test
2020-01-22 23:17:09 -08:00
Sven
60d4d5e1aa
Remove future imports ( #6724 )
...
* Remove all __future__ imports from RLlib.
* Remove (object) again from tf_run_builder.py::TFRunBuilder.
* Fix 2xLINT warnings.
* Fix broken appo_policy import (must be appo_tf_policy)
* Remove future imports from all other ray files (not just RLlib).
* Remove future imports from all other ray files (not just RLlib).
* Remove future import blocks that contain `unicode_literals` as well.
Revert appo_tf_policy.py to appo_policy.py (belongs to another PR).
* Add two empty lines before Schedule class.
* Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error.
2020-01-09 00:15:48 -08:00
Michał Słapek
aaeb3c44a5
[tune] Add _change_working_directory to RayTrialExecutor ( #6228 ) ( #6320 )
...
* [tune] Add _switch_working_directory to RayTrialExecutor (#6228 )
* Make _switch_working_directory before warn_if_slow
* Rename _switch_working_directory to _change_working_directory
2020-01-07 01:51:04 -08:00
Ujval Misra
20ba7ef647
[tune] Move util to utils package ( #6682 )
...
* Move util.py to utils
* Fix import
2020-01-06 18:11:02 -08:00
Ujval Misra
5b40408678
[tune] Remove py2.7-specific code ( #6665 )
...
* Remove backwards compatability py2.7 code.
* Use exists_ok=True in ray
* nit
* nit
Co-authored-by: Richard Liaw <rliaw@berkeley.edu >
2020-01-03 01:03:13 -08:00
Ujval Misra
ca651af1d7
[tune] Async restores and S3/GCP-capable trial FT ( #6376 )
...
* Initial commit for asynchronous save/restore
* Set stage for cloud checkpointable trainable.
* Refactor log_sync and sync_client.
* Add durable trainable impl.
* Support delete in cmd based client
* Fix some tests and such
* Cleanup, comments.
* Use upload_dir instead.
* Revert files belonging to other PR in split.
* Pass upload_dir into trainable init.
* Pickle checkpoint at driver, more robust checkpoint_dir discovery.
* Cleanup trainable helper functions, fix tests.
* Addressed comments.
* Fix bugs from cluster testing, add parameterized cluster tests.
* Add trainable util test
* package_ref
* pbt_address
* Fix bug after running pbt example (_save returning dir).
* get cluster tests running, other bug fixes.
* raise_errors
* Fix deleter bug, add durable trainable example.
* Fix cluster test bugs.
* filelock
* save/restore bug fixes
* .
* Working cluster tests.
* Lint, revert to tracking memory checkpoints.
* Documentation, cleanup
* fixinitialsync
* fix_one_test
* Fix cluster test bug
* nit
* lint
* Revert tune md change
* Fix basename bug for directories.
* lint
* fix_tests
* nit_fix
* Add __init__ file.
* Move to utils package
Co-authored-by: Richard Liaw <rliaw@berkeley.edu >
2020-01-02 20:40:53 -08:00
Robert Nishihara
39a3459886
Remove (object) from class declarations. ( #6658 )
2020-01-02 17:42:13 -08:00
Richard Liaw
93e8c85e72
[tune] Avoid duplication in TrialRunner execution ( #6598 )
...
* avoid_duplication
* Update python/ray/tune/ray_trial_executor.py
Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com >
Co-authored-by: Kristian Hartikainen <kristian.hartikainen@gmail.com >
2019-12-26 02:13:55 +01:00
visatish
e2ba8c1898
[tune] Fixed bug in PBT where initial trial result is empty. ( #6351 )
...
* Fixed bug in tune pbt where initial result is empty.
* Updated mock trial executor in test suite.
* Added comment.
2019-12-06 15:30:27 -08:00
Eric Liang
4c6739476b
[rllib] Raise an error if GPUs are enabled but not tf.test.is_gpu_available() ( #6365 )
2019-12-05 10:13:54 -08:00
Ujval Misra
fa5d62e8ba
[tune] Retry restore on timeout ( #6284 )
...
* Retry recovery on timeout
* fix bug, revert some code
* Add test for restore time outs.
* Fix lint
* Address comments
* Don't timeout restores.
2019-12-02 20:01:47 -08:00
Ujval Misra
2965dc1b72
[tune] Fault tolerance improvements ( #5877 )
...
* Precede ray.get with ray.wait.
* Trigger checkpoint deletes locally in Trainable
* Clean-up code.
* Minor changes.
* Track best checkpoint so far again
* Pulled checkpoint GC out of Trainable.
* Added comments, error logging.
* Immediate pull after checkpoint taken; rsync source delete on pull
* Minor doc fixes
* Fix checkpoint manager bug
* Fix bugs, tests, formatting
* Fix bugs, feature flag for force sync.
* Fix test.
* Fix minor bugs: clear proc and less verbose sync_on_checkpoint warnings.
* Fix bug: update IP of last_result.
* Fixed message.
* Added a lot of logging.
* Changes to ray trial executor.
* More bug fixes (logging after failure), better logging.
* Fix richards bug and logging
* Add comments.
* try-except
* Fix heapq bug.
* .
* Move handling of no available trials to ray_trial_executor (#1 )
* Fix formatting bug, lint.
* Addressed Richard's comments
* Revert tests.
* fix rebase
* Fix trial location reporting.
* Fix test
* Fix lint
* Rebase, use ray.get w/ timeout, lint.
* lint
* fix rebase
* Address richard's comments
2019-11-18 01:14:41 -08:00
Richard Liaw
91acecc9f9
[tune][minor] gpu warning ( #5948 )
...
* gpu
* formaat
* defaults
* format_and_check
* better registration
* fix
* fix
* trial
* foramt
* tune
2019-10-19 17:09:48 -07:00
Eric Liang
6843a01a7f
Automatically create custom node id resource ( #5882 )
...
* node id
* comment
* comments
* fix tests
2019-10-15 21:31:11 -07:00
Richard Liaw
1650f7b174
[tune] Remove TF MNIST example + add TrialRunner hook to execut… ( #5868 )
...
* remove test
* add trial runner
* remvoerestore
* Remove other mnist examples
* tunetest
* revert
* v1
* Revert "v1"
This reverts commit c8bddaf2db7a8270c43c02021cac0e75df15ed20.
* Revert "revert"
This reverts commit b58f56884a0c288d3a6f997d149ab4d496ddd7a3.
* errors
* format
2019-10-13 20:33:56 -07:00
Richard Liaw
52e5c9b22d
[tune] CPU-Only Head Node support ( #5900 )
...
* trialqueue
* add tests
2019-10-13 20:31:42 -07:00
Ujval Misra
375852af23
[tune] Check node liveness before result fetch ( #5844 )
...
* Check if trial's node is alive before trying to fetch result
* Added function for failed trials to trial_executor interface
* Address comments, add test.
2019-10-08 11:41:01 -07:00
Si-Yuan
0292f99e6c
Fix DeprecationWarning ( #5608 )
2019-09-01 15:21:32 -07:00
Eric Liang
e2e30ca507
Ray, Tune, and RLlib support for memory, object_store_memory options ( #5226 )
2019-08-21 23:01:10 -07:00
Richard Liaw
cff72d1a54
[minor][tune] update pbt docs ( #5420 )
2019-08-11 12:39:54 -07:00
Richard Liaw
094ec7adbc
[tune] Allow nested values in trial runner ( #5346 )
2019-08-06 14:36:17 -07:00
jichan3751
bd6dfc994f
[sgd] Replaced class Resources in sgd with use_gpu ( #5252 )
2019-08-01 01:03:10 -07:00
Richard Liaw
b0c0de49a2
[tune] Fixup exception messages ( #5238 )
2019-07-20 22:36:27 -07:00
Richard Liaw
1530389822
[tune] Fast Node Recovery ( #5053 )
2019-07-12 13:47:30 -07:00
Richard Liaw
acee89b1f6
[tune] Auto-init Ray + default SearchAlg ( #4815 )
2019-05-29 12:09:34 -07:00
Robert Nishihara
6703519144
Move global state API out of global_state object. ( #4857 )
2019-05-26 11:27:53 -07:00
Adi Zimmerman
36b71d1446
[Tune] Post-Experiment Tools ( #4351 )
2019-05-04 02:51:26 -04:00
Romil Bhardwaj
0f42f87ebc
Updating zero capacity resource semantics ( #4555 )
2019-04-12 16:53:57 -07:00
justinwyang
e88e706fcc
Enforce quoting style in Travis. ( #4589 )
2019-04-11 14:24:26 -07:00
Dušan Josipović
820c71b7d0
[tune/rllib] Add checkpoint eraser ( #4490 )
2019-04-06 20:01:54 -07:00
Richard Liaw
ea5a6f8455
[tune] Simplify API ( #4234 )
...
Uses `tune.run` to execute experiments as preferred API.
@noahgolmant
This does not break backwards compat, but will slowly internalize `Experiment`.
In a separate PR, Tune schedulers should only support 1 running experiment at a time.
2019-03-17 13:03:32 -07:00
Richard Liaw
5e95abe63e
[tune] Fix performance issue and fix reuse tests ( #4379 )
...
* fix tests
* better name
* reduce warnings
* better resource tracking
* oops
* revertmessage
* fix_executor
2019-03-16 13:52:02 -07:00
Eric Liang
2c1131e8b2
[tune] Add warnings if tune event loop gets clogged ( #4353 )
...
* add guards
* comemnts
2019-03-14 19:44:01 -07:00
Eric Liang
d5f4698305
[tune] Avoid scheduler blocking, add reuse_actors optimization ( #4218 )
2019-03-12 23:49:31 -07:00
Kristian Hartikainen
df9beb7123
[tune] Fix trial result fetching ( #4219 )
...
* Fix trial results wait in RayTrialExecutor.get_next_available_trial
* Add comment for the results shuffling
* Remove timeout from the wait
* Change random.sample to random.shuffle
2019-03-04 14:26:10 -08:00
Richard Liaw
3483282254
[tune] Local Mode support ( #4138 )
2019-03-03 14:05:59 -08:00
Richard Liaw
bb7c4ce9c4
[tune] Improve error message when Ray crashes ( #3795 )
2019-02-15 01:04:17 -08:00
Richard Liaw
5db1afef07
[tune] Support Custom Resources ( #2979 )
...
Support arbitrary resource declarations in Tune.
Fixes https://github.com/ray-project/ray/issues/2875
2019-02-07 00:29:19 -08:00
Tianming Xu
1302fafc0b
[Tune] Add export_formats option to export policy graphs ( #3868 )
...
In earlier PRs, PR#3585 and PR#3637, export_policy_model and export_policy_checkpoint were introduced for users to export TensorFlow model and checkpoint.
For Ray Tune users, these APIs are not accessible through YAML configurations.
In this pull request, export_formats option is provided to enable users to choose the desired export format.
2019-01-31 17:07:27 -08:00
Richard Liaw
3918934dfd
[tune] Cross-Node Recovery ( #3725 )
...
Augments trial restore to also check if the runner is at the same
location. If not, the checkpoint files are pushed onto the new location.
2019-01-15 10:37:28 -08:00
Robert Nishihara
d1e21b702e
Change timeout from milliseconds to seconds in ray.wait. ( #3706 )
...
* Change timeout from milliseconds to seconds in ray.wait.
* Suppress warning.
* Suppress warning.
* Add prominent warning in API documentation.
2019-01-08 21:32:08 -08:00
Richard Liaw
8934e37a78
[tune] Change log handling for Tune ( #3661 )
...
Also provides a small retry mechanism for a transient error as reported
by #3340 .
Closes #3653 .
2019-01-06 13:20:10 -08:00
Richard Liaw
960a943503
[tune] Fault Tolerance: handle lost checkpoints by restart ( #3657 )
...
Checks that node failure with lost checkpoints does not crash. Also adds test.
2019-01-04 22:05:27 -08:00
Richard Liaw
aad3c50e2d
[tune] Cluster Fault Tolerance ( #3309 )
...
This PR introduces cluster-level fault tolerance for Tune by checkpointing global state. This occurs with relatively high frequency and allows users to easily resume experiments when the cluster crashes.
Note that this PR may affect automated workflows due to auto-prompting, but this is resolvable.
2018-12-29 11:42:25 +08:00
Richard Liaw
9d0bd50e78
[tune] Component notification on node failure + Tests ( #3414 )
...
Changes include:
- Notify Components on Requeue
- Slight refactoring of Node Failure handling
- Better tests
2018-12-04 14:47:31 -08:00
Richard Liaw
784a6399b0
[tune] Node Fault Tolerance ( #3238 )
...
This PR introduces single-node fault tolerance for Tune.
## Previous behavior:
- Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.
## New behavior:
- RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available).
- If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
- During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running.
Remaining questions:
- Should `last_result` be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).
- Waiting for some PRs to merge first (#3239 )
Closes #2851 .
2018-11-21 12:38:16 -08:00
Eric Liang
65c27c70cf
[rllib] Clean up agent resource configurations ( #3296 )
...
Closes #3284
2018-11-13 18:00:03 -08:00
Robert Nishihara
658c14282c
Remove legacy Ray code. ( #3121 )
...
* Remove legacy Ray code.
* Fix cmake and simplify monitor.
* Fix linting
* Updates
* Fix
* Implement some methods.
* Remove more plasma manager references.
* Fix
* Linting
* Fix
* Fix
* Make sure class IDs are strings.
* Some path fixes
* Fix
* Path fixes and update arrow
* Fixes.
* linting
* Fixes
* Java fixes
* Some java fixes
* TaskLanguage -> Language
* Minor
* Fix python test and remove unused method signature.
* Fix java tests
* Fix jenkins tests
* Remove commented out code.
2018-10-26 13:36:58 -07:00
Richard Liaw
0651d3b629
[tune/core] Use Global State API for resources ( #3004 )
2018-10-04 17:23:17 -07:00