Ujval Misra
|
ca651af1d7
|
[tune] Async restores and S3/GCP-capable trial FT (#6376)
* Initial commit for asynchronous save/restore
* Set stage for cloud checkpointable trainable.
* Refactor log_sync and sync_client.
* Add durable trainable impl.
* Support delete in cmd based client
* Fix some tests and such
* Cleanup, comments.
* Use upload_dir instead.
* Revert files belonging to other PR in split.
* Pass upload_dir into trainable init.
* Pickle checkpoint at driver, more robust checkpoint_dir discovery.
* Cleanup trainable helper functions, fix tests.
* Addressed comments.
* Fix bugs from cluster testing, add parameterized cluster tests.
* Add trainable util test
* package_ref
* pbt_address
* Fix bug after running pbt example (_save returning dir).
* get cluster tests running, other bug fixes.
* raise_errors
* Fix deleter bug, add durable trainable example.
* Fix cluster test bugs.
* filelock
* save/restore bug fixes
* .
* Working cluster tests.
* Lint, revert to tracking memory checkpoints.
* Documentation, cleanup
* fixinitialsync
* fix_one_test
* Fix cluster test bug
* nit
* lint
* Revert tune md change
* Fix basename bug for directories.
* lint
* fix_tests
* nit_fix
* Add __init__ file.
* Move to utils package
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
|
2020-01-02 20:40:53 -08:00 |
|