Richard Liaw
1ce5e0e99f
[tune] Fix file descriptor leak by syncer ( #12590 )
2020-12-03 13:39:04 -08:00
Richard Liaw
48042be8bb
[tune] Avoid dependency on Kubernetes ( #12188 )
...
* fix-kubernetes
Signed-off-by: Richard Liaw <rliaw@berkeley.edu >
* kub
Signed-off-by: Richard Liaw <rliaw@berkeley.edu >
2020-11-20 13:01:20 -08:00
Kai Fricke
f1ace386db
[tune] detect docker and kubernetes syncers ( #12108 )
...
Co-authored-by: Richard Liaw <rliaw@berkeley.edu >
2020-11-19 12:17:17 -08:00
Kai Fricke
007634fd1b
[tune] logger refactor part 2: Add SyncerCallback ( #11748 )
...
Co-authored-by: Richard Liaw <rliaw@berkeley.edu >
2020-11-03 21:04:40 -08:00
Frank Gu
73fa94731f
[tune] Add HDFS as Cloud Sync Client ( #11524 )
2020-10-22 14:12:51 -07:00
Richard Liaw
551c597312
[tune] API revamp fix ( #10518 )
2020-09-05 15:34:53 -07:00
krfricke
8f0f7371a0
[tune] Added Kubernetes syncer and sync client ( #10097 )
...
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com >
Co-authored-by: Richard Liaw <rliaw@berkeley.edu >
Co-authored-by: Kai Fricke <kai@anyscale.com >
2020-08-16 14:09:28 -07:00
Richard Liaw
7a8b922841
[tune] hotfix log_once ( #10069 )
2020-08-12 12:40:22 -07:00
Richard Liaw
98df612010
[tune] option to raise on error ( #10030 )
2020-08-11 09:59:04 -07:00
Amog Kamsetty
9410e5884d
[Tune] Parametrize Cloud Syncing Frequency ( #8771 )
2020-06-04 18:55:50 -07:00
Sven
60d4d5e1aa
Remove future imports ( #6724 )
...
* Remove all __future__ imports from RLlib.
* Remove (object) again from tf_run_builder.py::TFRunBuilder.
* Fix 2xLINT warnings.
* Fix broken appo_policy import (must be appo_tf_policy)
* Remove future imports from all other ray files (not just RLlib).
* Remove future imports from all other ray files (not just RLlib).
* Remove future import blocks that contain `unicode_literals` as well.
Revert appo_tf_policy.py to appo_policy.py (belongs to another PR).
* Add two empty lines before Schedule class.
* Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error.
2020-01-09 00:15:48 -08:00
Ujval Misra
5b40408678
[tune] Remove py2.7-specific code ( #6665 )
...
* Remove backwards compatability py2.7 code.
* Use exists_ok=True in ray
* nit
* nit
Co-authored-by: Richard Liaw <rliaw@berkeley.edu >
2020-01-03 01:03:13 -08:00
Ujval Misra
ca651af1d7
[tune] Async restores and S3/GCP-capable trial FT ( #6376 )
...
* Initial commit for asynchronous save/restore
* Set stage for cloud checkpointable trainable.
* Refactor log_sync and sync_client.
* Add durable trainable impl.
* Support delete in cmd based client
* Fix some tests and such
* Cleanup, comments.
* Use upload_dir instead.
* Revert files belonging to other PR in split.
* Pass upload_dir into trainable init.
* Pickle checkpoint at driver, more robust checkpoint_dir discovery.
* Cleanup trainable helper functions, fix tests.
* Addressed comments.
* Fix bugs from cluster testing, add parameterized cluster tests.
* Add trainable util test
* package_ref
* pbt_address
* Fix bug after running pbt example (_save returning dir).
* get cluster tests running, other bug fixes.
* raise_errors
* Fix deleter bug, add durable trainable example.
* Fix cluster test bugs.
* filelock
* save/restore bug fixes
* .
* Working cluster tests.
* Lint, revert to tracking memory checkpoints.
* Documentation, cleanup
* fixinitialsync
* fix_one_test
* Fix cluster test bug
* nit
* lint
* Revert tune md change
* Fix basename bug for directories.
* lint
* fix_tests
* nit_fix
* Add __init__ file.
* Move to utils package
Co-authored-by: Richard Liaw <rliaw@berkeley.edu >
2020-01-02 20:40:53 -08:00
Robert Nishihara
39a3459886
Remove (object) from class declarations. ( #6658 )
2020-01-02 17:42:13 -08:00
Ujval Misra
81197e47c7
[tune] Refactor syncer ( #6496 )
...
* Refactor syncer and log_sync.
* Fix documentation.
* Remove delete from api
* Rename to get_node_syncer
2019-12-17 05:25:16 -08:00
Ujval Misra
2965dc1b72
[tune] Fault tolerance improvements ( #5877 )
...
* Precede ray.get with ray.wait.
* Trigger checkpoint deletes locally in Trainable
* Clean-up code.
* Minor changes.
* Track best checkpoint so far again
* Pulled checkpoint GC out of Trainable.
* Added comments, error logging.
* Immediate pull after checkpoint taken; rsync source delete on pull
* Minor doc fixes
* Fix checkpoint manager bug
* Fix bugs, tests, formatting
* Fix bugs, feature flag for force sync.
* Fix test.
* Fix minor bugs: clear proc and less verbose sync_on_checkpoint warnings.
* Fix bug: update IP of last_result.
* Fixed message.
* Added a lot of logging.
* Changes to ray trial executor.
* More bug fixes (logging after failure), better logging.
* Fix richards bug and logging
* Add comments.
* try-except
* Fix heapq bug.
* .
* Move handling of no available trials to ray_trial_executor (#1 )
* Fix formatting bug, lint.
* Addressed Richard's comments
* Revert tests.
* fix rebase
* Fix trial location reporting.
* Fix test
* Fix lint
* Rebase, use ray.get w/ timeout, lint.
* lint
* fix rebase
* Address richard's comments
2019-11-18 01:14:41 -08:00
Eric Liang
daf38c8723
[tune] Deprecate tune.function ( #5601 )
...
* remove tune function
* remove examples
* Update tune-usage.rst
2019-08-31 16:00:10 -07:00
Kristian Hartikainen
9e0192bc0b
[tune] Change the log syncing behavior ( #4450 )
...
* Change the log syncing behavior
* fix up abstractions for syncer
* Finished checkpoint syncing
* Code
* Set of changes to get things running
* Fixes for log syncing
* Fix parts
* Lint and other fixes
* fix some test
* Remove extra parsing functionality
* some test fixes
* Fix up cloud syncing
* Another thing to do
* Fix up tests and local sync
Changes LogSync into a mixin, and adds tests for different
functionalities.
* Fix up tests, start on local migration
* fix distributed migrations
* comments
* formatting
* Better checkpoint directory handling
* fix tests
* fix tests
* fix click
* comments
* formatting comments
* formatting and comments
* sync function deprecations
* syncfunction
* Add documentation for Syncing and Uploading
* nit
* BaseSyncer as base for Mixin in edge case
* more docs
* clean up assertions
* validate
* nit
* Update test_cluster.py
* betterdoc
* Update tune-usage.rst
* cleanup
* nit
2019-07-02 20:46:00 -07:00