Commit Graph

18 Commits

Author SHA1 Message Date
Richard Liaw 1ce5e0e99f [tune] Fix file descriptor leak by syncer (#12590) 2020-12-03 13:39:04 -08:00
Richard Liaw 48042be8bb [tune] Avoid dependency on Kubernetes (#12188)
* fix-kubernetes

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

* kub

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2020-11-20 13:01:20 -08:00
Kai Fricke f1ace386db [tune] detect docker and kubernetes syncers (#12108)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-11-19 12:17:17 -08:00
Kai Fricke 007634fd1b [tune] logger refactor part 2: Add SyncerCallback (#11748)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-11-03 21:04:40 -08:00
Frank Gu 73fa94731f [tune] Add HDFS as Cloud Sync Client (#11524) 2020-10-22 14:12:51 -07:00
Richard Liaw 551c597312 [tune] API revamp fix (#10518) 2020-09-05 15:34:53 -07:00
krfricke 8f0f7371a0 [tune] Added Kubernetes syncer and sync client (#10097)
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2020-08-16 14:09:28 -07:00
Richard Liaw 7a8b922841 [tune] hotfix log_once (#10069) 2020-08-12 12:40:22 -07:00
Richard Liaw 98df612010 [tune] option to raise on error (#10030) 2020-08-11 09:59:04 -07:00
Amog Kamsetty 9410e5884d [Tune] Parametrize Cloud Syncing Frequency (#8771) 2020-06-04 18:55:50 -07:00
Sven 60d4d5e1aa Remove future imports (#6724)
* Remove all __future__ imports from RLlib.

* Remove (object) again from tf_run_builder.py::TFRunBuilder.

* Fix 2xLINT warnings.

* Fix broken appo_policy import (must be appo_tf_policy)

* Remove future imports from all other ray files (not just RLlib).

* Remove future imports from all other ray files (not just RLlib).

* Remove future import blocks that contain `unicode_literals` as well.
Revert appo_tf_policy.py to appo_policy.py (belongs to another PR).

* Add two empty lines before Schedule class.

* Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error.
2020-01-09 00:15:48 -08:00
Ujval Misra 5b40408678 [tune] Remove py2.7-specific code (#6665)
* Remove backwards compatability py2.7 code.

* Use exists_ok=True in ray

* nit

* nit

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-01-03 01:03:13 -08:00
Ujval Misra ca651af1d7 [tune] Async restores and S3/GCP-capable trial FT (#6376)
* Initial commit for asynchronous save/restore

* Set stage for cloud checkpointable trainable.

* Refactor log_sync and sync_client.

* Add durable trainable impl.

* Support delete in cmd based client

* Fix some tests and such

* Cleanup, comments.

* Use upload_dir instead.

* Revert files belonging to other PR in split.

* Pass upload_dir into trainable init.

* Pickle checkpoint at driver, more robust checkpoint_dir discovery.

* Cleanup trainable helper functions, fix tests.

* Addressed comments.

* Fix bugs from cluster testing, add parameterized cluster tests.

* Add trainable util test

* package_ref

* pbt_address

* Fix bug after running pbt example (_save returning dir).

* get cluster tests running, other bug fixes.

* raise_errors

* Fix deleter bug, add durable trainable example.

* Fix cluster test bugs.

* filelock

* save/restore bug fixes

* .

* Working cluster tests.

* Lint, revert to tracking memory checkpoints.

* Documentation, cleanup

* fixinitialsync

* fix_one_test

* Fix cluster test bug

* nit

* lint

* Revert tune md change

* Fix basename bug for directories.

* lint

* fix_tests

* nit_fix

* Add __init__ file.

* Move to utils package

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-01-02 20:40:53 -08:00
Robert Nishihara 39a3459886 Remove (object) from class declarations. (#6658) 2020-01-02 17:42:13 -08:00
Ujval Misra 81197e47c7 [tune] Refactor syncer (#6496)
* Refactor syncer and log_sync.

* Fix documentation.

* Remove delete from api

* Rename to get_node_syncer
2019-12-17 05:25:16 -08:00
Ujval Misra 2965dc1b72 [tune] Fault tolerance improvements (#5877)
* Precede ray.get with ray.wait.

* Trigger checkpoint deletes locally in Trainable

* Clean-up code.

* Minor changes.

* Track best checkpoint so far again

* Pulled checkpoint GC out of Trainable.

* Added comments, error logging.

* Immediate pull after checkpoint taken; rsync source delete on pull

* Minor doc fixes

* Fix checkpoint manager bug

* Fix bugs, tests, formatting

* Fix bugs, feature flag for force sync.

* Fix test.

* Fix minor bugs: clear proc and less verbose sync_on_checkpoint warnings.

* Fix bug: update IP of last_result.

* Fixed message.

* Added a lot of logging.

* Changes to ray trial executor.

* More bug fixes (logging after failure), better logging.

* Fix richards bug and logging

* Add comments.

* try-except

* Fix heapq bug.

* .

* Move handling of no available trials to ray_trial_executor (#1)

* Fix formatting bug, lint.

* Addressed Richard's comments

* Revert tests.

* fix rebase

* Fix trial location reporting.

* Fix test

* Fix lint

* Rebase, use ray.get w/ timeout, lint.

* lint

* fix rebase

* Address richard's comments
2019-11-18 01:14:41 -08:00
Eric Liang daf38c8723 [tune] Deprecate tune.function (#5601)
* remove tune function

* remove examples

* Update tune-usage.rst
2019-08-31 16:00:10 -07:00
Kristian Hartikainen 9e0192bc0b [tune] Change the log syncing behavior (#4450)
* Change the log syncing behavior

* fix up abstractions for syncer

* Finished checkpoint syncing

* Code

* Set of changes to get things running

* Fixes for log syncing

* Fix parts

* Lint and other fixes

* fix some test

* Remove extra parsing functionality

* some test fixes

* Fix up cloud syncing

* Another thing to do

* Fix up tests and local sync

Changes LogSync into a mixin, and adds tests for different
functionalities.

* Fix up tests, start on local migration

* fix distributed migrations

* comments

* formatting

* Better checkpoint directory handling

* fix tests

* fix tests

* fix click

* comments

* formatting comments

* formatting and comments

* sync function deprecations

* syncfunction

* Add documentation for Syncing and Uploading

* nit

* BaseSyncer as base for Mixin in edge case

* more docs

* clean up assertions

* validate

* nit

* Update test_cluster.py

* betterdoc

* Update tune-usage.rst

* cleanup

* nit
2019-07-02 20:46:00 -07:00