Commit Graph

62 Commits

Author SHA1 Message Date
mehrdadn f93bb008bb Change os.uname()[1] and socket.gethostname() to the portable and faster platform.node_ip() (#8839)
Co-authored-by: Mehrdad <noreply@github.com>
2020-06-08 21:29:46 -07:00
Richard Liaw 67c01455fe [tune] tune.track -> tune.report (#8388) 2020-05-16 12:55:08 -07:00
Ujval Misra 708dff6d8f [tune] Stop-gap fix for PBT checkpointing (#7794)
* Fix PBT

* lint

* reset

* rm

* tests

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-04-20 15:10:36 -07:00
Richard Liaw 82b792be33 [tune] IP Check, Flatten Results for TBX (#7705)
* support_flattened

* loggers

* Format logger changes

Co-authored-by: Kristian Hartikainen <kristian.hartikainen@gmail.com>
2020-03-25 09:18:03 +00:00
Sven Mika 1138f2ebed [RLlib] Issue 7046 cannot restore keras model from h5 file. (#7482) 2020-03-23 12:19:30 -07:00
Richard Liaw 81d311031b [tune] Update API Reference Page (#7671)
* widerdocs

* init

* docs

* fix

* moveit

* mix

* better_docs

* remove

* Apply suggestions from code review

Co-Authored-By: Sven Mika <sven@anyscale.io>

Co-authored-by: Sven Mika <sven@anyscale.io>
2020-03-22 16:42:20 -07:00
Richard Liaw ea10cd212c [tune] add accessible trial_info (#7378)
* add accessible trial_info

* trial name and info

* doc

* fix
gp

* Update doc/source/tune-package-ref.rst

* Apply suggestions from code review

* fix

* trial

* fixtest

* testfix
2020-03-17 23:44:18 -07:00
Yuhao Yang 5f36e6eacb [tune] get checkpoints paths for a trial after tuning (#6643) 2020-01-17 10:15:04 -08:00
Sven 60d4d5e1aa Remove future imports (#6724)
* Remove all __future__ imports from RLlib.

* Remove (object) again from tf_run_builder.py::TFRunBuilder.

* Fix 2xLINT warnings.

* Fix broken appo_policy import (must be appo_tf_policy)

* Remove future imports from all other ray files (not just RLlib).

* Remove future imports from all other ray files (not just RLlib).

* Remove future import blocks that contain `unicode_literals` as well.
Revert appo_tf_policy.py to appo_policy.py (belongs to another PR).

* Add two empty lines before Schedule class.

* Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error.
2020-01-09 00:15:48 -08:00
Ujval Misra 20ba7ef647 [tune] Move util to utils package (#6682)
* Move util.py to utils

* Fix import
2020-01-06 18:11:02 -08:00
Ujval Misra 5b40408678 [tune] Remove py2.7-specific code (#6665)
* Remove backwards compatability py2.7 code.

* Use exists_ok=True in ray

* nit

* nit

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-01-03 01:03:13 -08:00
Ujval Misra ca651af1d7 [tune] Async restores and S3/GCP-capable trial FT (#6376)
* Initial commit for asynchronous save/restore

* Set stage for cloud checkpointable trainable.

* Refactor log_sync and sync_client.

* Add durable trainable impl.

* Support delete in cmd based client

* Fix some tests and such

* Cleanup, comments.

* Use upload_dir instead.

* Revert files belonging to other PR in split.

* Pass upload_dir into trainable init.

* Pickle checkpoint at driver, more robust checkpoint_dir discovery.

* Cleanup trainable helper functions, fix tests.

* Addressed comments.

* Fix bugs from cluster testing, add parameterized cluster tests.

* Add trainable util test

* package_ref

* pbt_address

* Fix bug after running pbt example (_save returning dir).

* get cluster tests running, other bug fixes.

* raise_errors

* Fix deleter bug, add durable trainable example.

* Fix cluster test bugs.

* filelock

* save/restore bug fixes

* .

* Working cluster tests.

* Lint, revert to tracking memory checkpoints.

* Documentation, cleanup

* fixinitialsync

* fix_one_test

* Fix cluster test bug

* nit

* lint

* Revert tune md change

* Fix basename bug for directories.

* lint

* fix_tests

* nit_fix

* Add __init__ file.

* Move to utils package

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-01-02 20:40:53 -08:00
Robert Nishihara 39a3459886 Remove (object) from class declarations. (#6658) 2020-01-02 17:42:13 -08:00
Sven f1b56fa5ee PG unify/cleanup tf vs torch and PG functionality test cases (tf + torch). (#6650)
* Unifying the code for PGTrainer/Policy wrt tf vs torch.
Adding loss function test cases for the PGAgent (confirm equivalence of tf and torch).

* Fix LINT line-len errors.

* Fix LINT errors.

* Fix `tf_pg_policy` imports (formerly: `pg_policy`).

* Rename tf_pg_... into pg_tf_... following <alg>_<framework>_... convention, where ...=policy/loss/agent/trainer.
Retire `PGAgent` class (use PGTrainer instead).

* - Move PG test into agents/pg/tests directory.
- All test cases will be located near the classes that are tested and
  then built into the Bazel/Travis test suite.

* Moved post_process_advantages into pg.py (from pg_tf_policy.py), b/c
the function is not a tf-specific one.

* Fix remaining import errors for agents/pg/...

* Fix circular dependency in pg imports.

* Add pg tests to Jenkins test suite.
2020-01-02 16:08:03 -08:00
Yuhao Yang 3db8faab0d [tune] fix log dir race condition (#6420) 2019-12-10 21:00:19 -08:00
Yuhao Yang ffa043d4b7 [tune] replace self.config (#6313) 2019-11-29 11:09:30 -08:00
Ujval Misra 2965dc1b72 [tune] Fault tolerance improvements (#5877)
* Precede ray.get with ray.wait.

* Trigger checkpoint deletes locally in Trainable

* Clean-up code.

* Minor changes.

* Track best checkpoint so far again

* Pulled checkpoint GC out of Trainable.

* Added comments, error logging.

* Immediate pull after checkpoint taken; rsync source delete on pull

* Minor doc fixes

* Fix checkpoint manager bug

* Fix bugs, tests, formatting

* Fix bugs, feature flag for force sync.

* Fix test.

* Fix minor bugs: clear proc and less verbose sync_on_checkpoint warnings.

* Fix bug: update IP of last_result.

* Fixed message.

* Added a lot of logging.

* Changes to ray trial executor.

* More bug fixes (logging after failure), better logging.

* Fix richards bug and logging

* Add comments.

* try-except

* Fix heapq bug.

* .

* Move handling of no available trials to ray_trial_executor (#1)

* Fix formatting bug, lint.

* Addressed Richard's comments

* Revert tests.

* fix rebase

* Fix trial location reporting.

* Fix test

* Fix lint

* Rebase, use ray.get w/ timeout, lint.

* lint

* fix rebase

* Address richard's comments
2019-11-18 01:14:41 -08:00
David Bignell 3f83b2daa9 [rllib] Rollout extensions (#6065)
* Rollout improvements

* Make info-saving optional, to avoid breaking change.

* Store generating ray version in checkpoint metadata

* Keep the linter happy

* Add small rollout test

* Terse.

* Update test_io.py
2019-11-05 20:34:18 -08:00
Vince Jankovics 7e214fd95e [tune] TensorBoard HParams for TF2.0 (#5678) 2019-09-21 11:06:34 -07:00
Richard Liaw 34f6d2fc5c [tune] Update trainable docs and support hparams (#5558) 2019-09-04 12:44:42 -07:00
Peng Zhenghao 983f3c83d8 [tune] Allow relative local_dir at tune.run() (#4734) 2019-08-10 16:49:34 -07:00
Richard Liaw 1eaa57c98f [tune] Distributed example + walkthrough (#5157) 2019-08-02 09:17:20 -07:00
Richard Liaw b0c0de49a2 [tune] Fixup exception messages (#5238) 2019-07-20 22:36:27 -07:00
Richard Liaw 1530389822 [tune] Fast Node Recovery (#5053) 2019-07-12 13:47:30 -07:00
Richard Liaw 691c9733f9 [tune] Document trainable attributes and enable user-checkpoint… (#4868) 2019-07-10 18:51:11 -07:00
Richard Liaw 0b540ab492 [tune] Test example checkpointing (#4728) 2019-07-10 01:58:26 -07:00
Dušan Josipović e9b88dcbed [wingman -> tune] Add system performance tracking (#4924) 2019-07-06 00:57:35 -07:00
Eric Liang 99eae05cf6 [tune] Disallow setting resources_per_trial when it is already configured (#4880)
* disallow it

* import fix

* fix example

* fix test

* fix tests

* Update mock.py

* fix

* make less convoluted

* fix tests
2019-06-03 06:47:39 +08:00
Dušan Josipović 820c71b7d0 [tune/rllib] Add checkpoint eraser (#4490) 2019-04-06 20:01:54 -07:00
Richard Liaw 828dc08ac8 [tune] Fix tests for Function API for better consistency (#4421) 2019-03-20 22:31:38 -07:00
gehring 7c3274e65b [tune] Make the logging of the function API consistent and predictable (#4011)
## What do these changes do?

This is a re-implementation of the `FunctionRunner` which enforces some synchronicity between the thread running the training function and the thread running the Trainable which logs results. The main purpose is to make logging consistent across APIs in anticipation of a new function API which will be generator based (through `yield` statements). Without these changes, it will be impossible for the (possibly soon to be) deprecated reporter based API to behave the same as the generator based API.

This new implementation provides additional guarantees to prevent results from being dropped. This makes the logging behavior more intuitive and consistent with how results are handled in custom subclasses of Trainable.

New guarantees for the tune function API:

- Every reported result, i.e., `reporter(**kwargs)` calls, is forwarded to the appropriate loggers instead of being dropped if not enough time has elapsed since the last results.
- The wrapped function only runs if the `FunctionRunner` expects a result, i.e., when `FunctionRunner._train()` has been called. This removes the possibility that a result will be generated by the function but never logged.
- The wrapped function is not called until the first `_train()` call. Currently, the wrapped function is started during the setup phase which could result in dropped results if the trial is cancelled between `_setup()` and the first `_train()` call.
- Exceptions raised by the wrapped function won't be propagated until all results are logged to prevent dropped results.
- The thread running the wrapped function is explicitly stopped when the `FunctionRunner` is stopped with `_stop()`.
- If the wrapped function terminates without reporting `done=True`, a duplicate result with `{"done": True}`, is reported to explicitly terminate the trial, and components will be notified with a duplicate of the last reported result, but this duplicate will not be logged.

## Related issue number

Closes #3956.
#3949
#3834
2019-03-18 19:14:26 -07:00
Eric Liang d5f4698305 [tune] Avoid scheduler blocking, add reuse_actors optimization (#4218) 2019-03-12 23:49:31 -07:00
Adi Zimmerman 9551f2a92e [tune] Properly handle closing files in Trainable (#4232)
Fixes #3965.

Using the with keyword/block will close to file immediately after the block ends
2019-03-03 14:23:05 -08:00
Eric Liang 6e46d75554 [tune] Remove slow gzip of checkpoints; ignore jupyter stop errors (#4076)
* fix gzip

* ignore jupyter
2019-02-18 01:30:13 -08:00
Andrew Tan 57dcd3033e [tune] Trial reporter fix (#3951)
Fixes #3949.
2019-02-13 01:03:54 -08:00
Eric Liang 0f81bc9a33 [rllib] on_train_result results do not get logged (#3865) 2019-02-01 20:32:07 -08:00
Tianming Xu 1302fafc0b [Tune] Add export_formats option to export policy graphs (#3868)
In earlier PRs, PR#3585 and PR#3637, export_policy_model and export_policy_checkpoint were introduced for users to export TensorFlow model and checkpoint.

For Ray Tune users, these APIs are not accessible through YAML configurations.

In this pull request, export_formats option is provided to enable users to choose the desired export format.
2019-01-31 17:07:27 -08:00
Richard Liaw 3918934dfd [tune] Cross-Node Recovery (#3725)
Augments trial restore to also check if the runner is at the same
location. If not, the checkpoint files are pushed onto the new location.
2019-01-15 10:37:28 -08:00
Richard Liaw aad3c50e2d [tune] Cluster Fault Tolerance (#3309)
This PR introduces cluster-level fault tolerance for Tune by checkpointing global state. This occurs with relatively high frequency and allows users to easily resume experiments when the cluster crashes.

Note that this PR may affect automated workflows due to auto-prompting, but this is resolvable.
2018-12-29 11:42:25 +08:00
Richard Liaw e37891d79d [tune] Fix default handling for timesteps (#3293)
This PR fixes an issue where previously if timesteps_this_iter = 0,
then it would render as "None".

Closes #3057.
2018-11-12 15:52:17 -08:00
Eric Liang 813f51769f [rllib] Fix rllib rollouts script and add test (#3211)
## What do these changes do?

Clean up the checkpointing to handle the new checkpoint dirs. Add a test for rollout.py

## Related issue number

https://github.com/ray-project/ray/issues/3206
https://github.com/ray-project/ray/issues/3204
2018-11-05 00:33:25 -08:00
Richard Liaw f9b58d7b02 [tune] Tweaks to Trainable and Verbosity (#2889) 2018-10-11 23:42:13 -07:00
Eric Liang 65dcafdc3f [rllib] Refactor save() / restore() code of agents and avoid O(n_workers) save size (#2982) 2018-09-30 01:15:13 -07:00
Richard Liaw f372f48bf3 [tune] Tune onto Logging Module (#2882)
Moves Tune onto logging in Python. Ignores examples and tests.
2018-09-16 12:09:36 -07:00
Kaahan 045861c9b0 [tune] Reset Config for Trainables (#2831)
Adds the ability for trainables to reset their configurations during experiments. These changes in particular add the base functions to the trial_executor and trainable interfaces as well as giving the basic implementation on the PopulationBasedTraining scheduler.

Related issue number: #2741
2018-09-11 08:45:04 -07:00
Eric Liang 611259b2c7 Re-raise actor initialization errors on method invocation (#2843)
If an actor constructor fails, save that error and re-raise it on any subsequent attempts to interact with the actor. Related to https://github.com/ray-project/ray/issues/282 and https://github.com/ray-project/ray/issues/1093.
2018-09-10 10:51:19 -07:00
Eric Liang d81605e9e7 [tune] Add a time/timesteps since last restore metric (#2819)
* rsm

* always log to avoid changing schema for csv writer

* add iter since restore

* update

* criteria warn
2018-09-05 17:45:09 -07:00
Richard Liaw 0347e6418b [tune] Add PyTorch MNIST Example + Misc. Tweaks (#2708) 2018-08-30 16:18:56 -07:00
Richard Liaw 62d0698097 [tune] Tune Facelift (#2472)
This PR introduces the following changes:

 * Ray Tune -> Tune 
 * [breaking] Creation of `schedulers/`, moving PBT, HyperBand into a submodule
 * [breaking] Search Algorithms now must take in experiment configurations via `add_configurations` rather through initialization
 * Support `"run": (function | class | str)` with automatic registering of trainable
 * Documentation Changes
2018-08-19 11:00:55 -07:00
Richard Liaw bb44456f6f [rllib, tune] TrainingResult -> Dict, Removes C408 from flake8 (#2565) 2018-08-07 12:17:44 -07:00