wassname/ray - ray - Gitea: Git with a cup of tea

mirror of https://github.com/wassname/ray.git synced 2026-06-28 16:13:54 +08:00

Author	SHA1	Message	Date
mehrdadn	f93bb008bb	Change os.uname()[1] and socket.gethostname() to the portable and faster platform.node_ip() (#8839 ) Co-authored-by: Mehrdad <noreply@github.com>	2020-06-08 21:29:46 -07:00
Richard Liaw	67c01455fe	[tune] `tune.track` -> `tune.report` (#8388 )	2020-05-16 12:55:08 -07:00
Ujval Misra	708dff6d8f	[tune] Stop-gap fix for PBT checkpointing (#7794 ) * Fix PBT * lint * reset * rm * tests Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2020-04-20 15:10:36 -07:00
Richard Liaw	82b792be33	[tune] IP Check, Flatten Results for TBX (#7705 ) * support_flattened * loggers * Format logger changes Co-authored-by: Kristian Hartikainen <kristian.hartikainen@gmail.com>	2020-03-25 09:18:03 +00:00
Sven Mika	1138f2ebed	[RLlib] Issue 7046 cannot restore keras model from h5 file. (#7482 )	2020-03-23 12:19:30 -07:00
Richard Liaw	81d311031b	[tune] Update API Reference Page (#7671 ) * widerdocs * init * docs * fix * moveit * mix * better_docs * remove * Apply suggestions from code review Co-Authored-By: Sven Mika <sven@anyscale.io> Co-authored-by: Sven Mika <sven@anyscale.io>	2020-03-22 16:42:20 -07:00
Richard Liaw	ea10cd212c	[tune] add accessible trial_info (#7378 ) * add accessible trial_info * trial name and info * doc * fix gp * Update doc/source/tune-package-ref.rst * Apply suggestions from code review * fix * trial * fixtest * testfix	2020-03-17 23:44:18 -07:00
Yuhao Yang	5f36e6eacb	[tune] get checkpoints paths for a trial after tuning (#6643 )	2020-01-17 10:15:04 -08:00
Sven	60d4d5e1aa	Remove future imports (#6724 ) * Remove all __future__ imports from RLlib. * Remove (object) again from tf_run_builder.py::TFRunBuilder. * Fix 2xLINT warnings. * Fix broken appo_policy import (must be appo_tf_policy) * Remove future imports from all other ray files (not just RLlib). * Remove future imports from all other ray files (not just RLlib). * Remove future import blocks that contain `unicode_literals` as well. Revert appo_tf_policy.py to appo_policy.py (belongs to another PR). * Add two empty lines before Schedule class. * Put back __future__ imports into determine_tests_to_run.py. Fails otherwise on a py2/print related error.	2020-01-09 00:15:48 -08:00
Ujval Misra	20ba7ef647	[tune] Move util to utils package (#6682 ) * Move util.py to utils * Fix import	2020-01-06 18:11:02 -08:00
Ujval Misra	5b40408678	[tune] Remove py2.7-specific code (#6665 ) * Remove backwards compatability py2.7 code. * Use exists_ok=True in ray * nit * nit Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2020-01-03 01:03:13 -08:00
Ujval Misra	ca651af1d7	[tune] Async restores and S3/GCP-capable trial FT (#6376 ) * Initial commit for asynchronous save/restore * Set stage for cloud checkpointable trainable. * Refactor log_sync and sync_client. * Add durable trainable impl. * Support delete in cmd based client * Fix some tests and such * Cleanup, comments. * Use upload_dir instead. * Revert files belonging to other PR in split. * Pass upload_dir into trainable init. * Pickle checkpoint at driver, more robust checkpoint_dir discovery. * Cleanup trainable helper functions, fix tests. * Addressed comments. * Fix bugs from cluster testing, add parameterized cluster tests. * Add trainable util test * package_ref * pbt_address * Fix bug after running pbt example (_save returning dir). * get cluster tests running, other bug fixes. * raise_errors * Fix deleter bug, add durable trainable example. * Fix cluster test bugs. * filelock * save/restore bug fixes * . * Working cluster tests. * Lint, revert to tracking memory checkpoints. * Documentation, cleanup * fixinitialsync * fix_one_test * Fix cluster test bug * nit * lint * Revert tune md change * Fix basename bug for directories. * lint * fix_tests * nit_fix * Add __init__ file. * Move to utils package Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2020-01-02 20:40:53 -08:00
Robert Nishihara	39a3459886	Remove (object) from class declarations. (#6658 )	2020-01-02 17:42:13 -08:00
Sven	f1b56fa5ee	PG unify/cleanup tf vs torch and PG functionality test cases (tf + torch). (#6650 ) * Unifying the code for PGTrainer/Policy wrt tf vs torch. Adding loss function test cases for the PGAgent (confirm equivalence of tf and torch). * Fix LINT line-len errors. * Fix LINT errors. * Fix `tf_pg_policy` imports (formerly: `pg_policy`). * Rename tf_pg_... into pg_tf_... following <alg>_<framework>_... convention, where ...=policy/loss/agent/trainer. Retire `PGAgent` class (use PGTrainer instead). * - Move PG test into agents/pg/tests directory. - All test cases will be located near the classes that are tested and then built into the Bazel/Travis test suite. * Moved post_process_advantages into pg.py (from pg_tf_policy.py), b/c the function is not a tf-specific one. * Fix remaining import errors for agents/pg/... * Fix circular dependency in pg imports. * Add pg tests to Jenkins test suite.	2020-01-02 16:08:03 -08:00
Yuhao Yang	3db8faab0d	[tune] fix log dir race condition (#6420 )	2019-12-10 21:00:19 -08:00
Yuhao Yang	ffa043d4b7	[tune] replace self.config (#6313 )	2019-11-29 11:09:30 -08:00
Ujval Misra	2965dc1b72	[tune] Fault tolerance improvements (#5877 ) * Precede ray.get with ray.wait. * Trigger checkpoint deletes locally in Trainable * Clean-up code. * Minor changes. * Track best checkpoint so far again * Pulled checkpoint GC out of Trainable. * Added comments, error logging. * Immediate pull after checkpoint taken; rsync source delete on pull * Minor doc fixes * Fix checkpoint manager bug * Fix bugs, tests, formatting * Fix bugs, feature flag for force sync. * Fix test. * Fix minor bugs: clear proc and less verbose sync_on_checkpoint warnings. * Fix bug: update IP of last_result. * Fixed message. * Added a lot of logging. * Changes to ray trial executor. * More bug fixes (logging after failure), better logging. * Fix richards bug and logging * Add comments. * try-except * Fix heapq bug. * . * Move handling of no available trials to ray_trial_executor (#1) * Fix formatting bug, lint. * Addressed Richard's comments * Revert tests. * fix rebase * Fix trial location reporting. * Fix test * Fix lint * Rebase, use ray.get w/ timeout, lint. * lint * fix rebase * Address richard's comments	2019-11-18 01:14:41 -08:00
David Bignell	3f83b2daa9	[rllib] Rollout extensions (#6065 ) * Rollout improvements * Make info-saving optional, to avoid breaking change. * Store generating ray version in checkpoint metadata * Keep the linter happy * Add small rollout test * Terse. * Update test_io.py	2019-11-05 20:34:18 -08:00
Vince Jankovics	7e214fd95e	[tune] TensorBoard HParams for TF2.0 (#5678 )	2019-09-21 11:06:34 -07:00
Richard Liaw	34f6d2fc5c	[tune] Update trainable docs and support hparams (#5558 )	2019-09-04 12:44:42 -07:00
Peng Zhenghao	983f3c83d8	[tune] Allow relative local_dir at tune.run() (#4734 )	2019-08-10 16:49:34 -07:00
Richard Liaw	1eaa57c98f	[tune] Distributed example + walkthrough (#5157 )	2019-08-02 09:17:20 -07:00
Richard Liaw	b0c0de49a2	[tune] Fixup exception messages (#5238 )	2019-07-20 22:36:27 -07:00
Richard Liaw	1530389822	[tune] Fast Node Recovery (#5053 )	2019-07-12 13:47:30 -07:00
Richard Liaw	691c9733f9	[tune] Document trainable attributes and enable user-checkpoint… (#4868 )	2019-07-10 18:51:11 -07:00
Richard Liaw	0b540ab492	[tune] Test example checkpointing (#4728 )	2019-07-10 01:58:26 -07:00
Dušan Josipović	e9b88dcbed	[wingman -> tune] Add system performance tracking (#4924 )	2019-07-06 00:57:35 -07:00
Eric Liang	99eae05cf6	[tune] Disallow setting resources_per_trial when it is already configured (#4880 ) * disallow it * import fix * fix example * fix test * fix tests * Update mock.py * fix * make less convoluted * fix tests	2019-06-03 06:47:39 +08:00
Dušan Josipović	820c71b7d0	[tune/rllib] Add checkpoint eraser (#4490 )	2019-04-06 20:01:54 -07:00
Richard Liaw	828dc08ac8	[tune] Fix tests for Function API for better consistency (#4421 )	2019-03-20 22:31:38 -07:00
gehring	7c3274e65b	[tune] Make the logging of the function API consistent and predictable (#4011 ) ## What do these changes do? This is a re-implementation of the `FunctionRunner` which enforces some synchronicity between the thread running the training function and the thread running the Trainable which logs results. The main purpose is to make logging consistent across APIs in anticipation of a new function API which will be generator based (through `yield` statements). Without these changes, it will be impossible for the (possibly soon to be) deprecated reporter based API to behave the same as the generator based API. This new implementation provides additional guarantees to prevent results from being dropped. This makes the logging behavior more intuitive and consistent with how results are handled in custom subclasses of Trainable. New guarantees for the tune function API: - Every reported result, i.e., `reporter(**kwargs)` calls, is forwarded to the appropriate loggers instead of being dropped if not enough time has elapsed since the last results. - The wrapped function only runs if the `FunctionRunner` expects a result, i.e., when `FunctionRunner._train()` has been called. This removes the possibility that a result will be generated by the function but never logged. - The wrapped function is not called until the first `_train()` call. Currently, the wrapped function is started during the setup phase which could result in dropped results if the trial is cancelled between `_setup()` and the first `_train()` call. - Exceptions raised by the wrapped function won't be propagated until all results are logged to prevent dropped results. - The thread running the wrapped function is explicitly stopped when the `FunctionRunner` is stopped with `_stop()`. - If the wrapped function terminates without reporting `done=True`, a duplicate result with `{"done": True}`, is reported to explicitly terminate the trial, and components will be notified with a duplicate of the last reported result, but this duplicate will not be logged. ## Related issue number Closes #3956. #3949 #3834	2019-03-18 19:14:26 -07:00
Eric Liang	d5f4698305	[tune] Avoid scheduler blocking, add reuse_actors optimization (#4218 )	2019-03-12 23:49:31 -07:00
Adi Zimmerman	9551f2a92e	[tune] Properly handle closing files in Trainable (#4232 ) Fixes #3965. Using the with keyword/block will close to file immediately after the block ends	2019-03-03 14:23:05 -08:00
Eric Liang	6e46d75554	[tune] Remove slow gzip of checkpoints; ignore jupyter stop errors (#4076 ) * fix gzip * ignore jupyter	2019-02-18 01:30:13 -08:00
Andrew Tan	57dcd3033e	[tune] Trial reporter fix (#3951 ) Fixes #3949.	2019-02-13 01:03:54 -08:00
Eric Liang	0f81bc9a33	[rllib] on_train_result results do not get logged (#3865 )	2019-02-01 20:32:07 -08:00
Tianming Xu	1302fafc0b	[Tune] Add export_formats option to export policy graphs (#3868 ) In earlier PRs, PR#3585 and PR#3637, export_policy_model and export_policy_checkpoint were introduced for users to export TensorFlow model and checkpoint. For Ray Tune users, these APIs are not accessible through YAML configurations. In this pull request, export_formats option is provided to enable users to choose the desired export format.	2019-01-31 17:07:27 -08:00
Richard Liaw	3918934dfd	[tune] Cross-Node Recovery (#3725 ) Augments trial restore to also check if the runner is at the same location. If not, the checkpoint files are pushed onto the new location.	2019-01-15 10:37:28 -08:00
Richard Liaw	aad3c50e2d	[tune] Cluster Fault Tolerance (#3309 ) This PR introduces cluster-level fault tolerance for Tune by checkpointing global state. This occurs with relatively high frequency and allows users to easily resume experiments when the cluster crashes. Note that this PR may affect automated workflows due to auto-prompting, but this is resolvable.	2018-12-29 11:42:25 +08:00
Richard Liaw	e37891d79d	[tune] Fix default handling for timesteps (#3293 ) This PR fixes an issue where previously if timesteps_this_iter = 0, then it would render as "None". Closes #3057.	2018-11-12 15:52:17 -08:00
Eric Liang	813f51769f	[rllib] Fix rllib rollouts script and add test (#3211 ) ## What do these changes do? Clean up the checkpointing to handle the new checkpoint dirs. Add a test for rollout.py ## Related issue number https://github.com/ray-project/ray/issues/3206 https://github.com/ray-project/ray/issues/3204	2018-11-05 00:33:25 -08:00
Richard Liaw	f9b58d7b02	[tune] Tweaks to Trainable and Verbosity (#2889 )	2018-10-11 23:42:13 -07:00
Eric Liang	65dcafdc3f	[rllib] Refactor save() / restore() code of agents and avoid O(n_workers) save size (#2982 )	2018-09-30 01:15:13 -07:00
Richard Liaw	f372f48bf3	[tune] Tune onto Logging Module (#2882 ) Moves Tune onto logging in Python. Ignores examples and tests.	2018-09-16 12:09:36 -07:00
Kaahan	045861c9b0	[tune] Reset Config for Trainables (#2831 ) Adds the ability for trainables to reset their configurations during experiments. These changes in particular add the base functions to the trial_executor and trainable interfaces as well as giving the basic implementation on the PopulationBasedTraining scheduler. Related issue number: #2741	2018-09-11 08:45:04 -07:00
Eric Liang	611259b2c7	Re-raise actor initialization errors on method invocation (#2843 ) If an actor constructor fails, save that error and re-raise it on any subsequent attempts to interact with the actor. Related to https://github.com/ray-project/ray/issues/282 and https://github.com/ray-project/ray/issues/1093.	2018-09-10 10:51:19 -07:00
Eric Liang	d81605e9e7	[tune] Add a time/timesteps since last restore metric (#2819 ) * rsm * always log to avoid changing schema for csv writer * add iter since restore * update * criteria warn	2018-09-05 17:45:09 -07:00
Richard Liaw	0347e6418b	[tune] Add PyTorch MNIST Example + Misc. Tweaks (#2708 )	2018-08-30 16:18:56 -07:00
Richard Liaw	62d0698097	[tune] Tune Facelift (#2472 ) This PR introduces the following changes: * Ray Tune -> Tune * [breaking] Creation of `schedulers/`, moving PBT, HyperBand into a submodule * [breaking] Search Algorithms now must take in experiment configurations via `add_configurations` rather through initialization * Support `"run": (function \| class \| str)` with automatic registering of trainable * Documentation Changes	2018-08-19 11:00:55 -07:00
Richard Liaw	bb44456f6f	[rllib, tune] TrainingResult -> Dict, Removes C408 from flake8 (#2565 )	2018-08-07 12:17:44 -07:00

1 2

62 Commits