wassname/ray - ray - Gitea: Git with a cup of tea

mirror of https://github.com/wassname/ray.git synced 2026-07-01 08:53:44 +08:00

Author	SHA1	Message	Date
Richard Liaw	784a6399b0	[tune] Node Fault Tolerance (#3238 ) This PR introduces single-node fault tolerance for Tune. ## Previous behavior: - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources. ## New behavior: - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued. - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running. Remaining questions: - Should `last_result` be consistent during restore? Yes; but not for earlier trials (trials that are yet to be checkpointed). - Waiting for some PRs to merge first (#3239) Closes #2851.	2018-11-21 12:38:16 -08:00
Eric Liang	65c27c70cf	[rllib] Clean up agent resource configurations (#3296 ) Closes #3284	2018-11-13 18:00:03 -08:00
Richard Liaw	c3a2c7ebed	[tune] Doc: Autofilled, StatusReporter (#3294 ) * autofill and revise doc page for things * lint * comments	2018-11-13 13:15:56 -08:00
Richard Liaw	e37891d79d	[tune] Fix default handling for timesteps (#3293 ) This PR fixes an issue where previously if timesteps_this_iter = 0, then it would render as "None". Closes #3057.	2018-11-12 15:52:17 -08:00
Eric Liang	463511f8a6	[tune] Track and warn on low memory (#3298 )	2018-11-11 00:29:45 -08:00
Eric Liang	53489d2f85	[sgd] Document and add simple MNIST example (#3236 )	2018-11-10 21:52:20 -08:00
Richard Liaw	29c182d449	[tune] Support "None" for upload_dir	2018-11-09 22:02:08 -08:00
Richard Liaw	22113be04c	[tune] Annotated Example Page and showcase Tutorials (#3267 ) Adds an example page and link in codebase. Closes #2728.	2018-11-08 23:45:05 -08:00
Richard Liaw	cf9e838326	[tune] Raise Error when overstepping (#3235 )	2018-11-07 14:27:09 -08:00
Eric Liang	813f51769f	[rllib] Fix rllib rollouts script and add test (#3211 ) ## What do these changes do? Clean up the checkpointing to handle the new checkpoint dirs. Add a test for rollout.py ## Related issue number https://github.com/ray-project/ray/issues/3206 https://github.com/ray-project/ray/issues/3204	2018-11-05 00:33:25 -08:00
Richard Liaw	2086a57e61	[tune] Add Fractional GPU example/docs (#3169 ) * Add example for fractional GPU support * Update tune_mnist_keras.py * Update doc/source/tune-usage.rst	2018-10-31 18:53:16 -07:00
Dennis Chung	9df2e6e6f4	[tune] Modify stop criteria in hyperopt example (#3102 ) Modify `training_iteraion` to `timesteps_total` because only `timesteps_total` is inside the reporter.	2018-10-30 13:26:40 -07:00
Robert Nishihara	658c14282c	Remove legacy Ray code. (#3121 ) * Remove legacy Ray code. * Fix cmake and simplify monitor. * Fix linting * Updates * Fix * Implement some methods. * Remove more plasma manager references. * Fix * Linting * Fix * Fix * Make sure class IDs are strings. * Some path fixes * Fix * Path fixes and update arrow * Fixes. * linting * Fixes * Java fixes * Some java fixes * TaskLanguage -> Language * Minor * Fix python test and remove unused method signature. * Fix java tests * Fix jenkins tests * Remove commented out code.	2018-10-26 13:36:58 -07:00
Robert Nishihara	5aa29613db	Fix linting errors. (#3127 )	2018-10-24 16:30:00 -07:00
Richard Liaw	eff7cb4458	[tune] Fix SearchAlg finishing early (#3081 ) * Fix trial search alg finishing early * Fix lint * fix lint * nit fix	2018-10-22 12:17:13 -07:00
Praveen Palanisamy	4d8cfc0bf5	[tune] Fix (some more) misleading comments in tune/results.py (#3068 ) ## What do these changes do? Fix the misleading comments in code for: - `EPISODES_THIS_ITER` - `EPISODES_TOTAL` Had noted it before and planned to fix it along with some other changes but seemed very relevant to stay next to #3058 so sending this now.	2018-10-16 11:07:53 -07:00
Eric Liang	6240ccbc6e	[rllib] Add more warnings when multi-agent envs might not be set up right (#3061 )	2018-10-15 13:42:56 -07:00
Marlon	4dc78b735b	[tune] Fix misleading comment (#3058 )	2018-10-14 22:25:39 -07:00
Richard Liaw	f9b58d7b02	[tune] Tweaks to Trainable and Verbosity (#2889 )	2018-10-11 23:42:13 -07:00
Kristian Hartikainen	2d35a97a76	Bug/log syncer fails with parentheses (#2653 ) * Update rsync command * Escape rsync locations * Fix the accidental variable move * Update rsync to use -s flag	2018-10-06 00:34:53 -07:00
Richard Liaw	0651d3b629	[tune/core] Use Global State API for resources (#3004 )	2018-10-04 17:23:17 -07:00
old-bear	8aa736572b	[tune] Fix hyperband edge case for None entries (#2964 )	2018-09-30 09:57:43 -07:00
Eric Liang	65dcafdc3f	[rllib] Refactor save() / restore() code of agents and avoid O(n_workers) save size (#2982 )	2018-09-30 01:15:13 -07:00
old-bear	b3f0dcf20b	[tune] Add a raise_on_failed_trial flag in run_experiments (#2961 ) Adds a flag to control raising TuneError if some trial fails in `run_experiments`.	2018-09-29 11:29:46 -07:00
Eric Liang	3a3782c39f	[rllib] Fix LSTM regression on truncated sequences and add regression test (#2898 ) * fix * add test * yapf * yapf * fix space * Oops that should be lstm: True * Update cartpole_lstm.py	2018-09-18 15:09:16 -07:00
Richard Liaw	f372f48bf3	[tune] Tune onto Logging Module (#2882 ) Moves Tune onto logging in Python. Ignores examples and tests.	2018-09-16 12:09:36 -07:00
Richard Liaw	e05baed336	[tune] Better Info String and Tweaks (#2874 )	2018-09-15 11:02:13 -07:00
Daniel Ho	d9eeaaf00a	[tune] Fix bug in example where config hyperparameters were ignored (#2860 ) A fix to an example for tune (`python/ray/tune/examples/pbt_tune_cifar10_with_keras.py`) where the hyperparameters for the optimizer, learning rate and decay, were not being passed into the optimizer. This means that the current optimizer uses default values for the hyperparameters no matter the config.	2018-09-12 09:17:56 -07:00
old-bear	f3c1194be3	[tune] Add AutoML algorithm of GeneticSearcher (#2699 ) Add new search algorithm (genetic) along with the base framework of the searcher (which performs some basic jobs such as logging, recording and organizing in our project). Note that this is the initial commit. In the following days, we will add example, UT, and other refinements.	2018-09-12 09:17:04 -07:00
Kaahan	045861c9b0	[tune] Reset Config for Trainables (#2831 ) Adds the ability for trainables to reset their configurations during experiments. These changes in particular add the base functions to the trial_executor and trainable interfaces as well as giving the basic implementation on the PopulationBasedTraining scheduler. Related issue number: #2741	2018-09-11 08:45:04 -07:00
Eric Liang	611259b2c7	Re-raise actor initialization errors on method invocation (#2843 ) If an actor constructor fails, save that error and re-raise it on any subsequent attempts to interact with the actor. Related to https://github.com/ray-project/ray/issues/282 and https://github.com/ray-project/ray/issues/1093.	2018-09-10 10:51:19 -07:00
Eric Liang	d81605e9e7	[tune] Add a time/timesteps since last restore metric (#2819 ) * rsm * always log to avoid changing schema for csv writer * add iter since restore * update * criteria warn	2018-09-05 17:45:09 -07:00
Eric Liang	995ac24a2c	[rllib] clarify train batch size for PPO (#2793 ) It's possible to configure PPO in a way that ends up discarding most of the samples (they are treated as "stragglers"). Add a warning when this happens, and raise an exception if the waste is particularly egregious.	2018-09-05 12:06:13 -07:00
Richard Liaw	72542c9016	[tune] Fix Pausing and Error Propogation (#2815 ) * add new tests * Try-catch errors from ray get * longer pbt run * Update pbt_example.py * Split trial and result and fix tests	2018-09-04 15:22:11 -07:00
Eric Liang	df4788e501	[rllib/tune] Add test for fractional gpu support in xray mode; add rllib support for fractional gpu (#2768 ) * frac gpu * doc * Update rllib-training.rst * yapf * remove xray	2018-09-03 11:12:23 -07:00
wangyiguang	3813ae34b3	[tune] Add AutoMLBoard: Monitoring UI (experimental) (#2574 )	2018-08-31 00:26:44 -07:00
Richard Liaw	0347e6418b	[tune] Add PyTorch MNIST Example + Misc. Tweaks (#2708 )	2018-08-30 16:18:56 -07:00
Praveen Palanisamy	357c0d6156	[tune] Adds option to checkpoint at end of trials (#2754 ) * Added checkpoint_at_end option. To fix #2740 * Added ability to checkpoint at the end of trials if the option is set to True * checkpoint_at_end option added; Consistent with Experience and Trial runner * checkpoint_at_end option mentioned in the tune usage guide * Moved the redundant checkpoint criteria check out of the if-elif * Added note that checkpoint_at_end is enabled only when checkpoint_freq is not 0 * Added test case for checkpoint_at_end * Made checkpoint_at_end have an effect regardless of checkpoint_freq * Removed comment from the test case * Fixed the indentation * Fixed pep8 E231 * Handled cases when trainable does not have _save implemented * Constrained test case to a particular exp using the MockAgent * Revert "Constrained test case to a particular exp using the MockAgent" This reverts commit e965a9358ec7859b99a3aabb681286d6ba3c3906. * Revert "Handled cases when trainable does not have _save implemented" This reverts commit 0f5382f996ff0cbf3d054742db866c33494d173a. * Simpler test case for checkpoint_at_end * Preserved bools from loosing their actual value * Revert "Moved the redundant checkpoint criteria check out of the if-elif" This reverts commit 783005122902240b0ee177e9e206e397356af9c5. * Fix linting error.	2018-08-29 13:14:17 -07:00
Eric Liang	69d1354016	[rllib] Document ARS & rainbow (#2744 ) * wip * rainbow doc too * e not used * fix ppo doc * clean list * use same title	2018-08-28 18:13:36 -07:00
Michael Tu	d16b6f6a32	[tune] Rename 'repeat' to 'num_samples' (#2698 ) Deprecates the `repeat` argument and introduces `num_samples`. Also updates docs accordingly.	2018-08-24 15:05:24 -07:00
old-bear	4be324efc3	[tune] Support infinity value in report result (#2693 ) * + Compatibility fix under py2 on ray.tune * + Revert changes on master branch * + Use default JsonEncoder in ray.tune.logger * + Add UT for infinity support	2018-08-22 13:09:14 -07:00
joyyoj	38867eea4e	[tune] Cross-Framework Compatibility (#2646 ) This commit is a first pass at restructuring the Trial execution logic to support running on multiple frameworks.	2018-08-22 10:55:45 -07:00
Eric Liang	fbe6c59f72	[rllib] Misc fixes, A2C (#2679 ) A bunch of minor rllib fixes: pull in latest baselines atari wrapper changes (and use deepmind wrapper by default) move reward clipping to policy evaluator add a2c variant of a3c reduce vision network fc layer size to 256 units switch to 84x84 images doc tweaks print timesteps in tune status	2018-08-20 15:28:03 -07:00
old-bear	230ac7aa80	[tune] Compatibility fix under py2 on str condition (#2673 ) * * Compatibility fix under py2 on ray.tune * + Fix compatibility * + Use package six to achieve str compatibility	2018-08-19 20:43:03 -07:00
Richard Liaw	62d0698097	[tune] Tune Facelift (#2472 ) This PR introduces the following changes: * Ray Tune -> Tune * [breaking] Creation of `schedulers/`, moving PBT, HyperBand into a submodule * [breaking] Search Algorithms now must take in experiment configurations via `add_configurations` rather through initialization * Support `"run": (function \| class \| str)` with automatic registering of trainable * Documentation Changes	2018-08-19 11:00:55 -07:00
Eric Liang	e56eb354eb	[tune] Remove hack to serve pin requests off thread (#2680 ) * nopin * fix	2018-08-18 13:19:52 -07:00
Eric Liang	64053278aa	[tune] Support lambda functions in hyperparameters / tune rllib multiagent support (#2568 ) * update * func * Update registry.py * revert	2018-08-07 16:29:21 -07:00
Richard Liaw	bb44456f6f	[rllib, tune] TrainingResult -> Dict, Removes C408 from flake8 (#2565 )	2018-08-07 12:17:44 -07:00
Richard Liaw	914a433e3f	[tune] Split Search from Scheduling (#2452 ) Introduces SearchAlgorithm concept, separate from schedulers in Tune. Moves HyperOpt under this concept.	2018-08-04 21:27:39 -07:00
Richard Liaw	7edc677304	[rllib] Extra Changes for Usability (#2363 )	2018-07-24 20:51:22 -07:00

1 2 3

144 Commits