wassname/ray - ray - Gitea: Git with a cup of tea

mirror of https://github.com/wassname/ray.git synced 2026-07-04 19:28:27 +08:00

Author	SHA1	Message	Date
Amog Kamsetty	07bdf062b9	[Ray SGD] [Hotfix] Worker group hotfix (#11008 )	2020-09-24 12:21:30 -07:00
Amog Kamsetty	52e1495e30	[Ray SGD] TorchTrainable pre 0.8.7 deprecation warning (#10984 ) * torch trainable add pre 0.8.7 backwards compat * raise instead * Update python/ray/util/sgd/torch/torch_trainer.py	2020-09-23 18:19:43 -07:00
Ian Rodney	4c3f09094a	[docs] redis-port -> port (#10937 )	2020-09-23 17:04:13 -07:00
Amog Kamsetty	7dbd0ff824	fix example (#10964 )	2020-09-23 10:33:19 -07:00
Amog Kamsetty	d1d4743702	[Ray SGD] FP16 Hotfix (#10931 )	2020-09-21 13:10:10 -07:00
Amog Kamsetty	d5a7c53908	[Ray SGD] use_local flag + Worker group abstraction (#10539 ) Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2020-09-15 11:58:57 -07:00
Ian Rodney	5bc2ba38fd	[docker] Detect CPUs in container correctly (#10507 ) Co-authored-by: simon-mo <simon.mo@hey.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Alex Wu <itswu.alex@gmail.com>	2020-09-13 23:40:48 -07:00
Barak Michener	c6b1ed7f8f	release process: bump version number to 1.1.0.dev0 everywhere (#10686 )	2020-09-10 16:00:21 -07:00
Amog Kamsetty	415be78cc0	[RaySGD] Simplify Builder Process (#10321 ) Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2020-09-08 15:19:40 -07:00
Richard Liaw	cb438be146	[core] Move log_to_driver back to public (#10422 )	2020-08-29 16:35:14 -07:00
Eric Liang	519354a39a	[api] Initial API deprecations for Ray 1.0 (#10325 )	2020-08-28 15:03:50 -07:00
Yu Shan	5264f888e4	fix iterable dataset (issue 9899) (#9952 )	2020-08-22 19:40:38 -07:00
SangBin Cho	92664249e8	Partially Use f string (#10218 ) * flynt. trial 1. * Trial 1. * Addressed code review.	2020-08-20 18:21:16 -07:00
Richard Liaw	0c3b9ebeef	[tune/sgd] Document func_trainable and add checkpoint context (#9739 ) Co-authored-by: krfricke <krfricke@users.noreply.github.com> Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>	2020-07-30 09:46:37 -07:00
Richard Liaw	f3fdb5c5db	[tune] distributed torch wrapper (#9550 ) * changes * add-working * checkpoint * ccleanu * fix * ok * formatting * ok * tests * some-good-stuff * fix-torch * ddp-torch * torch-test * sessions * add-small-test * fix * remove * gpu-working * update-tests * ok * try-test * formgat * ok * ok	2020-07-26 09:37:22 -07:00
krfricke	ea4797bf38	[RaySGD] revised existing transformer example to work with transformers>=3.0 (#9661 ) Co-authored-by: Kai Fricke <kai@anyscale.com>	2020-07-23 10:58:09 -07:00
mehrdadn	aa8928fac2	Make more tests compatible with Windows (#9303 )	2020-07-15 11:34:33 -05:00
Richard Liaw	d35f0e40d0	[tune] Use public methods for trainable (#9184 )	2020-07-01 11:00:00 -07:00
Xianyang Liu	b449ece2ea	[SGD] Variable worker CPU requirements (#8963 )	2020-06-23 00:43:27 -07:00
Richard Liaw	58efec0f2b	[sgd] simplify cuda visible device setting (#8775 )	2020-06-12 13:53:32 -07:00
Alex Wu	dcf58a43dc	[SGD] Dataset API (#7839 )	2020-06-01 15:48:15 -07:00
Max Fitton	13231ba63b	Rename redis-port to port and add default (#8406 )	2020-05-18 13:25:34 -05:00
Xianyang Liu	eda526c154	[SGD] Support multiple input model (#8246 )	2020-05-02 16:49:09 -07:00
Maksim Smolin	c2acb7ffe2	[SGD] Add imagenet example CI (#8150 )	2020-05-02 16:48:35 -07:00
Xianyang Liu	fbf23eb6ff	[SGD] Fix IterableDataset errors (#8208 )	2020-04-29 10:51:31 -07:00
Neil Lugovoy	8cf598deab	[sgd] Fix GPU Reservations in LocalDistributedRunner (#8157 )	2020-04-27 16:03:33 -07:00
Richard Liaw	fa7eecf48a	[sgd] Avoid parameter "gotcha" for learning rate scheduler (#8107 ) * with-scheduler-creator * none * add_freq * runner * torch	2020-04-21 01:01:04 -07:00
Sven Mika	165a86f1ab	[RLlib] SAC MuJoCo instability issues (tf and torch versions). (#8063 ) SAC (both torch and tf versions) are showing issues (crashes) due to numeric instabilities in the SquashedGaussian distribution (sampling + logp after extreme NN outputs). This PR fixes these. Stable MuJoCo learning (HalfCheetah) has been confirmed on both tf and torch versions. A Distribution stability test (using extreme NN outputs) has been added for SquashedGaussian (can be used for any other type of distribution as well).	2020-04-19 10:20:23 +02:00
Richard Liaw	857e4dba2f	[sgd] HuggingFace GLUE Fine-tuning Example (#7792 ) * Init fp16 * fp16 and schedulers * scheduler linking and fp16 * to fp16 * loss scaling and documentation * more documentation * add tests, refactor config * moredocs * more docs * fix logo, add test mode, add fp16 flag * fix tests * fix scheduler * fix apex * improve safety * fix tests * fix tests * remove pin memory default * rm * fix * Update doc/examples/doc_code/raysgd_torch_signatures.py * fix * migrate changes from other PR * ok thanks * pass * signatures * lint' * Update python/ray/experimental/sgd/pytorch/utils.py * Apply suggestions from code review Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com> * should address most comments * comments * fix this ci * first_pass * add overrides * override * fixing up operators * format * sgd * constants * rm * revert * save * failures * fixes * trainer * run test * operator * code * op * ok done * operator * sgd test fixes * ok * trainer * format * Apply suggestions from code review Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com> * Update doc/source/raysgd/raysgd_pytorch.rst * docstring * dcgan * doc * commits * nit * testing * revert * Start renaming pytorch to torch * Rename PyTorchTrainer to TorchTrainer * Rename PyTorch runners to Torch runners * Finish renaming API * Rename to torch in tests * Finish renaming docs + tests * Run format + fix DeprecationWarning * fix * move tests up * benchmarks * rename * remove some args * better metrics output * fix up the benchmark * benchmark-yaml * horovod-benchmark * benchmarks * Remove benchmark code for cleanups * benchmark-code * nits * benchmark yamls * benchmark yaml * ok * ok * ok * benchmark * nit * finish_bench * makedatacreator * relax * metrics * autosetsampler * profile * movements * OK * smoothen * fix * nitdocs * loss * envflag * comments * nit * format * visible * images * move_images * fix * rernder * rrender * rest * multgpu * fix * nit * finish * extrra * setup * experimental * as_trainable * fix * ok * format * create_torch_pbt * setup_pbt * ok * format * ok * format * docs * ok * Draft head-is-worker * Fix missing concurrency between local and remote workers * Fix tqdm to work with head-is-worker * Cleanup * Implement state_dict and load_state_dict * Reserve resources on the head node for the local worker * Update the development cluster setup * Add spot block reservation to the development yaml * ok * Draft the fault tolerance fix * Small fixes to local-remote concurrency * Cleanup + fix typo * fixes * worker_counts * some formatting and asha * fix * okme * fixactorkill * unify * Revert the cluster mounts * Cut the handler-reporter API * Fix most tests * Rm tqdm_handler.py * Re-add tune test * Automatically force-shutdown on actor errors on shutdown * Formatting * fix_tune_test * Add timeout error verification * Rename tqdm to use_tqdm * fixtests * ok * remove_redundant * deprecated * deactivated * ok_try_this * lint * nice * done * retries * fixes * kill * retry * init_transformer * init * deployit * improve_example * trans * rename * formats * format-to-py37 * time_to_test * more_changes * ok * update_args_and_script * fp16_epoch * huggingface * training stats * distributed * Apply suggestions from code review * transformer Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Maksim Smolin <maximsmol@gmail.com>	2020-04-17 15:17:30 -07:00
Maksim Smolin	d6f4e5b3e1	[SGD] Imagenet example (basic) (#8020 ) * Checkpoint the image-models example * Update cluster definition * Fix copyright info * Use original args * Checkpoint fixes * Add README * Add some missing features * Format * Get rid of the unused Namespace class * Address comments * Link the imagenet example in docs * Cleanup * Fix lint	2020-04-17 13:33:55 -07:00
Richard Liaw	a9ea139317	[sgd] Make serialization of data creation optional (#8027 ) * pytest * Update python/ray/util/sgd/torch/torch_trainer.py Co-Authored-By: Ujval Misra <misraujval@gmail.com> Co-authored-by: Ujval Misra <misraujval@gmail.com>	2020-04-16 20:27:51 -07:00
Richard Liaw	6545534805	[tune/sgd] DCGAN example self-contained, turn example into modu… (#8012 ) * ok * done * run_benchmarks * should_make_examples_usable	2020-04-16 17:55:27 -07:00
Karthikeyan Singaravelan	f95e18dfeb	[tune/sgd] Import ABC from collections.abc instead of collectio… (#7982 ) * Import ABC from collections.abc instead of collections for Python 3 compatibility. * Fix linter errors.	2020-04-16 15:26:49 -07:00
Robert Nishihara	d985d7537e	Replace all instances of ray.readthedocs.io with ray.io (#7994 )	2020-04-13 16:17:05 -07:00
Richard Liaw	dd63178e91	[sgd] Semantic Segmentation Example (#7825 ) * better_example * test * improve some usability things * submit * fix * making a segmentation example * segmentation_example * segmentation * device * flake * Update python/ray/util/sgd/torch/training_operator.py * uti * finished_example * block * format * locationg * fix * ok * revert * segmentation * lint_and_test * address_comments	2020-04-10 20:35:45 -07:00
David Chan	6521e92a95	[RaySGD] Honor the use_gpu flag (#7942 )	2020-04-08 20:20:09 -07:00
Richard Liaw	f63b4c1110	[sgd] make ddp optional (#7875 ) * loosen * devices * tryitout * fix * fix * fix * easy * test * fix * fix * better visibility * fix	2020-04-06 11:41:36 -07:00
Richard Liaw	24bf6ad607	[raysgd] Improve raysgd examples (#7818 ) * better_example * test * improve some usability things * submit * fix * flake * Update python/ray/util/sgd/torch/training_operator.py * trythis * fix * fix * smoke * fail * fix * fix	2020-04-01 08:58:39 -07:00
Richard Liaw	fbf02fa7f7	[Hotfix] Lint for Documentation (#7817 )	2020-03-30 11:49:05 -07:00
Richard Liaw	18327254b6	[docs] Fix readthedocs rendering (#7810 )	2020-03-30 11:40:08 -07:00
Richard Liaw	86cff17e7e	[tune/raysgd] Tune API for TorchTrainer + Fix State Restoration (#7547 )	2020-03-30 12:58:49 -05:00
Maksim Smolin	7b27ce2b23	[RaySGD] Convert the head worker to a local model (#7746 ) Why are these changes needed? Running a worker on head (locally, not as a Ray actor) allows for easier handling of stateful stuff like logging and for easier debugging.	2020-03-27 20:19:15 -07:00
Maksim Smolin	e95455b7d7	[RaySGD] Add tqdm logging to TorchTrainer (#7588 ) * Update issue templates * Init fp16 * fp16 and schedulers * scheduler linking and fp16 * to fp16 * loss scaling and documentation * more documentation * add tests, refactor config * moredocs * more docs * fix logo, add test mode, add fp16 flag * fix tests * fix scheduler * fix apex * improve safety * fix tests * fix tests * remove pin memory default * rm * fix * Update doc/examples/doc_code/raysgd_torch_signatures.py * fix * migrate changes from other PR * ok thanks * pass * signatures * lint' * Update python/ray/experimental/sgd/pytorch/utils.py * Apply suggestions from code review Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com> * should address most comments * comments * fix this ci * first_pass * add overrides * override * fixing up operators * format * sgd * constants * rm * revert * Checkpoint the basics * End of day checkpoint * Checkpoint log-to-head implementation * Checkpoint * Add actor-based batch log reporting, currently segfaults * Work around progress segfault * Fix some stuff in quicktorch * Make things more customizable * Quality of life fixes * More quality of life * Move tqdm logic to training_operator * Update examples * Fix some minor bugs * Fix merge * Fix small things, add pbar to dcgan * Run format.sh * Fix missing epoch number for batch pbar * Address PR comments * Fix float is not subscriptable * Add train_loss to pbar by default * Isolate tqdm code into a handler system * Format * Remove the batch_logs_reporter from distributed runner as well * Check if the train_loss is avaialbale before using it * Enable tqdm in the dcgan example * Fix a crash in no-handler trainers * Fix * Allow not calling set_reporters for tests Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2020-03-24 23:43:56 -07:00
Richard Liaw	b70f31339c	[sgd] Benchmark Fixes (#7553 ) * fix * fix	2020-03-11 13:08:27 -07:00
Richard Liaw	fbac256982	[sgd] Add benchmarks (#7454 ) * Init fp16 * fp16 and schedulers * scheduler linking and fp16 * to fp16 * loss scaling and documentation * more documentation * add tests, refactor config * moredocs * more docs * fix logo, add test mode, add fp16 flag * fix tests * fix scheduler * fix apex * improve safety * fix tests * fix tests * remove pin memory default * rm * fix * Update doc/examples/doc_code/raysgd_torch_signatures.py * fix * migrate changes from other PR * ok thanks * pass * signatures * lint' * Update python/ray/experimental/sgd/pytorch/utils.py * Apply suggestions from code review Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com> * should address most comments * comments * fix this ci * first_pass * add overrides * override * fixing up operators * format * sgd * constants * rm * revert * save * failures * fixes * trainer * run test * operator * code * op * ok done * operator * sgd test fixes * ok * trainer * format * Apply suggestions from code review Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com> * Update doc/source/raysgd/raysgd_pytorch.rst * docstring * dcgan * doc * commits * nit * testing * revert * Start renaming pytorch to torch * Rename PyTorchTrainer to TorchTrainer * Rename PyTorch runners to Torch runners * Finish renaming API * Rename to torch in tests * Finish renaming docs + tests * Run format + fix DeprecationWarning * fix * move tests up * benchmarks * rename * remove some args * better metrics output * fix up the benchmark * benchmark-yaml * horovod-benchmark * benchmarks * Remove benchmark code for cleanups * benchmark-code * nits * benchmark yamls * benchmark yaml * ok * ok * ok * benchmark * nit * finish_bench * makedatacreator * relax * metrics * autosetsampler * profile * movements * OK * smoothen * fix * nitdocs * loss * envflag * comments * nit * format * visible * images * move_images * fix * rernder * rrender * rest * multgpu * fix * nit * finish * extrra * setup * revert Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Maksim Smolin <maximsmol@gmail.com>	2020-03-11 01:09:08 -07:00
Richard Liaw	6163b21458	[raysgd] Better user errors! (#7546 ) * format * callable * Update python/ray/util/sgd/torch/torch_trainer.py Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com> * Update python/ray/util/sgd/torch/torch_trainer.py Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com> * data * torchtrainer * num_rep Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2020-03-10 18:58:19 -07:00
Richard Liaw	d192ef0611	[raysgd] Cleanup User API (#7384 ) * Init fp16 * fp16 and schedulers * scheduler linking and fp16 * to fp16 * loss scaling and documentation * more documentation * add tests, refactor config * moredocs * more docs * fix logo, add test mode, add fp16 flag * fix tests * fix scheduler * fix apex * improve safety * fix tests * fix tests * remove pin memory default * rm * fix * Update doc/examples/doc_code/raysgd_torch_signatures.py * fix * migrate changes from other PR * ok thanks * pass * signatures * lint' * Update python/ray/experimental/sgd/pytorch/utils.py * Apply suggestions from code review Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com> * should address most comments * comments * fix this ci * first_pass * add overrides * override * fixing up operators * format * sgd * constants * rm * revert * save * failures * fixes * trainer * run test * operator * code * op * ok done * operator * sgd test fixes * ok * trainer * format * Apply suggestions from code review Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com> * Update doc/source/raysgd/raysgd_pytorch.rst * docstring * dcgan * doc * commits * nit * testing * revert * Start renaming pytorch to torch * Rename PyTorchTrainer to TorchTrainer * Rename PyTorch runners to Torch runners * Finish renaming API * Rename to torch in tests * Finish renaming docs + tests * Run format + fix DeprecationWarning * fix * move tests up * benchmarks * rename * remove some args * better metrics output * fix up the benchmark * benchmark-yaml * horovod-benchmark * benchmarks * Remove benchmark code for cleanups * makedatacreator * relax * metrics * autosetsampler * profile * movements * OK * smoothen * fix * nitdocs * loss * comments * fix * fix * runner_tests * codes * example * fix_test * fix * tests Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Maksim Smolin <maximsmol@gmail.com>	2020-03-10 08:41:42 -07:00
Maksim Smolin	3a134c7224	[RaySGD] Rename PyTorch API endpoints to start with Torch (#7425 ) * Start renaming pytorch to torch * Rename PyTorchTrainer to TorchTrainer * Rename PyTorch runners to Torch runners * Finish renaming API * Rename to torch in tests * Finish renaming docs + tests * Run format + fix DeprecationWarning * fix * move tests up * rename Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2020-03-03 16:44:42 -08:00

48 Commits