Eric Liang
4522038259
[iter] Add .transform() function for arbitrary generator transforms ( #8978 )
2020-06-25 11:04:14 -07:00
Xianyang Liu
b449ece2ea
[SGD] Variable worker CPU requirements ( #8963 )
2020-06-23 00:43:27 -07:00
Alex Wu
40c15b1ba0
[ParallelIterator] Fix for_each concurrent test cases/bugs ( #8964 )
...
* Everything works
* Update python/ray/util/iter.py
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com >
* .
* .
* removed print statements
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com >
2020-06-22 18:26:45 -07:00
Richard Liaw
e2330ffc35
[sgd] Cleanup code from last PR ( #9076 )
2020-06-22 15:17:07 -07:00
Richard Liaw
acdd873481
[docs/sgd] Fix test failure + make slack link large ( #9051 )
2020-06-21 15:55:06 -07:00
Richard Liaw
58efec0f2b
[sgd] simplify cuda visible device setting ( #8775 )
2020-06-12 13:53:32 -07:00
SangBin Cho
731ed8d232
[Core] Fix a detached actor bug fix when GCS actor management is off. ( #8843 )
2020-06-09 15:46:17 -07:00
SangBin Cho
3388864768
[Core] Clean up detached actors ( #8759 )
2020-06-08 11:22:01 -05:00
Alex Wu
a2ec282033
[Doc] Dataset lint fix ( #8719 )
2020-06-01 19:43:06 -07:00
Alex Wu
dcf58a43dc
[SGD] Dataset API ( #7839 )
2020-06-01 15:48:15 -07:00
Edward Oakes
c64b694560
Update RaySGD test to use ray.kill instead of __ray_kill__ ( #8662 )
2020-05-28 22:38:05 -05:00
Amog Kamsetty
ae2e1f0883
[Parallel Iterators] Batching + Pipelining optimizations ( #7931 )
...
* batching + get_shard pipelining
* duplicate fix
* formatting
* adding performance benchmark
* minor changes
* turn batching off by default
2020-05-26 00:37:57 -07:00
Edward Oakes
860eb6f13a
Update named actor API ( #8559 )
2020-05-24 20:08:03 -05:00
Eric Liang
9a83908c46
[rllib] Deprecate policy optimizers ( #8345 )
2020-05-21 10:16:18 -07:00
Eric Liang
aa7a58e92f
[rllib] Support training intensity for dqn / apex ( #8396 )
2020-05-20 11:22:30 -07:00
Max Fitton
13231ba63b
Rename redis-port to port and add default ( #8406 )
2020-05-18 13:25:34 -05:00
Eric Liang
9d012626e5
[rllib] Distributed exec workflow for impala ( #8321 )
2020-05-11 20:24:43 -07:00
Edward Oakes
2677b71003
Implement named actors using the GCS service ( #8328 )
2020-05-09 08:58:10 -05:00
Eric Liang
9f04a65922
[rllib] Add PPO+DQN two trainer multiagent workflow example ( #8334 )
2020-05-07 23:40:29 -07:00
Alex Wu
04813c2ef5
[Parallel Iterator] Foreach concur ( #8140 )
2020-05-06 10:00:01 -05:00
Eric Liang
ee0eb44a32
Rename async_queue_depth -> num_async ( #8207 )
...
* rename
* lint
2020-05-05 01:38:10 -07:00
Xianyang Liu
eda526c154
[SGD] Support multiple input model ( #8246 )
2020-05-02 16:49:09 -07:00
Maksim Smolin
c2acb7ffe2
[SGD] Add imagenet example CI ( #8150 )
2020-05-02 16:48:35 -07:00
Richard Liaw
35eac2671e
[sgd] Resource limit lift for GPU test ( #8238 )
2020-04-30 00:24:48 -07:00
Xianyang Liu
fbf23eb6ff
[SGD] Fix IterableDataset errors ( #8208 )
2020-04-29 10:51:31 -07:00
Neil Lugovoy
8cf598deab
[sgd] Fix GPU Reservations in LocalDistributedRunner ( #8157 )
2020-04-27 16:03:33 -07:00
Philipp Moritz
d7da25eee1
Use RAY_ADDRESS to connect to an existing Ray cluster if present ( #7977 )
2020-04-27 09:59:37 -07:00
Richard Liaw
fa7eecf48a
[sgd] Avoid parameter "gotcha" for learning rate scheduler ( #8107 )
...
* with-scheduler-creator
* none
* add_freq
* runner
* torch
2020-04-21 01:01:04 -07:00
Sven Mika
165a86f1ab
[RLlib] SAC MuJoCo instability issues (tf and torch versions). ( #8063 )
...
SAC (both torch and tf versions) are showing issues (crashes) due to numeric instabilities in the SquashedGaussian distribution (sampling + logp after extreme NN outputs).
This PR fixes these. Stable MuJoCo learning (HalfCheetah) has been confirmed on both tf and torch versions. A Distribution stability test (using extreme NN outputs) has been added for SquashedGaussian (can be used for any other type of distribution as well).
2020-04-19 10:20:23 +02:00
Richard Liaw
857e4dba2f
[sgd] HuggingFace GLUE Fine-tuning Example ( #7792 )
...
* Init fp16
* fp16 and schedulers
* scheduler linking and fp16
* to fp16
* loss scaling and documentation
* more documentation
* add tests, refactor config
* moredocs
* more docs
* fix logo, add test mode, add fp16 flag
* fix tests
* fix scheduler
* fix apex
* improve safety
* fix tests
* fix tests
* remove pin memory default
* rm
* fix
* Update doc/examples/doc_code/raysgd_torch_signatures.py
* fix
* migrate changes from other PR
* ok thanks
* pass
* signatures
* lint'
* Update python/ray/experimental/sgd/pytorch/utils.py
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com >
* should address most comments
* comments
* fix this ci
* first_pass
* add overrides
* override
* fixing up operators
* format
* sgd
* constants
* rm
* revert
* save
* failures
* fixes
* trainer
* run test
* operator
* code
* op
* ok done
* operator
* sgd test fixes
* ok
* trainer
* format
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com >
* Update doc/source/raysgd/raysgd_pytorch.rst
* docstring
* dcgan
* doc
* commits
* nit
* testing
* revert
* Start renaming pytorch to torch
* Rename PyTorchTrainer to TorchTrainer
* Rename PyTorch runners to Torch runners
* Finish renaming API
* Rename to torch in tests
* Finish renaming docs + tests
* Run format + fix DeprecationWarning
* fix
* move tests up
* benchmarks
* rename
* remove some args
* better metrics output
* fix up the benchmark
* benchmark-yaml
* horovod-benchmark
* benchmarks
* Remove benchmark code for cleanups
* benchmark-code
* nits
* benchmark yamls
* benchmark yaml
* ok
* ok
* ok
* benchmark
* nit
* finish_bench
* makedatacreator
* relax
* metrics
* autosetsampler
* profile
* movements
* OK
* smoothen
* fix
* nitdocs
* loss
* envflag
* comments
* nit
* format
* visible
* images
* move_images
* fix
* rernder
* rrender
* rest
* multgpu
* fix
* nit
* finish
* extrra
* setup
* experimental
* as_trainable
* fix
* ok
* format
* create_torch_pbt
* setup_pbt
* ok
* format
* ok
* format
* docs
* ok
* Draft head-is-worker
* Fix missing concurrency between local and remote workers
* Fix tqdm to work with head-is-worker
* Cleanup
* Implement state_dict and load_state_dict
* Reserve resources on the head node for the local worker
* Update the development cluster setup
* Add spot block reservation to the development yaml
* ok
* Draft the fault tolerance fix
* Small fixes to local-remote concurrency
* Cleanup + fix typo
* fixes
* worker_counts
* some formatting and asha
* fix
* okme
* fixactorkill
* unify
* Revert the cluster mounts
* Cut the handler-reporter API
* Fix most tests
* Rm tqdm_handler.py
* Re-add tune test
* Automatically force-shutdown on actor errors on shutdown
* Formatting
* fix_tune_test
* Add timeout error verification
* Rename tqdm to use_tqdm
* fixtests
* ok
* remove_redundant
* deprecated
* deactivated
* ok_try_this
* lint
* nice
* done
* retries
* fixes
* kill
* retry
* init_transformer
* init
* deployit
* improve_example
* trans
* rename
* formats
* format-to-py37
* time_to_test
* more_changes
* ok
* update_args_and_script
* fp16_epoch
* huggingface
* training stats
* distributed
* Apply suggestions from code review
* transformer
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com >
Co-authored-by: Maksim Smolin <maximsmol@gmail.com >
2020-04-17 15:17:30 -07:00
Maksim Smolin
d6f4e5b3e1
[SGD] Imagenet example (basic) ( #8020 )
...
* Checkpoint the image-models example
* Update cluster definition
* Fix copyright info
* Use original args
* Checkpoint fixes
* Add README
* Add some missing features
* Format
* Get rid of the unused Namespace class
* Address comments
* Link the imagenet example in docs
* Cleanup
* Fix lint
2020-04-17 13:33:55 -07:00
Richard Liaw
a9ea139317
[sgd] Make serialization of data creation optional ( #8027 )
...
* pytest
* Update python/ray/util/sgd/torch/torch_trainer.py
Co-Authored-By: Ujval Misra <misraujval@gmail.com >
Co-authored-by: Ujval Misra <misraujval@gmail.com >
2020-04-16 20:27:51 -07:00
Richard Liaw
6545534805
[tune/sgd] DCGAN example self-contained, turn example into modu… ( #8012 )
...
* ok
* done
* run_benchmarks
* should_make_examples_usable
2020-04-16 17:55:27 -07:00
Karthikeyan Singaravelan
f95e18dfeb
[tune/sgd] Import ABC from collections.abc instead of collectio… ( #7982 )
...
* Import ABC from collections.abc instead of collections for Python 3 compatibility.
* Fix linter errors.
2020-04-16 15:26:49 -07:00
Robert Nishihara
d985d7537e
Replace all instances of ray.readthedocs.io with ray.io ( #7994 )
2020-04-13 16:17:05 -07:00
Richard Liaw
dd63178e91
[sgd] Semantic Segmentation Example ( #7825 )
...
* better_example
* test
* improve some usability things
* submit
* fix
* making a segmentation example
* segmentation_example
* segmentation
* device
* flake
* Update python/ray/util/sgd/torch/training_operator.py
* uti
* finished_example
* block
* format
* locationg
* fix
* ok
* revert
* segmentation
* lint_and_test
* address_comments
2020-04-10 20:35:45 -07:00
marload
e3ffb8ac28
[tune] Refactoring: Deduplicate ( #7918 )
...
* refactoring: Deduplication
* refactoring: Deduplication
* refactoring: Deduplication
* refactoring: Deduplication
* lint fix: Variable naming case
* fix: Remove White Space
* fix_lint
Co-authored-by: Richard Liaw <rliaw@berkeley.edu >
2020-04-09 20:19:04 -07:00
Simon Mo
59867dad75
Move Jenkins test to Github action ( #7342 )
2020-04-09 10:27:19 -07:00
David Chan
6521e92a95
[RaySGD] Honor the use_gpu flag ( #7942 )
2020-04-08 20:20:09 -07:00
Richard Liaw
f63b4c1110
[sgd] make ddp optional ( #7875 )
...
* loosen
* devices
* tryitout
* fix
* fix
* fix
* easy
* test
* fix
* fix
* better visibility
* fix
2020-04-06 11:41:36 -07:00
Richard Liaw
24bf6ad607
[raysgd] Improve raysgd examples ( #7818 )
...
* better_example
* test
* improve some usability things
* submit
* fix
* flake
* Update python/ray/util/sgd/torch/training_operator.py
* trythis
* fix
* fix
* smoke
* fail
* fix
* fix
2020-04-01 08:58:39 -07:00
Richard Liaw
fbf02fa7f7
[Hotfix] Lint for Documentation ( #7817 )
2020-03-30 11:49:05 -07:00
Richard Liaw
18327254b6
[docs] Fix readthedocs rendering ( #7810 )
2020-03-30 11:40:08 -07:00
Richard Liaw
86cff17e7e
[tune/raysgd] Tune API for TorchTrainer + Fix State Restoration ( #7547 )
2020-03-30 12:58:49 -05:00
Maksim Smolin
7b27ce2b23
[RaySGD] Convert the head worker to a local model ( #7746 )
...
Why are these changes needed?
Running a worker on head (locally, not as a Ray actor) allows for easier handling of stateful stuff like logging and for easier debugging.
2020-03-27 20:19:15 -07:00
Maksim Smolin
e95455b7d7
[RaySGD] Add tqdm logging to TorchTrainer ( #7588 )
...
* Update issue templates
* Init fp16
* fp16 and schedulers
* scheduler linking and fp16
* to fp16
* loss scaling and documentation
* more documentation
* add tests, refactor config
* moredocs
* more docs
* fix logo, add test mode, add fp16 flag
* fix tests
* fix scheduler
* fix apex
* improve safety
* fix tests
* fix tests
* remove pin memory default
* rm
* fix
* Update doc/examples/doc_code/raysgd_torch_signatures.py
* fix
* migrate changes from other PR
* ok thanks
* pass
* signatures
* lint'
* Update python/ray/experimental/sgd/pytorch/utils.py
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com >
* should address most comments
* comments
* fix this ci
* first_pass
* add overrides
* override
* fixing up operators
* format
* sgd
* constants
* rm
* revert
* Checkpoint the basics
* End of day checkpoint
* Checkpoint log-to-head implementation
* Checkpoint
* Add actor-based batch log reporting, currently segfaults
* Work around progress segfault
* Fix some stuff in quicktorch
* Make things more customizable
* Quality of life fixes
* More quality of life
* Move tqdm logic to training_operator
* Update examples
* Fix some minor bugs
* Fix merge
* Fix small things, add pbar to dcgan
* Run format.sh
* Fix missing epoch number for batch pbar
* Address PR comments
* Fix float is not subscriptable
* Add train_loss to pbar by default
* Isolate tqdm code into a handler system
* Format
* Remove the batch_logs_reporter from distributed runner as well
* Check if the train_loss is avaialbale before using it
* Enable tqdm in the dcgan example
* Fix a crash in no-handler trainers
* Fix
* Allow not calling set_reporters for tests
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com >
Co-authored-by: Richard Liaw <rliaw@berkeley.edu >
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com >
2020-03-24 23:43:56 -07:00
Eric Liang
288933ec6b
[rllib] Fix shared metrics context in parallel iterators ( #7666 )
...
* debug
* build
* update
* wip
* wpi
* update
* recurisve sync
* comment
* stream
* fix
* Update .travis.yml
2020-03-22 14:15:01 -07:00
Eric Liang
797e6cfc2a
[rllib][tune] fix some nans ( #7611 )
2020-03-16 11:19:58 -07:00
Eric Liang
f5d12a958b
[rllib] Port Ape-X to distributed execution API ( #7497 )
2020-03-12 00:54:08 -07:00
Richard Liaw
b70f31339c
[sgd] Benchmark Fixes ( #7553 )
...
* fix
* fix
2020-03-11 13:08:27 -07:00