Edward Oakes
2677b71003
Implement named actors using the GCS service ( #8328 )
2020-05-09 08:58:10 -05:00
Eric Liang
9f04a65922
[rllib] Add PPO+DQN two trainer multiagent workflow example ( #8334 )
2020-05-07 23:40:29 -07:00
Alex Wu
04813c2ef5
[Parallel Iterator] Foreach concur ( #8140 )
2020-05-06 10:00:01 -05:00
Eric Liang
ee0eb44a32
Rename async_queue_depth -> num_async ( #8207 )
...
* rename
* lint
2020-05-05 01:38:10 -07:00
Xianyang Liu
eda526c154
[SGD] Support multiple input model ( #8246 )
2020-05-02 16:49:09 -07:00
Maksim Smolin
c2acb7ffe2
[SGD] Add imagenet example CI ( #8150 )
2020-05-02 16:48:35 -07:00
Richard Liaw
35eac2671e
[sgd] Resource limit lift for GPU test ( #8238 )
2020-04-30 00:24:48 -07:00
Xianyang Liu
fbf23eb6ff
[SGD] Fix IterableDataset errors ( #8208 )
2020-04-29 10:51:31 -07:00
Neil Lugovoy
8cf598deab
[sgd] Fix GPU Reservations in LocalDistributedRunner ( #8157 )
2020-04-27 16:03:33 -07:00
Philipp Moritz
d7da25eee1
Use RAY_ADDRESS to connect to an existing Ray cluster if present ( #7977 )
2020-04-27 09:59:37 -07:00
Richard Liaw
fa7eecf48a
[sgd] Avoid parameter "gotcha" for learning rate scheduler ( #8107 )
...
* with-scheduler-creator
* none
* add_freq
* runner
* torch
2020-04-21 01:01:04 -07:00
Sven Mika
165a86f1ab
[RLlib] SAC MuJoCo instability issues (tf and torch versions). ( #8063 )
...
SAC (both torch and tf versions) are showing issues (crashes) due to numeric instabilities in the SquashedGaussian distribution (sampling + logp after extreme NN outputs).
This PR fixes these. Stable MuJoCo learning (HalfCheetah) has been confirmed on both tf and torch versions. A Distribution stability test (using extreme NN outputs) has been added for SquashedGaussian (can be used for any other type of distribution as well).
2020-04-19 10:20:23 +02:00
Richard Liaw
857e4dba2f
[sgd] HuggingFace GLUE Fine-tuning Example ( #7792 )
...
* Init fp16
* fp16 and schedulers
* scheduler linking and fp16
* to fp16
* loss scaling and documentation
* more documentation
* add tests, refactor config
* moredocs
* more docs
* fix logo, add test mode, add fp16 flag
* fix tests
* fix scheduler
* fix apex
* improve safety
* fix tests
* fix tests
* remove pin memory default
* rm
* fix
* Update doc/examples/doc_code/raysgd_torch_signatures.py
* fix
* migrate changes from other PR
* ok thanks
* pass
* signatures
* lint'
* Update python/ray/experimental/sgd/pytorch/utils.py
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com >
* should address most comments
* comments
* fix this ci
* first_pass
* add overrides
* override
* fixing up operators
* format
* sgd
* constants
* rm
* revert
* save
* failures
* fixes
* trainer
* run test
* operator
* code
* op
* ok done
* operator
* sgd test fixes
* ok
* trainer
* format
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com >
* Update doc/source/raysgd/raysgd_pytorch.rst
* docstring
* dcgan
* doc
* commits
* nit
* testing
* revert
* Start renaming pytorch to torch
* Rename PyTorchTrainer to TorchTrainer
* Rename PyTorch runners to Torch runners
* Finish renaming API
* Rename to torch in tests
* Finish renaming docs + tests
* Run format + fix DeprecationWarning
* fix
* move tests up
* benchmarks
* rename
* remove some args
* better metrics output
* fix up the benchmark
* benchmark-yaml
* horovod-benchmark
* benchmarks
* Remove benchmark code for cleanups
* benchmark-code
* nits
* benchmark yamls
* benchmark yaml
* ok
* ok
* ok
* benchmark
* nit
* finish_bench
* makedatacreator
* relax
* metrics
* autosetsampler
* profile
* movements
* OK
* smoothen
* fix
* nitdocs
* loss
* envflag
* comments
* nit
* format
* visible
* images
* move_images
* fix
* rernder
* rrender
* rest
* multgpu
* fix
* nit
* finish
* extrra
* setup
* experimental
* as_trainable
* fix
* ok
* format
* create_torch_pbt
* setup_pbt
* ok
* format
* ok
* format
* docs
* ok
* Draft head-is-worker
* Fix missing concurrency between local and remote workers
* Fix tqdm to work with head-is-worker
* Cleanup
* Implement state_dict and load_state_dict
* Reserve resources on the head node for the local worker
* Update the development cluster setup
* Add spot block reservation to the development yaml
* ok
* Draft the fault tolerance fix
* Small fixes to local-remote concurrency
* Cleanup + fix typo
* fixes
* worker_counts
* some formatting and asha
* fix
* okme
* fixactorkill
* unify
* Revert the cluster mounts
* Cut the handler-reporter API
* Fix most tests
* Rm tqdm_handler.py
* Re-add tune test
* Automatically force-shutdown on actor errors on shutdown
* Formatting
* fix_tune_test
* Add timeout error verification
* Rename tqdm to use_tqdm
* fixtests
* ok
* remove_redundant
* deprecated
* deactivated
* ok_try_this
* lint
* nice
* done
* retries
* fixes
* kill
* retry
* init_transformer
* init
* deployit
* improve_example
* trans
* rename
* formats
* format-to-py37
* time_to_test
* more_changes
* ok
* update_args_and_script
* fp16_epoch
* huggingface
* training stats
* distributed
* Apply suggestions from code review
* transformer
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com >
Co-authored-by: Maksim Smolin <maximsmol@gmail.com >
2020-04-17 15:17:30 -07:00
Maksim Smolin
d6f4e5b3e1
[SGD] Imagenet example (basic) ( #8020 )
...
* Checkpoint the image-models example
* Update cluster definition
* Fix copyright info
* Use original args
* Checkpoint fixes
* Add README
* Add some missing features
* Format
* Get rid of the unused Namespace class
* Address comments
* Link the imagenet example in docs
* Cleanup
* Fix lint
2020-04-17 13:33:55 -07:00
Richard Liaw
a9ea139317
[sgd] Make serialization of data creation optional ( #8027 )
...
* pytest
* Update python/ray/util/sgd/torch/torch_trainer.py
Co-Authored-By: Ujval Misra <misraujval@gmail.com >
Co-authored-by: Ujval Misra <misraujval@gmail.com >
2020-04-16 20:27:51 -07:00
Richard Liaw
6545534805
[tune/sgd] DCGAN example self-contained, turn example into modu… ( #8012 )
...
* ok
* done
* run_benchmarks
* should_make_examples_usable
2020-04-16 17:55:27 -07:00
Karthikeyan Singaravelan
f95e18dfeb
[tune/sgd] Import ABC from collections.abc instead of collectio… ( #7982 )
...
* Import ABC from collections.abc instead of collections for Python 3 compatibility.
* Fix linter errors.
2020-04-16 15:26:49 -07:00
Robert Nishihara
d985d7537e
Replace all instances of ray.readthedocs.io with ray.io ( #7994 )
2020-04-13 16:17:05 -07:00
Richard Liaw
dd63178e91
[sgd] Semantic Segmentation Example ( #7825 )
...
* better_example
* test
* improve some usability things
* submit
* fix
* making a segmentation example
* segmentation_example
* segmentation
* device
* flake
* Update python/ray/util/sgd/torch/training_operator.py
* uti
* finished_example
* block
* format
* locationg
* fix
* ok
* revert
* segmentation
* lint_and_test
* address_comments
2020-04-10 20:35:45 -07:00
marload
e3ffb8ac28
[tune] Refactoring: Deduplicate ( #7918 )
...
* refactoring: Deduplication
* refactoring: Deduplication
* refactoring: Deduplication
* refactoring: Deduplication
* lint fix: Variable naming case
* fix: Remove White Space
* fix_lint
Co-authored-by: Richard Liaw <rliaw@berkeley.edu >
2020-04-09 20:19:04 -07:00
Simon Mo
59867dad75
Move Jenkins test to Github action ( #7342 )
2020-04-09 10:27:19 -07:00
David Chan
6521e92a95
[RaySGD] Honor the use_gpu flag ( #7942 )
2020-04-08 20:20:09 -07:00
Richard Liaw
f63b4c1110
[sgd] make ddp optional ( #7875 )
...
* loosen
* devices
* tryitout
* fix
* fix
* fix
* easy
* test
* fix
* fix
* better visibility
* fix
2020-04-06 11:41:36 -07:00
Richard Liaw
24bf6ad607
[raysgd] Improve raysgd examples ( #7818 )
...
* better_example
* test
* improve some usability things
* submit
* fix
* flake
* Update python/ray/util/sgd/torch/training_operator.py
* trythis
* fix
* fix
* smoke
* fail
* fix
* fix
2020-04-01 08:58:39 -07:00
Richard Liaw
fbf02fa7f7
[Hotfix] Lint for Documentation ( #7817 )
2020-03-30 11:49:05 -07:00
Richard Liaw
18327254b6
[docs] Fix readthedocs rendering ( #7810 )
2020-03-30 11:40:08 -07:00
Richard Liaw
86cff17e7e
[tune/raysgd] Tune API for TorchTrainer + Fix State Restoration ( #7547 )
2020-03-30 12:58:49 -05:00
Maksim Smolin
7b27ce2b23
[RaySGD] Convert the head worker to a local model ( #7746 )
...
Why are these changes needed?
Running a worker on head (locally, not as a Ray actor) allows for easier handling of stateful stuff like logging and for easier debugging.
2020-03-27 20:19:15 -07:00
Maksim Smolin
e95455b7d7
[RaySGD] Add tqdm logging to TorchTrainer ( #7588 )
...
* Update issue templates
* Init fp16
* fp16 and schedulers
* scheduler linking and fp16
* to fp16
* loss scaling and documentation
* more documentation
* add tests, refactor config
* moredocs
* more docs
* fix logo, add test mode, add fp16 flag
* fix tests
* fix scheduler
* fix apex
* improve safety
* fix tests
* fix tests
* remove pin memory default
* rm
* fix
* Update doc/examples/doc_code/raysgd_torch_signatures.py
* fix
* migrate changes from other PR
* ok thanks
* pass
* signatures
* lint'
* Update python/ray/experimental/sgd/pytorch/utils.py
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com >
* should address most comments
* comments
* fix this ci
* first_pass
* add overrides
* override
* fixing up operators
* format
* sgd
* constants
* rm
* revert
* Checkpoint the basics
* End of day checkpoint
* Checkpoint log-to-head implementation
* Checkpoint
* Add actor-based batch log reporting, currently segfaults
* Work around progress segfault
* Fix some stuff in quicktorch
* Make things more customizable
* Quality of life fixes
* More quality of life
* Move tqdm logic to training_operator
* Update examples
* Fix some minor bugs
* Fix merge
* Fix small things, add pbar to dcgan
* Run format.sh
* Fix missing epoch number for batch pbar
* Address PR comments
* Fix float is not subscriptable
* Add train_loss to pbar by default
* Isolate tqdm code into a handler system
* Format
* Remove the batch_logs_reporter from distributed runner as well
* Check if the train_loss is avaialbale before using it
* Enable tqdm in the dcgan example
* Fix a crash in no-handler trainers
* Fix
* Allow not calling set_reporters for tests
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com >
Co-authored-by: Richard Liaw <rliaw@berkeley.edu >
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com >
2020-03-24 23:43:56 -07:00
Eric Liang
288933ec6b
[rllib] Fix shared metrics context in parallel iterators ( #7666 )
...
* debug
* build
* update
* wip
* wpi
* update
* recurisve sync
* comment
* stream
* fix
* Update .travis.yml
2020-03-22 14:15:01 -07:00
Eric Liang
797e6cfc2a
[rllib][tune] fix some nans ( #7611 )
2020-03-16 11:19:58 -07:00
Eric Liang
f5d12a958b
[rllib] Port Ape-X to distributed execution API ( #7497 )
2020-03-12 00:54:08 -07:00
Richard Liaw
b70f31339c
[sgd] Benchmark Fixes ( #7553 )
...
* fix
* fix
2020-03-11 13:08:27 -07:00
Richard Liaw
fbac256982
[sgd] Add benchmarks ( #7454 )
...
* Init fp16
* fp16 and schedulers
* scheduler linking and fp16
* to fp16
* loss scaling and documentation
* more documentation
* add tests, refactor config
* moredocs
* more docs
* fix logo, add test mode, add fp16 flag
* fix tests
* fix scheduler
* fix apex
* improve safety
* fix tests
* fix tests
* remove pin memory default
* rm
* fix
* Update doc/examples/doc_code/raysgd_torch_signatures.py
* fix
* migrate changes from other PR
* ok thanks
* pass
* signatures
* lint'
* Update python/ray/experimental/sgd/pytorch/utils.py
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com >
* should address most comments
* comments
* fix this ci
* first_pass
* add overrides
* override
* fixing up operators
* format
* sgd
* constants
* rm
* revert
* save
* failures
* fixes
* trainer
* run test
* operator
* code
* op
* ok done
* operator
* sgd test fixes
* ok
* trainer
* format
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com >
* Update doc/source/raysgd/raysgd_pytorch.rst
* docstring
* dcgan
* doc
* commits
* nit
* testing
* revert
* Start renaming pytorch to torch
* Rename PyTorchTrainer to TorchTrainer
* Rename PyTorch runners to Torch runners
* Finish renaming API
* Rename to torch in tests
* Finish renaming docs + tests
* Run format + fix DeprecationWarning
* fix
* move tests up
* benchmarks
* rename
* remove some args
* better metrics output
* fix up the benchmark
* benchmark-yaml
* horovod-benchmark
* benchmarks
* Remove benchmark code for cleanups
* benchmark-code
* nits
* benchmark yamls
* benchmark yaml
* ok
* ok
* ok
* benchmark
* nit
* finish_bench
* makedatacreator
* relax
* metrics
* autosetsampler
* profile
* movements
* OK
* smoothen
* fix
* nitdocs
* loss
* envflag
* comments
* nit
* format
* visible
* images
* move_images
* fix
* rernder
* rrender
* rest
* multgpu
* fix
* nit
* finish
* extrra
* setup
* revert
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com >
Co-authored-by: Maksim Smolin <maximsmol@gmail.com >
2020-03-11 01:09:08 -07:00
Richard Liaw
6163b21458
[raysgd] Better user errors! ( #7546 )
...
* format
* callable
* Update python/ray/util/sgd/torch/torch_trainer.py
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com >
* Update python/ray/util/sgd/torch/torch_trainer.py
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com >
* data
* torchtrainer
* num_rep
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com >
2020-03-10 18:58:19 -07:00
Richard Liaw
d192ef0611
[raysgd] Cleanup User API ( #7384 )
...
* Init fp16
* fp16 and schedulers
* scheduler linking and fp16
* to fp16
* loss scaling and documentation
* more documentation
* add tests, refactor config
* moredocs
* more docs
* fix logo, add test mode, add fp16 flag
* fix tests
* fix scheduler
* fix apex
* improve safety
* fix tests
* fix tests
* remove pin memory default
* rm
* fix
* Update doc/examples/doc_code/raysgd_torch_signatures.py
* fix
* migrate changes from other PR
* ok thanks
* pass
* signatures
* lint'
* Update python/ray/experimental/sgd/pytorch/utils.py
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com >
* should address most comments
* comments
* fix this ci
* first_pass
* add overrides
* override
* fixing up operators
* format
* sgd
* constants
* rm
* revert
* save
* failures
* fixes
* trainer
* run test
* operator
* code
* op
* ok done
* operator
* sgd test fixes
* ok
* trainer
* format
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com >
* Update doc/source/raysgd/raysgd_pytorch.rst
* docstring
* dcgan
* doc
* commits
* nit
* testing
* revert
* Start renaming pytorch to torch
* Rename PyTorchTrainer to TorchTrainer
* Rename PyTorch runners to Torch runners
* Finish renaming API
* Rename to torch in tests
* Finish renaming docs + tests
* Run format + fix DeprecationWarning
* fix
* move tests up
* benchmarks
* rename
* remove some args
* better metrics output
* fix up the benchmark
* benchmark-yaml
* horovod-benchmark
* benchmarks
* Remove benchmark code for cleanups
* makedatacreator
* relax
* metrics
* autosetsampler
* profile
* movements
* OK
* smoothen
* fix
* nitdocs
* loss
* comments
* fix
* fix
* runner_tests
* codes
* example
* fix_test
* fix
* tests
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com >
Co-authored-by: Maksim Smolin <maximsmol@gmail.com >
2020-03-10 08:41:42 -07:00
Eric Liang
90e23a5c43
[iterators] Add duplicate() call and fix broken test case ( #7510 )
2020-03-09 17:18:52 -07:00
Eric Liang
a644060daa
[rllib] First pass at pipeline implementation of DQN ( #7433 )
...
* wip iters
* add test
* speed up
* update docs
* document it
* support serial sampling
* add test
* spacing
* annotate it
* update
* rename to pipeline
* comment
* iter2 wip
* update
* update
* context test
* update
* fix
* fix
* a3c pipeline
* doc
* update
* move timer
* comment
* add piepline test
* fix
* clean up
* document
* iter s
* wip dqn
* wip
* wip
* metrics
* metrics rename
* metrics ctx
* wip
* constants
* add todo
* suppport .union
* wip
* support union
* remove prints
* add todo
* remove auto timer
* fix up
* fix pipeline test
* typing
* fix breakage
* remove bad assert
* wip
* fix multiagent example
* fixapply
* update a3c
* remove a2c pl
* 0 workers
* wip
* wip
* share metrics
* wip
* wip
* doc
* fix weight sync and global var updates
* mode
* fix
* fix
* doc
* fix
2020-03-07 14:47:58 -08:00
Eric Liang
476b5c6196
[Parallel Iterators] Allow for operator chaining after repartition ( #7268 )
...
* bug fix repartition
* change add_transform from private to inner
* formatting
* addressing comments
* formatting
2020-03-04 14:42:52 -08:00
Maksim Smolin
3a134c7224
[RaySGD] Rename PyTorch API endpoints to start with Torch ( #7425 )
...
* Start renaming pytorch to torch
* Rename PyTorchTrainer to TorchTrainer
* Rename PyTorch runners to Torch runners
* Finish renaming API
* Rename to torch in tests
* Finish renaming docs + tests
* Run format + fix DeprecationWarning
* fix
* move tests up
* rename
Co-authored-by: Richard Liaw <rliaw@berkeley.edu >
2020-03-03 16:44:42 -08:00
Edward Oakes
b0bf5450c2
Fix flaky multiprocessing tests ( #7413 )
2020-03-03 15:07:59 -06:00
Edward Oakes
04ec599441
Use ray.kill() in multiprocessing.Pool ( #7409 )
2020-03-03 12:49:13 -06:00
Richard Liaw
48cdca843f
[raysgd] Custom training operator ( #7211 )
2020-03-01 21:22:48 -08:00
Eric Liang
3c6b94f3f5
[rllib] Enable performance metrics reporting for RLlib pipelines, add A3C ( #7299 )
2020-02-28 16:44:17 -08:00
Sven Mika
357232d124
[Core/RLlib] Move log_once from rllib to ray.util. ( #7273 )
...
* Move log_once from rllib to tune.
* Move log_once from rllib to tune.
* LINT.
* Move to ray.util.debug.
2020-02-27 10:40:44 -08:00
Amog Kamsetty
1737a113be
[Parallel Iterators] Repartition functionality ( #7163 )
...
* repartition and tests
* blacklist lib/ files from import checks
* addressing comments and splitting up tests
* code readability
* adding explicit ref for parent iterator
* formatting
2020-02-21 13:20:18 -08:00
Eric Liang
5df801605e
Add ray.util package and move libraries from experimental ( #7100 )
2020-02-18 13:43:19 -08:00
Edward Oakes
dc5a27dac0
Move ray.experimental.multiprocessing to ray.util.multiprocessing ( #7149 )
2020-02-14 16:17:05 -08:00