Commit Graph

48 Commits

Author SHA1 Message Date
Edward Oakes 2677b71003 Implement named actors using the GCS service (#8328) 2020-05-09 08:58:10 -05:00
Eric Liang 9f04a65922 [rllib] Add PPO+DQN two trainer multiagent workflow example (#8334) 2020-05-07 23:40:29 -07:00
Alex Wu 04813c2ef5 [Parallel Iterator] Foreach concur (#8140) 2020-05-06 10:00:01 -05:00
Eric Liang ee0eb44a32 Rename async_queue_depth -> num_async (#8207)
* rename

* lint
2020-05-05 01:38:10 -07:00
Xianyang Liu eda526c154 [SGD] Support multiple input model (#8246) 2020-05-02 16:49:09 -07:00
Maksim Smolin c2acb7ffe2 [SGD] Add imagenet example CI (#8150) 2020-05-02 16:48:35 -07:00
Richard Liaw 35eac2671e [sgd] Resource limit lift for GPU test (#8238) 2020-04-30 00:24:48 -07:00
Xianyang Liu fbf23eb6ff [SGD] Fix IterableDataset errors (#8208) 2020-04-29 10:51:31 -07:00
Neil Lugovoy 8cf598deab [sgd] Fix GPU Reservations in LocalDistributedRunner (#8157) 2020-04-27 16:03:33 -07:00
Philipp Moritz d7da25eee1 Use RAY_ADDRESS to connect to an existing Ray cluster if present (#7977) 2020-04-27 09:59:37 -07:00
Richard Liaw fa7eecf48a [sgd] Avoid parameter "gotcha" for learning rate scheduler (#8107)
* with-scheduler-creator

* none

* add_freq

* runner

* torch
2020-04-21 01:01:04 -07:00
Sven Mika 165a86f1ab [RLlib] SAC MuJoCo instability issues (tf and torch versions). (#8063)
SAC (both torch and tf versions) are showing issues (crashes) due to numeric instabilities in the SquashedGaussian distribution (sampling + logp after extreme NN outputs).
This PR fixes these. Stable MuJoCo learning (HalfCheetah) has been confirmed on both tf and torch versions. A Distribution stability test (using extreme NN outputs) has been added for SquashedGaussian (can be used for any other type of distribution as well).
2020-04-19 10:20:23 +02:00
Richard Liaw 857e4dba2f [sgd] HuggingFace GLUE Fine-tuning Example (#7792)
* Init fp16

* fp16 and schedulers

* scheduler linking and fp16

* to fp16

* loss scaling and documentation

* more documentation

* add tests, refactor config

* moredocs

* more docs

* fix logo, add test mode, add fp16 flag

* fix tests

* fix scheduler

* fix apex

* improve safety

* fix tests

* fix tests

* remove pin memory default

* rm

* fix

* Update doc/examples/doc_code/raysgd_torch_signatures.py

* fix

* migrate changes from other PR

* ok thanks

* pass

* signatures

* lint'

* Update python/ray/experimental/sgd/pytorch/utils.py

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* should address most comments

* comments

* fix this ci

* first_pass

* add overrides

* override

* fixing up operators

* format

* sgd

* constants

* rm

* revert

* save

* failures

* fixes

* trainer

* run test

* operator

* code

* op

* ok done

* operator

* sgd test fixes

* ok

* trainer

* format

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update doc/source/raysgd/raysgd_pytorch.rst

* docstring

* dcgan

* doc

* commits

* nit

* testing

* revert

* Start renaming pytorch to torch

* Rename PyTorchTrainer to TorchTrainer

* Rename PyTorch runners to Torch runners

* Finish renaming API

* Rename to torch in tests

* Finish renaming docs + tests

* Run format + fix DeprecationWarning

* fix

* move tests up

* benchmarks

* rename

* remove some args

* better metrics output

* fix up the benchmark

* benchmark-yaml

* horovod-benchmark

* benchmarks

* Remove benchmark code for cleanups

* benchmark-code

* nits

* benchmark yamls

* benchmark yaml

* ok

* ok

* ok

* benchmark

* nit

* finish_bench

* makedatacreator

* relax

* metrics

* autosetsampler

* profile

* movements

* OK

* smoothen

* fix

* nitdocs

* loss

* envflag

* comments

* nit

* format

* visible

* images

* move_images

* fix

* rernder

* rrender

* rest

* multgpu

* fix

* nit

* finish

* extrra

* setup

* experimental

* as_trainable

* fix

* ok

* format

* create_torch_pbt

* setup_pbt

* ok

* format

* ok

* format

* docs

* ok

* Draft head-is-worker

* Fix missing concurrency between local and remote workers

* Fix tqdm to work with head-is-worker

* Cleanup

* Implement state_dict and load_state_dict

* Reserve resources on the head node for the local worker

* Update the development cluster setup

* Add spot block reservation to the development yaml

* ok

* Draft the fault tolerance fix

* Small fixes to local-remote concurrency

* Cleanup + fix typo

* fixes

* worker_counts

* some formatting and asha

* fix

* okme

* fixactorkill

* unify

* Revert the cluster mounts

* Cut the handler-reporter API

* Fix most tests

* Rm tqdm_handler.py

* Re-add tune test

* Automatically force-shutdown on actor errors on shutdown

* Formatting

* fix_tune_test

* Add timeout error verification

* Rename tqdm to use_tqdm

* fixtests

* ok

* remove_redundant

* deprecated

* deactivated

* ok_try_this

* lint

* nice

* done

* retries

* fixes

* kill

* retry

* init_transformer

* init

* deployit

* improve_example

* trans

* rename

* formats

* format-to-py37

* time_to_test

* more_changes

* ok

* update_args_and_script

* fp16_epoch

* huggingface

* training stats

* distributed

* Apply suggestions from code review

* transformer

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
2020-04-17 15:17:30 -07:00
Maksim Smolin d6f4e5b3e1 [SGD] Imagenet example (basic) (#8020)
* Checkpoint the image-models example

* Update cluster definition

* Fix copyright info

* Use original args

* Checkpoint fixes

* Add README

* Add some missing features

* Format

* Get rid of the unused Namespace class

* Address comments

* Link the imagenet example in docs

* Cleanup

* Fix lint
2020-04-17 13:33:55 -07:00
Richard Liaw a9ea139317 [sgd] Make serialization of data creation optional (#8027)
* pytest

* Update python/ray/util/sgd/torch/torch_trainer.py

Co-Authored-By: Ujval Misra <misraujval@gmail.com>

Co-authored-by: Ujval Misra <misraujval@gmail.com>
2020-04-16 20:27:51 -07:00
Richard Liaw 6545534805 [tune/sgd] DCGAN example self-contained, turn example into modu… (#8012)
* ok

* done

* run_benchmarks

* should_make_examples_usable
2020-04-16 17:55:27 -07:00
Karthikeyan Singaravelan f95e18dfeb [tune/sgd] Import ABC from collections.abc instead of collectio… (#7982)
* Import ABC from collections.abc instead of collections for Python 3 compatibility.

* Fix linter errors.
2020-04-16 15:26:49 -07:00
Robert Nishihara d985d7537e Replace all instances of ray.readthedocs.io with ray.io (#7994) 2020-04-13 16:17:05 -07:00
Richard Liaw dd63178e91 [sgd] Semantic Segmentation Example (#7825)
* better_example

* test

* improve some usability things

* submit

* fix

* making a segmentation example

* segmentation_example

* segmentation

* device

* flake

* Update python/ray/util/sgd/torch/training_operator.py

* uti

* finished_example

* block

* format

* locationg

* fix

* ok

* revert

* segmentation

* lint_and_test

* address_comments
2020-04-10 20:35:45 -07:00
marload e3ffb8ac28 [tune] Refactoring: Deduplicate (#7918)
* refactoring: Deduplication

* refactoring: Deduplication

* refactoring: Deduplication

* refactoring: Deduplication

* lint fix: Variable naming case

* fix: Remove White Space

* fix_lint

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-04-09 20:19:04 -07:00
Simon Mo 59867dad75 Move Jenkins test to Github action (#7342) 2020-04-09 10:27:19 -07:00
David Chan 6521e92a95 [RaySGD] Honor the use_gpu flag (#7942) 2020-04-08 20:20:09 -07:00
Richard Liaw f63b4c1110 [sgd] make ddp optional (#7875)
* loosen

* devices

* tryitout

* fix

* fix

* fix

* easy

* test

* fix

* fix

* better visibility

* fix
2020-04-06 11:41:36 -07:00
Richard Liaw 24bf6ad607 [raysgd] Improve raysgd examples (#7818)
* better_example

* test

* improve some usability things

* submit

* fix

* flake

* Update python/ray/util/sgd/torch/training_operator.py

* trythis

* fix

* fix

* smoke

* fail

* fix

* fix
2020-04-01 08:58:39 -07:00
Richard Liaw fbf02fa7f7 [Hotfix] Lint for Documentation (#7817) 2020-03-30 11:49:05 -07:00
Richard Liaw 18327254b6 [docs] Fix readthedocs rendering (#7810) 2020-03-30 11:40:08 -07:00
Richard Liaw 86cff17e7e [tune/raysgd] Tune API for TorchTrainer + Fix State Restoration (#7547) 2020-03-30 12:58:49 -05:00
Maksim Smolin 7b27ce2b23 [RaySGD] Convert the head worker to a local model (#7746)
Why are these changes needed?

Running a worker on head (locally, not as a Ray actor) allows for easier handling of stateful stuff like logging and for easier debugging.
2020-03-27 20:19:15 -07:00
Maksim Smolin e95455b7d7 [RaySGD] Add tqdm logging to TorchTrainer (#7588)
* Update issue templates

* Init fp16

* fp16 and schedulers

* scheduler linking and fp16

* to fp16

* loss scaling and documentation

* more documentation

* add tests, refactor config

* moredocs

* more docs

* fix logo, add test mode, add fp16 flag

* fix tests

* fix scheduler

* fix apex

* improve safety

* fix tests

* fix tests

* remove pin memory default

* rm

* fix

* Update doc/examples/doc_code/raysgd_torch_signatures.py

* fix

* migrate changes from other PR

* ok thanks

* pass

* signatures

* lint'

* Update python/ray/experimental/sgd/pytorch/utils.py

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* should address most comments

* comments

* fix this ci

* first_pass

* add overrides

* override

* fixing up operators

* format

* sgd

* constants

* rm

* revert

* Checkpoint the basics

* End of day checkpoint

* Checkpoint log-to-head implementation

* Checkpoint

* Add actor-based batch log reporting, currently segfaults

* Work around progress segfault

* Fix some stuff in quicktorch

* Make things more customizable

* Quality of life fixes

* More quality of life

* Move tqdm logic to training_operator

* Update examples

* Fix some minor bugs

* Fix merge

* Fix small things, add pbar to dcgan

* Run format.sh

* Fix missing epoch number for batch pbar

* Address PR comments

* Fix float is not subscriptable

* Add train_loss to pbar by default

* Isolate tqdm code into a handler system

* Format

* Remove the batch_logs_reporter from distributed runner as well

* Check if the train_loss is avaialbale before using it

* Enable tqdm in the dcgan example

* Fix a crash in no-handler trainers

* Fix

* Allow not calling set_reporters for tests

Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-03-24 23:43:56 -07:00
Eric Liang 288933ec6b [rllib] Fix shared metrics context in parallel iterators (#7666)
* debug

* build

* update

* wip

* wpi

* update

* recurisve sync

* comment

* stream

* fix

* Update .travis.yml
2020-03-22 14:15:01 -07:00
Eric Liang 797e6cfc2a [rllib][tune] fix some nans (#7611) 2020-03-16 11:19:58 -07:00
Eric Liang f5d12a958b [rllib] Port Ape-X to distributed execution API (#7497) 2020-03-12 00:54:08 -07:00
Richard Liaw b70f31339c [sgd] Benchmark Fixes (#7553)
* fix

* fix
2020-03-11 13:08:27 -07:00
Richard Liaw fbac256982 [sgd] Add benchmarks (#7454)
* Init fp16

* fp16 and schedulers

* scheduler linking and fp16

* to fp16

* loss scaling and documentation

* more documentation

* add tests, refactor config

* moredocs

* more docs

* fix logo, add test mode, add fp16 flag

* fix tests

* fix scheduler

* fix apex

* improve safety

* fix tests

* fix tests

* remove pin memory default

* rm

* fix

* Update doc/examples/doc_code/raysgd_torch_signatures.py

* fix

* migrate changes from other PR

* ok thanks

* pass

* signatures

* lint'

* Update python/ray/experimental/sgd/pytorch/utils.py

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* should address most comments

* comments

* fix this ci

* first_pass

* add overrides

* override

* fixing up operators

* format

* sgd

* constants

* rm

* revert

* save

* failures

* fixes

* trainer

* run test

* operator

* code

* op

* ok done

* operator

* sgd test fixes

* ok

* trainer

* format

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update doc/source/raysgd/raysgd_pytorch.rst

* docstring

* dcgan

* doc

* commits

* nit

* testing

* revert

* Start renaming pytorch to torch

* Rename PyTorchTrainer to TorchTrainer

* Rename PyTorch runners to Torch runners

* Finish renaming API

* Rename to torch in tests

* Finish renaming docs + tests

* Run format + fix DeprecationWarning

* fix

* move tests up

* benchmarks

* rename

* remove some args

* better metrics output

* fix up the benchmark

* benchmark-yaml

* horovod-benchmark

* benchmarks

* Remove benchmark code for cleanups

* benchmark-code

* nits

* benchmark yamls

* benchmark yaml

* ok

* ok

* ok

* benchmark

* nit

* finish_bench

* makedatacreator

* relax

* metrics

* autosetsampler

* profile

* movements

* OK

* smoothen

* fix

* nitdocs

* loss

* envflag

* comments

* nit

* format

* visible

* images

* move_images

* fix

* rernder

* rrender

* rest

* multgpu

* fix

* nit

* finish

* extrra

* setup

* revert

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
2020-03-11 01:09:08 -07:00
Richard Liaw 6163b21458 [raysgd] Better user errors! (#7546)
* format

* callable

* Update python/ray/util/sgd/torch/torch_trainer.py

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update python/ray/util/sgd/torch/torch_trainer.py

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* data

* torchtrainer

* num_rep

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-03-10 18:58:19 -07:00
Richard Liaw d192ef0611 [raysgd] Cleanup User API (#7384)
* Init fp16

* fp16 and schedulers

* scheduler linking and fp16

* to fp16

* loss scaling and documentation

* more documentation

* add tests, refactor config

* moredocs

* more docs

* fix logo, add test mode, add fp16 flag

* fix tests

* fix scheduler

* fix apex

* improve safety

* fix tests

* fix tests

* remove pin memory default

* rm

* fix

* Update doc/examples/doc_code/raysgd_torch_signatures.py

* fix

* migrate changes from other PR

* ok thanks

* pass

* signatures

* lint'

* Update python/ray/experimental/sgd/pytorch/utils.py

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* should address most comments

* comments

* fix this ci

* first_pass

* add overrides

* override

* fixing up operators

* format

* sgd

* constants

* rm

* revert

* save

* failures

* fixes

* trainer

* run test

* operator

* code

* op

* ok done

* operator

* sgd test fixes

* ok

* trainer

* format

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update doc/source/raysgd/raysgd_pytorch.rst

* docstring

* dcgan

* doc

* commits

* nit

* testing

* revert

* Start renaming pytorch to torch

* Rename PyTorchTrainer to TorchTrainer

* Rename PyTorch runners to Torch runners

* Finish renaming API

* Rename to torch in tests

* Finish renaming docs + tests

* Run format + fix DeprecationWarning

* fix

* move tests up

* benchmarks

* rename

* remove some args

* better metrics output

* fix up the benchmark

* benchmark-yaml

* horovod-benchmark

* benchmarks

* Remove benchmark code for cleanups

* makedatacreator

* relax

* metrics

* autosetsampler

* profile

* movements

* OK

* smoothen

* fix

* nitdocs

* loss

* comments

* fix

* fix

* runner_tests

* codes

* example

* fix_test

* fix

* tests

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
2020-03-10 08:41:42 -07:00
Eric Liang 90e23a5c43 [iterators] Add duplicate() call and fix broken test case (#7510) 2020-03-09 17:18:52 -07:00
Eric Liang a644060daa [rllib] First pass at pipeline implementation of DQN (#7433)
* wip iters

* add test

* speed up

* update docs

* document it

* support serial sampling

* add test

* spacing

* annotate it

* update

* rename to pipeline

* comment

* iter2 wip

* update

* update

* context test

* update

* fix

* fix

* a3c pipeline

* doc

* update

* move timer

* comment

* add piepline test

* fix

* clean up

* document

* iter s

* wip dqn

* wip

* wip

* metrics

* metrics rename

* metrics ctx

* wip

* constants

* add todo

* suppport .union

* wip

* support union

* remove prints

* add todo

* remove auto timer

* fix up

* fix pipeline test

* typing

* fix breakage

* remove bad assert

* wip

* fix multiagent example

* fixapply

* update a3c

* remove a2c pl

* 0 workers

* wip

* wip

* share metrics

* wip

* wip

* doc

* fix weight sync and global var updates

* mode

* fix

* fix

* doc

* fix
2020-03-07 14:47:58 -08:00
Eric Liang 476b5c6196 [Parallel Iterators] Allow for operator chaining after repartition (#7268)
* bug fix repartition

* change add_transform from private to inner

* formatting

* addressing comments

* formatting
2020-03-04 14:42:52 -08:00
Maksim Smolin 3a134c7224 [RaySGD] Rename PyTorch API endpoints to start with Torch (#7425)
* Start renaming pytorch to torch

* Rename PyTorchTrainer to TorchTrainer

* Rename PyTorch runners to Torch runners

* Finish renaming API

* Rename to torch in tests

* Finish renaming docs + tests

* Run format + fix DeprecationWarning

* fix

* move tests up

* rename

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-03 16:44:42 -08:00
Edward Oakes b0bf5450c2 Fix flaky multiprocessing tests (#7413) 2020-03-03 15:07:59 -06:00
Edward Oakes 04ec599441 Use ray.kill() in multiprocessing.Pool (#7409) 2020-03-03 12:49:13 -06:00
Richard Liaw 48cdca843f [raysgd] Custom training operator (#7211) 2020-03-01 21:22:48 -08:00
Eric Liang 3c6b94f3f5 [rllib] Enable performance metrics reporting for RLlib pipelines, add A3C (#7299) 2020-02-28 16:44:17 -08:00
Sven Mika 357232d124 [Core/RLlib] Move log_once from rllib to ray.util. (#7273)
* Move log_once from rllib to tune.

* Move log_once from rllib to tune.

* LINT.

* Move to ray.util.debug.
2020-02-27 10:40:44 -08:00
Amog Kamsetty 1737a113be [Parallel Iterators] Repartition functionality (#7163)
* repartition and tests

* blacklist lib/ files from import checks

* addressing comments and splitting up tests

* code readability

* adding explicit ref for parent iterator

* formatting
2020-02-21 13:20:18 -08:00
Eric Liang 5df801605e Add ray.util package and move libraries from experimental (#7100) 2020-02-18 13:43:19 -08:00
Edward Oakes dc5a27dac0 Move ray.experimental.multiprocessing to ray.util.multiprocessing (#7149) 2020-02-14 16:17:05 -08:00