Commit Graph

2285 Commits

Author SHA1 Message Date
Edward Oakes c2788ba299 [serve] Master actor fault tolerance (#8116) 2020-05-01 10:57:55 -07:00
Richard Liaw c2ef9fee74 [sgd] Resource limit lift for GPU test (#8238) 2020-05-01 00:22:28 -07:00
ijrsvt 7054eabf1f Remove logging (#8211) 2020-04-29 14:12:11 -07:00
SangBin Cho c6217e53e3 Updated Version to 0.8.5. 2020-04-28 00:00:08 -07:00
aannadi eb790bf3a3 [Dashboard] Set logdir in Tune Dashboard and TensorBoard Opt-in (#8074) 2020-04-27 20:17:52 -07:00
Richard Liaw be5235d982 [tune] Clarify Intro Tune Documentation (#8201) 2020-04-27 18:01:00 -07:00
ijrsvt a77e5a8cbf [Doc] Fix Docstring for Task Cancellation (#8198) 2020-04-27 17:06:08 -07:00
Neil Lugovoy 8cf598deab [sgd] Fix GPU Reservations in LocalDistributedRunner (#8157) 2020-04-27 16:03:33 -07:00
Robert Nishihara 48250217ac Fix API documentation formatting. (#8197) 2020-04-27 10:48:42 -07:00
Philipp Moritz d7da25eee1 Use RAY_ADDRESS to connect to an existing Ray cluster if present (#7977) 2020-04-27 09:59:37 -07:00
Richard Liaw 87557a00fa [tune] Refactor search algorithms (#7037)
* start refactoring of search algorithms

* format

* needs tests

* fix

* suggestions

* Fix PBT

* lint

* refactoring

* hyperopt_working

* dragonfly

* hyperopt

* change_half_of_algs

* save

* code-removed

* remove_lots_of_unneccessary

* changes

* formatting

* suggest

* reset

* rm

* tests

* search-change

* exception

* refactor-doc

* search

* py

* moredocs

* Update doc/source/tune-searchalg.rst

* concurrency

* max

* tune

* betterwarning

* bohb

* tests

* test-change

Co-authored-by: ujvl <misraujval@gmail.com>
2020-04-27 08:51:13 -07:00
Richard Liaw 5bc6e32c0a [autoscaler] latest_dlami update (#8178) 2020-04-26 00:25:46 -07:00
ijrsvt 69ff7e3e35 TaskCancellation (#7669)
* Smol comment

* WIP, not passing ray.init

* Fixed small problem

* wip

* Pseudo interrupt things

* Basic prototype operational

* correct proc title

* Mostly done

* Cleanup

* cleaner raylet error

* Cleaning up a few loose ends

* Fixing Race Conds

* Prelim testing

* Fixing comments and adding second_check for kill

* Working_new_impl

* demo_ready

* Fixing my english

* Fixing a few problems

* Small problems

* Cleaning up

* Response to changes

* Fixing error passing

* Merged to master

* fixing lock

* Cleaning up print statements

* Format

* Fixing Unit test build failure

* mock_worker fix

* java_fix

* Canel

* Switching to Cancel

* Responding to Review

* FixFormatting

* Lease cancellation

* FInal comments?

* Moving exist check to CoreWorker

* Fix Actor Transport Test

* Fixing task manager test

* chaning clock repr

* Fix build

* fix white space

* lint fix

* Updating to medium size

* Fixing Java test compilation issue

* lengthen bad timeouts
2020-04-25 16:04:52 -07:00
Richard Liaw 9dd3490c38 [tune] Safer try-catch for TensorboardX (#8174)
Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>
2020-04-25 13:08:37 -07:00
Simon Mo 13c14eac07 [Asyncio] Remove async init legacy code (#8177)
* [Asyncio] Remove async init legacy code

* Fix places that call async_init
2020-04-25 09:32:38 -07:00
Edward Oakes 9dc625318f [serve] Add basic test for specifying the method in a serve call (#8172) 2020-04-24 20:15:27 -05:00
Scott Graham 0dc01d8c1e [autoscaler] Azure versioning (#8168) 2020-04-24 17:03:55 -07:00
Nick Matthews a9d8d16b6b Change memory monitor warning to a logging call (#8137) 2020-04-22 21:29:18 -07:00
yncxcw 51559c08b9 Fix mis-memory counting in memory monitor for contaienr environment (#8113)
Co-authored-by: weich <weich@nvidia.com>
2020-04-22 14:32:35 -07:00
Edward Oakes 0bb918f2b1 Disable eager execution to fix test_tensorflow (#8133) 2020-04-22 15:54:42 -05:00
Edward Oakes f9f41e5a1a [serve] Fix nonblocking serve.init() (#8068) 2020-04-22 11:51:27 -05:00
Max Fitton c486b56c58 Improve Serve API Input Validations (#8124)
* Add additional validation to endpoint and backend creation that ensures there are not duplicates created of either of these. In addition, adds additional validation to split_traffic to make sure both the endpoint and backends exist.

* Fix test to deal with removed serve.link

* Address PR feedback

Co-authored-by: Max Fitton <max@semprehealth.com>
2020-04-21 19:45:29 -07:00
Simon Mo 95e8ec8c47 [CI] Dashboard+ Tensorboard Lint Hotfix (#8125) 2020-04-21 16:52:58 -07:00
Edward Oakes 505f3a8714 [serve] Remove serve.link(), rename serve.split() -> serve.set_traffic() (#8072) 2020-04-21 14:26:07 -05:00
Richard Liaw 6799fbbd5e [dashboard] Temporarily disable tensorboard (#8121) 2020-04-21 10:40:46 -07:00
mehrdadn 0a54407961 [CI] Factor out more Travis code and update GitHub Actions (#8085) 2020-04-21 09:53:08 -07:00
Richard Liaw fa7eecf48a [sgd] Avoid parameter "gotcha" for learning rate scheduler (#8107)
* with-scheduler-creator

* none

* add_freq

* runner

* torch
2020-04-21 01:01:04 -07:00
Stephanie Wang eefea4e29c [core] Post task submission to IO loop (#8090)
* Post to IO loop

* Unused

* Fix build
2020-04-20 19:13:50 -07:00
Ujval Misra 708dff6d8f [tune] Stop-gap fix for PBT checkpointing (#7794)
* Fix PBT

* lint

* reset

* rm

* tests

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-04-20 15:10:36 -07:00
Edward Oakes 213d3894ca Remove serve.route decorator (#8108) 2020-04-20 16:22:25 -05:00
Stephanie Wang 1323e1753d [core] When reconstruction is enabled, pin objects created by ray.put() (#8021)
* Unit test and pin ray.put objects until they have no more lineage references

* c++ tests

* lint

* Mark ray.put objects as pinned
2020-04-20 13:09:54 -07:00
Richard Liaw 9f3e9e7e9f [tune] Add more intensive tests (#7667)
* make_heavier_tests

* help
2020-04-20 11:14:44 -07:00
Edward Oakes 793e616a2d Fix job table parsing (#8070) 2020-04-20 12:56:43 -05:00
ZhuSenlin 3f28a8a229 [GCS] reply to the owner only after the actor has been successfully created. (#8079)
* reply to the owner only after the actor is successfully created.

* reply immediately if the actor is already created

* fix comment

* add test_actor_creation_task provided by @Stephanie Wang

Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>
2020-04-19 09:53:02 -07:00
Edward Oakes da296bf8c5 [serve] Router fault tolerance (#8008) 2020-04-19 11:04:06 -05:00
Sven Mika 165a86f1ab [RLlib] SAC MuJoCo instability issues (tf and torch versions). (#8063)
SAC (both torch and tf versions) are showing issues (crashes) due to numeric instabilities in the SquashedGaussian distribution (sampling + logp after extreme NN outputs).
This PR fixes these. Stable MuJoCo learning (HalfCheetah) has been confirmed on both tf and torch versions. A Distribution stability test (using extreme NN outputs) has been added for SquashedGaussian (can be used for any other type of distribution as well).
2020-04-19 10:20:23 +02:00
Dean Wampler 5d2885c609 Minor Ray API doc refinements (#8060)
* Added small section on installation when using Anaconda. Also fixed an obsolete link to Anaconda.

* Delete more temporary directories when running the doc "make clean".

* Fine-tuning the core Ray API documentation

* Fix doc lines that were too long

Co-authored-by: Dean Wampler <dean@concurrentthought.com>
2020-04-18 15:19:35 -07:00
Richard Liaw 857e4dba2f [sgd] HuggingFace GLUE Fine-tuning Example (#7792)
* Init fp16

* fp16 and schedulers

* scheduler linking and fp16

* to fp16

* loss scaling and documentation

* more documentation

* add tests, refactor config

* moredocs

* more docs

* fix logo, add test mode, add fp16 flag

* fix tests

* fix scheduler

* fix apex

* improve safety

* fix tests

* fix tests

* remove pin memory default

* rm

* fix

* Update doc/examples/doc_code/raysgd_torch_signatures.py

* fix

* migrate changes from other PR

* ok thanks

* pass

* signatures

* lint'

* Update python/ray/experimental/sgd/pytorch/utils.py

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* should address most comments

* comments

* fix this ci

* first_pass

* add overrides

* override

* fixing up operators

* format

* sgd

* constants

* rm

* revert

* save

* failures

* fixes

* trainer

* run test

* operator

* code

* op

* ok done

* operator

* sgd test fixes

* ok

* trainer

* format

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update doc/source/raysgd/raysgd_pytorch.rst

* docstring

* dcgan

* doc

* commits

* nit

* testing

* revert

* Start renaming pytorch to torch

* Rename PyTorchTrainer to TorchTrainer

* Rename PyTorch runners to Torch runners

* Finish renaming API

* Rename to torch in tests

* Finish renaming docs + tests

* Run format + fix DeprecationWarning

* fix

* move tests up

* benchmarks

* rename

* remove some args

* better metrics output

* fix up the benchmark

* benchmark-yaml

* horovod-benchmark

* benchmarks

* Remove benchmark code for cleanups

* benchmark-code

* nits

* benchmark yamls

* benchmark yaml

* ok

* ok

* ok

* benchmark

* nit

* finish_bench

* makedatacreator

* relax

* metrics

* autosetsampler

* profile

* movements

* OK

* smoothen

* fix

* nitdocs

* loss

* envflag

* comments

* nit

* format

* visible

* images

* move_images

* fix

* rernder

* rrender

* rest

* multgpu

* fix

* nit

* finish

* extrra

* setup

* experimental

* as_trainable

* fix

* ok

* format

* create_torch_pbt

* setup_pbt

* ok

* format

* ok

* format

* docs

* ok

* Draft head-is-worker

* Fix missing concurrency between local and remote workers

* Fix tqdm to work with head-is-worker

* Cleanup

* Implement state_dict and load_state_dict

* Reserve resources on the head node for the local worker

* Update the development cluster setup

* Add spot block reservation to the development yaml

* ok

* Draft the fault tolerance fix

* Small fixes to local-remote concurrency

* Cleanup + fix typo

* fixes

* worker_counts

* some formatting and asha

* fix

* okme

* fixactorkill

* unify

* Revert the cluster mounts

* Cut the handler-reporter API

* Fix most tests

* Rm tqdm_handler.py

* Re-add tune test

* Automatically force-shutdown on actor errors on shutdown

* Formatting

* fix_tune_test

* Add timeout error verification

* Rename tqdm to use_tqdm

* fixtests

* ok

* remove_redundant

* deprecated

* deactivated

* ok_try_this

* lint

* nice

* done

* retries

* fixes

* kill

* retry

* init_transformer

* init

* deployit

* improve_example

* trans

* rename

* formats

* format-to-py37

* time_to_test

* more_changes

* ok

* update_args_and_script

* fp16_epoch

* huggingface

* training stats

* distributed

* Apply suggestions from code review

* transformer

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
2020-04-17 15:17:30 -07:00
Maksim Smolin d6f4e5b3e1 [SGD] Imagenet example (basic) (#8020)
* Checkpoint the image-models example

* Update cluster definition

* Fix copyright info

* Use original args

* Checkpoint fixes

* Add README

* Add some missing features

* Format

* Get rid of the unused Namespace class

* Address comments

* Link the imagenet example in docs

* Cleanup

* Fix lint
2020-04-17 13:33:55 -07:00
Edward Oakes 90ef585fd5 Revert "Add ability to specify worker and driver ports (#7833)" (#8069)
This reverts commit 9f751ff8c4.
2020-04-17 12:32:22 -05:00
Richard Liaw a9ea139317 [sgd] Make serialization of data creation optional (#8027)
* pytest

* Update python/ray/util/sgd/torch/torch_trainer.py

Co-Authored-By: Ujval Misra <misraujval@gmail.com>

Co-authored-by: Ujval Misra <misraujval@gmail.com>
2020-04-16 20:27:51 -07:00
Richard Liaw de1787e5e5 [tune] Check actor start -> test_cluster (#8056)
* test

* info

* ok

* hard_stop

* codefix
2020-04-16 20:00:45 -07:00
Mitchell Stern d0c6f013c3 Fix command config portion of project schema (#8057) 2020-04-16 18:08:17 -07:00
Richard Liaw 6545534805 [tune/sgd] DCGAN example self-contained, turn example into modu… (#8012)
* ok

* done

* run_benchmarks

* should_make_examples_usable
2020-04-16 17:55:27 -07:00
Karthikeyan Singaravelan f95e18dfeb [tune/sgd] Import ABC from collections.abc instead of collectio… (#7982)
* Import ABC from collections.abc instead of collections for Python 3 compatibility.

* Fix linter errors.
2020-04-16 15:26:49 -07:00
Edward Oakes 9f751ff8c4 Add ability to specify worker and driver ports (#7833) 2020-04-16 13:49:25 -05:00
Richard Liaw 4d8bf5635d [hotfix] Lint formatting for new Tune optimizer ZOOpt (#8040)
* formatting

* removedill

* lint
2020-04-16 09:24:30 -07:00
Clark Zinzow d4cae5f632 [Core] Added ability to specify different IP addresses for a core worker and its raylet. (#7985) 2020-04-16 10:32:24 -05:00
Servon 5c274fe631 [Tune] Add ZOOpt search algorithm (#7960)
* add zoopt

* add zoopt search algo

* add zoopt

* fix zoopt

* add zoopt requirements

* fix zoopt

* remove generated guides

* Apply suggestions from code review

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-04-15 21:13:29 -07:00
Simon Mo 7455610d5a Serve Doc: Quickstart (#7940) 2020-04-15 12:25:37 -07:00