Commit Graph

3565 Commits

Author SHA1 Message Date
Ian Rodney 2e70743077 [Serve] Backend state unit tests (#13319) 2021-01-12 14:54:04 -06:00
architkulkarni e560933f9c [Serve] Add dependency management support for driver not running in a conda env (#13269) 2021-01-12 09:57:15 -08:00
Kai Fricke 518427627b [tune] buffer trainable results (#13236)
* Working prototype

* Pass buffer length, fix tests

* Don't buffer per default

* Dispatch and process save in one go, added tests

* Fix tests

* Pass adaptive seconds to train_buffered, stop result processing after STOP decision

* Fix tests, add release test

* Update tests

* Added detailed logs for slow operations

* Update python/ray/tune/trial_runner.py

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Apply suggestions from code review

* Revert tests and go back to old tuning loop

* nit

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-01-12 18:52:47 +01:00
Eric Liang 470fda190a Forgot overwrite parameter in Ray client internal kv 2021-01-11 17:50:06 -08:00
Amog Kamsetty 0452a3a435 [Tune] Rename MLFlow to MLflow (#13301) 2021-01-11 17:36:55 -08:00
Eric Liang de5bc24c60 Implement internal kv in ray client (#13344)
* kv internal

* fix
2021-01-11 14:54:52 -08:00
Eric Liang fbb9795374 [client] Report number of currently active clients on connect (#13326)
* wip

* update

* update

* reset worker

* fix conn

* fix

* disable pycodestyle
2021-01-11 14:53:12 -08:00
architkulkarni c43fa12e73 [Serve] Support Starlette streaming response (#13328) 2021-01-11 13:27:44 -08:00
Edward Oakes 62e1ad3973 [serve] Cleanup backend state, move checkpointing and async goal logic inside (#13298) 2021-01-11 11:45:43 -06:00
Edward Oakes 93006c2ba5 Use wait_for_condition to reduce flakiness in test_queue.py::test_custom_resources (#13210) 2021-01-10 19:32:59 -06:00
Barak Michener 6f0083ed10 add the method annotation and a comment explaining what's happening (#13306)
Change-Id: I848cc2f0beaed95340d9de7cca19a50c78d9da9a
2021-01-10 15:54:10 -08:00
Akash Patel 94a873fc4d remove empty extras streaming deps (#12933) 2021-01-10 12:09:27 -08:00
Kai Fricke d4b0a9fadf [tune] convert search spaces: parse spec before flattening (#12785)
* Parse spec before flattening

* flatten after parse

* Test for ValueError if grid search is passed to search algorithms
2021-01-09 18:21:49 +01:00
Edward Oakes d434ba6518 [serve] Clean up EndpointState interface, move checkpointing inside of EndpointState (#13215) 2021-01-08 22:36:19 -06:00
Philipp Moritz c5ae30d1d4 Do not give an error if both RAY_ADDRESS and address is specified on initialization (#13305)
* Finalize handling of RAY_ADDRESS

* lint
2021-01-08 18:31:32 -08:00
Barak Michener eb6f403b97 [ray_client]: first draft of documentation (#13216) 2021-01-08 15:38:36 -08:00
Ian Rodney f916549602 [Cancellation] Make Test Cancel Easier to Debug (#13243)
* first commit

* lint-fix
2021-01-08 14:52:43 -08:00
Alex Wu 6ca4fb1054 [Pull manager] Only pull once per retry period (#13245)
* .

* docs

* cleanup

* .

* .

* .

* .

Co-authored-by: Alex <alex@anyscale.com>
2021-01-08 14:51:11 -08:00
Edward Oakes 66daed99f5 Remove top-level ray.connect() and ray.disconnect() APIs (#13273) 2021-01-08 15:26:20 -06:00
dependabot[bot] 300a22d8f7 [tune](deps): Bump gluoncv from 0.9.0 to 0.9.1 in /python/requirements (#13287) 2021-01-08 11:42:58 -08:00
dependabot[bot] 3569b78237 [tune](deps): Bump mlflow from 1.13.0 to 1.13.1 in /python/requirements (#13286) 2021-01-08 11:42:18 -08:00
Philipp Moritz a247c71e2e [ray_client] Add metadata to gRPC requests (#13167) 2021-01-07 23:58:15 -08:00
Hao Chen 77cd0d5a21 Fix a crash problem caused by GetActorHandle in ActorManager (#13164) 2021-01-08 12:11:08 +08:00
Tao Wang ab2229dcb7 [GCS] Remove old lightweight resource usage report code path (#13192) 2021-01-08 10:30:00 +08:00
Ian Rodney 4aef3d6836 [docker] Pull if image is not present (#13136) 2021-01-07 17:17:00 -08:00
Amog Kamsetty 43f70faa25 [Tune] Pin Tune Dependencies (#13027)
Co-authored-by: Ian <ian.rodney@gmail.com>
2021-01-07 14:03:06 -08:00
Amog Kamsetty f68922d043 [Tune] Improve error message for Session Detection (#13255)
* Improve error message

* log once
2021-01-07 22:40:44 +01:00
Philipp Moritz 9872fc1801 Start ray client server with 'ray start' (#13217) 2021-01-06 21:04:14 -08:00
Siyuan (Ryans) Zhuang dde49b8d48 [Serialization] Fix cloudpickle (#13242) 2021-01-06 17:21:17 -08:00
Kai Fricke 97211a6170 [Tune] Fix tune serve integration example (#13233) 2021-01-06 17:02:04 +01:00
SangBin Cho 32dc5676b4 [Metrics] Record per node and raylet cpu / mem usage (#12982)
* Record per node and raylet cpu / mem usage

* Add comments.

* Addressed code review.
2021-01-05 21:57:21 -08:00
Ian Rodney 92963800f6 [tests] Fix Autoscaler Test failure on Windows (#13211)
* skip create_or_update tests

* Update python/ray/tests/test_autoscaler.py

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
2021-01-05 16:48:32 -08:00
Simon Mo 39813ff6b0 [Serve] HTTPOptions for deployment modes (#13142) 2021-01-05 16:41:52 -08:00
Amog Kamsetty bd19ed31e7 [Tune] Fix PBT Transformers Example (#13174) 2021-01-05 16:31:11 -08:00
Hao Zhang 7e52351ae5 [Collective] Some necessary abstraction of collective calls before introducing stream management (#13162) 2021-01-05 16:20:12 -08:00
Edward Oakes dc101fd087 [serve] Move controller state into separate files (#13204) 2021-01-05 14:37:16 -06:00
Edward Oakes d738610dc9 Disable atexit test on windows (#13207) 2021-01-05 14:33:51 -06:00
Kai Fricke 96c2d3d2b5 [tune] better signature check for tune.sample_from (#13171)
* [tune] better signature check for `tune.sample_from`

* Update python/ray/tune/sample.py

Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com>

Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com>
2021-01-05 08:04:18 -08:00
Edward Oakes e8162f1b1f [serve] Merge ActorReconciler and BackendState (#13139) 2021-01-05 09:56:22 -06:00
Hao Zhang 4150970226 [Collective][PR 2/6] Driver program declarative interfaces (#12874)
* scaffold of the code

* some scratch and options change

* NCCL mostly done, supporting API#1

* interface 2.1 2.2 scratch

* put code into ray and fix some importing issues

* add an addtional Rendezvous class to safely meet at named actor

* fix some small bugs in nccl_util

* some small fix

* scaffold of the code

* some scratch and options change

* NCCL mostly done, supporting API#1

* interface 2.1 2.2 scratch

* put code into ray and fix some importing issues

* add an addtional Rendezvous class to safely meet at named actor

* fix some small bugs in nccl_util

* some small fix

* add a Backend class to make Backend string more robust

* add several useful APIs

* add some tests

* added allreduce test

* fix typos

* fix several bugs found via unittests

* fix and update torch test

* changed back actor

* rearange a bit before importing distributed test

* add distributed test

* remove scratch code

* auto-linting

* linting 2

* linting 2

* linting 3

* linting 4

* linting 5

* linting 6

* 2.1 2.2

* fix small bugs

* minor updates

* linting again

* auto linting

* linting 2

* final linting

* Update python/ray/util/collective_utils.py

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Update python/ray/util/collective_utils.py

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Update python/ray/util/collective_utils.py

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* added actor test

* lint

* remove local sh

* address most of richard's comments

* minor update

* remove the actor.option() interface to avoid changes in ray core

* minor updates

Co-authored-by: YLJALDC <dal177@ucsd.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-01-04 20:57:37 -08:00
Tao Wang c617291b27 [build]Update description and add some keywords (#13163) 2021-01-05 11:34:03 +08:00
Barak Michener 9643e44af6 [ray_client]: Move from experimental to util (#13176)
Change-Id: I9f054881f0429092d265cd6944d89804cce9d946
2021-01-04 17:51:56 -08:00
Eric Liang dfb326d4b5 Surface object store spilling statistics in ray memory (#13124) 2021-01-04 17:35:39 -08:00
Amog Kamsetty e181515dff [SGD] Fix Docstring for as_trainable (#13173) 2021-01-04 17:21:24 -08:00
Amog Kamsetty 15e86581bd [XGboost] Update Documentation (#13017)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-01-04 17:21:04 -08:00
Raed Shabbir d632b0f0f7 [Serve] Bug in Serve node memory-related resources calculation #11198 (#13061) 2021-01-04 11:04:59 -08:00
Clark Zinzow c2bff64699 [Core] Locality-aware leasing: Milestone 1 - Owned refs, pinned location (#12817)
* Locality-aware leasing for owned refs (pinned locations).

* LessorPicker --> LeasePolicy.

* Consolidate GetBestNodeIdForTask and GetBestNodeIdForObjects.

* Update comments.

* Turn on locality-aware leasing feature flag by default.

* Move local fallback logic to LeasePolicy, move feature flag check to CoreWorker constructor, add local-only lease policy.

* Add lease policy consulting assertions to the direct task submitter tests.

* Add lease policy tests.

* LocalityLeasePolicy --> LocalityAwareLeasePolicy.

* Add missing const declarations.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* Add RAY_CHECK for raylet address nullptr when creating lease client.

* Make the fact that LocalLeasePolicy always returns the local node more explicit.

* Flatten GetLocalityData conditionals to make it more readable.

* Add ReferenceCounter::GetLocalityData() unit test.

* Add data-intensive microbenchmarks for single-node perf testing.

* Add data-intensive microbenchmarks for simulated cluster perf testing.

* Remove redundant comment.

* Remove data-intensive benchmarks.

* Add locality-aware leasing Python test.

* Formatting changes in ray_perf.py.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2021-01-04 09:49:08 -08:00
Dmitri Gekhtman 31453621ef [kubernetes][docs][minor] Kubernetes version warning (#13161) 2021-01-04 10:29:17 -06:00
architkulkarni a95275bdd9 [Serve] [Doc] Add existing web server integration ServeHandle tutorial (#13127) 2021-01-04 10:28:34 -06:00
Simon Mo fece8db70d [Serve] Use a small object to track requests (#13125) 2020-12-31 11:43:03 -08:00