Commit Graph

6968 Commits

Author SHA1 Message Date
Kai Fricke dc42abb2f5 [tune] placement group support (#13370) 2021-01-18 11:58:57 -08:00
Sven Mika 1f00f834ac [RLlib] Solve PyTorch/TF-eager A3C async race condition between calling model and its value function. (#13467) 2021-01-18 10:29:03 -08:00
Tao Wang 516eb77080 [GCS] Remove task info publish as nowhere uses it (#13509)
* Remove task info publish as nowhere uses it

* simplify right publish channel
2021-01-18 01:15:03 -08:00
Simon Mo 1e2adb335e [CI] Buildkite PR Environment for Simple Tests (#13130) 2021-01-18 00:44:24 -08:00
Tao Wang 3a0710130c [GCS]Only publish changed field when node dead (#13364)
* Only update changed field when node dead

* node_id missed
2021-01-17 21:28:35 -08:00
ZhuSenlin a4ebdbd7da Refactor node manager to eliminate new_scheduler_enabled_ (#12936) 2021-01-18 00:15:35 +08:00
ZhuSenlin 2cd51ce608 sync write internal config in gcs (#13197) 2021-01-17 12:00:01 +08:00
Eric Liang 8c8af2616e Minimal version of piping autoscaler events to driver logs (#13434) 2021-01-16 10:06:20 -08:00
Dmitri Gekhtman 7e54911093 move message to debug (#13472) 2021-01-16 10:04:41 -08:00
Richard Liaw 86387504ee [tune] fix small docs typo (#13355)
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2021-01-16 00:49:17 -08:00
Amog Kamsetty 1d3941e41a [Tests] Skip failing windows tests (#13495)
* skip failing windows tests

* skip more

* remove

* updates
2021-01-15 20:51:33 -08:00
SangBin Cho 1179db1fc2 Remove an unnecessary file (#13499) 2021-01-15 18:29:12 -08:00
Eric Liang ee6332dbb0 Bump dev branch to 2.0 to avoid endless version bump toil (#13497)
* wip

* fix

* fix
2021-01-15 17:41:17 -08:00
Barak Michener 68e3a0e0e1 [ray_client]: fix wrong reference in server_pickler (#13474)
Change-Id: Ie3d219541b1875e986e72e3ae73ece145c715acf
2021-01-15 15:49:38 -08:00
SangBin Cho d09df55b14 Update ID specification doc (#13356) 2021-01-15 15:15:51 -08:00
Eric Liang 4aeb0ea550 Return version info from Ray client connect, to allow for discovering version mismatches 2021-01-15 14:27:26 -08:00
Simon Mo 7a0597d03f [CI] Fix Windows Bazel Upload (#13436) 2021-01-15 13:27:11 -08:00
Ian Rodney 0ec9ddabc1 [docker/dashboard] Fix ray dashboard (#12899) 2021-01-15 10:03:01 -08:00
Simon Mo dac8b3d58a [CI] Enable Dashboard tests for master (#13425) 2021-01-15 09:43:34 -08:00
SangBin Cho f6d9996874 [Object Spilling] Dedup restore objects (#13470)
* done.

* Addressed code review.
2021-01-14 23:51:11 -08:00
fangfengbin ce1b208e41 [GCS]Remove unused class variable (#13454) 2021-01-15 14:48:18 +08:00
Barak Michener 84e110a949 [ray_client]: Support runtime_context as metadata (#13428) 2021-01-14 14:37:00 -08:00
Clark Zinzow 9a658b568f [Core] Ownership-based Object Directory: Consolidate location table and reference table. (#13220)
* Added owned object reference before Plasma put on Create() + Seal() path.

* Consolidated location table and reference table in reference counter.

* Restore type in definition.

* Clean up owned reference on failed Seal().

* Added RemoveOwnedObject test for reference counter.

* Guard against ref going out of scope before location RPCs.

* Add 'owner must have ref in scope' precondition to documentation for object location methods.

* Move to separate Create() + Seal() methods for existing objects.

* Clearer distinction between Create() and Seal() methods.

* Make it clear that references will normally be cleaned up by reference counting.
2021-01-14 13:48:10 -08:00
Siyuan (Ryans) Zhuang d1e9887be2 [Serialization] New custom serialization API (#13291)
* new serialization API with doc & test

* add more notes

* refine notes

* doc
2021-01-14 13:15:31 -08:00
Amog Kamsetty 07e97fe4c2 [xgb] re-enable xgboost_ray tests (#13416)
* re-enable

* fix

* update xgb_ray version
2021-01-14 22:14:44 +01:00
Edward Oakes 7ba87b8abe Fix getting runtime context dict in driver (#13417) 2021-01-14 14:41:53 -06:00
Ian Rodney 411e37ce3f [serve] Properly obey SERVE_LOG_DEBUG=0 (#13460) 2021-01-14 12:24:22 -08:00
Simon Mo 16e8c4a69f [Release] Fix Serve release test (#13303)
The Docker image we were using now uses `ray` users so we have to call
sudo.
2021-01-14 12:23:53 -08:00
Simon Mo 321bbe1ffb [Dashboard] Fix GPU resource rendering issue (#13388) 2021-01-14 12:23:21 -08:00
PENG Zhenghao e63da54931 [docs] Add more guideline on using ray in slurm cluster (#12819)
Co-authored-by: Sumanth Ratna <sumanthratna@gmail.com>
Co-authored-by: PENG Zhenghao <pengzh@ie.cuhk.edu.hk>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-01-14 12:17:53 -08:00
Sven Mika d98235cc84 [RLlib] Deflake 2x remote & local inference tests (external env). (#13459) 2021-01-14 20:44:26 +01:00
Micah Yong c89ebdd94a [Core][CLI] ray status and ray memory no longer starts a new job (#13391)
* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Make status and error args required in commands.py#debug.status

* Remove unnecessary imports

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Access memory info in ray memory via GlobalStateAccessor rather than calling ray.init()

* Modify ray status cli so that it doesn't start a new job via ray.init()

* Remove local test file

* Make status and error args required in commands.py#debug.status

* Remove unnecessary imports

* Job 38482.1 should now pass

* Resolve merge conflict
2021-01-14 10:12:16 -08:00
Dmitri Gekhtman 2d772a5a6d [kubernetes][minor] Operator garbage collection fix (#13392) 2021-01-14 10:40:15 -06:00
Barak Michener 9c6d892eec [ray_client]: fix exceptions raised while executing on the server on behalf of the client (#13424) 2021-01-14 10:38:01 -06:00
Ameer Haj Ali 2f7ba25efb [joblib] joblib strikes again but this time on windows (#13212) 2021-01-14 10:36:52 -06:00
fangfengbin 4a6c53da46 [Core]Fix raylet scheduling bug (#13452)
* [Core]Fix raylet scheduling bug

* fix lint error

* fix lint error

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2021-01-14 14:50:32 +01:00
Sven Mika 56878221ed [RLlib] Redo: Make TFModelV2 fully modular like TorchModelV2 (soft-deprecate register_variables, unify var names wrt torch). (#13363) 2021-01-14 14:44:33 +01:00
fangfengbin 33b092de28 [GCS]Add gcs resource scheduler (#13072) 2021-01-14 20:05:55 +08:00
Kai Fricke b296642646 Fix linter error (#13451) 2021-01-14 10:28:44 +01:00
Amog Kamsetty 560299972c Revert "Enable Ray client server by default (#13350)" (#13429)
This reverts commit 912d0cbbf9.
2021-01-13 21:28:54 -08:00
fyrestone 8697d67791 Fix raylet::MockWorker::GetProcess crashes (#13440)
Co-authored-by: 刘宝 <po.lb@antfin.com>
2021-01-14 12:19:21 +08:00
dHannasch ad015cb7df Split out the part of get_node_ip_address for which the docstring is correct (#12796) 2021-01-14 11:32:56 +08:00
Amog Kamsetty 3f42e6bafe [Tune] Pin Transitive Dependencies (#13358) 2021-01-13 19:10:21 -08:00
Tao Wang 062b7efc93 Remove unused handler methods (#13394) 2021-01-14 10:51:31 +08:00
Eric Liang 602c103eae Make request_resources() use internal kv instead of redis pub sub (#13410) 2021-01-13 17:30:43 -08:00
Edward Oakes 9ef48b16b6 [serve] Pull out goal management logic into AsyncGoalManager class (#13341) 2021-01-13 18:35:25 -06:00
Edward Oakes c6fc7124d1 [tune] Fix f-string in error message (#13423) 2021-01-13 18:34:21 -06:00
Simon Mo b257cb7d98 Add bazel logs upload to GHA (#13251) 2021-01-13 15:17:11 -08:00
Simon Mo 15501a4151 Fix Serve release test (#13385) 2021-01-13 15:06:23 -08:00
Dmitri Gekhtman 1968b2f9d8 [autoscaler/k8s] [CI] Kubernetes test ray up, exec, down (#12514) 2021-01-13 15:03:56 -08:00