Commit Graph

1586 Commits

Author SHA1 Message Date
fangfengbin 8fb926565c [Placement Group]Placement Group supports gcs failover (Part1) (#11933) 2020-11-16 14:42:56 +08:00
Gabriele Oliaro 4744ed01f7 Queueing non-actor tasks at the workers (#11051)
* separated adding tasks to queue and executing them (worker side)

* linting

* first review

* second rev

* rev3, all tests passing locally

* linting

* rev4

* linting

* finished rev4, all tests passing locally (mac)

* rev4, all tests passing locally

* linting

* rev5

* bug fix

* hopefully fixed build

* nvm

* ptr cast

* linting

* no special treatment for actor creation tasks
2020-11-12 12:44:13 -05:00
Tao Wang 3fbd8be851 [Placement Group]Do not really subtract resources, just count (#11894)
* [Placement Group]Do not really subtract resources, just count

* add todo
2020-11-12 00:01:19 -08:00
SangBin Cho f80d812799 [Object Spilling] Introduce SpillWorker & RestoreWorker Pool to avoid IO worker deadlock. (#11885) 2020-11-11 18:20:14 -08:00
Tao Wang 92286660e4 [Core] Lazy create node manager clients, and destroy then (#11928) 2020-11-11 08:51:40 -08:00
Siyuan (Ryans) Zhuang b8dda0e3d0 [Serialization] Fix buffer alignment issues (#11888)
* fix buffer alignment issues

* remove unused fields

* aligned memory allocation

* windows compat

* license. fix compiler warnings

* fix compilation error

* reinterpret_cast
2020-11-10 23:44:16 -08:00
dHannasch 29cb32539e [Core] If failed to connect to redis, try to say why. (#11916) 2020-11-10 18:22:10 -08:00
fangfengbin 433e4f32da [GCS]Reduce get operations of worker table (#11599)
* [GCS]Reduce get operations of worker table

* fix ut bug

* fix ut bug

* fix review comment

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-11-10 18:11:25 -08:00
Eric Liang 46f3652102 Remove repeat push timeout from object manager (#11874) 2020-11-10 16:26:53 -08:00
fangfengbin 543f7809a6 [GCS]Add gcs dump log(Part1) (#11727)
* add part code

* fix compile bug

* Fix bug

* Add part code

* fix review comment

* fix review comment

* fix lint error

* fix review comment

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-11-10 14:10:03 +08:00
Eric Liang ee2da0cf45 [Core] PushManager for reliable broadcast (#11869) 2020-11-09 18:01:47 -08:00
Kai Yang 904f48ebd9 [Core] Multi-tenancy: Pass job ID from Raylet to worker via env variable (#11829)
* Pass job ID from Raylet to worker via env variable

* fix

* fix

* fix

* lint

* fix

* fix test_object_spilling

* address comments

* lint

* fix
2020-11-09 11:02:15 -08:00
Tao Wang 77e3163630 [GCS]Only pass node id to node failure detector (#11886)
* [GCS]Only pass node id to node failure detector

* rename
2020-11-09 10:52:33 -08:00
fangfengbin 407a212816 [GCS]Fix TestActorTableResubscribe bug (#11830)
* fix compile bug

* [GCS]Fix TestActorTableResubscribe bug

* rm unused code

* fix lint error

* fix review comment

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-11-08 23:50:05 -08:00
Stephanie Wang 61e41257e7 [Object spilling] Queue failed object creation requests until objects have been spilled (#11796)
* Queue creation requests

* Cleanup disconnected clients

* Remove unused

* todo

* FIFO order for create requests, remove warmup for IO workers

* test and lint

* disable test

* lint

* Skip on windows
2020-11-06 18:22:19 -05:00
SangBin Cho e0ecf5d79d Revert "[GCS]Open light heartbeat by default (#11689)" (#11861)
This reverts commit 612ddb2dd1.
2020-11-06 14:34:59 -08:00
Barak Michener 27c810a97e Basic protos for ray client (#11762) 2020-11-05 16:23:54 -08:00
Eric Liang f86c4f992c Fix RAY_ENABLE_NEW_SCHEDULER=1 pytest test_advanced_2.py::test_zero_cpus_actor (#11817) 2020-11-05 16:02:04 -08:00
SangBin Cho 3cd1d7f44a [Metrics] Implement basic metrics changes (#11769)
* Implement basic metrics changes

* Addressed code review.

* Fix build issue.

* Fix build issue.
2020-11-05 11:07:05 -08:00
Tao Wang 612ddb2dd1 [GCS]Open light heartbeat by default (#11689) 2020-11-05 12:11:00 +08:00
DK.Pino 50110b934c [Placement Group]Enhance create placement group java api (#11702)
* enhance create pg java api

* add state for PlacementGroup

* fix comment

* move default pg

* make default pg name private

* add bundle size and bundle resource size check when placement group create
2020-11-05 09:59:36 +08:00
Stephanie Wang 952b71dc94 Fix windows build (#11786) 2020-11-03 12:38:45 -05:00
Stephanie Wang 0ba777af99 [Object spilling] Add policy to automatically spill objects on OutOfMemory (#11673) 2020-11-02 12:42:02 -08:00
Ameer Haj Ali 8d74a04a42 [autoscaler] Flag flip for resource_demand_scheduler should take into account queue (#11615) 2020-11-02 12:41:22 -08:00
fangfengbin 4a7d0e059d [GCS]Optimize subscription perf (#11669)
* [GCS]Optimize subscription perf

* fix review comment

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-11-02 09:46:04 -08:00
Eric Liang 48dee789b3 Add random actor placement; fix cancellation callback; update test skips (#11684) 2020-10-30 18:36:35 -07:00
DK.Pino b10871a1f5 [Core]Fix get workrer table bug (#11516)
* fix get_worker_table bug

* fix lint

* fix comment

* remove actor table

* fix comment

* fix get alive worker

* remove unused python import
2020-10-30 14:48:29 -07:00
SangBin Cho 6e2a1eac36 [Placement Group] Placement group automatic cleanup. (#11546)
* In progress. Done with all placement group manager code.

* It is working with job.

* Finished detached actor implementation.

* Fix minor issue.

* In progress.

* Addressed code review.

* Addressed code review.

* Addressed code reivew.

* Fix a build error.
2020-10-30 10:55:43 -07:00
Alex Wu e022d12dc3 [New scheduler] Deflake test heartbeat (#11586)
* defleked

* lint

* .

* Update cluster_task_manager_test.cc

Co-authored-by: Alex Wu <alex@anyscale.com>
2020-10-29 23:10:19 -07:00
architkulkarni 4175569d96 [Core] Add option to override environment variables for tasks and actors (#11619) 2020-10-29 14:22:44 -05:00
Simon Mo e82ff08b0c Fix asyncio plasma integration in cluster mode (#11665) 2020-10-29 11:53:10 -07:00
Lingxuan Zuo 0b7a3d9e02 [Log] new spdlog tool for ray (#10967)
* spdlog support

* fatal abort for spdlog

* print all logs in stderr if no logger given

* fix log test

* install signal handler for spdlog by reusing glog lib

* fix lint

* Avoid duplicated dump

* log rotation and fmt comments

* fix
2020-10-29 11:37:13 -07:00
Tao Wang 1d5694ddea [GCS]Use direct getting instead of pub-sub to update load metrics in monitor.py (#11339) 2020-10-28 11:23:18 -07:00
Eric Liang c933477915 [new scheduler] Pass test_basic and add CI builds with flag on (#11635) 2020-10-28 11:02:43 -07:00
Stephanie Wang 427b5af0ae [Object spilling] Refactor raylet to add a local object manager class (#11647)
* Fix pytest...

* Release objects that have been spilled

* GCS object table interface refactor

* Add spilled URL to object location info

* refactor to include spilled URL in notifications

* improve tests

* Add spilled URL to object directory results

* Remove force restore call

* Merge spilled URL and location

* fix

* tmp

* refactor

* unit test skeleton

* unit testing

* unit test fixes

* cleanup

* cleanup

* update

* Separate pinning from waiting for object free, fixes pytest

* Update src/ray/raylet/local_object_manager.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

Co-authored-by: Tyler Westenbroek <westenbroekt@berkeley.edu>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2020-10-28 10:38:42 -04:00
fyrestone 05ad4c7499 [Dashboard] Optimize dashboard datacenter (#11391)
* Optimize dashboard datacenter

* Fix tests

* Fix tests

* Fix

* Fix CI

* python/build-wheel-macos.sh

Co-authored-by: 刘宝 <po.lb@antfin.com>
Co-authored-by: Max Fitton <maxfitton@anyscale.com>
2020-10-27 23:49:31 -07:00
fangfengbin 55a090fb16 [GCS]Optimize gcs client nodes get function (#11424)
* [GCS]Optimize gcs client nodes get function

* fix review comment

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-27 21:13:19 -07:00
Tao Wang 273a712786 [GCS]Decouple node failure detector with resoure related operations (#11465) 2020-10-27 15:52:42 -07:00
fangfengbin ebe9a8865c [GCS]Fix a bug that creates invalid connection (#11590)
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-27 10:08:06 -07:00
Ian Rodney 2da6ad2176 [core] Better error message for named actor not found (#11604) 2020-10-26 09:46:02 -07:00
Tao Wang 0fbee4da0c [GCS] Remove unused ReportBatchHeartbeat/SubscribeHeartbeat (#11567)
* Remove unused message ReportBatchHeartbeat

* add up
2020-10-25 21:06:28 -07:00
Eric Liang d3ee83205b Remove crashing assert in actor creation for old scheduler (#11577)
* remove assert

* warn log
2020-10-24 00:05:26 -07:00
DK.Pino 9f804ade5f [Placement Group]Add get all placement group api (#11460)
* add get all interface for placement group

* add get all interface for placement group

* make it work

* fix lint

* fix lint

* fix comment

* add cpp test

* fix python lint
2020-10-23 11:46:48 -07:00
Alex Wu e02f4c0157 [New scheduler] queue by shape (#11381) 2020-10-21 15:56:06 -07:00
Edward Oakes 5d7f271e7d Add --worker-port-list option to ray start (#11481) 2020-10-21 14:46:45 -05:00
Tao Wang da2d3fbcfc Remove unused field in heartbeat message (#11459) 2020-10-21 10:49:16 -07:00
Kai Yang 078a22d676 [Core] Allow creating tasks/actors in a detached actor when driver has exited (#11493)
* Allow creating tasks/actors in a detached actor when driver has exited

* lint

* Address comment
2020-10-21 10:45:29 -07:00
Xuxue1 7200ddb72d Fix code_search_path failed in java (#11406)
Co-authored-by: xujiqiang eigen <xujiqiang@hpc1.ipa.aidigger.com>
2020-10-21 18:10:48 +08:00
fangfengbin a075e37695 [GCS]Fix TestActorTableResubscribe bug (#11463)
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-20 22:32:41 -07:00
Lingxuan Zuo aed739fbf4 [Log] Ignore callstacktrace test for windows (#11413) 2020-10-20 15:21:29 +08:00