Commit Graph

1411 Commits

Author SHA1 Message Date
SangBin Cho 053188dfbe [Placement Group] Support Placement Group state table. (#10090)
* Done.

* Addressed code review.

* Linting.

* Fix lint.

* Fix lint.

* Fix a test.

* Lint.

* Add a lint sleep to test.

* Fix the lint issue.

* Fixed doc build error.
2020-08-17 09:24:50 -07:00
fangfengbin edd783bc32 [Placement Group]Add soft pack strategy (#10099) 2020-08-17 12:01:34 +08:00
Tao Wang fba5906ce3 [GCS] Re-report heartbeat when gcs server restarts (#10040)
* Retry to send failed heartbeat when light heartbeat enalbed

* Re-report heartbeat when gcs server restarts

* remove is_pubsub_server_restarted

* add lock per comment

* minor change, name related
2020-08-14 17:37:20 -07:00
Siyuan (Ryans) Zhuang 17ca1d8ff4 [Core] Object spilling prototype (#9818) 2020-08-14 15:39:10 -07:00
Robert Nishihara 36e626e95d Revert "[Dashboard] Start the new dashboard (#9860)" (#10116)
This reverts commit 739933e5b8.
2020-08-14 14:06:57 -07:00
fangfengbin 3a6fa7d622 [Placement Group]Optimize placement group strict pack strategy (#9924)
* add part code

* add code

* add part code

* rm used import

* add part code

* add part code

* add part code

* add part code

* add part code

* add part code

* fix review comment

* add testcase

* use ResourceSet

* fix review comment

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-08-13 23:58:52 -07:00
Simon Mo 01f38bc5d1 CoreWorker correctly push metrics to agent (#10031) 2020-08-13 16:44:53 -07:00
Ícaro Aragão b77d6bf87d [GCS] Improve fallback for getting local valid IP for GCS server (#10004) 2020-08-13 16:29:47 -05:00
SangBin Cho 86b1db3f11 [Stats] Make metrics report time configurable (#10036)
* Done.

* Lint.

* Address code review.

* Address code review.

* Remove wrong commit.

* Fix a test error.
2020-08-13 00:30:24 -07:00
fyrestone 739933e5b8 [Dashboard] Start the new dashboard (#9860) 2020-08-13 11:01:46 +08:00
fangfengbin 701e26e0af [GCS]Add node realtime resource view (#10043) 2020-08-12 10:52:17 +08:00
Zhuohan Li a6fed4820e [Core] Preliminary implementation of ownership-based object directory (#9735) 2020-08-11 15:04:13 -07:00
SangBin Cho 946ae74817 [GCS Actor Management] Race condition around creating -> created phase. (#10035)
* Fix the issue.

* Address a code review.
2020-08-11 12:31:27 -07:00
Basasuya 0400a88bf1 [EVENT] Basic Function and Definition (#9657) 2020-08-11 17:36:07 +08:00
Kai Yang 3bc17fa62a [Core] Multi-tenancy: Pass env variables from job config to worker processes (#10022) 2020-08-10 14:31:37 -07:00
Alex Wu 2ebf76c7a3 [New Scheduler] Additional unit tests (#9990) 2020-08-10 11:44:06 -07:00
SangBin Cho eb6b10221e Increase the num of trials to reduce the probability of failing sample_test (#10007) 2020-08-10 10:05:33 -07:00
Kai Yang 37821f0b4c Support unlimited JVM options (#9910) 2020-08-10 16:08:33 +08:00
fangfengbin 26b36a1982 Optimize node register&worker failure log (#9833) 2020-08-10 11:41:45 +08:00
fangfengbin a2bfdcbf24 [Placement Group]Trigger placement group scheduling when a new node is added (#9905) 2020-08-10 10:56:17 +08:00
Barak Michener 8e76796fd0 ci: Redo format.sh --all script & backfill lint fixes (#9956) 2020-08-07 16:49:49 -07:00
Barak Michener 1d01c668f0 rpc: Core Worker client pool (#9934) 2020-08-07 16:34:29 -07:00
Tao Wang 8bea875673 [TEST]Check if port is free before start up redis (#9974)
* [TEST]Check if port is free before start up redis

* per comment
2020-08-07 10:15:12 -07:00
SangBin Cho 44826878ff [Core] Remove Legacy Raylet Code (#9936)
* Remove a flag and some methods in node manager including HandleDisconnectedActor, ResubmitTask, and HandleTaskReconstruction

* Make actor creator always required + remove raylet transport

* Remove actor reporter + remove FinishAssignedActorCreationTask

* Remove actor tasks.

* Remove finishactortask and switched it to finishactorcreation task

* Remove reconstruction policy.

* Remove lineage cache.

* Formatting.

* Remove actor frontier code.

* Removed build error.

* Revert "Remove reconstruction policy."

This reverts commit 9d25c9bced4da5fbcac5d484d51013345f16513b.

* Recover HandleReconstruction to mark expired objects as failed.
2020-08-06 16:37:50 -07:00
SangBin Cho ec2f1a225e [Stats] Metrics Export User Interface Part 1 (#9913)
* Metrics export port expose done.

* Support exposing metrics port + metrics agent service discovery through ray.nodes()

* Formatting.

* Added a doc.

* Linting.

* Change the location of metrics agent port.

* Addressed code review.

* Addressed code review.
2020-08-06 16:16:29 -07:00
Eric Liang 7d4f204aa8 [Placement Group] Allow scheduling a task on any bundle (-1, default) (#9885)
* wip

* wip

* fix tests

* wip

* wip

* wip

* wip

* wip

* add test

* update

* update

* remov debug

* comments
2020-08-06 00:05:21 -07:00
Tao Wang 1760586628 [GCS]Use an asynchronous PING to avoid blocking other operations (#9871)
* Use separate redis client to avoid its sync command blocking other operations

* use redis_failure_detector_client_

* use async command to ping redis

* format log
2020-08-05 19:10:53 -07:00
SangBin Cho 68899e2f8e [GCS Actor Management] Fix race condition for DEPENDENCIES_UNREADY states. (#9883)
* Fix issues.

* Address code review.

* Addressed code review 2.

* Fix formatting.

* Addressed code review 3/

* Addressed code review.
2020-08-05 12:22:12 -07:00
SangBin Cho 685182923c [Core] Fix detached actor local mode when gcs actor management is on. (#9839)
* Fix local mode detached actor.

* Revert changes.
2020-08-05 09:04:24 -07:00
kisuke95 ddc1e483fb Fix actor table Delete bug (#9499) 2020-08-05 18:05:51 +08:00
kisuke95 80d2544f6b Fix vector<bool> for loop (#9907) 2020-08-05 17:49:37 +08:00
fangfengbin 193d11ab8b Optimize placement group log (#9891) 2020-08-05 14:41:32 +08:00
chaokunyang 3323ad9d59 [HOTFIX] Fix master build with missing placement group argument (#9868)
* fix common task submit default placement group

* fix java_function
2020-08-04 11:19:15 -07:00
Barak Michener c16e1b9524 src/ray/protobuf: Break proto rules into a proper BUILD file (#9792) 2020-08-04 11:12:45 -07:00
Kai Yang 27cd323ce1 [Core] Multi-tenancy: Job isolation & implement per job config (except for env variables) (#9500) 2020-08-04 15:51:29 +08:00
kisuke95 28b1f7710c [Core] Error info pubsub (Remove ray.errors API) (#9665) 2020-08-04 14:04:29 +08:00
fangfengbin 8c3fc1db76 Optimize actor creation log (#9781) 2020-08-04 10:29:30 +08:00
Zhijun Fu 4f2e4f31dd async grpc calls should always return void (#9533) 2020-08-03 12:44:02 -07:00
Stephanie Wang 37a9c5783c [core] Report resource load by shape (#9806)
* Report and aggregate resource load by shape

* python test

* python test

* x

* update
2020-07-31 16:57:30 -07:00
Eric Liang b73080c85f Allow tasks to be used with placement groups (#9738) 2020-07-31 10:51:37 -07:00
fangfengbin 3900643948 Add actor states definitions & transition diagram doc (#9754) 2020-07-31 15:35:25 +08:00
Kai Yang 02fd950252 [Java] Local and distributed ref counting in Java (#9371) 2020-07-31 11:49:31 +08:00
Eric Liang 73df3f7bd2 Clean up formatting of placement group resources (#9740) 2020-07-30 15:52:32 -07:00
SangBin Cho e6d1e3afe2 Use pass by reference for const auto in for loop. (#9811) 2020-07-30 12:34:24 -05:00
Kai Yang 9be5a2f0fc Fix GCS related tests (#9783) 2020-07-30 11:46:36 +08:00
SangBin Cho 826f14c824 [Stats] Fix harvestor threads + Fix flaky stats shutdown. (#9745) 2020-07-29 18:57:59 -05:00
mehrdadn 07022f3f11 Fix src/ray/core_worker/common.h deleted constructor (#9785)
Co-authored-by: Mehrdad <noreply@github.com>
2020-07-29 15:49:02 -07:00
Alex Wu 72297dc46f [Core] Socket creation race condition bug fixes (#9764)
* fix issues

* hot fixes

* test

* test

* Always info log
2020-07-29 13:17:46 -07:00
SangBin Cho d1b37ca7e4 [GCS Actor Management] Fix flaky test_dead_actors. (#9715)
* Fix.

* Add logs.

* Add an unit test.
2020-07-29 10:54:18 -07:00
Tao Wang 2babad9906 [GCS]Use a separate thread in node failure detector to handle heartbeat (#9416)
* use a sole thread to handle heartbeat

* separate signal thread

* use work to avoid exiting when task is underway

* protect shared data structure to avoid deadlock

* add comments

* decrease io service num

* minor changes

* fix test

* per stephanie's comments

* use single io service instead of 1-size io service pool

* typo
2020-07-29 09:58:58 -07:00