Commit Graph

1420 Commits

Author SHA1 Message Date
Ian Rodney 32ed1a18b7 [hotfix] Fix lint in master (#10254) 2020-08-21 20:53:05 -07:00
Alex Wu 136c8ff19e [NewScheduler] Pass test_basic.py (#10059)
* .

* .

* Cleanup

* .

* whoops

* Update src/ray/raylet/scheduling/cluster_task_manager.h

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/raylet/scheduling/cluster_task_manager.h

Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>

* CR

* .

* .

* done

* .

* Unit tests

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
2020-08-21 15:00:08 -07:00
Barak Michener f03caa4532 rpc: Follow-up by sharing the core worker client pool within the core worker. (#10206)
* Share CoreWorkerClientPool

* Format
2020-08-21 11:01:22 -07:00
Stephanie Wang 85e57a7a98 [Object spilling] Look up the location of the primary raylet from the owner's metadata (#10197)
* Get the primary copy from the owner, python test, some node manager fixes

* fixes and todo

* update

* lint

* fix build
2020-08-20 14:46:59 -07:00
fangfengbin a462ae2747 [Placement Group]Add strict spread strategy (#10174)
* support STRICT_SPREAD strategy

* fix review comments

* rebase master

* fix lint error

* fix lint error

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-08-20 10:18:58 -07:00
SangBin Cho 224933b5e4 [Placement Group] Remove API part 2 (#10215)
* Initial progress done.

* Fix mistake.

* Addressed code review.

* Fix cpp build issue.

* Addressed code review.
2020-08-20 09:50:13 -07:00
fangfengbin 9734dbca3e [Placement Group]Reschedule bundles when the node of bundles is dead (#10021) 2020-08-19 13:24:42 -07:00
SangBin Cho 263df6163c [Placement Group] Placement group remove api part 1 (#10063)
* Added basic rpc calls.

* fix issues.

* Fix the gcs server not getting request issue.

* In Progress.

* Basic logic done. Tests are required.

* In progress.

* In progress in refactoring context.

* Revert "In progress in refactoring context."

This reverts commit 38236256cf1306c60dd203e75d45ceb4509c8106.

* Working now.

* Python test works.

* Lint.

* Addressed code review.

* Addressed code review.

* Lint.

* Added unit tests.

* Done, but one of unit tests fail

* Addressed code review.

* Addressed the last code review.

* Fix the wrong test case.
2020-08-18 12:44:00 -07:00
Simon Mo bedc2c24c8 Export Metrics in OpenCensus Protobuf Format (#10080) 2020-08-18 11:32:42 -07:00
SangBin Cho 053188dfbe [Placement Group] Support Placement Group state table. (#10090)
* Done.

* Addressed code review.

* Linting.

* Fix lint.

* Fix lint.

* Fix a test.

* Lint.

* Add a lint sleep to test.

* Fix the lint issue.

* Fixed doc build error.
2020-08-17 09:24:50 -07:00
fangfengbin edd783bc32 [Placement Group]Add soft pack strategy (#10099) 2020-08-17 12:01:34 +08:00
Tao Wang fba5906ce3 [GCS] Re-report heartbeat when gcs server restarts (#10040)
* Retry to send failed heartbeat when light heartbeat enalbed

* Re-report heartbeat when gcs server restarts

* remove is_pubsub_server_restarted

* add lock per comment

* minor change, name related
2020-08-14 17:37:20 -07:00
Siyuan (Ryans) Zhuang 17ca1d8ff4 [Core] Object spilling prototype (#9818) 2020-08-14 15:39:10 -07:00
Robert Nishihara 36e626e95d Revert "[Dashboard] Start the new dashboard (#9860)" (#10116)
This reverts commit 739933e5b8.
2020-08-14 14:06:57 -07:00
fangfengbin 3a6fa7d622 [Placement Group]Optimize placement group strict pack strategy (#9924)
* add part code

* add code

* add part code

* rm used import

* add part code

* add part code

* add part code

* add part code

* add part code

* add part code

* fix review comment

* add testcase

* use ResourceSet

* fix review comment

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-08-13 23:58:52 -07:00
Simon Mo 01f38bc5d1 CoreWorker correctly push metrics to agent (#10031) 2020-08-13 16:44:53 -07:00
Ícaro Aragão b77d6bf87d [GCS] Improve fallback for getting local valid IP for GCS server (#10004) 2020-08-13 16:29:47 -05:00
SangBin Cho 86b1db3f11 [Stats] Make metrics report time configurable (#10036)
* Done.

* Lint.

* Address code review.

* Address code review.

* Remove wrong commit.

* Fix a test error.
2020-08-13 00:30:24 -07:00
fyrestone 739933e5b8 [Dashboard] Start the new dashboard (#9860) 2020-08-13 11:01:46 +08:00
fangfengbin 701e26e0af [GCS]Add node realtime resource view (#10043) 2020-08-12 10:52:17 +08:00
Zhuohan Li a6fed4820e [Core] Preliminary implementation of ownership-based object directory (#9735) 2020-08-11 15:04:13 -07:00
SangBin Cho 946ae74817 [GCS Actor Management] Race condition around creating -> created phase. (#10035)
* Fix the issue.

* Address a code review.
2020-08-11 12:31:27 -07:00
Basasuya 0400a88bf1 [EVENT] Basic Function and Definition (#9657) 2020-08-11 17:36:07 +08:00
Kai Yang 3bc17fa62a [Core] Multi-tenancy: Pass env variables from job config to worker processes (#10022) 2020-08-10 14:31:37 -07:00
Alex Wu 2ebf76c7a3 [New Scheduler] Additional unit tests (#9990) 2020-08-10 11:44:06 -07:00
SangBin Cho eb6b10221e Increase the num of trials to reduce the probability of failing sample_test (#10007) 2020-08-10 10:05:33 -07:00
Kai Yang 37821f0b4c Support unlimited JVM options (#9910) 2020-08-10 16:08:33 +08:00
fangfengbin 26b36a1982 Optimize node register&worker failure log (#9833) 2020-08-10 11:41:45 +08:00
fangfengbin a2bfdcbf24 [Placement Group]Trigger placement group scheduling when a new node is added (#9905) 2020-08-10 10:56:17 +08:00
Barak Michener 8e76796fd0 ci: Redo format.sh --all script & backfill lint fixes (#9956) 2020-08-07 16:49:49 -07:00
Barak Michener 1d01c668f0 rpc: Core Worker client pool (#9934) 2020-08-07 16:34:29 -07:00
Tao Wang 8bea875673 [TEST]Check if port is free before start up redis (#9974)
* [TEST]Check if port is free before start up redis

* per comment
2020-08-07 10:15:12 -07:00
SangBin Cho 44826878ff [Core] Remove Legacy Raylet Code (#9936)
* Remove a flag and some methods in node manager including HandleDisconnectedActor, ResubmitTask, and HandleTaskReconstruction

* Make actor creator always required + remove raylet transport

* Remove actor reporter + remove FinishAssignedActorCreationTask

* Remove actor tasks.

* Remove finishactortask and switched it to finishactorcreation task

* Remove reconstruction policy.

* Remove lineage cache.

* Formatting.

* Remove actor frontier code.

* Removed build error.

* Revert "Remove reconstruction policy."

This reverts commit 9d25c9bced4da5fbcac5d484d51013345f16513b.

* Recover HandleReconstruction to mark expired objects as failed.
2020-08-06 16:37:50 -07:00
SangBin Cho ec2f1a225e [Stats] Metrics Export User Interface Part 1 (#9913)
* Metrics export port expose done.

* Support exposing metrics port + metrics agent service discovery through ray.nodes()

* Formatting.

* Added a doc.

* Linting.

* Change the location of metrics agent port.

* Addressed code review.

* Addressed code review.
2020-08-06 16:16:29 -07:00
Eric Liang 7d4f204aa8 [Placement Group] Allow scheduling a task on any bundle (-1, default) (#9885)
* wip

* wip

* fix tests

* wip

* wip

* wip

* wip

* wip

* add test

* update

* update

* remov debug

* comments
2020-08-06 00:05:21 -07:00
Tao Wang 1760586628 [GCS]Use an asynchronous PING to avoid blocking other operations (#9871)
* Use separate redis client to avoid its sync command blocking other operations

* use redis_failure_detector_client_

* use async command to ping redis

* format log
2020-08-05 19:10:53 -07:00
SangBin Cho 68899e2f8e [GCS Actor Management] Fix race condition for DEPENDENCIES_UNREADY states. (#9883)
* Fix issues.

* Address code review.

* Addressed code review 2.

* Fix formatting.

* Addressed code review 3/

* Addressed code review.
2020-08-05 12:22:12 -07:00
SangBin Cho 685182923c [Core] Fix detached actor local mode when gcs actor management is on. (#9839)
* Fix local mode detached actor.

* Revert changes.
2020-08-05 09:04:24 -07:00
kisuke95 ddc1e483fb Fix actor table Delete bug (#9499) 2020-08-05 18:05:51 +08:00
kisuke95 80d2544f6b Fix vector<bool> for loop (#9907) 2020-08-05 17:49:37 +08:00
fangfengbin 193d11ab8b Optimize placement group log (#9891) 2020-08-05 14:41:32 +08:00
chaokunyang 3323ad9d59 [HOTFIX] Fix master build with missing placement group argument (#9868)
* fix common task submit default placement group

* fix java_function
2020-08-04 11:19:15 -07:00
Barak Michener c16e1b9524 src/ray/protobuf: Break proto rules into a proper BUILD file (#9792) 2020-08-04 11:12:45 -07:00
Kai Yang 27cd323ce1 [Core] Multi-tenancy: Job isolation & implement per job config (except for env variables) (#9500) 2020-08-04 15:51:29 +08:00
kisuke95 28b1f7710c [Core] Error info pubsub (Remove ray.errors API) (#9665) 2020-08-04 14:04:29 +08:00
fangfengbin 8c3fc1db76 Optimize actor creation log (#9781) 2020-08-04 10:29:30 +08:00
Zhijun Fu 4f2e4f31dd async grpc calls should always return void (#9533) 2020-08-03 12:44:02 -07:00
Stephanie Wang 37a9c5783c [core] Report resource load by shape (#9806)
* Report and aggregate resource load by shape

* python test

* python test

* x

* update
2020-07-31 16:57:30 -07:00
Eric Liang b73080c85f Allow tasks to be used with placement groups (#9738) 2020-07-31 10:51:37 -07:00
fangfengbin 3900643948 Add actor states definitions & transition diagram doc (#9754) 2020-07-31 15:35:25 +08:00