Commit Graph

165 Commits

Author SHA1 Message Date
Stephanie Wang 85e57a7a98 [Object spilling] Look up the location of the primary raylet from the owner's metadata (#10197)
* Get the primary copy from the owner, python test, some node manager fixes

* fixes and todo

* update

* lint

* fix build
2020-08-20 14:46:59 -07:00
fangfengbin a462ae2747 [Placement Group]Add strict spread strategy (#10174)
* support STRICT_SPREAD strategy

* fix review comments

* rebase master

* fix lint error

* fix lint error

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-08-20 10:18:58 -07:00
SangBin Cho 263df6163c [Placement Group] Placement group remove api part 1 (#10063)
* Added basic rpc calls.

* fix issues.

* Fix the gcs server not getting request issue.

* In Progress.

* Basic logic done. Tests are required.

* In progress.

* In progress in refactoring context.

* Revert "In progress in refactoring context."

This reverts commit 38236256cf1306c60dd203e75d45ceb4509c8106.

* Working now.

* Python test works.

* Lint.

* Addressed code review.

* Addressed code review.

* Lint.

* Added unit tests.

* Done, but one of unit tests fail

* Addressed code review.

* Addressed the last code review.

* Fix the wrong test case.
2020-08-18 12:44:00 -07:00
SangBin Cho 053188dfbe [Placement Group] Support Placement Group state table. (#10090)
* Done.

* Addressed code review.

* Linting.

* Fix lint.

* Fix lint.

* Fix a test.

* Lint.

* Add a lint sleep to test.

* Fix the lint issue.

* Fixed doc build error.
2020-08-17 09:24:50 -07:00
fangfengbin edd783bc32 [Placement Group]Add soft pack strategy (#10099) 2020-08-17 12:01:34 +08:00
Siyuan (Ryans) Zhuang 17ca1d8ff4 [Core] Object spilling prototype (#9818) 2020-08-14 15:39:10 -07:00
Simon Mo 01f38bc5d1 CoreWorker correctly push metrics to agent (#10031) 2020-08-13 16:44:53 -07:00
SangBin Cho 86b1db3f11 [Stats] Make metrics report time configurable (#10036)
* Done.

* Lint.

* Address code review.

* Address code review.

* Remove wrong commit.

* Fix a test error.
2020-08-13 00:30:24 -07:00
Zhuohan Li a6fed4820e [Core] Preliminary implementation of ownership-based object directory (#9735) 2020-08-11 15:04:13 -07:00
Barak Michener 8e76796fd0 ci: Redo format.sh --all script & backfill lint fixes (#9956) 2020-08-07 16:49:49 -07:00
SangBin Cho 44826878ff [Core] Remove Legacy Raylet Code (#9936)
* Remove a flag and some methods in node manager including HandleDisconnectedActor, ResubmitTask, and HandleTaskReconstruction

* Make actor creator always required + remove raylet transport

* Remove actor reporter + remove FinishAssignedActorCreationTask

* Remove actor tasks.

* Remove finishactortask and switched it to finishactorcreation task

* Remove reconstruction policy.

* Remove lineage cache.

* Formatting.

* Remove actor frontier code.

* Removed build error.

* Revert "Remove reconstruction policy."

This reverts commit 9d25c9bced4da5fbcac5d484d51013345f16513b.

* Recover HandleReconstruction to mark expired objects as failed.
2020-08-06 16:37:50 -07:00
SangBin Cho ec2f1a225e [Stats] Metrics Export User Interface Part 1 (#9913)
* Metrics export port expose done.

* Support exposing metrics port + metrics agent service discovery through ray.nodes()

* Formatting.

* Added a doc.

* Linting.

* Change the location of metrics agent port.

* Addressed code review.

* Addressed code review.
2020-08-06 16:16:29 -07:00
Kai Yang 27cd323ce1 [Core] Multi-tenancy: Job isolation & implement per job config (except for env variables) (#9500) 2020-08-04 15:51:29 +08:00
Eric Liang b73080c85f Allow tasks to be used with placement groups (#9738) 2020-07-31 10:51:37 -07:00
SangBin Cho 7e3ba289dc [Stats] Basic Metrics Infrastructure (Metrics Agent + Prometheus Exporter) (#9607) 2020-07-28 10:28:01 -07:00
Alisa 51e12ee97c Python api of placement group (#9243) 2020-07-27 14:57:05 -07:00
Lingxuan Zuo 9c4cf0f961 fix tag key typo (#9606) 2020-07-21 19:50:54 +08:00
mehrdadn 2554a1a997 Bazel fixes (#9519) 2020-07-19 12:53:08 -07:00
Lingxuan Zuo ce3f542739 [Metric] new cython interface for python worker metric (#9469) 2020-07-19 10:43:21 +08:00
Gabriele Oliaro 026c009086 Pipelining task submission to workers (#9363)
* first step of pipelining

* pipelining tests & default configs
- added pipelining unit tests in direct_task_transport_test.cc
- added an entry in ray_config_def.h, ray_config.pxi, and ray_config.pxd to configure the parameter controlling the maximum number of tasks that can be in fligh to each worker
- consolidated worker_to_lease_client_ and worker_to_lease_client_ hash maps in direct_task_transport.h into a single one called worker_to_lease_entry_

* post-review revisions

* linting, following naming/style convention

* linting
2020-07-17 10:45:13 -07:00
SangBin Cho 2f674728a6 [GCS Actor Management] Gcs actor management broken detached actor (#9473) 2020-07-16 15:41:18 +08:00
kisuke95 5e2571e214 release gil in global state accessor (#9357) 2020-07-16 11:21:10 +08:00
Hao Chen d49dadf891 Change Python's ObjectID to ObjectRef (#9353) 2020-07-10 17:49:04 +08:00
Zhuohan Li 8a76f4cbb5 [Core] put small objects in memory store (#8972)
* remove the put in memory store

* put small objects directly in memory store

* cast data type

* fix another place that uses Put to spill to plasma store

* fix multiple tests related to memory limits

* partially fix test_metrics

* remove not functioning codes

* fix core_worker_test

* refactor put to plasma codes

* add a flag for the new feature

* add flag to more places

* do a warmup round for the plasma store

* lint

* lint again

* fix warmup store

* Update _raylet.pyx

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2020-07-09 15:39:40 -07:00
yncxcw 4ba4110dec [Core] Make worker_register_timeout_seconds configurable (#9221) 2020-07-07 18:27:15 -05:00
SangBin Cho 8f19f1eafb [Core] Actor handle refactoring (#8895)
* Marking needed changes.

* Resolve basic dependencies.

* In progress.

* linting.

* In progress 2.

* Linting.

* Refactor done. Cleanup needed.

* Linting.

* Recover kill actor in core worker because it is used inside raylet

* Cleanup.

* Use unique pointer instead. Unit tests are broken now.

* Fix the upstream change.

* Addressed code review 1.

* Lint.

* Addressed code review 2.

* Fix weird github history.

* Lint.

* Linting using clang 7.0.

* Use a better check message.

* Revert cpp stuff.

* Fix weird linting errors.

* Manuall fix all lint issues.

* Update a newline.

* Refactor some interface.

* Addressed all code review.

* Addressed code review
2020-07-07 11:11:41 -07:00
Ian Rodney a1e14380ce [core] Switch Async Callback to C++ [WIP] (#9228)
Co-authored-by: simon-mo <simon.mo@hey.com>
2020-07-07 09:47:25 -07:00
Stephanie Wang b42d6a1ddc [core] Refactor task arguments and attach owner address (#9152)
* Add intended worker ID to GetObjectStatus, tests

* Remove TaskID owner_id

* lint

* Add owner address to task args

* Make TaskArg a virtual class, remove multi args

* Set owner address for task args

* merge

* Fix tests

* Fix

* build

* update

* build

* java

* Move code

* build

* Revert "Fix Google log directory again (#9063)"

This reverts commit 275da2e400.

* Fix free

* x

* build

* Fix java

* Revert "Revert "Fix Google log directory again (#9063)""

This reverts commit 4a326fcb148ca09a35bc7de11d89df10edbb56e7.

* lint
2020-07-06 21:25:14 -07:00
ChenZhilei 6f3d993681 GCS server use worker table to handle RegisterWorker instead of redis accessor (#9168) 2020-07-06 10:37:25 +08:00
Stephanie Wang 490cddc250 [core] Refactor distributed ref counting to remove owner task ID (#9049)
* Add intended worker ID to GetObjectStatus, tests

* Remove TaskID owner_id

* lint

* Update message

* lint

* Fix build
2020-06-25 17:55:03 -07:00
Simon Mo b6d425526d Move actor task submission to io service (#9093) 2020-06-23 10:07:33 -07:00
Zhilei Chen d8a9247448 Remove gcs_service_disabled ci jobs and code (#8854) 2020-06-19 11:32:27 +08:00
Lingxuan Zuo 4cbbc15ca7 [GCS] Global state accessor from node resource table (#8658) 2020-06-02 14:01:00 +08:00
Alec Brickner 207ab44129 Raise major version limit for msgpack (#8466) 2020-06-01 20:00:36 -07:00
fangfengbin 35eeec5647 Add C++ global state for actor table (#8501)
* add global state actors

* fix code style

* fix GcsActorManagerTest bug

* rebase master

* add jni code

* add get checkpoint id code

* add debug code

* add debug code

* change log level

* fix compile bug

* return null in jni

* fix crash bug

* change import seq

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
Co-authored-by: Hao Chen <chenh1024@gmail.com>
2020-05-29 21:10:42 +08:00
Hao Chen 08fee00bc8 Increase rayelt client connect timeout to fix test_debug_tools (#8605) 2020-05-28 20:57:30 +08:00
Lingxuan Zuo e594524ed3 [GCS] global state query node info table from GCS. (#8498) 2020-05-28 16:39:13 +08:00
fyrestone f39760a4d3 Use uuid4() for actor creation function id hash (#8589) 2020-05-26 15:20:03 +08:00
fangfengbin 765d470c40 Add gcs object manager (#8298) 2020-05-25 17:21:35 +08:00
Tao Wang 92c2e41dfd [GCS]profile info getting implementation based gcs service (#8536) 2020-05-24 22:23:01 +08:00
Kai Yang 2e5e789294 Allow enabling logging in core worker with empty log_dir (#8529) 2020-05-22 18:02:37 +08:00
fangfengbin 9347a5d10c Add global state accessor of jobs (#8401) 2020-05-18 20:32:05 +08:00
Edward Oakes 16f48078d9 Remove use of ObjectID transport flag (#7699) 2020-05-17 11:29:49 -05:00
Stephanie Wang bd169749e0 Option to retry failed actor tasks (#8330)
* Python

* Consolidate state in the direct actor transport, set the caller starts at

* todo

* Remove unused

* Update and unit tests

* Doc

* Remove unused

* doc

* Remove debug

* Update src/ray/core_worker/transport/direct_actor_transport.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Update src/ray/core_worker/transport/direct_actor_transport.cc

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* lint and fix build

* Update

* Fix build

* Fix tests

* Unit test for max_task_retries=0

* Fix java?

* Fix bad test

* Cross language fix

* fix java

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2020-05-15 20:15:15 -07:00
Max Fitton 00325eb2b2 Rename max_reconstructions to max_restarts and use -1 for infinite (#8274)
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-05-14 10:30:29 -05:00
Edward Oakes 2677b71003 Implement named actors using the GCS service (#8328) 2020-05-09 08:58:10 -05:00
SangBin Cho e631827a9f [Core] Show_webui segfault fix. (#8323) 2020-05-06 11:45:07 -05:00
Edward Oakes ebdccde030 Fetch internal config from raylet (#8195) 2020-04-28 13:12:11 -05:00
ijrsvt 69ff7e3e35 TaskCancellation (#7669)
* Smol comment

* WIP, not passing ray.init

* Fixed small problem

* wip

* Pseudo interrupt things

* Basic prototype operational

* correct proc title

* Mostly done

* Cleanup

* cleaner raylet error

* Cleaning up a few loose ends

* Fixing Race Conds

* Prelim testing

* Fixing comments and adding second_check for kill

* Working_new_impl

* demo_ready

* Fixing my english

* Fixing a few problems

* Small problems

* Cleaning up

* Response to changes

* Fixing error passing

* Merged to master

* fixing lock

* Cleaning up print statements

* Format

* Fixing Unit test build failure

* mock_worker fix

* java_fix

* Canel

* Switching to Cancel

* Responding to Review

* FixFormatting

* Lease cancellation

* FInal comments?

* Moving exist check to CoreWorker

* Fix Actor Transport Test

* Fixing task manager test

* chaning clock repr

* Fix build

* fix white space

* lint fix

* Updating to medium size

* Fixing Java test compilation issue

* lengthen bad timeouts
2020-04-25 16:04:52 -07:00
Stephanie Wang eefea4e29c [core] Post task submission to IO loop (#8090)
* Post to IO loop

* Unused

* Fix build
2020-04-20 19:13:50 -07:00