Commit Graph

82 Commits

Author SHA1 Message Date
Stephanie Wang ada58abcd9 [Object spilling] Update object directory and reload spilled objects automatically (#11021)
* Fix pytest...

* Release objects that have been spilled

* GCS object table interface refactor

* Add spilled URL to object location info

* refactor to include spilled URL in notifications

* improve tests

* Add spilled URL to object directory results

* Remove force restore call

* Merge spilled URL and location

* fix

* CI

* build

* osx

* Fix multitenancy issues

* Skip windows tests
2020-10-02 15:52:42 -07:00
SangBin Cho 1e39c40370 [Placement Group] Capture child tasks by default. (#11025)
* In progress.

* Finished up.

* Improve comment.

* Addressed code review.

* Fix test failure.

* Fix ci failures.

* Fix CI issues.
2020-09-27 19:33:00 -07:00
DK.Pino db7097fb1f [Refactor] Rename ClientId to NodeId (#10992)
* rename ClientId to NodeId

* format lint

* format lint

* fix conflicts

* rename new ClientId to NodeId

* update lint

* make same version of clang-format with travis ci
2020-09-27 10:24:21 -07:00
SangBin Cho 5e6b887f2d [Placement Group] Capture Child Task Part 1 (#10968)
* In progress.

* In progers.

* Done.

* Addressed code review.

* Increase timeout to make a test less flaky.

* Addressed code review.

* Addressed code review.
2020-09-24 09:02:03 -07:00
fyrestone 50784e2496 [Dashboard] Dashboard node grouping (#10528)
* Add RAY_NODE_ID environment var to agent

* Node ralated data use node id as key

* ray.init() return node id; Pass test_reporter.py

* Fix lint & CI

* Fix comments

* Minor fixes

* Fix CI

* Add const to ClientID in AgentManager::Options

* Use fstring

* Add comments

* Fix lint

* Add test_multi_nodes_info

Co-authored-by: 刘宝 <po.lb@antfin.com>
2020-09-16 10:17:29 -07:00
Clark Zinzow 0c0b0d0a73 [Core] Added support for submission-time task names. (#10449)
* Added support for submission-time task names.

* Suggestions from code review: add missing consts

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* Add num_returns arg to actor method options docstring example.

* Add process name line and proctitle assertion to submission-time task name section of advanced docs.

* Add submission-time task name --> proctitle test for Python worker.

* Added Python actor options tests for num_returns and name.

* Added Java test for submission-time task names.

* Add dashboard image to task name docs section.

* Move to fstrings.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2020-09-03 11:45:24 -07:00
Stephanie Wang 85e57a7a98 [Object spilling] Look up the location of the primary raylet from the owner's metadata (#10197)
* Get the primary copy from the owner, python test, some node manager fixes

* fixes and todo

* update

* lint

* fix build
2020-08-20 14:46:59 -07:00
SangBin Cho 263df6163c [Placement Group] Placement group remove api part 1 (#10063)
* Added basic rpc calls.

* fix issues.

* Fix the gcs server not getting request issue.

* In Progress.

* Basic logic done. Tests are required.

* In progress.

* In progress in refactoring context.

* Revert "In progress in refactoring context."

This reverts commit 38236256cf1306c60dd203e75d45ceb4509c8106.

* Working now.

* Python test works.

* Lint.

* Addressed code review.

* Addressed code review.

* Lint.

* Added unit tests.

* Done, but one of unit tests fail

* Addressed code review.

* Addressed the last code review.

* Fix the wrong test case.
2020-08-18 12:44:00 -07:00
Siyuan (Ryans) Zhuang 17ca1d8ff4 [Core] Object spilling prototype (#9818) 2020-08-14 15:39:10 -07:00
Zhuohan Li a6fed4820e [Core] Preliminary implementation of ownership-based object directory (#9735) 2020-08-11 15:04:13 -07:00
SangBin Cho ec2f1a225e [Stats] Metrics Export User Interface Part 1 (#9913)
* Metrics export port expose done.

* Support exposing metrics port + metrics agent service discovery through ray.nodes()

* Formatting.

* Added a doc.

* Linting.

* Change the location of metrics agent port.

* Addressed code review.

* Addressed code review.
2020-08-06 16:16:29 -07:00
Kai Yang 27cd323ce1 [Core] Multi-tenancy: Job isolation & implement per job config (except for env variables) (#9500) 2020-08-04 15:51:29 +08:00
Eric Liang b73080c85f Allow tasks to be used with placement groups (#9738) 2020-07-31 10:51:37 -07:00
Alisa 51e12ee97c Python api of placement group (#9243) 2020-07-27 14:57:05 -07:00
SangBin Cho 2f674728a6 [GCS Actor Management] Gcs actor management broken detached actor (#9473) 2020-07-16 15:41:18 +08:00
Zhuohan Li 8a76f4cbb5 [Core] put small objects in memory store (#8972)
* remove the put in memory store

* put small objects directly in memory store

* cast data type

* fix another place that uses Put to spill to plasma store

* fix multiple tests related to memory limits

* partially fix test_metrics

* remove not functioning codes

* fix core_worker_test

* refactor put to plasma codes

* add a flag for the new feature

* add flag to more places

* do a warmup round for the plasma store

* lint

* lint again

* fix warmup store

* Update _raylet.pyx

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2020-07-09 15:39:40 -07:00
SangBin Cho 8f19f1eafb [Core] Actor handle refactoring (#8895)
* Marking needed changes.

* Resolve basic dependencies.

* In progress.

* linting.

* In progress 2.

* Linting.

* Refactor done. Cleanup needed.

* Linting.

* Recover kill actor in core worker because it is used inside raylet

* Cleanup.

* Use unique pointer instead. Unit tests are broken now.

* Fix the upstream change.

* Addressed code review 1.

* Lint.

* Addressed code review 2.

* Fix weird github history.

* Lint.

* Linting using clang 7.0.

* Use a better check message.

* Revert cpp stuff.

* Fix weird linting errors.

* Manuall fix all lint issues.

* Update a newline.

* Refactor some interface.

* Addressed all code review.

* Addressed code review
2020-07-07 11:11:41 -07:00
Ian Rodney a1e14380ce [core] Switch Async Callback to C++ [WIP] (#9228)
Co-authored-by: simon-mo <simon.mo@hey.com>
2020-07-07 09:47:25 -07:00
Stephanie Wang b42d6a1ddc [core] Refactor task arguments and attach owner address (#9152)
* Add intended worker ID to GetObjectStatus, tests

* Remove TaskID owner_id

* lint

* Add owner address to task args

* Make TaskArg a virtual class, remove multi args

* Set owner address for task args

* merge

* Fix tests

* Fix

* build

* update

* build

* java

* Move code

* build

* Revert "Fix Google log directory again (#9063)"

This reverts commit 275da2e400.

* Fix free

* x

* build

* Fix java

* Revert "Revert "Fix Google log directory again (#9063)""

This reverts commit 4a326fcb148ca09a35bc7de11d89df10edbb56e7.

* lint
2020-07-06 21:25:14 -07:00
Stephanie Wang 490cddc250 [core] Refactor distributed ref counting to remove owner task ID (#9049)
* Add intended worker ID to GetObjectStatus, tests

* Remove TaskID owner_id

* lint

* Update message

* lint

* Fix build
2020-06-25 17:55:03 -07:00
Simon Mo b6d425526d Move actor task submission to io service (#9093) 2020-06-23 10:07:33 -07:00
Kai Yang 2e5e789294 Allow enabling logging in core worker with empty log_dir (#8529) 2020-05-22 18:02:37 +08:00
Edward Oakes 16f48078d9 Remove use of ObjectID transport flag (#7699) 2020-05-17 11:29:49 -05:00
Max Fitton 00325eb2b2 Rename max_reconstructions to max_restarts and use -1 for infinite (#8274)
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-05-14 10:30:29 -05:00
Edward Oakes 2677b71003 Implement named actors using the GCS service (#8328) 2020-05-09 08:58:10 -05:00
SangBin Cho e631827a9f [Core] Show_webui segfault fix. (#8323) 2020-05-06 11:45:07 -05:00
ijrsvt 69ff7e3e35 TaskCancellation (#7669)
* Smol comment

* WIP, not passing ray.init

* Fixed small problem

* wip

* Pseudo interrupt things

* Basic prototype operational

* correct proc title

* Mostly done

* Cleanup

* cleaner raylet error

* Cleaning up a few loose ends

* Fixing Race Conds

* Prelim testing

* Fixing comments and adding second_check for kill

* Working_new_impl

* demo_ready

* Fixing my english

* Fixing a few problems

* Small problems

* Cleaning up

* Response to changes

* Fixing error passing

* Merged to master

* fixing lock

* Cleaning up print statements

* Format

* Fixing Unit test build failure

* mock_worker fix

* java_fix

* Canel

* Switching to Cancel

* Responding to Review

* FixFormatting

* Lease cancellation

* FInal comments?

* Moving exist check to CoreWorker

* Fix Actor Transport Test

* Fixing task manager test

* chaning clock repr

* Fix build

* fix white space

* lint fix

* Updating to medium size

* Fixing Java test compilation issue

* lengthen bad timeouts
2020-04-25 16:04:52 -07:00
Stephanie Wang eefea4e29c [core] Post task submission to IO loop (#8090)
* Post to IO loop

* Unused

* Fix build
2020-04-20 19:13:50 -07:00
Clark Zinzow d4cae5f632 [Core] Added ability to specify different IP addresses for a core worker and its raylet. (#7985) 2020-04-16 10:32:24 -05:00
Kai Yang 48b48cc8c2 Support multiple core workers in one process (#7623) 2020-04-07 11:01:47 +08:00
ijrsvt 9bfc2c4b54 Moving Local Mode to C++ (#7670) 2020-04-01 15:50:57 -05:00
Robert Nishihara b011c604d7 Remove ray.tasks() from API. (#7807) 2020-04-01 10:10:40 -05:00
Eric Liang 745b9d643d First pass at ray memory command for memory debugging (#7589) 2020-03-17 20:45:07 -07:00
Kai Yang d6e8f47065 Add a flag to disable reconstruction for a killed actor (#7346) 2020-03-13 19:10:21 +08:00
Stephanie Wang fdb528514b [core] Ref counting for actor handles (#7434)
* tmp

* Move Exit handler into CoreWorker, exit once owner's ref count goes to 0

* fix build

* Remove __ray_terminate__ and add test case for distributed ref counting

* lint

* Remove unused

* Fixes for detached actor, duplicate actor handles

* Remove unused

* Remove creation return ID

* Remove ObjectIDs from python, set references in CoreWorker

* Fix crash

* Fix memory crash

* Fix tests

* fix

* fixes

* fix tests

* fix java build

* fix build

* fix

* check status

* check status
2020-03-10 17:45:07 -07:00
ijrsvt fb76092d75 Re-route asyncio plasma code path through raylet instead of direct plasma connection (#7234) 2020-03-03 15:43:46 -05:00
Eric Liang b310661338 Add internal_api.global_gc() method, which triggers gc.collect() on all workers (#7327) 2020-02-26 14:09:29 -08:00
Edward Oakes 44b4394afa Remove unused AddContainedObjectIDs (#7323) 2020-02-25 16:42:20 -08:00
Simon Mo b804d40c04 Stop vendoring pyarrow (#7233) 2020-02-19 19:01:26 -08:00
Simon Mo 7bef7031c2 Revert "Revert "Revert "Removing Pyarrow dependency (#7146)" (#7209) (#7214)" (#7232) 2020-02-19 13:35:29 -08:00
Simon Mo e8941b1b79 Revert "Revert "Removing Pyarrow dependency (#7146)" (#7209) (#7214) 2020-02-19 10:08:52 -08:00
Stephanie Wang f76ce836b2 Distributed ref counting for serialized ObjectIDs (#6945)
* Skeleton plus a unit test for simple borrower case

* First unit test passes - forward an ID and task returns with 1 submitted task pending on the inner ID

* Invariant for contained_in

* Unit test passes for testing task return without creating a borrower

* Wrap ref count functionality in test case

* Fix bad delete

* Unit test and fix for borrowers creating more borrowers

* Unit test and fix for simple borrowing, but owner sends call after borrower's ref count goes to 0

* Refactor:
- keep a sentinel ref count for task argument IDs
- keep contained_in_borrowed in addition to contained_in_owned

* Unit test for nested IDs passes

* Refactor so that an object ID can only be contained in 1 borrowed ID at a time

* Add check

* Fix

* Unit test (passes) to test nesting object IDs but no borrowers created

* Unit test for nested objects from different owners passes, refactor to unset contained_in when popping refs

* Unit tests for borrowers receiving an ObjectID from multiple sources,
skip adding ownership info if we already have it to handle duplicate
refs

* Unit test for returning object ID passes

* More unit tests for returning object IDs pass

* Add serialized ID tests

* fix serialization issue

* remove swap

* It builds!

* debugging and some fixes:
- register handler for WaitForRefRemoved
- don't create a python reference for arg IDs
- pass in client factory into ReferenceCounter
- fix bad decrement in PopBorrowerRefs

* Fix accounting for serialized IDs:
- don't decrement for IDs on dependency resolution, wait until task finished
- add object IDs that were inlined when building the arguments to the task spec, pin these on the task executor until task finishes

* mu_ -> mutex_

* lint

* fix build

* clear outer_object_id

* add direct call type check

* Fix test for direct call IDs and return IDs for actor calls

* Fix CoreWorkerClient.Addr()

* Remove unneeded lock

* Remove unnecessary ObjectID refs

* Fix worker holding serialized refs test

* Fix hex IDs

* fix

* fix tests

* fix tests

* refactor and cleanups

* lint

* Put inlined Ids in task args and some cleanup

* Add back gc.collect() line for test case

* Refactor and fixes:
- store inlined IDs in RayObject
- allow storing objects with inlined IDs in memory store
- pin objects that were promoted to plasma

* oops

* make sure worker ID is set in address, pass in rpc::Address to CoreWorkerClient

* todos

* cleanups and test builds

* Fix tests

* Add feature flag

* cleanups

* address comments and some cleanups

* cleanup

* fix recursive test

* Comments for tests

* Turn off ref counting by default

* Skip tests

* Fix some bugs for test_array.py, java build

* Don't include nested objects in the ref count when the feature flag is off

* C++ feature flag does not work...

* Remove

* Turn on python tests and add a warning when plasma objects are evicted before being pinned

* Fix build and remove irrelevant test

* Fix for java

* Revert "Fix build and remove irrelevant test"

This reverts commit 056cca9b263ed05b0f9ab2250907338edcbca2d5.

* Fix ray.internal.free

* Fixes and skip some flaky tests

* fix java build

* fix windows build

* Add IDs contained in owned objects

* Update src/ray/protobuf/core_worker.proto

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/core_worker/reference_count.cc

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/protobuf/core_worker.proto

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/protobuf/core_worker.proto

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/core_worker/reference_count.h

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/core_worker/reference_count.h

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/core_worker/reference_count.cc

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* update

* Try to fix ::test_direct_call_serialized_id_eviction

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-02-18 18:21:34 -08:00
Eric Liang 0aa9373d62 Revert "Removing Pyarrow dependency (#7146)" (#7209)
This reverts commit 2116fd3bca.
2020-02-18 14:12:06 -08:00
ijrsvt 2116fd3bca Removing Pyarrow dependency (#7146) 2020-02-17 18:00:13 -08:00
fyrestone a6b8bd47b0 [xlang] Cross language serialize ActorHandle (#7134) 2020-02-17 20:44:56 +08:00
Qing Wang f3703bafa3 [Java] Support concurrent actor calls API. (#7022)
* WIP

Temp change

Attach native thread to jvm

* Fix run mode

* Address comments.
2020-02-14 13:02:39 +08:00
Edward Oakes 844f607c93 Collect contained ObjectIDs during deserialization (#7029) 2020-02-03 22:49:14 -08:00
Edward Oakes 984490d2be Collect object IDs during serialization (#6946) 2020-02-03 18:38:11 -08:00
Edward Oakes 92525f35d1 Remove raylet client from Python worker (#6018) 2020-01-31 18:23:01 -08:00
Simon Mo 396d7fafc8 UI improvement for asyncio (#6905) 2020-01-27 12:45:51 -08:00