Commit Graph

1438 Commits

Author SHA1 Message Date
Lixin Wei eb66db3199 [Build] bug fixed for logging (#10364) 2020-08-28 09:17:08 -07:00
SangBin Cho d206fbbc99 [Placement group] Scheduler map refactoring part 1. (#10381)
* In Progress

* done.

* Address code review.
2020-08-28 00:57:09 -07:00
SongGuyang cb70864c04 [cpp worker] support cluster mode and object Put/Get works (#9682) 2020-08-28 13:53:36 +08:00
SangBin Cho 17f465d5c1 [Core] Improve raylet failure error msg (#10345)
* Improve error message.

* Lint.

* Addressed code review.
2020-08-27 12:53:18 -07:00
Clark Zinzow 0178d6318e [Core] Expand job ID to 4 bytes by removing object flag bytes. (#10187) 2020-08-27 14:08:17 -05:00
Stephanie Wang f75dfd60a3 [api] API deprecations and cleanups for 1.0 (internal_config and Checkpointable actor) (#10333)
* remove

* internal config updates, remove Checkpointable

* Lower object timeout default

* remove json

* Fix flaky test

* Fix unit test
2020-08-27 10:19:53 -07:00
Edward Oakes 60665fc936 Clean up task dependency and scheduler metrics (#10340) 2020-08-26 22:56:03 -05:00
Lixin Wei 4b856fa416 [Core]Async updating issue fixed for actor's num_restart (#10176)
* bug fixed for num_restart updating

* add log

* log updated

* lint

* fixed

* Update src/ray/gcs/gcs_server/gcs_actor_manager.cc

Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>

* bug fixed

* bug fixed

* test passed

Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
2020-08-26 11:49:26 -07:00
Edward Oakes c35ad8237d [metrics] Clean up object manager stats (#10316) 2020-08-26 13:43:06 -05:00
Edward Oakes 916a19363f Clean up actor metrics (#10317) 2020-08-26 10:21:15 -05:00
Edward Oakes cbd9632f3a Fix wait timeout logic (#10199) 2020-08-25 22:41:39 -05:00
fyrestone 08adbb371f Cross language exception (#10023) 2020-08-26 10:46:05 +08:00
Edward Oakes 1e99b814f0 Remove unused scheduler states (#10318)
* remove unused state

* remove unused states
2020-08-25 18:56:21 -07:00
Stephanie Wang d4537ac1ce [core] Try to schedule tasks locally before spilling over to remote nodes (#10302)
* Regression test

* Spillback

* Remove check for actor tasks
2020-08-25 15:01:59 -07:00
kisuke95 24a7a8a04d [Streaming] Build fix (#10233) 2020-08-25 11:37:21 -07:00
fyrestone 05c103af94 [Dashboard] Start the new dashboard (#10131)
* Use new dashboard if environment var RAY_USE_NEW_DASHBOARD exists; new dashboard startup

* Make fake client/build/static directory for dashboard

* Add test_dashboard.py for new dashboard

* Travis CI enable new dashboard test

* Update new dashboard

* Agent manager service

* Add agent manager

* Register agent to agent manager

* Add a new line to the end of agent_manager.cc

* Fix merge; Fix lint

* Update dashboard/agent.py

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* Update dashboard/head.py

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* Fix bug

* Add tests for dashboard

* Fix

* Remove const from Process::Kill() & Fix bugs

* Revert error check of execute_after

* Raise exception from DashboardAgent.run

* Add more tests.

* Fix compile on Linux

* Use dict comprehension instead of dict(generator)

* Fix lint

* Fix windows compile

* Fix lint

* Test Windows CI

* Revert "Test Windows CI"

This reverts commit 945e01051ec95cff5fcc1c0bc37045b46e7ad9a6.

* Fix ParseWindowsCommandLine bug

* Update src/ray/util/util.cc

Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>

Co-authored-by: 刘宝 <po.lb@antfin.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>
2020-08-24 13:24:23 -07:00
Kai Yang 07f6cb17e4 [Core] Multi-tenancy: Refine worker env variable passing (#10191)
* Resolve issues with environment variable handling

* fix

* fix warning

* lint

Co-authored-by: Mehrdad <noreply@github.com>
2020-08-24 09:04:22 -07:00
fangfengbin b61a79efd7 [Placement Group]Fix SigSegv bug (#10262)
* fix SigSegv bug

* fix review comments

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-08-23 11:33:40 -07:00
Ian Rodney 32ed1a18b7 [hotfix] Fix lint in master (#10254) 2020-08-21 20:53:05 -07:00
Alex Wu 136c8ff19e [NewScheduler] Pass test_basic.py (#10059)
* .

* .

* Cleanup

* .

* whoops

* Update src/ray/raylet/scheduling/cluster_task_manager.h

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/raylet/scheduling/cluster_task_manager.h

Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>

* CR

* .

* .

* done

* .

* Unit tests

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
2020-08-21 15:00:08 -07:00
Barak Michener f03caa4532 rpc: Follow-up by sharing the core worker client pool within the core worker. (#10206)
* Share CoreWorkerClientPool

* Format
2020-08-21 11:01:22 -07:00
Stephanie Wang 85e57a7a98 [Object spilling] Look up the location of the primary raylet from the owner's metadata (#10197)
* Get the primary copy from the owner, python test, some node manager fixes

* fixes and todo

* update

* lint

* fix build
2020-08-20 14:46:59 -07:00
fangfengbin a462ae2747 [Placement Group]Add strict spread strategy (#10174)
* support STRICT_SPREAD strategy

* fix review comments

* rebase master

* fix lint error

* fix lint error

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-08-20 10:18:58 -07:00
SangBin Cho 224933b5e4 [Placement Group] Remove API part 2 (#10215)
* Initial progress done.

* Fix mistake.

* Addressed code review.

* Fix cpp build issue.

* Addressed code review.
2020-08-20 09:50:13 -07:00
fangfengbin 9734dbca3e [Placement Group]Reschedule bundles when the node of bundles is dead (#10021) 2020-08-19 13:24:42 -07:00
SangBin Cho 263df6163c [Placement Group] Placement group remove api part 1 (#10063)
* Added basic rpc calls.

* fix issues.

* Fix the gcs server not getting request issue.

* In Progress.

* Basic logic done. Tests are required.

* In progress.

* In progress in refactoring context.

* Revert "In progress in refactoring context."

This reverts commit 38236256cf1306c60dd203e75d45ceb4509c8106.

* Working now.

* Python test works.

* Lint.

* Addressed code review.

* Addressed code review.

* Lint.

* Added unit tests.

* Done, but one of unit tests fail

* Addressed code review.

* Addressed the last code review.

* Fix the wrong test case.
2020-08-18 12:44:00 -07:00
Simon Mo bedc2c24c8 Export Metrics in OpenCensus Protobuf Format (#10080) 2020-08-18 11:32:42 -07:00
SangBin Cho 053188dfbe [Placement Group] Support Placement Group state table. (#10090)
* Done.

* Addressed code review.

* Linting.

* Fix lint.

* Fix lint.

* Fix a test.

* Lint.

* Add a lint sleep to test.

* Fix the lint issue.

* Fixed doc build error.
2020-08-17 09:24:50 -07:00
fangfengbin edd783bc32 [Placement Group]Add soft pack strategy (#10099) 2020-08-17 12:01:34 +08:00
Tao Wang fba5906ce3 [GCS] Re-report heartbeat when gcs server restarts (#10040)
* Retry to send failed heartbeat when light heartbeat enalbed

* Re-report heartbeat when gcs server restarts

* remove is_pubsub_server_restarted

* add lock per comment

* minor change, name related
2020-08-14 17:37:20 -07:00
Siyuan (Ryans) Zhuang 17ca1d8ff4 [Core] Object spilling prototype (#9818) 2020-08-14 15:39:10 -07:00
Robert Nishihara 36e626e95d Revert "[Dashboard] Start the new dashboard (#9860)" (#10116)
This reverts commit 739933e5b8.
2020-08-14 14:06:57 -07:00
fangfengbin 3a6fa7d622 [Placement Group]Optimize placement group strict pack strategy (#9924)
* add part code

* add code

* add part code

* rm used import

* add part code

* add part code

* add part code

* add part code

* add part code

* add part code

* fix review comment

* add testcase

* use ResourceSet

* fix review comment

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-08-13 23:58:52 -07:00
Simon Mo 01f38bc5d1 CoreWorker correctly push metrics to agent (#10031) 2020-08-13 16:44:53 -07:00
Ícaro Aragão b77d6bf87d [GCS] Improve fallback for getting local valid IP for GCS server (#10004) 2020-08-13 16:29:47 -05:00
SangBin Cho 86b1db3f11 [Stats] Make metrics report time configurable (#10036)
* Done.

* Lint.

* Address code review.

* Address code review.

* Remove wrong commit.

* Fix a test error.
2020-08-13 00:30:24 -07:00
fyrestone 739933e5b8 [Dashboard] Start the new dashboard (#9860) 2020-08-13 11:01:46 +08:00
fangfengbin 701e26e0af [GCS]Add node realtime resource view (#10043) 2020-08-12 10:52:17 +08:00
Zhuohan Li a6fed4820e [Core] Preliminary implementation of ownership-based object directory (#9735) 2020-08-11 15:04:13 -07:00
SangBin Cho 946ae74817 [GCS Actor Management] Race condition around creating -> created phase. (#10035)
* Fix the issue.

* Address a code review.
2020-08-11 12:31:27 -07:00
Basasuya 0400a88bf1 [EVENT] Basic Function and Definition (#9657) 2020-08-11 17:36:07 +08:00
Kai Yang 3bc17fa62a [Core] Multi-tenancy: Pass env variables from job config to worker processes (#10022) 2020-08-10 14:31:37 -07:00
Alex Wu 2ebf76c7a3 [New Scheduler] Additional unit tests (#9990) 2020-08-10 11:44:06 -07:00
SangBin Cho eb6b10221e Increase the num of trials to reduce the probability of failing sample_test (#10007) 2020-08-10 10:05:33 -07:00
Kai Yang 37821f0b4c Support unlimited JVM options (#9910) 2020-08-10 16:08:33 +08:00
fangfengbin 26b36a1982 Optimize node register&worker failure log (#9833) 2020-08-10 11:41:45 +08:00
fangfengbin a2bfdcbf24 [Placement Group]Trigger placement group scheduling when a new node is added (#9905) 2020-08-10 10:56:17 +08:00
Barak Michener 8e76796fd0 ci: Redo format.sh --all script & backfill lint fixes (#9956) 2020-08-07 16:49:49 -07:00
Barak Michener 1d01c668f0 rpc: Core Worker client pool (#9934) 2020-08-07 16:34:29 -07:00
Tao Wang 8bea875673 [TEST]Check if port is free before start up redis (#9974)
* [TEST]Check if port is free before start up redis

* per comment
2020-08-07 10:15:12 -07:00