Commit Graph

387 Commits

Author SHA1 Message Date
Eric Liang 69c5a2bc3c Warn if OMP_NUM_THREADS is set (#6729) 2020-01-08 14:59:07 -08:00
Robert Nishihara 5e43b25e8c Document fault tolerance behavior. (#6698) 2020-01-06 22:34:06 -08:00
Edward Oakes 2a4d2c6e9e Basic reference counting & pinning (#6554) 2020-01-06 17:30:26 -06:00
Robert Nishihara 92e44a5dc8 Deprecate redis_address argument in favor of address. (#6654) 2020-01-02 20:18:34 -08:00
Robert Nishihara 39a3459886 Remove (object) from class declarations. (#6658) 2020-01-02 17:42:13 -08:00
Robert Nishihara 480206eef8 Remove some Python 2 compatibility code. (#6624) 2019-12-31 17:14:58 -08:00
Eric Liang e2bc489a18 Port webui nits from original pr that enables it (#6628)
* backport changes

* Update test_webui.py
2019-12-29 19:19:43 -08:00
Robert Nishihara 8724e5ffd5 Start WebUI by default. (#6493) 2019-12-27 13:49:07 -08:00
Edward Oakes 6b1a57542e Add actor.__ray_kill__() to terminate actors immediately (#6523) 2019-12-23 23:12:57 -06:00
Yunzhi Zhang bac6f3b61e [Dashboard] Collecting worker stats in node manager and implement webui display in the backend (#6574) 2019-12-22 17:50:23 -08:00
Simon Mo 26ec500ef9 Implement async get for direct actor call (#6339) 2019-12-18 11:50:21 -08:00
Simon Mo e530c37b0e Use localhost and set redis password by default (#6481) 2019-12-17 19:41:19 -08:00
Edward Oakes e2b7459bfc Fix worker exit cleanup (#6450)
* working but ugly

* comments

* proper but hanging in grpc server destructor

* grpc server shutdown deadline

* fix disconnect

* lint

* shutdown_only in test

* replace shutdown
2019-12-13 16:52:50 -08:00
Edward Oakes 82f7dbc7a7 Increase TaskID size by 2 bytes, taken from JobID (#6425)
* Increase TaskID size by 2 bytes, taken from JobID

* comments

* check max job id

* fix doc

* fix local mode
2019-12-11 10:45:14 -08:00
Edward Oakes 044527adb8 Remove ref counting dependencies on ray.get() (#6412)
* Remove ref counting dependencies on Get()

* comment

* don't send IDs when disabled

* pass through internal config

* fix

* allow reinit

* remove flag
2019-12-10 18:11:34 -08:00
Stephanie Wang da41180dc0 [direct task] Retry tasks on failure and turn on RAY_FORCE_DIRECT for test_multinode_failures.py (#6306)
* multinode failures direct

* Add number of retries allowed for tasks

* Retry tasks

* Add failing test for object reconstruction

* Handle return status and debug

* update

* Retry task unit test

* update

* update

* todo

* Fix max_retries decorator, fix test

* Fix test that flaked

* lint

* comments
2019-12-02 10:20:57 -08:00
Edward Oakes e4f9b3b7d9 Use process reaper for cleanup (#6253) 2019-11-26 22:00:08 -06:00
Simon Mo 1ca8c427e3 Consistent Name for Process Title (#6276)
* Consistent naming for setprotitle

* Address comments

* Add debug/verbose mode

* Fix test
2019-11-26 11:56:28 -08:00
Philipp Moritz 33c768ebe4 Fix worker signal.SIGTERM handler being installed from outside the main thread (#6176) 2019-11-20 11:14:28 -08:00
Ujval Misra 2965dc1b72 [tune] Fault tolerance improvements (#5877)
* Precede ray.get with ray.wait.

* Trigger checkpoint deletes locally in Trainable

* Clean-up code.

* Minor changes.

* Track best checkpoint so far again

* Pulled checkpoint GC out of Trainable.

* Added comments, error logging.

* Immediate pull after checkpoint taken; rsync source delete on pull

* Minor doc fixes

* Fix checkpoint manager bug

* Fix bugs, tests, formatting

* Fix bugs, feature flag for force sync.

* Fix test.

* Fix minor bugs: clear proc and less verbose sync_on_checkpoint warnings.

* Fix bug: update IP of last_result.

* Fixed message.

* Added a lot of logging.

* Changes to ray trial executor.

* More bug fixes (logging after failure), better logging.

* Fix richards bug and logging

* Add comments.

* try-except

* Fix heapq bug.

* .

* Move handling of no available trials to ray_trial_executor (#1)

* Fix formatting bug, lint.

* Addressed Richard's comments

* Revert tests.

* fix rebase

* Fix trial location reporting.

* Fix test

* Fix lint

* Rebase, use ray.get w/ timeout, lint.

* lint

* fix rebase

* Address richard's comments
2019-11-18 01:14:41 -08:00
Ujval Misra e3e3ad4b25 Add timeout param to ray.get (#6107) 2019-11-14 00:50:04 -08:00
Philipp Moritz f24d96ec4f Revert "Try to enable dashboard (again) (#6069)" (#6159)
This reverts commit 4044af8520.
2019-11-13 12:32:12 -08:00
Stephanie Wang 35d177f459 Use grpc for communication from worker to local raylet (task submission and direct actor args only) (#6118)
* Skeleton for SubmitTask proto

* Pass through node manager port, connect in raylet client

* Switch submit task to grpc

* Check port in use

* doc

* Remove default port, set port randomly from driver

* update

* Fix test

* Fix object manager test
2019-11-11 21:17:25 -08:00
Philipp Moritz decaa65cd6 Use pickle by default for serialization (#5978) 2019-11-10 18:12:18 -08:00
Eric Liang 4044af8520 Try to enable dashboard (again) (#6069)
* Revert "Revert "Enable the Ray dashboard by default (#5976)" (#6068)"

This reverts commit 1a3e97cf23.

* fix tests that assume the dashboard isn't a job

* travis
2019-11-08 10:48:48 -08:00
Eric Liang 4a28306186 Allow large returns from direct actor calls (#6088) 2019-11-07 21:28:55 -08:00
Edward Oakes 043d1f4094 Return RayObjects to core worker (#6052) 2019-11-04 20:27:57 -08:00
Eric Liang 1a3e97cf23 Revert "Enable the Ray dashboard by default (#5976)" (#6068)
This reverts commit 6166ef3e09.
2019-11-01 17:08:37 -07:00
Eric Liang fb34928a2a [minor] Perf optimizations for direct actor task submission (#6044)
* merge optimizations

* fix

* fix memory err

* optimize

* fix tests

* fix serialization of method handles

* document weakref

* fix check

* bazel format

* disable on 2
2019-11-01 14:41:14 -07:00
Eric Liang 6166ef3e09 Enable the Ray dashboard by default (#5976) 2019-11-01 12:19:01 -07:00
Edward Oakes e9e78871b9 Remove unused function definition caching (#6042) 2019-10-30 16:41:18 -07:00
Eric Liang b89cac976a Basic direct actor call support in Python (#5991) 2019-10-28 22:09:04 -07:00
Eric Liang a5523466a2 Enable memstore by default (#6003) 2019-10-25 21:59:12 -07:00
Edward Oakes 1ce521a7f3 Remove task context from python worker (#5987)
Removes duplicated state between the python and C++ workers. Also cleans up the serialization codepaths a bit.
2019-10-25 07:38:33 -07:00
Edward Oakes 6f27d881bd Fix core worker shutdown errors (#6004) 2019-10-24 22:29:05 -07:00
Edward Oakes 02931e08f3 [core worker] Python core worker task execution (#5783)
Executes tasks via the event loop in the C++ core worker. Also properly handles signals (including KeyboardInterrupt), so ctrl-C in a python interactive shell works now (if connecting to an existing cluster).
2019-10-22 20:15:59 -07:00
Siyuan (Ryans) Zhuang 95241f6686 Fix the incorrect serialization behavior with pickle (#5960) 2019-10-22 18:08:36 -07:00
Mitchell Stern 235dec8aa3 [Dashboard] Remove token authentication from dashboard (#5888) 2019-10-21 12:48:48 -07:00
Richard Liaw 26a724c5e6 [core] Support kwargs and positionals in Ray remote calls (#5606) 2019-10-20 22:40:54 -07:00
Richard Liaw 74852c80cb [docs] Improve more serialization Errors (#5658) 2019-10-20 14:06:00 -07:00
Philipp Moritz d23696de17 Introduce flag to use pickle for serialization (#5805) 2019-10-18 22:29:36 -07:00
Stephanie Wang 3ac8592dcf Remove actor handle IDs (#5889)
* Remove actor handle ID from main ActorHandle constructor

* Set the actor caller ID when calling submit task instead of in the actor handle

* Remove ActorHandle::Fork, remove actor handle ID from protobuf

* Make inner actor handle const, remove new_actor_handles

* Move caller ID into the common task spec, start refactoring raylet

* Some fixes for forking actor handles

* Store ActorHandle state in CoreWorker, only expose actor ID to Python

* Remove some unused fields

* lint

* doc

* fix merge

* Remove ActorHandleID from python/cpp

* doc

* Fix core worker test

* Move actor table subscription to CoreWorker, reset actor handles on actor failure

* lint

* Remove GCS client from direct actor

* fix tests

* Fix

* Fix tests for raylet codepath

* Fix local mode

* Fix multithreaded test

* Fix AsyncSubscribe issue...

* doc

* fix serve

* Revert bazel
2019-10-17 12:36:34 -04:00
Edward Oakes 08e4e3a153 [core worker] Submit Python actor tasks through core worker (#5750)
* Submit actor tasks through core worker

* Fix java

* add comment

* Remove task builder

* Check negative

* Increase -> Increment

* pass by reference

* fix signal

* Clean up c++ actor handle

* more cleanup

* Clean up headers

* Fix unique_ptr construction

* Fix java

* Move profiling to c++

* dedup

* fix error

* comments

* fix java

* Fix tests

* wait for actor to exit

* Start after constructor

* ignore java build

* fix comment

* always init logging

* Fix logging

* fix logging issue

* shared_ptr for profiler

* DEBUG -> WARNING

* fix killed_ init

* Fix flaky checkpointing tests

* -v flag for tune tests

* Fix checkpoint test logic

* Fix exception matching

* timeout exception

* Fix test exception info

* Fix import

* fix build

* Fix test

* shared_ptr
2019-10-07 15:42:19 -07:00
Si-Yuan 3a42780cb8 Improved Pickle5 pickling (#5841)
* object copy optimization

* see if we can reuse the Arrow parallel_memcopy

* remove unused function

* restore the original code, since later experiments show that it has little impact on performance.

* lint
2019-10-03 15:14:32 -07:00
Si-Yuan 2fb7d7846f Initial implementation of Cython pickle5 support (#5725) 2019-10-03 09:20:26 -07:00
Edward Oakes 963bbe8bbd Move profiling to c++ (#5771)
* Move profiling to c++

* comments

* Fix tests

* Start after constructor

* fix comment

* always init logging

* Fix logging

* fix logging issue

* shared_ptr for profiler

* DEBUG -> WARNING

* fix killed_ init

* Fix flaky checkpointing tests

* Fix checkpoint test logic

* Fix exception matching

* timeout exception

* Fix import

* fix build

* use boost::asio

* fix double const

* Properly reset async_wait

* remove SIGINT

* Change error message

* increase timeout

* small nits

* Don't trap on SIGINT

* -v for tune

* Fix test
2019-10-01 10:06:25 -07:00
Eric Liang 81ee887f91 Preserve the original exception type when converting to RayTaskError (#5799) 2019-09-28 17:03:15 -07:00
Philipp Moritz 01d6362472 Serialize StringIO with pickle (#5781) 2019-09-26 12:55:14 -07:00
Edward Oakes 61e5d674be Push driver task in core worker (#5752) 2019-09-23 10:53:55 -05:00
Edward Oakes 62bc30c1cf Validate redis address parameters (#5746)
* Validate redis address params

* Fix comment

* Add check
2019-09-23 10:52:34 -05:00