Eric Liang
69c5a2bc3c
Warn if OMP_NUM_THREADS is set ( #6729 )
2020-01-08 14:59:07 -08:00
Robert Nishihara
5e43b25e8c
Document fault tolerance behavior. ( #6698 )
2020-01-06 22:34:06 -08:00
Edward Oakes
2a4d2c6e9e
Basic reference counting & pinning ( #6554 )
2020-01-06 17:30:26 -06:00
Robert Nishihara
92e44a5dc8
Deprecate redis_address argument in favor of address. ( #6654 )
2020-01-02 20:18:34 -08:00
Robert Nishihara
39a3459886
Remove (object) from class declarations. ( #6658 )
2020-01-02 17:42:13 -08:00
Robert Nishihara
480206eef8
Remove some Python 2 compatibility code. ( #6624 )
2019-12-31 17:14:58 -08:00
Eric Liang
e2bc489a18
Port webui nits from original pr that enables it ( #6628 )
...
* backport changes
* Update test_webui.py
2019-12-29 19:19:43 -08:00
Robert Nishihara
8724e5ffd5
Start WebUI by default. ( #6493 )
2019-12-27 13:49:07 -08:00
Edward Oakes
6b1a57542e
Add actor.__ray_kill__() to terminate actors immediately ( #6523 )
2019-12-23 23:12:57 -06:00
Yunzhi Zhang
bac6f3b61e
[Dashboard] Collecting worker stats in node manager and implement webui display in the backend ( #6574 )
2019-12-22 17:50:23 -08:00
Simon Mo
26ec500ef9
Implement async get for direct actor call ( #6339 )
2019-12-18 11:50:21 -08:00
Simon Mo
e530c37b0e
Use localhost and set redis password by default ( #6481 )
2019-12-17 19:41:19 -08:00
Edward Oakes
e2b7459bfc
Fix worker exit cleanup ( #6450 )
...
* working but ugly
* comments
* proper but hanging in grpc server destructor
* grpc server shutdown deadline
* fix disconnect
* lint
* shutdown_only in test
* replace shutdown
2019-12-13 16:52:50 -08:00
Edward Oakes
82f7dbc7a7
Increase TaskID size by 2 bytes, taken from JobID ( #6425 )
...
* Increase TaskID size by 2 bytes, taken from JobID
* comments
* check max job id
* fix doc
* fix local mode
2019-12-11 10:45:14 -08:00
Edward Oakes
044527adb8
Remove ref counting dependencies on ray.get() ( #6412 )
...
* Remove ref counting dependencies on Get()
* comment
* don't send IDs when disabled
* pass through internal config
* fix
* allow reinit
* remove flag
2019-12-10 18:11:34 -08:00
Stephanie Wang
da41180dc0
[direct task] Retry tasks on failure and turn on RAY_FORCE_DIRECT for test_multinode_failures.py ( #6306 )
...
* multinode failures direct
* Add number of retries allowed for tasks
* Retry tasks
* Add failing test for object reconstruction
* Handle return status and debug
* update
* Retry task unit test
* update
* update
* todo
* Fix max_retries decorator, fix test
* Fix test that flaked
* lint
* comments
2019-12-02 10:20:57 -08:00
Edward Oakes
e4f9b3b7d9
Use process reaper for cleanup ( #6253 )
2019-11-26 22:00:08 -06:00
Simon Mo
1ca8c427e3
Consistent Name for Process Title ( #6276 )
...
* Consistent naming for setprotitle
* Address comments
* Add debug/verbose mode
* Fix test
2019-11-26 11:56:28 -08:00
Philipp Moritz
33c768ebe4
Fix worker signal.SIGTERM handler being installed from outside the main thread ( #6176 )
2019-11-20 11:14:28 -08:00
Ujval Misra
2965dc1b72
[tune] Fault tolerance improvements ( #5877 )
...
* Precede ray.get with ray.wait.
* Trigger checkpoint deletes locally in Trainable
* Clean-up code.
* Minor changes.
* Track best checkpoint so far again
* Pulled checkpoint GC out of Trainable.
* Added comments, error logging.
* Immediate pull after checkpoint taken; rsync source delete on pull
* Minor doc fixes
* Fix checkpoint manager bug
* Fix bugs, tests, formatting
* Fix bugs, feature flag for force sync.
* Fix test.
* Fix minor bugs: clear proc and less verbose sync_on_checkpoint warnings.
* Fix bug: update IP of last_result.
* Fixed message.
* Added a lot of logging.
* Changes to ray trial executor.
* More bug fixes (logging after failure), better logging.
* Fix richards bug and logging
* Add comments.
* try-except
* Fix heapq bug.
* .
* Move handling of no available trials to ray_trial_executor (#1 )
* Fix formatting bug, lint.
* Addressed Richard's comments
* Revert tests.
* fix rebase
* Fix trial location reporting.
* Fix test
* Fix lint
* Rebase, use ray.get w/ timeout, lint.
* lint
* fix rebase
* Address richard's comments
2019-11-18 01:14:41 -08:00
Ujval Misra
e3e3ad4b25
Add timeout param to ray.get ( #6107 )
2019-11-14 00:50:04 -08:00
Philipp Moritz
f24d96ec4f
Revert "Try to enable dashboard (again) ( #6069 )" ( #6159 )
...
This reverts commit 4044af8520 .
2019-11-13 12:32:12 -08:00
Stephanie Wang
35d177f459
Use grpc for communication from worker to local raylet (task submission and direct actor args only) ( #6118 )
...
* Skeleton for SubmitTask proto
* Pass through node manager port, connect in raylet client
* Switch submit task to grpc
* Check port in use
* doc
* Remove default port, set port randomly from driver
* update
* Fix test
* Fix object manager test
2019-11-11 21:17:25 -08:00
Philipp Moritz
decaa65cd6
Use pickle by default for serialization ( #5978 )
2019-11-10 18:12:18 -08:00
Eric Liang
4044af8520
Try to enable dashboard (again) ( #6069 )
...
* Revert "Revert "Enable the Ray dashboard by default (#5976 )" (#6068 )"
This reverts commit 1a3e97cf23 .
* fix tests that assume the dashboard isn't a job
* travis
2019-11-08 10:48:48 -08:00
Eric Liang
4a28306186
Allow large returns from direct actor calls ( #6088 )
2019-11-07 21:28:55 -08:00
Edward Oakes
043d1f4094
Return RayObjects to core worker ( #6052 )
2019-11-04 20:27:57 -08:00
Eric Liang
1a3e97cf23
Revert "Enable the Ray dashboard by default ( #5976 )" ( #6068 )
...
This reverts commit 6166ef3e09 .
2019-11-01 17:08:37 -07:00
Eric Liang
fb34928a2a
[minor] Perf optimizations for direct actor task submission ( #6044 )
...
* merge optimizations
* fix
* fix memory err
* optimize
* fix tests
* fix serialization of method handles
* document weakref
* fix check
* bazel format
* disable on 2
2019-11-01 14:41:14 -07:00
Eric Liang
6166ef3e09
Enable the Ray dashboard by default ( #5976 )
2019-11-01 12:19:01 -07:00
Edward Oakes
e9e78871b9
Remove unused function definition caching ( #6042 )
2019-10-30 16:41:18 -07:00
Eric Liang
b89cac976a
Basic direct actor call support in Python ( #5991 )
2019-10-28 22:09:04 -07:00
Eric Liang
a5523466a2
Enable memstore by default ( #6003 )
2019-10-25 21:59:12 -07:00
Edward Oakes
1ce521a7f3
Remove task context from python worker ( #5987 )
...
Removes duplicated state between the python and C++ workers. Also cleans up the serialization codepaths a bit.
2019-10-25 07:38:33 -07:00
Edward Oakes
6f27d881bd
Fix core worker shutdown errors ( #6004 )
2019-10-24 22:29:05 -07:00
Edward Oakes
02931e08f3
[core worker] Python core worker task execution ( #5783 )
...
Executes tasks via the event loop in the C++ core worker. Also properly handles signals (including KeyboardInterrupt), so ctrl-C in a python interactive shell works now (if connecting to an existing cluster).
2019-10-22 20:15:59 -07:00
Siyuan (Ryans) Zhuang
95241f6686
Fix the incorrect serialization behavior with pickle ( #5960 )
2019-10-22 18:08:36 -07:00
Mitchell Stern
235dec8aa3
[Dashboard] Remove token authentication from dashboard ( #5888 )
2019-10-21 12:48:48 -07:00
Richard Liaw
26a724c5e6
[core] Support kwargs and positionals in Ray remote calls ( #5606 )
2019-10-20 22:40:54 -07:00
Richard Liaw
74852c80cb
[docs] Improve more serialization Errors ( #5658 )
2019-10-20 14:06:00 -07:00
Philipp Moritz
d23696de17
Introduce flag to use pickle for serialization ( #5805 )
2019-10-18 22:29:36 -07:00
Stephanie Wang
3ac8592dcf
Remove actor handle IDs ( #5889 )
...
* Remove actor handle ID from main ActorHandle constructor
* Set the actor caller ID when calling submit task instead of in the actor handle
* Remove ActorHandle::Fork, remove actor handle ID from protobuf
* Make inner actor handle const, remove new_actor_handles
* Move caller ID into the common task spec, start refactoring raylet
* Some fixes for forking actor handles
* Store ActorHandle state in CoreWorker, only expose actor ID to Python
* Remove some unused fields
* lint
* doc
* fix merge
* Remove ActorHandleID from python/cpp
* doc
* Fix core worker test
* Move actor table subscription to CoreWorker, reset actor handles on actor failure
* lint
* Remove GCS client from direct actor
* fix tests
* Fix
* Fix tests for raylet codepath
* Fix local mode
* Fix multithreaded test
* Fix AsyncSubscribe issue...
* doc
* fix serve
* Revert bazel
2019-10-17 12:36:34 -04:00
Edward Oakes
08e4e3a153
[core worker] Submit Python actor tasks through core worker ( #5750 )
...
* Submit actor tasks through core worker
* Fix java
* add comment
* Remove task builder
* Check negative
* Increase -> Increment
* pass by reference
* fix signal
* Clean up c++ actor handle
* more cleanup
* Clean up headers
* Fix unique_ptr construction
* Fix java
* Move profiling to c++
* dedup
* fix error
* comments
* fix java
* Fix tests
* wait for actor to exit
* Start after constructor
* ignore java build
* fix comment
* always init logging
* Fix logging
* fix logging issue
* shared_ptr for profiler
* DEBUG -> WARNING
* fix killed_ init
* Fix flaky checkpointing tests
* -v flag for tune tests
* Fix checkpoint test logic
* Fix exception matching
* timeout exception
* Fix test exception info
* Fix import
* fix build
* Fix test
* shared_ptr
2019-10-07 15:42:19 -07:00
Si-Yuan
3a42780cb8
Improved Pickle5 pickling ( #5841 )
...
* object copy optimization
* see if we can reuse the Arrow parallel_memcopy
* remove unused function
* restore the original code, since later experiments show that it has little impact on performance.
* lint
2019-10-03 15:14:32 -07:00
Si-Yuan
2fb7d7846f
Initial implementation of Cython pickle5 support ( #5725 )
2019-10-03 09:20:26 -07:00
Edward Oakes
963bbe8bbd
Move profiling to c++ ( #5771 )
...
* Move profiling to c++
* comments
* Fix tests
* Start after constructor
* fix comment
* always init logging
* Fix logging
* fix logging issue
* shared_ptr for profiler
* DEBUG -> WARNING
* fix killed_ init
* Fix flaky checkpointing tests
* Fix checkpoint test logic
* Fix exception matching
* timeout exception
* Fix import
* fix build
* use boost::asio
* fix double const
* Properly reset async_wait
* remove SIGINT
* Change error message
* increase timeout
* small nits
* Don't trap on SIGINT
* -v for tune
* Fix test
2019-10-01 10:06:25 -07:00
Eric Liang
81ee887f91
Preserve the original exception type when converting to RayTaskError ( #5799 )
2019-09-28 17:03:15 -07:00
Philipp Moritz
01d6362472
Serialize StringIO with pickle ( #5781 )
2019-09-26 12:55:14 -07:00
Edward Oakes
61e5d674be
Push driver task in core worker ( #5752 )
2019-09-23 10:53:55 -05:00
Edward Oakes
62bc30c1cf
Validate redis address parameters ( #5746 )
...
* Validate redis address params
* Fix comment
* Add check
2019-09-23 10:52:34 -05:00