Commit Graph

43 Commits

Author SHA1 Message Date
Philipp Moritz cfc7e2c5a9 Fix modin test (#4069) 2019-02-18 12:17:36 -08:00
William Ma e1a479b137 Add teardown_module to test_queue.py (#4012) 2019-02-12 22:43:09 -08:00
Stephanie Wang ad9f1721d1 Fix object_manager_test.py::object_transfer_retry test (#3863) 2019-01-27 13:55:38 -08:00
Robert Nishihara 8723d6b061 Define a Node class to manage Ray processes. (#3733)
* Implement Node class and move most of services.py into it.

* Wait for nodes as they are added to the cluster.

* Fix Redis authentication bug.

* Fix bug in client table ordering.

* Address comments.

* Kill raylet before plasma store in test.

* Minor
2019-01-11 22:30:38 -08:00
Robert Nishihara 5e76d52868 Improve cluster.wait_for_nodes() API. (#3712)
* Separate out functionality for querying client table and improve cluster.wait_for_nodes() API.

* Linting

* Add back logging statements.

* info -> debug
2019-01-07 21:26:58 -08:00
Robert Nishihara c9d70f0dda Remove num_local_schedulers argument from ray.worker._init. (#3704)
* Remove num_local_schedulers argument from ray.worker._init.

* Fix

* Fix tests.
2019-01-07 12:44:49 -08:00
Yuhong Guo c9b8ecca51 Add RayParams to refactor the parameters used by ray python. (#3558) 2018-12-29 22:04:27 +08:00
Richard Liaw aad3c50e2d [tune] Cluster Fault Tolerance (#3309)
This PR introduces cluster-level fault tolerance for Tune by checkpointing global state. This occurs with relatively high frequency and allows users to easily resume experiments when the cluster crashes.

Note that this PR may affect automated workflows due to auto-prompting, but this is resolvable.
2018-12-29 11:42:25 +08:00
Robert Nishihara 417c7f2d6f Update arrow and remove plasma_manager references. (#3545) 2018-12-15 23:36:02 -08:00
Hao Chen e7b51cbd1b [xray] Implement Actor Reconstruction (#3332)
* Implement Actor Reconstruction

* fix

* fix actor handle __del__

* fix lint

* add comment

* Remove actorCreationDummyObjectId

* address comments

* fix

* address comments

* avoid copy

* change log to debug

* fix error name
2018-12-13 21:28:58 -08:00
Devin Petersohn 4d2010a852 Ship Modin with Ray. (#3109) 2018-11-29 20:05:24 +01:00
Richard Liaw 784a6399b0 [tune] Node Fault Tolerance (#3238)
This PR introduces single-node fault tolerance for Tune.

## Previous behavior:
 - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.

## New behavior:
 - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). 
 - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
 - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running.


Remaining questions:
 -  Should `last_result` be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).

 - Waiting for some PRs to merge first (#3239)

Closes #2851.
2018-11-21 12:38:16 -08:00
Eric Liang e4bb5d8d16 Fix logging when ray cluster utils is used 2018-11-18 21:49:27 -08:00
Richard Liaw 97f423781b Clean up Ray processes after cluster util exits (#3278) 2018-11-13 13:18:12 -08:00
Richard Liaw c0423db05c [core] Add Global State Test for multi-node setting (#3239)
* add test for adding node

* multinode test fixes

* First pass at allowing updatable values

* Fix compilation issues

* Add config file parsing

* Full initialization

* Wrote a good test

* configuration parsing and stuff

* docs

* write some tests, make it good

* fixed init

* Add all config options and bring back stress tests.

* Update python/ray/worker.py

* Update python/ray/worker.py

* Fix internalization

* some last changes

* Linting and Java fix

* add docstring

* Fix test, add assertions

* pytest ext

* lint

* lint
2018-11-13 10:35:24 -08:00
Richard Liaw 0bab8ed95c Expose internal config parameters for starting Ray (#3246)
## What do these changes do?

This PR exposes the CL option for using a config parameter. This is important for certain tests (i.e., FT tests that removing nodes) to run quickly.

Note that this is bad practice and should be replaced with GFLAGS or some equivalent as soon as possible.

#3239 depends on this.

TODO:
 - [x] Add documentation to method arguments before merging.
 - [x] Add test to verify this works?

## Related issue number
2018-11-07 21:46:02 -08:00
Peter Schafhalter f3efcd2342 Fix password authentication in worker (#3124) 2018-11-06 13:40:03 -08:00
Robert Nishihara 658c14282c Remove legacy Ray code. (#3121)
* Remove legacy Ray code.

* Fix cmake and simplify monitor.

* Fix linting

* Updates

* Fix

* Implement some methods.

* Remove more plasma manager references.

* Fix

* Linting

* Fix

* Fix

* Make sure class IDs are strings.

* Some path fixes

* Fix

* Path fixes and update arrow

* Fixes.

* linting

* Fixes

* Java fixes

* Some java fixes

* TaskLanguage -> Language

* Minor

* Fix python test and remove unused method signature.

* Fix java tests

* Fix jenkins tests

* Remove commented out code.
2018-10-26 13:36:58 -07:00
Robert Nishihara 9c1826ed69 Use XRay backend by default. (#3020)
* Use XRay backend by default.

* Remove irrelevant valgrind tests.

* Fix

* Move tests around.

* Fix

* Fix test

* Fix test.

* String/unicode fix.

* Fix test

* Fix unicode issue.

* Minor changes

* Fix bug in test_global_state.py.

* Fix test.

* Linting

* Try arrow change and other object manager changes.

* Use newer plasma client API

* Small updates

* Revert plasma client api change.

* Update

* Update arrow and allow SendObjectHeaders to fail.

* Update arrow

* Update python/ray/experimental/state.py

Co-Authored-By: robertnishihara <robertnishihara@gmail.com>

* Address comments.
2018-10-23 12:46:39 -07:00
Richard Liaw 40c4148d4f Cluster Utilities for Fault Tolerance Tests (#3008) 2018-10-20 22:56:29 -07:00
Peter Schafhalter a41bbc10ef Add password authentication to Redis ports (#2952)
* Implement Redis authentication

* Throw exception for legacy Ray

* Add test

* Formatting

* Fix bugs in CLI

* Fix bugs in Raylet

* Move default password to constants.h

* Use pytest.fixture

* Fix bug

* Authenticate using formatted strings

* Add missing passwords

* Add test

* Improve authentication of async contexts

* Disable Redis authentication for credis

* Update test for credis

* Fix rebase artifacts

* Fix formatting

* Add workaround for issue #3045

* Increase timeout for test

* Improve C++ readability

* Fixes for CLI

* Add security docs

* Address comments

* Address comments

* Adress comments

* Use ray.get

* Fix lint
2018-10-16 22:48:30 -07:00
Robert Nishihara 3ce8eb2d4c Test dying_worker_get and dying_worker_wait for xray. (#2997)
This tests the case in which a worker is blocked in a call to ray.get or ray.wait, and then the worker dies. Then later, the object that the worker was waiting for becomes available. We need to make sure not to try to send a message to the dead worker and then die. Related to #2790.
2018-10-02 00:08:47 -07:00
Peter Schafhalter fcdca6de18 Fix test for available resources (#2914) 2018-09-25 23:07:23 -07:00
Peter Schafhalter 5da6e78db1 Add available resources to global state (#2501) 2018-09-10 15:46:32 -07:00
Yuhong Guo 0b6e08ebee Separate python logger module-wise (#2703)
## What do these changes do?
1. Separate the log related code to logger.py from services.py.
2. Allow users to modify logging formatter in `ray start`.

## Related issue number
https://github.com/ray-project/ray/pull/2664
2018-08-26 13:46:14 -07:00
Yuhong Guo d2ebe4d9a3 Fix frequent failure of Jenkins CI. (#2490) 2018-08-02 10:28:28 -07:00
Philipp Moritz 696a229ece Fix text verbosity in python 2.7 by running tests with pytest (#2470) 2018-07-30 11:04:06 -07:00
Peter Schafhalter f5c46c7765 Add queue data structures (#2261) 2018-07-16 16:26:20 -07:00
Robert Nishihara 515da7721a Change ray.worker.cleanup -> ray.shutdown and improve API documentation. (#2374)
* Change ray.worker.cleanup -> ray.shutdown and improve API documentation.

* Deprecate ray.worker.cleanup() gracefully.

* Fix linting
2018-07-12 12:00:00 -07:00
Alok Singh f795173b51 Use flake8-comprehensions (#1976)
* Add flake8 to Travis

* Add flake8-comprehensions

[flake8 plugin](https://github.com/adamchainz/flake8-comprehensions) that
checks for useless constructions.

* Use generators instead of lists where appropriate

A lot of the builtins can take in generators instead of lists.

This commit applies `flake8-comprehensions` to find them.

* Fix lint error

* Fix some string formatting

The rest can be fixed in another PR

* Fix compound literals syntax

This should probably be merged after #1963.

* dict() -> {}

* Use dict literal syntax

dict(...) -> {...}

* Rewrite nested dicts

* Fix hanging indent

* Add missing import

* Add missing quote

* fmt

* Add missing whitespace

* rm duplicate pip install

This is already installed in another file.

* Fix indent

* move `merge_dicts` into utils

* Bring up to date with `master`

* Add automatic syntax upgrade

* rm pyupgrade

In case users want to still use it on their own, the upgrade-syn.sh script was
left in the `.travis` dir.
2018-05-20 16:15:06 -07:00
Adam Gleave 470887c2ad Support calling positional arguments by keyword (fix #998) (#2081) 2018-05-17 16:10:26 -07:00
Philipp Moritz 74162d1492 Lint Python files with Yapf (#1872) 2018-04-11 10:11:35 -07:00
Philipp Moritz a9acfab3a6 Start chain replicated GCS with Ray (#1538) 2018-03-07 10:18:58 -08:00
Richard Liaw 797f4fcbf3 Fixing Lint after flake upgrade (#1162)
* Fixing Lint after flake upgrade

* more lint fixes
2017-10-26 21:02:07 -05:00
Stephanie Wang aebe9f9374 Fix actor garbage collection by breaking cyclic references (#1064)
* Fix bug in wait_for_pid_to_exit, add test for actor deletion.

* Fix actor garbage collection by breaking cyclic references

* Add test for calling actor method immediately after actor creation.

* Fix bug, must dispatch tasks when workers are killed.

* Fix python test

* Fix cyclic reference problem by creating ActorMethod objects on the fly.

* Try simply increasing the time allowed for many_drivers_test.py.
2017-10-05 00:55:33 -07:00
Robert Nishihara cf41964816 Fix some bugs causing Travis test failures. (#839)
* Fix bug in which worker has no actor_class attribute.

* Remove case where we check if processes are defunct.
2017-08-16 01:18:32 -07:00
Robert Nishihara e0867c8845 Switch Python indentation from 2 spaces to 4 spaces. (#726)
* 4 space indentation for actor.py.

* 4 space indentation for worker.py.

* 4 space indentation for more files.

* 4 space indentation for some test files.

* Check indentation in Travis.

* 4 space indentation for some rl files.

* Fix failure test.

* Fix multi_node_test.

* 4 space indentation for more files.

* 4 space indentation for remaining files.

* Fixes.
2017-07-13 21:53:57 +00:00
Philipp Moritz 54925996ca Allow remote functions to specify max executions and kill worker once limit is reached. (#660)
* implement restarting workers after certain number of task executions

* Clean up python code.

* Don't start new worker when an actor disconnects.

* Move wait_for_pid_to_exit to test_utils.py.

* Add test.

* Fix linting errors.

* Fix linting.

* Fix typo.
2017-06-13 00:34:58 -07:00
Robert Nishihara ec2534422b Remove register_class from API. (#550)
* Perform ray.register_class under the hood.

* Fix bug.

* Release worker lock when waiting for imports to arrive in get.

* Remove calls to register_class from examples and tests.

* Clear serialization state between tests.

* Fix bug and add test for multiple custom classes with same name.

* Fix failure test.

* Fix linting and cleanups to python code.

* Fixes to documentation.

* Implement recursion depth for recursively registering classes.

* Fix linting.

* Push warning to user if waiting for class for too long.

* Fix typos.

* Don't export FunctionToRun if pickling the function fails.

* Don't broadcast class definition when pickling class.
2017-05-16 18:38:52 -07:00
Robert Nishihara 1627f89945 Fix problem in which actors and workers running tasks are not killed by driver exit. (#490)
* Augment test to verify that relevant workers and actors are killed during driver cleanup.

* Fix bug in which we were only killing one worker when a driver exited.

* Fix remove driver test.

* Fix and augment test.
2017-04-26 15:13:39 -07:00
Robert Nishihara 0ac125e9b2 Clean up when a driver disconnects. (#462)
* Clean up state when drivers exit.

* Remove unnecessary field in ActorMapEntry struct.

* Have monitor release GPU resources in Redis when driver exits.

* Enable multiple drivers in multi-node tests and test driver cleanup.

* Make redis GPU allocation a redis transaction and small cleanups.

* Fix multi-node test.

* Small cleanups.

* Make global scheduler take node_ip_address so it appears in the right place in the client table.

* Cleanups.

* Fix linting and cleanups in local scheduler.

* Fix removed_driver_test.

* Fix bug related to vector -> list.

* Fix linting.

* Cleanup.

* Fix multi node tests.

* Fix jenkins tests.

* Add another multi node test with many drivers.

* Fix linting.

* Make the actor creation notification a flatbuffer message.

* Revert "Make the actor creation notification a flatbuffer message."

This reverts commit af99099c8084dbf9177fb4e34c0c9b1a12c78f39.

* Add comment explaining flatbuffer problems.
2017-04-24 18:10:21 -07:00
Robert Nishihara ba02fc0eb0 Run flake8 in Travis and make code PEP8 compliant. (#387) 2017-03-21 12:57:54 -07:00
Philipp Moritz a708e36225 Switch build system to use CMake completely. (#200)
* switch to CMake completely

...

* cleanup

* Run C tests, update installation instructions.
2017-01-17 16:56:40 -08:00