Commit Graph

56 Commits

Author SHA1 Message Date
Robert Nishihara 85b373a4be Suppress warning in start_ray.sh about leaving child processes running when parent exits. (#429) 2017-04-05 23:54:22 -07:00
Robert Nishihara ba02fc0eb0 Run flake8 in Travis and make code PEP8 compliant. (#387) 2017-03-21 12:57:54 -07:00
Stephanie Wang 12c9618c0c Plasma and worker node failure. (#373)
* Failing test case

* Local scheduler exits cleanly after plasma store dies

* Tolerate one plasma store failure

* Tolerate plasma store failures on all nodes except head node

* Plasma manager heartbeats

* Component failure tests

* Don't run the helper for Python testing

* Fix C test

* Fix hanging plasma transfer test

* Fix python3

* Consolidate ClientConnection code

* Fix valgrind test

* fix c test

* We can restart worker nodes!

* Fix flatbuffers bug

* Address comments

* Only register actual workers with the local scheduler

* Fix bug

* Fix segfaults

* Add test case that tests for driver liveness, fix local scheduler bug

* Clean up after tests

* Allocate retry info on the stack

* Send SIGKILL before waiting

* Relax unit test conditions

* Driver liveness test case and documentation
2017-03-17 17:03:58 -07:00
Robert Nishihara f1d4dda8cb Put all log files in redis and visualize them in UI. (#350)
* Start process for monitoring log files and push changes to redis.

* Display log files in UI.

* Bug fix for recent tasks.

* Use flatbuffers to parse local scheduler heartbeats.
2017-03-16 15:27:00 -07:00
Robert Nishihara 53dffe0bf2 Use flatbuffers for some messages from Redis. (#341)
* Compile the Ray redis module with C++.

* Redo parsing of object table notifications with flatbuffers.

* Update redis module python tests.

* Redo parsing of task table notifications with flatbuffers.

* Fix linting.

* Redo parsing of db client notifications with flatbuffers.

* Redo publishing of local scheduler heartbeats with flatbuffers.

* Fix linting.

* Remove usage of fixed-width formatting of scheduling state in channel name.

* Reply with flatbuffer object to task table queries, also simplify redis string to flatbuffer string conversion.

* Fix linting and tests.

* fix

* cleanup

* simplify logic in ReplyWithTask
2017-03-10 18:35:25 -08:00
Stephanie Wang 41b8675d04 Availability after local scheduler failure (#329)
* Clean up plasma subscribers on EPIPE

First pass at a monitoring script - monitor can detect local scheduler death

Clean up task table upon local scheduler death in monitoring script

Don't schedule to dead local schedulers in global scheduler

Have global scheduler update the db clients table, monitor script cleans up state

Documentation

Monitor script should scan tables before beginning to read from subscription channel

Fix for python3

Redirect monitor output to redis logs, fix hanging in multinode tests

* Publish auxiliary addresses as part of db_client deletion notifications

* Fix test case?

* Small changes.

* Use SCAN instead of KEYS

* Address comments

* Address more comments

* Free redis module strings
2017-03-02 19:51:20 -08:00
Robert Nishihara 1ae7e7d29e Rename photon -> local scheduler. (#322) 2017-02-27 12:24:07 -08:00
Robert Nishihara 072eadd57f Pipe num_cpus and num_gpus through from start_ray.py. (#275)
* Pipe num_cpus and num_gpus through from start_ray.py.

* Improve load balancing tests.

* Fix bug.

* Factor out some testing code.
2017-02-13 17:43:23 -08:00
Robert Nishihara 3934d5f6eb Remove old files and remove old documentation for copying files around cluster. (#274) 2017-02-13 11:20:04 -08:00
Robert Nishihara cb7f6ca9b5 Attempt to start web UI when starting Ray. (#269)
* Attempt to start web UI when starting Ray.

* Add instructions for using web UI to cluster documentation.

* Don't check if port 8080 is open.

* Remove print statement.
2017-02-12 15:17:58 -08:00
Robert Nishihara f6ce9dfa6c Allow start_ray.sh to take an object manager port. (#272)
* Allow start_ray.sh to take a object manager port.

* Fix typo and add test.

* Small cleanups.
2017-02-12 12:39:32 -08:00
Johann Schleier-Smith 6ad2b5d87a Add Redis port option to startup script (#232)
* specify redis address when starting head

* cleanup

* update starting cluster documentation

* Whitespace.

* Address Philipp's comments.

* Change redis_host -> redis_ip_address.
2017-01-31 00:28:00 -08:00
Richard Liaw 4575cd88b2 Improve error messages when nodes can't communicate with each other. (#223)
* Good error messages when nodes can't communicate with each other

* Print more information when starting the head node.

* Change retries back to 5.
2017-01-22 14:53:15 -08:00
Robert Nishihara 9bb8162621 Improvements to documentation and error messages. (#221) 2017-01-19 20:27:46 -08:00
Robert Nishihara 84296c8905 Documentation for using Ray on a cluster. (#165) 2016-12-30 00:29:03 -08:00
Robert Nishihara 241c955707 Determine node IP address programatically. (#151)
* Determine node ip address programatically.

* Factor out methods for getting node IP addresses.

* Address comments.
2016-12-23 15:31:40 -08:00
Robert Nishihara 92010ca5b5 Check that we can connect to Redis and that there aren't existing redis clients on the same node in start_ray.py (#148) 2016-12-22 21:54:19 -08:00
Robert Nishihara 6cd02d71f8 Fixes and cleanups for the multinode setting. (#143)
* Add function for driver to get address info from Redis.

* Use Redis address instead of Redis port.

* Configure Redis to run in unprotected mode.

* Add method for starting Ray processes on non-head node.

* Pass in correct node ip address to start_plasma_manager.

* Script for starting Ray processes.

* Handle the case where an object already exists in the store. Maybe this should also compare the object hashes.

* Have driver get info from Redis when start_ray_local=False.

* Fix.

* Script for killing ray processes.

* Catch some errors when the main_loop in a worker throws an exception.

* Allow redirecting stdout and stderr to /dev/null.

* Wrap start_ray.py in a shell script.

* More helpful error messages.

* Fixes.

* Wait for redis server to start up before configuring it.

* Allow seeding of deterministic object ID generation.

* Small change.
2016-12-21 18:53:12 -08:00
Robert Nishihara ddba1df802 Start working toward Python3 compatibility. (#117) 2016-12-11 12:25:31 -08:00
Robert Nishihara 072f442c1f Update worker.py and services.py to use plasma and the local scheduler. (#19)
* Update worker code and services code to use plasma and the local scheduler.

* Cleanups.

* Fix bug in which threads were started before the worker mode was set. This caused remote functions to be defined on workers before the worker knew it was in WORKER_MODE.

* Fix bug in install-dependencies.sh.

* Lengthen timeout in failure_test.py.

* Cleanups.

* Cleanup services.start_ray_local.

* Clean up random name generation.

* Cleanups.
2016-11-02 00:39:35 -07:00
Robert Nishihara 6ed641177d Remove unnecessary files. (#4) 2016-10-26 23:24:40 -07:00
Robert Nishihara 91f16a3df0 Migrate repositories to ray-project. (#438)
* Migrate repositories to ray-project.

* Update numbuf to the migrated version.
2016-09-17 00:52:05 -07:00
Robert Nishihara e06311d415 Automatically add relevant directories to Python paths of workers (#380)
* Make ray.init set python paths of workers.

* Decouple starting cluster from copying user source code

* also add current directory to path

* Add comments about deallocation.

* Add test for new code path.
2016-08-16 14:53:55 -07:00
Robert Nishihara 13df8302e6 enable running example apps in cluster mode (#357) 2016-08-08 16:01:13 -07:00
Robert Nishihara a6452aca47 Command for installing example applications dependencies on cluster (#353) 2016-08-05 14:54:32 -07:00
Robert Nishihara 1454c26693 fix bug with home directory on cluster (#352) 2016-08-05 11:49:11 -07:00
Robert Nishihara ac363bf451 Let worker get worker address and object store address from scheduler (#350) 2016-08-04 17:47:08 -07:00
Johann Schleier-Smith 3ee0fd8f34 Update cluster guide (#347)
* clarify cluster setup instructions

* update multinode documentation, update cluster script, fix minor bug in worker.py

* clarify cluster documentation and fix update_user_code
2016-08-04 09:14:20 -07:00
Robert Nishihara 2040372084 unify starting local cluster with attaching to existing cluster (#327) 2016-07-31 19:26:35 -07:00
Robert Nishihara bcd0e3781f remove example functions and remove imports from shell (#314) 2016-07-29 12:42:44 -07:00
Philipp Moritz b5215f1e6a make it possible to use directory as user source directory that doesn't contain worker.py (#297) 2016-07-26 18:39:06 -07:00
Robert Nishihara aa2f618ab7 add directory containing script to python path of workers (#296) 2016-07-26 16:18:39 -07:00
Robert Nishihara 3bae6f136b export remote functions and reusable variables that were defined before connect was called (#292) 2016-07-26 11:40:09 -07:00
Robert Nishihara 8465df1146 script for launching nodes on ec2 (#270)
* original spark-ec2 script

* modifying spark-ec2 for ray
2016-07-16 15:14:14 -07:00
mehrdadn 0f1d7c5835 Run IPython shell without embedding (#269) 2016-07-16 14:42:58 -07:00
Robert Nishihara 80526f7777 add documentation and refactor cluster.py (#238) 2016-07-12 23:54:18 -07:00
Robert Nishihara 8952ff8cf9 allow cluster script to update worker code on nodes (#243) 2016-07-11 17:58:16 -07:00
Robert Nishihara e1a74eadbe remove installation of dependencies from setup script (#239) 2016-07-08 20:03:21 -07:00
Robert Nishihara 5dd411546d clean up imports (#230) 2016-07-08 12:46:47 -07:00
Robert Nishihara 875b20e397 only run cleanup if we've started ray in local mode and actually started the processes (#228) 2016-07-08 00:14:26 -07:00
Robert Nishihara 8e6b7929d6 make services.cleanup happen automatically (#224) 2016-07-07 14:05:25 -07:00
Robert Nishihara 5873831c21 basic tutorials (#204) 2016-07-06 13:51:32 -07:00
Robert Nishihara 0947024ad9 fix bug for functions with no return values and with one return value (#211) 2016-07-05 15:57:05 -07:00
Robert Nishihara 529e86ce64 add example functions to default worker (#210) 2016-07-05 14:39:42 -07:00
Robert Nishihara 0ffe657e27 enable restarting workers in singlenode case, plus cleanups to cluster.py (#190) 2016-07-01 14:10:51 -07:00
Robert Nishihara 7611fbce4d fixes to shell.py (#195) 2016-06-30 22:57:29 -07:00
Robert Nishihara ad35da08f3 fix (#188) 2016-06-30 13:26:06 -07:00
Philipp Moritz 8d70dd15df Fix imports for default_worker and set SHELL_MODE for shell 2016-06-27 17:23:01 -07:00
Robert Nishihara 731280fd75 Merge pull request #177 from amplab/newshell
Implement launching cluster with shell
2016-06-27 16:38:01 -07:00
Philipp Moritz a0df13b14f Implement launching cluster with shell 2016-06-27 16:33:12 -07:00