Commit Graph

143 Commits

Author SHA1 Message Date
Philipp Moritz 762bdf646e [xray] Put GCS data into the redis data shard (#2298) 2018-06-30 15:42:10 -10:00
Hao Chen 20c0ecb522 Reuse code of checking large pickles (#2291) 2018-06-28 16:51:23 -10:00
Richard Liaw e657497225 [xray] Fix tune tests (#2305)
* fix xray tests

* yapf

* unleash tests
2018-06-26 23:56:23 -07:00
Robert Nishihara ff2217251f [xray] Add error table and push error messages to driver through node manager. (#2256)
* Fix documentation indentation.

* Add error table to GCS and push error messages through node manager.

* Add type to error data.

* Linting

* Fix failure_test bug.

* Linting.

* Enable one more test.

* Attempt to fix doc building.

* Restructuring

* Fixes

* More fixes.

* Move current_time_ms function into util.h.
2018-06-20 21:29:28 -07:00
Eric Liang 30f7c08ca7 [rllib] Remove need to pass around registry (#2250)
* remove registry

* fix

* too many _

* fix

* cloudpickle

* Update registry.py

* yapf

* fix test

* fix kv check
2018-06-19 22:47:00 -07:00
Robert Nishihara 61139e1509 Enable fractional resources and resource IDs for xray. (#2187)
* Implement GPU IDs and fractional resources.

* Add documentation and python exceptions.

* Fix signed/unsigned comparison.

* Fix linting.

* Fixes from rebase.

* Re-enable tests that use ray.wait.

* Don't kill the raylet if an infeasible task is submitted.

* Ignore tests that require better load balancing.

* Linting

* Ignore array test.

* Ignore stress test reconstructions tests.

* Don't kill node manager if remote node manager disconnects.

* Ignore more stress tests.

* Naming changes

* Remove outdated todo

* Small fix

* Re-enable test.

* Linting

* Fix resource bookkeeping for blocked tasks.

* Fix linting

* Fix Java client.

* Ignore test

* Ignore put error tests
2018-06-10 15:31:43 -07:00
Philipp Moritz 4ec5bea03b [xray] Implement fetch (#2195) 2018-06-09 23:36:27 -07:00
Robert Nishihara 125fe1c09c Print warning when defining very large remote function or actor. (#2179)
* Print warning when defining very large remote function or actor.

* Add weak test.

* Check that warnings appear in test.

* Make wait_for_errors actually fail in failure_test.py.

* Use constants for error types.

* Fix
2018-06-09 19:59:15 -07:00
Melih Elibol 7246ff80a4 [xray] Implements ray.wait (#2162)
Implements ray.wait for xray. Fixes #1128.
2018-06-06 16:56:44 -07:00
Robert Nishihara 6172f94c04 Implement Python global state API for xray. (#2125)
* Implement global state API for xray.

* Fix object table.

* Fixes for log structure.

* Implement cluster_resources.

* Add driver task to task table.

* Remove python flatbuffers code

* Get some global state API tests running.

* Python linting.

* Fix linting.

* Fix mock modules for doc

* Copy over flatbuffer bindings.

* Fix for tests.

* Linting

* Fix monitor crash.
2018-05-29 16:25:54 -07:00
Robert Nishihara 99ae74e1d2 Improve error message printing and suppression. (#2104) 2018-05-20 12:13:14 -07:00
Alok Singh 9a8f29e571 YAPF, take 3 (#2098)
* Use pep8 style

The original style file is actually just pep8 style, but with everything
spelled out. It's easier to use the `based_on_style` feature. Any overrides are
clearer that way.

* Improve yapf script

1. Do formatting in parallel
2. Lint RLlib
3. Use .style.yapf file

* Pull out expressions into variables

* Don't format rllib

* Don't allow splits in dicts

* Apply yapf

* Disallow single line if-statements

* Use arithmetic comparison

* Simplify checking for changed files

* Pull out expr into var
2018-05-19 16:07:28 -07:00
Melih Elibol bea97b425b Fix python linting (#2076) 2018-05-16 15:04:31 -07:00
Robert Nishihara 8fbb88485b Create RemoteFunction class, remove FunctionProperties, simplify worker Python code. (#2052)
* Cleaning up worker and actor code. Create remote function class. Remove FunctionProperties object.

* Remove register_actor_signatures function.

* Small cleanups.

* Fix linting.

* Support @ray.method syntax for actor methods.

* Fix pickling bug.

* Fix linting.

* Shorten testBlockingTasks.

* Small fixes.

* Call get_global_worker().
2018-05-14 14:35:23 -07:00
Robert Nishihara 52b0f3734a [xray] Add Travis build for testing xray on Linux. (#2047)
* Run xray tests in travis.

* Comment out TaskTests.testSubmittingManyTasks.

* Comment out failing tests.

* Comment out hanging test.

* Linting

* Comment out failing test.

* Comment out failing test.

* Ignore test_dataframe.py for now.

* Comment out testDriverExitingQuickly.
2018-05-13 21:22:01 -07:00
Robert Nishihara 77c8aa7627 Make ActorHandles pickleable, also make proper ActorHandle and ActorC… (#2007)
* Make ActorHandles pickleable, also make proper ActorHandle and ActorClass classes.

* Fix bug.

* Fix actor test bug.

* Update __ray_terminate__ usage.

* Fix most linting, add documentation, and small cleanups.

* Handle forking and pickling differently for actor handles. Fix linting.

* Fixes for named actors via pickling.

* Generate actor handle IDs deterministically in the pickling case.
2018-05-08 19:19:07 -07:00
Alok Singh cdf94c18a4 Clean up syntax for supported Python versions. (#1963)
* Use set/dict literal syntax

Ran code through [pyupgrade](https://github.com/asottile/pyupgrade). This is
supported in every Python version 2.7+.

* Drop unnecessary string format specification

No need to specify 0,1.. if paramters are passed in order.

* Revert "Drop unnecessary string format specification"

This reverts commit efa5ec85d30ff69f34e5ed93e31343fea7647bcb.

* Undo changes to cloudpickle

Drop use of set literal until cloudpickle uses it.

* Reformat code with YAPF

We need to set up a git pre-push hook to automatically run this stuff.
2018-05-03 07:45:11 -07:00
Stephanie Wang aa07f1ce4e [xray] Workers blocked in a ray.get release their resources (#1920)
* [xray] Throttle task dispatch by required resources
* Pass in number of initial workers into raylet command
* Workers blocked in a ray.get release resources
2018-04-18 20:59:58 -07:00
Philipp Moritz 74162d1492 Lint Python files with Yapf (#1872) 2018-04-11 10:11:35 -07:00
Philipp Moritz 834e594709 [XRay] Register object store and raylet with the GCS (#1860) 2018-04-09 18:56:33 -07:00
Robert Nishihara 256389dc59 Use new task spec for computing IDs in raylet code path. (#1830)
* Use new task spec for computing IDs in raylet code path.

* Fix linting.

* Fixes

* Fix test.
2018-04-08 13:31:55 -07:00
Robert Nishihara fbfbb1c079 [xray] Integrate worker.py with raylet. (#1810)
* Integrate worker with raylet.

* Begin allowing worker to attach to cluster.

* Fix linting and documentation.

* Fix linting.

* Comment tests back in.

* Fix type of worker command.

* Remove xray python files and tests.

* Fix from rebase.

* Add test.

* Copy over raylet executable.

* Small cleanup.
2018-04-03 02:38:56 -07:00
Robert Nishihara 0fc989c6c1 Don't use 127.0.0.1 for local ip address. (#1596)
* Don't use 127.0.0.1 for ip address.

* Update test
2018-04-02 00:34:20 -07:00
Robert Nishihara 1ab0d0ea69 Acquire worker lock when importing actor. (#1783) 2018-03-26 18:31:26 -07:00
Robert Nishihara c6ad71fc9d Fix bug when connecting another driver in local case. (#1760)
* Allow connecting another driver when using ip address 127.0.0.1.

* Add test.
2018-03-21 11:49:53 -07:00
Robert Nishihara 4bccabd910 Redirect output of all processes by default. (#1752)
* Redirect output of all processes by default.

* Add separate flag for redirecting worker output.

* Fix tests.
2018-03-20 18:14:54 -07:00
Robert Nishihara 4658d0a180 Print error when actor takes too long to start, and refactor error me… (#1747)
* Print error when actor takes too long to start, and refactor error message pushing.

* Print warning every ten seconds.

* Fix linting and tests.

* Fix tests.
2018-03-19 20:24:35 -07:00
Robert Nishihara d78de0d41f Provide experimental API for changing number of return values and res… (#1735)
* Provide experimental API for changing number of return values and resource requirements at task submission time.

* Remove code duplication and add tests.
2018-03-19 15:32:23 -07:00
Robert Nishihara 96913be939 Treat actor creation like a regular task. (#1668)
* Treat actor creation like a regular task.

* Small cleanups.

* Change semantics of actor resource handling.

* Bug fix.

* Minor linting

* Bug fix

* Fix jenkins test.

* Fix actor tests

* Some cleanups

* Bug fix

* Fix bug.

* Remove cached actor tasks when a driver is removed.

* Add more info to taskspec in global state API.

* Fix cyclic import bug in tune.

* Fix

* Fix linting.

* Fix linting.

* Don't schedule any tasks (especially actor creaiton tasks) on local schedulers with 0 CPUs.

* Bug fix.

* Add test for 0 CPU case

* Fix linting

* Address comments.

* Fix typos and add comment.

* Add assertion and fix test.
2018-03-16 11:18:07 -07:00
Robert Nishihara f4b1881fec Update arrow to use updated pandas serializer. (#1582) 2018-02-22 11:10:52 -08:00
Eric Liang 7e998db656 [rllib] Reduce concat memory usage, allow object store memory to be specified in init (#1529)
* c

* stop agents

* comment

* Sat Feb 10 02:33:30 PST 2018

* Sat Feb 10 02:33:39 PST 2018

* Update sample_batch.py

* Sun Feb 11 14:38:55 PST 2018

* add ppo config warn
2018-02-11 19:14:51 -08:00
Robert Nishihara 89db7841d2 Update arrow version. (#1512) 2018-02-07 23:05:16 -08:00
Stephanie Wang ff8e7f8259 Actor checkpointing for distributed actor handles (#1498)
* Expose calls to get and set the actor frontier

* Remove fields used for old checkpointing prototype, change actor_checkpoint_failed -> succeeded

* Prototype for actor checkpointing

* Filter out duplicate tasks on the local scheduler

* Clean up some of the Python checkpointing code

* More cleanups

* Documentation

* cleanup and fix unit test

* Allow remote checkpoint calls through actor handle

* Check whether object is local before reconstructing

* Enable checkpointing for distributed actor handles, refactor tests

* Fix local scheduler tests

* lint

* Address comments

* lint

* Skip tests that fail on new GCS

* style

* Don't put same object twice when setting the actor frontier

* Address Philipp's comments, cleaner fbs naming
2018-02-07 11:19:32 -08:00
Eric Liang 4ec51a4660 [rllib] Occasional Thread Error from RLlib (#1441)
* fix

* Revert "fix"

This reverts commit 808f7d7688a837e5ce4cc4209ca28390bc29f1d8.

* Driver's should ignore imports from other drivers.
2018-02-06 20:30:11 -08:00
Robert Nishihara ed77a4c415 Make ray.get_gpu_ids() respect existing CUDA_VISIBLE_DEVICES. (#1499)
* Make ray.get_gpu_ids() respect existing CUDA_VISIBLE_DEVICES.

* Comment out failing GPUID check.

* Add import.

* Fix test.

* Remove test.

* Factor out environment variable setting/getting into utils.
2018-02-01 21:29:14 -08:00
Stephanie Wang 668737f383 Replace actor dummy objects with mock calls to the local scheduler (#1467)
* Replace putting the dummy object with a call to the local scheduler

* Mark dummy objects as locally available
2018-01-26 14:18:45 -08:00
Robert Nishihara ab5d4a6010 Bring cloudpickle inside the repository. (#1445)
* Bring cloudpickle version 0.5.2 inside the repo.

* Use internal copy of cloudpickle everywhere.

* Fix linting.

* Import ordering.

* Change __init__.py.

* Set pickler in serialization context.

* Don't check ray location.
2018-01-25 11:36:37 -08:00
Alexey Tumanov f1303291b4 Ray scheduler spillback plumbing + mechanism (#1362)
* spillback mechanism and plumbing : adding spillback counter + timestamp

* linting fix

* documentation

* Fix argument name.
2018-01-23 20:18:12 -08:00
Robert Nishihara e970e24ea5 Update arrow, and pass memcopy_threads into put. (#1374) 2017-12-31 13:32:06 -08:00
Stephanie Wang 12fdb3f53a Convert actor dummy objects to task execution edges. (#1281)
* Define execution dependencies flatbuffer and add to Redis commands

* Convert TaskSpec to TaskExecutionSpec

* Add execution dependencies to Python bindings

* Submitting actor tasks uses execution dependency API instead of dummy argument

* Fix dependency getters and some cleanup for fetching missing dependencies

* C++ convention

* Make TaskExecutionSpec a C++ class

* Convert local scheduler to use TaskExecutionSpec class

* Convert some pointers to references

* Finish conversion to TaskExecutionSpec class

* fix

* Fix

* Fix memory errors?

* Cast flatbuffers GetSize to size_t

* Fixes

* add more retries in global scheduler unit test

* fix linting and cast fbb.GetSize to size_t

* Style and doc

* Fix linting and simplify from_flatbuf.
2017-12-14 20:47:54 -08:00
Robert Nishihara 96c46d35ff Tell Ray how to serialize FunctionSignature objects. (#1308) 2017-12-10 22:40:28 -08:00
Eric Liang 7009538321 Autodetect the number of GPUs when starting Ray. (#1293)
* autodetect

* Wed Dec  6 12:46:52 PST 2017

* Wed Dec  6 12:47:54 PST 2017

* Move GPU autodetection into services.py.

* Fix capitalization of Nvidia.

* Update documentation.
2017-12-09 15:30:16 -08:00
John Schulman 2606001a36 allow users to disable the webui (#1306)
* allow users to disable the webui

* Remove trailing whitespace.
2017-12-09 00:35:55 -08:00
Robert Nishihara c21e189371 Allow scheduling with arbitrary user-defined resource labels. (#1236)
* Enable scheduling with custom resource labels.

* Fix.

* Minor fixes and ref counting fix.

* Linting

* Use .data() instead of .c_str().

* Fix linting.

* Fix ResourcesTest.testGPUIDs test by waiting for workers to start up.

* Sleep in test so that all tasks are submitted before any completes.
2017-12-01 11:41:40 -08:00
Eric Liang 37831ae0c3 Add a nicer warning message when you pass the wrong thing to ray.wait() (#1239)
* add warnings

* fix python mode

* Small changes and add tests.

* Fix test failure.
2017-11-27 22:57:33 -08:00
Robert Nishihara c1496b8111 Check version info in ray start for non-head nodes. (#1264)
* Check version info in ray start for non-head nodes.

* Small fix.

* Fix

* Push error to all drivers when worker has version mismatch.

* Linting

* Linting

* Fix

* Unify methods.

* Fix bug.
2017-11-27 22:03:38 -08:00
Robert Nishihara f7c4f41df8 Change Python Redis client psubscribe -> subscribe. (#1261) 2017-11-26 23:29:37 -08:00
Robert Nishihara 2865128df0 Remove counter from run_function_on_all_workers. Also remove utilitie… (#1260)
* Remove counter from run_function_on_all_workers. Also remove utilities for copying directories across machines.

* Fix linting.
2017-11-26 18:29:10 -08:00
Robert Nishihara 0b4961b161 Provide flag for setting redis maxclients. (#1257)
* Add flag for attempting to increase ulimit -n and the redis maxclients.

* Don't bother trying to set ulimit -n.

* Fix linting.

* Add basic test.
2017-11-26 18:25:55 -08:00
Robert Nishihara e583d5a421 Give warnings for unimplemented Python mode methods. (#1256) 2017-11-26 13:11:12 -08:00