Commit Graph

80 Commits

Author SHA1 Message Date
Melih Elibol 8ae82180b4 [xray] Adds a driver table. (#2289)
This PR adds a driver table for the new GCS, which enables cleanup functionality associated with monitoring driver death.

Some testing in `monitor_test.py` is restored, but redis sharding for xray is needed to enable remaining tests.
2018-08-08 23:41:40 -07:00
Robert Nishihara 909d7172b1 Introduce constant for ID_SIZE in python code. (#2517) 2018-07-31 12:40:53 -07:00
Peter Schafhalter 400a3e5705 Add queue size and __len__ methods (#2432) 2018-07-19 17:04:42 -07:00
Peter Schafhalter f5c46c7765 Add queue data structures (#2261) 2018-07-16 16:26:20 -07:00
Robert Nishihara e3534c46df [xray] Re-enable some stress tests and convert stress_tests to pytest. (#2285)
* Fix one of the stress tests, fix ray.global_state.client_table when called early on.

* Re-enable testWait.

* Convert stress_tests.py to pytest.

* Fix
2018-07-06 23:21:00 -07:00
Robert Nishihara b90e551b41 [xray] Implement timeline and profiling API. (#2306)
* Add profile table and store profiling information there.

* Code for dumping timeline.

* Improve color scheme.

* Push timeline events on driver only for raylet.

* Improvements to profiling and timeline visualization

* Some linting

* Small fix.

* Linting

* Propagate node IP address through profiling events.

* Fix test.

* object_id.hex() should return byte string in python 2.

* Include gcs.fbs in node_manager.fbs.

* Remove flatbuffer definition duplication.

* Decode to unicode in Python 3 and bytes in Python 2.

* Minor

* Submit profile events in a batch. Revert some CMake changes.

* Fix

* Workaround test failure.

* Fix linting

* Linting

* Don't return anything from chrome_tracing_dump when filename is provided.

* Remove some redundancy from profile table.

* Linting

* Move TODOs out of docstring.

* Minor
2018-07-04 23:23:48 -07:00
Eric Liang 8aa56c12e6 [rllib] Document "v2" APIs (#2316)
* re

* wip

* wip

* a3c working

* torch support

* pg works

* lint

* rm v2

* consumer id

* clean up pg

* clean up more

* fix python 2.7

* tf session management

* docs

* dqn wip

* fix compile

* dqn

* apex runs

* up

* impotrs

* ddpg

* quotes

* fix tests

* fix last r

* fix tests

* lint

* pass checkpoint restore

* kwar

* nits

* policy graph

* fix yapf

* com

* class

* pyt

* vectorization

* update

* test cpe

* unit test

* fix ddpg2

* changes

* wip

* args

* faster test

* common

* fix

* add alg option

* batch mode and policy serving

* multi serving test

* todo

* wip

* serving test

* doc async env

* num envs

* comments

* thread

* remove init hook

* update

* fix ppo

* comments1

* fix

* updates

* add jenkins tests

* fix

* fix pytorch

* fix

* fixes

* fix a3c policy

* fix squeeze

* fix trunc on apex

* fix squeezing for real

* update

* remove horizon test for now

* multiagent wip

* update

* fix race condition

* fix ma

* t

* doc

* st

* wip

* example

* wip

* working

* cartpole

* wip

* batch wip

* fix bug

* make other_batches None default

* working

* debug

* nit

* warn

* comments

* fix ppo

* fix obs filter

* update

* wip

* tf

* update

* fix

* cleanup

* cleanup

* spacing

* model

* fix

* dqn

* fix ddpg

* doc

* keep names

* update

* fix

* com

* docs

* clarify model outputs

* Update torch_policy_graph.py

* fix obs filter

* pass thru worker index

* fix

* rename

* vlad torch comments

* fix log action

* debug name

* fix lstm

* remove unused ddpg net

* remove conv net

* revert lstm

* wip

* wip

* cast

* wip

* works

* fix a3c

* works

* lstm util test

* doc

* clean up

* update

* fix lstm check

* move to end

* fix sphinx

* fix cmd

* remove bad doc

* envs

* vec

* doc prep

* models

* rl

* alg

* up

* clarify

* copy

* async sa

* fix

* comments

* fix a3c conf

* tune lstm

* fix reshape

* fix

* back to 16

* tuned a3c update

* update

* tuned

* optional

* merge

* wip

* fix up

* move pg class

* rename env

* wip

* update

* tip

* alg

* readme

* fix catalog

* readme

* doc

* context

* remove prep

* comma

* add env

* link to paper

* paper

* update

* rnn

* update

* wip

* clean up ev creation

* fix

* fix

* fix

* fix lint

* up

* no comma

* ma

* Update run_multi_node_tests.sh

* fix

* sphinx is stupid

* sphinx is stupid

* clarify torch graph

* no horizon

* fix config

* sb

* Update test_optimizers.py
2018-07-01 00:05:08 -07:00
Philipp Moritz 762bdf646e [xray] Put GCS data into the redis data shard (#2298) 2018-06-30 15:42:10 -10:00
Eric Liang 737f3e3cf2 [tune] Fix registering trainable twice (#2293)
* register twice

* isolate

* Update registry.py

* Update registry.py
2018-06-27 16:29:39 -07:00
Robert Nishihara ff2217251f [xray] Add error table and push error messages to driver through node manager. (#2256)
* Fix documentation indentation.

* Add error table to GCS and push error messages through node manager.

* Add type to error data.

* Linting

* Fix failure_test bug.

* Linting.

* Enable one more test.

* Attempt to fix doc building.

* Restructuring

* Fixes

* More fixes.

* Move current_time_ms function into util.h.
2018-06-20 21:29:28 -07:00
Zongheng Yang 8190ff1fd0 Experimental: enable automatic GCS flushing with configurable policy. (#2266)
* build_credis.sh: use an up-to-date credis commit.

* build_credis.sh: leveldb is updated, so update build cmds for it

* WIP: make monitor.py issue flush; switch gcs client to use credis

* Experimental: enable automatic GCS flushing with configurable policy.

* Fix linux compilation error

* Fix leveldb build

* Use optimized build for credis

* Address comments

* Attempt to fix tests
2018-06-20 14:40:57 -07:00
Eric Liang 30f7c08ca7 [rllib] Remove need to pass around registry (#2250)
* remove registry

* fix

* too many _

* fix

* cloudpickle

* Update registry.py

* yapf

* fix test

* fix kv check
2018-06-19 22:47:00 -07:00
Binglin Chang 19d6ca0670 Support constructing TensorFlowVariables from multiple tf operations (#2182) 2018-06-02 18:13:52 -07:00
Kunal Gosar 317d0da7d8 Add experimental API for ray.get and ray.wait with additional argument types (#2071) 2018-06-01 16:42:27 -07:00
Robert Nishihara 6172f94c04 Implement Python global state API for xray. (#2125)
* Implement global state API for xray.

* Fix object table.

* Fixes for log structure.

* Implement cluster_resources.

* Add driver task to task table.

* Remove python flatbuffers code

* Get some global state API tests running.

* Python linting.

* Fix linting.

* Fix mock modules for doc

* Copy over flatbuffer bindings.

* Fix for tests.

* Linting

* Fix monitor crash.
2018-05-29 16:25:54 -07:00
Yucong He 3509a33cf3 Prototype named actors. (#2129) 2018-05-24 00:32:12 -07:00
Alok Singh f795173b51 Use flake8-comprehensions (#1976)
* Add flake8 to Travis

* Add flake8-comprehensions

[flake8 plugin](https://github.com/adamchainz/flake8-comprehensions) that
checks for useless constructions.

* Use generators instead of lists where appropriate

A lot of the builtins can take in generators instead of lists.

This commit applies `flake8-comprehensions` to find them.

* Fix lint error

* Fix some string formatting

The rest can be fixed in another PR

* Fix compound literals syntax

This should probably be merged after #1963.

* dict() -> {}

* Use dict literal syntax

dict(...) -> {...}

* Rewrite nested dicts

* Fix hanging indent

* Add missing import

* Add missing quote

* fmt

* Add missing whitespace

* rm duplicate pip install

This is already installed in another file.

* Fix indent

* move `merge_dicts` into utils

* Bring up to date with `master`

* Add automatic syntax upgrade

* rm pyupgrade

In case users want to still use it on their own, the upgrade-syn.sh script was
left in the `.travis` dir.
2018-05-20 16:15:06 -07:00
Alok Singh 9a8f29e571 YAPF, take 3 (#2098)
* Use pep8 style

The original style file is actually just pep8 style, but with everything
spelled out. It's easier to use the `based_on_style` feature. Any overrides are
clearer that way.

* Improve yapf script

1. Do formatting in parallel
2. Lint RLlib
3. Use .style.yapf file

* Pull out expressions into variables

* Don't format rllib

* Don't allow splits in dicts

* Apply yapf

* Disallow single line if-statements

* Use arithmetic comparison

* Simplify checking for changed files

* Pull out expr into var
2018-05-19 16:07:28 -07:00
Robert Nishihara 78e4b021ab Functions for flushing done tasks and evicted objects. (#2033) 2018-05-18 01:59:58 -07:00
Melih Elibol bea97b425b Fix python linting (#2076) 2018-05-16 15:04:31 -07:00
Robert Nishihara 8fbb88485b Create RemoteFunction class, remove FunctionProperties, simplify worker Python code. (#2052)
* Cleaning up worker and actor code. Create remote function class. Remove FunctionProperties object.

* Remove register_actor_signatures function.

* Small cleanups.

* Fix linting.

* Support @ray.method syntax for actor methods.

* Fix pickling bug.

* Fix linting.

* Shorten testBlockingTasks.

* Small fixes.

* Call get_global_worker().
2018-05-14 14:35:23 -07:00
Alok Singh cdf94c18a4 Clean up syntax for supported Python versions. (#1963)
* Use set/dict literal syntax

Ran code through [pyupgrade](https://github.com/asottile/pyupgrade). This is
supported in every Python version 2.7+.

* Drop unnecessary string format specification

No need to specify 0,1.. if paramters are passed in order.

* Revert "Drop unnecessary string format specification"

This reverts commit efa5ec85d30ff69f34e5ed93e31343fea7647bcb.

* Undo changes to cloudpickle

Drop use of set literal until cloudpickle uses it.

* Reformat code with YAPF

We need to set up a git pre-push hook to automatically run this stuff.
2018-05-03 07:45:11 -07:00
Robert Nishihara 7792032ee3 Fix UI issue for non-json-serializable task arguments. (#1892)
* Fix UI issue for non-json-serializable task arguments.

* Simplify approach.
2018-04-15 13:54:42 -07:00
Philipp Moritz 74162d1492 Lint Python files with Yapf (#1872) 2018-04-11 10:11:35 -07:00
Robert Nishihara 7c9e291b4b In the UI, display task breakdowns by default. (#1857) 2018-04-09 13:24:38 -07:00
Robert Nishihara 5bde5e75e7 Implement unsafe method for flushing entire object table and task table. (#1824)
* Implement unsafe method for flushing entire object table and task table.

* Add test.

* Fix test.
2018-04-04 18:29:24 -07:00
Robert Nishihara 8d52fe931b Add experimental feature for flushing event logs and logfiles. (#1659)
* Add experimental feature for flushing event logs and logfiles.

* Add documentation.
2018-03-27 11:57:52 -07:00
Robert Nishihara 2922e1c388 Add API for getting total cluster resources. (#1736)
* Add API for getting total cluster resources.

* Add test.
2018-03-20 15:57:00 -07:00
Robert Nishihara 96913be939 Treat actor creation like a regular task. (#1668)
* Treat actor creation like a regular task.

* Small cleanups.

* Change semantics of actor resource handling.

* Bug fix.

* Minor linting

* Bug fix

* Fix jenkins test.

* Fix actor tests

* Some cleanups

* Bug fix

* Fix bug.

* Remove cached actor tasks when a driver is removed.

* Add more info to taskspec in global state API.

* Fix cyclic import bug in tune.

* Fix

* Fix linting.

* Fix linting.

* Don't schedule any tasks (especially actor creaiton tasks) on local schedulers with 0 CPUs.

* Bug fix.

* Add test for 0 CPU case

* Fix linting

* Address comments.

* Fix typos and add comment.

* Add assertion and fix test.
2018-03-16 11:18:07 -07:00
William Paul f2b6a7b58d Polished TensorFlowVariables code and documentation (#566) 2018-02-12 15:38:58 -08:00
Alexey Tumanov f1303291b4 Ray scheduler spillback plumbing + mechanism (#1362)
* spillback mechanism and plumbing : adding spillback counter + timestamp

* linting fix

* documentation

* Fix argument name.
2018-01-23 20:18:12 -08:00
Eric Liang a2b190e65b Fix occasional task timeline failure to get task ids (#1442) 2018-01-21 12:04:44 -08:00
Melih Elibol 24b93b1123 fixes default type for product of empty shape. (#1341) 2017-12-18 17:41:44 -08:00
Stephanie Wang 12fdb3f53a Convert actor dummy objects to task execution edges. (#1281)
* Define execution dependencies flatbuffer and add to Redis commands

* Convert TaskSpec to TaskExecutionSpec

* Add execution dependencies to Python bindings

* Submitting actor tasks uses execution dependency API instead of dummy argument

* Fix dependency getters and some cleanup for fetching missing dependencies

* C++ convention

* Make TaskExecutionSpec a C++ class

* Convert local scheduler to use TaskExecutionSpec class

* Convert some pointers to references

* Finish conversion to TaskExecutionSpec class

* fix

* Fix

* Fix memory errors?

* Cast flatbuffers GetSize to size_t

* Fixes

* add more retries in global scheduler unit test

* fix linting and cast fbb.GetSize to size_t

* Style and doc

* Fix linting and simplify from_flatbuf.
2017-12-14 20:47:54 -08:00
Robert Nishihara c21e189371 Allow scheduling with arbitrary user-defined resource labels. (#1236)
* Enable scheduling with custom resource labels.

* Fix.

* Minor fixes and ref counting fix.

* Linting

* Use .data() instead of .c_str().

* Fix linting.

* Fix ResourcesTest.testGPUIDs test by waiting for workers to start up.

* Sleep in test so that all tasks are submitted before any completes.
2017-12-01 11:41:40 -08:00
Robert Nishihara 2865128df0 Remove counter from run_function_on_all_workers. Also remove utilitie… (#1260)
* Remove counter from run_function_on_all_workers. Also remove utilities for copying directories across machines.

* Fix linting.
2017-11-26 18:29:10 -08:00
Philipp Moritz e798a652bc Change TaskSpec to allow multiple object IDs per argument. (#1204)
* Implement object ID bags

* linting

* fix tests

* fix linting

* fix comments
2017-11-10 16:33:34 -08:00
Stephanie Wang 07f0532b9b Local scheduler filters out dead clients during reconstruction (#1182)
* Object table lookup returns vector of DBClientID instead of address strings

* Add node IP address to DBClient notification

* DB client cache stores entire DB client, convert addresses to std::string

* get cached db client returns the client

* Expose a call to initialize the redis cache

* Local scheduler filters out dead clients during reconstruction

* Remove node ip address from dbclient, use aux_address for plasma managers

* Get entire db client entry when not found in cache

* Fix common tests

* Fix address in tests

* Push error to driver if driver task did the put

* Address Robert's comments and cleanup

* Remove unused Redis command

* Fix db test
2017-11-10 11:29:24 -08:00
Zongheng Yang 5a50e80b63 Make Monitor remove dead Redis entries from exiting drivers. (#994)
* WIP: removing OL, OI, TT on client exit; no saving yet.

* ray_redis_module.cc: update header comment.

* Cleanup: just the removal.

* Reformat via yapf: use pep8 style instead of google.

* Checkpoint addressing comments (partially)

* Add 'b' marker before strings (py3 compat)

* Add MonitorTest.

* Use `isort` to sort imports.

* Remove some loggings

* Fix flake8 noqa marker runtest.py

* Try to separate tests out to monitor_test.py

* Rework cleanup algorithm: correct logic

* Extend tests to cover multi-shard cases

* Add some small comments and formatting changes.
2017-09-26 00:11:38 -07:00
Eric Liang d8aa826e63 [webui] Scalability fixes for the task timeline and visualizations (#935)
* fixes

* comments

* fix test

* Update ui.py

* upd

* Fix linting.
2017-09-10 15:47:44 -07:00
Robert Nishihara f3c1248d98 Clone catapult and generate html files during installation. (#956)
* Clone catapult and generate static html during setup.

* Include UI files in installation.

* Fix directory to clone catapult to and fix linting.

* Use absolute path.

* Make sure we find a sufficiently new version of python2 when building wheels.

* Copy the trace_viewer_full.html file to the local directory if it is not present.

* Make sure wheels fail to build if UI is not included.
2017-09-10 13:41:16 -07:00
Eric Liang 953878364e [webui] Print out timeline link for full-screen trace viewing (#936)
* up

* update
2017-09-06 01:41:21 -07:00
Eric Liang a2814567e1 [webui] Quick fix to timeline on task failure (#930)
* foo

* update

* Move _add_missing_timestamps to task_profiles function.
2017-09-04 22:58:19 -07:00
Eric Liang 63d8d11714 [webui] Checkboxes should go to the left of their labels (#932) 2017-09-04 17:05:13 -07:00
Robert Nishihara 8ed03b1cf0 Make task timeline work with ipywidgets==7.0.0, change slider default values. (#925)
* Make task timeline work with ipywidgets==7.0.0.

* Change initial UI slider values from 70-100 to 0-100.
2017-09-03 23:15:46 -07:00
Wapaul1 4db45c9c54 Improved layout of controls for Web UI (#876)
* Improved layout of controls

* Added explicit labels and some comments

* Fix linting errors
2017-08-28 14:43:34 -07:00
Robert Nishihara d43a435c68 Don't redirect worker output to log files if redirect_output=False. (#873)
* Don't redirect worker output to log files if redirect_output=False.

* Fix, handle case where RedirectOutput key is not in Redis.
2017-08-27 14:27:44 -07:00
Robert Nishihara ca53e9ae7b Fix bugs in task timeline visualization. (#836)
* Fix bugs in task timeline visualization.

* Some cleanups.

* Remove print statements.
2017-08-13 23:39:37 -07:00
alanamarzoev bfe473fa8c Embedded task trace with object dependencies. (#818)
* Embedded timeline

* Yeah

* Fixed arrows not showing up.

* Fixed arrows not showing up, and added check boxes for the kinds of dependencies that should be included in the trace.

* first

* Fixes

* Fixed typo in comments, added more comments. fixed linting.

* Added more comments.

* Formatting.

* fixes

* Fixed state.py linting.

* Fixed ui.py linting errors.

* Fixed linting errors.

* Renamed task dependencies and included instructions for viewing arrows.

* Fixed according to PR comments.

* Fixed bug.

* Undid changes to metadata blocks.

* Fixes according to comments.

* Fixed linting.

* Fixed linting.

* NOQA keyword added to link line.
2017-08-09 23:00:14 -07:00
Alexey Tumanov fc885bd918 Adding basic support for a user-interpretable resource label (#761)
* adding support for the user-interpretable label(UIR)

* more plumbing for num_uirs further upstream; set to infty when specified on cmd line

* pass default num_uirs for actors; update GlobalStateAPI

* support num_uirs in ray.init()

* local scheduler resource accounting: support num_uirs; prep for vectorized resource accounting

* global scheduler test updated

* Fix bug introduced by rebase.

* Rename UIR -> CustomResource and add test.

* Small changes and use constexpr instead of macros.

* Linting and some renaming.

* Reorder some code.

* Remove cpus_in_use and fix bug.

* Add another test and make a small change.

* Rephrase documentation about feature stability.
2017-08-08 02:53:59 -07:00