* Compile global scheduler with -Werror -Wall.
* Compile plasma manager with -Werror -Wall.
* Compile local scheduler with -Werror -Wall.
* Compile common code with -Werror -Wall.
* Signed/unsigned comparisons.
* More signed/unsigned fixes.
* More signed/unsigned fixes and added extern keyword.
* Fix linting.
* Don't check strict-aliasing because Python.h doesn't pass.
* Worker reports error in previous task, actor task counter is incremented after task is successful
* Refactor actor task execution
- Return new task counter in GetTaskRequest
- Update worker state for actor tasks inside of the actor method
executor
* Manually invoked checkpoint method
* Scheduling for actor checkpoint methods
* Fix python bugs in checkpointing
* Return task success from worker to local scheduler instead of actor counter
* Kill local schedulers halfway through actor execution instead of waiting for all tasks to execute once
* Remove redundant actor tasks during dispatch, reconstruct missing dependencies for actor tasks
* Make executor for temporary actor methods
* doc
* Set default argument for whether the previous task was a success
* Refactor actor method call
* Simplify checkpoint task submission
* lint
* fix philipp's comments
* Add missing line
* Make actor reconstruction tests run faster
* Unimportant whitespace.
* Unimportant whitespace.
* Update checkpoint method signature
* Documentation and handle exceptions during checkpoint save/resume
* Rename get_task message field to actor_checkpoint_failed
* Fix bug.
* Remove debugging check, redirect test output
* Replace a local scheduler ut_array with a std::vector.
* Replace vector of sizes in local scheduler with std::pair.
* Remove utarray include.
* Replace utarray with std::vector for reading local scheduler input messages.
* Remove more UT data structures.
* Remove UT includes.
* Fix linting.
* Include stdlib.h to find size_t.
* Remove includes of stdbool.h.
* Replace std::pair with TaskQueueEntry.
* Fix redis tests.
* Reinstate tests.
* copy task specifications put into the actor task cache so it won't get overwritten when the scheduler receives the next task
* cleanup
* cleanup and fix
* linting
* fix jenkins test
* fix linting
* Clean up state when drivers exit.
* Remove unnecessary field in ActorMapEntry struct.
* Have monitor release GPU resources in Redis when driver exits.
* Enable multiple drivers in multi-node tests and test driver cleanup.
* Make redis GPU allocation a redis transaction and small cleanups.
* Fix multi-node test.
* Small cleanups.
* Make global scheduler take node_ip_address so it appears in the right place in the client table.
* Cleanups.
* Fix linting and cleanups in local scheduler.
* Fix removed_driver_test.
* Fix bug related to vector -> list.
* Fix linting.
* Cleanup.
* Fix multi node tests.
* Fix jenkins tests.
* Add another multi node test with many drivers.
* Fix linting.
* Make the actor creation notification a flatbuffer message.
* Revert "Make the actor creation notification a flatbuffer message."
This reverts commit af99099c8084dbf9177fb4e34c0c9b1a12c78f39.
* Add comment explaining flatbuffer problems.
* Switch to using C++ lists for task queues
* Init and free methods for TaskQueueEntry
* Switch from utarray to c++ vector for TaskQueueEntry
* Get rid of some pointers
* Back to O(1) deletion from waiting_task_queue
* Fix comments
* Cut code
* Non const iterators
* Fix Alexey's comments
* Availability after a killed worker
* Workers exit cleanly
* Memory cleanup in photon C tests
* Worker failure in multinode
* Consolidate worker cleanup handlers
* Update the result table before handling a task submission
* KILL_WORKER_TIMEOUT -> KILL_WORKER_TIMEOUT_MILLISECONDS
* Log a warning instead of crashing if no result table entry found
* Implement actor field for tasks
* Implement actor management in local scheduler.
* initial python frontend for actors
* import actors on worker
* IPython code completion and tests
* prepare creating actors through local schedulers
* add actor id to PyTask
* submit actor calls to local scheduler
* starting to integrate
* simple fix
* Fixes from rebasing.
* more work on python actors
* Improve local scheduler actor handlers.
* Pass actor ID to local scheduler when connecting a client.
* first working version of actors
* fixing actors
* fix creating two copies of the same actor
* fix actors
* remove sleep
* get rid of export synchronization
* update
* insert actor methods into the queue in the right order
* remove print statements
* make it compile again after rebase
* Minor updates.
* fix python actor ids
* Pass actor_id to start_worker.
* add test
* Minor changes.
* Update actor tests.
* Temporary plan for import counter.
* Temporarily fix import counters.
* Fix some tests.
* Fixes.
* Make actor creation non-blocking.
* Fix test?
* Fix actors on Python 2.
* fix rare case.
* Fix python 2 test.
* More tests.
* Small fixes.
* Linting.
* Revert tensorflow version to 0.12.0 temporarily.
* Small fix.
* Enhance inheritance test.
* attribute-based heterogeneity-awareness in global scheduler and photon
* minor post-rebase fix
* photon: enforce dynamic capacity constraint on task dispatch
* globalsched: cap the number of times we try to schedule a task in round robin
* propagating ability to specify resource capacity to ray.init
* adding resources to remote function export and fetch/register
* globalsched: remove unused functions; update cached photon resource capacity (until next photon heartbeat)
* Add some integration tests.
* globalsched: cleanup + factor out constraint checking
* lots of style
* task_spec_required_resource: global refactor
* clang format
* clang format + comment update in photon
* clang format photon comment
* valgrind
* reduce verbosity for Travis
* Add test for scheduler load balancing.
* addressing comments
* refactoring global scheduler algorithm
* Minor cleanups.
* Linting.
* Fix array_test.py and linting.
* valgrind fix for photon tests
* Attempt to fix stress tests.
* fix hashmap free
* fix hashmap free comment
* memset photon resource vectors to 0 in case they get used before the first heartbeat
* More whitespace changes.
* Undo whitespace error I introduced.
* Refactor local scheduler to remove worker indices.
* Change scheduling state enum to int in all function signatures.
* Bug fix, don't use pointers into a resizable array.
* Remove total_num_workers.
* Fix tests.
* First pass at reconstruction in the worker
Modify reconstruction stress testing to start Plasma service before rest of Ray cluster
TODO about reconstructing ray.puts
Fix ray.put error for double creates
Distinguish between empty entry and no entry in object table
Fix test case
Fix Python test
Fix tests
* Only call reconstruct on objects we have not yet received
* Address review comments
* Fix reconstruction for Python3
* remove unused code
* Address Robert's comments, stress tests are crashing
* Test and update the task's scheduling state to suppress duplicate
reconstruction requests.
* Split result table into two lookups, one for task ID and the other as a
test-and-set for the task state
* Fix object table tests
* Fix redis module result_table_lookup test case
* Multinode reconstruction tests
* Fix python3 test case
* rename
* Use new start_redis
* Remove unused code
* lint
* indent
* Address Robert's comments
* Use start_redis from ray.services in state table tests
* Remove unnecessary memset
* Split local scheduler task queue into waiting and dispatch queue
* Fix memory leak
* Add a new task scheduling status for when a task has been queued locally
* Fix global scheduler test case and add task status doc
* Documentation
* Address Philipp's comments
* Move tasks back to the waiting queue if their dependencies become unavailable
* Update existing task table entries instead of overwriting
* Switch to using redis modules for task table.
* Switch to using redis modules for the task table.
* Fix some tests.
* Fix naming and remove code duplication.
* Remove duplication in redis modules and add more cleanups.
* Address comments.
* Initial scheduler commit
* global scheduler
* add global scheduler
* Implement global scheduler skeleton.
* Formatting.
* Allow local scheduler to be started without a connection to redis so that we can test it without a global scheduler.
* Fail if there are no local schedulers when the global scheduler receives a task.
* Initialize uninitialized value and formatting fix.
* Generalize local scheduler table to db client table.
* Remove code duplication in local scheduler and add flag for whether a task came from the global scheduler or not.
* Queue task specs in the local scheduler instead of tasks.
* Simple global scheduler tests, including valgrind.
* Factor out functions for starting processes.
* Fixes.
* Merge task table and task log
* Fix test in db tests
* Address Robert's comments and some better error checking
* Add a LOG_FATAL that exits the program
* Put infrastructure in place to compute task IDs and object IDs.
* Fix version number for common library.
* Compute task IDs and object IDs deterministically.
* Address Stephanie's comments.
* Update task documentation.
* Fix formatting.
* Add more tests and checks.
* Fix formatting.
* Enable DCHECKs and change CHECKs to DCHECKs.
* Ion and Philipp's table retries
* Refactor the retry struct:
- Rename it from retry_struct to retry_info
- Retry information contains the failure callback, not the retry callback
- All functions take in retry information as an arg instead of its expanded fields
* Rename cb -> callback
* Remove prints
* Fix compiler warnings
* Change some CHECKs to greatest ASSERTs
* Key outstanding callbacks hash table with timer ID instead of callback data pointer
* Use the new retry API for table commands
* Memory cleanup in plasma unit tests
* fix Robert's comments
* add valgrind for common