Commit Graph

2378 Commits

Author SHA1 Message Date
Wang Qing fa2bfa6d76 Fix some small code quality issues. (#3719) 2019-01-11 15:24:49 +08:00
Stephanie Wang cc5ecd71c5 [autoscaler] Add kill and get IP commands to CLI for testing (#3731)
## What do these changes do?

Adds 2 commands to the CLI that take in an autoscaler config:
1. Kill a random ray node in the cluster.
2. Get all the worker node IP addresses.

These commands are both for testing and are not recommended for normal use.

## Related issue number
Closes #3685.
2019-01-10 22:06:57 -08:00
Richard Liaw 574f0b73bc [tune] Fix Trial Serialization (#3743) 2019-01-10 19:26:10 -08:00
Hao Chen 597abb24ea Refine multi-threading support (#3672)
* [Python] refine multi-threading support

fix

* [java] refine multithreading code

fix java

* format
2019-01-10 13:58:11 -08:00
Eric Liang 71243203a4 [rllib] Fix KeyError: 'kl' in multiagent ppo training 2019-01-09 19:33:07 -08:00
Hao Chen 6fc3fc4120 Cap task lease timeout (#3707) 2019-01-09 17:19:48 -08:00
Richard Liaw edb7aaf7c7 [tune] Better Serialization for Server (#3708)
* Add cloudpickle for serialization

* Fix tests
2019-01-09 11:55:32 -08:00
Stephanie Wang 04f31db54d Actor dummy object garbage collection (#3593)
* Convert UniqueID::nil() to a constructor

* Cleanup actor handle pickling code

* Add new actor handles to the task spec

* Pass in new actor handles

* Add new handles to the actor registration

* Regression test for actor handle forking and GC

* lint and doc

* Handle pickled actor handles in the backend and some refactoring

* Add regression test for dummy object GC and pickled actor handles

* Check for duplicate actor tasks on submission

* Regression test for forking twice, fix failed named actor leak

* Fix bug for forking twice

* lint

* Revert "Fix bug for forking twice"

This reverts commit 3da85e59d401e53606c2e37ffbebcc8653ff27ac.

* Add new actor handles when task is assigned, not finished

* Remove comment

* remove UniqueID()

* Updates

* update

* fix

* fix java

* fixes

* fix
2019-01-09 10:37:11 -08:00
Wenting Shen 3027dde303 Fix some storage problems of RayLog (#3595)
1. Fix the problem of duplicated stored logs.
2. Save log whose level  is higher than severity_threshold, not only with severity_threshold.
3. Fix a `log_dir` bug: storing logs in a wrong path.
2019-01-09 13:54:21 +08:00
Robert Nishihara d1e21b702e Change timeout from milliseconds to seconds in ray.wait. (#3706)
* Change timeout from milliseconds to seconds in ray.wait.

* Suppress warning.

* Suppress warning.

* Add prominent warning in API documentation.
2019-01-08 21:32:08 -08:00
Si-Yuan 59d861281e Bug fixing: Redis password should be used when reporting errors. (#3724) 2019-01-08 21:23:55 -08:00
Robert Nishihara 6bbc667f93 Remove unused code path in services.py. (#3722) 2019-01-08 19:57:16 -08:00
Peter Schafhalter 5945b92fd3 [sgd] Add checkpointing (#3638) 2019-01-08 15:29:30 -08:00
Robert Nishihara 5e76d52868 Improve cluster.wait_for_nodes() API. (#3712)
* Separate out functionality for querying client table and improve cluster.wait_for_nodes() API.

* Linting

* Add back logging statements.

* info -> debug
2019-01-07 21:26:58 -08:00
Richard Liaw 33319502b6 [tune] Add a callable check for converting to trainable (#3711) 2019-01-07 16:18:29 -08:00
Robert Nishihara 5dadac148c Remove unused file. (#3695) 2019-01-07 12:45:48 -08:00
Robert Nishihara c9d70f0dda Remove num_local_schedulers argument from ray.worker._init. (#3704)
* Remove num_local_schedulers argument from ray.worker._init.

* Fix

* Fix tests.
2019-01-07 12:44:49 -08:00
Eric Liang e78562b2e8 [rllib] Misc fixes: set lr for PG, better error message for LSTM/PPO, fix multi-agent/APEX (#3697)
* fix

* update test

* better error

* compute

* eps fix

* add get_policy() api

* Update agent.py

* better err msg

* fix

* pass in rew
2019-01-06 19:37:35 -08:00
Hao Chen df0733cafb Skip test_multiple_recursive (#3683)
This test often hangs or fails in CI. Skip it for now to unblock other PRs.
2019-01-06 13:24:29 -08:00
Richard Liaw 8934e37a78 [tune] Change log handling for Tune (#3661)
Also provides a small retry mechanism for a transient error as reported
by #3340.

Closes #3653.
2019-01-06 13:20:10 -08:00
mattearllongshot 681e8cd3fd [autoscaler] Add an initial_workers option (#3530)
## What do these changes do?

    This option goes along with `min_workers`, and `max_workers`.  When the
    cluster is first brought up (or when it is refreshed with a subsequent
    `ray up`) this number of nodes will be started.
    
    It's a workaround for issues of scaling (see related issues) where it
    can take a long time (or forever in the case where the head node has
    `--num-cpus 0`) to scale up a cluster in response to increasing demand.


## Related issue number

Workaround for https://github.com/ray-project/ray/issues/3339 and https://github.com/ray-project/ray/issues/2106
2019-01-05 17:58:42 -08:00
Robert Nishihara 067976ad3d Push a warning to all users when large number of workers have been started. (#3645)
* Push a warning to all users when large number of workers have been started.

* Add test.

* Fix bug.

* Give warning when worker starts instead of when worker registers.

* Fix

* Fix tests
2019-01-05 13:27:32 -08:00
Wang Qing 692fdc6bc3 [Java] Allow actor handle to be serialized without forking (#3686) 2019-01-06 00:29:08 +08:00
Eric Liang 03fe760616 [rllib] Model self loss isn't included in all algorithms (#3679) 2019-01-04 22:30:35 -08:00
Richard Liaw 960a943503 [tune] Fault Tolerance: handle lost checkpoints by restart (#3657)
Checks that node failure with lost checkpoints does not crash. Also adds test.
2019-01-04 22:05:27 -08:00
Eric Liang 7db1f3be2a [tune] resume=False by default but print a tip to set resume="prompt" + jenkins fix (#3681) 2019-01-04 17:23:19 -08:00
Kristian Hartikainen 747b117929 [tune] Tweak/allow nested pbt mutations (#3455)
* Fix warning text in pbt logger

* Allow nested mutations in pbt by recursing explore function

* Add test for nested pbt mutation

* Update pbt explore to only call custom explore on top level

* fix test
2019-01-04 13:51:11 -08:00
Robert Nishihara cd80891ddb Try to figure out the memory limit in a docker container. (#3605)
* Try to figure out the memory limit in a docker container.

* Update comment

* Fix

* Fix
2019-01-03 23:07:24 -08:00
Robert Nishihara 586a5c9ffa Limit default redis max memory to 10GB. (#3630)
* Limit Redis max memory to 10GB/shard by default.

* Update stress tests.

* Reorganize

* Update

* Add minimum cap size for object store and redis.

* Small test update.
2019-01-03 13:23:54 -08:00
Yuhong Guo 4b23a34c93 Fix multi-thread problem of function manager and Jenkins test (#3648) 2019-01-03 17:05:13 +08:00
Yuhong Guo ad2287ebe9 Fix new boost libs failure in cache-lib mode and add test to cover collect_dependent_libs.sh (#3627)
* Fix building breaks and add lib collection to Travis.

* Fix arrow build

* Fix version mismatch problem
2019-01-02 23:51:11 -08:00
Eric Liang ca864faece [rllib] Documentation for I/O API and multi-agent support / cleanup (#3650) 2019-01-03 15:15:36 +08:00
opherlieber 2177e2f410 [rllib] Agent: Allow unknown subkeys for custom_resources_per_worker (#3639)
* RLLib Agent: Allow unknown subkeys for custom_resources_per_worker

* Update agent.py
2019-01-03 14:19:59 +08:00
Eric Liang 47d36d7bd6 [rllib] Refactor pytorch custom model support (#3634) 2019-01-03 13:48:33 +08:00
Robert Nishihara b6bcd18d65 Split profile table among many keys in the GCS. (#3676)
* Divide profile table among many keys in GCS.

* Fix, and remove --collect-profiling-data arg.

* Remove reference in doc.
2019-01-02 21:33:01 -08:00
Yuhong Guo 93e9d2b82c Improve backend log: env variable setting and format refine. (#3662)
* Improve backend logging

* Address comment

* Fix Raul's comment
2019-01-01 21:45:29 -08:00
Eric Liang b8a9e3f106 [rllib] Remove uses of sgd_stepsize => lr (#3667)
* lr

* Update example-evolution-strategies.rst
2019-01-01 12:01:27 +08:00
Si-Yuan 93d54110f8 Prevent overriding faulthandler settings (#3668)
This change ensures that Ray set up fault handlers only if it has not been enabled by other applications. Otherwise some applications could face strange issues when using Ray, and some unittests using xml runners will fail.
2018-12-31 16:36:26 -08:00
Yuhong Guo c9b8ecca51 Add RayParams to refactor the parameters used by ray python. (#3558) 2018-12-29 22:04:27 +08:00
Devin Petersohn eb1e5fa2cf Fixing Python2 compatibility issues. Adding inline docs (#3656) 2018-12-28 22:53:28 -08:00
Richard Liaw aad3c50e2d [tune] Cluster Fault Tolerance (#3309)
This PR introduces cluster-level fault tolerance for Tune by checkpointing global state. This occurs with relatively high frequency and allows users to easily resume experiments when the cluster crashes.

Note that this PR may affect automated workflows due to auto-prompting, but this is resolvable.
2018-12-29 11:42:25 +08:00
Zhijun Fu 382b138fc7 fix code issues in object manager that are reported by scanning tool (#3649)
Fix some code issues found by code scanning tool:

**1. Macro compares unsigned to 0(NO_EFFECT)**

CWE570: An unsigned value can never be less than 0
This greater-than-or-equal-to-zero comparison of an unsigned value is always true. "this->create_buffer_state_[object_id].num_seals_remaining >= 0UL".

~/ray/src/ray/object_manager/object_buffer_pool.cc: ray::ObjectBufferPool::SealChunk(const ray::UniqueID &, unsigned long)

**2. Inferred misuse of enum(MIXED_ENUMS)**

CWE398: An integer expression which was inferred to have an enum type is mixed with a different enum type
This case, "static_cast(ray::object_manager::protocol::MessageType::PushRequest)", implies the effective type of "message_type" is "ray::object_manager::protocol::MessageType".

~/ray/src/ray/object_manager/object_manager.cc: ray::ObjectManager::ProcessClientMessage(std::shared_ptr> &, long, const unsigned char *)
2018-12-28 14:38:59 -08:00
Zhijun Fu 3df1e1c471 Add missing lock in FreeObjects of object buffer pool (#3647)
Object manager uses multi-threading for transferring objects between different nodes, the plasma client used in object_buffer_pool_ needs to be protected by lock. We have met crashes caused by missing lock in FreeObjects() interface, this PR fixes that issue.
2018-12-28 11:47:31 -08:00
Wang Qing c59b506c6e [Java] Support calling Ray APIs from multiple threads (#3646) 2018-12-28 17:44:31 +08:00
Hao Chen 0b682d043e Fix memory leak in PyRayletCient (#3640)
1) if using `PyObject_GetIter`, the caller must call `Py_DECREF` to avoid memory leak. But with `PyList_GetItem`, `Py_DECREF` isn't needed.
2) the `Py_BuildValue` call in `wait` doesn't need to increment ref count.
2018-12-27 17:39:02 -08:00
Hao Chen 62af2f25be Fix test_multiple_actor_reconstruction failure (#3641)
* Fix test_multiple_actor_reconstruction failure

* add comment
2018-12-27 13:57:52 -08:00
Richard Liaw ac792d70c8 [rllib] Add starcraft multiagent env as example (#3542) 2018-12-27 10:00:32 +08:00
Tianming Xu b4f61dfd50 [rllib] Export policy model checkpoint (#3637)
* Export policy model checkpoint

* update comment
2018-12-27 08:43:06 +09:00
Richard Liaw 6e2d7a9ba1 [tune] Support Configuration Merging (#3584)
* merge configs

* deep merge

* lint

* add resolve

* test
2018-12-26 20:07:11 +09:00
Stan Wang 4ce3818be5 Average aggregated gradients before put in plasma store (#3631) 2018-12-26 20:03:11 +09:00