diff --git a/release/RELEASE_CHECKLIST.md b/release/RELEASE_CHECKLIST.md index 9ab85f30b..da2d9145a 100644 --- a/release/RELEASE_CHECKLIST.md +++ b/release/RELEASE_CHECKLIST.md @@ -60,6 +60,20 @@ This checklist is meant to be used in conjunction with the RELEASE_PROCESS.rst d - [ ] K8s Test - [ ] K8s cluster launcher test - [ ] K8s operator test +- [ ] Data processing tests + - [ ] streaming_shuffle +- [x] Tune tests + - [x] ignore for now +- [ ] XGBoost Tests + - [ ] distributed_api_test + - [ ] train_small + - [ ] train_moderate + - [ ] train_gpu + - [ ] tune_small + - [ ] tune_4x32 + - [ ] tune_32x4 + - [ ] ft_small_non_elastic (flaky!) + - [ ] ft_small_elastic (flaky!) ## Final Steps - [ ] Wheels uploaded to Test PyPI diff --git a/release/RELEASE_PROCESS.rst b/release/RELEASE_PROCESS.rst index 80afb3589..f1decb4b6 100644 --- a/release/RELEASE_PROCESS.rst +++ b/release/RELEASE_PROCESS.rst @@ -144,11 +144,11 @@ is generally the easiest way to run release tests. Run the ``ci/asan_tests`` with the commit. This will enable ASAN build and run the whole Python tests to detect memory leaks. -6. **K8s operator tests** +7. **K8s operator tests** Run the ``python/ray/tests/test_k8s_*`` to make sure K8s cluster launcher and operator works. Make sure the docker image is the released version. -6. **Data processing tests** +8. **Data processing tests** .. code-block:: bash @@ -162,7 +162,26 @@ is generally the easiest way to run release tests. **IMPORTANT** Check if the workload scripts has terminated. If so, please record the result (both read/write bandwidth and the shuffle result) to the ``release_logs/data_processing_tests/[test_name]``. Both shuffling runtime and read/write bandwidth shouldn't be decreasing more than 15% compared to the previous release. - + +9. **Ray Tune release tests** + + General Ray Tune functionality is implicitly tested via RLLib and XGBoost release tests. + We are in the process of introducing scalability envelopes for Ray Tune. + This is an ongoing effort and will only be introduced in the next release. + For now, **you can ignore the tune_tests directory**. + +10. **XGBoost release tests** + + .. code-block:: bash + + xgboost_tests/README.rst + + Follow the instructions to kick off the tests and check the status of the workloads. + The XGBoost release tests use assertions or fail with exceptions and thus + should automatically tell you if they failed or not. + Only in the case of the fault tolerance tests you might want + to check the logs. See the readme for more information. + Identify and Resolve Release Blockers ------------------------------------- diff --git a/release/xgboost_tests/README.rst b/release/xgboost_tests/README.rst new file mode 100644 index 000000000..303b09ef9 --- /dev/null +++ b/release/xgboost_tests/README.rst @@ -0,0 +1,32 @@ +XGBoost on Ray tests +==================== + +This directory contains various XGBoost on Ray release tests. + +You should run these tests with the `releaser `_ tool. + +Overview +-------- +There are four kinds of tests: + +1. ``distributed_api_test`` - checks general API functionality and should finish very quickly (< 1 minute) +2. ``train_*`` - checks single trial training on different setups. +3. ``tune_*`` - checks multi trial training via Ray Tune. +4. ``ft_*`` - checks fault tolerance. **These tests are currently flaky** + +Generally the releaser tool will run all tests in parallel, but if you do +it sequentially, be sure to do it in the order above. If ``train_*`` fails, +``tune_*`` will fail, too. + +Flaky fault tolerance tests +--------------------------- +The fault tolerance tests are currently flaky. In some runs, more nodes die +than expected, causing the test to fail. In other cases, the re-scheduled +actors become available too soon after crashing, causing the assertions to +fail. Please consider re-running the test a couple of times or contact the +test owner with outputs from the tests for further questions. + +Acceptance criteria +------------------- +These tests are considered passing when they throw no error at the end of +the output log.