diff --git a/doc/source/index.rst b/doc/source/index.rst index 14865b002..5f0e0c215 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -132,6 +132,7 @@ Ray comes with libraries that accelerate deep learning and reinforcement learnin :caption: Help troubleshooting.rst + user-profiling.rst development.rst profiling.rst contact.rst diff --git a/doc/source/profiling.rst b/doc/source/profiling.rst index 6f76c7eb1..59d12d635 100644 --- a/doc/source/profiling.rst +++ b/doc/source/profiling.rst @@ -1,5 +1,5 @@ -Profiling Ray -============= +Profiling for Ray Developers +============================ This document details, for Ray developers, how to use ``pprof`` to profile Ray binaries. diff --git a/doc/source/user-profiling-timeline.gif b/doc/source/user-profiling-timeline.gif new file mode 100644 index 000000000..455f1a9af Binary files /dev/null and b/doc/source/user-profiling-timeline.gif differ diff --git a/doc/source/user-profiling.rst b/doc/source/user-profiling.rst new file mode 100644 index 000000000..e7c18dd5e --- /dev/null +++ b/doc/source/user-profiling.rst @@ -0,0 +1,649 @@ +Profiling for Ray Users +======================= + +This document is intended for users of Ray who want to know how to evaluate +the performance of their code while running on Ray. Profiling the +performance of your code can be very helpful to determine performance +bottlenecks or to find out where your code may not be parallelized properly. +If you are interested in pinpointing why your Ray application may not be +achieving the expected speedup, read on! + + +A Basic Example to Profile +-------------------------- + +Let's try to profile a simple example, and compare how different ways to +write a simple loop can affect performance. + +As a proxy for a computationally intensive and possibly slower function, +let's define our remote function to just sleep for 0.5 seconds: + +.. code-block:: python + + import ray + import time + + # Our time-consuming remote function + @ray.remote + def func(): + time.sleep(0.5) + +In our example setup, we wish to call our remote function ``func()`` five +times, and store the result of each call into a list. To compare the +performance of different ways of looping our calls to our remote function, +we can define each loop version as a separate function on the driver script. + +For the first version **ex1**, each iteration of the loop calls the remote +function, then calls ``ray.get`` in an attempt to store the current result +into the list, as follows: + +.. code-block:: python + + # This loop is suboptimal in Ray, and should only be used for the sake of this example + def ex1(): + list1 = [] + for i in range(5): + list1.append(ray.get(func.remote())) + +For the second version **ex2**, each iteration of the loop calls the remote +function, and stores it into the list **without** calling ``ray.get`` each time. +``ray.get`` is used after the loop has finished, in preparation for processing +``func()``'s results: + +.. code-block:: python + + # This loop is more proper in Ray + def ex2(): + list2 = [] + for i in range(5): + list2.append(func.remote()) + ray.get(list2) + +Finally, for an example that's not so parallelizable, let's create a +third version **ex3** where the driver has to call a local +function in between each call to the remote function ``func()``: + +.. code-block:: python + + # A local function executed on the driver, not on Ray + def other_func(): + time.sleep(0.3) + + def ex3(): + list3 = [] + for i in range(5): + other_func() + list3.append(func.remote()) + ray.get(list3) + + +Timing Performance Using Python's Timestamps +-------------------------------------------- + +One way to sanity-check the performance of the three loops is simply to +time how long it takes to complete each loop version. We can do this using +python's built-in ``time`` `module`_. + +.. _`module`: https://docs.python.org/3/library/time.html + +The ``time`` module contains a useful ``time()`` function that returns the +current timestamp in unix time whenever it's called. We can create a generic +function wrapper to call ``time()`` right before and right after each loop +function to print out how long each loop takes overall: + +.. code-block:: python + + # This is a generic wrapper for any driver function you want to time + def time_this(f): + def timed_wrapper(*args, **kw): + start_time = time.time() + result = f(*args, **kw) + end_time = time.time() + + # Time taken = end_time - start_time + print('| func:%r args:[%r, %r] took: %2.4f seconds |' % \ + (f.__name__, args, kw, end_time - start_time)) + return result + return timed_wrapper + +To always print out how long the loop takes to run each time the loop +function ``ex1()`` is called, we can evoke our ``time_this`` wrapper with +a function decorator. This can similarly be done to functions ``ex2()`` +and ``ex3()``: + +.. code-block:: python + + @time_this # Added decorator + def ex1(): + list1 = [] + for i in range(5): + list1.append(ray.get(func.remote())) + + def main(): + ray.init() + ex1() + ex2() + ex3() + + if __name__ == "__main__": + main() + +Then, running the three timed loops should yield output similar to this: + +.. code-block:: bash + + | func:'ex1' args:[(), {}] took: 2.5083 seconds | + | func:'ex2' args:[(), {}] took: 1.0032 seconds | + | func:'ex3' args:[(), {}] took: 2.0039 seconds | + +Let's interpret these results. + +Here, ``ex1()`` took substantially more time than ``ex2()``, where +their only difference is that ``ex1()`` calls ``ray.get`` on the remote +function before adding it to the list, while ``ex2()`` waits to fetch the +entire list with ``ray.get`` at once. + +.. code-block:: python + + @ray.remote + def func(): # A single call takes 0.5 seconds + time.sleep(0.5) + + def ex1(): # Took Ray 2.5 seconds + list1 = [] + for i in range(5): + list1.append(ray.get(func.remote())) + + def ex2(): # Took Ray 1 second + list2 = [] + for i in range(5): + list2.append(func.remote()) + ray.get(list2) + +Notice how ``ex1()`` took 2.5 seconds, exactly five times 0.5 seconds, or +the time it would take to wait for our remote function five times in a row. + +By calling ``ray.get`` after each call to the remote function, ``ex1()`` +removes all ability to parallelize work, by forcing the driver to wait for +each ``func()``'s result in succession. We are not taking advantage of Ray +parallelization here! + +Meanwhile, ``ex2()`` takes about 1 second, much faster than it would normally +take to call ``func()`` five times iteratively. Ray is running each call to +``func()`` in parallel, saving us time. + +``ex1()`` is actually a common user mistake in Ray. ``ray.get`` is not +necessary to do before adding the result of ``func()`` to the list. Instead, +the driver should send out all parallelizable calls to the remote function +to Ray before waiting to receive their results with ``ray.get``. ``ex1()``'s +suboptimal behavior can be noticed just using this simple timing test. + +Realistically, however, many applications are not as highly parallelizable +as ``ex2()``, and the application includes sections where the code must run in +serial. ``ex3()`` is such an example, where the local function ``other_func()`` +must run first before each call to ``func()`` can be submitted to Ray. + +.. code-block:: python + + # A local function that must run in serial + def other_func(): + time.sleep(0.3) + + def ex3(): # Took Ray 2 seconds, vs. ex1 taking 2.5 seconds + list3 = [] + for i in range(5): + other_func() + list2.append(func.remote()) + ray.get(list3) + +What results is that while ``ex3()`` still gained 0.5 seconds of speedup +compared to the completely serialized ``ex1()`` version, this speedup is +still nowhere near the ideal speedup of ``ex2()``. + +The dramatic speedup of ``ex2()`` is possible because ``ex2()`` is +theoretically completely parallelizable: if we were given 5 CPUs, all 5 calls +to ``func()`` can be run in parallel. What is happening with ``ex3()``, +however, is that each parallelized call to ``func()`` is staggered by a wait +of 0.3 seconds for the local ``other_func()`` to finish. + +``ex3()`` is thus a manifestation of `Amdahls Law`_: the fastest theoretically +possible execution time from parallelizing an application is limited to be +no better than the time it takes to run all serial parts in serial. + +.. _`Amdahls Law`: https://en.wikipedia.org/wiki/Amdahl%27s_law + +Due to Amdahl's Law, ``ex3()`` must take at least 1.5 +seconds -- the time it takes for 5 serial calls to ``other_func()`` to finish! +After an additional 0.5 seconds to execute func and get the result, the +computation is done. + + +Profiling Using An External Profiler (Line Profiler) +---------------------------------------------------- + +One way to profile the performance of our code using Ray is to use a third-party +profiler such as `Line_profiler`_. Line_profiler is a useful line-by-line +profiler for pure Python applications that formats its output side-by-side with +the profiled code itself. + +Alternatively, another third-party profiler (not covered in this documentation) +that you could use is `Pyflame`_, which can generate profiling graphs. + +.. _`Line_profiler`: https://github.com/rkern/line_profiler +.. _`Pyflame`: https://github.com/uber/pyflame + +First install ``line_profiler`` with pip: + +.. code-block:: bash + + pip install line_profiler + +``line_profiler`` requires each section of driver code that you want to profile as +its own independent function. Conveniently, we have already done so by defining +each loop version as its own function. To tell ``line_profiler`` which functions +to profile, just add the ``@profile`` decorator to ``ex1()``, ``ex2()`` and +``ex3()``. Note that you do not need to import ``line_profiler`` into your Ray +application: + +.. code-block:: python + + @profile # Added decorator + def ex1(): + list1 = [] + for i in range(5): + list1.append(ray.get(func.remote())) + + def main(): + ray.init() + ex1() + ex2() + ex3() + + if __name__ == "__main__": + main() + +Then, when we want to execute our Python script from the command line, instead +of ``python your_script_here.py``, we use the following shell command to run the +script with ``line_profiler`` enabled: + +.. code-block:: bash + + kernprof -l your_script_here.py + +This command runs your script and prints only your script's output as usual. +``Line_profiler`` instead outputs its profiling results to a corresponding +binary file called ``your_script_here.py.lprof``. + +To read ``line_profiler``'s results to terminal, use this shell command: + +.. code-block:: bash + + python -m line_profiler your_script_here.py.lprof + +In our loop example, this command outputs results for ``ex1()`` as follows. +Note that execution time is given in units of 1e-06 seconds: + +.. code-block:: bash + + Timer unit: 1e-06 s + + Total time: 2.50883 s + File: your_script_here.py + Function: ex1 at line 28 + + Line # Hits Time Per Hit % Time Line Contents + ============================================================== + 29 @profile + 30 def ex1(): + 31 1 3.0 3.0 0.0 list1 = [] + 32 6 18.0 3.0 0.0 for i in range(5): + 33 5 2508805.0 501761.0 100.0 list1.append(ray.get(func.remote())) + + +Notice that each hit to ``list1.append(ray.get(func.remote()))`` at line 33 +takes the full 0.5 seconds waiting for ``func()`` to finish. Meanwhile, in +``ex2()`` below, each call of ``func.remote()`` at line 40 only takes 0.127 ms, +and the majority of the time (about 1 second) is spent on waiting for ``ray.get()`` +at the end: + + +.. code-block:: bash + + Total time: 1.00357 s + File: your_script_here.py + Function: ex2 at line 35 + + Line # Hits Time Per Hit % Time Line Contents + ============================================================== + 36 @profile + 37 def ex2(): + 38 1 2.0 2.0 0.0 list2 = [] + 39 6 13.0 2.2 0.0 for i in range(5): + 40 5 637.0 127.4 0.1 list2.append(func.remote()) + 41 1 1002919.0 1002919.0 99.9 ray.get(list2) + + +And finally, ``line_profiler``'s output for ``ex3()``. Each call to +``func.remote()`` at line 50 still take magnitudes faster than 0.5 seconds, +showing that Ray is successfully parallelizing the remote calls. However, each +call to the local function ``other_func()`` takes the full 0.3 seconds, +totalling up to the guaranteed minimum application execution time of 1.5 +seconds: + +.. code-block:: bash + + Total time: 2.00446 s + File: basic_kernprof.py + Function: ex3 at line 44 + + Line # Hits Time Per Hit % Time Line Contents + ============================================================== + 44 @profile + 45 #@time_this + 46 def ex3(): + 47 1 2.0 2.0 0.0 list3 = [] + 48 6 13.0 2.2 0.0 for i in range(5): + 49 5 1501934.0 300386.8 74.9 other_func() + 50 5 917.0 183.4 0.0 list3.append(func.remote()) + 51 1 501589.0 501589.0 25.0 ray.get(list3) + + +Profiling Using Python's CProfile +--------------------------------- + +A second way to profile the performance of your Ray application is to +use Python's native cProfile `profiling module`_. Rather than tracking +line-by-line of your application code, cProfile can give the total runtime +of each loop function, as well as list the number of calls made and +execution time of all function calls made within the profiled code. + +.. _`profiling module`: https://docs.python.org/3/library/profile.html#module-cProfile + +Unlike ``line_profiler`` above, this detailed list of profiled function calls +**includes** internal function calls and function calls made within Ray! + +However, similar to ``line_profiler``, cProfile can be enabled with minimal +changes to your application code (given that each section of the code you want +to profile is defined as its own function). To use cProfile, add an import +statement, then replace calls to the loop functions as follows: + +.. code-block:: python + + import cProfile # Added import statement + + def ex1(): + list1 = [] + for i in range(5): + list1.append(ray.get(func.remote())) + + def main(): + ray.init() + cProfile.run('ex1()') # Modified call to ex1 + cProfile.run('ex2()') + cProfile.run('ex3()') + + if __name__ == "__main__": + main() + +Now, when executing your Python script, a cProfile list of profiled function +calls will be outputted to terminal for each call made to ``cProfile.run()``. +At the very top of cProfile's output gives the total execution time for +``'ex1()'``: + +.. code-block:: bash + + 601 function calls (595 primitive calls) in 2.509 seconds + +Following is a snippet of profiled function calls for ``'ex1()'``. Most of +these calls are quick and take around 0.000 seconds, so the functions of +interest are the ones with non-zero execution times: + +.. code-block:: bash + + ncalls tottime percall cumtime percall filename:lineno(function) + ... + 1 0.000 0.000 2.509 2.509 your_script_here.py:31(ex1) + 5 0.000 0.000 0.001 0.000 remote_function.py:103(remote) + 5 0.000 0.000 0.001 0.000 remote_function.py:107(_submit) + ... + 10 0.000 0.000 0.000 0.000 worker.py:2459(__init__) + 5 0.000 0.000 2.508 0.502 worker.py:2535(get) + 5 0.000 0.000 0.000 0.000 worker.py:2695(get_global_worker) + 10 0.000 0.000 2.507 0.251 worker.py:374(retrieve_and_deserialize) + 5 0.000 0.000 2.508 0.502 worker.py:424(get_object) + 5 0.000 0.000 0.000 0.000 worker.py:514(submit_task) + ... + +The 5 separate calls to Ray's ``get``, taking the full 0.502 seconds each call, +can be noticed at ``worker.py:2535(get)``. Meanwhile, the act of calling the +remote function itself at ``remote_function.py:103(remote)`` only takes 0.001 +seconds over 5 calls, and thus is not the source of the slow performance of +``ex1()``. + + +Profiling Ray Actors with cProfile +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Considering that the detailed output of cProfile can be quite different depending +on what Ray functionalities we use, let us see what cProfile's output might look +like if our example involved Actors (for an introduction to Ray actors, see our +`Actor documentation here`_). + +.. _`Actor documentation here`: http://ray.readthedocs.io/en/latest/actors.html + +Now, instead of looping over five calls to a remote function like in ``ex1``, +let's create a new example and loop over five calls to a remote function +**inside an actor**. Our actor's remote function again just sleeps for 0.5 +seconds: + +.. code-block:: python + + # Our actor + @ray.remote + class Sleeper(object): + def __init__(self): + self.sleepValue = 0.5 + + # Equivalent to func(), but defined within an actor + def actor_func(self): + time.sleep(self.sleepValue) + +Recalling the suboptimality of ``ex1``, let's first see what happens if we +attempt to perform all five ``actor_func()`` calls within a single actor: + +.. code-block:: python + + def ex4(): + # This is suboptimal in Ray, and should only be used for the sake of this example + actor_example = Sleeper.remote() + + five_results = [] + for i in range(5): + five_results.append(actor_example.actor_func.remote()) + + # Wait until the end to call ray.get() + ray.get(five_results) + +We enable cProfile on this example as follows: + +.. code-block:: python + + def main(): + ray.init() + cProfile.run('ex4()') + + if __name__ == "__main__": + main() + +Running our new Actor example, cProfile's abbreviated output is as follows: + +.. code-block:: bash + + 12519 function calls (11956 primitive calls) in 2.525 seconds + + ncalls tottime percall cumtime percall filename:lineno(function) + ... + 1 0.000 0.000 0.015 0.015 actor.py:546(remote) + 1 0.000 0.000 0.015 0.015 actor.py:560(_submit) + 1 0.000 0.000 0.000 0.000 actor.py:697(__init__) + ... + 1 0.000 0.000 2.525 2.525 your_script_here.py:63(ex4) + ... + 9 0.000 0.000 0.000 0.000 worker.py:2459(__init__) + 1 0.000 0.000 2.509 2.509 worker.py:2535(get) + 9 0.000 0.000 0.000 0.000 worker.py:2695(get_global_worker) + 4 0.000 0.000 2.508 0.627 worker.py:374(retrieve_and_deserialize) + 1 0.000 0.000 2.509 2.509 worker.py:424(get_object) + 8 0.000 0.000 0.001 0.000 worker.py:514(submit_task) + ... + +It turns out that the entire example still took 2.5 seconds to execute, or the +time for five calls to ``actor_func()`` to run in serial. We remember in ``ex1`` +that this behavior was because we did not wait until after submitting all five +remote function tasks to call ``ray.get()``, but we can verify on cProfile's +output line ``worker.py:2535(get)`` that ``ray.get()`` was only called once at +the end, for 2.509 seconds. What happened? + +It turns out Ray cannot parallelize this example, because we have only +initialized a single ``Sleeper`` actor. Because each actor is a single, +stateful worker, our entire code is submitted and ran on a single worker the +whole time. + +To better parallelize the actors in ``ex4``, we can take advantage +that each call to ``actor_func()`` is independent, and instead +create five ``Sleeper`` actors. That way, we are creating five workers +that can run in parallel, instead of creating a single worker that +can only handle one call to ``actor_func()`` at a time. + +.. code-block:: python + + def ex4(): + # Modified to create five separate Sleepers + five_actors = [Sleeper.remote() for i in range(5)] + + # Each call to actor_func now goes to a different Sleeper + five_results = [] + for actor_example in five_actors: + five_results.append(actor_example.actor_func.remote()) + + ray.get(five_results) + +Our example in total now takes only 1.5 seconds to run: + +.. code-block:: bash + + 1378 function calls (1363 primitive calls) in 1.567 seconds + + ncalls tottime percall cumtime percall filename:lineno(function) + ... + 5 0.000 0.000 0.002 0.000 actor.py:546(remote) + 5 0.000 0.000 0.002 0.000 actor.py:560(_submit) + 5 0.000 0.000 0.000 0.000 actor.py:697(__init__) + ... + 1 0.000 0.000 1.566 1.566 your_script_here.py:71(ex4) + ... + 21 0.000 0.000 0.000 0.000 worker.py:2459(__init__) + 1 0.000 0.000 1.564 1.564 worker.py:2535(get) + 25 0.000 0.000 0.000 0.000 worker.py:2695(get_global_worker) + 3 0.000 0.000 1.564 0.521 worker.py:374(retrieve_and_deserialize) + 1 0.000 0.000 1.564 1.564 worker.py:424(get_object) + 20 0.001 0.000 0.001 0.000 worker.py:514(submit_task) + ... + + +Visualizing Tasks in the Ray Timeline +------------------------------------- +Profiling the performance of your Ray application doesn't need to be +an eye-straining endeavor of interpreting numbers among hundreds of +lines of text. Ray comes with its own visual web UI to visualize the +parallelization (or lack thereof) of user tasks submitted to Ray! + +This method does have its own limitations, however. The Ray Timeline +can only show timing info about Ray tasks, and not timing for normal +Python functions. This can be an issue especially for debugging slow +Python code that is running on the driver, and not running as a task on +one of the workers. The other profiling techniques above are options that +do cover profiling normal Python functions. + +Currently, whenever initializing Ray, a URL is generated and printed +in the terminal. This URL can be used to view Ray's web UI as a Jupyter +notebook: + +.. code-block:: bash + + ~$: python your_script_here.py + + Process STDOUT and STDERR is being redirected to /tmp/raylogs/. + Waiting for redis server at 127.0.0.1:61150 to respond... + Waiting for redis server at 127.0.0.1:21607 to respond... + Starting local scheduler with the following resources: {'CPU': 4, 'GPU': 0}. + + ====================================================================== + View the web UI at http://localhost:8897/notebooks/ray_ui84907.ipynb?token=025e8ab295270a57fac209204b37349fdf34e037671a13ff + ====================================================================== + +Ray's web UI attempts to run on localhost at port 8888, and if it fails +it tries successive ports until it finds an open port. In this above +example, it has opened on port 8897. + +Because this web UI is only available as long as your Ray application +is currently running, you may need to add a user prompt to prevent +your Ray application from exiting once it has finished executing, +such as below. You can then browse the web UI for as long as you like: + +.. code-block:: python + + def main(): + ray.init() + ex1() + ex2() + ex3() + + # Require user input confirmation before exiting + hang = input('Examples finished executing. Press enter to exit:') + + if __name__ == "__main__": + main() + +Now, when executing your python script, you can access the Ray timeline +by copying the web UI URL into your web browser on the Ray machine. To +load the web UI in the jupyter notebook, select **Kernel -> Restart and +Run All** in the jupyter menu. + +The Ray timeline can be viewed in the fourth cell of the UI notebook by +using the task filter options, then clicking on the **View task timeline** +button. + +For example, here are the results of executing ``ex1()``, ``ex2()``, and +``ex3()`` visualized in the Ray timeline. Each red block is a call to one +of our user-defined remote functions, namely ``func()``, which sleeps for +0.5 seconds: + +.. image:: user-profiling-timeline.gif + +(highlighted color boxes for ``ex1()``, ``ex2()``, and ``ex3()`` added for +the sake of this example) + +Note how ``ex1()`` executes all five calls to ``func()`` in serial, +while ``ex2()`` and ``ex3()`` are able to parallelize their remote +function calls. + +Because we have 4 CPUs available on our machine, we can only able to +execute up to 4 remote functions in parallel. So, the fifth call to the +remote function in ``ex2()`` must wait until the first batch of ``func()`` +calls is finished. + +In ``ex3()``, because of the serial dependency on ``other_func()``, we +aren't even able to use all 4 of our cores to parallelize calls to ``func()``. +The time gaps between the ``func()`` blocks are a result of staggering the +calls to ``func()`` in between waiting 0.3 seconds for ``other_func()``. + +Also, notice that due to the aforementioned limitation of the Ray timeline, +``other_func()``, as a driver function and not a Ray task, is never +visualized on the Ray timeline. + +**For more on Ray's Web UI,** such as how to access the UI on a remote +node over ssh, or for troubleshooting installation, please see our +`Web UI documentation section`_. + +.. _`Web UI documentation section`: http://ray.readthedocs.io/en/latest/webui.html