diff --git a/doc/source/ray-dashboard.rst b/doc/source/ray-dashboard.rst index 53e34aed8..7daffaf02 100644 --- a/doc/source/ray-dashboard.rst +++ b/doc/source/ray-dashboard.rst @@ -1,13 +1,14 @@ Ray Dashboard ============= -Ray's built-in dashboard provides metrics, charts, and other features that help Ray users to understand Ray clusters and libraries. +Ray's built-in dashboard provides metrics, charts, and other features that help +Ray users to understand Ray clusters and libraries. Through the dashboard, you can -- Understand worker and machine behaviors by providing resource usages and status of them. -- Visualize the actor relationship and stats. -- Kill actors and profile your Ray job. -- See Tune jobs and trials information in a glance. +- View cluster metrics. +- Visualize the actor relationships and statistics. +- Kill actors and profile your Ray jobs. +- See Tune jobs and trial information. - Detect cluster anomalies and debug them. .. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/dashboard/Dashboard-overview.png @@ -15,8 +16,8 @@ Through the dashboard, you can Getting Started --------------- -you can access the dashboard through its default URL, **localhost:8265**. -(Note that port number is increasing if the default port is not available). +You can access the dashboard through its default URL, **localhost:8265**. +(Note that the port number increases if the default port is not available). The URL is printed when ``ray.init()`` is called. @@ -24,7 +25,8 @@ The URL is printed when ``ray.init()`` is called. INFO services.py:1093 -- View the Ray dashboard at localhost:8265 -The dashboard is also available when using the autoscaler. Read about how to `use the dashboard with the autoscaler `_. +The dashboard is also available when using the autoscaler. Read about how to +`use the dashboard with the autoscaler `_. Views ----- @@ -34,10 +36,10 @@ Views Machine View ~~~~~~~~~~~~ -The Machine View shows you: +The machine view shows you: -- System resources usage for each machine and worker. E.G. RAM, CPU, disk, and network usage information. -- Logs and error messages at each machine and worker. +- System resource usage for each machine and worker such as RAM, CPU, disk, and network usage information. +- Logs and error messages for each machine and worker. - Actors or tasks assigned to each worker process. .. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/dashboard/Machine-view-basic.png @@ -45,18 +47,18 @@ The Machine View shows you: Logical View ~~~~~~~~~~~~ -The Logical View shows you: +The logical view shows you: -- Created and killed actors -- Actor stats such as actor status, number of executed tasks, pending tasks, and memory usage. -- Actor hierarchy and dependency. +- Created and killed actors. +- Actor statistics such as actor status, number of executed tasks, pending tasks, and memory usage. +- Actor hierarchy. .. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/dashboard/Logical-view-basic.png :align: center Ray Config ~~~~~~~~~~ -The Ray Config tab shows you the current autoscaler configuration. +The ray config tab shows you the current autoscaler configuration. .. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/dashboard/Ray-config-basic.png :align: center @@ -65,14 +67,14 @@ Tune ~~~~ The Tune tab shows you: -- Tune jobs and their status. -- Hyperparameters each job is experimenting on. +- Tune jobs and their statuses. +- Hyperparameters for each job. .. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/dashboard/Tune-basic.png :align: center -Advanced Usages ---------------- +Advanced Usage +-------------- Killing Actors ~~~~~~~~~~~~~~ @@ -85,8 +87,11 @@ Debugging a Blocked Actor ~~~~~~~~~~~~~~~~~~~~~~~~~ You can find hanging actors through the Logical View tab. -If creating an actor requires resources (e.g., CPUs, GPUs, or other custom resources) that are not currently available, the actor cannot be created until those resources are added to the cluster or become available. -This can cause an application to hang. To alert you to this issue, infeasible tasks are shown in red in the dashboard, and pending tasks are shown in yellow. +If creating an actor requires resources (e.g., CPUs, GPUs, or other custom resources) +that are not currently available, the actor cannot be created until those resources are +added to the cluster or become available. This can cause an application to hang. To alert +you to this issue, infeasible tasks are shown in red in the dashboard, and pending tasks +are shown in yellow. Below is an example. @@ -113,22 +118,28 @@ Below is an example. .. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/dashboard/dashboard-pending-infeasible-actors.png :align: center -This cluster has two GPUs, and so it only has room to create two copies of Actor1. As a result, the rest of Actor1 will be pending. +This cluster has two GPUs, and so it only has room to create two copies of ``Actor1``. +As a result, the rest of ``Actor1`` will be pending. -You can also see it is infeasible to create Actor2 because it requires 4 GPUs which is bigger than the total gpus available in this cluster (2 GPUs). +You can also see it is infeasible to create ``Actor2`` because it requires 4 GPUs which +is bigger than the total gpus available in this cluster (2 GPUs). Inspect Memory Usage ~~~~~~~~~~~~~~~~~~~~ -You can detect local memory anomalies through the Logical View tab. If NumObjectIdsInScope, NumLocalObjects, or UsedLocalObjectMemory keeps growing without bound, it can lead to OOM errors or eviction of objectIDs that your program still wants to use. +You can detect local memory anomalies through the Logical View tab. If NumObjectIdsInScope, +NumLocalObjects, or UsedLocalObjectMemory keeps growing without bound, it can lead to out +of memory errors or eviction of objectIDs that your program still wants to use. Profiling (Experimental) ~~~~~~~~~~~~~~~~~~~~~~~~ -Use profiling features when you want to find bottleneck of your Ray applications. +Use profiling features when you want to find bottlenecks in your Ray applications. .. image:: https://raw.githubusercontent.com/ray-project/images/master/docs/dashboard/dashboard-profiling-buttons.png :align: center -Clicking one of the profiling buttons on the dashboard launches py-spy, which will profile your actor process for the given duration. Once the profiling has been done, you can click the "profiling result" button to visualize the profiling information as a flamegraph. +Clicking one of the profiling buttons on the dashboard launches py-spy, which will profile +your actor process for the given duration. Once the profiling has been done, you can click the "profiling result" button to visualize the profiling information as a flamegraph. + This visualization can help reveal computational bottlenecks. .. note:: @@ -147,12 +158,14 @@ References Machine View ~~~~~~~~~~~~ -**Machine/Worker Hierarchy**: The dashboard visualizes hierarchical relationship of workers (processes) and machines (nodes). Each host consists of many workers, and you can see them by clicking a + button. +**Machine/Worker Hierarchy**: The dashboard visualizes hierarchical relationship of +workers (processes) and machines (nodes). Each host consists of many workers, and +you can see them by clicking the + button. .. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/dashboard/Machine-view-reference-1.png :align: center -You can hide it again by clicking a - button. +You can hide it again by clicking the - button. .. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/dashboard/Machine-view-reference-2.png :align: center @@ -168,7 +181,7 @@ For example, when a Ray cluster is configured with 4 cores, ``ray.init(num_cpus= .. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/dashboard/resource-allocation-row-configured-1.png :align: center -When you spawn a new actor that uses 1 cpu, you can see this will be (CPU: 1/4). +When you spawn a new actor that uses 1 CPU, you can see this will be (CPU: 1/4). Below is an example. @@ -189,7 +202,8 @@ Below is an example. **Host**: If it is a node, it shows host information. If it is a worker, it shows a pid. -**Workers**: If it is a node, it shows a number of workers and virtual cores. Note that number of workers can exceed number of cores. +**Workers**: If it is a node, it shows a number of workers and virtual cores. +Note that number of workers can exceed number of cores. **Uptime**: Uptime of each worker and process. @@ -222,23 +236,32 @@ Logical View (Experimental) **Excuted**: A number of executed tasks for this actor. -**NumObjectIdsInScope**: Number of object IDs in scope for this actor. Object IDs in scope will not be evicted unless object stores are full. +**NumObjectIdsInScope**: Number of object IDs in scope for this actor. object IDs +in scope will not be evicted unless object stores are full. -**NumLocalObjects**: Number of objectIDs that are in this actor's local memory. Only big objects (>100KB) are residing in plasma object stores, and other small objects are staying in local memory. +**NumLocalObjects**: Number of object IDs that are in this actor's local memory. +Only big objects (>100KB) are residing in plasma object stores, and other small +objects are staying in local memory. **UsedLocalObjectMemory**: Used memory used by local objects. **kill actor**: A button to kill an actor in a cluster. It is corresponding to ``ray.kill``. -**profile for**: A button to run profiling. We currently support 10s, 30s and 60s profiling. It requires passwordless ``sudo``. +**profile for**: A button to run profiling. We currently support profiling for 10s, +30s and 60s. It requires passwordless ``sudo``. -**Infeasible Actor Creation**: Actor creation is infeasible when an actor requires more resources than a Ray cluster can provide. This is depicted as a red colored actor. +**Infeasible Actor Creation**: Actor creation is infeasible when an actor +requires more resources than a Ray cluster can provide. This is depicted +as a red colored actor. -**Pending Actor Creation**: Actor creation is pending when there is no available resource for this actor because it is already taken by other tasks / actors. This is depicted as a yellow colored actor. +**Pending Actor Creation**: Actor creation is pending when there are no +available resources for this actor because they are already taken by other +tasks and actors. This is depicted as a yellow colored actor. **Actor Hierarchy**: The logical view renders actor information in a tree format. -To illustrate this, in the code block below, the ``Parent`` actor creates two ``Child`` actors and each ``Child`` actor creates one ``GrandChild`` actor. +To illustrate this, in the code block below, the ``Parent`` actor creates +two ``Child`` actors and each ``Child`` actor creates one ``GrandChild`` actor. This relationship is visible in the dashboard *Logical View* tab. .. code-block:: python @@ -270,7 +293,8 @@ You can see that the dashboard shows the parent/child relationship as expected. Ray Config ~~~~~~~~~~~~ -Configuration defined at ``cluster.yaml`` for the autoscaler mode. See `Cluster.yaml reference `_ for more details. +If you are using the autoscaler, this Configuration defined at ``cluster.yaml`` is shown. +See `Cluster.yaml reference `_ for more details. Tune (Experimental) ~~~~~~~~~~~~~~~~~~~ @@ -282,4 +306,5 @@ Tune (Experimental) **Start Time**: Start time of each trial. -**Hyperparameters**: There are many hyperparameter users specify. All of values will be visible at the dashboard. +**Hyperparameters**: There are many hyperparameter users specify. All of values will +be visible at the dashboard.