Tuesday, October 6, 2020

Using Advanced Telemetry to Correlate GPU and Network Performance Issues

The image above was captured from the recent talk Using Advanced Telemetry to Correlate GPU and Network Performance Issues [A21870] presented at the NVIDIA GTC conference. The talk includes a demonstration of monitoring a high performance GPU compute cluster in real-time. The real-time dashboard provides an up to the second view of key performance metrics for the cluster.

This diagram shows the elements of the GPU compute cluster that was demonstrated. Cumulus Linux running on the switches reduces operational complexity by allowing you to run the same Linux operating system on the network devices as is run on the compute servers. sFlow telemetry is generated by the open source Host sFlow agent that runs on the servers and the switches, using standard Linux APIs to enable instrumentation and gather measurements. On switches, the measurements are offloaded to the ASIC to provide line rate monitoring.

Telemetry from all the switches and servers in the cluster is streamed to an sFlow-RT analyzer, which builds a real-time view of performance that can be used to drive operational dashboards and automation.

The Real-time GPU and network telemetry dashboard combines measurements from all the devices to provide view of cluster performance. Each of the three charts demonstrated a different type of measurement in the sFlow telemetry stream:
  1. GPU Utilization is based on sFlow's counter push mechanism, exporting NVIDIA Management Library (NVML) counters. This chart trends buffer, memory, and execution utilization of the GPUs in the cluster.
  2. Network Traffic is based on sFlow's random packet sampling mechanism, supported by the Linux kernel on servers, and offloaded to the Mellanox ASIC on the switches. This chart trends the top network flows crossing the network.
  3. Network Drops is based on sFlow's recently added dropped packet notification mechanism, see Using sFlow to monitor dropped packets. This chart trends dropped packet source and destination addresses and the reason the packet was dropped.
The cluster is running a video transcoding workload in which video is streamed across the network to a GPU where it is transcoded, and the result returned. A normal transcoding task is shown on left, where the charts show an increase in GPU and network activity and zero dropped packets. A failed transcoding task is shown in the middle. Here the GPU activity is low, there is no network activity, and there is a sequence of packets dropped by an access control list (ACL). Removing the ACL fixes the problem, which is confirmed by the new data shown on the right of the trend charts.

The sFlow data model integrates the three telemetry streams: counters, packet samples, and drop notifications. Each type of data is useful on its own, but together they provide the system wide observability needed to drive automation.

No comments:

Post a Comment