For example, the Heatmap above comes from a large high performance compute cluster running a mixture of tasks. Traffic is concentrated along the diagonal, indicating that the job scheduler is packing related tasks in racks so that most traffic is confined to the rack.
Note: Live Dashboards links to a number dashboards showing live traffic, including the Heatmap above.
The next Heatmap shows a very different traffic pattern. In this case, RoCEv2 traffic generated by GPUs performing a NCCL AllReduce/AllGather collective operation using a ring algorithm. During the collective operation, each GPU sends data to its immediate neighbor (modulo the number of GPUs) in a logical ring, resulting in two nearly continuous lines on either size of the diagonal: one for forward traffic, and the other for return traffic associated with each flow. The final example comes from a large data center hosting a mix of front end workloads. Unlike the backend networks, this network combines internal (East/West) traffic with external (North/South) traffic flows. The internal traffic flows are contained in the central grid. The surrounding borders display external traffic. The full range of IP addresses (0.0.0.0 - 255.255.255.255) is displayed on the heatmap using a piecewise linear scaling function. A start and end address identifies internal traffic and maps to values in the central grid and addresses outside this range are scaled to fit in the borders insets.Representing the traffic matrix in the form of a heat map scales well to very large networks and provides real-time insight into shifting traffic patterns as workloads change. The industry standard sFlow instrumentation in data center switches used to construct the traffic matrix also scales to the large number of switches and 400/800G port speeds found in AI/ML backend networks.













