This article goes beyond simulation to demonstrate the AI Metrics dashboard by comparing live traffic seen in two production AI clusters.
Cluster 1
This cluster consists of 250 GPUs connected via 100G ports to single large switch. The results are pretty consistent with simulation from the original article. In this case there is no Core Link Traffic because the cluster consists of a single switch. The Discards chart shows a burst of Out (egress) discards and the Drop Reasons chart gives the reason as ingress_vlan_filter. The Total Traffic, Operations, Edge Link Traffic, and RDMA Operations charts all show a transient drop in throughput coincident with the discard spike. Further details of the dropped packets, such as source/destination address, operation, ingress / egress port, QP pair, etc. can be extracted from the sFlow Dropped Packet Notifications that are populating the Drop Reasons chart, for example, using the browse-drops application packaged with the sflow/ai-metrics Docker image.The Period chart indicates that the workload is periodic with a compute / exchange cycle of approximately 0.9 seconds.
A real-time trend of the cluster network traffic polled every 100mS clearly shows the cyclic nature of the traffic shown by the Period chart and confirms the reported 0.9 second period.
Cluster 2
This cluster consists of two 400G fixed configuration switches connected to 40 GPUs. In this case the traffic is much less regular than the first example. RDMA operation sizes vary between 500MB to over 3.5GB transfers (in the previous example, all transfers were a consistent 7K bytes). The mix of RoCEv2 Infiniband operations is also different, comprising RDMA_READ, RESYNC and ACK operations with a mixture of RD (Reliable Datagram) and RC (Reliable Connection) transports. In contrast, the previous example consisted only of RDMA_WRITE and ACK operations using RD transport.An interesting point to note is the spike in the Discards chart coinciding with a burst in RC:RDMA_READ traffic. In this case, the network operating system running on the switches doesn't currently support sFlow Dropped Packet Notifications so the Drop Reasons chart doesn't provide further detail (Note: In this case the switch ASICs do have the required instrumentation, so a firmware update would be able to add the sFlow Dropped Packet Notifications feature).
In this example, the Period chart shows missing and irregular data.
In this case the real-time trend of network traffic shows no periodic structure, so the Period chart in the AI Metrics dashboard is unable to lock onto a repeating pattern.
Take a look at your own AI cluster network activity
AI Metrics gives step-by-step instructions to run the application in a production environment and integrate the metrics with back end Prometheus / Grafana dashboards. The solution utilizes industry standard sFlow instrumentation built into data center switches and can be deployed without any changes to the servers in the cluster.