Monday, October 20, 2025

AI / ML network performance metrics at scale

The charts above show information from a GPU cluster running an AI / ML training workload. The 244 nodes in the cluster are connected by 100G links to a single large switch. Industry standard sFlow telemetry from the switch is shown in the two trend charts generated by the sFlow-RT real-time analytics engine. The charts are updated every 100mS.
  • Per Link Telemetry shows RoCEv2 traffic on 5 randomly selected links from the cluster. Each trend is computed based on sFlow random packet samples collected on the link. The packet header in each sample is decoded and the metric is computed for packets identified as RoCEv2.
  • Combined Fabric-Wide Telemetry combines the signals from all the links to create a fabric wide metric. The signals are highly correlated since the AI training compute / exchange cycle is synchronized across all compute nodes in the cluster. Constructive interference from combining data from all the links removes the noise in each individual signal and clearly shows the traffic pattern for the cluster.
This is a relatively small cluster. For larger clusters, the effect is even more pronounced, resulting in extremely sharp cluster-wide metrics. The sFlow instrumentation embedded as a standard feature of data center switch hardware from all leading vendors (Arista, Cisco, Dell, Juniper, NVIDIA, etc.) provides a cost effective solution for even the largest AI / ML fabrics.
The Grafana AI Metrics dashboard shown above tracks performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

The metrics include:

  • Total Traffic Total traffic entering fabric
  • Operations Total RoCEv2 operations broken out by type
  • Core Link Traffic Histogram of load on fabric links
  • Edge Link Traffic Histogram of load on access ports
  • RDMA Operations Total RDMA operations
  • RDMA Bytes Average RDMA operation size
  • Credits Average number of credits in RoCEv2 acknowledgements
  • Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
  • Congestion Total ECN / CNP congestion messages
  • Errors Total ingress / egress errors
  • Discards Total ingress / egress discards
  • Drop Reasons Packet drop reasons
Enable sFlow in your AI / ML networks for detailed visibility into network performance. Instructions are provided in the articles: AI Metrics, AI Metrics with Prometheus and Grafana and AI Metrics with Grafana Cloud.