Tuesday, April 1, 2025

Comparing AI / ML activity from two production networks

AI Metrics describes how to deploy the open source ai-metrics application. The application provides performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter. The screen capture from the article (above) shows results from a simulated 48,000 GPU cluster.

This article goes beyond simulation to demonstrate the AI Metrics dashboard by comparing live traffic seen in two production AI clusters.

Cluster 1

This cluster consists of 250 GPUs connected via 100G ports to single large switch. The results are pretty consistent with simulation from the original article. In this case there is no Core Link Traffic because the cluster consists of a single switch. The Discards chart shows a burst of Out (egress) discards and the Drop Reasons chart gives the reason as ingress_vlan_filter. The Total Traffic, Operations, Edge Link Traffic, and RDMA Operations charts all show a transient drop in throughput coincident with the discard spike. Further details of the dropped packets, such as source/destination address, operation, ingress / egress port, QP pair, etc. can be extracted from the sFlow Dropped Packet Notifications that are populating the Drop Reasons chart, for example, using the browse-drops application packaged with the sflow/ai-metrics Docker image.

The Period chart indicates that the workload is periodic with a compute / exchange cycle of approximately 0.9 seconds.

A real-time trend of the cluster network traffic polled every 100mS clearly shows the cyclic nature of the traffic shown by the Period chart and confirms the reported 0.9 second period.

Cluster 2

This cluster consists of two 400G fixed configuration switches connected to 40 GPUs. In this case the traffic is much less regular than the first example. RDMA operation sizes vary between 500MB to over 3.5GB transfers (in the previous example, all transfers were a consistent 7K bytes). The mix of RoCEv2 Infiniband operations is also different, comprising RDMA_READ, RESYNC and ACK operations with a mixture of RD (Reliable Datagram) and RC (Reliable Connection) transports. In contrast, the previous example consisted only of RDMA_WRITE and ACK operations using RD transport.

An interesting point to note is the spike in the Discards chart coinciding with a burst in RC:RDMA_READ traffic. In this case, the network operating system running on the switches doesn't currently support sFlow Dropped Packet Notifications so the Drop Reasons chart doesn't provide further detail (Note: In this case the switch ASICs do have the required instrumentation, so a firmware update would be able to add the sFlow Dropped Packet Notifications feature).

In this example, the Period chart shows missing and irregular data.

In this case the real-time trend of network traffic shows no periodic structure, so the Period chart in the AI Metrics dashboard is unable to lock onto a repeating pattern.

Take a look at your own AI cluster network activity

AI Metrics gives step-by-step instructions to run the application in a production environment and integrate the metrics with back end Prometheus / Grafana dashboards. The solution utilizes industry standard sFlow instrumentation built into data center switches and can be deployed without any changes to the servers in the cluster.

Tuesday, March 4, 2025

Capture to pcap file using sflowtool


Replay pcap files using sflowtool describes how to capture sFlow datagrams using tcpdump and replay them in real time using sflowtool. However, using tcpdump for the capture has the downside of requiring root privileges. A recent update to sflowtool now makes it possible to use sflowtool to capture sFlow datagrams in tcpdump pcap format without the need for root access.
docker run --rm -p 6343:6343/udp sflow/sflowtool -M > sflow.pcap
Either compile the latest version of sflowtool or, as shown above, use Docker to run the pre-built sflow/sflowtool image. The -M option outputs whole UDP datagrams received to standard output. In either case, type CNTRL + C to end the capture.

Saturday, February 1, 2025

AI Metrics

AI Metrics is available on GitHub. The application provides performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

The dashboard shown above is from a simulated network 1,000 switches, each with 48 ports access ports connected to a host. Activity occurs in a 256mS on / off cycle to emulate an AI learning run. The metrics include:

  • Total Traffic Total traffic entering fabric
  • Operations Total RoCEv2 operations broken out by type
  • Core Link Traffic Histogram of load on fabric links
  • Edge Link Traffic Histogram of load on access ports
  • RDMA Operations Total RDMA operations
  • RDMA Bytes Average RDMA operation size
  • Credits Average number of credits in RoCEv2 acknowledgements
  • Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
  • Congestion Total ECN / CNP congestion messages
  • Errors Total ingress / egress errors
  • Discards Total ingress / egress discards
  • Drop Reasons Packet drop reasons

Note: Clicking on peaks in the charts shows values at that time.

This article gives step-by-step instructions to run the AI Metrics application in a production environment and integrate the metrics with back end Prometheus / Grafana dashboards. Please try AI Metrics out and share your comments so that the set of metrics can be refined and extended to address operational requirements.

docker run -p 8008:8008 -p 6343:6343/udp sflow/ai-metrics
User Docker to run the pre-built sflow/ai-metrics image and access the web interface on port 8008.
Enable sFlow on all switches in the cluster (leaf and spine) using the recommeded settings. Enable sFlow dropped packet notifications to populate the drop reasons metric, see Dropped packet notifications with Arista Networks and NVIDIA Cumulus Linux 5.11 for AI / ML for examples.

Note: Tuning Performance describes how to optimize settings for very large clusters.

Industry standard sFlow telemetry is uniquely suited monitoring AI workloads. The sFlow agents leverage instrumentation built into switch ASICs to stream randomly sampled packet headers and metadata in real-time. Sampling provides a scaleable method of monitoring the large numbers of 400G/800G links found in AI fabrics. Export of packet headers allows the sFlow collector to decode the InfiniBand Base Transport headers to extract operations and RDMA metrics. The Dropped Packet extension uses Mirror-on-Drop (MoD) / What Just Happened (WJH) capabilities in the ASIC to include packet header, location, and reason for EVERY dropped packet in the fabric.

Talk to your switch vendor about their plans to support the Transit delay and queueing extension. This extension provides visibility into queue depth and switch transit delay using instrumentation built into the ASIC.

A network topology is required to generate the analytics, see Topology for a description of the JSON file and instructions for generating topologies from Graphvis DOT format, NVIDIA NetQ, Arista eAPI, and NetBox.
Use the Topology Status dashboard to verify that the topology is consistent with the sFlow telemetry and fully monitored. The Locate tab can be used to locate network addresses to access switch ports.

Note: If any gauges indicate an error, click on the guage to get specific details.

Congratulations! The configuration is now complete and you should see charts at top of article in AI Metric application Traffic tab.

The AI Metrics application exports the metrics shown in Prometheus scrape format, see Help tab for details. The Docker image also includes the Prometheus application that allows flow metrics to be created and extracted, see Flow metrics with Prometheus and Grafana.

Getting Started provides an introductions to sFlow-RT, describes how to browse metrics and traffic flows using tools included in the Docker image, and links to information on creating applications using sFlow-RT APIs.

Thursday, January 30, 2025

Replay pcap files using sflowtool


It can be very useful to capture sFlow telemetry from production networks so that it can be replayed later to perform off-line analysis, or to develop or evaluate sFlow collection tools.
sudo tcpdump -i any -s 0 -w sflow.pcap udp port 6343
Run the command above on the system you are using to collect sFlow data (if you aren't yet collecting sFlow, see Agents for suggested configuration settings). Type Control-C to end the capture after 5 to 10 minutes.  Copy the resulting sflow.pcap file to your laptop.
docker run --rm -it -v $PWD/sflow.pcap:/sflow.pcap sflow/sflowtool \
  -r /sflow.pcap -P 1
Either compile the latest version of sflowtool or, as shown above, use Docker to run the pre-built sflow/sflowtool image. The -P (Playback) option replays the trace in real-time and displays the contents of each sFlow message. Running sflowtool using Docker provides additional examples, including converting the sFlow messages into JSON format for processing by a Python script. 
docker run --rm -it -v $PWD/sflow.pcap:/sflow.pcap sflow/sflowtool \
  -r /sflow.pcap -f 192.168.4.198/6343 -P 1
The -f (forwarding) option takes an IP address and UDP port number as arguments, in this case the laptop's address, 192.168.4.198, and the standard sFlow port, 6343. Use this option to send the sFlow stream to sFlow analytics software.
For example, Deploy real-time network dashboards using Docker compose, describes how to quickly stand up an sFlow-RT, Prometheus, and Grafana analytics stack.