Showing posts with label NCCL. Show all posts
Showing posts with label NCCL. Show all posts

Monday, February 23, 2026

Real-time visualization of AI / ML traffic matrix

Heatmap is available on GitHub. The application provides a real-time traffic matrix visualization of end-to-end traffic flowing across an Ethernet fabric. Each axis represents an ordered list of network addresses. The x-axis is a flow source and the y-axis is a flow destination.

For example, the Heatmap above comes from a large high performance compute cluster running a mixture of tasks. Traffic is concentrated along the diagonal, indicating that the job scheduler is packing related tasks in racks so that most traffic is confined to the rack.

Note: Live Dashboards links to a number dashboards showing live traffic, including the Heatmap above.

The next Heatmap shows a very different traffic pattern. In this case, RoCEv2 traffic generated by GPUs performing a NCCL AllReduce/AllGather collective operation using a ring algorithm. During the collective operation, each GPU sends data to its immediate neighbor (modulo the number of GPUs) in a logical ring, resulting in two nearly continuous lines on either size of the diagonal: one for forward traffic, and the other for return traffic associated with each flow.
The final example comes from a large data center hosting a mix of front end workloads. Unlike the backend networks, this network combines internal (East/West) traffic with external (North/South) traffic flows. The internal traffic flows are contained in the central grid. The surrounding borders display external traffic.
The full range of IP addresses (0.0.0.0 - 255.255.255.255) is displayed on the heatmap using a piecewise linear scaling function. A start and end address identifies internal traffic and maps to values in the central grid and addresses outside this range are scaled to fit in the borders insets.

Representing the traffic matrix in the form of a heat map scales well to very large networks and provides real-time insight into shifting traffic patterns as workloads change. The industry standard sFlow instrumentation in data center switches used to construct the traffic matrix also scales to the large number of switches and 400/800G port speeds found in AI/ML backend networks.

Monday, October 20, 2025

AI / ML network performance metrics at scale

The charts above show information from a GPU cluster running an AI / ML training workload. The 244 nodes in the cluster are connected by 100G links to a single large switch. Industry standard sFlow telemetry from the switch is shown in the two trend charts generated by the sFlow-RT real-time analytics engine. The charts are updated every 100mS.
  • Per Link Telemetry shows RoCEv2 traffic on 5 randomly selected links from the cluster. Each trend is computed based on sFlow random packet samples collected on the link. The packet header in each sample is decoded and the metric is computed for packets identified as RoCEv2.
  • Combined Fabric-Wide Telemetry combines the signals from all the links to create a fabric wide metric. The signals are highly correlated since the AI training compute / exchange cycle is synchronized across all compute nodes in the cluster. Constructive interference from combining data from all the links removes the noise in each individual signal and clearly shows the traffic pattern for the cluster.
This is a relatively small cluster. For larger clusters, the effect is even more pronounced, resulting in extremely sharp cluster-wide metrics. The sFlow instrumentation embedded as a standard feature of data center switch hardware from all leading vendors (Arista, Cisco, Dell, Juniper, NVIDIA, etc.) provides a cost effective solution for even the largest AI / ML fabrics.
The Grafana AI Metrics dashboard shown above tracks performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

The metrics include:

  • Total Traffic Total traffic entering fabric
  • Operations Total RoCEv2 operations broken out by type
  • Core Link Traffic Histogram of load on fabric links
  • Edge Link Traffic Histogram of load on access ports
  • RDMA Operations Total RDMA operations
  • RDMA Bytes Average RDMA operation size
  • Credits Average number of credits in RoCEv2 acknowledgements
  • Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
  • Congestion Total ECN / CNP congestion messages
  • Errors Total ingress / egress errors
  • Discards Total ingress / egress discards
  • Drop Reasons Packet drop reasons
Enable sFlow in your AI / ML networks for detailed visibility into network performance. Instructions are provided in the articles: AI Metrics, AI Metrics with Prometheus and Grafana and AI Metrics with Grafana Cloud.

Wednesday, September 10, 2025

Packet trimming

The latest version of the AI Metrics dashboard uses industry standard sFlow telemetry from network switches to monitor the number of trimmed packets to use as a congestion metric.

Ultra Ethernet Specification Update describes how the Ultra Ethernet Transport (UET) Protocol has the ability to leverage optional “packet trimming” in network switches, which allows packets to be truncated rather than dropped in the fabric during congestion events. As packet spraying causes reordering, it becomes more complicated to detect loss. Packet trimming gives the receiver and the sender an early explicit indication of congestion, allowing immediate loss recovery in spite of reordering, and is a critical feature in the low-RTT environments where UET is designed to operate.

cumulus@switch:~$ nv set system forwarding packet-trim profile packet-trim-default
cumulus@switch:~$ nv config apply

NVIDIA Cumulus Linux release 5.14 for NVIDA Spectrum Ethernet Switches includes support for Packet Trimming. The above command enables packet trimming, sets the DSCP remark value to 11, sets the truncation size to 256 bytes, sets the switch priority to 4, and sets the eligibility to all ports on the switch with traffic class 1, 2, and 3. NVIDA BlueField host adapters respond to trimmed packets to ensure fast congestion recovery.

Instructions for deploying the Grafana AI Metrics dashboard are provided in the articles: AI Metrics, AI Metrics with Prometheus and Grafana and AI Metrics with Grafana Cloud.

Thursday, July 10, 2025

AI Metrics with InfluxDB Cloud

The InfluxDB AI Metrics dashboard shown above tracks performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

The metrics include:

  • Total Traffic Total traffic entering fabric
  • Operations Total RoCEv2 operations broken out by type
  • Core Link Traffic Histogram of load on fabric links
  • Edge Link Traffic Histogram of load on access ports
  • RDMA Operations Total RDMA operations
  • RDMA Bytes Average RDMA operation size
  • Credits Average number of credits in RoCEv2 acknowledgements
  • Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
  • Congestion Total ECN / CNP congestion messages
  • Errors Total ingress / egress errors
  • Discards Total ingress / egress discards
  • Drop Reasons Packet drop reasons
This article shows how to integrate with InfluxDB Cloud instead of running the services locally.

Note: InfluxDB Cloud has a free service tier that can be used to test this example.

Save the following compose.yml file on a system running Docker.

configs:
  config.telegraf:
    content: |
      [agent]
        interval = '15s'
        round_interval = true
        omit_hostname = true
      [[outputs.influxdb_v2]]
        urls = ['https://<INFLUXDB_CLOUD_INSTANCE>.cloud2.influxdata.com']
        token = '<INFLUXDB_CLOUD_TOKEN>'
        organization = '<INFLUXDB_CLOUD_USER>'
        bucket = 'sflow'
      [[inputs.prometheus]]
        urls = ['http://sflow-rt:8008/app/ai-metrics/scripts/metrics.js/prometheus/txt']
        metric_version = 1

networks:

  monitoring:
    driver: bridge

services:

  sflow-rt:
    image: sflow/ai-metrics
    container_name: sflow-rt
    restart: unless-stopped
    ports:
      - '6343:6343/udp'
      - '8008:8008'
    networks:
      - monitoring

  telegraf:
    image: telegraf:alpine
    container_name: telegraf
    restart: unless-stopped
    configs:
      - source: config.telegraf
        target: /etc/telegraf/telegraf.conf
    depends_on:
      - sflow-rt
    networks:
      - monitoring
Use the Load Data menu to create an sflow bucket, create an API TOKEN to upload data, and find TELEGRAF INFLUXDB OUTPUT PLUGIN settings. Navigate to the Dashboards menu and create a new dashboard by importing ai_metrics.json
Edit the highlighted outputs.influxdb_v2 telegraf settings (INFLUXDB_CLOUD_INSTANCE, INFLUXDB_CLOUD_TOKEN and INFLUXDB_CLOUD_USER) to match those provided by InfluxDB Cloud.
docker compose up -d
Run the command above to start streaming metrics to InfluxDB Cloud.
Enable sFlow on all switches in the cluster (leaf and spine) using the recommeded settings. Enable sFlow dropped packet notifications to populate the drop reasons metric, see Dropped packet notifications with Arista NetworksNVIDIA Cumulus Linux 5.11 for AI / ML and Dropped packet notifications with Cisco 8000 Series Routers for examples.

Note: Tuning Performance describes how to optimize settings for very large clusters.

Industry standard sFlow telemetry is uniquely suited monitoring AI workloads. The sFlow agents leverage instrumentation built into switch ASICs to stream randomly sampled packet headers and metadata in real-time. Sampling provides a scaleable method of monitoring the large numbers of 400G/800G links found in AI fabrics. Export of packet headers allows the sFlow collector to decode the InfiniBand Base Transport headers to extract operations and RDMA metrics. The Dropped Packet extension uses Mirror-on-Drop (MoD) / What Just Happened (WJH) capabilities in the ASIC to include packet header, location, and reason for EVERY dropped packet in the fabric.

Talk to your switch vendor about their plans to support the Transit delay and queueing extension. This extension provides visibility into queue depth and switch transit delay using instrumentation built into the ASIC.

A network topology is required to generate the analytics, see Topology for a description of the JSON file and instructions for generating topologies from Graphvis DOT format, NVIDIA NetQ, Arista eAPI, and NetBox.
Use the Topology Status dashboard to verify that the topology is consistent with the sFlow telemetry and fully monitored. The Locate tab can be used to locate network addresses to access switch ports.

Note: If any gauges indicate an error, click on the gauge to get specific details.

Congratulations! The configuration is now complete and you should see charts above in AI Metric application Traffic tab. In addition, the AI Metrics dashboard at the top of this page should start to populate with data.

Getting Started provides an introductions to sFlow-RT, describes how to browse metrics and traffic flows using tools included in the Docker image, and links to information on creating applications using sFlow-RT APIs.

Saturday, June 14, 2025

AI network performance monitoring using containerlab

AI Metrics is available on GitHub. The application provides performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

The screen capture is from a containerlab topology that emulates a AI compute cluster connected by a leaf and spine network. The metrics include:

  • Total Traffic Total traffic entering fabric
  • Operations Total RoCEv2 operations broken out by type
  • Core Link Traffic Histogram of load on fabric links
  • Edge Link Traffic Histogram of load on access ports
  • RDMA Operations Total RDMA operations
  • RDMA Bytes Average RDMA operation size
  • Credits Average number of credits in RoCEv2 acknowledgements
  • Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
  • Congestion Total ECN / CNP congestion messages
  • Errors Total ingress / egress errors
  • Discards Total ingress / egress discards
  • Drop Reasons Packet drop reasons

Note: Clicking on peaks in the charts shows values at that time.

This article gives step-by-step instructions to run the demonstration.

git clone https://github.com/sflow-rt/containerlab.git
Download the sflow-rt/containerlab project from GitHub.
git clone https://github.com/sflow-rt/containerlab.git
cd containerlab
./run-clab
Run the above commands to download the sflow-rt/containerlab GitHub project and run Containerlab on a system with Docker installed. Docker Desktop is a conventient way to run the labs on a laptop.
containerlab deploy -t rocev2.yml

Start the 3 stage leaf and spine emulation.

The initial launch may take a couple of minutes as the container images are downloaded for the first time. Once the images are downloaded, the topology deploys in a few seconds.
./topo.py clab-rocev2
Run the command above to send the topology to the AI Metrics application and connect to http://localhost:8008/app/ai-metrics/html/ to access the dashboard shown at the top of this article.
docker exec -it clab-rocev2-h1 hping3 exec rocev2.tcl 172.16.2.2 10000 500 100
Run the command above to simulate RoCEv2 traffic between h1 and h2 during a training run. The run consists of 100 cycles, each cycle involves exchange of 10000 packets followed by a 500 millisecond pause (similuating GPU compute time). You should immediately see charts updating in the dashboard.
Clicking on the wave symbol in the top right of the Period chart pulls up a fast updating Periodicity chart that shows how sub-second variability in the traffic due to the exchange / compute cycles is used to compute the period shown in the dashboard, in this case a period of just under 0.7 seconds.
Connect to the included sflow-rt/trace-flow application, http://localhost:8008/app/trace-flow/html/
suffix:stack:.:1=ibbt
Enter the filter above and press Submit. The instant a RoCEv2 flow is generated, its path should be shown. See Defining Flows for information on traffic filters.
docker exec -it clab-rocev2-leaf1 vtysh
The routers in this topology run the open source FRRouting daemon used in data center switches with NVIDIA Cumulus Linux or SONiC network operating systems. Type the command above to access the CLI on the leaf1 switch.
leaf1# show running-config
For example, type the above command in the leaf1 CLI to see the running configuration.
Building configuration...

Current configuration:
!
frr version 10.2.3_git
frr defaults datacenter
hostname leaf1
log stdout
!
interface eth3
 ip address 172.16.1.1/24
 ipv6 address 2001:172:16:1::1/64
exit
!
router bgp 65001
 bgp bestpath as-path multipath-relax
 bgp bestpath compare-routerid
 neighbor fabric peer-group
 neighbor fabric remote-as external
 neighbor fabric description Internal Fabric Network
 neighbor fabric capability extended-nexthop
 neighbor eth1 interface peer-group fabric
 neighbor eth2 interface peer-group fabric
 !
 address-family ipv4 unicast
  redistribute connected route-map HOST_ROUTES
 exit-address-family
 !
 address-family ipv6 unicast
  redistribute connected route-map HOST_ROUTES
  neighbor fabric activate
 exit-address-family
exit
!
route-map HOST_ROUTES permit 10
 match interface eth3
exit
!
ip nht resolve-via-default
!
end

The switch is using BGP to establish equal cost multi-path (ECMP) routes across the fabric - this is very similar to a configuration you would expect to find in a production network. Containerlab provides a great environment to experiment with topologies and configuration options before putting them in production.

The results from this containerlab project are very similar to those observed in production networks, see Comparing AI / ML activity from two production networks. Follow the instruction in AI Metrics with Prometheus and Grafana to deploy this solution to monitor production GPU cluster traffic.

Monday, June 9, 2025

AI Metrics with Grafana Cloud

The Grafana AI Metrics dashboard shown above tracks performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

The metrics include:

  • Total Traffic Total traffic entering fabric
  • Operations Total RoCEv2 operations broken out by type
  • Core Link Traffic Histogram of load on fabric links
  • Edge Link Traffic Histogram of load on access ports
  • RDMA Operations Total RDMA operations
  • RDMA Bytes Average RDMA operation size
  • Credits Average number of credits in RoCEv2 acknowledgements
  • Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
  • Congestion Total ECN / CNP congestion messages
  • Errors Total ingress / egress errors
  • Discards Total ingress / egress discards
  • Drop Reasons Packet drop reasons
AI Metrics with Prometheus and Grafana describes how to stand up an analytics stack with Prometheus and Grafana to track performance metrics for an AI/ML GPU cluster. This article shows how to integrate with Prometheus and Grafana hosted in the cloud, Grafana Cloud, instead of running the services locally.

Note: Grafana Cloud has a free service tier that can be used to test this example.

Save the following compose.yml file on a system running Docker.

configs:
  config.alloy:
    content: |
      prometheus.scrape "prometheus" {
        targets = [{
          __address__ = "sflow-rt:8008",
        }]
        forward_to   = [prometheus.remote_write.grafanacloud.receiver]
        metrics_path = "/app/ai-metrics/scripts/metrics.js/prometheus/txt"
        scrape_interval = "10s"
      }

      prometheus.remote_write "grafanacloud" {
        endpoint {
          url  = "https://<Your Grafana Cloud Prometheus Instance>/api/prom/push"
          basic_auth {
            username = "<Your Grafana.com User ID>"
            password = "<Your Grafana.com API Token>"
          }
        }
      }

networks:

  monitoring:
    driver: bridge

services:

  sflow-rt:
    image: sflow/ai-metrics
    container_name: sflow-rt
    restart: unless-stopped
    ports:
      - '6343:6343/udp'
      - '8008:8008'
    networks:
      - monitoring

  alloy:
    image: grafana/alloy
    container_name: alloy
    restart: unless-stopped
    configs:
      - source: config.alloy
        target: /etc/alloy/config.alloy
    depends_on:
      - sflow-rt
    networks:
      - monitoring
Find the settings needed to upload metrics to Prometheus by clicking on the Send Metrics button in your Grafana Cloud account.
Edit the highlighted prometheus.remote_write endpoint settings (url, username and password) to match those provided by Grafana Cloud.
docker compose up -d
Run the command above to start streaming metrics to Grafana Cloud. Click on the Grafana Launch button to access Grafana and add the AI Metrics dashboard (ID: 23255).
Enable sFlow on all switches in the cluster (leaf and spine) using the recommeded settings. Enable sFlow dropped packet notifications to populate the drop reasons metric, see Dropped packet notifications with Arista NetworksNVIDIA Cumulus Linux 5.11 for AI / ML and Dropped packet notifications with Cisco 8000 Series Routers for examples.

Note: Tuning Performance describes how to optimize settings for very large clusters.

Industry standard sFlow telemetry is uniquely suited monitoring AI workloads. The sFlow agents leverage instrumentation built into switch ASICs to stream randomly sampled packet headers and metadata in real-time. Sampling provides a scaleable method of monitoring the large numbers of 400G/800G links found in AI fabrics. Export of packet headers allows the sFlow collector to decode the InfiniBand Base Transport headers to extract operations and RDMA metrics. The Dropped Packet extension uses Mirror-on-Drop (MoD) / What Just Happened (WJH) capabilities in the ASIC to include packet header, location, and reason for EVERY dropped packet in the fabric.

Talk to your switch vendor about their plans to support the Transit delay and queueing extension. This extension provides visibility into queue depth and switch transit delay using instrumentation built into the ASIC.

A network topology is required to generate the analytics, see Topology for a description of the JSON file and instructions for generating topologies from Graphvis DOT format, NVIDIA NetQ, Arista eAPI, and NetBox.
Use the Topology Status dashboard to verify that the topology is consistent with the sFlow telemetry and fully monitored. The Locate tab can be used to locate network addresses to access switch ports.

Note: If any gauges indicate an error, click on the gauge to get specific details.

Congratulations! The configuration is now complete and you should see charts above in AI Metric application Traffic tab. In addition, the AI Metrics Grafana dashboard at the top of this page should start to populate with data.

Flow metrics with Prometheus and Grafana and Dropped packet metrics with Prometheus and Grafana describe how to define additional flow-based metrics to incorporate in Grafana dashboards.

Getting Started provides an introductions to sFlow-RT, describes how to browse metrics and traffic flows using tools included in the Docker image, and links to information on creating applications using sFlow-RT APIs.

Monday, April 21, 2025

AI Metrics with Prometheus and Grafana

The Grafana AI Metrics dashboard shown above tracks performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

The metrics include:

  • Total Traffic Total traffic entering fabric
  • Operations Total RoCEv2 operations broken out by type
  • Core Link Traffic Histogram of load on fabric links
  • Edge Link Traffic Histogram of load on access ports
  • RDMA Operations Total RDMA operations
  • RDMA Bytes Average RDMA operation size
  • Credits Average number of credits in RoCEv2 acknowledgements
  • Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
  • Congestion Total ECN / CNP congestion messages
  • Errors Total ingress / egress errors
  • Discards Total ingress / egress discards
  • Drop Reasons Packet drop reasons

This article gives step-by-step instructions to set up the dashboard in a production environment.

git clone https://github.com/sflow-rt/prometheus-grafana.git
sed -i -e 's/prometheus/ai-metrics/g' prometheus-grafana/env_vars
./prometheus-grafana/start.sh

The easiest way to get started is to use Docker, see Deploy real-time network dashboards using Docker compose, and deploy the sflow/ai-metrics image bundling the AI Metrics application to generate metrics.

scrape_configs:
  - job_name: 'sflow-rt-ai-metrics'
    metrics_path: /app/ai-metrics/scripts/metrics.js/prometheus/txt
    scheme: http
    static_configs:
      - targets: [ 'sflow-rt:8008' ]
Follow the directions in Deploy real-time network dashboards using Docker compose to add the above Prometheus scrape task to retrieve the metrics and add the Grafana AI Metrics dashboard (ID: 23255).
Enable sFlow on all switches in the cluster (leaf and spine) using the recommeded settings. Enable sFlow dropped packet notifications to populate the drop reasons metric, see Dropped packet notifications with Arista NetworksNVIDIA Cumulus Linux 5.11 for AI / ML and Dropped packet notifications with Cisco 8000 Series Routers for examples.

Note: Tuning Performance describes how to optimize settings for very large clusters.

Industry standard sFlow telemetry is uniquely suited monitoring AI workloads. The sFlow agents leverage instrumentation built into switch ASICs to stream randomly sampled packet headers and metadata in real-time. Sampling provides a scaleable method of monitoring the large numbers of 400G/800G links found in AI fabrics. Export of packet headers allows the sFlow collector to decode the InfiniBand Base Transport headers to extract operations and RDMA metrics. The Dropped Packet extension uses Mirror-on-Drop (MoD) / What Just Happened (WJH) capabilities in the ASIC to include packet header, location, and reason for EVERY dropped packet in the fabric.

Talk to your switch vendor about their plans to support the Transit delay and queueing extension. This extension provides visibility into queue depth and switch transit delay using instrumentation built into the ASIC.

A network topology is required to generate the analytics, see Topology for a description of the JSON file and instructions for generating topologies from Graphvis DOT format, NVIDIA NetQ, Arista eAPI, and NetBox.
Use the Topology Status dashboard to verify that the topology is consistent with the sFlow telemetry and fully monitored. The Locate tab can be used to locate network addresses to access switch ports.

Note: If any gauges indicate an error, click on the gauge to get specific details.

Congratulations! The configuration is now complete and you should see charts above in AI Metric application Traffic tab. In addition, the AI Metrics Grafana dashboard at the top of this page should start to populate with data.

Flow metrics with Prometheus and Grafana and Dropped packet metrics with Prometheus and Grafana describe how to define additional flow-based metrics to incorporate in Grafana dashboards.

Getting Started provides an introductions to sFlow-RT, describes how to browse metrics and traffic flows using tools included in the Docker image, and links to information on creating applications using sFlow-RT APIs.

Tuesday, April 1, 2025

Comparing AI / ML activity from two production networks

AI Metrics describes how to deploy the open source ai-metrics application. The application provides performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter. The screen capture from the article (above) shows results from a simulated 48,000 GPU cluster.

This article goes beyond simulation to demonstrate the AI Metrics dashboard by comparing live traffic seen in two production AI clusters.

Cluster 1

This cluster consists of 250 GPUs connected via 100G ports to single large switch. The results are pretty consistent with simulation from the original article. In this case there is no Core Link Traffic because the cluster consists of a single switch. The Discards chart shows a burst of Out (egress) discards and the Drop Reasons chart gives the reason as ingress_vlan_filter. The Total Traffic, Operations, Edge Link Traffic, and RDMA Operations charts all show a transient drop in throughput coincident with the discard spike. Further details of the dropped packets, such as source/destination address, operation, ingress / egress port, QP pair, etc. can be extracted from the sFlow Dropped Packet Notifications that are populating the Drop Reasons chart, for example, using the browse-drops application packaged with the sflow/ai-metrics Docker image.

The Period chart indicates that the workload is periodic with a compute / exchange cycle of approximately 0.9 seconds.

A real-time trend of the cluster network traffic polled every 100mS clearly shows the cyclic nature of the traffic shown by the Period chart and confirms the reported 0.9 second period.

Cluster 2

This cluster consists of two 400G fixed configuration switches connected to 40 GPUs. In this case the traffic is much less regular than the first example. RDMA operation sizes vary between 500MB to over 3.5GB transfers (in the previous example, all transfers were a consistent 7K bytes). The mix of RoCEv2 Infiniband operations is also different, comprising RDMA_READ, RESYNC and ACK operations with a mixture of RD (Reliable Datagram) and RC (Reliable Connection) transports. In contrast, the previous example consisted only of RDMA_WRITE and ACK operations using RD transport.

An interesting point to note is the spike in the Discards chart coinciding with a burst in RC:RDMA_READ traffic. In this case, the network operating system running on the switches doesn't currently support sFlow Dropped Packet Notifications so the Drop Reasons chart doesn't provide further detail (Note: In this case the switch ASICs do have the required instrumentation, so a firmware update would be able to add the sFlow Dropped Packet Notifications feature).

In this example, the Period chart shows missing and irregular data.

In this case the real-time trend of network traffic shows no periodic structure, so the Period chart in the AI Metrics dashboard is unable to lock onto a repeating pattern.

Take a look at your own AI cluster network activity

AI Metrics gives step-by-step instructions to run the application in a production environment and integrate the metrics with back end Prometheus / Grafana dashboards. The solution utilizes industry standard sFlow instrumentation built into data center switches and can be deployed without any changes to the servers in the cluster.

Saturday, February 1, 2025

AI Metrics

AI Metrics is available on GitHub. The application provides performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

The dashboard shown above is from a simulated network 1,000 switches, each with 48 ports access ports connected to a host. Activity occurs in a 256mS on / off cycle to emulate an AI learning run. The metrics include:

  • Total Traffic Total traffic entering fabric
  • Operations Total RoCEv2 operations broken out by type
  • Core Link Traffic Histogram of load on fabric links
  • Edge Link Traffic Histogram of load on access ports
  • RDMA Operations Total RDMA operations
  • RDMA Bytes Average RDMA operation size
  • Credits Average number of credits in RoCEv2 acknowledgements
  • Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
  • Congestion Total ECN / CNP congestion messages
  • Errors Total ingress / egress errors
  • Discards Total ingress / egress discards
  • Drop Reasons Packet drop reasons

Note: Clicking on peaks in the charts shows values at that time.

This article gives step-by-step instructions to run the AI Metrics application in a production environment and integrate the metrics with back end Prometheus / Grafana dashboards. Please try AI Metrics out and share your comments so that the set of metrics can be refined and extended to address operational requirements.

docker run -p 8008:8008 -p 6343:6343/udp sflow/ai-metrics
User Docker to run the pre-built sflow/ai-metrics image and access the web interface on port 8008.
Enable sFlow on all switches in the cluster (leaf and spine) using the recommeded settings. Enable sFlow dropped packet notifications to populate the drop reasons metric, see Dropped packet notifications with Arista Networks and NVIDIA Cumulus Linux 5.11 for AI / ML for examples.

Note: Tuning Performance describes how to optimize settings for very large clusters.

Industry standard sFlow telemetry is uniquely suited monitoring AI workloads. The sFlow agents leverage instrumentation built into switch ASICs to stream randomly sampled packet headers and metadata in real-time. Sampling provides a scaleable method of monitoring the large numbers of 400G/800G links found in AI fabrics. Export of packet headers allows the sFlow collector to decode the InfiniBand Base Transport headers to extract operations and RDMA metrics. The Dropped Packet extension uses Mirror-on-Drop (MoD) / What Just Happened (WJH) capabilities in the ASIC to include packet header, location, and reason for EVERY dropped packet in the fabric.

Talk to your switch vendor about their plans to support the Transit delay and queueing extension. This extension provides visibility into queue depth and switch transit delay using instrumentation built into the ASIC.

A network topology is required to generate the analytics, see Topology for a description of the JSON file and instructions for generating topologies from Graphvis DOT format, NVIDIA NetQ, Arista eAPI, and NetBox.
Use the Topology Status dashboard to verify that the topology is consistent with the sFlow telemetry and fully monitored. The Locate tab can be used to locate network addresses to access switch ports.

Note: If any gauges indicate an error, click on the guage to get specific details.

Congratulations! The configuration is now complete and you should see charts at top of article in AI Metric application Traffic tab.

The AI Metrics application exports the metrics shown in Prometheus scrape format, see Help tab for details. The Docker image also includes the Prometheus application that allows flow metrics to be created and extracted, see Flow metrics with Prometheus and Grafana.

Getting Started provides an introductions to sFlow-RT, describes how to browse metrics and traffic flows using tools included in the Docker image, and links to information on creating applications using sFlow-RT APIs.