The metrics include:
- Total Traffic Total traffic entering fabric
- Operations Total RoCEv2 operations broken out by type
- Core Link Traffic Histogram of load on fabric links
- Edge Link Traffic Histogram of load on access ports
- RDMA Operations Total RDMA operations
- RDMA Bytes Average RDMA operation size
- Credits Average number of credits in RoCEv2 acknowledgements
- Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
- Congestion Total ECN / CNP congestion messages
- Errors Total ingress / egress errors
- Discards Total ingress / egress discards
- Drop Reasons Packet drop reasons
Note: Grafana Cloud has a free service tier that can be used to test this example.
Save the following compose.yml file on a system running Docker.
configs: config.alloy: content: | prometheus.scrape "prometheus" { targets = [{ __address__ = "sflow-rt:8008", }] forward_to = [prometheus.remote_write.grafanacloud.receiver] metrics_path = "/app/ai-metrics/scripts/metrics.js/prometheus/txt" scrape_interval = "10s" } prometheus.remote_write "grafanacloud" { endpoint { url = "https://<Your Grafana Cloud Prometheus Instance>/api/prom/push" basic_auth { username = "<Your Grafana.com User ID>" password = "<Your Grafana.com API Token>" } } } networks: monitoring: driver: bridge services: sflow-rt: image: sflow/ai-metrics container_name: sflow-rt restart: unless-stopped ports: - '6343:6343/udp' - '8008:8008' networks: - monitoring alloy: image: grafana/alloy container_name: alloy restart: unless-stopped configs: - source: config.alloy target: /etc/alloy/config.alloy depends_on: - sflow-rt networks: - monitoringFind the settings needed to upload metrics to Prometheus by clicking on the Send Metrics button in your Grafana Cloud account. Edit the highlighted prometheus.remote_write endpoint settings (url, username and password) to match those provided by Grafana Cloud.
docker compose up -dRun the command above to start streaming metrics to Grafana Cloud. Click on the Grafana Launch button to access Grafana and add the AI Metrics dashboard (ID: 23255). Enable sFlow on all switches in the cluster (leaf and spine) using the recommeded settings. Enable sFlow dropped packet notifications to populate the drop reasons metric, see Dropped packet notifications with Arista Networks, NVIDIA Cumulus Linux 5.11 for AI / ML and Dropped packet notifications with Cisco 8000 Series Routers for examples.
Note: Tuning Performance describes how to optimize settings for very large clusters.
Industry standard sFlow telemetry is uniquely suited monitoring AI workloads. The sFlow agents leverage instrumentation built into switch ASICs to stream randomly sampled packet headers and metadata in real-time. Sampling provides a scaleable method of monitoring the large numbers of 400G/800G links found in AI fabrics. Export of packet headers allows the sFlow collector to decode the InfiniBand Base Transport headers to extract operations and RDMA metrics. The Dropped Packet extension uses Mirror-on-Drop (MoD) / What Just Happened (WJH) capabilities in the ASIC to include packet header, location, and reason for EVERY dropped packet in the fabric.
Talk to your switch vendor about their plans to support the Transit delay and queueing extension. This extension provides visibility into queue depth and switch transit delay using instrumentation built into the ASIC.
A network topology is required to generate the analytics, see Topology for a description of the JSON file and instructions for generating topologies from Graphvis DOT format, NVIDIA NetQ, Arista eAPI, and NetBox. Use the Topology Status dashboard to verify that the topology is consistent with the sFlow telemetry and fully monitored. The Locate tab can be used to locate network addresses to access switch ports.Note: If any gauges indicate an error, click on the gauge to get specific details.
Congratulations! The configuration is now complete and you should see charts above in AI Metric application Traffic tab. In addition, the AI Metrics Grafana dashboard at the top of this page should start to populate with data.
Flow metrics with Prometheus and Grafana and Dropped packet metrics with Prometheus and Grafana describe how to define additional flow-based metrics to incorporate in Grafana dashboards.
Getting Started provides an introductions to sFlow-RT, describes how to browse metrics and traffic flows using tools included in the Docker image, and links to information on creating applications using sFlow-RT APIs.