The InfluxDB AI Metrics dashboard shown above tracks performance metrics for
AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using
NVIDIA Collective Communication Library (NCCL) operations for inter-GPU
communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.
The metrics include:
- Total Traffic Total traffic entering fabric
-
Operations Total RoCEv2 operations broken out by type
- Core Link Traffic Histogram of load on fabric links
- Edge Link Traffic Histogram of load on access ports
- RDMA Operations Total RDMA operations
- RDMA Bytes Average RDMA operation size
-
Credits Average number of credits in RoCEv2
acknowledgements
-
Period Detected period of compute / exchange activity on
fabric (in this case just over 0.5 seconds)
- Congestion Total ECN / CNP congestion messages
- Errors Total ingress / egress errors
- Discards Total ingress / egress discards
- Drop Reasons Packet drop reasons
This article shows how to integrate with
InfluxDB Cloud
instead of running the services locally.
Note: InfluxDB Cloud has a free service tier that can be used to test
this example.
Save the following compose.yml file on a system running
Docker.
configs:
config.telegraf:
content: |
[agent]
interval = '15s'
round_interval = true
omit_hostname = true
[[outputs.influxdb_v2]]
urls = ['https://<INFLUXDB_CLOUD_INSTANCE>.cloud2.influxdata.com']
token = '<INFLUXDB_CLOUD_TOKEN>'
organization = '<INFLUXDB_CLOUD_USER>'
bucket = 'sflow'
[[inputs.prometheus]]
urls = ['http://sflow-rt:8008/app/ai-metrics/scripts/metrics.js/prometheus/txt']
metric_version = 1
networks:
monitoring:
driver: bridge
services:
sflow-rt:
image: sflow/ai-metrics
container_name: sflow-rt
restart: unless-stopped
ports:
- '6343:6343/udp'
- '8008:8008'
networks:
- monitoring
telegraf:
image: telegraf:alpine
container_name: telegraf
restart: unless-stopped
configs:
- source: config.telegraf
target: /etc/telegraf/telegraf.conf
depends_on:
- sflow-rt
networks:
- monitoring
Use the
Load Data menu to create an
sflow bucket, create an API
TOKEN to upload data, and find TELEGRAF
INFLUXDB OUTPUT PLUGIN settings.
Navigate to the
Dashboards menu and create a new dashboard by importing
ai_metrics.json
Edit the
highlighted
outputs.influxdb_v2 telegraf settings (
INFLUXDB_CLOUD_INSTANCE,
INFLUXDB_CLOUD_TOKEN and
INFLUXDB_CLOUD_USER) to match those
provided by InfluxDB Cloud.
docker compose up -d
Run the command above to start streaming metrics to InfluxDB Cloud.
Enable sFlow on all switches in the cluster (leaf and spine) using the
recommeded settings. Enable
sFlow dropped packet notifications
to populate the drop reasons metric, see
Dropped packet notifications with Arista Networks,
NVIDIA Cumulus Linux 5.11 for AI / ML and
Dropped packet notifications with Cisco 8000 Series Routers
for examples.
Note:
Tuning Performance describes how
to optimize settings for very large clusters.
Industry standard sFlow telemetry is uniquely
suited monitoring AI workloads. The sFlow agents leverage instrumentation
built into switch ASICs to stream randomly sampled packet headers and metadata
in real-time. Sampling provides a scaleable method of monitoring the large
numbers of 400G/800G links found in AI fabrics. Export of packet headers
allows the sFlow collector to decode the
InfiniBand Base Transport
headers to extract operations and RDMA metrics. The
Dropped Packet extension uses
Mirror-on-Drop (MoD) / What Just Happened (WJH) capabilities in the ASIC to
include packet header, location, and reason for EVERY dropped packet in
the fabric.
Talk to your switch vendor about their plans to support the
Transit delay and queueing extension. This extension provides visibility into queue depth and
switch transit delay using instrumentation built into the ASIC.
A network topology is required to generate the analytics, see
Topology for a description of
the JSON file and instructions for generating topologies from Graphvis DOT
format, NVIDIA NetQ, Arista eAPI, and NetBox.
Use the
Topology Status dashboard to verify that the topology is
consistent with the sFlow telemetry and fully monitored. The
Locate tab
can be used to locate network addresses to access switch ports.
Note: If any gauges indicate an error, click on the gauge to
get specific details.
Congratulations! The configuration is now complete and you should see charts
above in AI Metric application Traffic tab. In addition, the AI
Metrics dashboard at the top of this page should start to populate with data.
Getting Started provides an
introductions to sFlow-RT, describes how to browse metrics and traffic flows
using tools included in the Docker image, and links to information on creating
applications using sFlow-RT APIs.