Tuesday, April 7, 2026

SONiC developments for visibility into AI/ML networks in 2026

SONiC sFlow High Level Design (HLD) v1.4 was recently published. This is the latest in a series of revisions bringing support for sFlow extensions that enhance network visibility for AI / ML traffic flows.

v1.3 Egress sFlow support

RoCEv2 / Ultra Ethernet host adapters bypass the Linux kernel and transfer data directly to GPU memory, rendering traditional host-based network monitoring tools ineffective (tcpdump, Wireshark, eBPF etc.). Ingress/egress packet sampling on the top of rack switch offloads monitoring from the host to the switch to provide visibility into host traffic.

In addition, some measurements may only be possible for egress sampled packets. For example, the v1.3 HLD describes how SONiC SAI drivers can support the sFlow Delay and Transit Structures extension:

Depending on platform capabilities, SAI driver may report additional attributes defined in https://github.com/torvalds/linux/blob/master/include/uapi/linux/psample.h. For example, PSAMPLE_ATTR_OUT_TC (egress queue), PSAMPLE_ATTR_OUT_TC_OCC (egress queue depth), and PSAMPLE_ATTR_LATENCY (transit delay) populate the sFlow Transit Delay Structures (https://sflow.org/sflow_transit.txt).
Typically this data is only known when packets egress the switch and may only be available for egress sampled packets.

Transit delay and queuing describes the measurements and provides an example. The sFlow transit delay and queue depth extension adds additional metadata to each packet sample. The combination of delay/queue measurement and packet header makes it clear where queues are filling, why the queues are filling, and who is sending the traffic during microbursts.

v1.4 Dropped packet notification (Mirror-on-Drop) support

RoCEv2 / Ultra Ethernet performance is severely impacted by packet loss, so the visibility into lost packets is essential for real-time detection and remediation of packet loss events.

The sFlow dropped packet notification feature immediately reports every dropped packet. Drop events are not sampled, but instead a rate limit is used to ensure that burst of dropped packets don’t flood the monitoring system. Each dropped packet notification shows the switch and port where the packet was dropped, the reason why it was dropped, and the header of the dropped packet reveals who was affected and what they are trying to do.

Dropped packet notifications and packet sampling in sFlow are a powerful combination. Packet sampling provides detailed visibility into the traffic successfully flowing through the network and the drop notifications provide details on the failures. Correlating the two measurements allows you to see the traffic filling buffers at the time packets are dropped. This detail is needed to tune network settings (like ECN/PFC buffer utilization thresholds, RoCEv2 credits, etc.) to avoid packet loss.

Configuration

Enabling sFlow on SONiC for monitoring production RoCEv2 / Ultra Ethernet traffic requires a small number of commands to monitor all the ports on the switch. The same configuration is applied to every switch in the fabric for comprehensive visibility.
sflow collector add sflow-rt 192.0.2.129 --vrf mgmt
sflow sample-direction both
sflow drop-monitor-limit 50
sflow enable
In this case sampling-direction both ensures that both ingress and egress packets are sampled for complete visibility. Setting drop-monitor-limit enables dropped packet notifications and sets the rate limit, in packets per second, for dropped packet notification messages. Finally, sending sFlow over the out of band management network to the collector ensures that monitoring traffic cannot interfere with production traffic and adversely affect RoCEv2 performance.

Availability

SONiC sFlow version 1.3 features are available in current SONiC releases, and dropped packet notifications should be available as part of the upcoming SONiC 202605 Release.
AI Metrics with Prometheus and Grafana provides step by step instructions for monitoring a backend AI/ML network.

To see an example, the live SDSC Expanse cluster live AI/ML metrics dashboard can be accessed by clicking on the dashboard link. The San Diego Supercomputer Center (SDSC) Expanse cluster in the example has the following specifications: 5 Pflop/s peak; 93,184 CPU cores; 208 NVIDIA GPUs; 220 TB total DRAM; 810 TB total NVMe.

Tuesday, March 10, 2026

Monitoring RoCEv2 with sFlow

The talk Seeing Through the RDMA Fog: Monitoring RoCEv2 with sFlow at the recent North American Network Operator's Group (NANOG) conference describes how leveraging industry standard sFlow telemetry from data center switches provides visibility into RDMA activity in AI / ML networks.

Note: Slides are available from the talk link.

The live SDSC Expanse cluster live AI/ML metrics dashboard described in the talk can be accesses by clicking on the dashboard link. The San Diego Supercomputer Center (SDSC) Expanse cluster specifications: 5 Pflop/s peak; 93,184 CPU cores; 208 NVIDIA GPUs; 220 TB total DRAM; 810 TB total NVMe.

Note: AI Metrics with Prometheus and Grafana shows how to set up the monitoring stack.

More recently, Expanse heatmap provides a publicly accessible real-time visualization live traffic flowing between nodes in the Expanse cluster, see Real-time visualization of AI / ML traffic matrix for more information.

Monday, February 23, 2026

Real-time visualization of AI / ML traffic matrix

Heatmap is available on GitHub. The application provides a real-time traffic matrix visualization of end-to-end traffic flowing across an Ethernet fabric. Each axis represents an ordered list of network addresses. The x-axis is a flow source and the y-axis is a flow destination.

For example, the Heatmap above comes from a large high performance compute cluster running a mixture of tasks. Traffic is concentrated along the diagonal, indicating that the job scheduler is packing related tasks in racks so that most traffic is confined to the rack.

Note: Live Dashboards links to a number dashboards showing live traffic, including the Heatmap above.

The next Heatmap shows a very different traffic pattern. In this case, RoCEv2 traffic generated by GPUs performing a NCCL AllReduce/AllGather collective operation using a ring algorithm. During the collective operation, each GPU sends data to its immediate neighbor (modulo the number of GPUs) in a logical ring, resulting in two nearly continuous lines on either size of the diagonal: one for forward traffic, and the other for return traffic associated with each flow.
The final example comes from a large data center hosting a mix of front end workloads. Unlike the backend networks, this network combines internal (East/West) traffic with external (North/South) traffic flows. The internal traffic flows are contained in the central grid. The surrounding borders display external traffic.
The full range of IP addresses (0.0.0.0 - 255.255.255.255) is displayed on the heatmap using a piecewise linear scaling function. A start and end address identifies internal traffic and maps to values in the central grid and addresses outside this range are scaled to fit in the borders insets.

Representing the traffic matrix in the form of a heat map scales well to very large networks and provides real-time insight into shifting traffic patterns as workloads change. The industry standard sFlow instrumentation in data center switches used to construct the traffic matrix also scales to the large number of switches and 400/800G port speeds found in AI/ML backend networks.

Tuesday, January 13, 2026

Exporting events to Loki

Grafana Loki is an open source log aggregation system inspired by Prometheus. While it is possible to use Loki with Grafana Alloy, a simpler approach is to send logs directly using the Loki HTTP API.

The following example modifies the ddos-protect application to use sFlow-RT's httpAsync() function to send events to Loki's HTTP API.

var lokiPort = getSystemProperty("ddos_protect.loki.port") || '3100';
var lokiPush = getSystemProperty("ddos_protect.loki.push") || '/loki/api/v1/push';
var lokiHost = getSystemProperty("ddos_protect.loki.host");

function sendEvent(action,attack,target,group,protocol) {
  if(lokiHost) {
    var url = 'http://'+lokiHost+':'+lokiPort+lokiPush;
    lokiEvent = {
      streams: [
        {
          stream: {
            service_name: 'ddos-protect'
          },
          values: [[
            Date.now()+'000000',
            action+" "+attack+" "+target+" "+group+" "+protocol,
            {
              detected_level: action == 'release' ? 'INFO' : 'WARN',
              action: action,
              attack: attack,
              ip: target,
              group: group,
              protocol: protocol
            }
          ]]
        }
      ]
    };
    httpAsync({
      url: url,
      headers: {'Content-Type':'application/json'},
      operation: 'POST',
      body: JSON.stringify(lokiEvent),
      success: (response) => { 
        if (200 != response.status) {
          logWarning("DDoS Loki status " + response.status);
        }
      },
      error: (error) => {
        logWarning("DDoS Loki error " + error);
      }
    });
  }

  if(syslogHosts.length === 0) return;

  var msg = {app:'ddos-protect',action:action,attack:attack,ip:target,group:group,protocol:protocol};
  syslogHosts.forEach(function(host) {
    try {
      syslog(host,syslogPort,syslogFacility,syslogSeverity,msg);
    } catch(e) {
      logWarning('DDoS cannot send syslog to ' + host);
    }
  });
}
The highlighted code extends the existing scripts/ddos.js script to add Loki support.
Add a panel to integrate the Loki log into the Grafana sFlow-RT DDoS Protect dashboard as shown at the top of this page.

DDoS protection quickstart guide describes how to set up a DDoS mitigation solution using sFlow-RT.

Saturday, November 15, 2025

SC25: SDSC Expanse cluster live AI/ML metrics

The SDSC Expanse cluster live AI/ML metrics dashboard is a joint InMon / San Diego Supercomputer Center (SDSC) demonstration at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC25) conference being held this week in St. Louis, November 16-21. Click on the dashboard link during the show to see live traffic.

By default, the dashboard shows the Last 24 Hours of traffic. Explore the data: select Last 30 Days to get a long term view, select Last 5 Minutes to get an up to the second view, click on items in a chart legend to show selected metric, drag to select an interval and zoom in.

The Expanse cluster at the San Diego Supercomputer Center is a batch-oriented science computing gateway serving thousands of users and a wide range of research projects, see Google News for examples.

The SDSC Expanse cluster live AI/ML metrics dashboard displays real-time metrics for workloads running on the cluster:

    • Total Traffic Total traffic entering fabric
    • Cluster Services Traffic associated with Lustre, Ceph and NFS storage, and Slurm workload management
    • Core Link Traffic Histogram of load on fabric links
    • Edge Link Traffic Histogram of load on access ports
    • RDMA Operations Total RDMA operations
    • RDMA Avg. Bytes per Operation Average RDMA operation size
    • Infiniband Operations Total RoCEv2 Infiniband operations broken out by type
    • Compute / Exchange Interval Detected period of compute / exchange activity on fabric
    • Congestion Notification Messages Total ECN / CNP congestion messages
    • Infiniband Ack. Credits Average number of credits in RoCEv2 Infiniband acknowledgements
    • Packet Discards Total ingress / egress discards
    • Packet Errors Total ingress / egress errors

AI Metrics with Prometheus and Grafana describes how to quickly set up the monitoring stack for your own AI / ML network using industry standard telemetry from leading switch vendors (Arista, Cisco, Dell, Edge-Core, Juniper, HPE, NVIDIA, SONiC etc.).

Monday, November 3, 2025

Ultra Ethernet Transport

The Ultra Ethernet Consortium has a mission to Deliver an Ethernet based open, interoperable, high performance, full-communications stack architecture to meet the growing network demands of AI & HPC at scale. The recently released UE-Specification-1.0.1 includes an Ultra Ethernet Transport (UET) protocol with similar functionality to RDMA over Converged Ethernet (RoCEv2).

The sFlow instrumentation embedded as a standard feature of data center switch hardware from all leading vendors (Arista, Cisco, Dell, Juniper, NVIDIA, etc.) provides a cost effective solution for gaining visibility into UET traffic in large production AI / ML fabrics. 

docker run -p 8008:8008 -p 6343:6343/udp sflow/prometheus
The easiest way to get started is to use the pre-built sflow/prometheus Docker image to analyze the sFlow telemetry. The chart at the top of this page shows an up to the second view of UET operations using the included Flow Browser application, see Defining Flows for a list of available UET attributes. Getting Started describes how to set up the sFlow monitoring system.

Flow metrics with Prometheus and Grafana describes how collect custom network traffic flow metrics using the Prometheus time series database and include the metrics in Grafana dashboards. Use the Flow Browser to explore UET flow metrics and then configure a Prometheus scrape task to collect useful operational metrics.

Thursday, October 30, 2025

Vector Packet Processor (VPP) dropped packet notifications


Vector Packet Processor (VPP) release 25.10 extends the sFlow implementation to include support for dropped packet notifications, providing detailed, low overhead, visibility into traffic flowing through a VPP router, see Vector Packet Processor (VPP) for performance information.
sflow sampling-rate 10000
sflow polling-interval 20
sflow header-bytes 128
sflow direction both
sflow drop-monitoring enable
sflow enable GigabitEthernet0/8/0
sflow enable GigabitEthernet0/9/0
sflow enable GigabitEthernet0/a/0
The above VPP configuration commands enable sFlow monitoring of the VPP dataplane, randomly sampling packets, periodically polling counters, and capturing dropped packets and reason codes. The measurements are send via Linux netlink messages to an instance of the open source Host sFlow agent (hsflowd) which combines the measurements and streams standard sFlow telemetry to a remote collector.
sflow {
  collector { ip=192.0.2.1 udpport=6343 }
  psample { group=1 egress=on }
  dropmon { start=on limit=50 }
  vpp { }
}
The /etc/hsflowd.conf file above enables the modules needed to receive netlink messages from VPP and send the resulting sFlow telemetry to a collector at 192.0.2.1. See vpp-sflow for detailed instructions.
docker run -p 6343:6343/udp sflow/sflowtool
Run sflowtool on the sFlow collector host to verify verify that the data is being received and to see the information available in the sFlow telemetry stream. The pre-built sflow/sflowtool Docker image is the fastest way to run sflowtool.
Docker also makes it easy to try out other sFlow analytics tools, for example, Deploy real-time network dashboards using Docker compose, describes how to quickly set up measurement stack consisting of the sFlow-RT real-time analytics engine, Prometheus time series database, and Grafana dashboard builder.

The sFlow VPP module delivers the same industry standard network performance monitoring available in switches and routers from leading vendors, including Arista, Cisco, Dell, Juniper, NVIDIA, VyOS, etc. to provide comprehensive, network-wide, visibility.