Tuesday, May 19, 2026

Fixing Ghost Drops: How eBPF Rescued IPv6 Telemetry


A customer complains that they aren't getting IPFIX flow data from a router.

Use socat to check that IPFIX is being received (IANA assigned port for IPFIX is 4739):

socat -b 0 -dd -u UDP6-RECV:4739 - 2>&1
Output demonstrates that at least some IPFIX messages can be received when listening on port 4739.
2026/05/15 22:46:32 socat[108419] N using stdout for writing
2026/05/15 22:46:32 socat[108419] N starting data transfer loop with FDs [5,5] and [1,1]
2026/05/15 22:46:33 socat[108419] N received packet with 0 bytes from AF=10 [fec0:0000:0000:0000:0001:000c:2744:69f1]:50978
2026/05/15 22:46:33 socat[108419] N received packet with 0 bytes from AF=10 [fec0:0000:0000:0000:0001:000c:2744:69f1]:50978
Use tcpdump to check for IPFIX packets. This gives visibility into packets before the host network stack, so you can see packets before they are dropped by host network stack or host firewall
tcpdump -i enp0s3 -n udp port 4739
The output shows that IPFIX datagrams are being received from a second source, fec0::1:c:2744:69f0, but they aren't showing up in the socat output, so the Linux kernel must be dropping them for some reason.
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on enp0s3, link-type EN10MB (Ethernet), snapshot length 262144 bytes
21:09:57.217821 IP6 fec0::1:c:2744:69f1.50978 > fec0::27ff:fe8d:4f0b.ipfix: UDP, length 1432
21:09:57.217921 IP6 fec0::1:c:2744:69f0.50978 > fec0::27ff:fe8d:4f0b.ipfix: UDP, length 1428
A check of the host firewall and reverse path filtering settings don't explain the drops, so take a more detailed look with tcpdump with the -v (verbose) option.
tcpdump -i enp0s3 -nv udp port 4739
This time we see that packets from fec0::1:c:2744:69f0 have a bad UDP checksum.
dropped privs to tcpdump
tcpdump: listening on enp0s3, link-type EN10MB (Ethernet), snapshot length 262144 bytes
22:07:36.443823 IP6 (flowlabel 0x1991c, hlim 64, next-header UDP (17) payload length: 1230) fec0::1:c:2744:69f1.50978 > fec0::27ff:fe8d:4f0b.ipfix: [udp sum ok] UDP, length 1222
22:07:37.356528 IP6 (flowlabel 0x1991c, hlim 64, next-header UDP (17) payload length: 1008) fec0::1:c:2744:69f0.50978 > fec0::27ff:fe8d:4f0b.ipfix: [bad udp cksum 0xfc60 -> 0xfb60!] UDP, length 1000
Verify by looking at the UDP error counter:
nstat -az Udp6InErrors
Results show an increasing error counter, confirming that the packets are being dropped by the Linux kernel.
#kernel
Udp6InErrors                    5106               0.0
Unfortunately, simple fixes that might work for IPv4 (disabling/ignoring UDP checksum) are not available for IPv6, see RFC 8200 Internet Protocol, Version 6 (IPv6) Specification
Unlike IPv4, the default behavior when UDP packets are originated by an IPv6 node is that the UDP checksum is not optional. That is, whenever originating a UDP packet, an IPv6 node must compute a UDP checksum over the packet and the pseudo-header, and, if that computation yields a result of zero, it must be changed to hex FFFF for placement in the UDP header. IPv6 receivers must discard UDP packets containing a zero checksum and should log the error.
Ideally the router would send IPFIX with correctly computed UDP checksum and on closer examination it looks that some IFPIX messages from the router do have a correct UDP checksum while others do not. The messages with the correct checksum originate from the routers management CPU, while the messages with incorrect checksum are being directly generated in hardware by the routing chip.

This isn't too surprising - if this were IPv4 export, the hardware could set the UDP checksum to zero (since it is optional) and there would be no issue, However, with IPv6 the mandatory checksum must be computed - a complex calculation that is likely to produce errors.

Computation of the UDP checksum involves creating an IPv6 pseudo header (shown above) and then calculating the checksum over the IPv6 pseudo header, the UDP header, and the UDP payload.

Ideally, the router vendor will fix the issue, but it may not be possible if there are hardware limitations, or it may take time if a fix isn't seen as a priority, so a workaround is needed.

This is where eBPF comes to the rescue! The GitHub fix-udp6-checksum project uses an eBPF program, fix_checksum.c, to compute and rewrite the UDP checksum before the IPFIX packet is handed to the Linux network stack.

Dowload and compile the fix_checksum.c program on a system running Docker:

git clone https://github.com/inmoncorp/fix-udp6-checksum.git
cd fix-udp6-checksum
./build.sh

Copy the resulting fix_checksum.o file to the IPFIX collector.

Install the eBPF program on interface enp0s3

sudo tc qdisc add dev enp0s3 clsact
sudo tc filter add dev enp0s3 ingress bpf da obj fix_checksum.o sec tc/ingress

Check to see that the filter has been installed

sudo tc -s filter show dev enp0s3 ingress
The output shows that the filter is installed and that the eBPF Just-In-Time (JIT) compiler has run for maximum performance.
filter protocol all pref 49151 bpf chain 0 
filter protocol all pref 49151 bpf chain 0 handle 0x1 fix_checksum.o:[tc/ingress] direct-action not_in_hw id 230 name fix_ipfix_check tag c6a96524b9f80adb jited
Finally, using socat to verify that the missing IPFIX data is being received:
2026/05/18 14:06:02 socat[945320] N using stdout for writing
2026/05/18 14:06:02 socat[945320] N starting data transfer loop with FDs [5,5] and [1,1]
2026/05/18 14:06:03 socat[945320] N received packet with 0 bytes from AF=10 [fec0:0000:0000:0000:0001:000c:2744:69f1]:50978
2026/05/18 14:06:03 socat[945320] N received packet with 0 bytes from AF=10 [fec0:0000:0000:0000:0001:000c:2744:69f0]:50978
eBPF is a game changer, providing the ability to change how Linux processes packets without having to build a new kernel or even reboot a production system. What would otherwise have been a major issue is transformed into a relatively straightforward fix and a happy customer.

Tuesday, April 14, 2026

Four public live production flow analytics dashboards

The following publicly accessible dashboards show live data from operational networks, including: an AI/ML RoCEv2 fabric, a world-wide Kubernetes cluster, and an Internet Exchange Provider (IXP). Click on the [ LIVE DASHBOARD ] link under each screen capture to access the live dashboard.

San Diego Supercomputer Center Expanse Cluster AI/ML dashboard using ai-metrics application. See AI Metrics with Prometheus and Grafana for detailed, step-by-step, instructions for setting up monitoring and dashboard.

San Diego Supercomputer Center Expanse Cluster AI/ML traffic matrix using heatmap application. See Real-time visualization of AI / ML traffic matrix for an explanation of the chart with examples.

National Research Platform Nautilus Cluster GPU, CPU, and network resources in world-wide Kubernetes cluster using sunburst application. See Real-time Kubernetes cluster monitoring example for more details and step-by-step instructions for deploying monitoring.

San Francisco Metropolitan Internet Exchange overall traffic dashboard using ixp-metrics application. See Internet eXchange Provider (IXP) Metrics for detailed, step-by-step, instructions for setting up overall exchange traffic and per member peering traffic dashboards.

Live Dashboards maintains a current list publicly accessible dashboards. If you have dashboard to share, would like help learning to build your own dashboards, or have a general interest in real-time flow analytics (DDoS mitigation, traffic engineering, etc), then you are welcome with the community of users and developers.

The Getting Started guide provides step by step instructions for setting up real-time traffic analytics. Even if you don't have immediate access to a network, Real-time network and system metrics as a service describes how to replay captured sFlow data to explore the capabilities of the software on your laptop. Alternatively, sflow-rt/containerlab includes projects that emulate leaf and spine networks, EVPN, and DDoS mitigation, that can be run on a laptop using Docker Desktop.

Tuesday, April 7, 2026

SONiC developments for visibility into AI/ML networks in 2026

SONiC sFlow High Level Design (HLD) v1.4 was recently published. This is the latest in a series of revisions bringing support for sFlow extensions that enhance network visibility for AI / ML traffic flows.

v1.3 Egress sFlow support

RoCEv2 / Ultra Ethernet host adapters bypass the Linux kernel and transfer data directly to GPU memory, rendering traditional host-based network monitoring tools ineffective (tcpdump, Wireshark, eBPF etc.). Ingress/egress packet sampling on the top of rack switch offloads monitoring from the host to the switch to provide visibility into host traffic.

In addition, some measurements may only be possible for egress sampled packets. For example, the v1.3 HLD describes how SONiC SAI drivers can support the sFlow Delay and Transit Structures extension:

Depending on platform capabilities, SAI driver may report additional attributes defined in https://github.com/torvalds/linux/blob/master/include/uapi/linux/psample.h. For example, PSAMPLE_ATTR_OUT_TC (egress queue), PSAMPLE_ATTR_OUT_TC_OCC (egress queue depth), and PSAMPLE_ATTR_LATENCY (transit delay) populate the sFlow Transit Delay Structures (https://sflow.org/sflow_transit.txt).
Typically this data is only known when packets egress the switch and may only be available for egress sampled packets.

Transit delay and queuing describes the measurements and provides an example. The sFlow transit delay and queue depth extension adds additional metadata to each packet sample. The combination of delay/queue measurement and packet header makes it clear where queues are filling, why the queues are filling, and who is sending the traffic during microbursts.

v1.4 Dropped packet notification (Mirror-on-Drop) support

RoCEv2 / Ultra Ethernet performance is severely impacted by packet loss, so the visibility into lost packets is essential for real-time detection and remediation of packet loss events.

The sFlow dropped packet notification feature immediately reports every dropped packet. Drop events are not sampled, but instead a rate limit is used to ensure that burst of dropped packets don’t flood the monitoring system. Each dropped packet notification shows the switch and port where the packet was dropped, the reason why it was dropped, and the header of the dropped packet reveals who was affected and what they are trying to do.

Dropped packet notifications and packet sampling in sFlow are a powerful combination. Packet sampling provides detailed visibility into the traffic successfully flowing through the network and the drop notifications provide details on the failures. Correlating the two measurements allows you to see the traffic filling buffers at the time packets are dropped. This detail is needed to tune network settings (like ECN/PFC buffer utilization thresholds, RoCEv2 credits, etc.) to avoid packet loss.

Configuration

Enabling sFlow on SONiC for monitoring production RoCEv2 / Ultra Ethernet traffic requires a small number of commands to monitor all the ports on the switch. The same configuration is applied to every switch in the fabric for comprehensive visibility.
sflow collector add sflow-rt 192.0.2.129 --vrf mgmt
sflow sample-direction both
sflow drop-monitor-limit 50
sflow enable
In this case sampling-direction both ensures that both ingress and egress packets are sampled for complete visibility. Setting drop-monitor-limit enables dropped packet notifications and sets the rate limit, in packets per second, for dropped packet notification messages. Finally, sending sFlow over the out of band management network to the collector ensures that monitoring traffic cannot interfere with production traffic and adversely affect RoCEv2 performance.

Availability

SONiC sFlow version 1.3 features are available in current SONiC releases, and dropped packet notifications should be available as part of the upcoming SONiC 202605 Release.
AI Metrics with Prometheus and Grafana provides step by step instructions for monitoring a backend AI/ML network.

To see an example, the live SDSC Expanse cluster live AI/ML metrics dashboard can be accessed by clicking on the dashboard link. The San Diego Supercomputer Center (SDSC) Expanse cluster in the example has the following specifications: 5 Pflop/s peak; 93,184 CPU cores; 208 NVIDIA GPUs; 220 TB total DRAM; 810 TB total NVMe.

Tuesday, March 10, 2026

Monitoring RoCEv2 with sFlow

The talk Seeing Through the RDMA Fog: Monitoring RoCEv2 with sFlow at the recent North American Network Operator's Group (NANOG) conference describes how leveraging industry standard sFlow telemetry from data center switches provides visibility into RDMA activity in AI / ML networks.

Note: Slides are available from the talk link.

The live SDSC Expanse cluster live AI/ML metrics dashboard described in the talk can be accesses by clicking on the dashboard link. The San Diego Supercomputer Center (SDSC) Expanse cluster specifications: 5 Pflop/s peak; 93,184 CPU cores; 208 NVIDIA GPUs; 220 TB total DRAM; 810 TB total NVMe.

Note: AI Metrics with Prometheus and Grafana shows how to set up the monitoring stack.

More recently, Expanse heatmap provides a publicly accessible real-time visualization live traffic flowing between nodes in the Expanse cluster, see Real-time visualization of AI / ML traffic matrix for more information.

Monday, February 23, 2026

Real-time visualization of AI / ML traffic matrix

Heatmap is available on GitHub. The application provides a real-time traffic matrix visualization of end-to-end traffic flowing across an Ethernet fabric. Each axis represents an ordered list of network addresses. The x-axis is a flow source and the y-axis is a flow destination.

For example, the Heatmap above comes from a large high performance compute cluster running a mixture of tasks. Traffic is concentrated along the diagonal, indicating that the job scheduler is packing related tasks in racks so that most traffic is confined to the rack.

Note: Live Dashboards links to a number dashboards showing live traffic, including the Heatmap above.

The next Heatmap shows a very different traffic pattern. In this case, RoCEv2 traffic generated by GPUs performing a NCCL AllReduce/AllGather collective operation using a ring algorithm. During the collective operation, each GPU sends data to its immediate neighbor (modulo the number of GPUs) in a logical ring, resulting in two nearly continuous lines on either size of the diagonal: one for forward traffic, and the other for return traffic associated with each flow.
The final example comes from a large data center hosting a mix of front end workloads. Unlike the backend networks, this network combines internal (East/West) traffic with external (North/South) traffic flows. The internal traffic flows are contained in the central grid. The surrounding borders display external traffic.
The full range of IP addresses (0.0.0.0 - 255.255.255.255) is displayed on the heatmap using a piecewise linear scaling function. A start and end address identifies internal traffic and maps to values in the central grid and addresses outside this range are scaled to fit in the borders insets.

Representing the traffic matrix in the form of a heat map scales well to very large networks and provides real-time insight into shifting traffic patterns as workloads change. The industry standard sFlow instrumentation in data center switches used to construct the traffic matrix also scales to the large number of switches and 400/800G port speeds found in AI/ML backend networks.

Tuesday, January 13, 2026

Exporting events to Loki

Grafana Loki is an open source log aggregation system inspired by Prometheus. While it is possible to use Loki with Grafana Alloy, a simpler approach is to send logs directly using the Loki HTTP API.

The following example modifies the ddos-protect application to use sFlow-RT's httpAsync() function to send events to Loki's HTTP API.

var lokiPort = getSystemProperty("ddos_protect.loki.port") || '3100';
var lokiPush = getSystemProperty("ddos_protect.loki.push") || '/loki/api/v1/push';
var lokiHost = getSystemProperty("ddos_protect.loki.host");

function sendEvent(action,attack,target,group,protocol) {
  if(lokiHost) {
    var url = 'http://'+lokiHost+':'+lokiPort+lokiPush;
    lokiEvent = {
      streams: [
        {
          stream: {
            service_name: 'ddos-protect'
          },
          values: [[
            Date.now()+'000000',
            action+" "+attack+" "+target+" "+group+" "+protocol,
            {
              detected_level: action == 'release' ? 'INFO' : 'WARN',
              action: action,
              attack: attack,
              ip: target,
              group: group,
              protocol: protocol
            }
          ]]
        }
      ]
    };
    httpAsync({
      url: url,
      headers: {'Content-Type':'application/json'},
      operation: 'POST',
      body: JSON.stringify(lokiEvent),
      success: (response) => { 
        if (200 != response.status) {
          logWarning("DDoS Loki status " + response.status);
        }
      },
      error: (error) => {
        logWarning("DDoS Loki error " + error);
      }
    });
  }

  if(syslogHosts.length === 0) return;

  var msg = {app:'ddos-protect',action:action,attack:attack,ip:target,group:group,protocol:protocol};
  syslogHosts.forEach(function(host) {
    try {
      syslog(host,syslogPort,syslogFacility,syslogSeverity,msg);
    } catch(e) {
      logWarning('DDoS cannot send syslog to ' + host);
    }
  });
}
The highlighted code extends the existing scripts/ddos.js script to add Loki support.
Add a panel to integrate the Loki log into the Grafana sFlow-RT DDoS Protect dashboard as shown at the top of this page.

DDoS protection quickstart guide describes how to set up a DDoS mitigation solution using sFlow-RT.

Saturday, November 15, 2025

SC25: SDSC Expanse cluster live AI/ML metrics

The SDSC Expanse cluster live AI/ML metrics dashboard is a joint InMon / San Diego Supercomputer Center (SDSC) demonstration at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC25) conference being held this week in St. Louis, November 16-21. Click on the dashboard link during the show to see live traffic.

By default, the dashboard shows the Last 24 Hours of traffic. Explore the data: select Last 30 Days to get a long term view, select Last 5 Minutes to get an up to the second view, click on items in a chart legend to show selected metric, drag to select an interval and zoom in.

The Expanse cluster at the San Diego Supercomputer Center is a batch-oriented science computing gateway serving thousands of users and a wide range of research projects, see Google News for examples.

The SDSC Expanse cluster live AI/ML metrics dashboard displays real-time metrics for workloads running on the cluster:

    • Total Traffic Total traffic entering fabric
    • Cluster Services Traffic associated with Lustre, Ceph and NFS storage, and Slurm workload management
    • Core Link Traffic Histogram of load on fabric links
    • Edge Link Traffic Histogram of load on access ports
    • RDMA Operations Total RDMA operations
    • RDMA Avg. Bytes per Operation Average RDMA operation size
    • Infiniband Operations Total RoCEv2 Infiniband operations broken out by type
    • Compute / Exchange Interval Detected period of compute / exchange activity on fabric
    • Congestion Notification Messages Total ECN / CNP congestion messages
    • Infiniband Ack. Credits Average number of credits in RoCEv2 Infiniband acknowledgements
    • Packet Discards Total ingress / egress discards
    • Packet Errors Total ingress / egress errors

AI Metrics with Prometheus and Grafana describes how to quickly set up the monitoring stack for your own AI / ML network using industry standard telemetry from leading switch vendors (Arista, Cisco, Dell, Edge-Core, Juniper, HPE, NVIDIA, SONiC etc.).