Monday, July 28, 2025

Linux packet sampling using eBPF

Linux 6.11+ kernels provide TCX attachment points for eBPF programs to efficiently examine packets as they ingress and egress the host. The latest version of the open source Host sFlow agent includes support for TCX packet sampling to stream industry standard sFlow telemetry to a central collector for network wide visibility, e.g. Deploy real-time network dashboards using Docker compose describes how to quickly set up a Prometheus database and use Grafana to build network dashboards.

static __always_inline void sample_packet(struct __sk_buff *skb, __u8 direction) {
    __u32 key = skb->ifindex;
    __u32 *rate = bpf_map_lookup_elem(&sampling, &key);
    if (!rate || (*rate > 0 && bpf_get_prandom_u32() % *rate != 0))
        return;

    struct packet_event_t pkt = {};
    pkt.timestamp = bpf_ktime_get_ns();
    pkt.ifindex = skb->ifindex;
    pkt.sampling_rate = *rate;
    pkt.ingress_ifindex = skb->ingress_ifindex;
    pkt.routed_ifindex = direction ? 0 : get_route(skb);
    pkt.pkt_len = skb->len;
    pkt.direction = direction;

    __u32 hdr_len = skb->len < MAX_PKT_HDR_LEN ? skb->len : MAX_PKT_HDR_LEN;
    if (hdr_len > 0 && bpf_skb_load_bytes(skb, 0, pkt.hdr, hdr_len) < 0)
        return;
    bpf_perf_event_output(skb, &events, BPF_F_CURRENT_CPU, &pkt, sizeof(pkt));
}

SEC("tcx/ingress")
int tcx_ingress(struct __sk_buff *skb) {
    sample_packet(skb, 0);

    return TCX_NEXT;
}

SEC("tcx/egress")
int tcx_egress(struct __sk_buff *skb) {
    sample_packet(skb, 1);

    return TCX_NEXT;
}

The sample.bpf.c file is compiled into eBPF code that the Host sFlow mod_epcap.c module uses to tap packets on selected interfaces. The highlighted code uses the bpf_get_random_u32() function to randomly select packets using the configured sampling rate for the interface. Once a packet is selected to be sampled, the packet header and selected metadata is captured and sent to the Host sFlow agent as a performance event via the bpf_perf_event_output() call. The ability to perform the sampling action in the kernel dramatically reduces the overhead associated with network traffic monitoring, since only the small fraction of sampled packets need to be transferred to the user space Host sFlow agent.

static __always_inline __u32 get_route(struct __sk_buff *skb) {
    __u32 key = 0;
    __u32 *routing_enabled = bpf_map_lookup_elem(&routing, &key);
    if(!routing_enabled || !*routing_enabled)
	return 0;

    if(skb->pkt_type != PACKET_HOST)
	return 0;

    void *data = (void *)(long)skb->data;
    void *data_end = (void *)(long)skb->data_end;
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return 0;
    __u32 proto = bpf_ntohs(eth->h_proto);
    if(proto == ETH_P_IP) {
        struct iphdr *ip = data + sizeof(*eth);
        if ((void *)(ip + 1) > data_end)
            return 0;
        struct bpf_fib_lookup fib = {0};
        fib.family      = AF_INET;
        fib.ipv4_src    = ip->saddr;
        fib.ipv4_dst    = ip->daddr;
        fib.tos         = ip->tos;
	fib.l4_protocol = ip->protocol;
	fib.sport       = 0;
	fib.dport       = 0;
	fib.tot_len     = bpf_ntohs(ip->tot_len);
        fib.ifindex     = skb->ifindex;
        long rc = bpf_fib_lookup(skb, &fib, sizeof(fib), 0);
        if(rc != BPF_FIB_LKUP_RET_SUCCESS)
	   return 0;
        return fib.ifindex;
    } else if(proto == ETH_P_IPV6) {
	struct ipv6hdr *ipv6 = data + sizeof(*eth);
        if ((void *)(ipv6 + 1) > data_end)
	    return 0;
        struct bpf_fib_lookup fib = {0};
        fib.family      = AF_INET6;
	__builtin_memcpy(fib.ipv6_src, &ipv6->saddr, sizeof(ipv6->saddr));
        __builtin_memcpy(fib.ipv6_dst, &ipv6->daddr, sizeof(ipv6->daddr));
	fib.flowinfo    = *(__be32 *) ipv6 & bpf_htonl(0x0FFFFFFF);
	fib.l4_protocol = ipv6->nexthdr;
	fib.sport       = 0;
	fib.dport       = 0;
	fib.tot_len     = bpf_ntohs(ipv6->payload_len);
        fib.ifindex     = skb->ifindex;
        long rc = bpf_fib_lookup(skb, &fib, sizeof(fib), 0);
        if(rc != BPF_FIB_LKUP_RET_SUCCESS)
           return 0;
        return fib.ifindex;	
    }
    return 0;
}

If the Linux host has been configured as a router, then the bpf_fib_lookup() is used to to determine the forwarding decision (egress port) for sampled ingress packets.

Note: The mod_pcap.c module works on older Linux kernels and uses traditional BPF to perform random packet sampling. The main advantage of the mod_epcap module is its ability add additional metadata to each sampled packet.

Friday, July 11, 2025

Tracing network packets with eBPF and pwru

pwru (packet, where are you?) is an open source tool from Cilium that used eBPF instrumentation in recent Linux kernels to trace network packets through the kernel.

In this article we will use Multipass to create a virtual machine to experiment with pwru. Multipass is a command line tool for running Ubuntu virtual machines on Mac or Windows. Multipass uses the native virtualization capabilities of the host operating system to simplify the creation of virtual machines.

multipass launch --name=ebpf noble
multipass exec ebpf -- sudo apt update
multipass exec ebpf -- sudo apt -y install git clang llvm make libbpf-dev flex bison golang
multipass exec ebpf -- git clone https://github.com/cilium/pwru.git
multipass exec ebpf --working-directory pwru -- make
multipass exec ebpf -- sudo ./pwru/pwru -h
Run the commands above to create the virtual machine and build pwru from sources.
multipass exec ebpf -- sudo ./pwru/pwru port https
Run pwru to trace https traffic on the virtual machine.
multipass exec ebpf -- curl https://sflow-rt.com
In a second window, run the above command to generate an https request from the virtual machine.
SKB                CPU PROCESS          NETNS      MARK/x        IFACE       PROTO  MTU   LEN   TUPLE FUNC
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0               0         0x0000 1500  60    192.168.66.3:47460->54.190.130.38:443(tcp) ip_local_out
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0               0         0x0000 1500  60    192.168.66.3:47460->54.190.130.38:443(tcp) __ip_local_out
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0               0         0x0800 1500  60    192.168.66.3:47460->54.190.130.38:443(tcp) ip_output
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  60    192.168.66.3:47460->54.190.130.38:443(tcp) nf_hook_slow
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  60    192.168.66.3:47460->54.190.130.38:443(tcp) apparmor_ip_postroute
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  60    192.168.66.3:47460->54.190.130.38:443(tcp) ip_finish_output
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  60    192.168.66.3:47460->54.190.130.38:443(tcp) __ip_finish_output
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  60    192.168.66.3:47460->54.190.130.38:443(tcp) ip_finish_output2
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  60    192.168.66.3:47460->54.190.130.38:443(tcp) neigh_resolve_output
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  60    192.168.66.3:47460->54.190.130.38:443(tcp) eth_header
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  60    192.168.66.3:47460->54.190.130.38:443(tcp) skb_push
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  74    192.168.66.3:47460->54.190.130.38:443(tcp) __dev_queue_xmit
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  74    192.168.66.3:47460->54.190.130.38:443(tcp) qdisc_pkt_len_init
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  74    192.168.66.3:47460->54.190.130.38:443(tcp) netdev_core_pick_tx
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  74    192.168.66.3:47460->54.190.130.38:443(tcp) sch_direct_xmit
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  74    192.168.66.3:47460->54.190.130.38:443(tcp) validate_xmit_skb_list
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  74    192.168.66.3:47460->54.190.130.38:443(tcp) validate_xmit_skb
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  74    192.168.66.3:47460->54.190.130.38:443(tcp) netif_skb_features
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  74    192.168.66.3:47460->54.190.130.38:443(tcp) passthru_features_check
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  74    192.168.66.3:47460->54.190.130.38:443(tcp) skb_network_protocol
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  74    192.168.66.3:47460->54.190.130.38:443(tcp) skb_csum_hwoffload_help
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  74    192.168.66.3:47460->54.190.130.38:443(tcp) skb_checksum_help
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  74    192.168.66.3:47460->54.190.130.38:443(tcp) skb_ensure_writable
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  74    192.168.66.3:47460->54.190.130.38:443(tcp) validate_xmit_xfrm
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  74    192.168.66.3:47460->54.190.130.38:443(tcp) dev_hard_start_xmit
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  74    192.168.66.3:47460->54.190.130.38:443(tcp) start_xmit
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  74    192.168.66.3:47460->54.190.130.38:443(tcp) skb_clone_tx_timestamp
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  74    192.168.66.3:47460->54.190.130.38:443(tcp) xmit_skb
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  86    192.168.66.3:47460->54.190.130.38:443(tcp) skb_to_sgvec
0xffff9fc40335a0e8 0   ~r/bin/curl:8966 4026531840 0             ens3:2      0x0800 1500  86    192.168.66.3:47460->54.190.130.38:443(tcp) __skb_to_sgvec
The pwru output provides a detailed trace of each packet through the Linux stack, identifying each application, namespace, interface and kernel module traversed by the packet. Type Ctrl+C to stop the trace.
multipass exec ebpf -- sudo ufw default allow
multipass exec ebpf -- sudo ufw deny out to any port https
multipass exec ebpf -- sudo ufw enable
Now create a firewall rule to block outgoing https traffic and repeat the test
SKB                CPU PROCESS          NETNS      MARK/x        IFACE       PROTO  MTU   LEN   TUPLE FUNC
0xffff970bc5f568e8 0   ~r/bin/curl:7138 4026531840 0               0         0x0000 1500  60    192.168.67.5:38932->54.190.130.38:443(tcp) ip_local_out
0xffff970bc5f568e8 0   ~r/bin/curl:7138 4026531840 0               0         0x0000 1500  60    192.168.67.5:38932->54.190.130.38:443(tcp) __ip_local_out
0xffff970bc5f568e8 0   ~r/bin/curl:7138 4026531840 0               0         0x0800 1500  60    192.168.67.5:38932->54.190.130.38:443(tcp) nf_hook_slow
0xffff970bc5f568e8 0   ~r/bin/curl:7138 4026531840 0               0         0x0800 1500  60    192.168.67.5:38932->54.190.130.38:443(tcp) kfree_skb_reason(SKB_DROP_REASON_NETFILTER_DROP)
0xffff970bc5f568e8 0   ~r/bin/curl:7138 4026531840 0               0         0x0800 1500  60    192.168.67.5:38932->54.190.130.38:443(tcp) skb_release_head_state
0xffff970bc5f568e8 0   ~r/bin/curl:7138 4026531840 0               0         0x0800 0     60    192.168.67.5:38932->54.190.130.38:443(tcp) tcp_wfree
0xffff970bc5f568e8 0   ~r/bin/curl:7138 4026531840 0               0         0x0800 0     60    192.168.67.5:38932->54.190.130.38:443(tcp) skb_release_data
0xffff970bc5f568e8 0   ~r/bin/curl:7138 4026531840 0               0         0x0800 0     60    192.168.67.5:38932->54.190.130.38:443(tcp) kfree_skbmem
This time, the curl command hangs and the pwru trace shows that packets are being dropped in the Linux firewall.
multipass exec ebpf -- sudo ufw disable
Disable the firewall.

There are additional multipass commands available to manage the virtual machine.

multipass shell ebpf
Connect to the virtual machine and access a command shell.
multipass stop ebpf
Stop the virtual machine.
multipass start ebpf
Start the virtual machine.
multipass delete ebpf
multipass purge
Delete the virtual machine.

Thursday, July 10, 2025

AI Metrics with InfluxDB Cloud

The InfluxDB AI Metrics dashboard shown above tracks performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

The metrics include:

  • Total Traffic Total traffic entering fabric
  • Operations Total RoCEv2 operations broken out by type
  • Core Link Traffic Histogram of load on fabric links
  • Edge Link Traffic Histogram of load on access ports
  • RDMA Operations Total RDMA operations
  • RDMA Bytes Average RDMA operation size
  • Credits Average number of credits in RoCEv2 acknowledgements
  • Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
  • Congestion Total ECN / CNP congestion messages
  • Errors Total ingress / egress errors
  • Discards Total ingress / egress discards
  • Drop Reasons Packet drop reasons
This article shows how to integrate with InfluxDB Cloud instead of running the services locally.

Note: InfluxDB Cloud has a free service tier that can be used to test this example.

Save the following compose.yml file on a system running Docker.

configs:
  config.telegraf:
    content: |
      [agent]
        interval = '15s'
        round_interval = true
        omit_hostname = true
      [[outputs.influxdb_v2]]
        urls = ['https://<INFLUXDB_CLOUD_INSTANCE>.cloud2.influxdata.com']
        token = '<INFLUXDB_CLOUD_TOKEN>'
        organization = '<INFLUXDB_CLOUD_USER>'
        bucket = 'sflow'
      [[inputs.prometheus]]
        urls = ['http://sflow-rt:8008/app/ai-metrics/scripts/metrics.js/prometheus/txt']
        metric_version = 1

networks:

  monitoring:
    driver: bridge

services:

  sflow-rt:
    image: sflow/ai-metrics
    container_name: sflow-rt
    restart: unless-stopped
    ports:
      - '6343:6343/udp'
      - '8008:8008'
    networks:
      - monitoring

  telegraf:
    image: telegraf:alpine
    container_name: telegraf
    restart: unless-stopped
    configs:
      - source: config.telegraf
        target: /etc/telegraf/telegraf.conf
    depends_on:
      - sflow-rt
    networks:
      - monitoring
Use the Load Data menu to create an sflow bucket, create an API TOKEN to upload data, and find TELEGRAF INFLUXDB OUTPUT PLUGIN settings. Navigate to the Dashboards menu and create a new dashboard by importing ai_metrics.json
Edit the highlighted outputs.influxdb_v2 telegraf settings (INFLUXDB_CLOUD_INSTANCE, INFLUXDB_CLOUD_TOKEN and INFLUXDB_CLOUD_USER) to match those provided by InfluxDB Cloud.
docker compose up -d
Run the command above to start streaming metrics to InfluxDB Cloud.
Enable sFlow on all switches in the cluster (leaf and spine) using the recommeded settings. Enable sFlow dropped packet notifications to populate the drop reasons metric, see Dropped packet notifications with Arista NetworksNVIDIA Cumulus Linux 5.11 for AI / ML and Dropped packet notifications with Cisco 8000 Series Routers for examples.

Note: Tuning Performance describes how to optimize settings for very large clusters.

Industry standard sFlow telemetry is uniquely suited monitoring AI workloads. The sFlow agents leverage instrumentation built into switch ASICs to stream randomly sampled packet headers and metadata in real-time. Sampling provides a scaleable method of monitoring the large numbers of 400G/800G links found in AI fabrics. Export of packet headers allows the sFlow collector to decode the InfiniBand Base Transport headers to extract operations and RDMA metrics. The Dropped Packet extension uses Mirror-on-Drop (MoD) / What Just Happened (WJH) capabilities in the ASIC to include packet header, location, and reason for EVERY dropped packet in the fabric.

Talk to your switch vendor about their plans to support the Transit delay and queueing extension. This extension provides visibility into queue depth and switch transit delay using instrumentation built into the ASIC.

A network topology is required to generate the analytics, see Topology for a description of the JSON file and instructions for generating topologies from Graphvis DOT format, NVIDIA NetQ, Arista eAPI, and NetBox.
Use the Topology Status dashboard to verify that the topology is consistent with the sFlow telemetry and fully monitored. The Locate tab can be used to locate network addresses to access switch ports.

Note: If any gauges indicate an error, click on the gauge to get specific details.

Congratulations! The configuration is now complete and you should see charts above in AI Metric application Traffic tab. In addition, the AI Metrics dashboard at the top of this page should start to populate with data.

Getting Started provides an introductions to sFlow-RT, describes how to browse metrics and traffic flows using tools included in the Docker image, and links to information on creating applications using sFlow-RT APIs.

Saturday, June 14, 2025

AI network performance monitoring using containerlab

AI Metrics is available on GitHub. The application provides performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

The screen capture is from a containerlab topology that emulates a AI compute cluster connected by a leaf and spine network. The metrics include:

  • Total Traffic Total traffic entering fabric
  • Operations Total RoCEv2 operations broken out by type
  • Core Link Traffic Histogram of load on fabric links
  • Edge Link Traffic Histogram of load on access ports
  • RDMA Operations Total RDMA operations
  • RDMA Bytes Average RDMA operation size
  • Credits Average number of credits in RoCEv2 acknowledgements
  • Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
  • Congestion Total ECN / CNP congestion messages
  • Errors Total ingress / egress errors
  • Discards Total ingress / egress discards
  • Drop Reasons Packet drop reasons

Note: Clicking on peaks in the charts shows values at that time.

This article gives step-by-step instructions to run the demonstration.

git clone https://github.com/sflow-rt/containerlab.git
Download the sflow-rt/containerlab project from GitHub.
git clone https://github.com/sflow-rt/containerlab.git
cd containerlab
./run-clab
Run the above commands to download the sflow-rt/containerlab GitHub project and run Containerlab on a system with Docker installed. Docker Desktop is a conventient way to run the labs on a laptop.
containerlab deploy -t rocev2.yml

Start the 3 stage leaf and spine emulation.

The initial launch may take a couple of minutes as the container images are downloaded for the first time. Once the images are downloaded, the topology deploys in a few seconds.
./topo.py clab-rocev2
Run the command above to send the topology to the AI Metrics application and connect to http://localhost:8008/app/ai-metrics/html/ to access the dashboard shown at the top of this article.
docker exec -it clab-rocev2-h1 hping3 exec rocev2.tcl 172.16.2.2 10000 500 100
Run the command above to simulate RoCEv2 traffic between h1 and h2 during a training run. The run consists of 100 cycles, each cycle involves exchange of 10000 packets followed by a 500 millisecond pause (similuating GPU compute time). You should immediately see charts updating in the dashboard.
Clicking on the wave symbol in the top right of the Period chart pulls up a fast updating Periodicity chart that shows how sub-second variability in the traffic due to the exchange / compute cycles is used to compute the period shown in the dashboard, in this case a period of just under 0.7 seconds.
Connect to the included sflow-rt/trace-flow application, http://localhost:8008/app/trace-flow/html/
suffix:stack:.:1=ibbt
Enter the filter above and press Submit. The instant a RoCEv2 flow is generated, its path should be shown. See Defining Flows for information on traffic filters.
docker exec -it clab-rocev2-leaf1 vtysh
The routers in this topology run the open source FRRouting daemon used in data center switches with NVIDIA Cumulus Linux or SONiC network operating systems. Type the command above to access the CLI on the leaf1 switch.
leaf1# show running-config
For example, type the above command in the leaf1 CLI to see the running configuration.
Building configuration...

Current configuration:
!
frr version 10.2.3_git
frr defaults datacenter
hostname leaf1
log stdout
!
interface eth3
 ip address 172.16.1.1/24
 ipv6 address 2001:172:16:1::1/64
exit
!
router bgp 65001
 bgp bestpath as-path multipath-relax
 bgp bestpath compare-routerid
 neighbor fabric peer-group
 neighbor fabric remote-as external
 neighbor fabric description Internal Fabric Network
 neighbor fabric capability extended-nexthop
 neighbor eth1 interface peer-group fabric
 neighbor eth2 interface peer-group fabric
 !
 address-family ipv4 unicast
  redistribute connected route-map HOST_ROUTES
 exit-address-family
 !
 address-family ipv6 unicast
  redistribute connected route-map HOST_ROUTES
  neighbor fabric activate
 exit-address-family
exit
!
route-map HOST_ROUTES permit 10
 match interface eth3
exit
!
ip nht resolve-via-default
!
end

The switch is using BGP to establish equal cost multi-path (ECMP) routes across the fabric - this is very similar to a configuration you would expect to find in a production network. Containerlab provides a great environment to experiment with topologies and configuration options before putting them in production.

The results from this containerlab project are very similar to those observed in production networks, see Comparing AI / ML activity from two production networks. Follow the instruction in AI Metrics with Prometheus and Grafana to deploy this solution to monitor production GPU cluster traffic.

Monday, June 9, 2025

AI Metrics with Grafana Cloud

The Grafana AI Metrics dashboard shown above tracks performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

The metrics include:

  • Total Traffic Total traffic entering fabric
  • Operations Total RoCEv2 operations broken out by type
  • Core Link Traffic Histogram of load on fabric links
  • Edge Link Traffic Histogram of load on access ports
  • RDMA Operations Total RDMA operations
  • RDMA Bytes Average RDMA operation size
  • Credits Average number of credits in RoCEv2 acknowledgements
  • Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
  • Congestion Total ECN / CNP congestion messages
  • Errors Total ingress / egress errors
  • Discards Total ingress / egress discards
  • Drop Reasons Packet drop reasons
AI Metrics with Prometheus and Grafana describes how to stand up an analytics stack with Prometheus and Grafana to track performance metrics for an AI/ML GPU cluster. This article shows how to integrate with Prometheus and Grafana hosted in the cloud, Grafana Cloud, instead of running the services locally.

Note: Grafana Cloud has a free service tier that can be used to test this example.

Save the following compose.yml file on a system running Docker.

configs:
  config.alloy:
    content: |
      prometheus.scrape "prometheus" {
        targets = [{
          __address__ = "sflow-rt:8008",
        }]
        forward_to   = [prometheus.remote_write.grafanacloud.receiver]
        metrics_path = "/app/ai-metrics/scripts/metrics.js/prometheus/txt"
        scrape_interval = "10s"
      }

      prometheus.remote_write "grafanacloud" {
        endpoint {
          url  = "https://<Your Grafana Cloud Prometheus Instance>/api/prom/push"
          basic_auth {
            username = "<Your Grafana.com User ID>"
            password = "<Your Grafana.com API Token>"
          }
        }
      }

networks:

  monitoring:
    driver: bridge

services:

  sflow-rt:
    image: sflow/ai-metrics
    container_name: sflow-rt
    restart: unless-stopped
    ports:
      - '6343:6343/udp'
      - '8008:8008'
    networks:
      - monitoring

  alloy:
    image: grafana/alloy
    container_name: alloy
    restart: unless-stopped
    configs:
      - source: config.alloy
        target: /etc/alloy/config.alloy
    depends_on:
      - sflow-rt
    networks:
      - monitoring
Find the settings needed to upload metrics to Prometheus by clicking on the Send Metrics button in your Grafana Cloud account.
Edit the highlighted prometheus.remote_write endpoint settings (url, username and password) to match those provided by Grafana Cloud.
docker compose up -d
Run the command above to start streaming metrics to Grafana Cloud. Click on the Grafana Launch button to access Grafana and add the AI Metrics dashboard (ID: 23255).
Enable sFlow on all switches in the cluster (leaf and spine) using the recommeded settings. Enable sFlow dropped packet notifications to populate the drop reasons metric, see Dropped packet notifications with Arista NetworksNVIDIA Cumulus Linux 5.11 for AI / ML and Dropped packet notifications with Cisco 8000 Series Routers for examples.

Note: Tuning Performance describes how to optimize settings for very large clusters.

Industry standard sFlow telemetry is uniquely suited monitoring AI workloads. The sFlow agents leverage instrumentation built into switch ASICs to stream randomly sampled packet headers and metadata in real-time. Sampling provides a scaleable method of monitoring the large numbers of 400G/800G links found in AI fabrics. Export of packet headers allows the sFlow collector to decode the InfiniBand Base Transport headers to extract operations and RDMA metrics. The Dropped Packet extension uses Mirror-on-Drop (MoD) / What Just Happened (WJH) capabilities in the ASIC to include packet header, location, and reason for EVERY dropped packet in the fabric.

Talk to your switch vendor about their plans to support the Transit delay and queueing extension. This extension provides visibility into queue depth and switch transit delay using instrumentation built into the ASIC.

A network topology is required to generate the analytics, see Topology for a description of the JSON file and instructions for generating topologies from Graphvis DOT format, NVIDIA NetQ, Arista eAPI, and NetBox.
Use the Topology Status dashboard to verify that the topology is consistent with the sFlow telemetry and fully monitored. The Locate tab can be used to locate network addresses to access switch ports.

Note: If any gauges indicate an error, click on the gauge to get specific details.

Congratulations! The configuration is now complete and you should see charts above in AI Metric application Traffic tab. In addition, the AI Metrics Grafana dashboard at the top of this page should start to populate with data.

Flow metrics with Prometheus and Grafana and Dropped packet metrics with Prometheus and Grafana describe how to define additional flow-based metrics to incorporate in Grafana dashboards.

Getting Started provides an introductions to sFlow-RT, describes how to browse metrics and traffic flows using tools included in the Docker image, and links to information on creating applications using sFlow-RT APIs.

Monday, May 5, 2025

Multi-vendor support for dropped packet notifications


The sFlow Dropped Packet Notification Structures extension was published in October 2020. Extending sFlow to provide visibility into dropped packets offers significant benefits for network troubleshooting, providing real-time network wide visibility into the specific packets that were dropped as well the reason the packet was dropped. This visibility instantly reveals the root cause of drops and the impacted connections. Packet discard records complement sFlow's existing counter polling and packet sampling mechanisms and share a common data model so that all three sources of data can be correlated, for example, packet sampling reveals the top consumers of bandwidth on a link, helping to get to the root cause of congestion related packet drops reported for the link.

Today the following network operating systems include support for the drop notification extension in their sFlow agent implementations:

Two additional sFlow dropped packet notification implementations are in the pipeline and should be available later this year:

If your network vendor is on the list, follow the instructions in the linked articles to try out drop monitoring, if not, ask about your vendor's plans to implement the sFlow drop notification extension.

Monday, April 21, 2025

AI Metrics with Prometheus and Grafana

The Grafana AI Metrics dashboard shown above tracks performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

The metrics include:

  • Total Traffic Total traffic entering fabric
  • Operations Total RoCEv2 operations broken out by type
  • Core Link Traffic Histogram of load on fabric links
  • Edge Link Traffic Histogram of load on access ports
  • RDMA Operations Total RDMA operations
  • RDMA Bytes Average RDMA operation size
  • Credits Average number of credits in RoCEv2 acknowledgements
  • Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
  • Congestion Total ECN / CNP congestion messages
  • Errors Total ingress / egress errors
  • Discards Total ingress / egress discards
  • Drop Reasons Packet drop reasons

This article gives step-by-step instructions to set up the dashboard in a production environment.

git clone https://github.com/sflow-rt/prometheus-grafana.git
cd prometheus-grafana
env RT_IMAGE=ai-metrics ./start.sh

The easiest way to get started is to use Docker, see Deploy real-time network dashboards using Docker compose, and deploy the sflow/ai-metrics image bundling the AI Metrics application to generate metrics.

scrape_configs:
  - job_name: 'sflow-rt-ai-metrics'
    metrics_path: /app/ai-metrics/scripts/metrics.js/prometheus/txt
    scheme: http
    static_configs:
      - targets: [ 'sflow-rt:8008' ]
Follow the directions in Deploy real-time network dashboards using Docker compose to add the above Prometheus scrape task to retrieve the metrics and add the Grafana AI Metrics dashboard (ID: 23255).
Enable sFlow on all switches in the cluster (leaf and spine) using the recommeded settings. Enable sFlow dropped packet notifications to populate the drop reasons metric, see Dropped packet notifications with Arista NetworksNVIDIA Cumulus Linux 5.11 for AI / ML and Dropped packet notifications with Cisco 8000 Series Routers for examples.

Note: Tuning Performance describes how to optimize settings for very large clusters.

Industry standard sFlow telemetry is uniquely suited monitoring AI workloads. The sFlow agents leverage instrumentation built into switch ASICs to stream randomly sampled packet headers and metadata in real-time. Sampling provides a scaleable method of monitoring the large numbers of 400G/800G links found in AI fabrics. Export of packet headers allows the sFlow collector to decode the InfiniBand Base Transport headers to extract operations and RDMA metrics. The Dropped Packet extension uses Mirror-on-Drop (MoD) / What Just Happened (WJH) capabilities in the ASIC to include packet header, location, and reason for EVERY dropped packet in the fabric.

Talk to your switch vendor about their plans to support the Transit delay and queueing extension. This extension provides visibility into queue depth and switch transit delay using instrumentation built into the ASIC.

A network topology is required to generate the analytics, see Topology for a description of the JSON file and instructions for generating topologies from Graphvis DOT format, NVIDIA NetQ, Arista eAPI, and NetBox.
Use the Topology Status dashboard to verify that the topology is consistent with the sFlow telemetry and fully monitored. The Locate tab can be used to locate network addresses to access switch ports.

Note: If any gauges indicate an error, click on the gauge to get specific details.

Congratulations! The configuration is now complete and you should see charts above in AI Metric application Traffic tab. In addition, the AI Metrics Grafana dashboard at the top of this page should start to populate with data.

Flow metrics with Prometheus and Grafana and Dropped packet metrics with Prometheus and Grafana describe how to define additional flow-based metrics to incorporate in Grafana dashboards.

Getting Started provides an introductions to sFlow-RT, describes how to browse metrics and traffic flows using tools included in the Docker image, and links to information on creating applications using sFlow-RT APIs.