Thursday, October 30, 2025

Vector Packet Processor (VPP) dropped packet notifications


Vector Packet Processor (VPP) release 25.10 extends the sFlow implementation to include support for dropped packet notifications, providing detailed, low overhead, visibility into traffic flowing through a VPP router, see Vector Packet Processor (VPP) for performance information.
sflow sampling-rate 10000
sflow polling-interval 20
sflow header-bytes 128
sflow direction both
sflow drop-monitoring enable
sflow enable GigabitEthernet0/8/0
sflow enable GigabitEthernet0/9/0
sflow enable GigabitEthernet0/a/0
The above VPP configuration commands enable sFlow monitoring of the VPP dataplane, randomly sampling packets, periodically polling counters, and capturing dropped packets and reason codes. The measurements are send via Linux netlink messages to an instance of the open source Host sFlow agent (hsflowd) which combines the measurements and streams standard sFlow telemetry to a remote collector.
sflow {
  collector { ip=192.0.2.1 udpport=6343 }
  psample { group=1 egress=on }
  dropmon { start=on limit=50 }
  vpp { }
}
The /etc/hsflowd.conf file above enables the modules needed to receive netlink messages from VPP and send the resulting sFlow telemetry to a collector at 192.0.2.1. See vpp-sflow for detailed instructions.
docker run -p 6343:6343/udp sflow/sflowtool
Run sflowtool on the sFlow collector host to verify verify that the data is being received and to see the information available in the sFlow telemetry stream. The pre-built sflow/sflowtool Docker image is the fastest way to run sflowtool.
Docker also makes it easy to try out other sFlow analytics tools, for example, Deploy real-time network dashboards using Docker compose, describes how to quickly set up measurement stack consisting of the sFlow-RT real-time analytics engine, Prometheus time series database, and Grafana dashboard builder.

The sFlow VPP module delivers the same industry standard network performance monitoring available in switches and routers from leading vendors, including Arista, Cisco, Dell, Juniper, NVIDIA, VyOS, etc. to provide comprehensive, network-wide, visibility.

Monday, October 20, 2025

AI / ML network performance metrics at scale

The charts above show information from a GPU cluster running an AI / ML training workload. The 244 nodes in the cluster are connected by 100G links to a single large switch. Industry standard sFlow telemetry from the switch is shown in the two trend charts generated by the sFlow-RT real-time analytics engine. The charts are updated every 100mS.
  • Per Link Telemetry shows RoCEv2 traffic on 5 randomly selected links from the cluster. Each trend is computed based on sFlow random packet samples collected on the link. The packet header in each sample is decoded and the metric is computed for packets identified as RoCEv2.
  • Combined Fabric-Wide Telemetry combines the signals from all the links to create a fabric wide metric. The signals are highly correlated since the AI training compute / exchange cycle is synchronized across all compute nodes in the cluster. Constructive interference from combining data from all the links removes the noise in each individual signal and clearly shows the traffic pattern for the cluster.
This is a relatively small cluster. For larger clusters, the effect is even more pronounced, resulting in extremely sharp cluster-wide metrics. The sFlow instrumentation embedded as a standard feature of data center switch hardware from all leading vendors (Arista, Cisco, Dell, Juniper, NVIDIA, etc.) provides a cost effective solution for even the largest AI / ML fabrics.
The Grafana AI Metrics dashboard shown above tracks performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.

The metrics include:

  • Total Traffic Total traffic entering fabric
  • Operations Total RoCEv2 operations broken out by type
  • Core Link Traffic Histogram of load on fabric links
  • Edge Link Traffic Histogram of load on access ports
  • RDMA Operations Total RDMA operations
  • RDMA Bytes Average RDMA operation size
  • Credits Average number of credits in RoCEv2 acknowledgements
  • Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
  • Congestion Total ECN / CNP congestion messages
  • Errors Total ingress / egress errors
  • Discards Total ingress / egress discards
  • Drop Reasons Packet drop reasons
Enable sFlow in your AI / ML networks for detailed visibility into network performance. Instructions are provided in the articles: AI Metrics, AI Metrics with Prometheus and Grafana and AI Metrics with Grafana Cloud.