Friday, July 31, 2020

Using sFlow to monitor dropped packets

Visibility into dropped packets describes instrumentation, recently added to the Linux kernel, that provides visibility into packets dropped by the kernel data path on a host, or dropped by a switch ASIC when packets are forwarded in hardware. This article describes integration of drop monitoring in the open source Host sFlow agent and inclusion of drop reporting as part of industry standard sFlow telemetry.

Extending sFlow to provide visibility into dropped packets offers significant benefits for network troubleshooting, providing real-time network-wide visibility into the specific packets that were dropped as well the reason the packet was dropped. This visibility instantly reveals the root cause of drops and the impacted connections.

Packet discard monitoring complements sFlow's existing counter polling and packet sampling mechanisms and shares a common data model so that all three sources of data can be correlated.  For example, if packets are being discarded because of buffer exhaustion, the discard records don't necessarily tell the whole story. The discarded packets may represent mice flows that are victims of an elephant flow. Packet samples will reveal the traffic that isn't being dropped and provide a more complete picture. Counter data adds additional information such as CPU load, interface speed, link utilization, packet and discard rates that further completes the picture.

The following steps build the Host sFlow agent with drop monitoring on an Ubuntu 20 system (a Linux 5.4 kernel or newer is required):
git clone
cd host-sflow
sudo make install
sudo make schedule
Next, edit the /etc/hsflowd.conf file, in this example, directing sFlow to be sent to a collector at, enabling packet sampling on host adapter enp0s3, and enabling drop monitoring:
sflow {
  collector { ip= }
  pcap { dev = enp0s3 }
  dropmon { group = 1 start = on }
Start the agent:
sudu systemctl enable hsflowd
sudo systemctl start hsflowd
Build the latest version of sflowtool on the collector host (
git clone
cd sflowtool
sudo make install
Now run sflowtool to receive and decode the sFlow telemetry stream:
The following example shows the output for a discarded TCP packet:
startSample ----------------------
sampleType_tag 0:5
sampleType DISCARD
sampleSequenceNo 20
sourceId 0:1
dropEvents 0
inputPort 1
outputPort 0
discardCode 289
discardReason unknown_l4
discarded_flowBlock_tag 0:1
discarded_flowSampleType HEADER
discarded_headerProtocol 1
discarded_sampledPacketSize 54
discarded_strippedBytes 0
discarded_headerLen 54
discarded_headerBytes 00-00-00-00-00-00-00-00-00-00-00-00-08-00-45-00-00-28-00-00-40-00-40-06-3C-CE-7F-00-00-01-7F-00-00-01-04-05-04-39-00-00-00-00-14-51-E2-8A-50-14-00-00-B2-B4-00-00
discarded_dstMAC 000000000000
discarded_srcMAC 000000000000
discarded_IPSize 40
discarded_ip.tot_len 40
discarded_IPProtocol 6
discarded_IPTOS 0
discarded_IPTTL 64
discarded_IPID 0
discarded_TCPSrcPort 1029
discarded_TCPDstPort 1081
discarded_TCPFlags 20
endSample   ----------------------
The sflowtool -T option converts the discarded packet records into PCAP format so that they can be decoded by packet analysis tools such as Wireshark and tcpdump:
sflowtool -T | tshark -r -
   12  22.000000 → TCP 78 65527 → 80 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=1324841769 TSecr=0 SACK_PERM=1
The article sFlow to JSON uses Python examples to demonstrate how sflowtool's ability to convert sFlow records into JSON can be used for further analysis.

Thursday, July 16, 2020

Visibility into dropped packets

Dropped packets have a profound impact on network performance and availability. Packet discards due to congestion can significantly impact application performance. Dropped packets due to black hole routes, expired TTLs, MTU mismatches, etc. can result in insidious connection failures that are time consuming and difficult to diagnose.

Devlink Trap describes recent changes to the Linux drop monitor service that provide visibility into packets dropped by switch ASIC hardware. When a packet is dropped by the ASIC, an event is generated that includes the header of the dropped packet and the reason why it was dropped. A hardware policer is used to limit the number of events generated by the ASIC to a rate that can be handled by the Linux kernel. The events are delivered to userspace applications using the Linux netlink service.

Running the dropwatch command line tool on an Ubuntu 20 system demonstrates the instrumentation:
pp@ubuntu20:~$ sudo dropwatch
Initializing null lookup method
dropwatch> set alertmode packet
Setting alert mode
Alert mode successfully set
dropwatch> start
Enabling monitoring...
Kernel monitoring activated.
Issue Ctrl-C to stop monitoring
drop at: __udp4_lib_rcv+0xae5/0xbb0 (0xffffffffb05ead95)
origin: software
input port ifindex: 2
timestamp: Wed Jul 15 23:57:36 2020 223253465 nsec
protocol: 0x800
length: 128
original length: 243

drop at: __netif_receive_skb_core+0x14f/0xc90 (0xffffffffb05351af)
origin: software
input port ifindex: 2
timestamp: Wed Jul 15 23:57:36 2020 530212717 nsec
protocol: 0x4
length: 52
original length: 52
The dropwatch package also includes the dwdump command, allowing the dropped packets to be analyzed using packet decoding tools such as tshark:
pp@ubuntu20:~$ sudo dwdump | tshark -r -
   12   0.777386 BelkinIn_a8:8a:7d → Spanning-tree-(for-bridges)_00 STP 196 Conf. Root = 61440/4095/24:f5:a2:a8:8a:7b  Cost = 0  Port = 0x8003
   13   2.825301 BelkinIn_a8:8a:7d → Spanning-tree-(for-bridges)_00 STP 196 Conf. Root = 61440/4095/24:f5:a2:a8:8a:7b  Cost = 0  Port = 0x8003
Linux is the base for many open source and commercial network operating systems used today (e.g. SONiC, DENT, Cumulus, EOS, NX-OS, and IOS-XR). A standard interface for monitoring packet drops is an exciting development, allowing Linux based network monitoring agents to collect, export, and analyze drop events to rapidly identify the root cause of dropped packets.

Monday, June 8, 2020

Large flow marking using BGP Flowspec

Elephant Detection in Virtual Switches & Mitigation in Hardware discusses a VMware and Cumulus demonstration, Elephants and Mice, in which the virtual switch on a host detects and marks large "Elephant" flows and the hardware switch enforces priority queueing to prevent Elephant flows from adversely affecting latency of small "Mice" flows.

SDN and WAN optimization describes a presentation by Amin Vahdat describing Google's SDN based wide area network traffic engineering solution in which traffic prioritization allows Google to reduce costs by fully utilizing WAN bandwidth.

Deconstructing Datacenter Packet Transport describes how priority marking of packets associated with large flows can improve completion times for flows crossing the data center fabric. Simulation results presented in the paper show that prioritization of short flows over large flows can significantly improve throughput (reducing flow completion times by a factor of 5 or more at high loads).

This article demonstrates a self contained real-time Elephant flow marking solution that leverages the real-time visibility and control features available using commodity switch hardware.

The diagram shows the elements of the solution. An instance of the sFlow-RT real-time analytics engine receives streaming sFlow telemetry from a pair of edge routers. A mix of many small flows mixed with a few large flows arrive at the left router, all flows have the default Best Effort (be) Differentiated Services Code Point (DSCP) 0 marking (indicated in blue). As soon as a large flow is detected, a BGP Flowspec rule is pushed to the router, remarking the flow as Lower Effort (le) DSCP 1 (see RFC 8622: A Lower-Effort Per-Hop Behavior (LE PHB) for Differentiated Services).  The large flow is continuously monitored and the Flowspec rule is withdrawn when the flow ends.

The less than best effort class ensures that the large flow doesn't compete for bandwidth and buffer resources with the small flows, ensuring faster completion times and lower latency for time sensitive traffic while minimally impacting throughput of the large flow.

The following partial configuration enables sFlow and BGP Flowspec on an Arista EOS device (EOS 4.22 or later):
service routing protocols model multi-agent
sflow sample 16384
sflow polling-interval 30
sflow destination
sflow run
interface Ethernet1
   flow-spec ipv4 ipv6
interface Management1
   ip address
ip routing
ipv6 unicast-routing
router bgp 64497
   neighbor remote-as 65070
   neighbor transport remote-port 1179
   neighbor allowas-in 3
   neighbor send-community extended
   neighbor maximum-routes 12000 
   address-family flow-spec ipv4
      neighbor activate
   address-family flow-spec ipv6
      neighbor activate
   address-family ipv4
      neighbor activate
   address-family ipv6
      neighbor activate
The following sFlow-RT mark.js script implements the flow marking controller:
var routers = [
var my_as = '65030';
var my_id = '';
var flow_t = 2;
var threshold_val = 10000000/8;
var threshold_t = 10;
var dscp_name = 'le';
var dscp_val = 1;
var enable_v6 = false;
var max_controls = 1000;

var controls = {};

var bgp_opts = {ipv6:enable_v6,flowspec:true,flowspec6:enable_v6};

function bgpClose(router) {
  var key, ctl;
  for(key in controls) {
    ctl = controls[key];
    if(ctl.router != router) continue;

    ctl.success = false;
function bgpOpen(router) {
  var key, ctl;
  for(key in controls) {
    ctl = controls[key];
    if(ctl.router != router) continue;

    ctl.success = bgpAddFlow(ctl.router, ctl.flowspec);
var agentToRouter = {};
var controlCount = {};
routers.forEach(function(rec) {
  agentToRouter[rec.agent] = rec.router || rec.agent;
  controlCount[rec.router] = 0;

setFlow('mark_tcp', {
  keys: 'ipsource,ipdestination,tcpsourceport,tcpdestinationport',
setFlow('mark_tcp6', {
  keys: 'ip6source,ip6destination,tcpsourceport,tcpdestinationport',

setThreshold('mark_tcp', {
setThreshold('mark_tcp6', {

setEventHandler(function(evt) {
  var router = agentToRouter[evt.agent];
  if(!router) {

  var key = router + '-' + evt.flowKey;
  if(controls[key]) {

  if(controlCount[router] >= max_controls) {

  var [saddr,daddr,sport,dport] = evt.flowKey.split(',');
  var ctl = {
    flowspec: {
      match: {
      then: {
  switch(evt.eventID) {
    case 'mark_tcp':
      ctl.flowspec.match.version = '4';
      ctl.flowspec.match.protocol = '=6';
    case 'mark_tcp6':
      ctl.flowspec.match.version = '6';
      ctl.flowspec.match.protocol = '=6';
  if(!enable_v6 && '6' == ctl.flowspec.match.version) {

  ctl.success = bgpAddFlow(ctl.router, ctl.flowspec);
  controls[ctl.key] = ctl;
  logInfo('mark add '+router+' '+evt.flowKey);

setIntervalHandler(function(now) {
  var key, ctl, evt, triggered;
  for(key in controls) {
    ctl = controls[key];
    evt = ctl.event;
    if(thresholdTriggered(evt.thresholdID, evt.agent,
                          evt.flowKey)) continue;

    if(ctl.success) bgpRemoveFlow(ctl.router,ctl.flowspec);
    delete controls[key];
    logInfo('mark remove '+ctl.router+' '+ctl.event.flowKey);
Some notes on the script:
  1. The routers array contains the set of BGP routers that are to be controlled. The router attribute specifies the IP address that will initiate the BGP connection and the agent attribute specifies the sFlow agent address of the router. 
  2. TCP connections exceeding the threshold_val of 10Mbit/s will be marked.
  3. The max_controls value of 1000 caps the number of Flowspec rules that can be installed in each router in order to avoid exceeding the capabilities of the hardware.
  4. The setFlow() function, see Defining Flows, tracks ingress TCP flows that haven't been marked as LE.
  5. The setThreshold() function defines a threshold to identify large unmarked flows.
  6. The setEventHandler() function triggers the marking action in response to a threshold event.
  7. The setIntervalHandler() function runs every second, finding large flows that have finished and removing their controls.
  8. See Writing Applications for more information.
The easiest way to run the script is to use Docker with the pre-built sflow/ddos-protect image. Running the following command on host launches the controller:
docker run --net=host \
-v $PWD/mark.js:/sflow-rt/mark.js \
sflow/ddos-protect -Dscript.file=mark.js
Using iperf to generate a large flow to test the controller.
localhost#sh bgp flow-spec ipv4
BGP Flow Specification rules for VRF default
Router identifier, local AS number 65096
Rule status codes: # - not installed, M - received from multiple peers

   Matching Rule                                                Actions;;DP:=5001;SP:=39208;          Mark DSCP: 0x1
Command line output from the edge router confirms that the large flow has been detected and is being marked.

Note: Real-time DDoS mitigation using BGP RTBH and FlowSpec describes how DDoS attacks can be automatically mitigated in real-time using a control scheme very similar to the one described in this article. The Docker image used above includes the DDoS mitigation controller.

Tuesday, May 12, 2020

Real-time network and system metrics as a service

The sFlow-RT real-time analytics engine receives industry standard sFlow telemetry as a continuous stream from network and host devices and coverts the raw data into useful measurements that can be be queried through a REST API. A single sFlow-RT instance can monitor the entire data center, providing a comprehensive view of performance, not just of the individual components, but of the data center as a whole.

This article is an interactive tutorial intended to familiarize the reader with the REST API. The examples can be run on a laptop using recorded data so that access to a live network is not required.

The data was captured from the leaf and spine test network shown above (described in Fabric View).
curl -O
First, download the captured sFlow data.

You will need to have a system with Java or Docker to run the sFlow-RT software.
curl -O
tar -xzf sflow-rt.tar.gz
./sflow-rt/ sflow-rt browse-metrics
./sflow-rt/ sflow-rt browse-flows
./sflow-rt/ sflow-rt prometheus
./sflow-rt/ -Dsflow.file=$PWD/ecmp.pcap
The above commands download and run sFlow-RT, with browse-metrics, browse-flows, and prometheus applications on a system with Java 1.8+ installed.
docker run --rm -v $PWD/ecmp.pcap:/sflow-rt/ecmp.pcap \
-p 8008:8008 --name sflow-rt sflow/prometheus -Dsflow.file=ecmp.pcap
Alternatively, the above command runs sFlow-RT and applications using Docker.
The REST API is documented using OpenAPI. Use a web browser to access the REST API explorer at http://localhost:8008/api/index.html.

Each of the examples can be run in a terminal window using curl, or you can simply click on the link to see the results in your web browser.
Measurements are represented by sFlow-RT in the form of a logical table. Each agent is a device on the network and is uniquely identified by an IP address. Each agent may have one or more datasources that represent a logical source of measurements. For example, a network switch will have a data source for each port.
curl http://localhost:8008/agents/json
List the sFlow agents.
curl http://localhost:8008/metrics/json
List the names of the metrics being exported by the agents. The available metrics depend on the types of agent streaming data to sFlow-RT. For a list of supported sFlow metrics, see Metrics.
curl http://localhost:8008/metric/ALL/max:ifinutilization,max:ifoututilization/json
Find the switch ports with the highest input and output utilization.

The metric query walks the table and returns a value that summarizes each metric in a comma separated list. The following summary statistics are supported:
  • max: Maximum value
  • min: Smallest value
  • sum: Total value
  • avg: Average value
  • var: Variance
  • sdev: Standard deviation
  • med: Median value
  • q1: First quartile
  • q2: Second quartile (same as med:)
  • q3: Third quartile
  • iqr: Inter-quartile range (i.e. q3 - q1)
The browse-metrics application makes use of the metric REST API and can be used to query and trend metrics.
Click on the link below to plot a graph of the switch port with the highest input utilization (screen capture shown above):
The following examples show how to retrieve metric values without summarization.
curl http://localhost:8008/table/ALL/ifinutilization,ifoututilization/json
Get a table of input and output utilization for every switch port. The table query doesn't summarize metrics. Instead, the query returns rows from the logical table that include the metrics specified in the query.
curl http://localhost:8008/dump/ALL/ALL/json
Dump all metric values for all agents. The dump query is similar to the table query, but instead of walking the table row by row, individual metrics are traversed in their internal order.
curl http://localhost:8008/prometheus/metrics/ALL/ALL/txt
Dump all metric values in Prometheus Exporter format. For example, the Grafana sFlow-RT Network Interfaces dashboard makes use of the query to populate the Prometheus time series database.

There are two types of measurement carried by sFlow: periodically exported counters and randomly sampled packets. So far the examples have been querying metrics derived from the counters.
curl http://localhost:8008/flowkeys/json
Get the list of attributes that are being extracted from sampled packet headers. The available attributes depend on the type of traffic flowing in the network. For a list of supported packet attributes, see Defining Flows.
curl -H "Content-Type:application/json" -X PUT \
--data '{"keys":"ipsource,ipdestination",value:"bytes"}' \
Define a new "flow" metric called srcdst that calculates the bytes per second between each pair of communicating IP addresses on the network.
curl http://localhost:8008/metric/ALL/max:srcdst/json
Find the maximum value of the newly defined srcdst flow metric, i.e. the switch port on the network observing the highest bandwidth flow of packets.
 "agent": "",
 "metricName": "max:srcdst",
 "topKeys": [
   "lastUpdate": 1274,
   "value": 3.392739739066506E8,
   "key": ","
   "lastUpdate": 2352,
   "value": 2.155296894816872E8,
   "key": ","
 "metricN": 10,
 "lastUpdate": 1275,
 "lastUpdateMax": 2031,
 "metricValue": 3.392739739066506E8,
 "dataSource": "4",
 "lastUpdateMin": 1267
In addition to providing a metric value, the result also includes topKeys, showing the top flows seen at the switch port.
Click on the link below to trend the srcdst metric (screen capture shown above):
There are additional queries specific to flow metrics.
curl http://localhost:8008/activeflows/ALL/srcdst/json
Find the largest flows gathered from all the interfaces in the network.
  "flowN": 7,
  "agent": "",
  "value": 5.537867023642346E8,
  "dataSource": "4",
  "key": ","
  "flowN": 6,
  "agent": "",
  "value": 5.1034569007443213E8,
  "dataSource": "38",
  "key": ","
  "flowN": 6,
  "agent": "",
  "value": 1469003.6788768284,
  "dataSource": "4",
  "key": ","
  "flowN": 7,
  "agent": "",
  "value": 1306006.2405022713,
  "dataSource": "37",
  "key": ","
Each flow returned identifies the number of locations it was observed and the port with the maximum value. For example, the largest flow from to was seen by 7 data sources and its maximum value 5.5e8 was observed by data source 4 on agent
Click on the link below to plot a graph of the top flows using the browse-flows application (screen capture shown above):
Note how quickly the graph changes as it tracks new elephant flows in real time.

See RESTflow for a more detailed discussion of sFlow-RT's flow REST API.

This tutorial has just scratches the surface of the capabilities of sFlow-RT's analytics engine. The Writing Applications tutorial provides further examples and a discussion of how to build applications using Python and JavaScript, see Real-time DDoS mitigation using BGP RTBH and FlowSpecFabric View and Flow metrics with Prometheus and Grafana for examples of sFlow-RT applications.

Seeing your own data is more interesting than a canned demonstration. Network Equipment lists devices that support sFlow. Ubuntu 18.04 and CentOS 8 describe how to install the open source Host sFlow agent on popular Linux distributions, extending visibility into compute and cloud infrastructure. The Host sFlow agent is also available as a Docker image for easy deployment with container orchestration systems, see Host, Docker, Swarm and Kubernetes monitoring.

Even if you don't have access to a production environment, the Docker testbed and Kubernetes testbed examples show how to build a virtual testbed using Docker Desktop. Alternatively, Mininet flow analytics and Mininet dashboard provide starting points if you want to experiment with software defined networking (SDN).

Finally, join the sFlow-RT community to ask questions and share solutions and operational experience.

Tuesday, May 5, 2020

NVIDIA, Mellanox, and Cumulus

Recent press releases, Riding a Cloud: NVIDIA Acquires Network-Software Trailblazer Cumulus and NVIDIA Completes Acquisition of Mellanox, Creating Major Force Driving Next-Gen Data Centers, describe NVIDIA's moves to provide high speed data center networks to connect compute clusters that use of their GPUs to accelerate big data workloads, including: deep learning, climate modeling, animation, data visualization, physics, molecular dynamics etc.

Real-time visibility into compute, network, and GPU infrastructure is required manage and optimize the unified infrastructure. This article explores how the industry standard sFlow technology supported by all three vendors can deliver comprehensive visibility.

Cumulus Linux simplifies operations, providing the same operating system, Linux, that runs on the servers. Cumulus Networks and Mellanox have a long history of working with the Linux community to integrate support for switches. The latest Linux kernels now include native support for network ASICs, seamlessly integrating with standard Linux routing (FRR, Quagga, Bird, etc), configuration (Puppet, Chef, Ansible, etc) and monitoring (collectd, netstat, top, etc) tools.

Linux 4.11 kernel extends packet sampling support describes enhancements to the Linux kernel to support industry standard sFlow instrumentation in network ASICs. Cumulus Linux and Mellanox both support the new Linux APIs. Cumulus Linux uses the open source Host sFlow agent to stream telemetry gathered from the hardware, Linux operating system, and applications to a remote collector.

Ubuntu 18.04 and CentOS 8 describe how to install the Host sFlow agent on popular host Linux distributions. The Host sFlow agent is also available as a Docker image for easy deployment with container orchestration systems, see Host, Docker, Swarm and Kubernetes monitoring. Extending network visibility to the host allows network traffic to be associated with applications running on the host as well as providing details about the resources consumed by the applications and the network quality of service being delivered to the applications.

The Host sFlow agent also supports the sFlow NVML GPU Structures extension to export key metrics from NVIDIA GPUs using the NVIDIA Management Library (NVML), see GPU performance monitoring.

Enabling sFlow across the network, compute, and GPU stack provides a real-time, data center wide, view of performance. The sFlow-RT real-time analytics engine offers a convenient method of integrating sFlow analytics with popular orchestration, DevOps and SDN tools, examples include: Cumulus Networks, sFlow and data center automationFlow metrics with Prometheus and GrafanaECMP visibility with Cumulus LinuxFabric View, and Troubleshooting connectivity problems in leaf and spine fabrics.

Friday, April 24, 2020

Monitoring DDoS mitigation

Real-time DDoS mitigation using BGP RTBH and FlowSpec and Pushing BGP Flowspec rules to multiple routers describe how to deploy the ddos-protect application. This article focuses on how to monitor DDoS activity and control actions.

The diagram shows the elements of the solution. Routers stream standard sFlow telemetry to an instance of the sFlow-RT real-time analytics engine running the ddos-protect application. The instant a DDoS attack is detected, RTBH and / or Flowspec actions are pushed via BGP to the routers to mitigate the attack. Key metrics are published using the Prometheus exporter format over HTTP and events are sent using the standard syslog protocol.
The sFlow-RT DDoS Protect dashboard, shown above, makes use of the Prometheus time series database and the Grafana metrics visualization tool to track DDoS attack mitigation actions.
The sFlow-RT Countries and Networks dashboard, shown above, breaks down traffic by origin network and country to provide an indication of the source of attacks.  Flow metrics with Prometheus and Grafana describes how to build additional dashboards to provide additional insight into network traffic.
In this example, syslog events are directed to an Elasticsearch, Logstash, and Kibana (ELK) stack where they are archived, queried, and analyzed. Grafana can be used to query Elasticsearch to incorporate event data in dashboards. The Grafana dashboard example above trends DDoS events and displays key information in a table below.

The tools demonstrated in this article are not the only ones that can be used. If you already have monitoring for your infrastructure then it makes sense to leverage the existing tools rather than stand up a new monitoring system. Syslog events are a standard that are widely supported by on-site (e.g. Splunk) and cloud based (e.g. Solarwinds Loggly) SIEM tools. Similarly, the Prometheus metrics export protocol widely supported (e.g. InfluxDB).

Wednesday, April 15, 2020

Pushing BGP Flowspec rules to multiple routers

Real-time DDoS mitigation using BGP RTBH and Flowspec describes the open source DDoS Protect application. The software runs on the sFlow-RT real-time analytics engine, which receives industry standard sFlow telemetry from routers and pushes controls using BGP. A recent enhancement to the application pushes controls to multiple routers in order to protect networks with redundant edge routers.
Configuring multiple BGP connections is simple, the ddos_protect.router configuration option has been extended to accept a comma separated list of IP addresses for the routers that will be connecting to the controller.
Alternatively, a BGP Flowspec/RTBH reflector can be used to propagate the controls. Flowspec is a recent addition to open source BGP software, FRR and Bird, and it should be possible to use this software to reflect Flowspec controls. A reflector can be a useful place to implement policies that direct controls to specific enforcement devices.

Support for multiple BGP connections in the DDoS Protect application reduces the complexity of simple deployments by removing the requirement for a reflector. Controls are pushed to all devices, but differentiated policies can still be implemented by configuring each device's response to controls.