Sunday, May 22, 2016


Open Network Switch Layer (OpenNSL) is a library of network switch APIs that is openly available for programming Broadcom network switch silicon based platforms. These open APIs enable development of networking application software based on Broadcom network switch architecture based platforms.

The recent inclusion of the APIs needed to enable sFlow instrumentation in Broadcom hardware allows open source network operating systems such as OpenSwitch and Open Network Linux to implement the sFlow telemetry standard.

Saturday, May 21, 2016

Mininet dashboard

Mininet Dashboard has been released on GitHub, Follow the steps in Mininet flow analytics to install sFlow-RT and configure sFlow instrumentation in Mininet.

The following steps install the dashboard and start sFlow-RT:
cd sflow-rt
./get-app sflow-rt mininet-dashboard
The dashboard web interface shown in the screen shot should now be accessible. Run a test to see data in the dashboard. The following test created the results shown:
sudo mn --custom extras/ --link tc,bw=10 --topo tree,depth=2,fanout=2 --test iperf
The dashboard has three time series charts that update every second and show five minutes worth of data. From top to bottom, the charts are:
  1. Top Flows - Click on a peak in the chart to see the flows that were active at that time.
  2. Top Ports - Click on a peak in the chart to see the ingress ports that were active at that time.
  3. Topology Diameter - The diameter of the topology.
The dashboard application is easily modified to add additional metrics, generate events, or implement controls. For example, adding the following code to the end of the sflow-rt/app/mininet-dashboard/scripts/metrics.js file implements equivalent functionality to the large flow detection Python script described in Mininet flow analytics:


setEventHandler(function(evt) {
Restart sFlow-RT and repeat the iperf test and the following events should be logged:
$ ./ 
2016-05-21T18:00:03-0700 INFO: Listening, sFlow port 6343
2016-05-21T18:00:03-0700 INFO: Starting the Jetty [HTTP/1.1] server on port 8008
2016-05-21T18:00:03-0700 INFO: Starting application
2016-05-21T18:00:03-0700 INFO: Listening, http://localhost:8008
2016-05-21T18:00:03-0700 INFO: app/mininet-dashboard/scripts/metrics.js started
2016-05-21T18:00:12-0700 INFO: s1-s2,,
2016-05-21T18:00:12-0700 INFO: s1-s3,,
See Writing Application for more information.

Thursday, May 19, 2016

Mininet flow analytics

Mininet is free software that creates a realistic virtual network, running real kernel, switch and application code, on a single machine (VM, cloud or native), in seconds. Mininet is useful for development, teaching, and research. Mininet is also a great way to develop, share, and experiment with OpenFlow and Software-Defined Networking systems.

This article shows how standard sFlow instrumentation built into Mininet can be combined with sFlow-RT analytics software to provide real-time traffic visibility for Mininet networks. Augmenting Mininet with sFlow telemetry realistically emulates the instrumentation built into most vendor's switch hardware, provides visibility into Mininet experiments, and opens up new areas of research (e.g. SDN and large flows).

The following papers are a small selection of projects using sFlow-RT:
In order to make it easier to get started, the latest release of sFlow-RT includes a Mininet helper script that automates sFlow configuration. The following example shows how to use the script and build a simple application in Python.

Install sFlow-RT on the Mininet host:
tar -xvzf sflow-rt.tar.gz
cd sflow-rt
In a second terminal, add the --custom argument to the Mininet command line. For example, the following command builds a depth 2 tree topology with link bandwidths of 10Mbit/s
cd sflow-rt
sudo mn --custom extras/ --link tc,bw=10 --topo tree,depth=2,fanout=2
The script extends Mininet, automatically enabling sFlow on each of the switches in the topology, and posting a JSON representation of the Mininet topology using sFlow-RT's REST API.

Traffic engineering of large "Elephant" flows is an active area of research. The following Python script,, demonstrates how Elephant flows can be detected using sFlow-RT REST API calls:
#!/usr/bin/env python
import requests
import json

rt = ''

flow = {'keys':'link:inputifindex,ipsource,ipdestination','value':'bytes'}

threshold = {'metric':'pair','value':1000000/8,'byFlow':True,'timeout':1}

eventurl = rt+'/events/json?thresholdID=elephant&maxEvents=10&timeout=60'
eventID = -1
while 1 == 1:
  r = requests.get(eventurl + "&eventID=" + str(eventID))
  if r.status_code != 200: break
  events = r.json()
  if len(events) == 0: continue

  eventID = events[0]["eventID"]
  for e in events:
    print e['flowKey']
Some notes on the script:
  • The link:inputifindex function in the flow definition identifies the link in the topology associated with the ingress port on the Mininet switch, see Defining Flows
  • The script defines an Elephant flow as a flow that consumes 10% of the link bandwidth. In this example Mininet was configured with a link bandwidth of 10Mbit/s so an Elephant is a flow that exceeds 1Mbit/s. Since the specified flow measures traffic in bytes/second the threshold needs to be converted by bytes/second (dividing by 8).
  • The sFlow-RT REST API uses long-polling as a method of asynchronously pushing events to the client. The events HTTP request blocks until there are new events or a timeout occurs. The client immediately reconnects after receiving a response to wait for further events. 
See Writing Applications for additional information.

Start the script:
$ ./
Run an iperf test using the Mininet CLI:
mininet> iperf h1 h3
*** Iperf: testing TCP bandwidth between h1 and h3
*** Results: ['9.06 Mbits/sec', '9.98 Mbits/sec']
The following results should appear as soon as the flow is detected:
$ ./ 
The output identifies the links carrying the flow between h1 and h3 and shows the IP addresses of the hosts.
The sFlow-RT web interface provides basic charting capabilities. The chart above shows a sequence of iperf tests.

The Python script can easily be modified to address a number of interesting use cases. Instead of simply printing events, a REST call can be made to an OpenFlow controller (POX, OpenDaylight, Floodlight, ONOS, etc) to apply controls to mark, mirror, load balance, rate limit, or block flows.

Wednesday, May 18, 2016

Identifying bad ECMP paths

In the talk Move Fast, Unbreak Things! at the recent DevOps Networking Forum,  Petr Lapukhov described how Facebook has tackled the problem of detecting packet loss in Equal Cost Multi-Path (ECMP) networks. At Facebook's scale,  there are many parallel paths and actively probing all the paths generates a lot of data. The active tests generate over 1Terabits/second of measurement data per Facebook data center and a Hadoop cluster with hundreds of compute nodes is required per data center to process the data.

Processing active test data can detect that packets are being lost within approximately 20 seconds, but doesn't provide the precise location where packets are dropped. A custom multi-path traceroute tool (fbtracert) is used to follow up and narrow down the location of the packet loss.

While described as measuring packet loss, the test system is really measuring path loss. For example, if there are 64 ECMP paths in a pod, then the loss of one path would result in a packet loss of approximately 1 in 64 packets in traffic flows that cross the ECMP group.

Black hole detection describes an alternative approach. Industry standard sFlow instrumentation embedded within most vendor's switch hardware provides visibility into the paths that packets take across the network - see Packet paths. In some ways the sFlow telemetry is very similar to the traceroute tests, each measurement identifies the specific location a packet was seen.

The passive sFlow monitoring approach has significant benefits:
  1. Eliminates active test traffic since production traffic exercises network paths.
  2. Eliminates traffic generators and test targets required to perform the active tests.
  3. Simplifies analysis since sFlow measurements provides a direct indication of anomaly location.
  4. Reduced operation complexity and associated costs.
Enabling sFlow throughout the network continuously monitors all paths and can rapidly detect routing anomalies. In addition, sFlow is a general purpose solution that delivers the visibility needed to manage leaf-spine networks and the distributed applications that they support. The following examples are illustrative of the breadth of solution supported by sFlow analytics:

Tuesday, May 17, 2016

Black hole detection

The Broadcom white paper, Black Hole Detection by BroadView™ Instrumentation Software, describes the challenge of detecting and isolating packet loss caused by inconsistent routing in leaf-spine fabrics. The diagram from the paper provides an example, packets from host H11 to H22 are being forwarded by ToR1 via Spine1 to ToR2 even though the route to H22 has been withdrawn from ToR2. Since ToR2 doesn't have a route to the host, it sends the packet back up to Spine 2, which will send the packet back to ToR2, causing the packet to bounce back and forth until the IP time to live (TTL) expires.

The white paper discusses how Broadcom ASICs can be programmed to detect blackholes based on packet paths, i.e. packets arriving at a ToR switch from a Spine switch should never be forwarded to another Spine switch.

This article will discuss how the industry standard sFlow instrumentation (also included in Broadcom based switches) can be used to provide fabric wide detection of black holes.

The diagram shows a simple test network built using Cumulus VX virtual machines to emulate a four switch leaf-spine fabric like the one described in the Broadcom white paper (this network is described in Open Virtual Network (OVN) and Network virtualization visibility demo). The emulation of the control plane is extremely accurate since the same Cumulus Linux distribution that runs on physical switches is running in the Cumulus VX virtual machine. In this case BGP is being used as the routing protocol (see BGP configuration made simple with Cumulus Linux).

The same open source Host sFlow agent is running on the Linux servers and switches, streaming real-time telemetry over the out of band management network to sFlow analysis software running on the management server.
Fabric View is an open source application, running on the sFlow-RT real-time analytics engine, designed to monitor the performance of leaf-spine fabrics. The sFlow-RT Download page has instructions for downloading and installing sFlow-RT and Fabric View.

The Fabric View application needs two pieces of configuration information: the network topology and the address allocation.

Topology discovery with Cumulus Linux describes how to extract the topology in a Cumulus Linux network, yielding the following topology.json file:
  "links": {
    "leaf2-spine2": {
      "node1": "leaf2", "port1": "swp2", 
      "node2": "spine2", "port2": "swp2"
    "leaf1-spine1": {
      "node1": "leaf1", "port1": "swp1",
      "node2": "spine1", "port2": "swp1"
    "leaf1-spine2": {
      "node1": "leaf1", "port1": "swp2",
      "node2": "spine2", "port2": "swp1"
    "leaf2-spine1": {
      "node1": "leaf2", "port1": "swp1",
      "node2": "spine1", "port2": "swp2"
And the following groups.json file lists the /24 address blocks allocated to hosts connected to each leaf switch:
Defining Flows describes how sFlow-RT can be programmed to perform flow analytics. The following JavaScript file implements the blackhole detection and can be installed in the sflow-rt/app/fabric-view/scripts/ directory:
// track flows that are sent back to spine
var pathfilt = 'node:inputifindex~leaf.*';
pathfilt += '&link:inputifindex!=null';
pathfilt += '&link:outputifindex!=null';
 {keys:'group:ipdestination:fv',value:'frames', filter:pathfilt,
  log:true, flowStart:true}

// track locally originating flows that have TTL indicating non shortest path
var diam = 2;
var ttlfilt = 'range:ipttl:0:'+(64-diam-2)+'=true';
ttlfilt += '&group:ipsource:fv!=external';
  {keys:'group:ipdestination:fv,ipttl', value:'frames', filter:ttlfilt,
   log:true, flowStart:true}

setFlowHandler(function(rec) {
  var parts, msg = {'type':'blackhole'};
  switch( {
  case 'fv-blackhole-path':
     msg.rack = rec.flowKeys;
  case 'fv-blackhole-ttl':
     var [rack,ttl] = rec.flowKeys.split(',');
     msg.rack = rack;
     msg.ttl = ttl;
  var port = topologyInterfaceToPort(rec.agent,rec.dataSource);
  if(port && port.node) msg.node = port.node;
Some notes on the script:
  • The fv-blackhole-path flow definition has a filter that matches packets that arrive on an inter switch link and are sent back on another link (the rule described in the Broadcom paper.)
  • The fv-blackhole-ttl script relies on the fact that the servers are running Linux which uses an initial TTL of 64. Since it only takes 3 routing hops to traverse the leaf-spine fabric, any TTL values of 60 or smaller are an indication of a potential routing loop and black hole.
  • Flow records are generated as soon as a match is found and the setFlowHandler() function is uses to process the records, in this case logging warning messages.
Under normal operation no warning are generated. Adding static routes to leaf2 and spine1 to create a loop to blackhole packets results in the following output:
$ ./ 
2016-05-17T19:47:37-0700 INFO: Listening, sFlow port 9343
2016-05-17T19:47:38-0700 INFO: Starting the Jetty [HTTP/1.1] server on port 8008
2016-05-17T19:47:38-0700 INFO: Starting application
2016-05-17T19:47:38-0700 INFO: Listening, http://localhost:8008
2016-05-17T19:47:38-0700 INFO: app/fabric-view/scripts/fabric-view-stats.js started
2016-05-17T19:47:38-0700 INFO: app/fabric-view/scripts/blackhole.js started
2016-05-17T19:47:38-0700 INFO: app/fabric-view/scripts/fabric-view.js started
2016-05-17T19:47:38-0700 INFO: app/fabric-view/scripts/fabric-view-elephants.js started
2016-05-17T19:47:38-0700 INFO: app/fabric-view/scripts/fabric-view-usr.js started
2016-05-17T20:50:33-0700 WARNING: {"type":"blackhole","rack":"rack2","ttl":"13","node":"leaf2"}
2016-05-17T20:50:48-0700 WARNING: {"type":"blackhole","rack":"rack2","node":"leaf2"}
Exporting events using syslog describes how to send events to SIEM tools like Logstash or Splunk so that they can be queried. The script can also be extended to perform further analysis, or to automatically apply remediation controls.
This example demonstrates the versatility of the sFlow architecture, shifting flow analytics from devices to external software makes it easy to deploy new capabilities. The real-time networking, server, and application analytics provided by sFlow-RT delivers actionable data through APIs and can easily be integrated with a wide variety of on-site and cloud, orchestration, DevOps and Software Defined Networking (SDN) tools.

Friday, May 6, 2016

sFlow to IPFIX/NetFlow

RESTflow explains how the sFlow architecture shifts the flow cache from devices to external software and describes how the sFlow-RT REST API can be used to program and query flow caches. Exporting events using syslog describes how flow records can be exported using the syslog protocol to Security Information and Event Management (SIEM) tools such as Logstash and and Splunk. This article demonstrates how sFlow-RT can be used to define and export the flows using the IP Flow Information eXport (IPFIX) protocol (the IETF standard based on NetFlow version 9).

For example, the following command defines a cache that will maintain flow records for TCP flows on the network, capturing IP source and destination addresses, source and destination port numbers and the bytes transferred and sending flow records to address
curl -H "Content-Type:application/json" -X PUT --data \ '{"keys":"ipsource,ipdestination,tcpsourceport,tcpdestinationport", \
"value":"bytes", "ipfixCollectors":[""]}' \
Running Wireshark's tshark command line utility on verifies that flows are being received:
# tshark -i eth0 -V udp port 4739
Running as user "root" and group "root". This could be dangerous.
Capturing on lo
Frame 1 (134 bytes on wire, 134 bytes captured)
    Arrival Time: Aug 24, 2013 10:44:06.096082000
    [Time delta from previous captured frame: 0.000000000 seconds]
    [Time delta from previous displayed frame: 0.000000000 seconds]
    [Time since reference or first frame: 0.000000000 seconds]
    Frame Number: 1
    Frame Length: 134 bytes
    Capture Length: 134 bytes
    [Frame is marked: False]
    [Protocols in frame: eth:ip:udp:cflow]
Ethernet II, Src: 00:00:00_00:00:00 (00:00:00:00:00:00), Dst: 00:00:00_00:00:00 (00:00:00:00:00:00)
    Destination: 00:00:00_00:00:00 (00:00:00:00:00:00)
        Address: 00:00:00_00:00:00 (00:00:00:00:00:00)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
    Source: 00:00:00_00:00:00 (00:00:00:00:00:00)
        Address: 00:00:00_00:00:00 (00:00:00:00:00:00)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
    Type: IP (0x0800)
Internet Protocol, Src: (, Dst: (
    Version: 4
    Header length: 20 bytes
    Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00)
        0000 00.. = Differentiated Services Codepoint: Default (0x00)
        .... ..0. = ECN-Capable Transport (ECT): 0
        .... ...0 = ECN-CE: 0
    Total Length: 120
    Identification: 0x0000 (0)
    Flags: 0x02 (Don't Fragment)
        0.. = Reserved bit: Not Set
        .1. = Don't fragment: Set
        ..0 = More fragments: Not Set
    Fragment offset: 0
    Time to live: 64
    Protocol: UDP (0x11)
    Header checksum: 0x2532 [correct]
        [Good: True]
        [Bad : False]
    Source: (
    Destination: (
User Datagram Protocol, Src Port: 56109 (56109), Dst Port: ipfix (4739)
    Source port: 56109 (56109)
    Destination port: ipfix (4739)
    Length: 100
    Checksum: 0x15b9 [validation disabled]
        [Good Checksum: False]
        [Bad Checksum: False]
Cisco NetFlow/IPFIX
    Version: 10
    Length: 92
    Timestamp: Aug 24, 2013 10:44:06.000000000
        ExportTime: 1377366246
    FlowSequence: 74
    Observation Domain Id: 0
    Set 1
        Template FlowSet: 2
        FlowSet Length: 40
        Template (Id = 258, Count = 8)
            Template Id: 258
            Field Count: 8
            Field (1/8)
                .000 0000 1000 0010 = Type: exporterIPv4Address (130)
                Length: 4
            Field (2/8)
                .000 0000 1001 0110 = Type: flowStartSeconds (150)
                Length: 4
            Field (3/8)
                .000 0000 1001 0111 = Type: flowEndSeconds (151)
                Length: 4
            Field (4/8)
                .000 0000 0000 1000 = Type: IP_SRC_ADDR (8)
                Length: 4
            Field (5/8)
                .000 0000 0000 1100 = Type: IP_DST_ADDR (12)
                Length: 4
            Field (6/8)
                .000 0000 1011 0110 = Type: TCP_SRC_PORT (182)
                Length: 2
            Field (7/8)
                .000 0000 1011 0111 = Type: TCP_DST_PORT (183)
                Length: 2
            Field (8/8)
                .000 0000 0101 0101 = Type: BYTES_TOTAL (85)
                Length: 8
    Set 2
        DataRecord (Template Id): 258
        DataRecord Length: 36
        Flow 1
            ExporterAddr: (
            [Duration: 65.000000000 seconds]
                StartTime: Aug 24, 2013 10:43:01.000000000
                EndTime: Aug 24, 2013 10:44:06.000000000
            SrcAddr: (
            DstAddr: (
            SrcPort: 48859
            DstPort: 443
            Octets: 228045
The output demonstrates how the flow cache definition is exported as an IPFIX Template and the individual flow records are exported as one or more Flow entries within a DataRecord.

What might not be apparent is that the single configuration command to sFlow-RT enabled network wide monitoring of TCP connections, even in a network containing hundreds of physical switches, thousands of virtual switches, different switch models, multiple vendors etc. In contrast, if devices maintain their own flow caches then each switch needs to be re-configured whenever monitoring requirements change - typically a time consuming and complex manual process, see Software defined analytics.
While IPFIX provides a useful method of exporting IP flow records to legacy monitoring solutions, logging flow records is only a small subset of the applications for sFlow analytics. The real-time networking, server, and application analytics provided by sFlow-RT delivers actionable data through APIs and can easily be integrated with a wide variety of on-site and cloud, orchestration, DevOps and Software Defined Networking (SDN) tools.

Thursday, May 5, 2016

Berkeley Packet Filter (BPF)

Linux bridge, macvlan, ipvlan, adapters discusses how industry standard sFlow technology, widely supported by data center switch vendors, has been extended to provide network visibility into the Linux data plane. This article explores how sFlow's lightweight packet sampling mechanism has been implemented on Linux network adapters.

Linux Socket Filtering aka Berkeley Packet Filter (BPF) describes the recently added prandom_u32() function that allows packets to be randomly sampled in the Linux kernel for efficient monitoring of production traffic.
Background: Enhancing Network Intrusion Detection With Integrated Sampling and Filtering, Jose M. Gonzalez and Vern Paxson, International Computer Science Institute Berkeley, discusses the motivation for adding random sampling BPF and the email thread [PATCH] filter: added BPF random opcode describes the Linux implementation and includes an interesting discussion of the motivation for the patch.
The following code shows how the open source Host sFlow agent implements random 1-in-256 packet sampling as a BPF program:
ld rand
mod #256
jneq #1, drop
ret #-1
drop: ret #0
A JIT for packet filters discusses the Linux Just In Time (JIT) compiler for BFP programs, delivering native machine code performance for compiled filters.

Minimizing cost of visibility describes why low overhead monitoring is an essential component for increasing efficiency in cloud infrastructure. The combination of BPF packet sampling with standard sFlow export provides a low overhead method of delivering real-time network visibility into large scale cloud infrastructure.