Tuesday, July 21, 2015

White box Internet router PoC

SDN router using merchant silicon top of rack switch describes how the performance of a software Internet router could be accelerated using the hardware routing capabilities of a commodity switch. This article describes a proof of concept demonstration using Linux virtual machines and a bare metal switch running Cumulus Linux.
The diagram shows the demo setup, providing inter-domain routing between Peer 1 and Peer 2. The Peers are directly connected to the Hardware Switch and ingress packets are routed by the default (0.0.0.0/0) route to the Software Router. The Software Router learns the full set of routes from the Peers using BGP and forwards the packet to the correct next hop router. The packet is then switched to the selected peer router via bridge br_xen.

The following traceroute run on Peer 1 shows the set of router hops from 192.168.250.1 to 192.168.251.1
[root@peer1 ~]# traceroute -s 192.168.250.1 192.168.251.1
traceroute to 192.168.251.1 (192.168.251.1), 30 hops max, 40 byte packets
 1  192.168.152.2 (192.168.152.2)  3.090 ms  3.014 ms  2.927 ms
 2  192.168.150.3 (192.168.150.3)  3.377 ms  3.161 ms  3.099 ms
 3  192.168.251.1 (192.168.251.1)  6.440 ms  6.228 ms  3.217 ms
Ensuring that packets are first forwarded by the default route on the Hardware Switch is the key to accelerating forwarding decisions for the Software Router. Any specific routes added to Hardware Switch will override the default route and matching packets will bypass the Software Router and will be forwarded by the Hardware Switch.

In this test bench, routing is performed using Quagga instances running on Peer 1, Peer 2, Hardware Switch and Software Router.

Peer 1

router bgp 150
 bgp router-id 192.168.150.1
 network 192.168.250.0/24
 neighbor 192.168.152.3 remote-as 152
 neighbor 192.168.152.3 ebgp-multihop 2

Peer 2

router bgp 151
 bgp router-id 192.168.151.1
 network 192.168.251.0/24
 neighbor 192.168.152.3 remote-as 152
 neighbor 192.168.152.3 ebgp-multihop 2

Software Router

interface lo
  ip address 192.168.152.3/32

router bgp 152
 bgp router-id 192.168.152.3
 neighbor 192.168.150.1 remote-as 150
 neighbor 192.168.150.1 update-source 192.168.152.3
 neighbor 192.168.150.1 passive
 neighbor 192.168.151.1 remote-as 151
 neighbor 192.168.151.1 update-source 192.168.152.3
 neighbor 192.168.151.1 passive
 neighbor 10.0.0.162 remote-as 152
 neighbor 10.0.0.162 port 1179
 neighbor 10.0.0.162 timers connect 30
 neighbor 10.0.0.162 route-reflector-client

Hardware Switch

router bgp 65000
 bgp router-id 0.0.0.1
 neighbor 10.0.0.162 remote-as 65000
 neighbor 10.0.0.162 port 1179
 neighbor 10.0.0.162 timers connect 30
In addition, the following lines in /etc/network/interfaces configure the bridge:
auto br_xen
iface br_xen
  bridge-ports swp1 swp2 swp3
  address 192.168.150.2/24
  address 192.168.151.2/24
  address 192.168.152.2/24
Cumulus Networks, sFlow and data center automation describes how to configure sFlow monitoring on Cumulus Linux switches. The switch is configured to send sFlow to 10.0.0.162 (the host running the SDN controller).

SDN Routing Application


SDN router using merchant silicon top of rack switch describes how to install sFlow-RT and provides an application for pushing active routes to an accelerator. The application has been modified for this setup and is running on host 10.0.0.162:
bgpAddNeighbor('10.0.0.152',152);
bgpAddNeighbor('10.0.0.233',65000);
bgpAddSource('10.0.0.233','10.0.0.152',10);

var installed = {};
setIntervalHandler(function() {
  let now = Date.now();
  let top = bgpTopPrefixes('10.0.0.152',20000,1);
  if(!top || !top.hasOwnProperty('topPrefixes')) return;

  let tgt = bgpTopPrefixes('10.0.0.233',0);
  if(!tgt || 'established' != tgt.state) return;

  for(let i = 0; i < top.topPrefixes.length; i++) {
     let entry = top.topPrefixes[i];
     if(bgpAddRoute('10.0.0.233',entry)) {
       installed[entry.prefix] = now; 
     }
  }
  for(let prefix in installed) {
     let time = installed[prefix];
     if(time === now) continue;
     if(bgpRemoveRoute('10.0.0.233',prefix)) {
        delete installed[prefix];
     } 
  }
}, 1);
Start the application:
$ ./start.sh 
2015-07-21T19:36:52-0700 INFO: Listening, BGP port 1179
2015-07-21T19:36:52-0700 INFO: Listening, sFlow port 6343
2015-07-21T19:36:53-0700 INFO: Starting the Jetty [HTTP/1.1] server on port 8008
2015-07-21T19:36:53-0700 INFO: Starting com.sflow.rt.rest.SFlowApplication application
2015-07-21T19:36:53-0700 INFO: Listening, http://localhost:8008
2015-07-21T19:36:53-0700 INFO: bgp.js started
2015-07-21T19:36:57-0700 INFO: BGP open /10.0.0.152:50010
2015-07-21T19:37:23-0700 INFO: BGP open /10.0.0.233:55097
Next, examine the routing table on the Hardware Switch:
cumulus@cumulus$ ip route
default via 192.168.152.1 dev br_xen 
10.0.0.0/24 dev eth0  proto kernel  scope link  src 10.0.0.233 
192.168.150.0/24 dev br_xen  proto kernel  scope link  src 192.168.150.2 
192.168.151.0/24 dev br_xen  proto kernel  scope link  src 192.168.151.2 
192.168.152.0/24 dev br_xen  proto kernel  scope link  src 192.168.152.2
This is the default set of routes  configured to pass traffic to and from the Software Router.

To generate traffic using iperf, run the following command on Peer 2:
iperf -s -B 192.168.251.1
And generate traffic with the following command on Peer 1:
iperf -c 192.168.251.1 -B 192.168.250.1
Now check the routing table on the Hardware Switch again:
cumulus@cumulus$ ip route
default via 192.168.152.1 dev br_xen 
10.0.0.0/24 dev eth0  proto kernel  scope link  src 10.0.0.233  
192.168.150.0/24 dev br_xen  proto kernel  scope link  src 192.168.150.2 
192.168.151.0/24 dev br_xen  proto kernel  scope link  src 192.168.151.2 
192.168.152.0/24 dev br_xen  proto kernel  scope link  src 192.168.152.2 
192.168.250.0/24 via 192.168.150.1 dev br_xen  proto zebra  metric 20 
192.168.251.0/24 via 192.168.151.1 dev br_xen  proto zebra  metric 20
Note the two hardware routes that have been added by the SDN controller. The route override can be verified by repeating the traceroute test:
[root@peer1 ~]# traceroute -s 192.168.250.1 192.168.251.1
traceroute to 192.168.251.1 (192.168.251.1), 30 hops max, 40 byte packets
 1  192.168.150.2 (192.168.150.2)  3.260 ms  3.151 ms  3.014 ms
 2  192.168.251.1 (192.168.251.1)  4.418 ms  4.351 ms  4.260 ms
Comparing with the original traceroute, notice that packets bypass the Software Router interface (192.168.150.3) and are forwarded entirely in hardware.

The traffic analytics driving the forwarding decisions can be viewed through the sFlow-RT REST API:
$ curl http://10.0.0.162:8008/bgp/topprefixes/10.0.0.52/json
{
 "as": 152,
 "direction": "destination",
 "id": "192.168.152.3",
 "learnedPrefixesAdded": 2,
 "learnedPrefixesRemoved": 0,
 "nPrefixes": 2,
 "pushedPrefixesAdded": 0,
 "pushedPrefixesRemoved": 0,
 "startTime": 1437535255553,
 "state": "established",
 "topPrefixes": [
  {
   "aspath": "150",
   "localpref": 100,
   "med": 0,
   "nexthop": "192.168.150.1",
   "origin": "IGP",
   "prefix": "192.168.250.0/24",
   "value": 1.4462334178258518E7
  },
  {
   "aspath": "151",
   "localpref": 100,
   "med": 0,
   "nexthop": "192.168.151.1",
   "origin": "IGP",
   "prefix": "192.168.251.0/24",
   "value": 391390.33359066787
  }
 ],
 "valuePercentCoverage": 100,
 "valueTopPrefixes": 1.4853724511849185E7,
 "valueTotal": 1.4853724511849185E7
}
The SDN application automatically removes routes from the hardware once they become idle, or to make room for more active routes if the hardware routing table exceeds the set limit of 20,000 routes, or if they are withdrawn. This switch has a maximum capacity of 32,768 routes and standard sFlow analytics can be used to monitor hardware table utilizations - Broadcom ASIC table utilization metrics, DevOps, and SDN.
The test setup was configured to quickly test the concept using limited hardware at hand and can be improved for production deployment (using smaller CIDRs and VLANs to separate peer traffic).
This proof of concept demonstrates that it is possible to use SDN analytics and control to combine standard sFlow and BGP capabilities of commodity hardware and deliver Terabit routing capacity with just a few thousand dollars of hardware.

Tuesday, July 14, 2015

SDN router using merchant silicon top of rack switch

The talk from David Barroso describes how Spotify optimizes hardware routing on a commodity switch by using sFlow analytics to identify the routes carrying the most traffic.  The full Internet routing table contains nearly 600,000 entries, too many for commodity switch hardware to handle. However, not all entries are active all the time. The Spotify solution uses traffic analytics to track the 30,000 most active routes (representing 6% of the full routing table) and push them into hardware. Based on Spotify's experience, offloading the active 30,000 routes to the switch provides hardware routing for 99% of their traffic.

David is interviewed by Ivan Pepelnjak,  SDN ROUTER @ SPOTIFY ON SOFTWARE GONE WILD. The SDN Internet Router (SIR) source code and documentation is available on GitHub.
The diagram from David's talk shows the overall architecture of the solution. Initially the Internet Router (commodity switch hardware) uses a default route to direct outbound traffic to a Transit Provider (capable of handling all the outbound traffic). The BGP Controller learns routes via BGP and observes traffic using the standard sFlow measurement technology embedded with most commodity switch silicon.
After a period (1 hour) the BGP Controller identifies the most active 30,000 prefixes and configures the Internet Router to install these routes in the hardware so that traffic takes the best routes to each peer. Each subsequent period provides new measurements and the controller adjusts the active set of routes accordingly.
The internals of the BGP Controller are shown in this diagram. BGP and sFlow data are received by the pmacct traffic accounting software, which then writes out files containing traffic by prefix. The bgpc.py script calculates the TopN prefixes and installs them in the Internet Router. 
In this example, the Bird routing daemon is running on the Internet Router and the TopN prefixes are written into a filter file that restricts prefixes that can be installed in the hardware. 
The SIR router demonstrates that building an SDN controller that leverages standard measurement and control capabilities of commodity hardware has the potential to disrupt the router market by replacing expensive custom routers with inexpensive commodity switches based on merchant silicon. However, the relatively slow feedback loop (updating measurements every hour) limits SIR to access routers with relatively stable traffic patterns.

The rest of this article discusses how a fast feedback loop can be built combining real-time sFlow analytics with a BGP control plane. A fast feedback loop significantly reduces the number hardware cache misses and increase the scaleability of the solution, allowing a broader range of use cases to be addressed.

This diagram differs from the SIR router, re-casting the role of the hardware Switch as an accelerator that handles forwarding for a subset of prefixes in order to reduce the traffic forwarded by a Router implementing the full Internet routing table. Applications for this approach include taking an existing router and boosting its throughput (e.g. boosting 1Gigabit router to 100Gigabit), or, more disruptively, replacing an expensive hardware router with a commodity Linux server.
Route caching is not a new idea, the paper, Revisiting Route Caching: The World Should Be Flat, discusses the history of route caching and discusses applications to contemporary workloads and requirements.
The throughput increase is determined by the cache hit rate that can be achieved with the limited number of routing entries supported by the switch hardware. For example, if the hardware achieves a 90% cache hit rate, then only 10% of the traffic is handled by the router and the throughput is boosted by a factor of 10.

A fast control loop is critical to increasing the cache hit rate, rapidly detecting traffic to new destination prefixes and installing hardware forwarding entries that minimize traffic through the router.

The sFlow-RT analytics software already provides real-time (sub-second) traffic analytics and recently added experimental BGP support allows sFlow-RT to act as a route reflector client, learning the full set of prefixes so that it can track traffic rates by prefix.

The following steps are required to try out the software.

First download sFlow-RT.
wget http://www.inmon.com/products/sFlow-RT/sflow-rt.tar.gz
tar -xvzf sflow-rt.tar.gz
cd sflow-rt
Next configure sFlow-RT to listen for BGP connections. In this case, add the following entries to the start.sh file to enable BGP, listening on port 1179 rather than the well known BGP port 179 so that sFlow-RT does not need to run with root privileges:
-Dbgp.start=yes -Dbgp.port=1179
Edit, the init.js file and use the bgpAddNeighbor function to peer with the Router (10.0.0.254) where NNNN is the local autonomous system (AS) number and the bgpAddNeighbor function to combine sFlow data from the Switch (10.0.0.253) with the routing table, tracking bytes/second and using a 10 second moving average:
bgpAddNeighbor('10.0.0.254',NNNN);
bgpAddSource('10.0.0.253','10.0.0.254',10,'bytes');
Configure the Switch to send sFlow to sFlow-RT (see Switch configurations).

Configure the Router as a route reflector, connecting to sFlow-RT (10.0.0.252) and exporting the full routing table. For example, using Quagga as the routing daemon:
router bgp NNNN
 bgp router-id 10.0.0.254
 neighbor 10.0.0.252 remote-as NNNN
 neighbor 10.0.0.252 port 1179
 neighbor 10.0.0.252 route-reflector-client
Start sFlow-RT:
./start.sh
The following cURL command accesses the sFlow-RT REST API to query the TopN prefixes:
curl "http://10.0.0.162:8008/bgp/topprefixes/10.0.0.30/json?direction=destination&maxPrefixes=5&minValue=1"
{
 "as": NNNN,
 "direction": "destination",
 "id": "N.N.N.N",
 "learnedPrefixesAdded": 568313,
 "learnedPrefixesRemoved": 9,
 "nPrefixes": 567963,
 "pushedPrefixesAdded": 0,
 "pushedPrefixesRemoved": 0,
 "startTime": 1436830843625,
 "state": "established",
 "topPrefixes": [
  {
   "aspath": "NNNN",
   "localpref": 888,
   "nexthop": "N.N.N.N",
   "origin": "IGP",
   "prefix": "0.0.0.0/0",
   "value": 680740.5504781345
  },
  {
   "aspath": "NNNN-NNNN",
   "localpref": 100,
   "nexthop": "N.N.N.N",
   "origin": "IGP",
   "prefix": "N.N.0.0/14",
   "value": 58996.251739893225
  },
  {
   "aspath": "NNNN-NNNNN",
   "localpref": 130,
   "nexthop": "N.N.N.N",
   "origin": "IGP",
   "prefix": "N.N.0.0/13",
   "value": 7966.802831354894
  },
  {
   "localpref": 100,
   "med": 2,
   "nexthop": "N.N.N.N",
   "origin": "IGP",
   "prefix": "N.N.N.0/18",
   "value": 3059.8853014045844
  },
  {
   "aspath": "NNNN",
   "localpref": 1010,
   "med": 0,
   "nexthop": "N.N.N.N",
   "origin": "IGP",
   "prefix": "N.N.N.0/24",
   "value": 1635.0250535959976
  }
 ],
 "valuePercentCoverage": 99.67670497397555,
 "valueTopPrefixes": 752398.5154043833,
 "valueTotal": 754838.871931838
}

In addition to returning the top prefixes, the query returns information about the amount of traffic covered by these prefixes. In this case, the valuePercentageCoverage of 99.67 indicates that 99.67 percent of the traffic is covered by the top 5 prefixes.
Try running this query on your own network to find out how many prefixes are required to cover 90%, 95%, and 99% of the traffic. If you have results you can share, please post them as comments to this article.
Obtaining the TopN prefixes is only part of the SDN routing application. An efficient method of installing the TopN prefixes in the switch hardware is also required . The SIR router uses a configuration file, but this approach doesn't work well for rapidly modifying large tables. In addition, configuration files vary between routers, limiting the portability of the controller.

In addition to listening for routes using BGP, sFlow-RT can also act as a BGP speaker. The following init.js script implements a basic hardware route cache:

bgpAddNeighbor('10.0.0.254',65000);
bgpAddSource('10.0.0.253','10.0.0.254',10,'bytes');
bgpAddNeighbor('10.0.0.253',65000);

var installed = {};
setIntervalHandler(function() {
  let now = Date.now();
  let top = bgpTopPrefixes('10.0.0.254',100,1,'destination');
  if(!top || !top.hasOwnProperty('topPrefixes')) return;

  let tgt = bgpTopPrefixes('10.0.0.253',0);
  if(!tgt || 'established' != tgt.state) return;

  for(let i = 0; i < top.topPrefixes.length; i++) {
     let entry = top.topPrefixes[i];
     if(bgpAddRoute('10.0.0.253',entry)) {
       installed[entry.prefix] = now; 
     }
  }
  for(let prefix in installed) {
     let time = installed[prefix];
     if(time === now) continue;
     if(bgpRemoveRoute('10.0.0.253',prefix)) {
        delete installed[prefix];
     } 
  }
}, 5);
Some notes on the script:
  1. setIntervalHandler registers a function that is called every 5 seconds
  2. The interval handler queries for the top 100 destination prefixes
  3. Active prefixes are pushed to the switch using bgpAddRoute
  4. Inactive prefixes are withdrawn using bgpRemoveRoute
  5. The bgpAddRoute/bgpRemoveRoute functions are BGP session state aware and will only forward changes

The initial BGP functionality is fairly limited (no IPv6, no communities, ..) and experimental, please report any bugs here, or on the sFlow-RT group.

Try out the software and provide feedback. This example was only one use case for combining sFlow and BGP in an SDN controller. Other use cases include inbound / outbound traffic engineering, DDoS mitigation, multi-path load balancing, etc. Finally, the combination of commodity hardware with mature, widely deployed BGP and sFlow protocols is a pragmatic approach to SDN that allows solutions to be developed rapidly and deployed widely in production environments.

Thursday, June 25, 2015

WAN optimization using real-time traffic analytics

TATA Consultancy Services white paper, Actionable Intelligence in the SDN Ecosystem: Optimizing Network Traffic through FRSA, demonstrates how real-time traffic analytics and SDN can be combined to perform real-time traffic engineering of large flows across a WAN infrastructure.
The architecture being demonstrated is shown in the diagram (this diagram has been corrected - the diagram in the white paper incorrectly states that sFlow-RT analytics software uses a REST API to poll the nodes in the topology. In fact, the nodes stream telemetry using the widely supported, industry standard, sFlow protocol, providing real-time visibility and scaleability that would be difficult to achieve using polling - see Push vs Pull).

The load balancing application receives real-time notifications of large flows from the sFlow-RT analytics software and programs the SDN Controller (in this case OpenDaylight) to push forwarding rules to the switches to direct the large flows across a specific path. Flow Aware Real-time SDN Analytics (FRSA) provides an overview of the basic ideas behind large flow traffic engineering that inspired this use case.

While OpenDaylight is used in this example, an interesting alternative for this use case would be the ONOS SDN controller running the Segment Routing application. ONOS is specifically designed with carriers in mind and segment routing is a natural fit for the traffic engineering task described in this white paper.
Leaf and spine traffic engineering using segment routing describes a demonstration combining real-time analytics and SDN control in a data center context. The demonstration was part of the recent 2015 Open Networking Summit (ONS) conference Showcase and presented in the talk, CORD: FABRIC An Open-Source Leaf-Spine L3 Clos Fabric, by Saurav Das.

Sunday, June 21, 2015

Optimizing software defined data center

The recent Fortune magazine article, Software-defined data center market to hit $77.18 billion by 2020, starts with the quote "Data centers are no longer just about all the hardware gear you can stitch together for better operations. There’s a lot of software involved to squeeze more performance out of your hardware, and all that software is expected to contribute to a burgeoning new market dubbed the software-defined data center."

The recent ONS2015 Keynote from Google's Amin Vahdat describes how Google builds large scale software defined data centers. The presentation is well worth watching in its entirety since Google has a long history of advancing distributed computing with technologies that have later become mainstream.
There are a number of points in the presentation that relate to the role of networking to the performance of cloud applications. Amin states, "Networking is at this inflection point and what computing means is going to be largely determined by our ability to build great networks over the coming years. In this world data center networking in particular is a key differentiator."

This slide shows the the large pools of storage and compute connected by the data center network that are used to deliver data center services. Amin states that the dominant costs are compute and storage and that the network can be relatively inexpensive.
In Overall Data Center Costs James Hamilton breaks down the monthly costs of running a data center and puts the cost of network equipment at 8% of the overall cost.
However, Amin goes on to explain why networking has a disproportionate role in the overall value delivered by the data center.
The key to an efficient data center is balance. If a resource is scarce, then other resources are left idle and this increases costs and limits the overall value of the data center. Amin goes on to state, "Typically the resource that is most scarce is the network."
The need to build large scale high-performance networks has driven Google to build networks with the following properties:
  • Leaf and Spine (Clos) topology
  • Merchant silicon based switches (white box / brite box / bare metal)
  • Centralized control (SDN)
The components and topology of the network are shown in the following slide.
Here again Google is leading the overall network market transition to inexpensive leaf and spine networks built using commodity hardware.

Google is not alone in leading this trend. Facebook has generated significant support for the Open Compute Project (OCP), which publishes open source designs data center equipment, including merchant silicon based leaf and spine switches. A key OCP project is the Open Network Install Environment (ONIE), which allows third party software to be installed on the network equipment. ONIE separates hardware from software and has spawned a number of innovative networking software companies, including: Cumulus Networks, Big Switch Networks, Pica8, Pluribus Networks.  Open network hardware and the related ecosystem of software is entering the mainstream as leading vendors such as Dell and HP deliver open networking hardware, software and support to enterprise customers.
The ONS2015 keynote from AT&T's John Donovan, describes the economic drivers for AT&T's transition to open networking and compute architectures.
John discusses the rapid move from legacy TDM (Time Division Multiplexing) technologies to commodity Ethernet, explaining that "video now makes up the majority of traffic on our network." This is a fundamental shift for AT&T and John states that "We plan to virtualize and control more than 75% of our network using cloud infrastructure and a software defined architecture."

John mentions the CORD (Central Office Re-architected as a Datacenter) project which proposes an architecture very similar to Google's, consisting of a leaf and spine network built using open merchant silicon based hardware connecting commodity servers and storage. A prototype of the CORD leaf and spine network was shown as part of the ONS2015 Solutions Showcase.
ONS2015 Solutions Showcase: Open-source spine-leaf Fabric
Leaf and spine traffic engineering using segment routing and SDN describes a live demonstration presented in ONS2015 Solutions Showcase. The demonstration shows how centralized analytics and control can be used to optimize the performance of commodity leaf and spine networks handling the large "Elephant" flows that typically comprise most traffic on the network (for example, video streams - see SDN and large flows for a general discussion).

Getting back to the Fortune article, it is clear that the move to open commodity network, server and storage hardware shifts value from hardware to the software solutions that optimize performance. The network in particular is a critical resource that constrains overall performance and network optimization solutions can provide disproportionate benefits by eliminating bottlenecks that constrain compute and storage and limit the value delivered by the data center.

Friday, June 12, 2015

Leaf and spine traffic engineering using segment routing and SDN


The short 3 minute video is a live demonstration showing how software defined networking (SDN) can be used to orchestrate the measurement and control capabilities of commodity data center switches to automatically load balance traffic on a 4 leaf, 4 spine, 10 Gigabit leaf and spine network.
The diagram shows the physical layout of the demonstration rack. The four logical racks with their servers and leaf switches are combined in a single physical rack, along with the spine switches, and SDN controllers. All the links in the data plane are 10G and sFlow has been enabled on every switch and link with the following settings, packet sampling rate 1-in-8192 and counter polling interval 20 seconds. The switches have been configured to send the sFlow data to sFlow-RT analytics software running on Controller 1.

The switches are also configured to enable OpenFlow 1.3 and connect to multiple controllers in the redundant ONOS SDN controller cluster running on Controller 1 and Controller 2.
The charts from The Nature of Datacenter Traffic: Measurements & Analysis show data center traffic measurements published by Microsoft. Most traffic flows are short duration. However, combined they consume less bandwidth than a much smaller number of large flows with durations ranging from 10 seconds to 100 seconds. The large number of small flows are often referred to as "Mice" and the small number of large flows as "Elephants."

This demonstration focuses on the Elephant flows since they consume most of the bandwidth. The iperf load generator is used to generate two streams of back to back 10Gbyte transfers that should take around 8 seconds to complete over the 10Gbit/s leaf and spine network.
while true; do iperf -B 10.200.3.32 -c 10.200.3.42 -n 10000M; done
while true; do iperf -B 10.200.3.33 -c 10.200.3.43 -n 10000M; done
These two independent streams of connections from switch 103 to 104 drive the demo.
The HTML 5 dashboard queries sFlow-RT's REST API to extract and display real-time flow information.

The dashboard shows a topological view of the leaf and spine network in the top left corner. Highlighted "busy" links have a utilization of over 70% (i.e. 7Gbit/s). The topology shows flows taking independent paths from 103 to 104 (via spines 105 and 106). The links are highlighted in blue to indicate that the utilization on each link is driven by a single large flow. The chart immediately under the topology trends the number of busy links. The most recent point, to the far right of the chart, has a value of 4 and is colored blue, recording that 4 blue links are shown in the topology.

The bottom chart trends the total traffic entering the network broken out by flow. The current throughput is just under 20Gbit/s and is comprised of two roughly equal flows.

The ONOS controller configures the switches to forward packets using Equal Cost Multi-Path (ECMP) routing. There are four equal cost (hop count) paths from leaf switch 103 to leaf switch 104 (via spine switches 105, 106, 107 and 108). The switch hardware selects between paths based on a hash function calculated over selected fields in the packets (e.g. source and destination IP addresses + source and destination TCP ports), e.g.
index = hash(packet fields) % group.size
selected_physical_port = group[index]
Hash based load balancing works well for large numbers of Mice flows, but is less suitable for the Elephant flows. The hash function may assign multiple Elephant flows to the same path resulting in congestion and poor network performance.
This screen shot shows the effect of a collision between flows. Both flows have been assigned the same path via spine switch 105. The analytics software has determined that there are multiple large flows on the pair of busy links and indicates this by coloring the highlighted links yellow. The most recent point, to the far right of the upper trend chart, has a value of 2 and is colored yellow, recording that 2 yellow links are shown in the topology.

Notice that the bottom chart shows that the total throughput has dropped to 10Gbit/s and that each of the flows is limited to 5Gbit/s - halving the throughput and doubling the time taken to complete the data transfer.

The dashboard demonstrates that the sFlow-RT analytics engine has all the information needed to characterize the problem - identifying busy links and the large flows. What is needed is a way to take action to direct one of the flows on a different path across the network.

This is where the segment routing functionality of the ONOS SDN controller comes into its own. The controller implements Segment Routing in Networking (SPRING)  as the method of ECMP forwarding and provides a simple REST API for specifying paths across the network and assigning traffic to those paths.

In this example, the traffic is colliding because both flows are following a path running through spine switch 105. Paths from leaf 103 to 104 via spines 106, 107 or 108 have available bandwidth.

The following REST operation instructs the segment routing module to build a path from 103 via 106 to 104:
curl -H "Content-Type: application/json" -X POST http://localhost:8181/onos/segmentrouting/tunnel -d '{"tunnel_id":"t1", "label_path":[103,106,104]}'
Once the tunnel has been defined, the following REST operation assigns one of the colliding flows to the new path:
curl -H "Content-Type: application/json" -X POST http://localhost:8181/onos/segmentrouting/policy -d '{"policy_id":"p1", "priority":1000, "src_ip":"10.200.3.33/32", "dst_ip":"10.200.4.43/32", "proto_type":"TCP", "src_tp_port":53163, "dst_tp_port":5001, "policy_type":"TUNNEL_FLOW", "tunnel_id":"t1"}'
However, manually implementing these controls isn't feasible since there is a constant stream of flows that would require policy changes every few seconds.
The final screen shot shows the result of enabling the Flow Accelerator application on sFlow-RT. Flow Accelerator watches for collisions and automatically applies and removes segment routing policies as required to separate Elephant flows, in this case the table on the top right of the dashboard shows that a single policy has been installed sending one of the flows via spine 107.

The controller has been running for about half the interval show in the two trend charts (approximately two and half minutes). To the left you can see frequent long collisions and consequent dips in throughput. To the right you can see that more of the links are kept busy and flows experience consistent throughput.
Traffic analytics are a critical component of this demonstration. Why does this demonstration use sFlow? Could NetFlow/JFlow/IPFIX/OpenFlow etc. be used instead? The above diagram illustrates the basic architectural difference between sFlow and other common flow monitoring technologies. For this use case the key difference is that with sFlow real-time data from the entire network is available in a central location (the sFlow-RT analytics software), allowing the traffic engineering application to make timely load balancing decisions based on complete information. Rapidly detecting large flows, sFlow vs. NetFlow/IPFIX presents experimental data to demonstrate the difference is responsiveness between sFlow and the other flow monitoring technologies. OK, but what about using hardware packet counters periodically pushed via sFlow, or polled using SNMP or OpenFlow? Here again, measurement delay limits the usefulness of the counter information for SDN applications, see Measurement delay, counters vs. packet samples. Fortunately, the requirement for sFlow is not limiting since support for standard sFlow measurement is built into most vendor and white box hardware - see Drivers for growth.

Finally, the technologies presented in this demonstration have broad applicability beyond the leaf and spine use case. Elephant flows dominate data center, campus, wide area, and wireless networks (see SDN and large flows). In addition, segment routing is applicable to wide area networks as was demonstrated by an early version of the ONOS controller (Prototype & Demo Videos). The demonstration illustrates that the integration real-time sFlow analytics in SDN solutions enables fundamentally new use cases that drive SDN to a new level - optimizing networks rather than simply provisioning them.

Monday, May 18, 2015

Analytics and SDN

Recent presentations from AT&T and Google describe SDN/NFV architectures that incorporate measurement based feedback in order to improve performance and reliability.

The first slide is from a presentation by AT&T's Margaret Chiosi; SDN+NFV Next Steps in the Journey, NFV World Congress 2015. The future architecture envisions generic (white box) hardware providing a stream of analytics which are compared to policies and used to drive actions to assure service levels.


The second slide is from the presentation by Google's Bikash Koley at the Silicon Valley Software Defined Networking Group Meetup. In this architecture, "network state changes observed by analyzing comprehensive time-series data stream." Telemetry is used to verify that the network is behaving as intended, identifying policy violations so that the management and control planes can apply corrective actions. Again, the software defined network is built from commodity white box switches.

Support for standard sFlow measurements is almost universally available in commodity switch hardware. sFlow agents embedded within network devices continuously stream measurements to the SDN controller, supplying the analytics component with the comprehensive, scaleable, real-time visibility needed for effective control.

SDN fabric controller for commodity data center switches describes the measurement and control capabilities available in commodity switch hardware. In addition, there are a number of use cases described on this blog that demonstrate the benefits of incorporating traffic analytics in SDN solutions, including:
While the incorporation of telemetry / analytics in SDN architectures is recent, the sFlow measurement standard is a proven technology that has been incorporated in switch ASICs for over a decade. Incorporating sFlow in SDN solution stacks leverages the capabilities of commodity switches to provide immediate visibility into operational networks without the complexity and cost of adding probes or being locked in to vendor specific hardware.

Wednesday, April 1, 2015

Big Tap sFlow: Enabling Pervasive Flow-level Visibility


Today's Big Switch Networks webinar, Big Tap sFlow: Enabling Pervasive Flow-level Visibility, describes how Big Switch uses software defined networking (SDN) to control commodity switches and deliver network visibility. The webinar presents a live demonstration showing how real-time sFlow analytics is used to automatically drive SDN actions to provide a "smarter way to find a needle in a haystack."

The video presentation covers the following topics:

  • 0:00 Introduction to Big Tap
  • 7:00 sFlow generation and use cases
  • 12:30 Demonstration of real-time tap triggering based on sFlow

The webinar describes how the network wide monitoring provided by industry standard sFlow instrumentation complements the Big Tap SDN controller's ability to capture and direct packet selected packet streams to visibility tools.

The above slide from the webinar draws an analogy for the role that sFlow plays in targeting the capture network to that of a finderscope, the small, wide-angle telescope used to provide an overview of the sky and guide the telescope to its target. Support for the sFlow measurement standard is built into commodity switch hardware and is enabled on all ports in the capture network to provide a wide angle view of all traffic in the data center. Once suspicious activity is detected, targeted captures can be automatically triggered using Big Tap's REST API.
Blacklists are an important way in which the Internet community protects itself by identifying bad actors. Incorporating blacklists in traffic monitoring can be a useful way to find hosts on a network that have been compromised. If a host interacts with addresses known to be part of a botnet for example, then it raises the concern that the host has been compromised and is itself a member of the botnet.

Black lists can be very large, larger lists can exceed a million addresses. Switches don't have the resources to match traffic against such large lists. However, sFlow shifts analysis from the switches to external software which can easily handle to task of matching traffic against large lists. The live demonstration uses InMon's sFlow-RT real-time analytics software to match sFlow data against a large blacklist.  When a match is detected the Big Tap controller is programmed via a REST API call to capture all the packets from the suspected hosts and stream them to Wireshark for further investigation.