Tuesday, July 14, 2015

SDN router using merchant silicon top of rack switch

The talk from David Barroso describes how Spotify optimizes hardware routing on a commodity switch by using sFlow analytics to identify the routes carrying the most traffic.  The full Internet routing table contains nearly 600,000 entries, too many for commodity switch hardware to handle. However, not all entries are active all the time. The Spotify solution uses traffic analytics to track the 30,000 most active routes (representing 6% of the full routing table) and push them into hardware. Based on Spotify's experience, offloading the active 30,000 routes to the switch provides hardware routing for 99% of their traffic.

David is interviewed by Ivan Pepelnjak,  SDN ROUTER @ SPOTIFY ON SOFTWARE GONE WILD. The SDN Internet Router (SIR) source code and documentation is available on GitHub.
The diagram from David's talk shows the overall architecture of the solution. Initially the Internet Router (commodity switch hardware) uses a default route to direct outbound traffic to a Transit Provider (capable of handling all the outbound traffic). The BGP Controller learns routes via BGP and observes traffic using the standard sFlow measurement technology embedded with most commodity switch silicon.
After a period (1 hour) the BGP Controller identifies the most active 30,000 prefixes and configures the Internet Router to install these routes in the hardware so that traffic takes the best routes to each peer. Each subsequent period provides new measurements and the controller adjusts the active set of routes accordingly.
The internals of the BGP Controller are shown in this diagram. BGP and sFlow data are received by the pmacct traffic accounting software, which then writes out files containing traffic by prefix. The bgpc.py script calculates the TopN prefixes and installs them in the Internet Router. 
In this example, the Bird routing daemon is running on the Internet Router and the TopN prefixes are written into a filter file that restricts prefixes that can be installed in the hardware. 
The SIR router demonstrates that building an SDN controller that leverages standard measurement and control capabilities of commodity hardware has the potential to disrupt the router market by replacing expensive custom routers with inexpensive commodity switches based on merchant silicon. However, the relatively slow feedback loop (updating measurements every hour) limits SIR to access routers with relatively stable traffic patterns.

The rest of this article discusses how a fast feedback loop can be built combining real-time sFlow analytics with a BGP control plane. A fast feedback loop significantly reduces the number hardware cache misses and increase the scaleability of the solution, allowing a broader range of use cases to be addressed.

This diagram differs from the SIR router, re-casting the role of the hardware Switch as an accelerator that handles forwarding for a subset of prefixes in order to reduce the traffic forwarded by a Router implementing the full Internet routing table. Applications for this approach include taking an existing router and boosting its throughput (e.g. boosting 1Gigabit router to 100Gigabit), or, more disruptively, replacing an expensive hardware router with a commodity Linux server.
Route caching is not a new idea, the paper, Revisiting Route Caching: The World Should Be Flat, discusses the history of route caching and discusses applications to contemporary workloads and requirements.
The throughput increase is determined by the cache hit rate that can be achieved with the limited number of routing entries supported by the switch hardware. For example, if the hardware achieves a 90% cache hit rate, then only 10% of the traffic is handled by the router and the throughput is boosted by a factor of 10.

A fast control loop is critical to increasing the cache hit rate, rapidly detecting traffic to new destination prefixes and installing hardware forwarding entries that minimize traffic through the router.

The sFlow-RT analytics software already provides real-time (sub-second) traffic analytics and recently added experimental BGP support allows sFlow-RT to act as a route reflector client, learning the full set of prefixes so that it can track traffic rates by prefix.

The following steps are required to try out the software.

First download sFlow-RT.
wget http://www.inmon.com/products/sFlow-RT/sflow-rt.tar.gz
tar -xvzf sflow-rt.tar.gz
cd sflow-rt
Next configure sFlow-RT to listen for BGP connections. In this case, add the following entries to the start.sh file to enable BGP, listening on port 1179 rather than the well known BGP port 179 so that sFlow-RT does not need to run with root privileges:
-Dbgp.start=yes -Dbgp.port=1179
Edit, the init.js file and use the bgpAddNeighbor function to peer with the Router ( where NNNN is the local autonomous system (AS) number and the bgpAddNeighbor function to combine sFlow data from the Switch ( with the routing table, tracking bytes/second and using a 10 second moving average:
Configure the Switch to send sFlow to sFlow-RT (see Switch configurations).

Configure the Router as a route reflector, connecting to sFlow-RT ( and exporting the full routing table. For example, using Quagga as the routing daemon:
router bgp NNNN
 bgp router-id
 neighbor remote-as NNNN
 neighbor port 1179
 neighbor route-reflector-client
Start sFlow-RT:
The following cURL command accesses the sFlow-RT REST API to query the TopN prefixes:
curl ""
 "as": NNNN,
 "direction": "destination",
 "id": "N.N.N.N",
 "learnedPrefixesAdded": 568313,
 "learnedPrefixesRemoved": 9,
 "nPrefixes": 567963,
 "pushedPrefixesAdded": 0,
 "pushedPrefixesRemoved": 0,
 "startTime": 1436830843625,
 "state": "established",
 "topPrefixes": [
   "aspath": "NNNN",
   "localpref": 888,
   "nexthop": "N.N.N.N",
   "origin": "IGP",
   "prefix": "",
   "value": 680740.5504781345
   "aspath": "NNNN-NNNN",
   "localpref": 100,
   "nexthop": "N.N.N.N",
   "origin": "IGP",
   "prefix": "N.N.0.0/14",
   "value": 58996.251739893225
   "aspath": "NNNN-NNNNN",
   "localpref": 130,
   "nexthop": "N.N.N.N",
   "origin": "IGP",
   "prefix": "N.N.0.0/13",
   "value": 7966.802831354894
   "localpref": 100,
   "med": 2,
   "nexthop": "N.N.N.N",
   "origin": "IGP",
   "prefix": "N.N.N.0/18",
   "value": 3059.8853014045844
   "aspath": "NNNN",
   "localpref": 1010,
   "med": 0,
   "nexthop": "N.N.N.N",
   "origin": "IGP",
   "prefix": "N.N.N.0/24",
   "value": 1635.0250535959976
 "valuePercentCoverage": 99.67670497397555,
 "valueTopPrefixes": 752398.5154043833,
 "valueTotal": 754838.871931838

In addition to returning the top prefixes, the query returns information about the amount of traffic covered by these prefixes. In this case, the valuePercentageCoverage of 99.67 indicates that 99.67 percent of the traffic is covered by the top 5 prefixes.
Try running this query on your own network to find out how many prefixes are required to cover 90%, 95%, and 99% of the traffic. If you have results you can share, please post them as comments to this article.
Obtaining the TopN prefixes is only part of the SDN routing application. An efficient method of installing the TopN prefixes in the switch hardware is also required . The SIR router uses a configuration file, but this approach doesn't work well for rapidly modifying large tables. In addition, configuration files vary between routers, limiting the portability of the controller.

In addition to listening for routes using BGP, sFlow-RT can also act as a BGP speaker. The following init.js script implements a basic hardware route cache:


var installed = {};
setIntervalHandler(function() {
  let now = Date.now();
  let top = bgpTopPrefixes('',100,1,'destination');
  if(!top || !top.hasOwnProperty('topPrefixes')) return;

  let tgt = bgpTopPrefixes('',0);
  if(!tgt || 'established' != tgt.state) return;

  for(let i = 0; i < top.topPrefixes.length; i++) {
     let entry = top.topPrefixes[i];
     if(bgpAddRoute('',entry)) {
       installed[entry.prefix] = now; 
  for(let prefix in installed) {
     let time = installed[prefix];
     if(time === now) continue;
     if(bgpRemoveRoute('',prefix)) {
        delete installed[prefix];
}, 5);
Some notes on the script:
  1. setIntervalHandler registers a function that is called every 5 seconds
  2. The interval handler queries for the top 100 destination prefixes
  3. Active prefixes are pushed to the switch using bgpAddRoute
  4. Inactive prefixes are withdrawn using bgpRemoveRoute
  5. The bgpAddRoute/bgpRemoveRoute functions are BGP session state aware and will only forward changes

The initial BGP functionality is fairly limited (no IPv6, no communities, ..) and experimental, please report any bugs here, or on the sFlow-RT group.

Try out the software and provide feedback. This example was only one use case for combining sFlow and BGP in an SDN controller. Other use cases include inbound / outbound traffic engineering, DDoS mitigation, multi-path load balancing, etc. Finally, the combination of commodity hardware with mature, widely deployed BGP and sFlow protocols is a pragmatic approach to SDN that allows solutions to be developed rapidly and deployed widely in production environments.

No comments:

Post a Comment