Tuesday, July 21, 2015

White box Internet router PoC

SDN router using merchant silicon top of rack switch describes how the performance of a software Internet router could be accelerated using the hardware routing capabilities of a commodity switch. This article describes a proof of concept demonstration using Linux virtual machines and a bare metal switch running Cumulus Linux.
The diagram shows the demo setup, providing inter-domain routing between Peer 1 and Peer 2. The Peers are directly connected to the Hardware Switch and ingress packets are routed by the default (0.0.0.0/0) route to the Software Router. The Software Router learns the full set of routes from the Peers using BGP and forwards the packet to the correct next hop router. The packet is then switched to the selected peer router via bridge br_xen.

The following traceroute run on Peer 1 shows the set of router hops from 192.168.250.1 to 192.168.251.1
[root@peer1 ~]# traceroute -s 192.168.250.1 192.168.251.1
traceroute to 192.168.251.1 (192.168.251.1), 30 hops max, 40 byte packets
 1  192.168.152.2 (192.168.152.2)  3.090 ms  3.014 ms  2.927 ms
 2  192.168.150.3 (192.168.150.3)  3.377 ms  3.161 ms  3.099 ms
 3  192.168.251.1 (192.168.251.1)  6.440 ms  6.228 ms  3.217 ms
Ensuring that packets are first forwarded by the default route on the Hardware Switch is the key to accelerating forwarding decisions for the Software Router. Any specific routes added to Hardware Switch will override the default route and matching packets will bypass the Software Router and will be forwarded by the Hardware Switch.

In this test bench, routing is performed using Quagga instances running on Peer 1, Peer 2, Hardware Switch and Software Router.

Peer 1

router bgp 150
 bgp router-id 192.168.150.1
 network 192.168.250.0/24
 neighbor 192.168.152.3 remote-as 152
 neighbor 192.168.152.3 ebgp-multihop 2

Peer 2

router bgp 151
 bgp router-id 192.168.151.1
 network 192.168.251.0/24
 neighbor 192.168.152.3 remote-as 152
 neighbor 192.168.152.3 ebgp-multihop 2

Software Router

interface lo
  ip address 192.168.152.3/32

router bgp 152
 bgp router-id 192.168.152.3
 neighbor 192.168.150.1 remote-as 150
 neighbor 192.168.150.1 update-source 192.168.152.3
 neighbor 192.168.150.1 passive
 neighbor 192.168.151.1 remote-as 151
 neighbor 192.168.151.1 update-source 192.168.152.3
 neighbor 192.168.151.1 passive
 neighbor 10.0.0.162 remote-as 152
 neighbor 10.0.0.162 port 1179
 neighbor 10.0.0.162 timers connect 30
 neighbor 10.0.0.162 route-reflector-client

Hardware Switch

router bgp 65000
 bgp router-id 0.0.0.1
 neighbor 10.0.0.162 remote-as 65000
 neighbor 10.0.0.162 port 1179
 neighbor 10.0.0.162 timers connect 30
In addition, the following lines in /etc/network/interfaces configure the bridge:
auto br_xen
iface br_xen
  bridge-ports swp1 swp2 swp3
  address 192.168.150.2/24
  address 192.168.151.2/24
  address 192.168.152.2/24
Cumulus Networks, sFlow and data center automation describes how to configure sFlow monitoring on Cumulus Linux switches. The switch is configured to send sFlow to 10.0.0.162 (the host running the SDN controller).

SDN Routing Application


SDN router using merchant silicon top of rack switch describes how to install sFlow-RT and provides an application for pushing active routes to an accelerator. The application has been modified for this setup and is running on host 10.0.0.162:
bgpAddNeighbor('10.0.0.152',152);
bgpAddNeighbor('10.0.0.233',65000);
bgpAddSource('10.0.0.233','10.0.0.152',10);

var installed = {};
setIntervalHandler(function() {
  let now = Date.now();
  let top = bgpTopPrefixes('10.0.0.152',20000,1);
  if(!top || !top.hasOwnProperty('topPrefixes')) return;

  let tgt = bgpTopPrefixes('10.0.0.233',0);
  if(!tgt || 'established' != tgt.state) return;

  for(let i = 0; i < top.topPrefixes.length; i++) {
     let entry = top.topPrefixes[i];
     if(bgpAddRoute('10.0.0.233',entry)) {
       installed[entry.prefix] = now; 
     }
  }
  for(let prefix in installed) {
     let time = installed[prefix];
     if(time === now) continue;
     if(bgpRemoveRoute('10.0.0.233',prefix)) {
        delete installed[prefix];
     } 
  }
}, 1);
Start the application:
$ ./start.sh 
2015-07-21T19:36:52-0700 INFO: Listening, BGP port 1179
2015-07-21T19:36:52-0700 INFO: Listening, sFlow port 6343
2015-07-21T19:36:53-0700 INFO: Starting the Jetty [HTTP/1.1] server on port 8008
2015-07-21T19:36:53-0700 INFO: Starting com.sflow.rt.rest.SFlowApplication application
2015-07-21T19:36:53-0700 INFO: Listening, http://localhost:8008
2015-07-21T19:36:53-0700 INFO: bgp.js started
2015-07-21T19:36:57-0700 INFO: BGP open /10.0.0.152:50010
2015-07-21T19:37:23-0700 INFO: BGP open /10.0.0.233:55097
Next, examine the routing table on the Hardware Switch:
cumulus@cumulus$ ip route
default via 192.168.152.1 dev br_xen 
10.0.0.0/24 dev eth0  proto kernel  scope link  src 10.0.0.233 
192.168.150.0/24 dev br_xen  proto kernel  scope link  src 192.168.150.2 
192.168.151.0/24 dev br_xen  proto kernel  scope link  src 192.168.151.2 
192.168.152.0/24 dev br_xen  proto kernel  scope link  src 192.168.152.2
This is the default set of routes  configured to pass traffic to and from the Software Router.

To generate traffic using iperf, run the following command on Peer 2:
iperf -s -B 192.168.251.1
And generate traffic with the following command on Peer 1:
iperf -c 192.168.251.1 -B 192.168.250.1
Now check the routing table on the Hardware Switch again:
cumulus@cumulus$ ip route
default via 192.168.152.1 dev br_xen 
10.0.0.0/24 dev eth0  proto kernel  scope link  src 10.0.0.233  
192.168.150.0/24 dev br_xen  proto kernel  scope link  src 192.168.150.2 
192.168.151.0/24 dev br_xen  proto kernel  scope link  src 192.168.151.2 
192.168.152.0/24 dev br_xen  proto kernel  scope link  src 192.168.152.2 
192.168.250.0/24 via 192.168.150.1 dev br_xen  proto zebra  metric 20 
192.168.251.0/24 via 192.168.151.1 dev br_xen  proto zebra  metric 20
Note the two hardware routes that have been added by the SDN controller. The route override can be verified by repeating the traceroute test:
[root@peer1 ~]# traceroute -s 192.168.250.1 192.168.251.1
traceroute to 192.168.251.1 (192.168.251.1), 30 hops max, 40 byte packets
 1  192.168.150.2 (192.168.150.2)  3.260 ms  3.151 ms  3.014 ms
 2  192.168.251.1 (192.168.251.1)  4.418 ms  4.351 ms  4.260 ms
Comparing with the original traceroute, notice that packets bypass the Software Router interface (192.168.150.3) and are forwarded entirely in hardware.

The traffic analytics driving the forwarding decisions can be viewed through the sFlow-RT REST API:
$ curl http://10.0.0.162:8008/bgp/topprefixes/10.0.0.52/json
{
 "as": 152,
 "direction": "destination",
 "id": "192.168.152.3",
 "learnedPrefixesAdded": 2,
 "learnedPrefixesRemoved": 0,
 "nPrefixes": 2,
 "pushedPrefixesAdded": 0,
 "pushedPrefixesRemoved": 0,
 "startTime": 1437535255553,
 "state": "established",
 "topPrefixes": [
  {
   "aspath": "150",
   "localpref": 100,
   "med": 0,
   "nexthop": "192.168.150.1",
   "origin": "IGP",
   "prefix": "192.168.250.0/24",
   "value": 1.4462334178258518E7
  },
  {
   "aspath": "151",
   "localpref": 100,
   "med": 0,
   "nexthop": "192.168.151.1",
   "origin": "IGP",
   "prefix": "192.168.251.0/24",
   "value": 391390.33359066787
  }
 ],
 "valuePercentCoverage": 100,
 "valueTopPrefixes": 1.4853724511849185E7,
 "valueTotal": 1.4853724511849185E7
}
The SDN application automatically removes routes from the hardware once they become idle, or to make room for more active routes if the hardware routing table exceeds the set limit of 20,000 routes, or if they are withdrawn. This switch has a maximum capacity of 32,768 routes and standard sFlow analytics can be used to monitor hardware table utilizations - Broadcom ASIC table utilization metrics, DevOps, and SDN.
The test setup was configured to quickly test the concept using limited hardware at hand and can be improved for production deployment (using smaller CIDRs and VLANs to separate peer traffic).
This proof of concept demonstrates that it is possible to use SDN analytics and control to combine standard sFlow and BGP capabilities of commodity hardware and deliver Terabit routing capacity with just a few thousand dollars of hardware.

Tuesday, July 14, 2015

SDN router using merchant silicon top of rack switch

The talk from David Barroso describes how Spotify optimizes hardware routing on a commodity switch by using sFlow analytics to identify the routes carrying the most traffic.  The full Internet routing table contains nearly 600,000 entries, too many for commodity switch hardware to handle. However, not all entries are active all the time. The Spotify solution uses traffic analytics to track the 30,000 most active routes (representing 6% of the full routing table) and push them into hardware. Based on Spotify's experience, offloading the active 30,000 routes to the switch provides hardware routing for 99% of their traffic.

David is interviewed by Ivan Pepelnjak,  SDN ROUTER @ SPOTIFY ON SOFTWARE GONE WILD. The SDN Internet Router (SIR) source code and documentation is available on GitHub.
The diagram from David's talk shows the overall architecture of the solution. Initially the Internet Router (commodity switch hardware) uses a default route to direct outbound traffic to a Transit Provider (capable of handling all the outbound traffic). The BGP Controller learns routes via BGP and observes traffic using the standard sFlow measurement technology embedded with most commodity switch silicon.
After a period (1 hour) the BGP Controller identifies the most active 30,000 prefixes and configures the Internet Router to install these routes in the hardware so that traffic takes the best routes to each peer. Each subsequent period provides new measurements and the controller adjusts the active set of routes accordingly.
The internals of the BGP Controller are shown in this diagram. BGP and sFlow data are received by the pmacct traffic accounting software, which then writes out files containing traffic by prefix. The bgpc.py script calculates the TopN prefixes and installs them in the Internet Router. 
In this example, the Bird routing daemon is running on the Internet Router and the TopN prefixes are written into a filter file that restricts prefixes that can be installed in the hardware. 
The SIR router demonstrates that building an SDN controller that leverages standard measurement and control capabilities of commodity hardware has the potential to disrupt the router market by replacing expensive custom routers with inexpensive commodity switches based on merchant silicon. However, the relatively slow feedback loop (updating measurements every hour) limits SIR to access routers with relatively stable traffic patterns.

The rest of this article discusses how a fast feedback loop can be built combining real-time sFlow analytics with a BGP control plane. A fast feedback loop significantly reduces the number hardware cache misses and increase the scaleability of the solution, allowing a broader range of use cases to be addressed.

This diagram differs from the SIR router, re-casting the role of the hardware Switch as an accelerator that handles forwarding for a subset of prefixes in order to reduce the traffic forwarded by a Router implementing the full Internet routing table. Applications for this approach include taking an existing router and boosting its throughput (e.g. boosting 1Gigabit router to 100Gigabit), or, more disruptively, replacing an expensive hardware router with a commodity Linux server.
Route caching is not a new idea, the paper, Revisiting Route Caching: The World Should Be Flat, discusses the history of route caching and discusses applications to contemporary workloads and requirements.
The throughput increase is determined by the cache hit rate that can be achieved with the limited number of routing entries supported by the switch hardware. For example, if the hardware achieves a 90% cache hit rate, then only 10% of the traffic is handled by the router and the throughput is boosted by a factor of 10.

A fast control loop is critical to increasing the cache hit rate, rapidly detecting traffic to new destination prefixes and installing hardware forwarding entries that minimize traffic through the router.

The sFlow-RT analytics software already provides real-time (sub-second) traffic analytics and recently added experimental BGP support allows sFlow-RT to act as a route reflector client, learning the full set of prefixes so that it can track traffic rates by prefix.

The following steps are required to try out the software.

First download sFlow-RT.
wget http://www.inmon.com/products/sFlow-RT/sflow-rt.tar.gz
tar -xvzf sflow-rt.tar.gz
cd sflow-rt
Next configure sFlow-RT to listen for BGP connections. In this case, add the following entries to the start.sh file to enable BGP, listening on port 1179 rather than the well known BGP port 179 so that sFlow-RT does not need to run with root privileges:
-Dbgp.start=yes -Dbgp.port=1179
Edit, the init.js file and use the bgpAddNeighbor function to peer with the Router (10.0.0.254) where NNNN is the local autonomous system (AS) number and the bgpAddNeighbor function to combine sFlow data from the Switch (10.0.0.253) with the routing table, tracking bytes/second and using a 10 second moving average:
bgpAddNeighbor('10.0.0.254',NNNN);
bgpAddSource('10.0.0.253','10.0.0.254',10,'bytes');
Configure the Switch to send sFlow to sFlow-RT (see Switch configurations).

Configure the Router as a route reflector, connecting to sFlow-RT (10.0.0.252) and exporting the full routing table. For example, using Quagga as the routing daemon:
router bgp NNNN
 bgp router-id 10.0.0.254
 neighbor 10.0.0.252 remote-as NNNN
 neighbor 10.0.0.252 port 1179
 neighbor 10.0.0.252 route-reflector-client
Start sFlow-RT:
./start.sh
The following cURL command accesses the sFlow-RT REST API to query the TopN prefixes:
curl "http://10.0.0.162:8008/bgp/topprefixes/10.0.0.30/json?direction=destination&maxPrefixes=5&minValue=1"
{
 "as": NNNN,
 "direction": "destination",
 "id": "N.N.N.N",
 "learnedPrefixesAdded": 568313,
 "learnedPrefixesRemoved": 9,
 "nPrefixes": 567963,
 "pushedPrefixesAdded": 0,
 "pushedPrefixesRemoved": 0,
 "startTime": 1436830843625,
 "state": "established",
 "topPrefixes": [
  {
   "aspath": "NNNN",
   "localpref": 888,
   "nexthop": "N.N.N.N",
   "origin": "IGP",
   "prefix": "0.0.0.0/0",
   "value": 680740.5504781345
  },
  {
   "aspath": "NNNN-NNNN",
   "localpref": 100,
   "nexthop": "N.N.N.N",
   "origin": "IGP",
   "prefix": "N.N.0.0/14",
   "value": 58996.251739893225
  },
  {
   "aspath": "NNNN-NNNNN",
   "localpref": 130,
   "nexthop": "N.N.N.N",
   "origin": "IGP",
   "prefix": "N.N.0.0/13",
   "value": 7966.802831354894
  },
  {
   "localpref": 100,
   "med": 2,
   "nexthop": "N.N.N.N",
   "origin": "IGP",
   "prefix": "N.N.N.0/18",
   "value": 3059.8853014045844
  },
  {
   "aspath": "NNNN",
   "localpref": 1010,
   "med": 0,
   "nexthop": "N.N.N.N",
   "origin": "IGP",
   "prefix": "N.N.N.0/24",
   "value": 1635.0250535959976
  }
 ],
 "valuePercentCoverage": 99.67670497397555,
 "valueTopPrefixes": 752398.5154043833,
 "valueTotal": 754838.871931838
}

In addition to returning the top prefixes, the query returns information about the amount of traffic covered by these prefixes. In this case, the valuePercentageCoverage of 99.67 indicates that 99.67 percent of the traffic is covered by the top 5 prefixes.
Try running this query on your own network to find out how many prefixes are required to cover 90%, 95%, and 99% of the traffic. If you have results you can share, please post them as comments to this article.
Obtaining the TopN prefixes is only part of the SDN routing application. An efficient method of installing the TopN prefixes in the switch hardware is also required . The SIR router uses a configuration file, but this approach doesn't work well for rapidly modifying large tables. In addition, configuration files vary between routers, limiting the portability of the controller.

In addition to listening for routes using BGP, sFlow-RT can also act as a BGP speaker. The following init.js script implements a basic hardware route cache:

bgpAddNeighbor('10.0.0.254',65000);
bgpAddSource('10.0.0.253','10.0.0.254',10,'bytes');
bgpAddNeighbor('10.0.0.253',65000);

var installed = {};
setIntervalHandler(function() {
  let now = Date.now();
  let top = bgpTopPrefixes('10.0.0.254',100,1,'destination');
  if(!top || !top.hasOwnProperty('topPrefixes')) return;

  let tgt = bgpTopPrefixes('10.0.0.253',0);
  if(!tgt || 'established' != tgt.state) return;

  for(let i = 0; i < top.topPrefixes.length; i++) {
     let entry = top.topPrefixes[i];
     if(bgpAddRoute('10.0.0.253',entry)) {
       installed[entry.prefix] = now; 
     }
  }
  for(let prefix in installed) {
     let time = installed[prefix];
     if(time === now) continue;
     if(bgpRemoveRoute('10.0.0.253',prefix)) {
        delete installed[prefix];
     } 
  }
}, 5);
Some notes on the script:
  1. setIntervalHandler registers a function that is called every 5 seconds
  2. The interval handler queries for the top 100 destination prefixes
  3. Active prefixes are pushed to the switch using bgpAddRoute
  4. Inactive prefixes are withdrawn using bgpRemoveRoute
  5. The bgpAddRoute/bgpRemoveRoute functions are BGP session state aware and will only forward changes

The initial BGP functionality is fairly limited (no IPv6, no communities, ..) and experimental, please report any bugs here, or on the sFlow-RT group.

Try out the software and provide feedback. This example was only one use case for combining sFlow and BGP in an SDN controller. Other use cases include inbound / outbound traffic engineering, DDoS mitigation, multi-path load balancing, etc. Finally, the combination of commodity hardware with mature, widely deployed BGP and sFlow protocols is a pragmatic approach to SDN that allows solutions to be developed rapidly and deployed widely in production environments.