Tuesday, July 21, 2015

White box Internet router PoC

SDN router using merchant silicon top of rack switch describes how the performance of a software Internet router could be accelerated using the hardware routing capabilities of a commodity switch. This article describes a proof of concept demonstration using Linux virtual machines and a bare metal switch running Cumulus Linux.
The diagram shows the demo setup, providing inter-domain routing between Peer 1 and Peer 2. The Peers are directly connected to the Hardware Switch and ingress packets are routed by the default (0.0.0.0/0) route to the Software Router. The Software Router learns the full set of routes from the Peers using BGP and forwards the packet to the correct next hop router. The packet is then switched to the selected peer router via bridge br_xen.

The following traceroute run on Peer 1 shows the set of router hops from 192.168.250.1 to 192.168.251.1
[root@peer1 ~]# traceroute -s 192.168.250.1 192.168.251.1
traceroute to 192.168.251.1 (192.168.251.1), 30 hops max, 40 byte packets
 1  192.168.152.2 (192.168.152.2)  3.090 ms  3.014 ms  2.927 ms
 2  192.168.150.3 (192.168.150.3)  3.377 ms  3.161 ms  3.099 ms
 3  192.168.251.1 (192.168.251.1)  6.440 ms  6.228 ms  3.217 ms
Ensuring that packets are first forwarded by the default route on the Hardware Switch is the key to accelerating forwarding decisions for the Software Router. Any specific routes added to Hardware Switch will override the default route and matching packets will bypass the Software Router and will be forwarded by the Hardware Switch.

In this test bench, routing is performed using Quagga instances running on Peer 1, Peer 2, Hardware Switch and Software Router.

Peer 1

router bgp 150
 bgp router-id 192.168.150.1
 network 192.168.250.0/24
 neighbor 192.168.152.3 remote-as 152
 neighbor 192.168.152.3 ebgp-multihop 2

Peer 2

router bgp 151
 bgp router-id 192.168.151.1
 network 192.168.251.0/24
 neighbor 192.168.152.3 remote-as 152
 neighbor 192.168.152.3 ebgp-multihop 2

Software Router

interface lo
  ip address 192.168.152.3/32

router bgp 152
 bgp router-id 192.168.152.3
 neighbor 192.168.150.1 remote-as 150
 neighbor 192.168.150.1 update-source 192.168.152.3
 neighbor 192.168.150.1 passive
 neighbor 192.168.151.1 remote-as 151
 neighbor 192.168.151.1 update-source 192.168.152.3
 neighbor 192.168.151.1 passive
 neighbor 10.0.0.162 remote-as 152
 neighbor 10.0.0.162 port 1179
 neighbor 10.0.0.162 timers connect 30
 neighbor 10.0.0.162 route-reflector-client

Hardware Switch

router bgp 65000
 bgp router-id 0.0.0.1
 neighbor 10.0.0.162 remote-as 65000
 neighbor 10.0.0.162 port 1179
 neighbor 10.0.0.162 timers connect 30
In addition, the following lines in /etc/network/interfaces configure the bridge:
auto br_xen
iface br_xen
  bridge-ports swp1 swp2 swp3
  address 192.168.150.2/24
  address 192.168.151.2/24
  address 192.168.152.2/24
Cumulus Networks, sFlow and data center automation describes how to configure sFlow monitoring on Cumulus Linux switches. The switch is configured to send sFlow to 10.0.0.162 (the host running the SDN controller).

SDN Routing Application


SDN router using merchant silicon top of rack switch describes how to install sFlow-RT and provides an application for pushing active routes to an accelerator. The application has been modified for this setup and is running on host 10.0.0.162:
bgpAddNeighbor('10.0.0.152',152);
bgpAddNeighbor('10.0.0.233',65000);
bgpAddSource('10.0.0.233','10.0.0.152',10);

var installed = {};
setIntervalHandler(function() {
  let now = Date.now();
  let top = bgpTopPrefixes('10.0.0.152',20000,1);
  if(!top || !top.hasOwnProperty('topPrefixes')) return;

  let tgt = bgpTopPrefixes('10.0.0.233',0);
  if(!tgt || 'established' != tgt.state) return;

  for(let i = 0; i < top.topPrefixes.length; i++) {
     let entry = top.topPrefixes[i];
     if(bgpAddRoute('10.0.0.233',entry)) {
       installed[entry.prefix] = now; 
     }
  }
  for(let prefix in installed) {
     let time = installed[prefix];
     if(time === now) continue;
     if(bgpRemoveRoute('10.0.0.233',prefix)) {
        delete installed[prefix];
     } 
  }
}, 1);
Start the application:
$ ./start.sh 
2015-07-21T19:36:52-0700 INFO: Listening, BGP port 1179
2015-07-21T19:36:52-0700 INFO: Listening, sFlow port 6343
2015-07-21T19:36:53-0700 INFO: Starting the Jetty [HTTP/1.1] server on port 8008
2015-07-21T19:36:53-0700 INFO: Starting com.sflow.rt.rest.SFlowApplication application
2015-07-21T19:36:53-0700 INFO: Listening, http://localhost:8008
2015-07-21T19:36:53-0700 INFO: bgp.js started
2015-07-21T19:36:57-0700 INFO: BGP open /10.0.0.152:50010
2015-07-21T19:37:23-0700 INFO: BGP open /10.0.0.233:55097
Next, examine the routing table on the Hardware Switch:
cumulus@cumulus$ ip route
default via 192.168.152.1 dev br_xen 
10.0.0.0/24 dev eth0  proto kernel  scope link  src 10.0.0.233 
192.168.150.0/24 dev br_xen  proto kernel  scope link  src 192.168.150.2 
192.168.151.0/24 dev br_xen  proto kernel  scope link  src 192.168.151.2 
192.168.152.0/24 dev br_xen  proto kernel  scope link  src 192.168.152.2
This is the default set of routes  configured to pass traffic to and from the Software Router.

To generate traffic using iperf, run the following command on Peer 2:
iperf -s -B 192.168.251.1
And generate traffic with the following command on Peer 1:
iperf -c 192.168.251.1 -B 192.168.250.1
Now check the routing table on the Hardware Switch again:
cumulus@cumulus$ ip route
default via 192.168.152.1 dev br_xen 
10.0.0.0/24 dev eth0  proto kernel  scope link  src 10.0.0.233  
192.168.150.0/24 dev br_xen  proto kernel  scope link  src 192.168.150.2 
192.168.151.0/24 dev br_xen  proto kernel  scope link  src 192.168.151.2 
192.168.152.0/24 dev br_xen  proto kernel  scope link  src 192.168.152.2 
192.168.250.0/24 via 192.168.150.1 dev br_xen  proto zebra  metric 20 
192.168.251.0/24 via 192.168.151.1 dev br_xen  proto zebra  metric 20
Note the two hardware routes that have been added by the SDN controller. The route override can be verified by repeating the traceroute test:
[root@peer1 ~]# traceroute -s 192.168.250.1 192.168.251.1
traceroute to 192.168.251.1 (192.168.251.1), 30 hops max, 40 byte packets
 1  192.168.150.2 (192.168.150.2)  3.260 ms  3.151 ms  3.014 ms
 2  192.168.251.1 (192.168.251.1)  4.418 ms  4.351 ms  4.260 ms
Comparing with the original traceroute, notice that packets bypass the Software Router interface (192.168.150.3) and are forwarded entirely in hardware.

The traffic analytics driving the forwarding decisions can be viewed through the sFlow-RT REST API:
$ curl http://10.0.0.162:8008/bgp/topprefixes/10.0.0.52/json
{
 "as": 152,
 "direction": "destination",
 "id": "192.168.152.3",
 "learnedPrefixesAdded": 2,
 "learnedPrefixesRemoved": 0,
 "nPrefixes": 2,
 "pushedPrefixesAdded": 0,
 "pushedPrefixesRemoved": 0,
 "startTime": 1437535255553,
 "state": "established",
 "topPrefixes": [
  {
   "aspath": "150",
   "localpref": 100,
   "med": 0,
   "nexthop": "192.168.150.1",
   "origin": "IGP",
   "prefix": "192.168.250.0/24",
   "value": 1.4462334178258518E7
  },
  {
   "aspath": "151",
   "localpref": 100,
   "med": 0,
   "nexthop": "192.168.151.1",
   "origin": "IGP",
   "prefix": "192.168.251.0/24",
   "value": 391390.33359066787
  }
 ],
 "valuePercentCoverage": 100,
 "valueTopPrefixes": 1.4853724511849185E7,
 "valueTotal": 1.4853724511849185E7
}
The SDN application automatically removes routes from the hardware once they become idle, or to make room for more active routes if the hardware routing table exceeds the set limit of 20,000 routes, or if they are withdrawn. This switch has a maximum capacity of 32,768 routes and standard sFlow analytics can be used to monitor hardware table utilizations - Broadcom ASIC table utilization metrics, DevOps, and SDN.
The test setup was configured to quickly test the concept using limited hardware at hand and can be improved for production deployment (using smaller CIDRs and VLANs to separate peer traffic).
This proof of concept demonstrates that it is possible to use SDN analytics and control to combine standard sFlow and BGP capabilities of commodity hardware and deliver Terabit routing capacity with just a few thousand dollars of hardware.

4 comments:

  1. The Internet now has 561,105 prefixes, and it looks like you can only have about 20,000 on the switch hardware right now.

    There is a serious denial of service risk from someone hitting the switch with a large number of packets that cause cache misses. And if this is really an "Internet" router application, you know how many nefarious people are on the planet with time on their hands....

    I don't know if typical Ethernet switch merchant silicon will ever have a business case to handle 500,000+ flows.

    ReplyDelete
    Replies
    1. The DDoS risk is important to consider and it can be mitigated. During a "cache miss attack" most normal traffic will still hit the hardware cache and be unaffected. The switch configuration should include ACLs to ensure that control plane traffic (BGP) from peers is allowed and rate limit data plane traffic on the link to the router to ensure that the software router is never overwhelmed. These two controls would ensure that the effect of the attack would be limited to an increase in packet loss for the small fraction of traffic not handled by the cache.

      The SDN controller can further mitigate the attack using analytics to identify the attack signature and installing hardware ACLs to drop, rate limit, or reduce priority of the attack traffic. For an example, see DDoS mitigation with Cumulus Linux.

      Delete
  2. Hi Peter,

    I have been following your last blog posts and been testing some of your scenarios. I have been contemplating this approach for years. Needless to say I find all this very exciting!

    What if we removed the software router from the above scenario and let the switch do all the BGP peering. We can then use "selective route download" to hold all routes in RIB but only install default routes from transit providers and the routes received from the sFlowRT on to the FIB. Much like David Barrosos SIR example.

    That would remove the requirement of peers setting up a static route to the software router and the ebgp multihop requirement. This removes the cache miss attack surface as well.

    A bare-metal switch such as the Quanta T3048-LY8 with Intel Rangeley CPU and 4 GB DDR3/ECC RAM, should be able to handle this with a decent software suite.

    ReplyDelete
    Replies
    1. How would packets that don't have exact hardware routes be forwarded? The SIR architecture works for an access router where you have transit provider that will handle your default route and the hardware is used to forward selected prefixes to peers. In this PoC there is no transit provider, all packets must be correctly forwarded. The miss traffic is sent to the adjacent software router for forwarding. With a decent CPU on the switch you could merge both functions, passing hardware misses up to the CPU for forwarding. You could also install sFlow-RT on the switch for a self contained (but logically equivalent) solution.

      I would be interested in any thoughts about how one might streamline the configuration. The PoC setup awkward and not as transparent to the peers as it should be.

      If you wanted to implement the logical equivalent of SIR using sFlow-RT to manager routing to transit providers, it would be easy to do. Just install BIRD and sFlow-RT on the switch and have sFlow-RT write the BIRD prefix filter configuration file.

      Delete