The following traceroute run on Peer 1 shows the set of router hops from 192.168.250.1 to 192.168.251.1
[root@peer1 ~]# traceroute -s 192.168.250.1 192.168.251.1 traceroute to 192.168.251.1 (192.168.251.1), 30 hops max, 40 byte packets 1 192.168.152.2 (192.168.152.2) 3.090 ms 3.014 ms 2.927 ms 2 192.168.150.3 (192.168.150.3) 3.377 ms 3.161 ms 3.099 ms 3 192.168.251.1 (192.168.251.1) 6.440 ms 6.228 ms 3.217 msEnsuring that packets are first forwarded by the default route on the Hardware Switch is the key to accelerating forwarding decisions for the Software Router. Any specific routes added to Hardware Switch will override the default route and matching packets will bypass the Software Router and will be forwarded by the Hardware Switch.
In this test bench, routing is performed using Quagga instances running on Peer 1, Peer 2, Hardware Switch and Software Router.
Peer 1
router bgp 150 bgp router-id 192.168.150.1 network 192.168.250.0/24 neighbor 192.168.152.3 remote-as 152 neighbor 192.168.152.3 ebgp-multihop 2
Peer 2
router bgp 151 bgp router-id 192.168.151.1 network 192.168.251.0/24 neighbor 192.168.152.3 remote-as 152 neighbor 192.168.152.3 ebgp-multihop 2
Software Router
interface lo ip address 192.168.152.3/32 router bgp 152 bgp router-id 192.168.152.3 neighbor 192.168.150.1 remote-as 150 neighbor 192.168.150.1 update-source 192.168.152.3 neighbor 192.168.150.1 passive neighbor 192.168.151.1 remote-as 151 neighbor 192.168.151.1 update-source 192.168.152.3 neighbor 192.168.151.1 passive neighbor 10.0.0.162 remote-as 152 neighbor 10.0.0.162 port 1179 neighbor 10.0.0.162 timers connect 30 neighbor 10.0.0.162 route-reflector-client
Hardware Switch
router bgp 65000 bgp router-id 0.0.0.1 neighbor 10.0.0.162 remote-as 65000 neighbor 10.0.0.162 port 1179 neighbor 10.0.0.162 timers connect 30In addition, the following lines in /etc/network/interfaces configure the bridge:
auto br_xen iface br_xen bridge-ports swp1 swp2 swp3 address 192.168.150.2/24 address 192.168.151.2/24 address 192.168.152.2/24Cumulus Networks, sFlow and data center automation describes how to configure sFlow monitoring on Cumulus Linux switches. The switch is configured to send sFlow to 10.0.0.162 (the host running the SDN controller).
SDN Routing Application
SDN router using merchant silicon top of rack switch describes how to install sFlow-RT and provides an application for pushing active routes to an accelerator. The application has been modified for this setup and is running on host 10.0.0.162:
bgpAddNeighbor('10.0.0.152',152); bgpAddNeighbor('10.0.0.233',65000); bgpAddSource('10.0.0.233','10.0.0.152',10); var installed = {}; setIntervalHandler(function() { let now = Date.now(); let top = bgpTopPrefixes('10.0.0.152',20000,1); if(!top || !top.hasOwnProperty('topPrefixes')) return; let tgt = bgpTopPrefixes('10.0.0.233',0); if(!tgt || 'established' != tgt.state) return; for(let i = 0; i < top.topPrefixes.length; i++) { let entry = top.topPrefixes[i]; if(bgpAddRoute('10.0.0.233',entry)) { installed[entry.prefix] = now; } } for(let prefix in installed) { let time = installed[prefix]; if(time === now) continue; if(bgpRemoveRoute('10.0.0.233',prefix)) { delete installed[prefix]; } } }, 1);Start the application:
$ ./start.sh 2015-07-21T19:36:52-0700 INFO: Listening, BGP port 1179 2015-07-21T19:36:52-0700 INFO: Listening, sFlow port 6343 2015-07-21T19:36:53-0700 INFO: Starting the Jetty [HTTP/1.1] server on port 8008 2015-07-21T19:36:53-0700 INFO: Starting com.sflow.rt.rest.SFlowApplication application 2015-07-21T19:36:53-0700 INFO: Listening, http://localhost:8008 2015-07-21T19:36:53-0700 INFO: bgp.js started 2015-07-21T19:36:57-0700 INFO: BGP open /10.0.0.152:50010 2015-07-21T19:37:23-0700 INFO: BGP open /10.0.0.233:55097Next, examine the routing table on the Hardware Switch:
cumulus@cumulus$ ip route default via 192.168.152.1 dev br_xen 10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.233 192.168.150.0/24 dev br_xen proto kernel scope link src 192.168.150.2 192.168.151.0/24 dev br_xen proto kernel scope link src 192.168.151.2 192.168.152.0/24 dev br_xen proto kernel scope link src 192.168.152.2This is the default set of routes configured to pass traffic to and from the Software Router.
To generate traffic using iperf, run the following command on Peer 2:
iperf -s -B 192.168.251.1And generate traffic with the following command on Peer 1:
iperf -c 192.168.251.1 -B 192.168.250.1Now check the routing table on the Hardware Switch again:
cumulus@cumulus$ ip route
default via 192.168.152.1 dev br_xen
10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.233
192.168.150.0/24 dev br_xen proto kernel scope link src 192.168.150.2
192.168.151.0/24 dev br_xen proto kernel scope link src 192.168.151.2
192.168.152.0/24 dev br_xen proto kernel scope link src 192.168.152.2
192.168.250.0/24 via 192.168.150.1 dev br_xen proto zebra metric 20
192.168.251.0/24 via 192.168.151.1 dev br_xen proto zebra metric 20
Note the two hardware routes that have been added by the SDN controller. The route override can be verified by repeating the traceroute test:
[root@peer1 ~]# traceroute -s 192.168.250.1 192.168.251.1 traceroute to 192.168.251.1 (192.168.251.1), 30 hops max, 40 byte packets 1 192.168.150.2 (192.168.150.2) 3.260 ms 3.151 ms 3.014 ms 2 192.168.251.1 (192.168.251.1) 4.418 ms 4.351 ms 4.260 msComparing with the original traceroute, notice that packets bypass the Software Router interface (192.168.150.3) and are forwarded entirely in hardware.
The traffic analytics driving the forwarding decisions can be viewed through the sFlow-RT REST API:
$ curl http://10.0.0.162:8008/bgp/topprefixes/10.0.0.52/json { "as": 152, "direction": "destination", "id": "192.168.152.3", "learnedPrefixesAdded": 2, "learnedPrefixesRemoved": 0, "nPrefixes": 2, "pushedPrefixesAdded": 0, "pushedPrefixesRemoved": 0, "startTime": 1437535255553, "state": "established", "topPrefixes": [ { "aspath": "150", "localpref": 100, "med": 0, "nexthop": "192.168.150.1", "origin": "IGP", "prefix": "192.168.250.0/24", "value": 1.4462334178258518E7 }, { "aspath": "151", "localpref": 100, "med": 0, "nexthop": "192.168.151.1", "origin": "IGP", "prefix": "192.168.251.0/24", "value": 391390.33359066787 } ], "valuePercentCoverage": 100, "valueTopPrefixes": 1.4853724511849185E7, "valueTotal": 1.4853724511849185E7 }The SDN application automatically removes routes from the hardware once they become idle, or to make room for more active routes if the hardware routing table exceeds the set limit of 20,000 routes, or if they are withdrawn. This switch has a maximum capacity of 32,768 routes and standard sFlow analytics can be used to monitor hardware table utilizations - Broadcom ASIC table utilization metrics, DevOps, and SDN.
The test setup was configured to quickly test the concept using limited hardware at hand and can be improved for production deployment (using smaller CIDRs and VLANs to separate peer traffic).This proof of concept demonstrates that it is possible to use SDN analytics and control to combine standard sFlow and BGP capabilities of commodity hardware and deliver Terabit routing capacity with just a few thousand dollars of hardware.
The Internet now has 561,105 prefixes, and it looks like you can only have about 20,000 on the switch hardware right now.
ReplyDeleteThere is a serious denial of service risk from someone hitting the switch with a large number of packets that cause cache misses. And if this is really an "Internet" router application, you know how many nefarious people are on the planet with time on their hands....
I don't know if typical Ethernet switch merchant silicon will ever have a business case to handle 500,000+ flows.
The DDoS risk is important to consider and it can be mitigated. During a "cache miss attack" most normal traffic will still hit the hardware cache and be unaffected. The switch configuration should include ACLs to ensure that control plane traffic (BGP) from peers is allowed and rate limit data plane traffic on the link to the router to ensure that the software router is never overwhelmed. These two controls would ensure that the effect of the attack would be limited to an increase in packet loss for the small fraction of traffic not handled by the cache.
DeleteThe SDN controller can further mitigate the attack using analytics to identify the attack signature and installing hardware ACLs to drop, rate limit, or reduce priority of the attack traffic. For an example, see DDoS mitigation with Cumulus Linux.
Hi Peter,
ReplyDeleteI have been following your last blog posts and been testing some of your scenarios. I have been contemplating this approach for years. Needless to say I find all this very exciting!
What if we removed the software router from the above scenario and let the switch do all the BGP peering. We can then use "selective route download" to hold all routes in RIB but only install default routes from transit providers and the routes received from the sFlowRT on to the FIB. Much like David Barrosos SIR example.
That would remove the requirement of peers setting up a static route to the software router and the ebgp multihop requirement. This removes the cache miss attack surface as well.
A bare-metal switch such as the Quanta T3048-LY8 with Intel Rangeley CPU and 4 GB DDR3/ECC RAM, should be able to handle this with a decent software suite.
How would packets that don't have exact hardware routes be forwarded? The SIR architecture works for an access router where you have transit provider that will handle your default route and the hardware is used to forward selected prefixes to peers. In this PoC there is no transit provider, all packets must be correctly forwarded. The miss traffic is sent to the adjacent software router for forwarding. With a decent CPU on the switch you could merge both functions, passing hardware misses up to the CPU for forwarding. You could also install sFlow-RT on the switch for a self contained (but logically equivalent) solution.
DeleteI would be interested in any thoughts about how one might streamline the configuration. The PoC setup awkward and not as transparent to the peers as it should be.
If you wanted to implement the logical equivalent of SIR using sFlow-RT to manager routing to transit providers, it would be easy to do. Just install BIRD and sFlow-RT on the switch and have sFlow-RT write the BIRD prefix filter configuration file.