Monday, March 23, 2015

OpenNetworking.tv interview


The OpenNetworking.tv interview includes a wide ranging discussion of current trends in the software defined networking (SDN), including: merchant silicon, analytics, probes, scaleability, Open vSwitch, network virtualization, VxLAN, network function virtualization (NFV),  Open Compute Project, white box / bare metal switches, leaf and spine topologies, large "Elephant" flow marking and steering, Cumulus Linux, Big Switch, orchestration, Puppet and Chef.

The interview and full transcript are available on SDxCentral: sFlow Creator Peter Phaal On Taming The Wilds Of SDN & Virtual Networking

Related articles on this blog include:

Friday, March 13, 2015

ECMP visibility with Cumulus Linux

Demo: Implementing the Big Data Design Guide in the Cumulus Workbench  is a great demonstration of the power of zero touch provisioning and automation. When the switches and servers boot they automatically pick up their operating systems and configurations for the complex Equal Cost Multi-Path (ECMP) routed network shown in the diagram.

Topology discovery with Cumulus Linux looked at an alternative Multi-Chassis Link Aggregation (MLAG) configuration and shows how to extract the configuration and monitor traffic on the network using sFlow and Fabric View.

The paper Hedera: Dynamic Flow Scheduling for Data Center Networks describes the impact of colliding flows on effective ECMP cross sectional bandwidth. The paper gives an example which demonstrates that effective cross sectional bandwidth can be reduced by a factor of between 20% to 60%, depending on the number of simultaneous flows per host.

This article uses the workbench to demonstrate the effect of large "Elephant" flow collisions on network throughput. The following script running on each of the servers uses the iperf tool to generate pairs of overlapping Elephant flows:
cumulus@server1:~$ while true; do iperf -c 10.4.2.2 -t 20; sleep 20; done
------------------------------------------------------------
Client connecting to 10.4.2.2, TCP port 5001
TCP window size: 1.06 MByte (default)
------------------------------------------------------------
[  3] local 10.4.1.2 port 57234 connected with 10.4.2.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-20.0 sec  21.9 GBytes  9.41 Gbits/sec
------------------------------------------------------------
Client connecting to 10.4.2.2, TCP port 5001
TCP window size: 1.06 MByte (default)
------------------------------------------------------------
[  3] local 10.4.1.2 port 57240 connected with 10.4.2.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-20.0 sec  10.1 GBytes  4.34 Gbits/sec
------------------------------------------------------------
Client connecting to 10.4.2.2, TCP port 5001
TCP window size: 1.06 MByte (default)
------------------------------------------------------------
[  3] local 10.4.1.2 port 57241 connected with 10.4.2.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-20.0 sec  21.9 GBytes  9.41 Gbits/sec
------------------------------------------------------------
The first iperf test achieves a TCP throughput of 9.41 Gbits/sec (the maximum achievable on the 10Gbit/s network in the workbench). However, the second test only achieves a throughput of 4.34 Gbits/sec. How can this result be explained?
The Top Flows table above confirms that two simultaneous elephant flows are being tracked by Fabric View.
The Traffic charts update every second and give a fine grained view of the traffic flows over time. The charts clearly show how iperf flows vary in throughput, with the low throughput runs achieving a throughput of approximately 50% of the network capacity (these results are consistent with 20% to 60% reported in the Hedera paper).
The Performance charts show what is happening. Packets take two hops as they are routed from leaf1 to leaf2 (via spine1 or spine2). Each iperf connection is able to fully utilize the two links to achieve line rate throughput. Comparing the Total Traffic and Busy Spine Links charts shows that peak total throughput of approximately 20Gbits/sec corresponds to interval when 4 spine links are busy. The throughput is halved during intervals when the routes overlap and share 1 or 2 links (shown in gold as Collisions on the Busy Spine Links chart).
Readers might be surprised by the frequency of collisions given the number of links in the network. Packets take two hops to go from leaf1 to leaf2 - routed via spine1 or spine2. In addition, the links between switches are paired, so there are 8 possible two hop paths from leaf1 to leaf2. The explanation involves looking at the conditional probability that the second flow with overlap with the first. Suppose the first flow takes is routed to spine1 via port swp1s0 and that spine1 routes the flow to leaf2 via port swp51. If the second flow is routed via any of the 4 paths through spine2, there is no collision. However, if it is routed via spine1, there is only 1 path that avoids collisions (leaf1 port swp1s1 to spine1 port swp52). This means that there is a 5 / 8 chance of avoiding a collision, or a 3/8 (37.5%) chance that the two flow will collide. The probability of flow collisions is surprisingly high even on very large networks with many spine switches and paths (see Birthday Paradox). 
Also note the Discards trend in the Congestion and Errors section. Comparing the rate of discards with Collisions in the Busy Spine Links chart shows that discards don't occur unless there are Elephant flow collisions on the busy links.
The Discard trend lags the Collision trend because discards are reported using sFlow counters and the Collision metric are based on packet samples - see Measurement delay, counters vs. packet samples
This example demonstrates the visibility into leaf and spine fabric performance achievable using standard sFlow instrumentation built into commodity switch hardware. If you have a leaf and spine network, request a free evaluation of Fabric View to better understand your network's performance.
This small four switch leaf and spine network is composed of 12 x 10 Gbits/sec links which would require 24 x 10 Gbits/sec taps with associated probes and collector to fully monitor using traditional tools used to monitor legacy data center networks. The cost and complexity of tapping leaf and spine topologies is prohibitive. However, leaf and spine switches typically include hardware support for the sFlow measurement standard, embedding line rate visibility into every switch port for network wide coverage at no extra cost. In this example, the Fabric View analytics software is running on a commodity physical or virtual server consuming 1% CPU and 200 MBytes RAM.
Real-time analytics for leaf and spine networks is a core enabling technology for software defined networking (SDN) control mechanisms that can automatically adapt the network to rapidly changing flow patterns and dramatically improve performance.
For example, REST API for Cumulus Linux ACLs describes how and SDN controller can remotely control switches. Use cases discussed on this blog include: Elephant flow marking,  Elephant flow steering, and DDoS mitigation.

Finally, Cumulus Linux runs on open switch hardware from Agema, Dell, Edge-Core, Penguin Computing, Quanta. In addition, Hewlett-Packard recently announced that they will soon be selling a new line of open network switches built by Accton Technologies and support Cumulus Linux. The increasing availability of low cost open networking hardware running Linux creates a platform for open source and commercial software developers to quickly build and deploy innovative solutions.

Wednesday, March 11, 2015

Topology discovery with Cumulus Linux

Demo: Implementing the OpenStack Design Guide in the Cumulus Workbench is a great demonstration of the power of zero touch provisioning and automation. When the switches and servers boot they automatically pick up their operating systems and configurations for the complex network shown in the diagram.
REST API for Cumulus Linux ACLs describes a REST server for remotely controlling ACLs on Cumulus Linux. This article will discuss recently added topology discovery methods that allow an SDN controller to learn topology and apply targeted controls (e.g Large "Elephant" flow marking, Large flow steering, DDoS mitigation, etc.).

Prescriptive Topology Manager

Complex Topology and Wiring Validation in Data Centers describes how Cumulus Networks' prescriptive topology manager (PTM) provides a simple method of verifying and enforcing correct wiring topologies.

The following REST call converts the topology from PTM's dot notation and returns a JSON representation:
cumulus@wbench:~$ curl http://leaf1:8080/ptm
Returns the result:
{
 "links": {
  "L1": {
   "node1": "leaf1", 
   "node2": "spine1", 
   "port1": "swp1s0", 
   "port2": "swp49"
  },
  ...
 }
}

LLDP

Prescriptive Topology Manager is preferred since it ensures that the discovered topology is correct. However, PTM builds on basic Link Level Discovery Protocol (LLDP), which provides an alternative method of topology discovery.

The following REST call return the hostname:
cumulus@wbench:~$ curl http://leaf1:8080/hostname
Returns result:
"leaf1"
The following REST call returns LLDP neighbor information:
cumulus@wbench:~$ curl http://leaf1:8080/lldp/neighbors

Returns result:
{
   "lldp": [
     {
       "interface": [
         {
           "name": "eth0",
           "via": "LLDP",
           "chassis": [
             {
               "id": [
                 {
                   "type": "mac",
                   "value": "6c:64:1a:00:2e:7f"
                 }
               ],
               "name": [
                 {
                   "value": "colo-tor-3"
                 }
               ]
             }
           ],
           "port": [
             {
               "id": [
                 {
                   "type": "ifname",
                   "value": "swp10"
                 }
               ],
               "descr": [
                 {
                   "value": "swp10"
                 }
               ]
             }
           ]
         },
         ...
     }
   ]
 }
The following REST call returns LLDP configuration information:
cumulus@wbench:~$ curl http://leaf1:8080/lldp/configuration
Returns result:
{
   "configuration": [
     {
       "config": [
         {
           "tx-delay": [
             {
               "value": "30"
             }
           ],
           ...
         }
       ]
     }
   ]
 }

Topology discovery with LLDP

The script lldp.py extracts LLDP data from all the switches in the network and compiles a topology:
#!/usr/bin/env python

import sys, re, fileinput, json, requests

switch_list = ['leaf1','leaf2','spine1','spine2']

l = 0
linkdb = {}
links = {}
for switch_name in switch_list:
  # verify that lldp configuration exports hostname,ifname information
  r = requests.get("http://%s:8080/lldp/configuration" % (switch_name));
  if r.status_code != 200: continue
  config = r.json()
  lldp_hostname = config['configuration'][0]['config'][0]['hostname'][0]['value']
  if lldp_hostname != '(none)': continue
  lldp_porttype = config['configuration'][0]['config'][0]['lldp_portid-type'][0]['value']
  if lldp_porttype != 'ifname': continue
  # local hostname 
  r = requests.get("http://%s:8080/hostname" % (switch_name));
  if r.status_code != 200: continue
  host = r.json()
  # get neighbors
  r = requests.get("http://%s:8080/lldp/neighbors" % (switch_name));
  if r.status_code != 200: continue
  neighbors = r.json()
  interfaces = neighbors['lldp'][0]['interface']
  for i in interfaces:
    # local port name
    port = i['name']
    # neighboring hostname
    nhost = i['chassis'][0]['name'][0]['value']
    # neighboring port name
    nport = i['port'][0]['descr'][0]['value']
    if not host or not port or not nhost or not nport: continue
    if host < nhost:
      link = {'node1':host,'port1':port,'node2':nhost,'port2':nport}
    else:
      link = {'node1':nhost,'port1':nport,'node2':host,'port2':port}
    keystr = "%s %s -- %s %s" % (link['node1'],link['port1'],link['node2'],link['port2'])
    if keystr in linkdb:
       # check consistency
       prev = linkdb[keystr]
       if (link['node1'] != prev['node1'] 
           or link['port1'] != prev['port1']
           or link['node2'] != prev['node2']
           or link['port2'] != prev['port2']): raise Exception('Mismatched LLDP', keystr)
    else:
       linkdb[keystr] = link
       linkname = 'L%d' % (l)
       links[linkname] = link
       l += 1

top = {'links':links}               
print json.dumps(top,sort_keys=True, indent=1)
Returns result:
cumulus@wbench:~$ ./lldp.py 
{
 "links": {
  "L0": {
   "node1": "colo-tor-3", 
   "node2": "leaf1", 
   "port1": "swp10", 
   "port2": "eth0"
  }, 
  ...
 }
}
The lldp.py script and the latest version of acl_server can be found on Github, https://github.com/pphaal/acl_server/

Demonstration

Fabric visibility with Cumulus Linux demonstrates the visibility into network performance provided by Cumulus Linux support for the sFlow standard (see Cumulus Networks, sFlow and data center automation). The screen shot shows 10Gbit/s Elephant flows traversing the network shown at the top of this article. The flows between server1 and server2 were generated using iperf tests running in a continuous loop.

The acl_server and sFlow agents are installed on the leaf1, leaf2, spine1, and spine2 switches. By default, the sFlow agents automatically pick up their settings using DNS Service Discovery (DNS-SD). Adding the following entry in the wbench DNS server zone file, /etc/bind/zones/lab.local.zone, enables sFlow on the switches and directs measurements to the wbench host:
_sflow._udp     30      SRV     0 0 6343 wbench
Note: For more information on running sFlow in the Cumulus workbench, see Demo: Monitoring Traffic on Cumulus Switches with sFlow). Another point to note, this workbench setup demonstrates the visibility into Link Aggregation (LAG) provides by sFlow (see Link aggregation).

Fabric View is installed on wbench and is configured with the network topology obtained from acl_server. The web interface is accessed through the workbench reverse proxy, but access is also possible using a VPN (see Setting up OpenVPN on the Cumulus Workbench).
This workbench example automatically provisions an OpenStack cluster on the two servers along with the network to connect them. In much the same way OpenStack provides access to virtual resources, Cumulus' Remote Lab leverages the automation capabilities of open hardware to provide multi-tenant access to physical servers and networks.
Finally, Cumulus Linux runs on open switch hardware from Agema, Dell, Edge-Core, Penguin Computing, Quanta. In addition, Hewlett-Packard recently announced that they will soon be selling a new line of open network switches built by Accton Technologies and support Cumulus Linux. This article, demonstrates the flexibility that open networking offers to developers and network administrators. If you are curious, its very easy to give Cumulus Linux a try.