Thursday, June 26, 2014

Docker performance monitoring

IT’S HERE: DOCKER 1.0 recently announced the first production release of the Docker Linux container platform. Docker is seeing explosive growth and has already been embraced by IBM, RedHat and RackSpace. Today the open source Host sFlow project released support for Docker, exporting standard sFlow performance metrics for Linux containers and unifying Linux containers with the broader sFlow ecosystem.
Visibility and the software defined data center
Host sFlow Docker support simplifies data center performance management by unifying monitoring of Linux containers with monitoring of virtual machines (Hyper-V, KVM/libvirt, Xen/XCP/XenServer), virtual switches (Open vSwitch, Hyper-V Virtual Switch, IBM Distributed Virtual Switch, HP FlexFabric Virtual Switch), servers (Linux, Windows, Solaris, AIX, FreeBSD), and physical networks (over 40 vendors, including: A10, Arista, Alcatel-Lucent, Arista, Brocade, Cisco, Cumulus, Extreme, F5, Hewlett-Packard, Hitachi, Huawei, IBM, Juniper, Mellanox, NEC, ZTE). In addition, standardizing metrics allows allows measurements to be shared among different tools, further reducing operational complexity.


The talk provides additional background on the sFlow standard and case studies. The remainder of this article describes how to use Host sFlow to monitor a Docker server pool.

First, download, compile and install the Host sFlow agent on a Docker host (Note: The agent needs to be built from sources since Docker support is currently in the development branch):
svn checkout http://svn.code.sf.net/p/host-sflow/code/trunk host-sflow-code
cd host-sflow-code
make DOCKER=yes
make install
make schedule
service hsflowd start
Next, if SE Linux is enabled, run the following commands to allow Host sFlow to retrieve network stats (or disable SE Linux):
audit2allow -a -M hsflowd
semodule -i hsflowd.pp
See Installing Host sFlow on a Linux server for additional information on configuring the agent.


The slide presentation describes how Docker can be used with Open vSwitch to create virtual networks connecting containers. In addition to providing advanced SDN capabilities, the Open vSwitch includes sFlow instrumentation, providing detailed visibility into network traffic between containers and to the outside network.

The Host sFlow agent makes it easy to enable sFlow on Open vSwitch. Simply enable the sflowovd daemon and Host sFlow configuration settings will be automatically applied to the Open vSwitch.
service sflowovsd start
There are a number of tools that consume and report on sFlow data and these should be able to report on Docker since the metrics being reported are the same standard set reported for virtual machines. Here are a few examples from this blog:
Looking at the big picture, the comprehensive visibility of sFlow combined with the agility of SDN and Docker lays the foundation for optimized workload placement, resource allocation, and scaling by the orchestration system, maximizing the utility of the physical network, storage and compute infrastructure.

Tuesday, June 24, 2014

Microsoft Office 365 outage

6/24/2014 Information Week - Microsoft Exchange Online Suffers Service Outage, "Service disruptions with Microsoft's Exchange Online left many companies with no email on Tuesday."

The following entry on the Microsoft 365 community forum describes the incident:
====================================

Closure Summary: On Tuesday, June 24, 2014, at approximately 1:11 PM UTC, engineers received reports of an issue in which some customers were unable to access the Exchange Online service. Investigation determined that a portion of the networking infrastructure entered into a degraded state. Engineers made configuration changes on the affected capacity to remediate end-user impact. The issue was successfully fixed on Tuesday, June 24, 2014, at 9:50 PM UTC.

Customer Impact: Affected customers were unable to access the Exchange Online service.

Incident Start Time: Tuesday, June 24, 2014, at 1:11 PM UTC

Incident End Time: Tuesday, June 24, 2014, at 9:50 PM UTC

=====================================
The closure summary shows that operators took 8 hour 39 minutes to manually diagnose and remediate the problem with degraded networking infrastructure. The network related outage described in this example is not an isolated incident; other incidents described on this blog include: Packet lossAmazon EC2 outageGmail outageDelay vs utilization for adaptive control, and Multi-tenant performance isolation.

The incidents demonstrate two important points:
  1. Cloud services are critically dependent on the physical network
  2. Manually diagnosing problems in large scale networks is a time consuming process that results in extended service outages.
The article, SDN fabric controller for commodity data center switches, describes how the performance and resilience of the physical core can be enhanced through automation. The SDN fabric controller leverages the measurement and control capabilities of commodity switches to rapidly detect and adapt to changing traffic, reducing response times from hours to seconds.

Monday, June 9, 2014

RESTful control of Cumulus Linux ACLs

Figure 1: Elephants and Mice
Elephant Detection in Virtual Switches & Mitigation in Hardware discusses a VMware and Cumulus demonstration, Elephants and Mice, in which the virtual switch on a host detects and marks large "Elephant" flows and the hardware switch enforces priority queueing to prevent Elephant flows from adversely affecting latency of small "Mice" flows.

This article demonstrates a self contained real-time Elephant flow marking solution that leverages the visibility and control features of Cumulus Linux.

SDN fabric controller for commodity data center switches provides some background on the capabilities of the commodity switch hardware used to run Cumulus Linux. The article describes how the measurement and control capabilities of the hardware can be used to maximize data center fabric performance:
Exposing the ACL configuration files through a RESTful API offers a straightforward method of remotely creating, reading, updating, deleting and listing ACLs.

For example, the following command creates a filter called ddos1 to drop a DNS amplification attack:
curl -H "Content-Type:application/json" -X PUT --data \
'["[iptables]",\
"-A FORWARD --in-interface swp+ -d 10.10.100.10 -p udp --sport 53 -j DROP"]' \
http://10.0.0.233:8080/acl/ddos1
The filter can be retrieved:
curl http://10.0.0.233:8080/acl/ddos1
The following command lists the filter names:
curl http://10.0.0.233:8080/acl/
The filter can be deleted:
curl -X DELETE http://10.0.0.233:8080/acl/ddos1
Finally, all filters can be deleted:
curl -X DELETE http://10.0.0.233:8080/acl/
Running the following Python script on the Cumulus switches provides a simple proof of concept implementation of the REST API:
#!/usr/bin/env python

from BaseHTTPServer import BaseHTTPRequestHandler,HTTPServer
from os import listdir,remove
from os.path import isfile
from json import dumps,loads
from subprocess import Popen,STDOUT,PIPE
import re

class ACLRequestHandler(BaseHTTPRequestHandler):
  uripat = re.compile('^/acl/([a-z0-9]+)$')
  dir = '/etc/cumulus/acl/policy.d/'
  priority = '50'
  prefix = 'rest-'
  suffix = '.rules'
  filepat = re.compile('^'+priority+prefix+'([a-z0-9]+)\\'+suffix+'$')

  def commit(self):
    Popen(["cl-acltool","-i"],stderr=STDOUT,stdout=PIPE).communicate()[0]

  def aclfile(self,name):
    return self.dir+self.priority+self.prefix+name+self.suffix

  def wheaders(self,status):
    self.send_response(status)
    self.send_header('Content-Type','application/json')
    self.end_headers() 

  def do_PUT(self):
    m = self.uripat.match(self.path)
    if None != m:
       name = m.group(1)
       len = int(self.headers.getheader('content-length'))
       data = self.rfile.read(len)
       lines = loads(data)
       fn = self.aclfile(name)
       f = open(fn,'w')
       f.write('\n'.join(lines) + '\n')
       f.close()
       self.commit()
       self.wheaders(201)
    else:
       self.wheaders(404)
 
  def do_DELETE(self):
    m = self.uripat.match(self.path)
    if None != m:
       name = m.group(1)
       fn = self.aclfile(name)
       if isfile(fn):
          remove(fn)
          self.commit()
       self.wheaders(204)
    elif '/acl/' == self.path:
       for file in listdir(self.dir):
         m = self.filepat.match(file)
         if None != m:
           remove(self.dir+file)
       self.commit()
       self.wheaders(204)
    else:
       self.wheaders(404)

  def do_GET(self):
    m = self.uripat.match(self.path)
    if None != m:
       name = m.group(1)
       fn = self.aclfile(name)
       if isfile(fn):
         result = [];
         with open(fn) as f:
           for line in f:
              result.append(line.rstrip('\n'))
         self.wheaders(200)
         self.wfile.write(dumps(result))
       else:
         self.wheaders(404)
    elif '/acl/' == self.path:
       result = []
       for file in listdir(self.dir):
         m = self.filepat.match(file)
         if None != m:
           name = m.group(1)
           result.append(name)
       self.wheaders(200)
       self.wfile.write(dumps(result))
    else:
       self.wheaders(404)

if __name__ == '__main__':
  server = HTTPServer(('',8080), ACLRequestHandler) 
  server.serve_forever()
Some notes on building a production ready solution:
  1. Add authentication
  2. Add error handling
  3. Script needs to run as a daemon
  4. Scaleability could be improved by asynchronously committing rules in batches 
  5. Latency could be improved through use of persistent connections (SPDY, websocket)
The following sFlow-RT controller application implements large flow marking using sFlow measurements from the switch and control of ACLs using the REST API:
include('extras/json2.js');

// Define large flow as greater than 100Mbits/sec for 1 second or longer
var bytes_per_second = 100000000/8;
var duration_seconds = 1;

var id = 0;
var controls = {};

setFlow('tcp',
 {keys:'ipsource,ipdestination,tcpsourceport,tcpdestinationport',
  value:'bytes', filter:'direction=ingress', t:duration_seconds}
);

setThreshold('elephant',
 {metric:'tcp', value:bytes_per_second, byFlow:true, timeout:4,
  filter:{ifspeed:[1000000000]}}
);

setEventHandler(function(evt) {
 if(controls[evt.flowKey]) return;

 var rulename = 'mark' + id++;
 var keys = evt.flowKey.split(',');
 var acl = [
'[iptables]',
'# mark Elephant',
'-t mangle -A FORWARD --in-interface swp+ -s ' + keys[0] + ' -d ' + keys[1] 
+ ' -p tcp --sport ' + keys[2] + ' --dport ' + keys[3]
+ ' -j SETQOS --set-dscp 10 --set-cos 5'
 ];
 http('http://'+evt.agent+':8080/acl/'+rulename,
      'put','application/json',JSON.stringify(acl));
 controls[evt.flowKey] = {
   agent:evt.agent,
   dataSource:evt.dataSource,
   rulename:rulename,
   time: (new Date()).getTime()
 };
},['elephant']);

setIntervalHandler(function() {
  for(var flowKey in controls) {
    var ctx = controls[flowKey];
    var val = flowValue(ctx.agent,ctx.dataSource + '.tcp',flowKey);
    if(val < 100) {
      http('http://'+ctx.agent+':8080/acl/'+ctx.rulename,'delete');
      delete controls[flowKey]; 
    }
  }
},5);
The following command line argument load the script:
-Dscript.file=clmark.js
Some notes on the script:
  1. The 100Mbits/s threshold for large flows was selected because it represents 10% of the bandwidth of the 1Gigabit access ports on the network
  2. The setFlow filter specifies ingress flows since the goal is to mark flows as they enter the network
  3. The setThreshold filter specifies that thresholds are only applied to 1Gigabit access ports
  4. The event handler function triggers when new Elephant flows are detected, creating and installing an ACL to mark packets in the flow with a dscp value of 10 and a cos value of 5
  5. The interval handler function runs every 5 seconds and removes ACLs for flows that have completed
The iperf tool can be used to generate a sequence of large flows to test the controller:
while true; do iperf -c 10.100.10.152 -i 20 -t 20; sleep 20; done
The following screen capture shows a basic test setup and results:
The screen capture shows a mixture of small flows "mice" and large flows "elephants" generated by a server connected to an edge switch (in this case a Penguin Computing Arctica switch running Cumulus Linux). The graph at the bottom right shows the mixture of unmarked large and small flows arriving at the switch. The sFlow-RT controller receives a stream of sFlow measurements from the switch and detects each elephant flows in real-time, immediately installing an ACL that matches the flow and instructs the switch to mark the flow by setting the DSCP value. The traffic upstream of the switch is shown in the top right chart and it can be clearly seen that each elephant flow has been identified and marked, while the mice have been left unmarked.

Thursday, June 5, 2014

Cumulus Networks, sFlow and data center automation

Cumulus Networks and InMon Corp have ported the open source Host sFlow agent to the upcoming Cumulus Linux 2.1 release. The Host sFlow agent already supports Linux, Windows, FreeBSD, Solaris, and AIX operating systems and KVM, Xen, XCP, XenServer, and Hyper-V hypervisors, delivering a standard set of performance metrics from switches, servers, hypervisors, virtual switches, and virtual machines - see Visibility and the software defined data center

The Cumulus Linux platform makes it possible to run the same open source agent on switches, servers, and hypervisors - providing unified end-to-end visibility across the data center. The open networking model that Cumulus is pioneering offers exciting opportunities. Cumulus Linux allows popular open source server orchestration tools to also manage the network, and the combination of real-time, data center wide analytics with orchestration make it possible to create self-optimizing data centers.

Install and configure Host sFlow agent

The following command installs the Host sFlow agent on a Cumulus Linux switch:
sudo apt-get install hsflowd
Note: Network managers may find this command odd since it is usually not possible to install third party software on switch hardware. However, what is even more radical is that Cumulus Linux allows users to download source code and compile it on their switch. Instead of being dependent on the switch vendor to fix a bug or add a feature, users are free to change the source code and contribute the changes back to the community.

The sFlow agent requires very little configuration, automatically monitoring all switch ports using the following default settings:

Link SpeedSampling RatePolling Interval
1 Gbit/s1-in-1,00030 seconds
10 Gbit/s1-in-10,00030 seconds
40 Gbit/s1-in-40,00030 seconds
100 Gbit/s1-in-100,00030 seconds

Note: The default settings ensure that large flows (defined as consuming 10% of link bandwidth) are detected within approximately 1 second - see Large flow detection

Once the Host sFlow agent is installed, there are two alternative configuration mechanisms that can be used to tell the agent where to send the measurements:

1. DNS Service Discovery (DNS-SD)

This is the default configuration mechanism for Host sFlow agents. DNS-SD uses a special type of DNS record (the SRV record) to allow hosts to automatically discover servers. For example, adding the following line to the site DNS zone file will enable sFlow on all the agents and direct the sFlow measurements to an sFlow analyzer (10.0.0.1):
_sflow._udp 300 SRV 0 0 10.0.0.1
No Host sFlow agent specific configuration is required, each switch or host will automatically pick up the settings when the Host sFlow agent is installed, when the device is restarted, or if settings on the DNS server are changed.

Default sampling rates and polling interval can be overridden by adding a TXT record to the zone file. For example, the following TXT record reduces the sampling rate on 10G links to 1-in-2000 and the polling interval to 20 seconds:
_sflow._udp 300 TXT (
"txtvers=1"
"sampling.10G=2000"
"polling=20"
)
Note: Currently defined TXT options are described on sFlow.org.

The article DNS-SD describes how DNS service discovery allows sFlow agents to automatically discover their configuration settings. The slides DNS Service Discovery from a talk at the SF Bay Area Large Scale Production Engineering Meetup provide additional background.

 2. Configuration File

The Host sFlow agent is configured by editing the /etc/hsflowd.conf file. For example, the following configuration disables DNS-SD, instructs the agent to send sFlow to 10.0.0.1, reduces the sampling rate on 10G links to 1-in-2000 and the polling interval to 20 seconds:
sflow {
  DNSSD = off

  polling = 20
  sampling.10G = 2000
  collector {
    ip = 10.0.0.1
  }
}
The Host sFlow agent must be restarted for configuration changes to take effect:
sudu /etc/init.d/hsflowd restart
All hosts and switches can share the same settings and it is straightforward to use orchestration tools such as Puppet, Chef, etc. to manage the sFlow settings.

Collecting and analyzing sFlow

Figure 1: Visibility and the software defined data center
Figure 1 shows the general architecture of sFlow monitoring. Standard sFlow agents embedded within the elements of the infrastructure, stream essential performance metrics to management tools, ensuring that every resource in a dynamic cloud infrastructure is immediately detected and continuously monitored.

  • Applications -  e.g. Apache, NGINX, Tomcat, Memcache, HAProxy, F5, A10 ...
  • Virtual Servers - e.g. Xen, Hyper-V, KVM ...
  • Virtual Network - e.g. Open vSwitch, Hyper-V extensible vSwitch
  • Servers - e.g. BSD, Linux, Solaris and Windows
  • Network - over 40 switch vendors, see Drivers for growth

The sFlow data from a Cumulus switch contains standard Linux performance statistics in addition to the interface counters and packet samples that you would typically get from a networking device.

Note: Enhanced visibility into host performance is important on open switch platforms since they may be running a number of user installed services that can stress the limited CPU, memory and IO resources.

For example, the following sflowtool output shows the raw data contained in an sFlow datagram from a switch running Cumulus Linux:
startDatagram =================================
datagramSourceIP 10.0.0.160
datagramSize 1332
unixSecondsUTC 1402004767
datagramVersion 5
agentSubId 100000
agent 10.0.0.233
packetSequenceNo 340132
sysUpTime 17479000
samplesInPacket 7
startSample ----------------------
sampleType_tag 0:2
sampleType COUNTERSSAMPLE
sampleSequenceNo 876
sourceId 2:1
counterBlock_tag 0:2001
adaptor_0_ifIndex 2
adaptor_0_MACs 1
adaptor_0_MAC_0 6c641a000459
counterBlock_tag 0:2005
disk_total 0
disk_free 0
disk_partition_max_used 0.00
disk_reads 980
disk_bytes_read 4014080
disk_read_time 1501
disk_writes 0
disk_bytes_written 0
disk_write_time 0
counterBlock_tag 0:2004
mem_total 2056589312
mem_free 1100533760
mem_shared 0
mem_buffers 33464320
mem_cached 807546880
swap_total 0
swap_free 0
page_in 35947
page_out 0
swap_in 0
swap_out 0
counterBlock_tag 0:2003
cpu_load_one 0.390
cpu_load_five 0.440
cpu_load_fifteen 0.430
cpu_proc_run 1
cpu_proc_total 95
cpu_num 2
cpu_speed 0
cpu_uptime 770774
cpu_user 160600160
cpu_nice 192970
cpu_system 77855100
cpu_idle 1302586110
cpu_wio 4650
cpuintr 0
cpu_sintr 308370
cpuinterrupts 1851322098
cpu_contexts 800650455
counterBlock_tag 0:2006
nio_bytes_in 405248572711
nio_pkts_in 394079084
nio_errs_in 0
nio_drops_in 0
nio_bytes_out 406139719695
nio_pkts_out 394667262
nio_errs_out 0
nio_drops_out 0
counterBlock_tag 0:2000
hostname cumulus
UUID fd-01-78-45-93-93-42-03-a0-5a-a3-d7-42-ac-3c-de
machine_type 7
os_name 2
os_release 3.2.46-1+deb7u1+cl2+1
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:2
sampleType COUNTERSSAMPLE
sampleSequenceNo 876
sourceId 0:44
counterBlock_tag 0:1005
ifName swp42
counterBlock_tag 0:1
ifIndex 44
networkType 6
ifSpeed 0
ifDirection 2
ifStatus 0
ifInOctets 0
ifInUcastPkts 0
ifInMulticastPkts 0
ifInBroadcastPkts 0
ifInDiscards 0
ifInErrors 0
ifInUnknownProtos 4294967295
ifOutOctets 0
ifOutUcastPkts 0
ifOutMulticastPkts 0
ifOutBroadcastPkts 0
ifOutDiscards 0
ifOutErrors 0
ifPromiscuousMode 0
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 1022129
sourceId 0:7
meanSkipCount 128
samplePool 130832512
dropEvents 0
inputPort 7
outputPort 10
flowBlock_tag 0:1
flowSampleType HEADER
headerProtocol 1
sampledPacketSize 1518
strippedBytes 4
headerLen 128
headerBytes 6C-64-1A-00-04-5E-E8-E7-32-77-E2-B5-08-00-45-00-05-DC-63-06-40-00-40-06-9E-21-0A-64-0A-97-0A-64-14-96-9A-6D-13-89-4A-0C-4A-42-EA-3C-14-B5-80-10-00-2E-AB-45-00-00-01-01-08-0A-5D-B2-EB-A5-15-ED-48-B7-34-35-36-37-38-39-30-31-32-33-34-35-36-37-38-39-30-31-32-33-34-35-36-37-38-39-30-31-32-33-34-35-36-37-38-39-30-31-32-33-34-35-36-37-38-39-30-31-32-33-34-35-36-37-38-39-30-31-32-33-34-35
dstMAC 6c641a00045e
srcMAC e8e73277e2b5
IPSize 1500
ip.tot_len 1500
srcIP 10.100.10.151
dstIP 10.100.20.150
IPProtocol 6
IPTOS 0
IPTTL 64
TCPSrcPort 39533
TCPDstPort 5001
TCPFlags 16
endSample   ----------------------
While sflowtool is extremely useful, there are many other open source and commercial tools available, including:
Note: The sFlow Collectors list on sFlow.org contains a number of additional tools.

There is a great deal of variety among sFlow collectors - many focus on the network, others have a compute infrastructure focus, and yet others report on application performance. The shared sFlow measurement infrastructure delivers value in each of these areas. However, as network, storage, host and application resources are brought together and automated to create cloud data centers, a new set of sFlow analytics tools is emerging to deliver the integrated real-time visibility required to drive automation and optimize performance and efficiency across the data center.
While network administrators are likely to be familiar with sFlow, application development and operations teams may be unfamiliar with the technology. The 2012 O'Reilly Velocity conference talk provides an introduction to sFlow aimed at the DevOps community.
Cumulus Linux presents the switch as a server with a large number of network adapters, an abstraction that will be instantly familiar to anyone with server management experience. For example, displaying interface information on Cumulus Linux uses the standard Linux command:
ifconfig swp2
On the other hand, network administrators experienced with switch CLIs may find that Linux commands take a little time to get used to - the above command is roughly equivalent to:
show interfaces fastEthernet 6/1
However, the basic concepts of networking don't change and these skills are essential to designing, automating, operating and troubleshooting data center networks. Open networking platforms such as Cumulus Linux are an important piece of the automation puzzle, taking networking out of its silo and allowing a combined NetDevOps team to manage network, server, and application resources using proven monitoring and orchestration tools such as Ganglia, Graphite, Nagios, CFEngine, Puppet, Chef, Ansible, and Salt.

Saturday, May 31, 2014

SDN fabric controller for commodity data center switches

Figure 1: Rise of merchant silicon
Figure 1 illustrates the rapid transition to merchant silicon among leading data center network vendors, including: Alcatel-Lucent, Arista, Cisco, Cumulus, Dell, Extreme, Juniper, Hewlett-Packard, and IBM.

This article will examine some of the factors leading to commoditization of network hardware and the role that software defined networking (SDN) plays in coordinating hardware resources to deliver increased network efficiency.
Figure 2: Fabric: A Retrospective on Evolving SDN
The article, Fabric: A Retrospective on Evolving SDN by Martin Casado, Teemu Koponen, Scott Shenker, and Amin Tootoonchian, makes the case for a two tier SDN architecture; comprising a smart edge and an efficient core.
Table 1: Edge vs Fabric Functionality
Virtualization and advances in the networking capability of x86 based servers are drivers behind this separation. Virtual machines are connected to each other and to the physical network using a software virtual switch. The software switch provides the flexibility to quickly develop and deploy advanced features like network virtualization, tenant isolation, distributed firewalls, etc. Network function virtualization (NFV) is moving firewall, load balancing, routing, etc. functions from dedicated appliances to virtual machines or embedding them within the virtual switches. The increased importance of network centric software has driven dramatic improvements in the performance of commodity x86 based servers, reducing the need for complex hardware functions in network devices.

As complex functions shift to software running on servers at the network edge, the role of the core physical network is simplified. Merchant silicon provides a cost effective way of delivering the high performance forwarding capabilities needed to interconnect servers and Figure 1 shows how Broadcom based switches are now dominating the market.

The Broadcom white paper, Engineered Elephant Flows for Boosting Application Performance in Large-Scale CLOS Networks, describes the challenge of posed by large "Elephant" flows and describes the opportunity to use software defined networking to orchestrate hardware resources and improve network efficiency.
Figure 3: Feedback controller
Figure 3 shows the elements of an SDN feedback controller. Network measurements are analyzed to identify network hot spots, available resources, and large flows. The controller then plans a response and deploys controls in order to allocate resources where they are needed and reduce contention. The control system operates as a continuous loop. The effect of the changes are observed by the measurement system and further changes are made as needed.

Implementing the controller requires an understanding of the measurement and control capabilities of the Broadcom ASICs.

Control Protocol

Figure 4: Programming Pipeline for ECMP
The Broadcom white paper focuses on the ASIC architecture and control mechanisms and includes the functional diagram shown in Figure 4. The paper describes two distinct configuration tasks:
  1. Programming the Routing Flow Table and ECMP Select Groups to perform equal cost multi-path forwarding of the majority of flows.
  2. Programming the ACL Policy Flow Table to selectively override forwarding decisions for relatively small number of Elephant flows responsible for the bulk of the traffic on the network.
Managing the Routing and ECMP Group tables is well understood and there are a variety of solutions available that can be used to configure ECMP forwarding:
  1. CLI — Use switch CLI to configure distributed routing agents running on each switch (e.g. OSPF, BGP, etc.)
  2. Configuration Protocol — Similar to 1, but programmatic configuration protocols such as NETCONF or JSON RPC replaces CLI.
  3. Server orchestration — Open Linux based switch platforms allow server management agents to be installed on the switches to manage configuration. For example, Cumulus Linux supports Puppet, Chef, CFEngine, etc.
  4. OpenFlow — The white paper describes using the Ryu controller to calculate routes and update the forwarding and group tables using OpenFlow 1.3+ to communicate with Indigo OpenFlow agents on the switches.  
The end result is very similar whatever method is chosen to populate the Routing and and ECMP Group tables - the hardware forwards packets across multiple paths based on a hash function calculated over selected fields in the packets (e.g. source and destination IP addresses + source and destination TCP ports), e.g.
index = hash(packet fields) % group.size
selected_physical_port = group[index]
Hash based load balancing works well for the large numbers of small flows "Mice" on the network, but is less suitable for the long lived large "Elephant" flows. The hash function may assign multiple Elephant flows to the same physical port (even if other ports in the group are idle), resulting in congestion and poor network performance.
Figure 5: Long vs Short flows (from The Nature of Datacenter Traffic: Measurements & Analysis)
The traffic engineering controller uses ACL Flow Policy table to manage Elephant flows, ensuring that they don't interfere with latency sensitive Mice and are evenly distributed across the available paths - see Marking large flows and ECMP load balancing.
Figure 6: Hybrid Programmable Forwarding Plane, David Ward, ONF Summit, 2011
Integrated hybrid OpenFlow 1.0 is an effective mechanism for exposing the ACL Policy Flow Table to an external controller:
  • Simple, no change to normal forwarding behavior, can be combined with any of the mechanisms used to manage the Routing and ECMP Group tables listed above.
  • Efficient, Routing and ECMP Group tables efficiently handle most flows. OpenFlow used to control ACL Policy Flow Table and selectively override forwarding of specific flows (block, mark, steer, rate-limit), maximizing effectiveness of limited number of entries available.
  • Scaleable, most flows handled by existing control plane, OpenFlow only used when controller wants to make an exception.
  • Robust, if controller fails network keeps forwarding
The control protocol is only half the story. An effective measurement protocol is needed to rapidly identify network hot spots, available resources, and large flows so that the controller can identify the which flows need to be managed and where to apply the controls.

Measurement Protocol

The Broadcom white paper is limited in its discussion of measurement, but it does list four ways of detecting large flows:
  1. A priori
  2. Monitor end host socket buffers
  3. Maintain per flow statistics in network
  4. sFlow
The first two methods involve signaling the arrival of large flows to the network from the hosts. Both methods have practical difficulties in that they require that every application and / or host implement the measurements and communicate them to the fabric controller - a difficult challenge in a heterogeneous environment. However, the more fundamental problem is that while both methods can usefully identify the arrival of large flows, they don't provide sufficient information for the fabric controller to take action since it also needs to know the load on all the links in the fabric.

The requirement for end to end visibility can only be met if the instrumentation is built into the network devices, which leads to options 3 and 4. Option 3 would require an entry in the ACL table for each flow and the Broadcom paper points out that this approach does not scale.

The solution to the measurement challenge is option 4. Support for the multi-vendor sFlow protocol is included in Broadcom ASIC, is completely independent of the forwarding tables, and can be enabled on all port and all switches to provide the end to end visibility needed for effective control.
Figure 7: Custom vs. merchant silicon traffic measurement
Figure 7 compares traffic measurement on legacy custom ASIC based switches with standard sFlow measurements supported by merchant silicon vendors. The custom ASIC based switch, shown on top, performs many of the traffic flow analysis functions in hardware. In contrast, merchant silicon based switches shift flow analysis to external software, implementing only the essential measurement functions required for wire speed performance in silicon.

Figure 7 lists a number of benefits that result from moving flow analysis from the custom ASIC to external software, but in the context of large flow traffic engineering the real-time detection of flows made possible by an external flow cache is the essential if the traffic engineering controller is to be effective - see Rapidly detecting large flows, sFlow vs. NetFlow/IPFIX
Figure 8: sFlow-RT feedback controller
Figure 8 shows a fully instantiated SDN feedback controller. The sFlow-RT controller leverages the sFlow and OpenFlow standards to optimize the performance of fabrics built using commodity switches. The following practical applications for the sFlow-RT controller have already been demonstrated:
While the industry at large appears to be moving to the Edge / Fabric architecture shown in Figure 2, Cisco's Application Centric Infrastructure (ACI) is an anomaly. ACI is a tightly integrated proprietary solution; the Cisco Application Policy Infrastructure Controller (APIC) uses the Cisco OpFlex protocol to manage Cisco Nexus 9000 switches and Cisco AVI virtual switches. For example, the Cisco Nexus 9000 switches are based on Broadcom silicon and provide an interoperable NX-OS mode. However, line cards that include an Application Leaf Engines (ALE) ASIC along with the Broadcom ASIC are required to support ACI mode. The ALE provides visibility and control features for large flow load balancing and prioritization - both of which can be achieved using standard protocols to manage the capabilities of the Broadcom ASIC.

It will be interesting to see whether ACI is able to compete with modular, low cost, solutions based on open standards and commodity hardware. Cisco has offered its customers a choice and given the compelling value of open platforms I expect many will choose not to be locked into the proprietary ACI solution and will favor NX-OS mode on the Nexus 9000 series, pushing Cisco to provide the full set of open APIs currently available on the Nexus 3000 series (sFlow, OpenFlow, Puppet, Python etc.).
Figure 9: Move communicating virtual machines together to reduce network traffic (from NUMA)
Finally, SDN is only one piece of a larger effort to orchestrate network, compute and storage resources to create a software defined data center (SDDC). For example, Figure 9 shows how network analytics from the fabric controller can be used move virtual machines (e.g. by integrating with OpenStack APIs) to reduce application response times and network traffic. More broadly, feedback control allows efficient matching of resources to workloads and can dramatically increase the efficiency of the data center - see Workload placement.

Tuesday, May 13, 2014

Load balancing large flows on multi-path networks

Figure 1: Active control of large flows in a multi-path topology
Figure 1 shows initial results from the Mininet integrated hybrid OpenFlow testbed demonstrating that active steering of large flows using a performance aware SDN controller significantly improves network throughput of multi-path network topologies.
Figure 2: Two path topology
The graph in Figure 1 summarizes results from topologies with 2, 3 and 4 equal cost paths. For example, the Mininet topology in Figure 2 has two equal cost paths of 10Mbit/s (shown in blue and red). The iperf traffic generator was used to create a continuous stream of 20 second flows from h1 to h3 and from h2 to h4. If traffic were perfectly balanced, each flow would achieve 10Mbit/s throughput. However, Figure 1 shows that the throughput obtained using hash based ECMP load balancing is approximately 6.8Mbit/s. Interestingly, the average link throughput decreases as additional paths are added, dropping to approximately 6.2Mbit/s with four equal cost paths (see the blue bars in Figure 1).

To ensure that packets in a flow arrive in order at their destination, switch s3 computes a hash function over selected fields in the packets (e.g. source and destination IP addresses + source and destination TCP ports) and picks a link based on the value of the hash, e.g.
index = hash(packet fields) % linkgroup.size
selected_link = linkgroup[index]
The drop in throughput occurs when two or more large flows are assigned to the same link by the hash function and must compete for bandwidth.
Figure 3: Performance optimizing hybrid OpenFlow controller
Performance optimizing hybrid OpenFlow controller describes how the sFlow and OpenFlow standards can be combined to provide analytics driven feedback control to automatically adapt resources to changing demand. In this example, the controller has been programmed to detect large flows arriving on busy links and steer them to a less congested alternative path. The results shown in Figure 1 demonstrate that actively steering the large flows increases average link throughput by between 17% and 20% (see the red bars).
There results were obtained using a very simple initial control scheme and there is plenty of scope for further improvement since a 50-60% increase in throughput over hash based ECMP load balancing is theoretically possible based on the results from these experiments.
This solution easily scales to 10G data center fabrics. Support for the sFlow standard is included in most vendor's switches (Alcatel-Lucent, Arista, Brocade, Cisco, Dell, Extreme, HP, Huawei, IBM, Juniper, Mellanox, ZTE, etc.) providing data center wide visibility - see Drivers for growth. Combined with increasing maturity and vendor support for the OpenFlow standard provides the real-time control of packet forwarding needed to adapt the network to changing traffic. Finally, flow steering is one of a number of techniques that combine to amplify performance gains delivered by the controller, other techniques include: large flow marking, DDoS mitigation, and workload placement.

Wednesday, April 23, 2014

Mininet integrated hybrid OpenFlow testbed

Figure 1: Hybrid Programmable Forwarding Planes
Integrated hybrid OpenFlow combines OpenFlow and existing distributed routing protocols to deliver robust software defined networking (SDN) solutions. Performance optimizing hybrid OpenFlow controller describes how the sFlow and OpenFlow standards combine to deliver visibility and control to address challenges including: DDoS mitigation, ECMP load balancing, LAG load balancing, and large flow marking.

A number of vendors support sFlow and integrated hybrid OpenFlow today, examples described on this blog include: Alcatel-Lucent, Brocade, and Hewlett-Packard. However, building a physical testbed is expensive and time consuming. This article describes how to build an sFlow and hybrid OpenFlow testbed using free Mininet network emulation software. The testbed emulates ECMP leaf and spine data center fabrics and provides a platform for experimenting with analytics driven feedback control using the sFlow-RT hybrid OpenFlow controller.

First build an Ubuntu 13.04 / 13.10 virtual machine then follow instructions for installing Mininet - Option 3: Installation from Packages.

Next, install an Apache web server:
sudo apt-get install apache2
Install the sFlow-RT integrated hybrid OpenFlow controller, either on the Mininet virtual machine, or on a different system (Java 1.6+ is required to run sFlow-RT):
wget http://www.inmon.com/products/sFlow-RT/sflow-rt.tar.gz
tar -xvzf sflow-rt.tar.gz
Copy the leafandspine.py script from the sflow-rt/extras directory to the Mininet virtual machine.

The following options are available:
./leafandspine.py --help
Usage: leafandspine.py [options]

Options:
  -h, --help            show this help message and exit
  --spine=SPINE         number of spine switches, default=2
  --leaf=LEAF           number of leaf switches, default=2
  --fanout=FANOUT       number of hosts per leaf switch, default=2
  --collector=COLLECTOR
                        IP address of sFlow collector, default=127.0.0.1
  --controller=CONTROLLER
                        IP address of controller, default=127.0.0.1
  --topofile=TOPOFILE   file used to write out topology, default topology.txt
Figure 2 shows a simple leaf and spine topology consisting of four hosts and four switches:
Figure 2: Simple leaf and spine topology
The following command builds the topology and specifies a remote host (10.0.0.162) running sFlow-RT as the hybrid OpenFlow controller and sFlow collector:
sudo ./leafandspine.py --collector 10.0.0.162 --controller 10.0.0.162 --topofile /var/www/topology.json
Note: All the links are configured to 10Mbit/s and the sFlow sampling rate is set to 1-in-10. These settings are equivalent to a 10Gbit/s network with a 1-in-10,000 sampling rate - see Large flow detection.

The network topology is written to  /var/www/topology.json making it accessible through HTTP. For example, the following command retrieves the topology from the Mininet VM (10.0.0.61):
curl http://10.0.0.61/topology.json
{"nodes": {"s3": {"ports": {"s3-eth4": {"ifindex": "392", "name": "s3-eth4"}, "s3-eth3": {"ifindex": "390", "name": "s3-eth3"}, "s3-eth2": {"ifindex": "402", "name": "s3-eth2"}, "s3-eth1": {"ifindex": "398", "name": "s3-eth1"}}, "tag": "edge", "name": "s3", "agent": "10.0.0.61", "dpid": "0000000000000003"}, "s2": {"ports": {"s2-eth1": {"ifindex": "403", "name": "s2-eth1"}, "s2-eth2": {"ifindex": "405", "name": "s2-eth2"}}, "name": "s2", "agent": "10.0.0.61", "dpid": "0000000000000002"}, "s1": {"ports": {"s1-eth1": {"ifindex": "399", "name": "s1-eth1"}, "s1-eth2": {"ifindex": "401", "name": "s1-eth2"}}, "name": "s1", "agent": "10.0.0.61", "dpid": "0000000000000001"}, "s4": {"ports": {"s4-eth2": {"ifindex": "404", "name": "s4-eth2"}, "s4-eth3": {"ifindex": "394", "name": "s4-eth3"}, "s4-eth1": {"ifindex": "400", "name": "s4-eth1"}, "s4-eth4": {"ifindex": "396", "name": "s4-eth4"}}, "tag": "edge", "name": "s4", "agent": "10.0.0.61", "dpid": "0000000000000004"}}, "links": {"s2-eth1": {"ifindex1": "403", "ifindex2": "402", "node1": "s2", "node2": "s3", "port2": "s3-eth2", "port1": "s2-eth1"}, "s2-eth2": {"ifindex1": "405", "ifindex2": "404", "node1": "s2", "node2": "s4", "port2": "s4-eth2", "port1": "s2-eth2"}, "s1-eth1": {"ifindex1": "399", "ifindex2": "398", "node1": "s1", "node2": "s3", "port2": "s3-eth1", "port1": "s1-eth1"}, "s1-eth2": {"ifindex1": "401", "ifindex2": "400", "node1": "s1", "node2": "s4", "port2": "s4-eth1", "port1": "s1-eth2"}}}
Don't start sFlow-RT yet, it should only be started after Mininet has finished building the topology.

Verify connectivity before starting sFlow-RT:
mininet> pingall
*** Ping: testing ping reachability
h1 -> h2 h3 h4 
h2 -> h1 h3 h4 
h3 -> h1 h2 h4 
h4 -> h1 h2 h3 
*** Results: 0% dropped (12/12 received)
This test demonstrates that the Mininet topology has been constructed with a set of default forwarding rules that provide connectivity without the need for an OpenFlow controller - emulating the behavior of  a network of integrated hybrid OpenFlow switches.

The following sFlow-RT script ecmp.js demonstrates ECMP load balancing in the emulated network:
include('extras/json2.js');

// Define large flow as greater than 1Mbits/sec for 1 second or longer
var bytes_per_second = 1000000/8;
var duration_seconds = 1;

var top = JSON.parse(http("http://10.0.0.61/topology.json"));
setTopology(top);

setFlow('tcp',
 {keys:'ipsource,ipdestination,tcpsourceport,tcpdestinationport',
  value:'bytes', t:duration_seconds}
);

setThreshold('elephant',
 {metric:'tcp', value:bytes_per_second, byFlow:true, timeout:2}
);

setEventHandler(function(evt) {
 var rec = topologyInterfaceToLink(evt.agent,evt.dataSource);
 if(!rec || !rec.link) return;
 var link = topologyLink(rec.link);
 logInfo(link.node1 + "-" + link.node2 + " " + evt.flowKey);
},['elephant']);
Modify the sFlow-RT start.sh script to include the following arguments:
RT_OPTS="-Dopenflow.controller.start=yes -Dopenflow.controller.flushRules=no"
SCRIPTS="-Dscript.file=ecmp.js"
Some notes on the script:
  1. The topology is retrieved by making an HTTP request to the Mininet VM (10.0.0.61)
  2. The 1Mbits/s threshold for large flows was selected because it represents 10% of the bandwidth of the 10Mbits/s links in the emulated network
  3. The event handler prints the link the flow traversed - identifying the link by the pair of switches it connects
Start sFlow-RT:
./start.sh
Now generate some large flows between h1 and h3 using the Mininet iperf command:
mininet> iperf h1 h3
*** Iperf: testing TCP bandwidth between h1 and h3
*** Results: ['9.58 Mbits/sec', '10.8 Mbits/sec']
mininet> iperf h1 h3
*** Iperf: testing TCP bandwidth between h1 and h3
*** Results: ['9.58 Mbits/sec', '10.8 Mbits/sec']
mininet> iperf h1 h3
*** Iperf: testing TCP bandwidth between h1 and h3
*** Results: ['9.59 Mbits/sec', '10.3 Mbits/sec']
The following results were logged by sFlow-RT:
2014-04-21T19:00:36-0700 INFO: ecmp.js started
2014-04-21T19:01:16-0700 INFO: s1-s3 10.0.0.1,10.0.1.1,49240,5001
2014-04-21T19:01:16-0700 INFO: s1-s4 10.0.0.1,10.0.1.1,49240,5001
2014-04-21T20:53:19-0700 INFO: s2-s4 10.0.0.1,10.0.1.1,49242,5001
2014-04-21T20:53:19-0700 INFO: s2-s3 10.0.0.1,10.0.1.1,49242,5001
2014-04-21T20:53:29-0700 INFO: s1-s3 10.0.0.1,10.0.1.1,49244,5001
2014-04-21T20:53:29-0700 INFO: s1-s4 10.0.0.1,10.0.1.1,49244,5001
The results demonstrate that the emulated leaf and spine network is performing equal cost multi-path (ECMP) forwarding - different flows between the same pair of hosts take different paths across the fabric (the highlighted lines correspond to the paths shown in Figure 2).
Open vSwitch in Mininet is the key to this emulation, providing sFlow and multi-path forwarding support 
The following script implements the large flow marking example described in Performance optimizing hybrid OpenFlow controller:
include('extras/json2.js');

// Define large flow as greater than 1Mbits/sec for 1 second or longer
var bytes_per_second = 1000000/8;
var duration_seconds = 1;

var idx = 0;

var top = JSON.parse(http("http://10.0.0.61/topology.json"));
setTopology(top);

setFlow('tcp',
 {keys:'ipsource,ipdestination,tcpsourceport,tcpdestinationport',
  value:'bytes', t:duration_seconds}
);

setThreshold('elephant',
 {metric:'tcp', value:bytes_per_second, byFlow:true, timeout:4}
);

setEventHandler(function(evt) {
 var agent = evt.agent;
 var ds = evt.dataSource;
 if(topologyInterfaceToLink(agent,ds)) return;

 var ports = ofInterfaceToPort(agent,ds);
 if(ports && ports.length == 1) {
  var dpid = ports[0].dpid;
  var id = "mark" + idx++;
  var k = evt.flowKey.split(',');
  var rule= {
    priority:1000, idleTimeout:2,
    match:{dl_type:2048, nw_proto:6, nw_src:k[0], nw_dst:k[1],
           tp_src:k[2], tp_dst:k[3]},
    actions:["set_nw_tos=128","output=normal"]
  };
  setOfRule(dpid,id,rule);
 }
},['elephant']);

setFlow('tos0',{value:'bytes',filter:'iptos=00000000',t:1});
setFlow('tos128',{value:'bytes',filter:'iptos=10000000',t:1});
Some notes on the script:
  1. The topologyInterfaceToLink() function looks up link information based on agent and interface. The event handler uses this function to exclude inter-switch links, applying controls to ingress ports only.
  2. The OpenFlow rule priority for rules created by controller scripts must be greater than 500 to override the default rules created by leafandspine.py
  3. The tos0 and tos128 flow definitions have been added to so that the re-marking can be seen.
Restart sFlow-RT with the new script and use a web browser to view the default tos0 and the re-marked tos128 traffic.
Figure 3: Marking large flows
Use iperf to generate traffic between h1 and h3 (the traffic needs to cross more than one switch so it can be observed before and after marking). The screen capture in figure 3 demonstrates that the controller immediately detects and marks large flows.