sFlow: June 2013

Wednesday, June 26, 2013

Marking large flows

Figure 1: Conceptual view of ﬂow scheduling over a datacenter fabric

Deconstructing Datacenter Packet Transport describes how priority marking of packets associated with large flows can improve completion times for flows crossing the data center fabric.

Figure 2: Overall average ﬂow completion time. The results are normalized with respect to the best possible completion time for each ﬂow size.

Figure 2 shows simulation results from the paper, showing that prioritization of short flows over large flows can significantly improve throughput (reducing flow completion times by a factor of 5 or more at high loads).

While the scheme described in the paper proposes changes to the end host behavior, there are practical difficulties in modifying the behavior of all the end hosts. In addition, the paper identifies a complication "whenever a packet traverses multiple hops only to be dropped at a downstream link."

Figure 3: Elements of a combined SDN load balancing and priority marking controller

An interesting alternative can be constructed using a software defined networking (SDN) control application to automatically detect each large flow and dynamically configure the ingress top of rack switch (or virtual switch) to mark its packets. Figure 3 shows the elements of the solution: sFlow measurements from the switches are sent to the sFlow-RT real-time analytics engine which rapidly detects large flows and notifies the Priority Marking application. The Priority Marking application instructs the OpenFlow Controller to mark the packets matching the flow keys and the controller uses OpenFlow to instruct the top of rack switch to mark the packets as they enter the network.

The Load Balancer also listens for large flow events and uses sFlow-RT to monitor the utilization of all the links in the fabric. The Load Balancer determines if the flow needs to be re-routed and sends instructions to the OpenFlow Controller, which uses OpenFlow to instruct the switches to change the forwarding path for the flow, see ECMP load balancing.

Using SDN to manage large flows scales well because a relatively small number of large flows are responsible for the bulk of the data transferred over the network, see SDN and large flows. Combining prioritization and load balancing of large flows offers the promise of much greater improvement than either technique is able to offer on its own. For example, Google's use of SDN to improve the efficiency of their wide area network combines control of forwarding with prioritization of traffic, see SDN and WAN optimization.

Tuesday, June 25, 2013

Large flow detection script

Large flow detection describes how sFlow monitoring scales to rapidly detect large flows (flows consuming more than 10% of a link's bandwidth). The chart displays the test pattern developed in the article and shows visually how sFlow can be used to detect and track large flows

This article develops a script for automatically detecting large flows that can form the basis of a load balancing performance aware software defined networking (SDN) application. The script is based on the node.js DDoS mitigation example presented in Controlling large flows with OpenFlow. Here the script is adapted poll sFlow-RT every second link loads as well as receiving large flow events.

var fs = require('fs');
var http = require('http');

var rt = { hostname: 'localhost', port: 8008 };
var flowkeys = 'ipsource,ipdestination,udpsourceport,udpdestinationport';
var threshold = 125000; // 1 Mbit/s   = 10% of link bandwidth

var links = {};

// mininet mapping between sFlow ifIndex numbers and switch/port names
var ifindexToPort = {};
var path = '/sys/devices/virtual/net/';
var devs = fs.readdirSync(path);
for(var i = 0; i < devs.length; i++) {
 var dev = devs[i];
 var parts = dev.match(/(.*)-(.*)/);
 if(!parts) continue;

 var ifindex = fs.readFileSync(path + dev + '/ifindex');
 ifindex = parseInt(ifindex).toString();
 var port = {'switch':parts[1],'port':dev};
 ifindexToPort[ifindex] = port;
}

function extend(destination, source) {
 for (var property in source) {
  if (source.hasOwnProperty(property)) {
   destination[property] = source[property];
  }
 }
 return destination;
}

function jsonGet(target,path,callback) {
 var options = extend({method:'GET',path:path},target);
 var req = http.request(options,function(resp) {
  var chunks = [];
  resp.on('data', function(chunk) { chunks.push(chunk); });
  resp.on('end', function() { callback(JSON.parse(chunks.join(''))); });
  });
  req.end();
};

function jsonPut(target,path,value,callback) {
 var options = extend({method:'PUT',headers:{'content-type':'application/json'},path:path},target);
 var req = http.request(options,function(resp) {
  var chunks = [];
  resp.on('data', function(chunk) { chunks.push(chunk); });
  resp.on('end', function() { callback(chunks.join('')); });
 });
 req.write(JSON.stringify(value));
 req.end();
};

function getLinkRecord(agent,ifindex) {
 var linkkey = agent + ">" + ifindex;
 var rec = links[linkkey];
 if(!rec) {
  rec = {agent:agent, ifindex:ifindex, port:ifindexToPort[ifindex]};
  links[linkkey] = rec;
 }
 return rec;
}

function updateLinkLoads(metrics) {  
 for(var i = 0; i < metrics.length; i++) {
  var metric = metrics[i];
  var rec = getLinkRecord(metric.agent,metric.dsIndex);
  rec.total = metric.metricValue;
 }
}

function largeFlow(link,flowKey,now,dt) {
 console.log(now + " " + " " + dt + " " + link.port.port + " " + flowKey);
}

function getEvents(id) {
 jsonGet(rt,'/events/json?maxEvents=10&timeout=60&eventID='+ id,
  function(events) {
   var nextID = id;
   if(events.length > 0) {
    nextID = events[0].eventID;
    events.reverse();
    var now = (new Date()).getTime();
    for(var i = 0; i < events.length; i++) {
     var evt = events[i];
     var dt = now - evt.timestamp;
     if('detail' == evt.thresholdID
        && Math.abs(dt) < 5000) {
      var flowKey = evt.flowKey;
      var rec = getLinkRecord(evt.agent,evt.dataSource);
      largeFlow(rec,flowKey,now,dt);
     }
    }
   }
   getEvents(nextID);
  }
 );
}

function startMonitor() {
 getEvents(-1);
 setInterval(function() {
  jsonGet(rt,'/dump/ALL/total/json', updateLinkLoads)
 }, 1000);
}

function setTotalFlows() {
 jsonPut(rt,'/flow/total/json',
  {value:'bytes',filter:'outputifindex!=discard&direction=ingress', t:2},
  function() { setDetailedFlows(); }
 );
}

function setDetailedFlows() {
 jsonPut(rt,'/flow/detail/json',
  {
   keys:flowkeys,
   value:'bytes',filter:'outputifindex!=discard&direction=ingress',
   n:10,
   t:2
  },
  function() { setThreshold(); }
 );
}

function setThreshold() {
 jsonPut(rt,'/threshold/detail/json',
  {
   metric:'detail',
   value: threshold,
   byFlow: true
  },
  function() { startMonitor(); }
 );
}

function initialize() {
 setTotalFlows();
}

initialize();

Some notes on the script:

The script tracks total bytes per second on each link using packet sampling since this gives a low latency measurement of link utilization consistent with the flow measurements, see Measurement delay, counters vs. packet counters. The link load values are stored in the links hashmap so they are available when deciding if and how to re-reroute large flows.
The flow keys used in this example are ipsource,ipdestination,udpsourceport,udpdestinationport in order to detect the flows generated by iperf. However, any keys can be used and multiple flow definitions could be monitored concurrently.
The threshold is set so that any flow that consumes 10% or more of link bandwidth is reported.
The function largeFlow() simply prints out the large flows. However, given a network topology and the link utilization data, this function could be used to implement flow steering controls.

The stepped test pattern is described in Large flow detection was used to evaluate the responsiveness of the detection script. The test pattern consisting of 20 second constant rate traffic flows ranging from 1Mbit/s to 10Mbit/s, representing 10% to 100% of link bandwidth. The test pattern script was modified to print out the time when each flow is started so that they could be compared with times reported by the flow detection script:

# bash sweep2.bash 
1372176102617
1372176132691
1372176162765
1372176192831
1372176222896
1372176252949
1372176283003
1372176313065
1372176343138
1372176373194

The results from the flow detection script are as follows:

$ nodejs lf.js
1372176107535  2 s3-eth1 10.0.0.1,10.0.0.3,57576,5001
1372176108583  5 s2-eth3 10.0.0.1,10.0.0.3,57576,5001
1372176109488  4 s1-eth1 10.0.0.1,10.0.0.3,57576,5001
1372176134246  4 s3-eth1 10.0.0.1,10.0.0.3,48214,5001
1372176134852  4 s2-eth3 10.0.0.1,10.0.0.3,48214,5001
1372176134974  4 s1-eth1 10.0.0.1,10.0.0.3,48214,5001
1372176163792  4 s2-eth3 10.0.0.1,10.0.0.3,39956,5001
1372176164018  5 s1-eth1 10.0.0.1,10.0.0.3,39956,5001
1372176164369  4 s3-eth1 10.0.0.1,10.0.0.3,39956,5001
1372176193584  2 s3-eth1 10.0.0.1,10.0.0.3,38970,5001
1372176193631  6 s2-eth3 10.0.0.1,10.0.0.3,38970,5001
1372176193980  5 s1-eth1 10.0.0.1,10.0.0.3,38970,5001
1372176223492  4 s3-eth1 10.0.0.1,10.0.0.3,44499,5001
1372176223517  3 s2-eth3 10.0.0.1,10.0.0.3,44499,5001
1372176223595  5 s1-eth1 10.0.0.1,10.0.0.3,44499,5001
1372176253274  4 s2-eth3 10.0.0.1,10.0.0.3,39900,5001
1372176253437  3 s3-eth1 10.0.0.1,10.0.0.3,39900,5001
1372176253557  5 s1-eth1 10.0.0.1,10.0.0.3,39900,5001
1372176283485  4 s2-eth3 10.0.0.1,10.0.0.3,55620,5001
1372176283496  9 s1-eth1 10.0.0.1,10.0.0.3,55620,5001
1372176283515  2 s3-eth1 10.0.0.1,10.0.0.3,55620,5001
1372176313469  3 s3-eth1 10.0.0.1,10.0.0.3,58151,5001
1372176313509  2 s1-eth1 10.0.0.1,10.0.0.3,58151,5001
1372176313556  3 s2-eth3 10.0.0.1,10.0.0.3,58151,5001
1372176343398  6 s1-eth1 10.0.0.1,10.0.0.3,41406,5001
1372176343490  4 s2-eth3 10.0.0.1,10.0.0.3,41406,5001
1372176343589  3 s3-eth1 10.0.0.1,10.0.0.3,41406,5001
1372176373525  5 s2-eth3 10.0.0.1,10.0.0.3,44200,5001
1372176373542  5 s3-eth1 10.0.0.1,10.0.0.3,44200,5001
1372176373650  6 s1-eth1 10.0.0.1,10.0.0.3,44200,5001

Note that each flow is detected on ingress by all of the links it traverses as it crosses the network (the blue path shown in the figure below).

Detecting each flow on all the links it traversed gives the controller end to end visibility and allows it to choose globally optimal paths. In this experiment, having multiple independent agents report on the flows provides useful data on the spread of detection times for each flow.

The following table summarizes Detection Time vs. Flow Size from this experiment:

Flow Size (% of link bandwidth)	Detection Time
10%	4.918 - 6.871 seconds
20%	1.555 - 2.283 seconds
30%	1.027 - 1.604 seconds
40%	0.753 - 1.149 seconds
50%	0.596 - 0.699 seconds
60%	0.325 - 0.608 seconds
70%	0.482 - 0.512 seconds
80%	0.404 - 0.491 seconds
90%	0.260 - 0.451 seconds
100%	0.331 - 0.456 seconds

The relatively slow time to detect the 10% flow results because the threshold was set at 10% and so these flows are on the margin. If a lower threshold had been set, they would have been detected more quickly. For flow sizes larger than 10%, the detection are between 1 and 2 seconds for flows in the range of 20% - 40% of bandwidth and detection times for larger flows is consistently sub-second.

The detection times shown in the table are achievable with the following sampling rates, see Large flow detection:

Link Speed	Large Flow	Sampling Rate	Polling Interval
10 Mbit/s	>= 1 Mbit/s	1-in-10	20 seconds
100 Mbit/s	>= 10 Mbit/s	1-in-100	20 seconds
1 Gbit/s	>= 100 Mbit/s	1-in-1,000	20 seconds
10 Gbit/s	>= 1 Gbit/s	1-in-10,000	20 seconds
40 Gbit/s	>= 4 Gbit/s	1-in-40,000	20 seconds
100 Gbit/s	>= 10 Gbit/s	1-in-100,000	20 seconds

These sampling rates allow a central controller to monitor very large scale switch fabrics. In addition, multiple control functions can be applied in parallel based on the sFlow data feed, see Software defined analytics. For example, implementing load balancing, mitigating denial of service attacks and capturing suspicious traffic as SDN applications.

Monday, June 17, 2013

Large Scale Production Engineering talk

A recent Large Scale Production Engineering (LSPE) meeting included a number of talks looking different aspects of software defined networking (SDN):

"Introduction to SDN/Openflow"
Xin Huang - Sr. Research Scientist/Technologist, Cyan, Inc.

"SDN/OpenFlow: Challenges and Opportunities"
Sriram Natarajan - Sr. Researcher, NTT Innovation Institute, Inc.

Lightning talk: "#include <abstractions.h>"
Nils Swart - Director, Plexxi

"Performance Aware SDN"
Peter Phaal - sFlow standard co-author, sFlow.org

The video from the event contains Q&A from the first talk and the remaining three talks.

The Performance Aware SDN talk starts at time marker 49:20 and describes the role of sFlow instrumentation in building performance optimizing SDN controllers that combine networking, server and application visibility and bring networking under the control of the DevOps team.

0:49:20 Introduction / Why should DevOps team care about SDN?
0:52.40 Controllability, Observability and Delay
0:56:00 Role of sFlow in making data center observable
1:02:00 SDN performance optimizing controllers using OpenFlow and sFlow
1:04:35 DDoS example
1:07:33 ECMP/LAG multi-path traffic distribution
1:10:45 Memcached hot keys
1:12:52 Next steps
1:13:50 Q&A

Slides from the talk are available on Slideshare.

Friday, June 14, 2013

Multi-tenant performance isolation

This incident report from an OpenStack based cloud data center illustrates how performance problems can propagate and affect multiple tenants within the data center. This article will examine the incident and describe how performance aware software defined networking can be used to improve performance isolation in multi-tenant environments.

The incident report describes an external distributed denial of service (DDoS) attack that was launched some time before 9:30. The effects of the attack started to be detected by the measurement system at 9:30 and it took until 10:00 fully identify the attack and start planning a response. The plan to null route the traffic was implemented at 10:09 and the incident was fully resolved at 10:29.

The SDN and delay discusses the components of delay in a feedback control loop and includes the above timeline. Applying the timeline to the DDoS incident identifies the following components of response delay:

Measurement delay, 30 minutes
Planning delay, 9 minutes
Configuration delay, not broken out, included in planning delay
Response delay, < 20 minutes
Loop delay, 59 minutes

Threats to performance aren't just external. The following related incident report shows an internal host (likely a compromised host that was part of the initial DDoS attack) was responsible for disrupting service for other tenants within the data center.

In this case the time to resolve the problem was faster at 11 minutes (however, if this host was part of the original DDoS attack then total response time to detect and isolate this host was 2 hours 10 minutes).

While automation is an important part of the OpenStack (and other cloud orchestration systems), current architectures don't include the feedback mechanisms and coordinated controls needed for effective multi-tenant performance isolation, Network virtualization, management silos, and missed opportunities.

The key to building responsive performance optimizing controllers is a pervasive, scaleable, real-time monitoring system. The sFlow instrumentation embedded within the physical and virtual switches (in this case Open vSwitch), load balancers and hypervisors enables real-time monitoring of the entire cloud data center.

The next step is to integrate the real-time analytics into the orchestration system. The article performance aware software defined networking describes the basic elements of a performance optimizing controller.

The article DDoS describes a fully automated system for DDoS mitigation with a loop delay of around 10 seconds, i.e. it is able to detect, characterize, null route and eliminate an attack within 10 seconds (over 300 times faster than the manual process). The controller is fast enough to prevent the attack from fully developing, cutting the peak traffic by a factor of 4.

Even faster responses are possible using software defined networking (SDN): the article Controlling large flows with OpenFlow describes an experimental controller that can mitigate a denial of service attack in 2 seconds.

Denial of service mitigation is just one example of multi-tenant performance isolation. There are many types of application that tenants run within their cloud deployments that stress the infrastructure. The articles Multi-tenant traffic in virtualized network environments, Pragmatic software defined networking and Resource allocation look at some of the architectural issues involved in managing cloud performance.

Tuesday, June 11, 2013

F5 BIG-IP LTM and TMOS 11.4.0

The latest TMOS 11.4.0 release for F5's BIG-IP Local Traffic Manager (LTM) includes comprehensive L2-7 support for sFlow, from packet sampling and interface counters to application response times, URLs and status codes (see Monitoring BIG-IP System Traffic with sFlow).

Load balancers are used to virtualize scale out service pools: clients connect to a virtual IP address and service port associated with the load balancer which selects a member of the server pool to handle the request. This architecture provides operational flexibility, allowing servers to be added and removed from the pool as demand changes.

The load balancer is uniquely positioned to provide information on the overall performance of the entire service pool and link the performance seen by clients with the behavior of individual servers in the pool. The advantage of using sFlow to monitor performance is the scalability it offers when request rates are high and conventional logging solutions generate too much data or impose excessive overhead. In addition, monitoring HTTP services using sFlow is part of an integrated monitoring system that spans the data center, providing real-time visibility into application, server and network performance.

Once configured, BIG IP will stream measurements to a central sFlow Analyzer. Download, compile and install the sflowtool on the system your are using to receive sFlow to see the raw data and verify that the measurements are being received.

Running sflowtool will display output of the form:

startDatagram =================
datagramSourceIP 10.0.0.153
datagramSize 564
unixSecondsUTC 1370017719
datagramVersion 5
agentSubId 3
agent 10.0.0.153
packetSequenceNo 16
sysUpTime 1557816000
samplesInPacket 2
startSample ----------------------
sampleType_tag 0:2
sampleType COUNTERSSAMPLE
sampleSequenceNo 1
sourceId 3:2
counterBlock_tag 0:2201
http_method_option_count 0
http_method_get_count 71
http_method_head_count 0
http_method_post_count 0
http_method_put_count 0
http_method_delete_count 0
http_method_trace_count 0
http_methd_connect_count 0
http_method_other_count 2
http_status_1XX_count 0
http_status_2XX_count 26
http_status_3XX_count 24
http_status_4XX_count 23
http_status_5XX_count 0
http_status_other_count 0
endSample   -------------------
startSample -------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 1
sourceId 3:2
meanSkipCount 1
samplePool 1
dropEvents 0
inputPort 352
outputPort 1073741823
flowBlock_tag 0:2102
extendedType proxy_socket4
proxy_socket4_ip_protocol 6
proxy_socket4_local_ip 10.0.0.153
proxy_socket4_remote_ip 10.0.0.150
proxy_socket4_local_port 40451
proxy_socket4_remote_port 80
flowBlock_tag 0:2100
extendedType socket4
socket4_ip_protocol 6
socket4_local_ip 10.0.0.153
socket4_remote_ip 10.0.0.70
socket4_local_port 80
socket4_remote_port 40451
flowBlock_tag 0:2206
flowSampleType http
http_method 2
http_protocol 1001
http_uri /index.html
http_host 10.10.10.250
http_referrer http://asdfasdfasdf.asdf
http_useragent curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2
http_authuser Aladdin
http_mimetype text/html; charset=UTF-8
http_request_bytes 340
http_bytes 8778
http_duration_uS 1930
http_status 200
endSample   ----------------------
endDatagram ======================

There are two types of sFlow record shown: COUNTERSAMPLE and FLOWSAMPLE data. The counters are useful for trending overall performance using tools like Ganglia and Graphite. Using sflowtool to output combined logfile format makes the data available to most logfile analyzers.

Note: The highlighted IP addresses in the FLOWSAMPLE correspond to addresses in the diagram and illustrate how request records from the proxy link clients to the back end servers.

A native sFlow analyzer like sFlowTrend can combine the counters, flows and host performance metrics to provide an integrated view of performance.

Installing sFlow agents on the backend web servers further extends visibility: implementations are available for Apache, NGINX, Tomcat and node.js. Application logic running on the servers can also be instrumented with sFlow, see Scripting languages. Back end Memcache, Java and virtualization pools can also be instrumented with sFlow. sFlow agents embedded in physical and virtual switches provide end-to-end visibility across the network.

Comprehensive visibility in multi-tiered environments allows the powerful control capabilities of the load balancers to be used to greatest effect: regulating traffic between tiers, protecting overloaded backend systems, defending against denial of service attacks, moving resources from over provisioned pools to under provisioned pools.

Saturday, June 8, 2013

Bay Area Network Virtualization talk

This talk from a recent Bay Area Network Virtualization Meetup describes how sFlow and OpenFlow can be combined to develop performance aware software defined networking (SDN) applications.

Some of the topics covered include:

0:00:00 Why monitor performance?
0:07:34 What is sFlow?
0:16:55 Software Defined Networking
0:22:40 Examples
- 0:23:35 DDoS mitigation
- 0:26:10 Load balancing large flows
- 0:30:32 Optimizing virtual networks
- 0:36:37 Packet brokers
0:39:05 Q&A
1:10:15 Hands on workshop
1:10:40 Testbed setup
1:13:30 sFlow-RT REST API Commands
1:38:30 Q&A

The slides from the talk are available on Slideshare.

Friday, June 7, 2013

Large flow detection

The familiar television test pattern is used to measure display resolution, linearity and calibration. Since fast and accurate detection of large flows is a pre-requisite for developing load balancing SDN controllers, this article will develop a large flow test pattern and use it to examining the speed and accuracy of large flow detection based on the sFlow standard.

Step Response from Wikipedia

Step or square wave signals are widely used in electrical and control engineering to monitor the responsiveness of a system. In this case we are interested in detecting large flows, defined as a flow consuming at least 10% of a link's bandwidth, see SDN and large flows.

The article, Flow collisions, describes a Mininet 2.0 test bed that realistically emulates network performance. In the test bed, link speed are scaled down to 10Mbit/s so that they can be accurately emulated in software. Therefore, a large flow in the test bed is any flow of 1Mbit/s or greater. The following script uses iperf to generate a test pattern, consisting of 20 second constant rate traffic flows ranging from 1Mbit/s to 10Mbit/s:

iperf -c 10.0.0.3 -t 20 -u -b 1M
sleep 10
iperf -c 10.0.0.3 -t 20 -u -b 2M
sleep 10
iperf -c 10.0.0.3 -t 20 -u -b 3M
sleep 10
iperf -c 10.0.0.3 -t 20 -u -b 4M
sleep 10
iperf -c 10.0.0.3 -t 20 -u -b 5M
sleep 10
iperf -c 10.0.0.3 -t 20 -u -b 6M
sleep 10
iperf -c 10.0.0.3 -t 20 -u -b 7M
sleep 10
iperf -c 10.0.0.3 -t 20 -u -b 8M
sleep 10
iperf -c 10.0.0.3 -t 20 -u -b 9M
sleep 10
iperf -c 10.0.0.3 -t 20 -u -b 10M

The following command configures sFlow on the virtual switch with a 1-in-10 sampling probability and a 1 second counter export interval:

ovs-vsctl -- --id=@sflow create sflow agent=eth0 target=127.0.0.1 \
sampling=10 polling=1 -- \
-- set bridge s1 sflow=@sflow \
-- set bridge s2 sflow=@sflow \
-- set bridge s3 sflow=@sflow \
-- set bridge s4 sflow=@sflow

The following sFlow-RT chart shows a second by second view of the test pattern flows constructed from the real-time sFlow data exported by the virtual switch:

The chart clearly shows the test pattern, a sequence of 10 flows starting at 1Mbit/s. Each large flow is detected within a second or two: the minimum size large flow (1Mbit/s) takes the longest to determine as a large flow (i.e. cross the 1Mbit/s line) and larger flows take progressively less time to classify (the largest flow is determined to be large in under a second). The chart displays not just the volume of each flow, but also identifies the source and destination MAC addresses, IP addresses, and UDP ports - the detailed information needed to configure control actions to steer the large flows, see Load balancing LAG/ECMP groups and ECMP load balancing.

The results can be further validated using output from iperf. The iperf tool consistes of a traffic source (client) and a target (server). The following reports from the server confirm the flow volumes, IP addresses and port numbers:

iperf -su
------------------------------------------------------------
Server listening on UDP port 5001
Receiving 1470 byte datagrams
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  3] local 10.0.0.3 port 5001 connected with 10.0.0.1 port 37645
[ ID] Interval       Transfer     Bandwidth        Jitter   Lost/Total Datagrams
[  3]  0.0-20.0 sec  2.39 MBytes  1.00 Mbits/sec   0.033 ms    0/ 1702 (0%)
[  4] local 10.0.0.3 port 5001 connected with 10.0.0.1 port 43101
[  4]  0.0-20.0 sec  4.77 MBytes  2.00 Mbits/sec   0.047 ms    0/ 3402 (0%)
[  4]  0.0-20.0 sec  1 datagrams received out-of-order
[  3] local 10.0.0.3 port 5001 connected with 10.0.0.1 port 49970
[  3]  0.0-20.0 sec  7.15 MBytes  3.00 Mbits/sec   0.023 ms    0/ 5102 (0%)
[  3]  0.0-20.0 sec  1 datagrams received out-of-order
[  4] local 10.0.0.3 port 5001 connected with 10.0.0.1 port 46495
[  4]  0.0-20.0 sec  9.54 MBytes  4.00 Mbits/sec   0.033 ms    0/ 6804 (0%)
[  3] local 10.0.0.3 port 5001 connected with 10.0.0.1 port 34667
[  3]  0.0-20.0 sec  11.9 MBytes  5.00 Mbits/sec   0.050 ms    1/ 8504 (0.012%)
[  3]  0.0-20.0 sec  1 datagrams received out-of-order
[  4] local 10.0.0.3 port 5001 connected with 10.0.0.1 port 47284
[  4]  0.0-20.0 sec  14.3 MBytes  6.00 Mbits/sec   0.050 ms    0/10205 (0%)
[  3] local 10.0.0.3 port 5001 connected with 10.0.0.1 port 55425
[  3]  0.0-20.0 sec  16.7 MBytes  7.00 Mbits/sec   0.028 ms    1/11905 (0.0084%)
[  3]  0.0-20.0 sec  1 datagrams received out-of-order
[  4] local 10.0.0.3 port 5001 connected with 10.0.0.1 port 59881
[  4]  0.0-20.0 sec  19.1 MBytes  8.00 Mbits/sec   0.029 ms    2/13605 (0.015%)
[  4]  0.0-20.0 sec  2 datagrams received out-of-order
[  3] local 10.0.0.3 port 5001 connected with 10.0.0.1 port 44822
[  3]  0.0-20.0 sec  21.5 MBytes  9.00 Mbits/sec   0.037 ms    0/15314 (0%)
[  3]  0.0-20.0 sec  1 datagrams received out-of-order
[  4] local 10.0.0.3 port 5001 connected with 10.0.0.1 port 48150
[  4]  0.0-20.1 sec  23.3 MBytes  9.73 Mbits/sec   0.415 ms    1/16631 (0.006%)
[  4]  0.0-20.1 sec  2 datagrams received out-of-order

A further validation of the results is possible using interface counters exported by sFlow (which were configured to export at 1 second intervals):

The chart shows that the flow measurements (based on packet samples) correspond closely to the measurements based on the periodic interface counter exports (which report the 100% accurate interface counters maintained by the switch ports).

Note: Normally one would not use 1 second counter export with sFlow, the default interval is 30 seconds and values in the range 15 - 30 seconds typically satisfy most requirements, see Measurement delay, counters vs. packet samples.

Link Speed	Large Flow	Sampling Rate	Polling Interval
10 Mbit/s	>= 1 Mbit/s	1-in-10	20 seconds
100 Mbit/s	>= 10 Mbit/s	1-in-100	20 seconds
1 Gbit/s	>= 100 Mbit/s	1-in-1,000	20 seconds
10 Gbit/s	>= 1 Gbit/s	1-in-10,000	20 seconds
40 Gbit/s	>= 4 Gbit/s	1-in-40,000	20 seconds
100 Gbit/s	>= 10 Gbit/s	1-in-100,000	20 seconds

The results scale to higher link speeds using the settings for the table above. Configuring sampling rates from this table ensures that large flows (defined as 10% of link bandwidth) are quickly detected and tracked.

Note: Readers may be wondering if other approaches to large flow detection such as OpenFlow metering, NetFlow, or IPFIX might be suitable for SDN control. These technologies operate by maintaining a flow table within the switch which can be polled, periodically exported, or exported when the flow ends. In all cases the measurements are delayed, limiting the value of the measurements for SDN control applications like load balancing, see Rapidly detecting large flows, sFlow vs. NetFlow/IPFIX. The large flow test pattern described in this article can be used to test the fidelity of large flow detection systems and compare their performance.

Looking at the table, the definition of a large flow on a 100 Gbit/s link is any flow greater than or equal to 10Gbit/s. This may seem like a large number. However, as link speeds increase, applications are being developed to fully utilize their capacity.

From Monitoring at 100 Gigabits/s

The chart from Monitoring at 100 Gigabits/s shows a link carrying thee large flows, each around 40 Gigabits/s. In addition, a flow doesn't have to correspond to an individual UDP/TCP connection. An Internet exchange might define flows as traffic between pairs of MAC addresses, or an Internet Service Provider (ISP) might define flows based on destination BGP AS numbers. The software defined analytics architecture supported by sFlow allows flows to be flexibly defined to suite each environment.

Tailoring flow definitions to minimize the number of flows that need to be managed reduces complexity and churn in the controller, and makes most efficient use of the hardware flow steering capabilities of network switches which currently support a limited number of general match forwarding rules (see OpenFlow Switching Performance: Not All TCAM Is Created Equal). Using our definition of large flows (>=10% of link bandwidth), a 48 port switch would require a maximum of 480 general match rules in order to steer all large flows, which is well within the capabilities of current hardware, while leaving small flows to the normal forwarding logic in the switch, see Pragmatic software defined networking.

Monday, June 3, 2013

Flow collisions

Controlling large flows with OpenFlow describes how to build a performance aware software defined networking testbed using Mininet. This article describes how to use Mininet to create a minimal multi-path topology to experiment with load balancing of large flows.

Mininet 2.0 (a.k.a. Mininet HiFi) includes support for link bandwidth limits, making it an attractive platform to explore network performance. Mininet 2.0 has been used to reproduce results from a number of significant research papers, see Reproducible Network Experiments Using Container-Based Emulation.

The diagram shows the basic leaf and spine topology that we will be constructing. The two leaf switches (s1 and s2) are connected by two spine switches (s3 and s4). The two paths connecting each top of rack switch are shown in red and blue in the diagram.

Note: Scaling of network characteristics is important when building performance models in Mininet. While production networks may have 10Gbit/s links, it is not realistic to expect the software emulation to faithfully model high speed links. Scaling the speeds down makes it possible to emulate the links in software, while still preserving the basic characteristics of the network. In this case, we will scale down by a factor of 1000, using 10Mbit/s links instead of 10Gbit/s links. The settings for the monitoring system need to be similarly scaled. The article SDN and large flows recommending a 1-in-10,000 sampling rate to detect large flows on 10Gbit/s links and a sampling rate of 1-in-10 will give the same response time on the 10Mbit/s links in the emulation.

The following Python script builds the network and enables sFlow on the switches:

#!/usr/bin/env python

from mininet.net  import Mininet
from mininet.node import RemoteController
from mininet.link import TCLink
from mininet.cli  import CLI
from mininet.util import quietRun

c = RemoteController('c',ip='127.0.0.1')
net = Mininet(link=TCLink);

# Add hosts and switches
leftHost1  = net.addHost('h1',ip='10.0.0.1',mac='00:04:00:00:00:01')
leftHost2  = net.addHost('h2',ip='10.0.0.2',mac='00:04:00:00:00:02')
rightHost1 = net.addHost('h3',ip='10.0.0.3',mac='00:04:00:00:00:03')
rightHost2 = net.addHost('h4',ip='10.0.0.4',mac='00:04:00:00:00:04')

leftSwitch     = net.addSwitch('s1')
rightSwitch    = net.addSwitch('s2')
leftTopSwitch  = net.addSwitch('s3')
rightTopSwitch = net.addSwitch('s4')

# Add links
# set link speeds to 10Mbit/s
linkopts = dict(bw=10)
net.addLink(leftHost1,  leftSwitch,    **linkopts )
net.addLink(leftHost2,  leftSwitch,    **linkopts )
net.addLink(rightHost1, rightSwitch,   **linkopts )
net.addLink(rightHost2, rightSwitch,   **linkopts )
net.addLink(leftSwitch, leftTopSwitch, **linkopts )
net.addLink(leftSwitch, rightTopSwitch,**linkopts )
net.addLink(rightSwitch,leftTopSwitch, **linkopts )
net.addLink(rightSwitch,rightTopSwitch,**linkopts )

# Start
net.controllers = [ c ]
net.build()
net.start()

# Enable sFlow
quietRun('ovs-vsctl -- --id=@sflow create sflow agent=eth0 target=127.0.0.1 sampling=10 polling=20 -- -- set bridge s1 sflow=@sflow -- set bridge s2 sflow=@sflow -- set bridge s3 sflow=@sflow -- set bridge s4 sflow=@sflow')

# CLI
CLI( net )

# Clean up
net.stop()

First start Floodlight and sFlow-RT and then run the script to build the network. The topology can then be viewed by accessing the Floodlight user interface (http://xx.xx.xx.xx:8080/ui/index.html) and clicking on the Topology tab.

Next, open xterm windows on the four hosts: h1, h2, h3 and h4. The following commands, typed into each window, generate a sequence of iperf tests from h1 to h2 and a separate set of tests between h3 and h4.

h2:

iperf -s

h4:

iperf -s

h1:

while true; do iperf -c 10.0.0.2 -i 60 -t 60; sleep 20; done

h3:

while true; do iperf -c 10.0.0.4 -i 60 -t 60; sleep 30; done

Run the following commands to install flow metrics in sFlow-RT to track traffic from each host:

curl -H "Content-Type:application/json" -X PUT --data "{value:'bytes',filter:'ipsource=10.0.0.1'}" http://localhost:8008/flow/h1/json
curl -H "Content-Type:application/json" -X PUT --data "{value:'bytes',filter:'ipsource=10.0.0.2'}" http://localhost:8008/flow/h2/json
curl -H "Content-Type:application/json" -X PUT --data "{value:'bytes',filter:'ipsource=10.0.0.3'}" http://localhost:8008/flow/h3/json
curl -H "Content-Type:application/json" -X PUT --data "{value:'bytes',filter:'ipsource=10.0.0.4'}" http://localhost:8008/flow/h4/json

The following sFlow-RT page shows the traffic for hosts h1 and h3:

The chart shows that each flow reaches a peak of around 1.2Mbytes/s (10Mbits/s), demonstrating that Mininet is emulating the 10Mbit/s links in the configuration. The chart also shows that there is no interaction between the flows, which is expected since shortest path flows between h1 and h2 are restricted to s1 and flows between h3 and h4 are restricted to s2.

Note: If the emulator can't keep up with the number or speed of links, you might see spurious interactions between the traffic streams. It is a good idea to run some tests like the one above to verify the emulation before moving on to more complex scenarios.

The next experiment generates flows across the spine switches.

h3:

iperf -s

h4:

iperf -s

h1:

while true; do iperf -c 10.0.0.2 -i 60 -t 60; sleep 20; done

h2:

while true; do iperf -c 10.0.0.4 -i 60 -t 60; sleep 30; done

The following sFlow-RT chart shows the traffic for hosts h1 and h2.

This chart clearly displays the effect of flow collisions on performance. When flows collide, each flow only achieves half the throughput.

While the topology has full cross sectional bandwidth between all pairs of hosts, Floodlight's shortest path forwarding algorithm places all flows between a pair of switches on the same path, resulting in collisions. However, even if ECMP routing were used, the chance of hash collisions between simultaneous flows in this two path network would be 50%.

Note: The paper, Hedera: Dynamic Flow Scheduling for Data Center Networks describes the effect of collisions on large scale ECMP fabrics and the results of the paper have been reproduced using Mininet, see Reproducible Network Experiments Using Container-Based Emulation.

The IETF Operations and Management Area Working Group (OPSAWG) recently adopted Mechanisms for Optimal LAG/ECMP Component Link Utilization in Networks as a working group draft. The draft mentions sFlow as a method for detecting large flows and the charts in this article demonstrate that sFlow's low latency traffic measurements provide clear signals that can be used to quickly detect collisions, allowing an SDN controller to re-route the colliding flows.

Note: The draft mentions NetFlow as a possible measurement technology, however, the article Rapidly detecting large flows, sFlow vs. NetFlow/IPFIX demonstrates that flow monitoring delays measurement, limiting their value for SDN control applications. It should also be noted that OpenFlow's metering mechanism shares the same architectural limitations as NetFlow/IPFIX. In addition, using polling mechanisms for retrieving metrics is also slower and less scaleable as a method for detecting large flows, see Measurement delay, counters vs. packet samples.

There are a number of additional factors than make sFlow an attractive as the measurement component is a load balancing solution. The sFlow standard is widely supported by switch vendors and the measurements scale to 100Gbits/s and beyond (detecting large flows on a 100Gbit/s link with the same responsiveness shown in this testbed requires a sampling rate of only 1-in-100,000). In addition, sFlow monitoring scales to large numbers of devices (a single instance of sFlow-RT can easily monitor 10's of thousands of switch ports), providing the measurements needed to load balance large ECMP fabrics. Finally, every sFlow capable device provides the full set of flow identification attributes described in the draft, including: source MAC address, destination MAC address, VLAN ID, IP Protocol, IP source address, IP destination address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP destination port and tunnels (GRE, VxLAN, NVGRE).

Load balancing large flows is only one of a number of promising applications for performance aware software defined networking. Others application include, traffic engineering, denial of service (DoS) mitigation and packet brokers.