Tuesday, December 24, 2013

Workload placement

Messy and organized closets are an every day example of the efficiency that can be gained by packing items together in a systematic way: randomly throwing items in a closet makes poor use of the space; keeping the closet organized increases the available space.

A recent CBS 60 Minutes Amazon segment describes the ultimate closet - an Amazon order fulfillment warehouse. Each vast warehouse looks like a chaotic jumble - something out of Raiders of the Lost Ark.
Even when you get up close to an individual shelf, there still appears to be no organizing principle. Interviewer Charlie Rose comments, "The products are then placed by stackers in what seems to outsiders as a haphazard way… a book on Buddhism and Zen resting next to Mrs. Potato Head…"
Amazon's Dave Clark explains, "Can those two things, you look at how these items fit in the bin. They’re optimized for utilizing the available space. And we have computers and algorithmic work that tells people the areas of the building that have the most space to put product in that’s coming in at that time. Amazon has become so efficient with its stacking, it can now store twice as many goods in its centers as it did five years ago."

The 60 Minutes piece goes on to discuss Amazon Web Services (AWS). There are interesting parallels between managing a cloud data center and managing a warehouse (both of which Amazon does extremely well). There is a fixed amount of physical compute, storage and bandwidth resources in the data center, but instead of having to find shelf space to store physical goods, the data center manager needs to find a server with enough spare capacity to run each new virtual machine.

Just as a physical object has a size, shape and weight that constrain where it can be placed, virtual machines have characteristics such as number of virtual CPUs, memory, storage and network bandwidth that determine how many virtual machines can be placed on each physical server (see Amazon EC2 Instances). For example, an Amazon m1.small instance provides 1 virtual CPU, 1.7 GiB RAM, and 160 GB storage. A simplistic packing scheme would allow 6 small instances to be hosted on a physical server with 8 CPU cores, 32 GiB RAM, and 1 TB disk. This allocation scheme is limited by the amount of disk space and leaves CPU cores and RAM unused.

While the analogy between a data center and a warehouse is interesting, there are distinct differences between computational workloads and physical goods that are important to consider. One of the motivating factors driving the move to virtualization was the realization that most physical servers were poorly utilized. Moving to virtual machines allowed multiple workloads to be combined and run on a single physical server, increasing utilization and reducing costs. Continuing the EC2 example, if measurement revealed that the m1.small instances where only using 80GB of storage, additional instances could be placed on the server by over subscribing the storage.
The Wired article, Return of the Borg: How Twitter Rebuilt Google’s Secret Weapon, describes Google's internally developed workload packing software and the strategic value it has for Google's business.
Amazon has been able to double the capacity of their physical warehouses by using bar code tracking and computer orchestration algorithms. Assuming analytics driven workload placement in data centers can drive a similar increase workload density, what impact would that have for a cloud hosting provider?

Suppose a data center is operating with a gross margin of 20%. Leveraging the sFlow standard for measurement doesn't add to costs since the capability is embedded in most vendor's data center switches, and open source sFlow agents can easily be deployed on hypervisors using orchestration tools. Real-time analytics software is required to turn the raw measurements into actionable data, however, the cost of this software is a negligible part of the overall cost of running the data center. On the other hand, doubling the number of virtual machines that can be hosted in the data center (and assuming that there is sufficient demand to fill this additional capacity) doubles the top line revenue and triples the gross margin to 60%.

One can argue about the assumptions in the example, but playing around with different assumptions and models, it is clear that workload placement has great potential for increasing the efficiency and profitability of cloud data centers. Where the puck is going: analytics describes the vital role for analytics in SDN orchestration stacks, including: VMware (NSX), Cisco, Open Daylight, etc. The article predicts that there will be increase merger and acquisition activity in 2014 as orchestration vendors compete by integrating analytics into their platforms.

Finally, while analytics offers attractive opportunities, a lack of visibility and poorly placed workloads carries significant risks. In SDN market predictions for New Year: NFV, OpenFlow, Open vSwitch boom, Eric Hanselman of 451 Research poses the question, "Will data center overlays hit a wall in 2014?"  He then goes on to state, "There is a point at which the overlay is going to be constrained by the mechanics of the network underneath... Data center operators will want the ability to do dynamic configuration and traffic management on the physical network and tie that management and control into application-layer orchestration."

Saturday, December 14, 2013

Blacklists

Blacklists are an important way in which the Internet community protects itself by identifying bad actors. However, before using a blacklist, it is important to understand how it is compiled and maintained in order to properly use the list and interpret the significance of a match.

Incorporating blacklists in traffic monitoring can be a useful way to find hosts on a network that have been compromised. If a host interacts with addresses known to be part of a botnet for example, then it raises the concern that the host has been compromised and is itself a member of the botnet.

This article provides an example that demonstrates how the standard sFlow instrumentation build into most vendors switches can be used match traffic against a large blacklist. Black lists can be very large, the list used in this example contains approximately 16,000 domain names and nearly 300,000 CIDRs. Most switches don't have the resources to match traffic against such large lists. However, the article RESTflow describes how sFlow shifts analysis from the switches to external software which can easily handle to task of matching traffic against large lists. This article uses sFlow-RT to perform the black list matching.
Figure 1: Components of sFlow-RT
The following sFlow-RT script (phish.js) makes use of the PhishTank blacklist to identify hosts that may have been compromised by phishing attacks:
include('extras/json2.js');

var server = '10.0.0.1';
var port = 514;
var facility = 16; // local0
var severity = 5;  // notice

var domains = {};
function updatePhish() {
  var phish = JSON.parse(http("http://data.phishtank.com/data/online-valid.json"));
  domains = {};
  var dlist = [];
  var groups = {};
  for(var i = 0; i < phish.length; i++) {
    var entry = phish[i];
    var target = entry.target;
    var id = entry.phish_id;
    var url = entry.url;
    var dnsqname = url.match(/:\/\/(.[^/]+)/)[1] + '.';
    if(!domains[dnsqname]) {
      domains[dnsqname] = id;
      dlist.push(dnsqname);
    }
    var details = entry.details;
    var cidrlist = [];
    for(var j = 0; j < details.length; j++) {
      var ip = details[j].ip_address;
      var cidr = details[j].cidr_block;
      if(cidr) cidrlist.push(cidr);
    }
    if(cidrlist.length > 0) groups["phish." + id] = cidrlist;
  }

  // add in local groups
  groups.other = ['0.0.0.0/0','::/0'];
  groups.private = ['10.0.0.0/8','172.16.0.0/12','192.168.0.0/16','FC00::/7'];
  groups.multicast = ['224.0.0.0/4'];
  setGroups(groups);

  setFlow('phishydns',
    {
      keys:'ipsource,ipdestination,dnsqname,dnsqr',
      value:'frames',
      filter:'dnsqname="'+ dlist + '"',
      log:true,
      flowStart:true
    }
  );
}

setFlowHandler(function(rec) {
  var keys = rec.flowKeys.split(',');
  var msg = {type:'phishing'};
  switch(rec.name) {
  case 'phishysrc':
     msg.victim=keys[0];
     msg.match='cidr';
     msg.phish_id = keys[1].split('.')[1];
     break;
  case 'phishydst':
     msg.victim=keys[0];
     msg.match='cidr';
     msg.phish_id = keys[1].split('.')[1];
     break;
  case 'phishydns':
     var id = domains[keys[2]];
     msg.victim = keys[3] == 'false' ? keys[0] : keys[1];
     msg.match = 'dns';
     msg.phish_id = domains[keys[2]];
     break;
  }
  syslog(server,port,facility,severity,msg);
},['phishysrc','phishydst','phishydns']);


updatePhish();

// update threat database every 24 hours
setIntervalHandler(function() {
  try { updatePhish(); } catch(e) {}
},60*60*24);

setFlow('phishysrc',
  {
    keys:'ipsource,destinationgroup',
    value:'frames',
    filter:'destinationgroup~^phish.*',
    log:true,
    flowStart:true
  }
);

setFlow('phishydest',
  {
    keys:'ipdestination,sourcegroup',
    value:'frames',
    filter:'sourcegroup~^phish.*',
    log:true,
    flowStart:true
  }
);
The following command line arguments should be added to sFlow-RT's start.sh in order to load the script on startup and allocate enough memory to allow the blacklists to be loaded:
-Xmx2000m -Dscript.file=phish.js
A few notes about the script:
  1. The script uses sFlow-RT's setGroups() function to efficiently classify and group IP addresses based on CIDR lists.
  2. The large number of DNS names used in the DNS filter is efficiently compiled and does not impact performance.
  3. The script makes an HTTP call to retrieve updated signatures every 24 hours. If more frequent updates are required then a developer key should be obtained, see Developer Information.
  4. Matches are exported using syslog(), see Exporting events using syslog. The script could easily be modified to post events into other systems, or take control actions, by using the http() function to interact with RESTful APIs.
Network virtualization poses interesting monitoring challenges since compromised hosts may be virtual machines and their traffic may be carried over tunnels (VxLAN, GRE, NVGRE etc.) across the physical network. Fortunately, sFlow monitoring intrinsically provides good visibility into tunnels (see Tunnels) and the sFlow-RT script could easily be modified to examine flows within the tunnels (see Down the rabbit hole) and report inner IP addresses and virtual network identifiers (VNI) for compromised hosts. In addition most virtual switches also support sFlow monitoring, providing direct visibility into inter virtual machine traffic.

Blacklist matching is only one use case for sFlow monitoring - many others have been described on this blog. The ability to pervasively monitor high speed networks at scale and deliver continuous real-time visibility is transformative, allowing many otherwise difficult or impossible tasks to be accomplished with relative ease.

Saturday, December 7, 2013

ovs-ofctl

The ovs-ofctl command line tool that ships with Open vSwitch provides a very convenient way to interact with OpenFlow forwarding rules, not just with Open vSwitch, but with any switch that can be configured to accept passive connections from an OpenFlow controller.

This article looks takes the example in Integrated hybrid OpenFlow and repeats it without an OpenFlow controller, using ovs-ofctl instead.

First start Mininet without a controller and configure the switch to listen for OpenFlow commands:
sudo mn --topo single,3 --controller none --listenport 6633
Next use enable normal forwarding in the switch:
ovs-ofctl add-flow tcp:127.0.0.1 priority=10,action=normal
The following command blocks traffic from host 1 (10.0.0.1):
ovs-ofctl add-flow tcp:127.0.0.1 priority=11,dl_type=0x0800,nw_src=10.0.0.1,action=drop
The following command removes the block:
ovs-ofctl --strict del-flows tcp:127.0.0.1 priority=11,dl_type=0x0800,nw_src=10.0.0.1
Finally, modify the controller script with the following block() and allow() functions:
function addFlow(spec) {
  runCmd(['ovs-ofctl','add-flow','tcp:127.0.0.1',spec.join(',')]);
}

function removeFlow(spec) {
  runCmd(['ovs-ofctl','--strict','del-flows','tcp:127.0.0.1',spec.join(',')]);
}

function block(address) {
  if(!controls[address]) {
     addFlow(['priority=11','dl_type=0x0800','nw_src=' + address,'action=drop']);
     controls[address] = { action:'block', time: (new Date()).getTime() };
  }
}

function allow(address) {
  if(controls[address]) {
     removeFlow(['priority=11','dl_type=0x0800','nw_src=' + address]);
     delete controls[address];
  }
}
Moving from Mininet to a production setting is simply a matter of modifying the script to connect to the remote switch, configuring the switch to listen for OpenFlow commands, and configuring the switch to send sFlow data to sFlow-RT.

DDoS mitigation is only one use case for large flow control, others described on this blog include: ECMP / LAG load balancing, traffic marking and packet capture. This script can be modified to address these different use cases. The Mininet test bed provides a useful way to test hybrid OpenFlow control schemes before moving them into production using physical switches that support integrated hybrid OpenFlow and sFlow.

Tuesday, December 3, 2013

Integrated hybrid OpenFlow

Figure 1: Hybrid Programmable Forwarding Planes
Figure 1 shows two models for hybrid OpenFlow deployment, allowing OpenFlow to be used in conjunction with existing routing protocols. The Ships-in-the-Night model divides the switch into two, allocating selected ports to external OpenFlow control and the remaining ports are left to the internal control plane. It is not clear how useful this model is, other than for experimentation.

The Integrated hybrid model is much more interesting since it can be used to combine the best attributes of OpenFlow and existing distributed routing protocols to deliver robust solutions. The OpenFlow 1.3.1 specification includes supports for the integrated hybrid model by defining the NORMAL action:
Optional: NORMAL: Represents the traditional non-OpenFlow pipeline of the switch (see 5.1). Can be used only as an output port and processes the packet using the normal pipeline. If the switch cannot forward packets from the OpenFlow pipeline to the normal pipeline, it must indicate that it does not support this action.
Hybrid solutions leverage the full capabilities of vendor and merchant silicon which efficiently support distributed forwarding protocols. In addition, most switch and merchant silicon vendors embed support for the sFlow standard, allowing the fabric controller to rapidly detect large flows and apply OpenFlow forwarding rules to control these flows.

Existing switching silicon is often criticized for the limited size of the hardware forwarding tables, supporting too few general match OpenFlow forwarding rules to be useful in production settings. However, consider that SDN and large flows defines a large flow as a flow that consumes 10% of a link's bandwidth. Using this definition, a 48 port switch would require a maximum of 480 general match rules in order to steer all large flows, well within the capabilities of current hardware (see OpenFlow Switching Performance: Not All TCAM Is Created Equal).

This article will use the Mininet testbed described in Controlling large flows with OpenFlow to experiment with using integrated hybrid forwarding to selectively control large flows, leaving the remaining flows to the switch's NORMAL forwarding pipeline.
Figure 2: MiniNet as an SDN test platform
The following command uses Mininet to emulate a simple topology with one switch and three hosts:
$ sudo mn --topo single,3 --controller=remote,ip=127.0.0.1
The next command enables sFlow on the switch:
sudo ovs-vsctl -- --id=@sflow create sflow agent=eth0  target=\"127.0.0.1:6343\" sampling=10 polling=20 -- -- set bridge s1 sflow=@sflow
Floodlight's Static Flow Pusher API will be used to insert OpenFlow rules in the switch. The default Floodlight configuration implements packet forwarding, disabling the forwarding module requires configuration changes:
  1. Copy the default properties file target/bin/floodlightdefault.properties to static.properties
  2. Edit the file to remove the line net.floodlightcontroller.forwarding.Forwarding,\
  3. Copy the floodlight.sh script to floodlight_static.sh
  4. Modify the last line of the script to invoke the properties, java ${JVM_OPTS} -Dlogback.configurationFile=${FL_LOGBACK} -jar ${FL_JAR} -cf static.properties
Update 22 December, 2013 Thanks to Jason Parraga, the following modules are the minimum set needed to support the Static Flow Pusher functionality in the Floodlight properties file:
floodlight.modules=\
net.floodlightcontroller.counter.CounterStore,\
net.floodlightcontroller.storage.memory.MemoryStorageSource,\
net.floodlightcontroller.core.internal.FloodlightProvider,\
net.floodlightcontroller.staticflowentry.StaticFlowEntryPusher,\
net.floodlightcontroller.perfmon.PktInProcessingTime,\
net.floodlightcontroller.ui.web.StaticWebRoutable
Start Floodlight with the forwarding module disabled:
cd floodlight
$ ./floodlight_static.sh
The following sFlow-RT script is based on the DDoS script described in Embedded SDN applications:
include('extras/json2.js');

var flowkeys = 'ipsource';
var value = 'frames';
var filter = 'outputifindex!=discard&direction=ingress&sourcegroup=external';
var threshold = 1000;
var groups = {'external':['0.0.0.0/0'],'internal':['10.0.0.2/32']};

var metricName = 'ddos';
var controls = {};
var enabled = true;
var blockSeconds = 20;

var flowpusher = 'http://localhost:8080/wm/staticflowentrypusher/json';

function clearOpenFlow() {
  http('http://localhost:8080/wm/staticflowentrypusher/clear/all/json');
}

function setOpenFlow(spec) {
  http(flowpusher, 'post','application/json',JSON.stringify(spec));
}

function deleteOpenFlow(spec) {
  http(flowpusher, 'delete','application/json',JSON.stringify(spec));
}

function block(address) {
  if(!controls[address]) {
     setOpenFlow({name:'block-' + address, switch:'00:00:00:00:00:01',
                  cookie:'0', priority:'11', active: true,
                  'ether-type':'0x0800', 'src-ip': address, actions:""});
     controls[address] = { action:'block', time: (new Date()).getTime() };
  }
}

function allow(address) {
  if(controls[address]) {
     deleteOpenFlow({name:'block-' + address});
     delete controls[address];
  }
}

setEventHandler(function(evt) {
  if(!enabled) return;

  var addr = evt.flowKey;
  block(addr);  
},[metricName]);

setIntervalHandler(function() {
  // remove stale controls
  var stale = [];
  var now = (new Date()).getTime();
  var threshMs = 1000 * blockSeconds;
  for(var addr in controls) {
    if((now - controls[addr].time) > threshMs) stale.push(addr);
  }
  for(var i = 0; i < stale.length; i++) allow(stale[i]);
},10);

setHttpHandler(function(request) {
  var result = {};
  try {
    var action = '' + request.query.action;
    switch(action) {
    case 'block':
       var address = request.query.address[0];
       if(address) block(address);
        break;
    case 'allow':
       var address = request.query.address[0];
       if(address) allow(address);
       break;
    case 'enable':
      enabled = true;
      break;
    case 'disable':
      enabled = false;
      break;
    }
  }
  catch(e) { result.error = e.message }
  result.controls = controls;
  result.enabled = enabled;
  return JSON.stringify(result);
});

setGroups(groups);
setFlow(metricName,{keys:flowkeys,value:value,filter:filter});
setThreshold(metricName,{metric:metricName,value:threshold,byFlow:true,timeout:5});

clearOpenFlow();
setOpenFlow({name:'normal',switch:"00:00:00:00:00:01",cookie:"0",
             priority:"10",active:true,actions:"output=normal"});
The following command line argument loads the script on startup:
-Dscript.file=normal.js
Some notes on the script:
  1. The intervalHandler() function is used to automatically release controls after 20 seconds
  2. The clearOpenFlow() function is used to remove any existing flow entries at startup
  3. The last line in the script defined the NORMAL forwarding action for all packets on the switch using a priority of 10
  4. Blocking rules are added for specific addresses using a higher priority of 11
Open a web browser to view a trend of traffic and then perform the following steps:
  1. disable the controller
  2. perform a simulated DoS attack (using a flood ping)
  3. enable the controller
  4. simulate a second DoS attack

Figure 3: DDoS attack traffic with and without controller
Figure 3 shows the results of the demonstration. When the controller is disabled, the attack traffic exceeds 6,000 packets per second and persists until the attacker stops sending. When the controller is enabled, traffic is stopped the instant it hits the 1,000 packet per second threshold in the application. The control is removed 20 seconds later and re-triggers if the attacker is still sending traffic.

DDoS mitigation is only one use case large flow control, others described on this blog include: ECMP / LAG load balancing, traffic marking and packet capture. This script can be modified to address these different use cases. The Mininet test bed provides a useful way to test hybrid OpenFlow control schemes before moving them into production using physical switches that support integrated hybrid OpenFlow.

Sunday, November 24, 2013

Exporting events using syslog

Figure 1: ICMP unreachable
ICMP unreachable described how standard sFlow monitoring built into switches can be used to detect scanning activity on the network. This article shows how sFlow-RT's embedded scripting API can be used to notify Security Information and Event Management (SIEM) tools when unreachable messages are observed.
Figure 2: Components of sFlow-RT
The following sFlow-RT JavaScript application (syslog.js) defines a flow to track ICMP port unreachable messages and generate syslog events that are sent to the SIEM tool running on server 10.0.0.152 and listening for UDP syslog events on the default syslog port (514):
var server = '10.0.0.152';
var port = 514;
var facility = 16; // local0
var severity = 5;  // notice

var flowkeys = ['ipsource','ipdestination','icmpunreachableport'];

setFlow('uport', {
  keys: flowkeys,
  value:'frames',
  log:true,
  flowStart:true
});

setFlowHandler(function(rec) {
  var keys = rec.flowKeys.split(',');
  var msg = {};
  for(var i = 0; i < flowkeys.length; i++) msg[flowkeys[i]] = keys[i];
  
  syslog(server,port,facility,severity,msg);
},['uport']);
The following command line argument loads the script on startup:
-Dscript.file=syslog.js
The following screen capture shows the events collected by the Splunk SIEM tool:
While Splunk was used in this example, there are a wide variety of open source and commercial tools that can be used to collect and analyze syslog events. For example, the following screen capture shows events in the open source Logstash tool:
Splunk, Logstash and other SIEM tools don't natively understand sFlow records and require a tool like sFlow-RT to extract information and convert it into a text format that can be processed. Using sFlow-RT to selectively forward high value data reduces the load on the SIEM system and in the case of commercial software like Splunk significantly lowers the expense of monitoring since licensing costs are typically based on the volume of data collected and indexed.

ICMP unreachable messages are only one example of the kinds of events that can be generated from sFlow data. The sFlow standard provides a scaleable method of monitoring all the network, server and application resources in the data center, see Visibility and the software defined data center.
Figure 3: Visibility and the software defined data center
For example, Cluster performance metrics describes how sFlow-RT can be used to summarize performance metrics, and periodic polling, or setting thresholds on metrics is another source of events for the SIEM system. A hybrid approach that splits the metrics stream so that exceptions are sent to the SIEM system and periodic summaries are sent to a time series database (e.g. Metric export to Graphite) leverages the strengths of the different tools.

Finally, log export is only one of many applications for sFlow data, some of which have been described on this blog. The data center wide visibility provided by sFlow-RT supports orchestration tools and allows them to automatically optimize the allocation of compute, storage and application resources and the placement of loads on these resources.

Saturday, November 23, 2013

Metric export to Graphite

Figure 1: Cluster performance metrics
Cluster performance metrics describes how sFlow-RT can be used to calculate summary metrics for cluster performance. The article includes a Python script that polls sFlow-RT's REST API and then sends metrics to to Graphite. In this article sFlow-RT's internal scripting API will be used to send metrics directly to Graphite.
Figure 2: Components of sFlow-RT
The following script (graphite.js) re-implements the Python example (generating a sum of the load_one metric for a cluster of Linux machines) in JavaScript using sFlow-RT built-in functions for retrieving metrics and sending them to Graphite:
// author: Peter
// version: 1.0
// date: 11/23/2013
// description: Log metrics to Graphite

include('extras/json2.js');

var graphiteServer = "10.0.0.151";
var graphitePort = null;

var errors = 0;
var sent = 0;
var lastError;

setIntervalHandler(function() {
  var names = ['sum:load_one'];
  var prefix = 'linux.';
  var vals = metric('ALL',names,{os_name:['linux']});
  var metrics = {};
  for(var i = 0; i < names.length; i++) {
    metrics[prefix + names[i]] = vals[i].metricValue;
  }
  try { 
    graphite(graphiteServer,graphitePort,metrics);
    sent++;
  } catch(e) {
    errors++;
    lastError = e.message;
  }
} , 15);

setHttpHandler(function() {
  var message = { 'errors':errors,'sent':sent };
  if(lastError) message.lastError = lastError;
  return JSON.stringify(message);
});
The interval handler function runs every 15 seconds and retrieves the set of metrics in the names array (in this case just one metrics, but multiple metrics could be retrieved). The names are then converted into a Graphite friendly form (prefixing each metric with the token linux. so that they can be easily grouped) and then sent to the Graphite collector running on 10.0.0.151 using the default TCP port 2003. The script also keeps track of any errors and makes them available through the URL /script/graphite.js/json

The following command line argument loads the script on startup:
-Dscript.file=graphite.js
The following Graphite screen capture below shows a trend of the metric:
There are a virtually infinite number of core and derived metrics that can be collected by sFlow-RT using standard sFlow instrumentation embedded in switches, servers and applications throughout the data center. For example Packet loss describes the importance of collecting network packet loss metrics and including them in performance dashboards.
Figure 2: Visibility and the software defined data center
While having access to all these metrics is extremely useful, not all of them need to be stored in Graphite. Using sFlow-RT to calculate and selectively export high value metrics reduces pressure on the time series database, while still allowing any of the remaining metrics to be polled using the REST API when needed.

Finally, metrics export is only one of many applications for sFlow data, some of which have been described on this blog. The data center wide visibility provided by sFlow-RT supports orchestration tools and allows them to automatically optimize the allocation of compute, storage and application resources and the placement of loads on these resources.

Thursday, November 14, 2013

SC13 large flow demo

For the duration of the SC13 conference, Denver will host of one of the most powerful and advanced networks in the world - SCinet. Created each year for the conference, SCinet brings to life a very high capacity network that supports the revolutionary applications and experiments that are a hallmark of the SC conference. SCinet will link the Colorado Convention Center to research and commercial networks around the world. In doing so, SCinet serves as the platform for exhibitors to demonstrate the advanced computing resources of their home institutions and elsewhere by supporting a wide variety of bandwidth-driven applications including supercomputing and cloud computing. - SCinet

The screen shot is from a live demonstration of network-wide large flow detection and tracking using standard sFlow instrumentation build into switches in the SCInet network. Currently multiple vendor's switches, 1,223 ports, with speeds up to 100Gbit/s, are sending sFlow data.
Note: The network is currently being set up, traffic levels will build up and reach a peak next week during the SC13 show (Nov. 17-22). Visit the demonstration site next week to see live traffic on one of the worlds busiest networks: http://inmon.sc13.org/dash/
The sFlow-RT real-time analytics engine is receiving the sFlow and centrally tracking large flows. The HTML5 web pages poll the analytics engine every half second for the largest 100 flows in order to update the charts, which represent large flows as follows:
  • Dot an IP address
  • Circle a logical grouping of IP addresses
  • Line width represents bandwidth consumed by flow
  • Line color identifies traffic type
Real-time detection and tracking of large flows has many applications in software defined networking (SDN), including: DDoS mitigation, large flow load balancing, and multi-tenant performance isolation. For more information, see Performance Aware SDN

Sunday, November 10, 2013

UDP packet replication using Open vSwitch

UDP protocols such as sFlow, syslog, NetFlow, IPFIX and SNMP traps, have many advantages for large scale network and system monitoring, see Push vs Pull.  In a typical deployment each managed element is configured to send UDP packets to a designated collector (specified by an IP address and port). For example, in a simple sFlow monitoring system all the switches might be configured to send sFlow data to UDP port 6343 on the host running the sFlow analysis application. Complex deployments may require multiple analysis applications, for example: a first application providing analytics for software defined networking, a second focused on host performance, a third addressing packet capture and security, and the fourth looking at application performance. In addition, a second copy of each application may be required for redundancy. The challenge is getting copies of the data to all the application instances in an efficient manner.

There are a number of approaches to replicating UDP data, each with limitations:
  1. IP Multicast - if the data is sent to an IP multicast address then each application could subscribe to the multicast channel receive a copy of the data. This sounds great in theory, but in practice configuring and maintaining IP multicast connectivity can be a challenge. In addition, all the agents and collectors would need to support IP multicast. In addition, IP multicast also doesn't address the situation where you have multiple applications running on single host and so each application has to be receive the UDP data on a different port.
  2. Replicate at source - each agent could be configured to send a copy of the data to each application. Replicating at source is a configuration challenge (all agents need to be reconfigured if you add an additional application). This approach is also wasteful of bandwidth - multiple copies of the same data are send across the network.
  3. Replicate at destination - a UDP replicator, or "samplicator" application receives the stream of UDP messages, copies them and resends them to each of the applications. This functionality may be deployed as a stand alone application, or be an integrated function within an analysis application. The replicator application is a single point of failure - if it is shut down none of the applications receive data. The replicator adds delay to the measurements and at high data rates can significantly increase UDP loss rate as the datagrams are received, sent, and received again. 
This article will examine a fourth option, using software defined networking (SDN) techniques to replicate and distribute data within the network. The Open vSwitch is implemented in the Linux kernel and includes OpenFlow and network virtualization features that will be used to build the replication network.

First, you will need a server (or virtual machine) running a recent version of Linux. Next download and install Open vSwitch.

Next, configure the Open vSwitch to handle networking for the server:
ovs-vsctl add-br br0
ovs-vsctl add-port br0 eth0
ifconfig eth0 0
ifconfig br0 10.0.0.1/24
Now configure the UDP agents to send their data to 10.0.0.1. You should be able to run a collector application for each service port (e.g. sFlow 6343, syslog 514, etc.).

The first case to consider is replicating the datagrams to a second port on the server (sending packets to App 1 and App 2 in the diagram). First, use the ovs-vsctl command to list the OpenFlow port numbers on the virtual switch:
% ovs-vsctl --format json --columns name,ofport list Interface
{"data":[["eth0",1],["br1",65534]],"headings":["name","ofport"]}
We are interested in replicating packets received on eth0 and output shows that the corresponding OpenFlow port is 1.

The Open vSwitch provides a command line utility ovs-ofctl that uses the OpenFlow protocol to configure forwarding rules in the vSwitch. The following OpenFlow rule will replicate sFlow datagrams:
in_port=1 dl_type=0x0800 nw_proto=17 tp_dst=6343 actions=LOCAL,mod_tp_dst:7343,normal
The match part of the rule looks for packets received on port 1 (in_port=1), where the Ethernet type is IPv4 (dl_type=0x0800), the IP protocol is UDP (nw_protocol=17), and the destination UDP port is 6343 (tp_dst=6343). The actions section of the rule is the key to building the replication function. The LOCAL action delivers the original packet as intended. The destination port is then changed to 7343 (mod_tp_dst:7343) and the modified packet is sent through the normal processing path to be delivered to the application.

Save this rule to a file, say replicate.txt, and then use ovs-ofctl to apply the rule to br0:
ovs-ofctl add-flows br0 replicate.txt
At this point a second sFlow analysis application listening for sFlow datagrams on port 7343 should start receiving data - sflowtool is a convenient way to verify that the packets are being received:
sflowtool -p 7343
The second case to consider is replicating the datagrams to a remote host (sending packets to App 3 in the diagram).
in_port=1 dl_type=0x0800 nw_proto=17 tp_dst=6343 actions=LOCAL,mod_tp_dst:7343,normal,mod_nw_src:10.0.0.1,mod_nw_dst:10.0.0.2,normal
The extended rule includes additional actions that modify the source address of the packets (mod_nw_src:10.0.0.1) and the destination IP address (mod_nw_dst:10.0.0.1) and sends the packet through the normal processing path. Since we are relying on the routing functionality in the Linux stack to deliver the packet, make sure that routing in enabled - see How to Enable IP Forwarding in Linux.
Unicast reverse path filtering (uRPF) is mechanism that routers use to drop spoofed packets (i.e. packets where the source address doesn't belong to the subnet on the access port the packet was received on). uRPF should be enabled wherever practical because spoofing is used in a variety of security and denial of service attacks, e.g. DNS amplification attacks. By modifying the IP source address to be the address of the forwarding host (10.0.0.1) rather than the original source IP address the OpenFlow rule ensures that the packet will pass through uRPF filters, both on the host and on the access router. Rewriting the sFlow source address does not cause any problems because the sFlow protocol identifies the original source of the data within its payload and doesn't rely on the IP source address. However, other UDP protocols (for example, NetFlow/IPFIX) rely on the IP source address to identify the source of the data. In this case, removing the mod_nw_src action will leave the IP source address unchanged, but the packet may well be dropped by uRPF filters. Newer Linux distributions implement strict uRPF by default, however it can be disabled if necessary, see Reverse Path Filtering.
This article has only scratched the surface of capabilities of the Open vSwitch. In situations where passing the raw packets across the network isn't feasible the Open vSwitch can be configured to send the packets over a tunnel (sending packets to App 4 in the diagram). Tunnels, in conjunction with OpenFlow, can be used to create a virtual UDP distribution overlay network with its own addressing scheme and topology - Open vSwitch is used by a number of network virtualization vendors (e.g. VMware NSX). In addition, more complex filters can also be implemented, forwarding datagrams based on source subnet to different collectors etc.

The replication functions don't need to be performed in software in the virtual switch. OpenFlow rules can be pushed to OpenFlow capable hardware switches which can perform the replication, or source based forwarding functions at wire speed. A full blown controller based solution isn't necessarily required, the ovs-ofctl command can be used to push OpenFlow rules to physical switches.

More generally, building flexible UDP datagram distribution and replication networks is an interesting use case for software defined networking. The power of software defined networking is that you can adapt the network behavior to suit the needs of the application - in this case overcoming the limitations of existing UDP distribution solutions by modifying the behavior of the network.

Sunday, November 3, 2013

ICMP unreachable

Figure 1: ICMP port unreachable
Figure 1 provides an example that demonstrates how Internet Control Message Protocol (ICMP) destination port unreachable messages are generated. In the example, host h1 sends a UDP packet to port 30000 on host h4. The packet message transits switches s1 and s2 on its path to h4. In this case, h4 is not running a service that listens for UDP packets on port 30000, so host h4 sends an ICMP destination port unreachable message (ICMP type 3, code 3) back to host h1 to inform it that that the port cannot be reached. ICMP unreachable messages include the header of the original packet within their payload so that the sender can examine the header fields and determine the source of the error.

ICMP unreachable messages provide a clear indication of configuration errors and should be rare in a well configured network. Typically, the ICMP unreachable messages that are seen result from scanning and network reconnaissance:
  • Scanning a host for open ports will generate ICMP port / protocol unreachable messages
  • Scanning for hosts will generate ICMP host / network unreachable messages
The sources of scanning activity can identify compromised hosts on the network and gives information about potential security challenges to the network. From the example, UDP port 30000 is known to be associated with trojan activity and so any requests to connect to this port from host h1 suggest that h1 may be compromised. It also make sense to follow up to see if any hosts are responding to requests to UDP port 30000.

The challenge in monitoring ICMP messages is that there is no single location that can see all the messages - they take a direct path between sender and receiver. Installing monitoring agents on all the hosts poses practical challenges in a heterogeneous environment, and agent based monitoring may be circumvented since trojans often disable security monitoring software when they infect a host.

Support for the sFlow standard in switches provides an independent method of profiling host behavior. The sFlow standard is widely supported by switch vendors and has the scaleability to deliver real-time, network wide, monitoring of host traffic. The switches export packet headers, allowing the central monitoring software to perform deep packet inspection and extract details from the ICMP protocol.

DNS amplification attacks describes how the sFlow-RT analyzer can be used to monitor DNS activity. The SMURF attack uses spoofed ICMP messages as a method of DDoS amplification and similar techniques to those described in the DNS article can be used to detect and mitigate these attacks.

The following example illustrates how sFlow can be used to monitor ICMP unreachable activity; a single instance of sFlow-RT is monitoring 7500 switch ports in a data center network.

The following ICMP attributes are extracted from packet samples and can be used in flow definitions or as filters:

DescriptionNameExample
Message type, e.g. Destination Unreachable (3)icmptype3
Message code, e.g. Protocol Unreachable (2)icmpcode2
IP address in network unreachable responseicmpunreachablenet10.0.0.1
Host in host unreachable responseicmpunreachablehost10.0.0.1
Protocol in protocol unreachable response icmpunreachableprotocol41
Port in port unreachable responseicmpunreachableportudp_30000

The following flow definitions were created using sFlow-RT's embedded scripting API:
setFlow('unets',{keys:'icmpunreachablenet',value:'frames',t:20});
setFlow('uhosts',{keys:'icmpunreachablehost',value:'frames',t:20});
setFlow('uprotos',{keys:'icmpunreachableprotocol',value:'frames',t:20});
setFlow('uports',{keys:'icmpunreachableport',value:'frames',t:20});
Alternatively, the flow definitions can be specified by making calls to the REST API using cURL:
curl -H "Content-Type:application/json" -X PUT --data "{keys:'icmpunreachableport', value:'frames', t:20}" http://localhost:8008/flow/uports/json
Using the script API has a number of advantages: it ensures that flow definitions are automatically reinstated on a system restart, makes it easy to generate trend charts (for example the graphite() function sends metrics to Graphite for integration in performance dashboards) and to automate the response when ICMP anomalies are detected (for example, using the syslog() function to send an alert or http() to access a REST API on a device or SDN controller to block the traffic).
The table above (http://localhost:8008/activeflows/ALL/uports/html?maxFlows=20&aggMode=sum) shows a continuously updating, real-time, view of the top ICMP unreachable ports - a bit like the Linux top command, but applied to the active flows. The table shows that the most frequently reported unreachable port is UDP port 30000.

There are a number of more detailed flow definitions that can be created:
  • To identify hosts generating scan packets, include ipdestination in the flow definition
  • To identify targets of Smurf attacks, include ipdestination and filter to exclude local addresses
  • To identify target country, include destinationcountry and filter to exclude local addresses
Note: Examples of these detailed flows have been omitted to preserve the anonymity.
Figure 2: Performance aware software defined networking
Incorporating sFlow analytics in a performance aware software defined networking solution offers the opportunity to automate a response. The following script monitors for ICMP unreachable messages and generates syslog events when an unreachable message is detected:
setFlowHandler(function(rec) {
  var name = rec.name;
  var keys = rec.flowKeys.split(',');
  var msg = {type:name,host:keys[0],target:keys[1]};
  syslog('10.0.0.1',null,null,null,msg); 
},['unets','uhosts','uprotos','uports']);

setFlow('unets',{keys:'ipdestination,icmpunreachablenet', value:'frames', t:20, log:true, flowStart:true});
setFlow('uhosts',{keys:'ipdestination,icmpunreachablehost', value:'frames', t:20, log:true, flowStart:true});
setFlow('uprotos',{keys:'ipdestination,icmpunreachableprotocol', value:'frames', t:20, log:true, flowStart:true});
setFlow('uports',{keys:'ipdestination,icmpunreachableport', value:'frames', t:20, log:true, flowStart:true});
While this example focused on a data center hosting servers, a similar approach could be used to monitor campus networks, detecting hosts that are scanning or participating in DDoS attacks. In this case, the SDN controller would respond by isolating the compromised hosts from the rest of the network.

Wednesday, October 16, 2013

DNS amplification attacks

Figure 1: DNS Amplification Variation Used in Recent DDoS Attacks (Update)
DNS Amplification Variation Used in Recent DDoS Attacks (Update) describes how public DNS servers can be used to amplify the effect of Distributed Denial of Service (DDoS) attacks - resulting in some of the largest and most disruptive attacks reported to date.
Figure 2: The DDoS That Knocked Spamhaus Offline (And How We Mitigated It)
The DDoS That Knocked Spamhaus Offline (And How We Mitigated It) describes a large 75Gbps attack using DNS amplification. An even larger 300Gbps attack causes wide scale disruption, see What you need to know about the world’s biggest DDoS attack.

DDoS describes how the sFlow monitoring standard can be used to rapidly detect and mitigate DDoS attacks on the target (victim) network. This article will examine how data centers that may be inadvertently hosting open DNS servers can use sFlow to identify servers participating in amplification attacks.

A hosting service provider has very little control over the services running on the physical and virtual servers running in the data center, and while one might hope that the customers carefully configure and monitor their DNS servers, the reality is that there are many openly accessible DNS servers. Using the network switches to monitor DNS operations is an attractive option, offering an agentless method of detecting and monitoring DNS servers wherever they are in the data center.

The sFlow standard is well suited to this task:
  1. sFlow is widely supported in physical and virtual switches
  2. sFlow is embedded within switch hardware and can be enabled in high traffic production networks without impacting performance.
  3. sFlow is scaleable, a single software analyzer can monitor hundreds of switches and tens of thousands of switch ports to deliver network wide visibility
  4. sFlow exports packet headers, allowing sFlow analysis software to perform deep packet inspection and report on DNS operations.
In the following example, a single instance of sFlow-RT is monitoring 7500 switch ports in a data center network.

The following DNS attributes are extracted from the sFlow packet samples and can be included in flow definitions, or as filters:

DescriptionNameExample
request=false, response=truednsqrfalse
op codednsopcode0
authoritative answerdnsaafalse
truncateddnstcfalse
recursion desireddnsrdfalse
recursion availablednsratrue
reserveddnsz0
response codednsrcode0
number of entries in questiondnsqdcount1
number of entries in answerdnsancount0
number of entries in name server sectiondnsnscount0
number of entries in resources sectiondnsarcount0
domain name in querydnsqnameyahoo.com.
query type codednsqtype15
query type namednsqtypenameMX(15)
query classdnsqclass1

The following flow definitions were created using sFlow-RT's embedded scripting API:
// track DNS query types
setFlow('dnsqueries',
  {keys:'dnsqtypename',value:'frames',filter:'dnsqr=false',t:20});

// track DNS query domains for ANY(15) queries
setFlow('dnsqany',
  {keys:'dnsqname', value:'frames', filter:'dnsqr=false&dnsqtype=255',t:20});

// track total DNS request rate for ANY(15) queries
setFlow('dnsqanytot',
  {value:'frames', filter:'dnsqr=false&dnsqtype=255',t:20});
Alternatively, the flow definitions can be specified by making calls to the REST API using cURL:
curl -H "Content-Type:application/json" -X PUT --data "{keys:'dnsqtypename', value:'frames', filter:'dnsqr=false',t:20}" http://localhost:8008/flow/dnsqueries/json
Using the script API has a number of advantages: it ensures that flow definitions are automatically reinstated on a system restart, makes it easy to generate trend charts (for example the graphite() function sends metrics to Graphite for integration in performance dashboards) and to automate the response when DNS anomalies are detected (for example, using the syslog() function to send an alert or http() to access a REST API on a device or SDN controller to block the traffic)
The table above (http://localhost:8008/activeflows/ALL/dnsqueries/html?aggMode=sum) shows a continuously updating, real-time, view of the top DNS queries - a bit like the Linux top command, but applied to the active flows. The table shows that the fourth most frequent query is the ANY(255) query type.

The ANY(15) query is often used for DNS amplification attacks since it asks the name server for all the records within the domain, resulting in a large response that amplifies the traffic in the attack.
The above chart (http://localhost:8008/activeflows/ALL/dnsqany/html?aggMode=sum) looks at the domain names being queried by the ANY(15) queries. Domain names in the list are known to be associated with DNS amplification attacks, see DNS Amplification Attacks Observer, so it appears that there are open DNS servers being used to amplify DNS attacks in this data center.

The trend chart above (http://localhost:8008/metric/ALL/sum:dnsqanytot/html) looks at the overall level of ANY(15) requests in the data center. The trend is increasing as a new DNS amplification attack is launched.

There are a number of more detailed flow definitions that can be created:
  1. identify the open name servers: include ipdestination in the flow definition
  2. identify target of the attack: include ipsource in the flow definition 
  3. identify target country: include sourcecountry in the flow definition
  4. identify compromised hosts: include macsource in the flow definition
Note: Examples of these detailed flows have been omitted to preserve the anonymity.
Figure 3: Performance aware software defined networking
Incorporating sFlow analytics in a performance aware software defined networking solution offers the opportunity to automate a response. By itself, OpenFlow is not DNS aware, however, combining the detection capabilities of sFlow with OpenFlow rules to selectively steer traffic based on IP source, destination, protocol and port allows attacks to be blocked, or for a DNS proxy to be inserted in the packet path to selectively drop requests.

While this example focused on a data center hosting DNS servers, a similar approach could be used to monitor campus networks. Detecting hosts that are spoofing their source addresses and generating suspect DNS requests is a useful signature for identifying compromised hosts. In this case, the SDN controller would respond by isolating the compromised system from the rest of the network.

DNS amplification attacks are a serious problem that is difficult to address because the attacker is two steps removed from their victims (hidden behind compromised hosts and open DNS servers). DNS amplification attacks have limited impact on the intermediate networks and may go unnoticed, even though the combined effect of all the traffic arriving at the target network can be devastating. Software defined networking offers the promise of intelligent networks that can automatically respond the changing traffic conditions and security threats, providing a way to share and automate best practices and reducing operating costs to the point where intermediate networks can play a larger role in reducing the impact of these attacks.  

Tuesday, October 1, 2013

Embedding SDN applications

Figure 1: Performance aware software defined networking
Performance aware software defined networking describes a general architecture for integrating real-time analytics in software defined networking (SDN) stacks to support applications such as load balancing, DDoS mitigation, traffic marking, and multi-tenant performance isolation.

Examples on this blog have used Python or node.js scripts to create demonstrations. However, while external scripts are a quick way to build prototypes, moving from prototype to production quality implementations can be a challenge.

Much of the complexity in developing external control applications involves sharing and distributing state with the analytics engine and OpenFlow controller. This complexity can be greatly reduced if the application can be embedded in the analytics software, or in the OpenFlow controller. Deciding whether to embed an application in the analytics engine, or in the controller, should be based on how tightly coupled the application is to each of these services. In the case of performance management applications, most of the interaction is with the analytics engine and so it makes most sense to embed application logic within the analytics engine.

The following example demonstrates the benefits of embedding by taking the DDoS mitigation script described in Frenetic, Pyretic and Resonance and re-implementing it as an embedded application using the recently released sFlow-RT analytics engine scripting API.

The following script implements the DDoS mitigation application:
// author: Peter
// version: 1.0
// date: 9/30/2013
// description: DDoS controller script

include('extras/json2.js');

var flowkeys = 'ipsource';
var value = 'frames';
var filter = 'outputifindex!=discard&direction=ingress&sourcegroup=external';
var threshold = 1000; // 1000 packets per second
var groups = {'external':['0.0.0.0/0'],'internal':['10.0.0.2/32']};

var metricName = 'ddos';
var controls = {};
var enabled = true;

function sendy(address,type,state) {
  var result = runCmd(['../pyretic/pyretic/pyresonance/sendy_json.py',
                       '-i',address,'-e',type,'-V',state]);
}

function block(address) {
  if(!controls[address]) {
     sendy(address,'auth','clear');
     controls[address] = 'blocked';
  }
}
function allow(address) {
  if(controls[address]) {
     sendy(address,'auth','authenticated');
     delete controls[address];
  }
}

setGroups(groups);
setFlow(metricName, {keys:flowkeys, value:value, filter:filter, n:10, t:2});
setThreshold(metricName,{metric:metricName,value:threshold,byFlow:true});

setEventHandler(function(evt) {
  if(!enabled) return;

  var address = evt.flowKey;
  block(address);
},[metricName]);

setHttpHandler(function(request) {
  var result = {};
  try {
    var action = '' + request.query.action;
    switch(action) {
    case 'block':
       var address = request.query.address[0];
       if(address) block(address);
        break;
    case 'allow':
       var address = request.query.address[0];
       if(address) allow(address);
       break;
    case 'enable':
      enabled = true;
      break;
    case 'disable':
      enabled = false;
      break;
    }
  }
  catch(e) { result.error = e.message }
  result.controls = controls;
  result.enabled = enabled;
  return JSON.stringify(result);
});
The following command line argument loads the script on startup:
-Dscript.file=ddos.js
In addition to providing the functionality of the original script, the embedded script also includes an HTTP interface for remotely monitoring and controlling the application. For example, manually blocking or allowing an address is accomplished with the following commands:
$ curl "http://10.0.0.54:8008/script/ddos.js/json?action=block&address=10.0.0.1"
{"controls":{"10.0.0.1":"blocked"},"enabled":true}
$ curl "http://10.0.0.54:8008/script/ddos.js/json?action=allow&address=10.0.0.1"
{"controls":{},"enabled":true}
Enabling and disabling the controller is also possible:
$ curl http://10.0.0.54:8008/script/ddos.js/json?action=enable
{"controls":{},"enabled":true}
$ curl http://10.0.0.54:8008/script/ddos.js/json?action=disable
{"controls":{},"enabled":false}
The following example uses the test bed described in Frenetic, Pyretic and Resonance to demonstrate the DDoS controller. Open a web browser to view a trend of traffic and then performing the following steps:
  1. disable the controller
  2. perform a simulated DoS attack (using a flood ping)
  3. enable the controller
  4. simulate a second DoS attack.
Figure 2: DDoS attack traffic with and without controller
Figure 2 shows the results of the demonstration. When the controller is disabled, the attack traffic reaches 6,000 packets per second and persists until the attacker stops sending. When the controller is enabled, traffic is stopped the instant it hits the 1,000 packet per second threshold in the application.
Figure 3: RESTful control of switches
While the previous example demonstrated integration with an OpenFlow controller, controller-less deployments are also possible, see RESTful control of switches. sFlow-RT scripts can access REST-APIs or use TCL/expect to access switch CLIs - using the http() and runCmd() functions to reconfigure routing policy and access controls in order to block or redirect attacks - just modify the block() and release() functions in the example script. The case study described in DDoS reconfigures router BGP settings and ACLs to stop attacks. In addition, monitoring can also be added to the script - for example, using the syslog() function to send notifications to an Security Information and Event Management (SIEM).

While important classes of SDN application make sense integrated within the SDN controller (e.g. network virtualization, virtual firewalls, and routers etc.), the use cases described in this article demonstrate that integrating performance aware SDN applications (e.g.  load balancingDDoS mitigationtraffic markingmulti-tenant performance isolation, etc.) within the analytics platform makes architectural sense.

Sunday, September 22, 2013

Wile E. Coyote

One of the classic moments in a Road Runner cartoon is Wile E. Coyote pursuing the Road Runner into a cloud of dust. Wile E. Coyote starts to suspect that there is something wrong, but remains suspended until the moment of realization that he is no longer on the road, but is instead suspended in mid-air over a chasm.

In the cartoon, the dust cloud allows Wile E. Coyote to temporarily defy the laws of physics by hiding the underlying physical topography. The Road Runner is under no such illusion - by leading the Road Runner is able to see the road ahead and stay on firm ground.

Example of an SDN solution with tunnels
Current network virtualization architectures are built on a similar cartoon reality - hiding the network under a cloud (using an overlay network of tunnels) and asserting that applications will somehow be insulated from the physical network topology and communication devices.

The network virtualization software used to establish and manage the overlay are a form of distributed computing system that delivers network connectivity as a service. Vendors of network virtualization software that assert that their solution is "independent of underlying hardware" are making flawed assumptions about networking that are common to distributing computing systems and are collectively known as the Fallacies of Distributed Computing:
  1. The network is reliable
  2. Latency is zero
  3. Bandwidth is infinite
  4. The network is secure
  5. Topology doesn't change
  6. There is one administrator
  7. Transport cost is zero
  8. The network is homogeneous
This article isn't intended to dismiss the value of the network virtualization abstraction. Virtualizing networking greatly increases operational flexibility. In addition, the move of complex functionality from the network core to edge hardware and virtual switches simplifies configuration and deployment of network functions (e.g. load balancing, firewalls, routing etc.). However, in order to realize the virtual network abstraction the orchestration system needs to be aware of the physical resources on which the service depends. The limitations of ignoring physical networking are demonstrated in the article, Multi-tenant performance isolation, which provides a real-life example of the type of service failure that impacts the entire data center and is difficult to address with current network virtualization architectures.

To be effective, virtualization architectures needs to be less like Wile E. Coyote, blindly running into trouble, and more like the Road Runner, fully aware of road ahead, safely navigating around obstacles and using resources to maximum advantage. In much the same way the hypervisor takes responsibility for managing limited physical resources like memory, CPU cycles and I/O bandwidth in order to deliver compute virtualization; the network virtualization system needs to be aware of the physical networking resources in order to integrate them into the virtualization stack. The article, NUMA, draws the parallel between how operating systems optimize performance by being aware of the location of resources and how cloud orchestration systems need to be similarly location aware.

One of the main reasons for the popularity of current overlay approaches to network virtualization has nothing to do with technology. The organizational silos that separate networking, compute and application operational teams in most enterprises make it difficult to deploy integrated solutions. Given the organizational challenges, it is easy to see the appeal to vendors creating overlay based products that bypasses the network silo and deliver operational flexibility to the virtualization team - see Network virtualization, management silos and missed opportunities. However, as network virtualization reaches the mainstream and software defined networking matures, expect to see enterprises integrate their functional teams and the emergence of network virtualization solutions that address current limitations. Multi-tenant traffic in virtualized network environments, examine the architectural problems with current cloud architectures and describe the benefits of taking a holistic, visibility driven, approach to coordinating network, compute, storage and application resources.

Thursday, September 12, 2013

Packet loss

The timeline describes an outage on Sunday, August 25th in Amazon's Elastic Block Store (EBS) service that affected a number of companies, including: Instagram, Vine, Airbnb and Flipboard - see Business Week: Another Amazon Outage Exposes the Cloud's Dark Lining.

The root cause of the outage is described as partial failure with a networking device that resulted in excessive packet loss. Network monitoring can detect many problems before they become serious and reduce the time to identify the root cause of problems when they do occur. However, traditional network monitoring technologies such as SNMP polling for interface counters don't easily scale, leaving operations teams with limited, delayed, information on critical network performance indicators such as packet loss. Fortunately, the challenge of large scale interface monitoring is addressed by the sFlow standard's counter push mechanism in which each device is responsible for periodically sending its own counters to a central collector, see Push vs Pull.
The sFlow counter export mechanism extends beyond the network to include server and application performance metrics and the article, Cluster performance metrics, shows how queries to the sFlow-RT analyzer's REST API are used to feed performance metrics from server clusters to operations tools like Graphite.

The following query builds on the example to show how real-time information summarizing network wide packet loss and error rates can easily be gathered, even if there are thousands of switches and tens of thousands of links.
$ curl http://10.0.0.162:8008/metric/ALL/sum:ifinerrors,sum:ifindiscards,sum:ifouterrors,sum:ifoutdiscards,max:ifinerrors,max:ifindiscards,max:ifouterrors,max:ifoutdiscards/json
[
 {
  "lastUpdateMax": 56187,
  "lastUpdateMin": 487,
  "metricN": 119,
  "metricName": "sum:ifinerrors",
  "metricValue": 0.05025125628140704
 },
 {
  "lastUpdateMax": 56187,
  "lastUpdateMin": 487,
  "metricN": 119,
  "metricName": "sum:ifindiscards",
  "metricValue": 0.05025125628140704
 },
 {
  "lastUpdateMax": 56187,
  "lastUpdateMin": 487,
  "metricN": 119,
  "metricName": "sum:ifouterrors",
  "metricValue": 2.1955918328651283
 },
 {
  "lastUpdateMax": 56187,
  "lastUpdateMin": 487,
  "metricN": 119,
  "metricName": "sum:ifoutdiscards",
  "metricValue": 2.1955918328651283
 },
 {
  "agent": "10.0.0.30",
  "dataSource": "4",
  "lastUpdate": 487,
  "lastUpdateMax": 27487,
  "lastUpdateMin": 487,
  "metricN": 119,
  "metricName": "max:ifinerrors",
  "metricValue": 0.05025125628140704
 },
 {
  "agent": "10.0.0.30",
  "dataSource": "4",
  "lastUpdate": 487,
  "lastUpdateMax": 27487,
  "lastUpdateMin": 487,
  "metricN": 119,
  "metricName": "max:ifindiscards",
  "metricValue": 0.05025125628140704
 },
 {
  "agent": "10.0.0.28",
  "dataSource": "8",
  "lastUpdate": 7288,
  "lastUpdateMax": 27488,
  "lastUpdateMin": 7288,
  "metricN": 119,
  "metricName": "max:ifouterrors",
  "metricValue": 0.7425742574257426
 },
 {
  "agent": "10.0.0.28",
  "dataSource": "8",
  "lastUpdate": 7288,
  "lastUpdateMax": 27488,
  "lastUpdateMin": 7288,
  "metricN": 119,
  "metricName": "max:ifoutdiscards",
  "metricValue": 0.7425742574257426
 }
]
The query result identifies links with the highest number of packet errors and packet discards (links are identified by the agent and dataSource fields for each of the max: ifindiscards, ifoutdiscards, ifinerrors, and ifouterrors metrics). Also of interest are the sum: metrics, measuring total packet loss and error rates summed across all the links in the network.

The sFlow counter push mechanism is also useful for tracking availability: the periodic export of counters by each device acts as a "heart beat" - if messages stop arriving from a device then there is a problem with the device or network that needs to be investigated. The metricN field in the query result indicates the number of data sources that contributed to the summary metrics and the lastUpdateMax and lastUpdateMin values indicate how long ago (in milliseconds) the most recent and oldest updates are within the group. A value of lastUpdateMax value exceeding a multiple of the configured sFlow export period (typically 30 seconds), or a decreasing value of metricN, indicates a problem - possibly the type of "grey" partial failure in the Amazon example.
The chart shows the metrics plotted on a chart. There is clearly an issue with a large number of discarded packets. The trend line for the worst interface tracks the total across all interfaces, indicating that the problem is localized to a specific link, rather than a general congestion related problem involving multiple links.

The network related outage described in this example is not an isolated incident; other incidents described on this blog include: Amazon EC2 outageGmail outageDelay vs utilization for adaptive control, and Multi-tenant performance isolation. The article, Visibility and the software defined data center, describes how the sFlow standard delivers the comprehensive visibility needed to automatically manage data center resources, increasing efficiency, and reducing downtime due to network related outages.