Saturday, May 31, 2014

SDN fabric controller for commodity data center switches

Figure 1: Rise of merchant silicon
Figure 1 illustrates the rapid transition to merchant silicon among leading data center network vendors, including: Alcatel-Lucent, Arista, Cisco, Cumulus, Dell, Extreme, Juniper, Hewlett-Packard, and IBM.

This article will examine some of the factors leading to commoditization of network hardware and the role that software defined networking (SDN) plays in coordinating hardware resources to deliver increased network efficiency.
Figure 2: Fabric: A Retrospective on Evolving SDN
The article, Fabric: A Retrospective on Evolving SDN by Martin Casado, Teemu Koponen, Scott Shenker, and Amin Tootoonchian, makes the case for a two tier SDN architecture; comprising a smart edge and an efficient core.
Table 1: Edge vs Fabric Functionality
Virtualization and advances in the networking capability of x86 based servers are drivers behind this separation. Virtual machines are connected to each other and to the physical network using a software virtual switch. The software switch provides the flexibility to quickly develop and deploy advanced features like network virtualization, tenant isolation, distributed firewalls, etc. Network function virtualization (NFV) is moving firewall, load balancing, routing, etc. functions from dedicated appliances to virtual machines or embedding them within the virtual switches. The increased importance of network centric software has driven dramatic improvements in the performance of commodity x86 based servers, reducing the need for complex hardware functions in network devices.

As complex functions shift to software running on servers at the network edge, the role of the core physical network is simplified. Merchant silicon provides a cost effective way of delivering the high performance forwarding capabilities needed to interconnect servers and Figure 1 shows how Broadcom based switches are now dominating the market.

The Broadcom white paper, Engineered Elephant Flows for Boosting Application Performance in Large-Scale CLOS Networks, describes the challenge of posed by large "Elephant" flows and describes the opportunity to use software defined networking to orchestrate hardware resources and improve network efficiency.
Figure 3: Feedback controller
Figure 3 shows the elements of an SDN feedback controller. Network measurements are analyzed to identify network hot spots, available resources, and large flows. The controller then plans a response and deploys controls in order to allocate resources where they are needed and reduce contention. The control system operates as a continuous loop. The effect of the changes are observed by the measurement system and further changes are made as needed.

Implementing the controller requires an understanding of the measurement and control capabilities of the Broadcom ASICs.

Control Protocol

Figure 4: Programming Pipeline for ECMP
The Broadcom white paper focuses on the ASIC architecture and control mechanisms and includes the functional diagram shown in Figure 4. The paper describes two distinct configuration tasks:
  1. Programming the Routing Flow Table and ECMP Select Groups to perform equal cost multi-path forwarding of the majority of flows.
  2. Programming the ACL Policy Flow Table to selectively override forwarding decisions for relatively small number of Elephant flows responsible for the bulk of the traffic on the network.
Managing the Routing and ECMP Group tables is well understood and there are a variety of solutions available that can be used to configure ECMP forwarding:
  1. CLI — Use switch CLI to configure distributed routing agents running on each switch (e.g. OSPF, BGP, etc.)
  2. Configuration Protocol — Similar to 1, but programmatic configuration protocols such as NETCONF or JSON RPC replaces CLI.
  3. Server orchestration — Open Linux based switch platforms allow server management agents to be installed on the switches to manage configuration. For example, Cumulus Linux supports Puppet, Chef, CFEngine, etc.
  4. OpenFlow — The white paper describes using the Ryu controller to calculate routes and update the forwarding and group tables using OpenFlow 1.3+ to communicate with Indigo OpenFlow agents on the switches.  
The end result is very similar whatever method is chosen to populate the Routing and and ECMP Group tables - the hardware forwards packets across multiple paths based on a hash function calculated over selected fields in the packets (e.g. source and destination IP addresses + source and destination TCP ports), e.g.
index = hash(packet fields) % group.size
selected_physical_port = group[index]
Hash based load balancing works well for the large numbers of small flows "Mice" on the network, but is less suitable for the long lived large "Elephant" flows. The hash function may assign multiple Elephant flows to the same physical port (even if other ports in the group are idle), resulting in congestion and poor network performance.
Figure 5: Long vs Short flows (from The Nature of Datacenter Traffic: Measurements & Analysis)
The traffic engineering controller uses ACL Flow Policy table to manage Elephant flows, ensuring that they don't interfere with latency sensitive Mice and are evenly distributed across the available paths - see Marking large flows and ECMP load balancing.
Figure 6: Hybrid Programmable Forwarding Plane, David Ward, ONF Summit, 2011
Integrated hybrid OpenFlow 1.0 is an effective mechanism for exposing the ACL Policy Flow Table to an external controller:
  • Simple, no change to normal forwarding behavior, can be combined with any of the mechanisms used to manage the Routing and ECMP Group tables listed above.
  • Efficient, Routing and ECMP Group tables efficiently handle most flows. OpenFlow used to control ACL Policy Flow Table and selectively override forwarding of specific flows (block, mark, steer, rate-limit), maximizing effectiveness of limited number of entries available.
  • Scaleable, most flows handled by existing control plane, OpenFlow only used when controller wants to make an exception.
  • Robust, if controller fails network keeps forwarding
The control protocol is only half the story. An effective measurement protocol is needed to rapidly identify network hot spots, available resources, and large flows so that the controller can identify the which flows need to be managed and where to apply the controls.

Measurement Protocol

The Broadcom white paper is limited in its discussion of measurement, but it does list four ways of detecting large flows:
  1. A priori
  2. Monitor end host socket buffers
  3. Maintain per flow statistics in network
  4. sFlow
The first two methods involve signaling the arrival of large flows to the network from the hosts. Both methods have practical difficulties in that they require that every application and / or host implement the measurements and communicate them to the fabric controller - a difficult challenge in a heterogeneous environment. However, the more fundamental problem is that while both methods can usefully identify the arrival of large flows, they don't provide sufficient information for the fabric controller to take action since it also needs to know the load on all the links in the fabric.

The requirement for end to end visibility can only be met if the instrumentation is built into the network devices, which leads to options 3 and 4. Option 3 would require an entry in the ACL table for each flow and the Broadcom paper points out that this approach does not scale.

The solution to the measurement challenge is option 4. Support for the multi-vendor sFlow protocol is included in Broadcom ASIC, is completely independent of the forwarding tables, and can be enabled on all port and all switches to provide the end to end visibility needed for effective control.
Figure 7: Custom vs. merchant silicon traffic measurement
Figure 7 compares traffic measurement on legacy custom ASIC based switches with standard sFlow measurements supported by merchant silicon vendors. The custom ASIC based switch, shown on top, performs many of the traffic flow analysis functions in hardware. In contrast, merchant silicon based switches shift flow analysis to external software, implementing only the essential measurement functions required for wire speed performance in silicon.

Figure 7 lists a number of benefits that result from moving flow analysis from the custom ASIC to external software, but in the context of large flow traffic engineering the real-time detection of flows made possible by an external flow cache is the essential if the traffic engineering controller is to be effective - see Rapidly detecting large flows, sFlow vs. NetFlow/IPFIX
Figure 8: sFlow-RT feedback controller
Figure 8 shows a fully instantiated SDN feedback controller. The sFlow-RT controller leverages the sFlow and OpenFlow standards to optimize the performance of fabrics built using commodity switches. The following practical applications for the sFlow-RT controller have already been demonstrated:
While the industry at large appears to be moving to the Edge / Fabric architecture shown in Figure 2, Cisco's Application Centric Infrastructure (ACI) is an anomaly. ACI is a tightly integrated proprietary solution; the Cisco Application Policy Infrastructure Controller (APIC) uses the Cisco OpFlex protocol to manage Cisco Nexus 9000 switches and Cisco AVI virtual switches. For example, the Cisco Nexus 9000 switches are based on Broadcom silicon and provide an interoperable NX-OS mode. However, line cards that include an Application Leaf Engines (ALE) ASIC along with the Broadcom ASIC are required to support ACI mode. The ALE provides visibility and control features for large flow load balancing and prioritization - both of which can be achieved using standard protocols to manage the capabilities of the Broadcom ASIC.

It will be interesting to see whether ACI is able to compete with modular, low cost, solutions based on open standards and commodity hardware. Cisco has offered its customers a choice and given the compelling value of open platforms I expect many will choose not to be locked into the proprietary ACI solution and will favor NX-OS mode on the Nexus 9000 series, pushing Cisco to provide the full set of open APIs currently available on the Nexus 3000 series (sFlow, OpenFlow, Puppet, Python etc.).
Figure 9: Move communicating virtual machines together to reduce network traffic (from NUMA)
Finally, SDN is only one piece of a larger effort to orchestrate network, compute and storage resources to create a software defined data center (SDDC). For example, Figure 9 shows how network analytics from the fabric controller can be used move virtual machines (e.g. by integrating with OpenStack APIs) to reduce application response times and network traffic. More broadly, feedback control allows efficient matching of resources to workloads and can dramatically increase the efficiency of the data center - see Workload placement.

Tuesday, May 13, 2014

Load balancing large flows on multi-path networks

Figure 1: Active control of large flows in a multi-path topology
Figure 1 shows initial results from the Mininet integrated hybrid OpenFlow testbed demonstrating that active steering of large flows using a performance aware SDN controller significantly improves network throughput of multi-path network topologies.
Figure 2: Two path topology
The graph in Figure 1 summarizes results from topologies with 2, 3 and 4 equal cost paths. For example, the Mininet topology in Figure 2 has two equal cost paths of 10Mbit/s (shown in blue and red). The iperf traffic generator was used to create a continuous stream of 20 second flows from h1 to h3 and from h2 to h4. If traffic were perfectly balanced, each flow would achieve 10Mbit/s throughput. However, Figure 1 shows that the throughput obtained using hash based ECMP load balancing is approximately 6.8Mbit/s. Interestingly, the average link throughput decreases as additional paths are added, dropping to approximately 6.2Mbit/s with four equal cost paths (see the blue bars in Figure 1).

To ensure that packets in a flow arrive in order at their destination, switch s3 computes a hash function over selected fields in the packets (e.g. source and destination IP addresses + source and destination TCP ports) and picks a link based on the value of the hash, e.g.
index = hash(packet fields) % linkgroup.size
selected_link = linkgroup[index]
The drop in throughput occurs when two or more large flows are assigned to the same link by the hash function and must compete for bandwidth.
Figure 3: Performance optimizing hybrid OpenFlow controller
Performance optimizing hybrid OpenFlow controller describes how the sFlow and OpenFlow standards can be combined to provide analytics driven feedback control to automatically adapt resources to changing demand. In this example, the controller has been programmed to detect large flows arriving on busy links and steer them to a less congested alternative path. The results shown in Figure 1 demonstrate that actively steering the large flows increases average link throughput by between 17% and 20% (see the red bars).
There results were obtained using a very simple initial control scheme and there is plenty of scope for further improvement since a 50-60% increase in throughput over hash based ECMP load balancing is theoretically possible based on the results from these experiments.
This solution easily scales to 10G data center fabrics. Support for the sFlow standard is included in most vendor's switches (Alcatel-Lucent, Arista, Brocade, Cisco, Dell, Extreme, HP, Huawei, IBM, Juniper, Mellanox, ZTE, etc.) providing data center wide visibility - see Drivers for growth. Combined with increasing maturity and vendor support for the OpenFlow standard provides the real-time control of packet forwarding needed to adapt the network to changing traffic. Finally, flow steering is one of a number of techniques that combine to amplify performance gains delivered by the controller, other techniques include: large flow marking, DDoS mitigation, and workload placement.

Wednesday, April 23, 2014

Mininet integrated hybrid OpenFlow testbed

Figure 1: Hybrid Programmable Forwarding Planes
Integrated hybrid OpenFlow combines OpenFlow and existing distributed routing protocols to deliver robust software defined networking (SDN) solutions. Performance optimizing hybrid OpenFlow controller describes how the sFlow and OpenFlow standards combine to deliver visibility and control to address challenges including: DDoS mitigation, ECMP load balancing, LAG load balancing, and large flow marking.

A number of vendors support sFlow and integrated hybrid OpenFlow today, examples described on this blog include: Alcatel-Lucent, Brocade, and Hewlett-Packard. However, building a physical testbed is expensive and time consuming. This article describes how to build an sFlow and hybrid OpenFlow testbed using free Mininet network emulation software. The testbed emulates ECMP leaf and spine data center fabrics and provides a platform for experimenting with analytics driven feedback control using the sFlow-RT hybrid OpenFlow controller.

First build an Ubuntu 13.04 / 13.10 virtual machine then follow instructions for installing Mininet - Option 3: Installation from Packages.

Next, install an Apache web server:
sudo apt-get install apache2
Install the sFlow-RT integrated hybrid OpenFlow controller, either on the Mininet virtual machine, or on a different system (Java 1.6+ is required to run sFlow-RT):
wget http://www.inmon.com/products/sFlow-RT/sflow-rt.tar.gz
tar -xvzf sflow-rt.tar.gz
Copy the leafandspine.py script from the sflow-rt/extras directory to the Mininet virtual machine.

The following options are available:
./leafandspine.py --help
Usage: leafandspine.py [options]

Options:
  -h, --help            show this help message and exit
  --spine=SPINE         number of spine switches, default=2
  --leaf=LEAF           number of leaf switches, default=2
  --fanout=FANOUT       number of hosts per leaf switch, default=2
  --collector=COLLECTOR
                        IP address of sFlow collector, default=127.0.0.1
  --controller=CONTROLLER
                        IP address of controller, default=127.0.0.1
  --topofile=TOPOFILE   file used to write out topology, default topology.txt
Figure 2 shows a simple leaf and spine topology consisting of four hosts and four switches:
Figure 2: Simple leaf and spine topology
The following command builds the topology and specifies a remote host (10.0.0.162) running sFlow-RT as the hybrid OpenFlow controller and sFlow collector:
sudo ./leafandspine.py --collector 10.0.0.162 --controller 10.0.0.162 --topofile /var/www/topology.json
Note: All the links are configured to 10Mbit/s and the sFlow sampling rate is set to 1-in-10. These settings are equivalent to a 10Gbit/s network with a 1-in-10,000 sampling rate - see Large flow detection.

The network topology is written to  /var/www/topology.json making it accessible through HTTP. For example, the following command retrieves the topology from the Mininet VM (10.0.0.61):
curl http://10.0.0.61/topology.json
{"nodes": {"s3": {"ports": {"s3-eth4": {"ifindex": "392", "name": "s3-eth4"}, "s3-eth3": {"ifindex": "390", "name": "s3-eth3"}, "s3-eth2": {"ifindex": "402", "name": "s3-eth2"}, "s3-eth1": {"ifindex": "398", "name": "s3-eth1"}}, "tag": "edge", "name": "s3", "agent": "10.0.0.61", "dpid": "0000000000000003"}, "s2": {"ports": {"s2-eth1": {"ifindex": "403", "name": "s2-eth1"}, "s2-eth2": {"ifindex": "405", "name": "s2-eth2"}}, "name": "s2", "agent": "10.0.0.61", "dpid": "0000000000000002"}, "s1": {"ports": {"s1-eth1": {"ifindex": "399", "name": "s1-eth1"}, "s1-eth2": {"ifindex": "401", "name": "s1-eth2"}}, "name": "s1", "agent": "10.0.0.61", "dpid": "0000000000000001"}, "s4": {"ports": {"s4-eth2": {"ifindex": "404", "name": "s4-eth2"}, "s4-eth3": {"ifindex": "394", "name": "s4-eth3"}, "s4-eth1": {"ifindex": "400", "name": "s4-eth1"}, "s4-eth4": {"ifindex": "396", "name": "s4-eth4"}}, "tag": "edge", "name": "s4", "agent": "10.0.0.61", "dpid": "0000000000000004"}}, "links": {"s2-eth1": {"ifindex1": "403", "ifindex2": "402", "node1": "s2", "node2": "s3", "port2": "s3-eth2", "port1": "s2-eth1"}, "s2-eth2": {"ifindex1": "405", "ifindex2": "404", "node1": "s2", "node2": "s4", "port2": "s4-eth2", "port1": "s2-eth2"}, "s1-eth1": {"ifindex1": "399", "ifindex2": "398", "node1": "s1", "node2": "s3", "port2": "s3-eth1", "port1": "s1-eth1"}, "s1-eth2": {"ifindex1": "401", "ifindex2": "400", "node1": "s1", "node2": "s4", "port2": "s4-eth1", "port1": "s1-eth2"}}}
Don't start sFlow-RT yet, it should only be started after Mininet has finished building the topology.

Verify connectivity before starting sFlow-RT:
mininet> pingall
*** Ping: testing ping reachability
h1 -> h2 h3 h4 
h2 -> h1 h3 h4 
h3 -> h1 h2 h4 
h4 -> h1 h2 h3 
*** Results: 0% dropped (12/12 received)
This test demonstrates that the Mininet topology has been constructed with a set of default forwarding rules that provide connectivity without the need for an OpenFlow controller - emulating the behavior of  a network of integrated hybrid OpenFlow switches.

The following sFlow-RT script ecmp.js demonstrates ECMP load balancing in the emulated network:
// Define large flow as greater than 1Mbits/sec for 1 second or longer
var bytes_per_second = 1000000/8;
var duration_seconds = 1;

var top = JSON.parse(http("http://10.0.0.61/topology.json"));
setTopology(top);

setFlow('tcp',
 {keys:'ipsource,ipdestination,tcpsourceport,tcpdestinationport',
  value:'bytes', t:duration_seconds}
);

setThreshold('elephant',
 {metric:'tcp', value:bytes_per_second, byFlow:true, timeout:2}
);

setEventHandler(function(evt) {
 var rec = topologyInterfaceToLink(evt.agent,evt.dataSource);
 if(!rec || !rec.linkname) return;
 var link = topologyLink(rec.linkname);
 logInfo(link.node1 + "-" + link.node2 + " " + evt.flowKey);
},['elephant']);
Modify the sFlow-RT start.sh script to include the following arguments:
RT_OPTS="-Dopenflow.start=yes -Dopenflow.flushRules=no"
SCRIPTS="-Dscript.file=ecmp.js"
Some notes on the script:
  1. The topology is retrieved by making an HTTP request to the Mininet VM (10.0.0.61)
  2. The 1Mbits/s threshold for large flows was selected because it represents 10% of the bandwidth of the 10Mbits/s links in the emulated network
  3. The event handler prints the link the flow traversed - identifying the link by the pair of switches it connects
Start sFlow-RT:
./start.sh
Now generate some large flows between h1 and h3 using the Mininet iperf command:
mininet> iperf h1 h3
*** Iperf: testing TCP bandwidth between h1 and h3
*** Results: ['9.58 Mbits/sec', '10.8 Mbits/sec']
mininet> iperf h1 h3
*** Iperf: testing TCP bandwidth between h1 and h3
*** Results: ['9.58 Mbits/sec', '10.8 Mbits/sec']
mininet> iperf h1 h3
*** Iperf: testing TCP bandwidth between h1 and h3
*** Results: ['9.59 Mbits/sec', '10.3 Mbits/sec']
The following results were logged by sFlow-RT:
2014-04-21T19:00:36-0700 INFO: ecmp.js started
2014-04-21T19:01:16-0700 INFO: s1-s3 10.0.0.1,10.0.1.1,49240,5001
2014-04-21T19:01:16-0700 INFO: s1-s4 10.0.0.1,10.0.1.1,49240,5001
2014-04-21T20:53:19-0700 INFO: s2-s4 10.0.0.1,10.0.1.1,49242,5001
2014-04-21T20:53:19-0700 INFO: s2-s3 10.0.0.1,10.0.1.1,49242,5001
2014-04-21T20:53:29-0700 INFO: s1-s3 10.0.0.1,10.0.1.1,49244,5001
2014-04-21T20:53:29-0700 INFO: s1-s4 10.0.0.1,10.0.1.1,49244,5001
The results demonstrate that the emulated leaf and spine network is performing equal cost multi-path (ECMP) forwarding - different flows between the same pair of hosts take different paths across the fabric (the highlighted lines correspond to the paths shown in Figure 2).
Open vSwitch in Mininet is the key to this emulation, providing sFlow and multi-path forwarding support 
The following script implements the large flow marking example described in Performance optimizing hybrid OpenFlow controller:
include('extras/leafandspine-hybrid.js');

// Define large flow as greater than 1Mbits/sec for 1 second or longer
var bytes_per_second = 1000000/8;
var duration_seconds = 1;

var idx = 0;

var top = JSON.parse(http("http://10.0.0.61/topology.json"));
setTopology(top);

setFlow('tcp',
 {keys:'ipsource,ipdestination,tcpsourceport,tcpdestinationport',
  value:'bytes', t:duration_seconds}
);

setThreshold('elephant',
 {metric:'tcp', value:bytes_per_second, byFlow:true, timeout:4}
);

setEventHandler(function(evt) {
 var agent = evt.agent;
 var ds = evt.dataSource;
 if(topologyInterfaceToLink(agent,ds)) return;

 var port = ofInterfaceToPort(agent,ds);
 if(port) {
  var dpid = port.dpid;
  var id = "mark" + idx++;
  var k = evt.flowKey.split(',');
  var rule= {
    priority:1000, idleTimeout:2,
    match:{eth_type:2048, ip_proto:6, ip_src:k[0], ip_dst:k[1],
           tcp_src:k[2], tcp_dst:k[3]},
    actions:["set_ip_dscp=32","output=normal"]
  };
  setOfRule(dpid,id,rule);
 }
},['elephant']);

setFlow('tos0',{value:'bytes',filter:'ipdscp=0',t:1});
setFlow('tos128',{value:'bytes',filter:'ipdscp=32',t:1});
Some notes on the script:
  1. The topologyInterfaceToLink() function looks up link information based on agent and interface. The event handler uses this function to exclude inter-switch links, applying controls to ingress ports only.
  2. The OpenFlow rule priority for rules created by controller scripts must be greater than 500 to override the default rules created by leafandspine.py
  3. The tos0 and tos128 flow definitions have been added to so that the re-marking can be seen.
Restart sFlow-RT with the new script and use a web browser to view the default tos0 and the re-marked tos128 traffic.
Figure 3: Marking large flows
Use iperf to generate traffic between h1 and h3 (the traffic needs to cross more than one switch so it can be observed before and after marking). The screen capture in figure 3 demonstrates that the controller immediately detects and marks large flows.

Saturday, April 19, 2014

Configuring Mellanox switches

The following commands configure a Mellanox switch (10.0.0.252) to sample packets at 1-in-10000, poll counters every 30 seconds and send sFlow to an analyzer (10.0.0.50) using the default sFlow port 6343:
sflow enable
sflow agent-ip 10.0.0.252
sflow collector-ip 10.0.0.50
sflow sampling-rate 10000
sflow counter-poll-interval 30
For each interface:
interface ethernet 1/1 sflow enable
A previous posting discussed the selection of sampling rates. Additional information can be found on the Mellanox web site.

See Trying out sFlow for suggestions on getting started with sFlow monitoring and reporting.

Sunday, April 6, 2014

DDoS mitigation hybrid OpenFlow controller

Performance optimizing hybrid OpenFlow controller describes the growing split in the SDN controller market between edge controllers using virtual switches to deliver network virtualization (e.g. VMware NSX, Nuage Networks, Juniper Contrail, etc.) and fabric controllers that optimize performance of the physical network. The article provides an example using InMon's sFlow-RT controller to detect and mark large "elephant" flows so that they don't interfere with latency sensitive small "mice" flows.

This article describes an additional example, using the sFlow-RT controller to implement the ONS 2014 SDN Idol winning distributed denial of service (DDoS) mitigation solution - Real-time SDN Analytics for DDoS mitigation.
Figure 1: ISP/IX Market Segment
Figure 1 shows how service providers are ideally positioned to mitigate large flood attacks directed at their customers. The mitigation solution involves an SDN controller that rapidly detects and filters out attack traffic and protects the customer's Internet access.
Figure 2: Novel DDoS Mitigation solution using Real-time SDN Analytics
Figure 2 shows the elements of the control system in the SDN Idol demonstration. The addition of an embedded OpenFlow controller in sFlow-RT allows the entire DDoS mitigation system to be collapsed into the following sFlow-RT JavaScript application:
// Define large flow as greater than 100Mbits/sec for 1 second or longer
var bytes_per_second = 100000000/8;
var duration_seconds = 1;

var idx = 0;

setFlow('udp_target',
 {keys:'ipdestination,udpsourceport',
  value:'bytes', filter:'direction=egress', t:duration_seconds}
);

setThreshold('attack',
 {metric:'udp_target', value:bytes_per_second, byFlow:true, timeout:2, 
  filter:{ifspeed:[1000000000]}}
);

setEventHandler(function(evt) {
 var agent = evt.agent;
 var ports = ofInterfaceToPort(agent);
 if(ports && ports.length == 1) {
  var dpid = ports[0].dpid;
  var id = "drop" + idx++;
  var k = evt.flowKey.split(',');
  var rule= {
   priority:500, idleTimeout:20, hardTimeout:3600,
   match:{dl_type:2048, nw_proto:17, nw_dst:k[0], tp_src:k[1]},
   actions:[]
  };
  setOfRule(dpid,id,rule);
 }
},['attack']);
The following command line arguments load the script and enable OpenFlow on startup:
-Dscript.file=ddos.js -Dopenflow.start=yes
Some notes on the script:
  1. The 100Mbits/s threshold for large flows was selected because it represents 10% of the bandwidth of the 1Gigabit access ports on the network
  2. The setFlow filter specifies egress flows since the goal is to filter flows as converge on customer facing egress ports.
  3. The setThreshold filter specifies that thresholds are only applied to 1Gigabit access ports
  4. The OpenFlow rule generated in setEventHandler matches the destination address and source port associated with the DDoS attack and includes an idleTimeout of 20 seconds and a hardTimeout of 3600 seconds. This means that OpenFlow rules are automatically removed by the switch when the flow becomes idle without any further intervention from the controller. If the attack is still in progress when the hardTimeout expires and the rule is removed, the attack will be immediately be detected by the controller and a new rule will be installed.
The nping tool can be used to simulate DDoS attacks to test the application. The following script simulates a series of DNS reflection attacks:
while true; do nping --udp --source-port 53 --data-length 1400 --rate 2000 --count 700000 --no-capture --quiet 10.100.10.151; sleep 40; done
The following screen capture shows a basic test setup and results:
The chart at the top right of the screen capture shows attack traffic mixed with normal traffic arriving at the edge switch. The switch sends a continuous stream of measurements to the sFlow-RT controller running the DDoS mitigation application. When an attack is detected, an OpenFlow rule is pushed to the switch to block the traffic. The chart at the bottom right trends traffic on the protected customer link, showing that normal traffic is left untouched, but attack traffic is immediately detected and removed from the link.
Note: While this demonstration only used a single switch, the solution easily scales to hundreds of switches and thousands of edge ports.
This example, along with the large flow marking example, demonstrates that basing the sFlow-RT fabric controller on widely supported sFlow and OpenFlow standards and including an open, standards based, programming environment (JavaScript / ECMAScript) makes sFlow-RT an ideal platform for rapidly developing and deploying traffic engineering SDN applications in existing networks.

Thursday, April 3, 2014

Cisco, ACI, OpFlex and OpenDaylight

Cisco's April 2nd, 2014 announcement - Cisco and Industry Leaders Will Deliver Open, Multi-Vendor, Standards-Based Networks for Application Centric Infrastructure with OpFlex Protocol - has drawn mixed reviews from industry commentators.

In, Cisco Submits Its (Very Different) SDN to IETF & OpenDaylight, SDNCentral editor Craig Matsumoto comments, "You know how, early on, people were all worried Cisco would 'take over' OpenDaylight? This is pretty much what they were talking about. It’s not a 'takeover,' literally, but OpFlex and the group policy concept steer OpenDaylight into a new direction that it otherwise wouldn’t have, one that Cisco happens to already have taken."

CIMI Corp. President, Tom Nolle, remarks "We’re all in business to make money, and if Cisco takes a position in a key market like SDN that seems to favor…well…doing nothing much different, you have to assume they have good reason to believe that their approach will resonate with buyers." - Cisco’s OpFlex: We Have Sound AND Fury

This article will look at some of the architectural issues raised by Cisco's announcement based on the following documents:
The diagram at the top of this article illustrates the architecture of Cisco's OpenDaylight proposal.  The crack in the diagram was added to show the split between Cisco's proposed additions and existing OpenDaylight components. It is clear that Cisco has simply bolted a new controller to the side of the existing OpenDaylight controller, the ACI controller on the left has a native Southbound API (OpFlex) and treats the the existing OpenDaylight controller as a Southbound plug-in (the arrow that connects the Affinity Decomposer module to the existing Affinity Service module). The existing OpenDaylight controller is marginalized by relegating its role to managing Traditional Network Elements, implying that next generation SDN revolves around devices that support the OpFlex protocol exclusively.

What is the function of Cisco's new controller? The press release states, ACI is the first data center and cloud solution to offer full visibility and integrated management of both physical and virtual networked IT resources, accelerating application deployment through a dynamic, application-aware network policy model. However, if you look a little deeper - Cisco Application Policy Infrastructure Controller Data Center Policy Model - the underlying architecture of ACI is based on promise theory.

Promise theory underpins many data center orchestration tools, including: CFEngine, Puppet, Chef, Ansible, and Salt. These automation tools are an important part of the DevOps toolkit - providing a way to rapidly reconfigure resources and roll out new services. Does it make sense to create a new controller and protocol just to manage network equipment?
The DevOps movement has revolutionized the data center by breaking down silos, merging application development and IT operations to increase the speed and agility of service creation and delivery.
An alternative to creating a new, network only, orchestration system is to open up network equipment to the orchestration tools that DevOps teams already use. The article, Dell, Cumulus, Open Source, Open Standards, and Unified Management, discusses the trend toward open, Linux-based, switch platforms. An important benefit of this move to open networking platforms is that the same tools that are today used to manage Linux servers can also be used to manage the configuration of the network - for example, Cumulus Architecture currently lists Puppet, Chef and CFEngine as options for network automation. Eliminating the need to deploy and coordinate separate network and system orchestration tools significantly reduces operational complexity and increases agility; breaking down the network silo to facilitate the creation of a NetDevOps team.
While it might be argued that Cisco's ACI/OpFlex is better at configuring network devices than existing DevOps tools, the fierce competition and rapid pace of innovation in the DevOps space is likely to outpace Cisco's efforts to standardize the OpFlex protocol in the IETF.
Finally, it is not clear how serious Cisco is about its ACI architecture. Cisco Nexus 3000 series switches are based on standard merchant silicon hardware and support open, multi-vendor, standards and APIs, including: sFlow, OpenFlow, Linux Containers, XML, JSON, Puppet, Chef, Python, and OpenStack. Nexus 9000 series switches, the focus of Cisco's ACI strategy, include custom Cisco hardware to support ACI but also contain merchant silicon, allowing the switches to be run in either ACI or NX-OS mode. The value of open platforms is compelling and I expect Cisco's customers will favor NX-OS mode on the Nexus 9000 series and push Cisco to provide feature parity with the Nexus 3000 series.

Tuesday, March 25, 2014

Integrated hybrid OpenFlow control of HP switches

Performance optimizing hybrid OpenFlow controller describes InMon's sFlow-RT controller. The controller makes use of the sFlow and OpenFlow standards and is optimized for real-time traffic engineering applications that managing large traffic flows, including: DDoS mitigation, ECMP load balancing, LAG load balancing, large flow marking etc.

The previous article provided an example of large flow marking using an Alcatel-Lucent OmniSwitch 6900 switch. This article discusses how to replicate the example using HP Networking switches.

At present, the following HP switch models are listed as having OpenFlow support:
  • FlexFabric 12900 Switch Series
  • 12500 Switch Series
  • FlexFabric 11900 Switch Series
  • 8200 zl Switch Series
  • HP FlexFabric 5930 Switch Series
  • 5920 Switch Series
  • 5900 Switch Series
  • 5400 zl Switch Series
  • 3800 Switch Series
  • HP 3500 and 3500 yl Switch Series
  • 2920 Switch Series 
Note: All of the above HP switches (and many others) support the sFlow standard - see sFlow Products: Network Equipment @ sFlow.org.

HP's OpenFlow implementation supports integrated hybrid mode - provided the OpenFlow controller pushes a default low priority OpenFlow rule that matches all packets and applies the NORMAL action (i.e. instructs the switch to apply default switching / routing forwarding to the packets).

In this example, an HP 5400 zl switch is used to run a slightly modified version of the sFlow-RT controller JavaScript application described in Performance optimizing hybrid OpenFlow controller:
// Define large flow as greater than 100Mbits/sec for 0.2 seconds or longer
var bytes_per_second = 100000000/8;
var duration_seconds = 0.2;

var idx = 0;

setFlow('tcp',
 {keys:'ipsource,ipdestination,tcpsourceport,tcpdestinationport',
  value:'bytes', filter:'direction=ingress', t:duration_seconds}
);

setThreshold('elephant',
 {metric:'tcp', value:bytes_per_second, byFlow:true, timeout:2, 
  filter:{ifspeed:[1000000000]}}
);

setEventHandler(function(evt) {
 var agent = evt.agent;
 var ports = ofInterfaceToPort(agent);
 if(ports && ports.length == 1) {
  var dpid = ports[0].dpid;
  var id = "mark" + idx++;
  var k = evt.flowKey.split(',');
  var rule= {
   priority:500, idleTimeout:20,
   match:{dl_type:2048, nw_proto:6, nw_src:k[0], nw_dst:k[1],
          tp_src:k[2], tp_dst:k[3]},
   actions:["set_nw_tos=128","output=normal"]
  };
  setOfRule(dpid,id,rule);
 }
},['elephant']);
The idleTimeout was increased from 2 to 20 seconds since the switch has a default Probe Interval of 10 seconds (the interval between OpenFlow counter updates). If the OpenFlow rule idleTimeout is set shorter than the Probe Interval the switch will remove the OpenFlow rule before the flow ends.
Mar. 27, 2014 Update: The  HP Switch Software OpenFlow Administrator's Guide K/KA/WB 15.14, Appendix B Implementation Notes, describes the effect of the probe interval on idle timeouts and describes how to change the default (using openflow hardware statistics refresh rate) but warns that shorter refresh rates will increase CPU load on the switch.
The following command line arguments load the script and enable OpenFlow on startup:
-Dscript.file=ofmark.js \
-Dopenflow.start=yes \
-Dopenflow.addNormal=yes
The additional highlighted argument instructs the sFlow-RT controller to install the wild card OpenFlow NORMAL rule automatically when the switch connects.

The screen capture at the top of the page shows a mixture of small flows "mice" and large flows "elephants" generated by a server connected to the HP 5406 zl switch. The graph at the bottom right shows the mixture of unmarked traffic being sent to the switch. The sFlow-RT controller receives a stream of sFlow measurements from the switch and detects each elephant flows in real-time, immediately installing an OpenFlow rule that matches the flow and instructing the switch to mark the flow by setting the IP type of service bits. The traffic upstream of the switch is shown in the top right chart and it can be clearly seen that each elephant flow has been identified and marked, while the mice have been left unmarked.
Note: While this demonstration only used a single switch, the solution easily scales to hundreds of switches and thousands of edge ports.
The results from the HP switch are identical to those obtained with the Alcatel-Lucent switch, demonstrating the multi-vendor interoperability provided by the sFlow and OpenFlow standards. In addition, sFlow-RT's support for an open, standards based, programming environment (JavaScript / ECMAScript) makes it an ideal platform for rapidly developing and deploying traffic engineering SDN applications in existing networks.