Thursday, April 25, 2013

SDN packet broker

Rich Groves described Microsoft's DEMon (Distributed Ethernet Monitoring) network at the recent Open Networking Summit. Rich described the challenges of capturing packets from Microsoft's cloud network, consisting of "thousands and thousands of 10Gig connections." The DEMon capture network was developed because existing solutions were too expensive and didn't scale.

The DEMon capture network makes use of inexpensive merchant silicon based switches as "programmable frame processors." The switches support OpenFlow, allowing an external controller to configure their filtering and forwarding behavior in order to capture and deliver packets to security, compliance, and performance monitoring applications.
Note: Lippis Report 207: The Killer SDN Applications: Network Virtualization and Visualization, describes network visualization as "an SDN killer application" and examines the impact this technology is having on incumbent vendors such as Gigamon, Ixia/Anue, VSS, cPacket, and NetScout.
Rich's talk has already been covered in the news, Microsoft uses OpenFlow SDN for network monitoring and analysis. However, this article focusses on the Proactively Finding Interesting Traffic use case described near the end of the talk that hasn't seen much coverage. The use case is an interesting example of the combined use of OpenFlow and sFlow to dramatically increase to capabilities of the monitoring system.
In the DEMon architecture, a layer of filter switches terminates all the monitor ports. By default the filter switches are configured to drop all packets since the terrabits per second of traffic arriving at the capture network would overwhelm the analysis tools.
Rich commented that, "It just turns out that when you get these merchant silicon switches, you get sFlow for free." Switches supporting the sFlow standard implement packet sampling in silicon, providing scaleable, wire-speed monitoring that can be enabled on all the filter switches. Packet sampling works like a  shrink ray, scaling down the size of the monitoring task so that a single sFlow collector can provide an overview of data center wide traffic.
Rich mentions using the sFlow-RT collector to define interesting flows and thresholds and trigger automatic packet captures using simple Python scripts, yielding "meaningful captures" as part of "a smarter way to find a needle in a haystack."
An analogy for the role that sFlow plays in steering the capture network is that of a finderscope, the small, wide-angle telescope used to provide an overview of the sky and guide the telescope to its target.

How easy is it to build a similar system within your own network? Although not mentioned by name, one can speculate based on the symbol on the slides that the controller is Big Tap from Big Switch Networks. Big Tap provides a high level RESTful API that can be used to direct packet capture. In addition, although the switch vendor wasn't named, there are a number of vendors selling top of rack switches based on the same merchant silicon. Virtually all of these switches support the sFlow standard and many also support OpenFlow, offering a range of choices.

While the DEMon capture network uses a separate filtering layer, the most cost effective solution would be to use the top of rack switches as the filtering layer in the architecture. The simplest solution would be a hybrid OpenFlow deployment in which OpenFlow rules are used to selectively capture specific flows without affecting the normal delivery of the packets.
The diagram shows the elements of a performance aware software defined networking solution like the one described. The combination of sFlow monitoring and SDN controlled packet capture yields significant benefits:
  1. Offload The capture network is a limited resource, both in terms of bandwidth and in the number of flows that can be simultaneously captured.  Offloading as many tasks as possible to the sFlow analyzer frees up resources in the capture network, allowing the resources to be applied where they add most value. A good sFlow analyzer delivers data center wide visibility that can address many traffic accounting, capacity planning and traffic engineering use cases. In addition, many of the packet analysis tools can accept sFlow data directly, for example the Wireshark shown on the DEMon slides, further reducing the cases where a full capture is required.
  2. Context Data center wide monitoring using sFlow provides context for triggering packet capture. For example, sFlow monitoring might show an unusual packet size distribution for traffic to a particular service. Queries to the sFlow analyzer can identify the set of switches and ports involved in providing the service and identify a set of attributes that can be used to selectively capture the traffic (the capture filter needs to be expressed in terms of packet attributes that are understood by the OpenFlow protocol).
  3. DDoS Certain classes of event such as DDoS flood attacks may be too large for the capture network to handle. An sFlow driven SDN controller can be used to mitigate DDoS flood attacks (see DDoS and OpenFlow 1.0 Actual Use-Case: RTBH of DDoS Traffic While Keeping the Target Online), freeing the capture network to focus on identifying more serious application layer attacks, see Gartner Says 25 Percent of Distributed Denial of Services Attacks in 2013 Will Be Application-Based.
To illustrate the benefits of a combined solution, consider the challenge of monitoring tunneled traffic. Tunneled or encapsulated traffic is present in a wide variety of contexts, including: IPv6 migration (Teredo, 6-in-4, 4-in-6 etc.), network virtualization and layer-2 encapsulations (MPLS, Q-in-Q, TRILL, 802.1aq etc.). Capturing tunneled traffic is challenging for an OpenFlow based system since the switch hardware only examines the outer headers and cannot see traffic inside a tunnel. How can captures be triggered based on activity within tunnels?

The following Python script is based on the examples described in performance aware software defined networking and Down the rabbit hole. The script configures sFlow-RT to look for GRE tunnels containing traffic to TCP port 22, generating an alert that can be used to trigger a packet capture:
import requests
import json

groups = {'external':[''],'internal':['']}
flows = {'keys':'ipsource,ipdestination','value':'frames','filter':'stack=eth.ip.gre.ip.tcp&tcpdestinationport=22'}
threshold = {'metric':'watch','value':0}

rt = 'http://localhost:8008'

r = requests.put(rt + '/group/json',data=json.dumps(groups))
r = requests.put(rt + '/flow/watch/json',data=json.dumps(flows))
r = requests.put(rt + '/threshold/watch/json',data=json.dumps(threshold))

eventurl = rt + '/events/json?maxEvents=10&timeout=60'
eventID = -1
while 1 == 1:
  r = requests.get(eventurl + "&eventID=" + str(eventID))
  if r.status_code != 200: break
  events = r.json()
  if len(events) == 0: continue
  eventID = events[0]["eventID"]
  for e in events:
    thresholdID = e['thresholdID']
    if "watch" == thresholdID:
      r = requests.get(rt + '/metric/' + e['agent'] + '/' + e['dataSource'] + '.' + e['metric'] + '/json')
      metrics = r.json()
      if len(metrics) > 0:
        evtMetric = metrics[0]
        evtKeys = evtMetric.get('topKeys',None)
        if(evtKeys and len(evtKeys) > 0):
          topKey = evtKeys[0]
          key = topKey.get('key', None)
          value = topKey.get('value',None)
          keys = key.split(",")
          match = [
             "src-ip": str(keys[0]),
             "dst-ip": str(keys[1]),
          print match
Running the script generates the following output as soon as a packet is detected:
$ python 
[{'src-tp-port': None, 'src-ip-mask': '', 'ip-proto': '47', 'ether-type': '2048', 'any-traffic': None, 'dst-ip': '', 'dst-ip-mask': '', 'src-ip': '', 'dst-tp-port': None}]
In a real deployment, the script would make a REST call to automatically trigger a packet capture, rather than simply printing out details of the tunnel. Triggering packet captures is just one example of an SDN application making use of real-time sFlow analytics, other examples include load balancing large flows and DDoS mitigation.