Thursday, April 25, 2013

SDN packet broker

Rich Groves described Microsoft's DEMon (Distributed Ethernet Monitoring) network at the recent Open Networking Summit. Rich described the challenges of capturing packets from Microsoft's cloud network, consisting of "thousands and thousands of 10Gig connections." The DEMon capture network was developed because existing solutions were too expensive and didn't scale.

The DEMon capture network makes use of inexpensive merchant silicon based switches as "programmable frame processors." The switches support OpenFlow, allowing an external controller to configure their filtering and forwarding behavior in order to capture and deliver packets to security, compliance, and performance monitoring applications.
Note: Lippis Report 207: The Killer SDN Applications: Network Virtualization and Visualization, describes network visualization as "an SDN killer application" and examines the impact this technology is having on incumbent vendors such as Gigamon, Ixia/Anue, VSS, cPacket, and NetScout.
Rich's talk has already been covered in the news, Microsoft uses OpenFlow SDN for network monitoring and analysis. However, this article focusses on the Proactively Finding Interesting Traffic use case described near the end of the talk that hasn't seen much coverage. The use case is an interesting example of the combined use of OpenFlow and sFlow to dramatically increase to capabilities of the monitoring system.
In the DEMon architecture, a layer of filter switches terminates all the monitor ports. By default the filter switches are configured to drop all packets since the terrabits per second of traffic arriving at the capture network would overwhelm the analysis tools.
Rich commented that, "It just turns out that when you get these merchant silicon switches, you get sFlow for free." Switches supporting the sFlow standard implement packet sampling in silicon, providing scaleable, wire-speed monitoring that can be enabled on all the filter switches. Packet sampling works like a  shrink ray, scaling down the size of the monitoring task so that a single sFlow collector can provide an overview of data center wide traffic.
Rich mentions using the sFlow-RT collector to define interesting flows and thresholds and trigger automatic packet captures using simple Python scripts, yielding "meaningful captures" as part of "a smarter way to find a needle in a haystack."
Finderscope
An analogy for the role that sFlow plays in steering the capture network is that of a finderscope, the small, wide-angle telescope used to provide an overview of the sky and guide the telescope to its target.

How easy is it to build a similar system within your own network? Although not mentioned by name, one can speculate based on the symbol on the slides that the controller is Big Tap from Big Switch Networks. Big Tap provides a high level RESTful API that can be used to direct packet capture. In addition, although the switch vendor wasn't named, there are a number of vendors selling top of rack switches based on the same merchant silicon. Virtually all of these switches support the sFlow standard and many also support OpenFlow, offering a range of choices.

While the DEMon capture network uses a separate filtering layer, the most cost effective solution would be to use the top of rack switches as the filtering layer in the architecture. The simplest solution would be a hybrid OpenFlow deployment in which OpenFlow rules are used to selectively capture specific flows without affecting the normal delivery of the packets.
The diagram shows the elements of a performance aware software defined networking solution like the one described. The combination of sFlow monitoring and SDN controlled packet capture yields significant benefits:
  1. Offload The capture network is a limited resource, both in terms of bandwidth and in the number of flows that can be simultaneously captured.  Offloading as many tasks as possible to the sFlow analyzer frees up resources in the capture network, allowing the resources to be applied where they add most value. A good sFlow analyzer delivers data center wide visibility that can address many traffic accounting, capacity planning and traffic engineering use cases. In addition, many of the packet analysis tools can accept sFlow data directly, for example the Wireshark shown on the DEMon slides, further reducing the cases where a full capture is required.
  2. Context Data center wide monitoring using sFlow provides context for triggering packet capture. For example, sFlow monitoring might show an unusual packet size distribution for traffic to a particular service. Queries to the sFlow analyzer can identify the set of switches and ports involved in providing the service and identify a set of attributes that can be used to selectively capture the traffic (the capture filter needs to be expressed in terms of packet attributes that are understood by the OpenFlow protocol).
  3. DDoS Certain classes of event such as DDoS flood attacks may be too large for the capture network to handle. An sFlow driven SDN controller can be used to mitigate DDoS flood attacks (see DDoS and OpenFlow 1.0 Actual Use-Case: RTBH of DDoS Traffic While Keeping the Target Online), freeing the capture network to focus on identifying more serious application layer attacks, see Gartner Says 25 Percent of Distributed Denial of Services Attacks in 2013 Will Be Application-Based.
To illustrate the benefits of a combined solution, consider the challenge of monitoring tunneled traffic. Tunneled or encapsulated traffic is present in a wide variety of contexts, including: IPv6 migration (Teredo, 6-in-4, 4-in-6 etc.), network virtualization and layer-2 encapsulations (MPLS, Q-in-Q, TRILL, 802.1aq etc.). Capturing tunneled traffic is challenging for an OpenFlow based system since the switch hardware only examines the outer headers and cannot see traffic inside a tunnel. How can captures be triggered based on activity within tunnels?

The following Python script is based on the examples described in performance aware software defined networking and Down the rabbit hole. The script configures sFlow-RT to look for GRE tunnels containing traffic to TCP port 22, generating an alert that can be used to trigger a packet capture:
import requests
import json

groups = {'external':['0.0.0.0/0'],'internal':['10.0.0.0/8']}
flows = {'keys':'ipsource,ipdestination','value':'frames','filter':'stack=eth.ip.gre.ip.tcp&tcpdestinationport=22'}
threshold = {'metric':'watch','value':0}

rt = 'http://localhost:8008'

r = requests.put(rt + '/group/json',data=json.dumps(groups))
r = requests.put(rt + '/flow/watch/json',data=json.dumps(flows))
r = requests.put(rt + '/threshold/watch/json',data=json.dumps(threshold))

eventurl = rt + '/events/json?maxEvents=10&timeout=60'
eventID = -1
while 1 == 1:
  r = requests.get(eventurl + "&eventID=" + str(eventID))
  if r.status_code != 200: break
  events = r.json()
  if len(events) == 0: continue
  eventID = events[0]["eventID"]
  events.reverse()
  for e in events:
    thresholdID = e['thresholdID']
    if "watch" == thresholdID:
      r = requests.get(rt + '/metric/' + e['agent'] + '/' + e['dataSource'] + '.' + e['metric'] + '/json')
      metrics = r.json()
      if len(metrics) > 0:
        evtMetric = metrics[0]
        evtKeys = evtMetric.get('topKeys',None)
        if(evtKeys and len(evtKeys) > 0):
          topKey = evtKeys[0]
          key = topKey.get('key', None)
          value = topKey.get('value',None)
          keys = key.split(",")
          match = [
           {
             "ether-type":"2048",
             "ip-proto":"47",
             "src-ip": str(keys[0]),
             "src-ip-mask":"255.255.255.255",
             "dst-ip": str(keys[1]),
             "dst-ip-mask":"255.255.255.255",
             "src-tp-port":None,
             "dst-tp-port":None,
             "any-traffic":None
           }
          ]
          print match
Running the script generates the following output as soon as a packet is detected:
$ python gre.py 
[{'src-tp-port': None, 'src-ip-mask': '255.255.255.255', 'ip-proto': '47', 'ether-type': '2048', 'any-traffic': None, 'dst-ip': '10.0.0.151', 'dst-ip-mask': '255.255.255.255', 'src-ip': '10.0.0.152', 'dst-tp-port': None}]
In a real deployment, the script would make a REST call to automatically trigger a packet capture, rather than simply printing out details of the tunnel. Triggering packet captures is just one example of an SDN application making use of real-time sFlow analytics, other examples include load balancing large flows and DDoS mitigation.

Monday, April 22, 2013

Multi-tenant traffic in virtualized network environments

Figure 1: Network virtualization (credit Brad Hedlund)
Network Virtualization: a next generation modular platform for the data center virtual network describes the basic concepts of network virtualization. Figure 1 shows the architectural elements of the solution which involves creating tunnels to encapsulate traffic between hypervisors. Tunneling allows the controller to create virtual networks between virtual machines that are independent of the underlying physical network (Any Network in the diagram).
Figure 2: Physical and virtual packet paths
Figure 2 shows a virtual network on the upper layer and maps the paths onto a physical network below. The network virtualization architecture is not aware of the topology of the underlying physical network and so the physical location of virtual machines and resulting packet paths are unlikely to bear any relationship to their logical relationships, resulting in an inefficient "spaghetti" of traffic flows. When a network manager observes traffic on the physical network, the traffic between hypervisors, top of rack switches, or virtual machine to virtual machine will appear to have very little structure.
Figure 3: Apparent virtual network traffic matrix
Figure 3 shows a traffic matrix in which the probability of any virtual machine talking to any other virtual machine is uniform. A network designed to carry this flat traffic matrix must itself be topologically flat, i.e. provide equal bandwidth between all hosts.
Figure 4: Relative cost of different topologies (from Flyways To De-Congest Data Networks)
Figure 4 shows that eliminating over-subscription to create a flat network is expensive, ranging from 2 to 5 times the cost of a conventional network design. Applying this same strategy to the road system would be the equivalent of connecting every town and city with an 8-lane freeway, no matter how small or remote the town. In practice, traffic studies guide development and roads are built where they are needed to satisfy demand. A similar, measurement-based, approach can be applied to network design.

In fact, the traffic matrix isn't random, it just appears random because the virtual machines have been randomly scattered around the data center by the network virtualization layer. Consider an important use case for network virtualization - multi-tenant isolation. Virtual networks are created for each tenant and configured to isolate and protect tenants from each other in the public cloud. Virtual machines assigned to each tenant are free to communicate among themselves, but are prevented for communicating with other tenants in the data center.
Figure 5: Traffic matrix within and between tenants
Figure 5 shows the apparently random traffic matrix shown in Figure 3, but this time the virtual machines have been grouped by tenant and the tenants have been sorted from largest to smallest. The resulting traffic matrix has some interesting features:
  1. The largest tenant occupies a small fraction of the total area in the traffic matrix.
  2. Tenant size rapidly decreases with most tenants being much smaller than the largest few.
  3. The traffic matrix is extremely sparse.
Even this picture is misleading, because if you drill down to look at a single tenant, their traffic matrix is likely to be equally sparse.
Figure 6: Traffic from large map / reduce cluster
Figure 5 shows the traffic matrix for a common large scale workload that a tenant might run in the cloud - map / reduce (Hadoop) - and the paper, Traffic Patterns and Affinities, discusses the sparseness and structure of this traffic matrix in some detail.
Note: There is a striking similarity between the traffic matrices in figures 5 and 6. The reason for the strong diagonal in the Hadoop traffic matrix is that the Hadoop scheduler is topologically aware, assigning compute tasks to nodes that are close to the storage they are going to operate on, and orchestrating storage replication in order to minimise non-local transfers. However, when this workload is run over a virtualized network, the virtual machines are scattered, turning this highly localized and efficient traffic pattern into randomly distributed traffic.
Apart from Hadoop, how else might a large tenant use the network?  It's worth focusing on large tenants since their workloads are likely to be the hardest to accomodate. Netflix is one of the largest and most sophisticated tenants in the Amazon Elastic Compute Cloud (EC2) and the presentation, Dynamically Scaling Netflix in the Cloud provides some interesting insights into their use of cloud resources.
Figure 7: Netflix elastic load balancing pools
Figure 7 shows how Netflix distributes copies of its service across availability zones. Each service intance, A, B or C is implemented by a scale out pool of virtual machines (EC2 instances). Note also the communication patterns between service pools, resulting in a sparse, structured traffic matrix.
Figure 8: Elastic load balancing
Figure 8 shows how each service within an availability zone is dynamically scaled based on measured demand. As demand increases, additional virtual machines are added to the pool. When demand decreases, virtual machines are released from the pool.
Figure 9: Variation in number of Netflix instances over a 24 hour period
Figure 9 shows how the number of virtual machines in each pool varies over the course of a day as a result of the elastic load balancing. Looking at the graph, one can see that a significant fraction of the virtual machines in each service pool is recycled each day.

Elastic load balancing is a service provided by the underlying infrastructure, the service provider is aware of the pools and the members within each pool. Since it's in the nature of a load balancing pool that each instance has similar traffic pattern to its peers, observing the communication patterns of active pool members would allow a topology aware orchestration controller to select poorly placed VMs when making a removal decision and add new VMs in locations that are close to their peers.
Note: Netflix maintains a base number of reserved instances (reserved instances are the least expensive option, provided you can keep them busy) and uses this "free" capacity for analytics tasks (Hadoop) during off peak periods. Exposing basic locality information to tenants would allow them to better configure topology aware workloads like Hadoop, delivering improved performance and reducing traffic on the shared physical network.
Multi-tenancy is just one application of network virtualization. However, the general concept of creating multiple virtual networks implies constraints on communication patterns and a location aware virtual network controller will be able to reduce network loads, improve application performance, and increase scaleability by placing nodes that communicate together topologically close to each other.

There are challenges dealing with large tenants since they may have large groups of machines that need to be provided with high bandwidth communication.  Helios: A Hybrid Electrical/Optical Switch Architecture for Modular Data Centers describes some of the limitations of fixed configuration networks and describes how optical networking can be used to flexibly allocate bandwidth where it is needed.
Figure 10: Demonstrating AWESOME in the Pursuit of the Optical Data Center
Figure 10, from the article Demonstrating AWESOME in the Pursuit of the Optical Data Center, shows a joint Plexxi and Calient solution that orchestrates connectivity based on what Plexxi terms network affinities. This technology can be used to "rewire" the network to create tailored pods to efficiently accomodate large tenants. The paper, PAST: Scalable Ethernet for Data Centers, describes how software defined networking can be used to exploit the capabilities of merchant silicon to deliver bandwidth where it is needed.

However flexible the network, coordinated management of storage, virtual machine and networking resources is required to fully realize the flexibility and efficiency promised by cloud data centers. The paper,  Joint VM Placement and Routing for Data Center Traffic Engineering, shows that jointly optimizing network and server resources can yield significant benefits.
Note: Vint Cerf recently revealed  that Google has re-engineering its data center networks to use OpenFlow based software defined networks, possibly bringing networking under the coordinated control of their data center resource management system. In addition, one of the authors of the Helios paper, Amin Vahdat, is a distinguished engineer at Google and has described Google's use of optical networking and OpenFlow in the context of WAN traffic engineering; it would be surprising if Google weren't applying similar techniques within their data centers.
Comprehensive measurement is an essential, but often overlooked, component of an adaptive architecture. The controller cannot optimally place workloads if the traffic matrix, link utilizations, and server loads are not known. The widely supported sFlow standard addresses the requirement for pervasive visibility by embedding instrumentation within physical and virtual switches, and in the servers and applications making use of the network to provide the integrated view of performance needed for unified control.

Finally, there are significant challenges to realizing revolutionary improvements in datacenter flexibility and scaleability, many of which aren't technical. Network virtualization, management silos and missed opportunities discusses how inflexible human organizational structures are being reflected in the data center architectures proposed by industry consortia. The article talks about Open Stack, but the recently formed Open Daylight consortium seems to have similar issues, freezing in place existing architectures that offer incremental benefits, rather than providing the flexibility needed for radical innovation and improvement.

Saturday, April 20, 2013

Merchant silicon competition

Figure 1: Open Network Platform Switch Reference Design (see Intel Product Brief)
Rose Schooler's keynote at the recent Open Networking Summit described Intel's new reference switch platform. Intel merchant silicon addresses network virtualization and SDN use cases through support for open standards, including: OpenFlow,  NVGRE, VxLAN and sFlow.
Figure 2: Top of rack using Broadcom merchant silicon (from Merchant silicon)
Intel appears to be targeting Broadcom's postion as merchant silicon provider in the data center switch market. Just as competition between Intel, AMD and ARM has spurred innovation, increased choice, and driven down CPU prices, competition between merchant silicon vendors promises similar benefits.

In the compute space, the freedom to choose operating systems (Windows, Linux, Solaris etc.) increases competition among hardware vendors and between operating system vendors. Choices in switch operating system are starting to appear (PicOS and the open source Switch Light project), opening the door to disruptive change in the networking market that is likely to mirror the transition from proprietary minicomputers to commodity x86 servers that occurred in the 1980's.

Monday, April 1, 2013

Velocity Conference talk


This talk from the O'Reilly 2012 Velocity Conference presents a case study describing how Tagged.com uses sFlow to monitor their application infrastructure.

Some of the topics covered include:
  • Introduction to sFlow
  • Using Ganglia as an sFlow collector
  • Monitoring Apache web tier
  • Monitoring Memcache clusters
  • Monitoring Java applications
  • Using sflowtool to develop custom tools
The recently published O'Reilly book, Monitoring with Ganglia, includes the Tagged case study and a chapter on configuring Ganglia to collect sFlow metrics.