Sunday, March 31, 2013

Pragmatic software defined networking

Figure 1: Fabric: A Retrospective on Evolving SDN
The article, Fabric: A Retrospective on Evolving SDN, makes the case for a two tier software defined networking (SDN) architecture; comprising a smart edge and an efficient core.

Smart edge

Figure 2: Network virtualization (credit Brad Hedlund)
Network Virtualization: a next generation modular platform for the data center virtual network describes the elements of the Nicira/VMware network virtualization platform which represents the current state of the art in smart edges. Figure 2 shows the architectural elements of the solution, all of which are implemented a the network edge. The purple box, labelled Any Network, maps to the Fabric Elements core shown in Figure 1.

Through a process of network function virtualization (NFV), layer 3-7 components such as routers, load balancers and firewalls are abstracted as services that can be implemented in virtual machines, or as physical devices linked together by a virtual topology connected by tunnels across the physical network.

The Open vSwitch (OVS) is a critical component is this architecture, providing a number of critical features:
  1. Flexibility - the virtual switch is implemented in software, allowing a rapidly evolving set of advanced features to be developed that would be difficult, time consuming, and expensive to replicate in hardware.
  2. Configuration - OVSDB configuration protocol allows the controller to coordinate configurations among the virtual switches.
  3. OpenFlow - allows centralized control of the complex forwarding policies needed to create virtual networks.
  4. Monitoring - including NetFlow and sFlow to provide visibility into traffic flows.
Figure 2 claims that any network can be used to carry traffic between the edge switches. However, while technically true, it is clear from the diagram that East-West (traffic between vSwitches and between top of rack switches) dominates. The shared physical network must meet the bandwidth and latency requirements of the overlayed virtual network in order for network virtualization to be a viable solution.    

Efficient core

The architecture in figure 1 simplifies the core by shifting complex classification tasks to the network edge. The core fabric is left with the task of efficiently managing physical resources in order to deliver low latency, high bandwidth connectivity between edge switches.

Current fabrics make use of distributed routing protocols such as Transparent Interconnection of Lots of Links (TRILL), Link Aggregation (LAG/MLAG), Multiprotocol Label Switching (MPLS) and Equal Cost Multi-Path Routing (ECMP) to control forwarding paths. In order to deliver improved efficiency, the feature set required from the SDN/OpenFlow control plane needs to address the requirements of traffic engineering.

Centralized configuration mechanisms (e.g. NETCONF) are useful for provisioning the fabrics and the distributed control planes provide the robust, high performance switching performance needed at the core. However, there are important classes of traffic that are poorly handled by the distributed control plane and would be more efficiently routed by a central SDN controller, see SDN and large flows. A hybrid solution combining the best elements of existing hardware, control planes and OpenFlow offers a solution.
Figure 3: Hybrid Programmable Forwarding Planes
Figure 3 shows two models for hybrid OpenFlow deployment, allowing OpenFlow to be used in conjunction with existing routing protocols. The Ships-in-the-Night model divides the switch into two, allocating selected ports to external OpenFlow control and the remaining ports are left to the internal control plane. It is not clear how useful this model is, other than for experimentation. For production use cases (e.g. the top of rack (ToR) case shown in Figure 2, where the switch is used to virtualize network services for a rack of physical servers) a pure OpenFlow switch is much simpler and likely to provide a more robust solution.

The Integrated hybrid model is much more interesting since it can be used to combine the best attributes of OpenFlow and existing distributed routing protocols to deliver a robust solutions. The OpenFlow 1.3.1 specification includes supports for the integrated hybrid model by defining the NORMAL action:
Optional: NORMAL: Represents the traditional non-OpenFlow pipeline of the switch (see 5.1). Can be used only as an output port and processes the packet using the normal pipeline. If the switch cannot forward packets from the OpenFlow pipeline to the normal pipeline, it must indicate that it does not support this action.
Hybrid solutions leverage the full capabilities of vendor and  merchant silicon which efficiently support distributed forwarding protocols. In addition, most switch and merchant silicon vendors embed support for the sFlow standard, allowing the fabric controller to rapidly detect large flows and apply OpenFlow forwarding rules to steer the flows and optimize performance. The articles Load balancing LAG/ECMP groups and ECMP load balancing describe hybrid control strategies for increasing the performance of switch fabrics.

Existing switching silicon is often criticized for the limited size of the hardware forwarding tables, supporting too few general match OpenFlow forwarding rules to be useful in production settings. However, consider that SDN and large flows defines a large flow as a flow that consumes 10% of a link's bandwidth. Using this definition, a 48 port switch would require a maximum of 480 general match rules in order to steer all large flows, well within the capabilities of current hardware (see OpenFlow Switching Performance: Not All TCAM Is Created Equal).
Figure 4: Mad Max supercharger switch
Among the advantages of a hybrid solution is that dependency on the central fabric controller is limited - if the controller fails, the switches fall back to embedded forwarding and network connectivity is maintained. The hybrid SDN controller can be viewed as a supercharger that boosts the performance of existing networks by finding global optimizations that are inaccessible to the distributed control planes - hence the gratuitous picture from Mad Max in figure 4 - for the uninitiated the red switch activates a supercharger that dramatically improves the car's performance.

In data centers, the network is a small part of overall costs and is often seen as unimportant. However, network bottlenecks idle expensive, power hungry servers and reduce overall data center performance and throughput. Improving the performance of the network increases throughput of servers and delivers increased ROI.

Network performance problems are insidious costs for most organisations because of the split between networking and compute teams: network managers don't see the impact of network congestions on server throughput, and application development and operations teams (DevOps) don't have visibility into how application performance is being constrained by the network. The article, Network virtualization, management silos and missed opportunities discusses how these organizational problems risk being transferred into current cloud orchestration frameworks. What is needed is a framework for coordinating between layers in order to achieve optimal performance.

Coordinated control

Figure 5: Virtual and physical packet paths
Figure 5 shows a virtual network on the upper layer and maps the paths onto a physical network below. The network virtualization architecture is not aware of the topology of the underlying physical network and so the physical location of virtual machines and resulting packet paths are unlikely to bear any relationship to their logical relationships, resulting in an inefficient "spaghetti" of traffic flows.

Note: A popular term for this type of inefficient traffic path is a hairpin or a traffic trombone, however these terms imply a singular mistake, rather than a systematic problem resulting in a marching band of trombones. The term spaghetti routing has been around a long time and conjures the image of a chaotic tangle that is more appropriate to the traffic patterns that result from the willful lack of locality awareness in current network virtualization frameworks.

In practice there is considerable structure to network traffic that can be exploited by the controller:
  1. Traffic within the group of virtual machines belonging to a tenant is much greater than traffic between different tenants.
  2. Traffic between hosts within scale-out clusters and between clusters is highly structured.
Note: Plexxi refers to the structured communication patterns as affinities, see Traffic Patterns and Affinities.
Figure 6: Cloud operating system
System boundary describes how extending the span of control to include network, server and compute resources provides new opportunities for increasing efficiency. Figure 6 extends the controller hierarchy from Figure 1 to include the compute controller responsible for virtual machine placement and adds an overarching cloud operating system with APIs connecting to the compute, edge and fabric controllers. This architecture allows for coordination of resources between the compute, edge and core subsystems. For example the cloud operating system can use topology information learned from the Fabric controller to direct the Compute controller to move virtual machines in order to disentangle the spaghetti of traffic flows. Pro-actively, the cloud operating system can take into account information about the locations and communication patterns of a tenant's existing virtual machines and find the optimal location when asked to create a new virtual machine.

Note: Plexxi puts an interesting spin on topology optimization, using optical wave division multiplexing (WDM) in their top of rack switches to dynamically create topologies matched to traffic affinities, see Affinity Networking for Data Centers and Clouds.

A comprehensive measurement system is an essential component of an efficient cloud architecture, providing feedback to the controller so that it can optimize resource allocation. The sFlow standard addresses the requirements for pervasive visibility by embedding instrumentation within physical and virtual networking and in the servers and applications making use of the network. The main difference between the architecture shown in Figure 6 and current cloud orchestration architectures like OpenStack is the inclusion of feedback paths, the upward arrows, that allow the lower layer controllers and the cloud operating system to be performance and location aware when making load placement decisions, see Network virtualization, management silos and missed opportunities.

Friday, March 22, 2013

Network virtualization, management silos and missed opportunities

Conway's law states that "organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations"
Figure 1: Management silos
Management silos described how the organization of operations teams into functional silos (network, storage and server groups) creates an inflexible management structure that makes it hard to deal with highly dynamic cloud architectures.
Figure 2: OpenStack Quantum Intro
When you look at the OpenStack architecture shown in Figure 2, it bears a strong resemblance to existing organizational silos. Is this really the best way to architect next generation cloud systems or is it simply a demonstration of Conway's law in action?

The OpenStack compute scheduler documentation describes the factors that can be included in deciding which compute node to use when starting a virtual machine. Notable by their absence is any mention of storage or network location. In contrast, the Hadoop scheduler is both storage and network topology aware, allowing it to place compute tasks close to storage and replicate data within racks for increased performance and across racks for availability. A previous article, System boundary, discussed the importance of including all the tightly coupled network, storage, and compute resources within an integrated control system and NUMA discussed the importance of location awareness for optimal performance.

Note: OpenStack was selected as a representative example to demonstrate architectural features that are common to many cloud stacks. This article shouldn't be seen as a specific criticism of OpenStack, but as a general discussion of cloud architectures.
Figure 3OpenStack Quantum Intro
OpenStack is still in active development, so one might hope that future schedulers will be enhanced to be more location aware as the network service matures. However, looking at Figure 3, it appears that this will not be possible since the APIs being developed to access the network service do not expose network topology or performance information to the scheduler.

Figure 4: VMware NSX Network Virtualization
Figure 4 shows how the situation becomes even worse as additional layers are added, further removing the scheduler from the information it needs to be location aware. In fact, the lack of location awareness is touted as an advantage, providing "A complete and feature rich virtual network can be defined at liberty from any constraints in physical switching infrastructure features, topologies or resources."

Each orchestration layer kicks the problem of network resource management down to lower layers, until you are left selecting from a range of vendor specific fabrics which also hide the network topology and present the abstraction of a single switch.
Figure 5: Juniper QFabric Architecture
On a slightly different tack, consider whether the organizational divisions in cloud orchestration systems are being justified based on one or more Fallacies of Distributed Computing:
  1. The network is reliable
  2. Latency is zero
  3. Bandwidth is infinite
  4. The network is secure
  5. Topology doesn't change
  6. There is one administrator
  7. Transport cost is zero
  8. The network is homogeneous
A corollary to Conway's law is that flexible organizations are willing and able to reorganize to produce optimal designs. The DevOps movement is breaking down the silos between application development and operations teams in order to improve the agility and reliability of cloud based applications. The standards for cloud computing are just starting to emerge and it would be tragic if the opportunity to deliver agile, robust, efficient and scaleable cloud systems is lost because of an inability to create the flexible, cross disciplinary design groups needed to re-imagine the relationship between networking, storage and computing and produce new architectures.

It is easy to be complacent based on the the buzz around cloud computing, software defined networking and the software defined data center. However, if these architectures don't deliver on their promise, there is competition waiting in the wings - see Return of the Borg: How Twitter Rebuilt Google’s Secret Weapon. The difference is that these alternative architectures are being developed by flexible organizations that are prepared to consider all aspects of their stack in order to make disruptive improvements.

The unified visibility across all network, server, storage and application resources provided by the multi-vendor sFlow standard offers a solution. Piercing through the layers of abstraction and architectural silos delivers the comprehensive real-time analytics and location awareness for efficient scheduling.

Thursday, March 14, 2013


A new class of damaging DDoS attacks was launched against U.S. banks in the second half of 2012, sometimes adding up to 70 Gbps of noisy network traffic blasting at the banks through their Internet pipes. Until this recent spate of attacks, most network-level DDoS attacks consumed only five Gbps of bandwidth, but more recent levels made it impossible for bank customers and others using the same pipes to get to their websites. - Gartner
Figure 1: Components of a DDoS attack (credit Wikipedia)
Figure 1 shows the components of a Distributed Denial of Service (DDoS) attack. The attacker uses a command and control network to instruct large numbers of compromised systems to send traffic to a designated target with the aim of overwhelming the target infrastructure and denying access to legitimate users.

This article will show how the standard sFlow monitoring built in to most vendor's network equipment can be used to rapidly detect DDoS attacks and drive automated controls to mitigate their effect. This case study is based on a data center network consisting of approximately 500 switches and 30,000 switch ports and the charts show production traffic. This network was used as a testbed for developing the sFlow-RT analytics engine and the resulting solution is now used in production.
Figure 2: Uncontrolled DDoS attack
Figure 1 shows a typical DDoS attack, consisting of sustained traffic levels of over 5M packets per second (30 Gigabits per second) that last for many hours. The attacks are intended to saturate the links to the data center and deny access to the servers hosted there.

Note: This chart is from an early sFlow-RT prototype and the drop outs are spurious.

Figure 3: Performance aware software defined networking
Performance aware software defined networking describes the basic elements of the DDoS mitigation system. The sFlow measurements from all the switches are sent to the sFlow-RT analytics engine which provides real-time notification of denial of service attacks and information about the attackers and targets to the DDoS protection application (a variant of Python script shown in the article). The DDoS protection application issues commands to the controller which communicates with the switches to eliminate the DDoS traffic. In this specific example, the controller doesn't actually use OpenFlow to communicate with the switches - instead scripts automatically login to the switch CLI to issue configuration commands that cause upstream routers to drop the traffic (see null route).
Figure 4: Five DDoS attacks within three minutes
Figure 4 shows results from an early prototype controller. The chart is interesting because it shows five separate DDoS attacks occurring within the span of three minutes. Each attack is being stopped in under 30 seconds - this is fast enough that the attacks don't fully evolve, peaking at 3 million packets per second, rather than the typical 5+ million packets per second.

Note: It takes the attacker some time to fully mobilize their network of compromised hosts - if the defense actions can be deployed faster than the attacker can deploy their resources then the effect of the attack is largely eliminated.
Figure 5: Elements of controller delay
Figure 5, from SDN and delay, describes the components of response time in the control loop. Further tuning to reduce the measurement delay and configuration delay significantly improved effectiveness of the controller.
Figure 6: Mitigating DDoS attack using fast controller
Figure 6 shows the performance of the improved controller: the response time to detect an attack and implement a control is around 4 seconds, the peak traffic cut by two thirds, and all the traffic is eliminated in approximately 10 seconds.

This denial of service mitigation example demonstrates sFlow's unique suitability for control applications. More broadly, sFlow provides the comprehensive measurements needed to drive a variety of resource allocation and load balancing applications, including: SDN and large flows,  ECMP load balancingLoad balancing LAG/ECMP groups, and cloud orchestration.

In future, expect to see sFlow-based performance awareness incorporated in a wide range of orchestration platforms, leveraging existing infrastructure to increase performance, reduce costs and ensuring quality of service - ask vendors about their plans.

Tuesday, March 12, 2013

ECMP load balancing

Figure 1: Examples of ECMP collisions resulting in reduced bisection bandwidth
(from Hedera: Dynamic Flow Scheduling for Data Center Networks)
The paper Hedera: Dynamic Flow Scheduling for Data Center Networks describes the impact of colliding flows on effective ECMP cross sectional bandwidth. The paper gives an example which demonstrates that effective cross sectional bandwidth can be reduced by a factor of between 20% to 60%, depending on the number of simultaneous flows per host.

Figure 1 illustrates the two types of collision that can occur: local collisions when large flows converge on an uplink as they are forwarded from source aggregation switches to the core, and downstream collisions when large flows converge on a downlink from the core switches down to the target aggregation switch. Optimizing forwarding paths for large flows is an interesting challenge that requires end-to-end visibility across the fabric. An aggregation switch could use local visibility to avoid collisions on the uplinks by selecting a different core switch (e.g. Agg 0 can choose to forward the colliding flow through Core 1). However, there is only one downlink from a core switch to each aggregation switch and so avoiding downstream collisions is not a local decision (e.g. the collision on the downlink from Core 2 can be avoided if Agg 2 sends the flow via Core 3).
Figure 2: Performance aware software defined networking
Figure 2 shows the architecture described in Performance aware software defined networking,  Load balancing LAG/ECMP groups, and SDN and large flows. This architecture uses the standard sFlow monitoring embedded within most vendor's switches to continuously monitor all the links in the fabric. The sFlow-RT analytics engine rapidly detects large flows, providing end-to-end visibility to the SDN load balancing application. The load balancer communicates with an OpenFlow controller, or a vendor supplied fabric controller REST API to implement globally optimal forwarding decisions that avoid collisions and significantly increase the effective bandwidth of the fabric.