sFlow: SDN fabric controller for commodity data center switches

Figure 1: Rise of merchant silicon

Figure 1 illustrates the rapid transition to merchant silicon among leading data center network vendors, including: Alcatel-Lucent, Arista, Cisco, Cumulus, Dell, Extreme, Juniper, Hewlett-Packard, and IBM.

This article will examine some of the factors leading to commoditization of network hardware and the role that software defined networking (SDN) plays in coordinating hardware resources to deliver increased network efficiency.

Figure 2: Fabric: A Retrospective on Evolving SDN

The article, Fabric: A Retrospective on Evolving SDN by Martin Casado, Teemu Koponen, Scott Shenker, and Amin Tootoonchian, makes the case for a two tier SDN architecture; comprising a smart edge and an efficient core.

Table 1: Edge vs Fabric Functionality

Virtualization and advances in the networking capability of x86 based servers are drivers behind this separation. Virtual machines are connected to each other and to the physical network using a software virtual switch. The software switch provides the flexibility to quickly develop and deploy advanced features like network virtualization, tenant isolation, distributed firewalls, etc. Network function virtualization (NFV) is moving firewall, load balancing, routing, etc. functions from dedicated appliances to virtual machines or embedding them within the virtual switches. The increased importance of network centric software has driven dramatic improvements in the performance of commodity x86 based servers, reducing the need for complex hardware functions in network devices.

As complex functions shift to software running on servers at the network edge, the role of the core physical network is simplified. Merchant silicon provides a cost effective way of delivering the high performance forwarding capabilities needed to interconnect servers and Figure 1 shows how Broadcom based switches are now dominating the market.

The Broadcom white paper, Engineered Elephant Flows for Boosting Application Performance in Large-Scale CLOS Networks, describes the challenge of posed by large "Elephant" flows and describes the opportunity to use software defined networking to orchestrate hardware resources and improve network efficiency.

Figure 3: Feedback controller

Figure 3 shows the elements of an SDN feedback controller. Network measurements are analyzed to identify network hot spots, available resources, and large flows. The controller then plans a response and deploys controls in order to allocate resources where they are needed and reduce contention. The control system operates as a continuous loop. The effect of the changes are observed by the measurement system and further changes are made as needed.

Implementing the controller requires an understanding of the measurement and control capabilities of the Broadcom ASICs.

Control Protocol

Figure 4: Programming Pipeline for ECMP

The Broadcom white paper focuses on the ASIC architecture and control mechanisms and includes the functional diagram shown in Figure 4. The paper describes two distinct configuration tasks:

Programming the Routing Flow Table and ECMP Select Groups to perform equal cost multi-path forwarding of the majority of flows.
Programming the ACL Policy Flow Table to selectively override forwarding decisions for relatively small number of Elephant flows responsible for the bulk of the traffic on the network.

Managing the Routing and ECMP Group tables is well understood and there are a variety of solutions available that can be used to configure ECMP forwarding:

CLI — Use switch CLI to configure distributed routing agents running on each switch (e.g. OSPF, BGP, etc.)
Configuration Protocol — Similar to 1, but programmatic configuration protocols such as NETCONF or JSON RPC replaces CLI.
Server orchestration — Open Linux based switch platforms allow server management agents to be installed on the switches to manage configuration. For example, Cumulus Linux supports Puppet, Chef, CFEngine, etc.
OpenFlow — The white paper describes using the Ryu controller to calculate routes and update the forwarding and group tables using OpenFlow 1.3+ to communicate with Indigo OpenFlow agents on the switches.

The end result is very similar whatever method is chosen to populate the Routing and and ECMP Group tables - the hardware forwards packets across multiple paths based on a hash function calculated over selected fields in the packets (e.g. source and destination IP addresses + source and destination TCP ports), e.g.

index = hash(packet fields) % group.size
selected_physical_port = group[index]

Hash based load balancing works well for the large numbers of small flows "Mice" on the network, but is less suitable for the long lived large "Elephant" flows. The hash function may assign multiple Elephant flows to the same physical port (even if other ports in the group are idle), resulting in congestion and poor network performance.

Figure 5: Long vs Short flows (from The Nature of Datacenter Traffic: Measurements & Analysis)

The traffic engineering controller uses ACL Flow Policy table to manage Elephant flows, ensuring that they don't interfere with latency sensitive Mice and are evenly distributed across the available paths - see Marking large flows and ECMP load balancing.

Figure 6: Hybrid Programmable Forwarding Plane, David Ward, ONF Summit, 2011

Integrated hybrid OpenFlow 1.0 is an effective mechanism for exposing the ACL Policy Flow Table to an external controller:

Simple, no change to normal forwarding behavior, can be combined with any of the mechanisms used to manage the Routing and ECMP Group tables listed above.
Efficient, Routing and ECMP Group tables efficiently handle most flows. OpenFlow used to control ACL Policy Flow Table and selectively override forwarding of specific flows (block, mark, steer, rate-limit), maximizing effectiveness of limited number of entries available.
Scaleable, most flows handled by existing control plane, OpenFlow only used when controller wants to make an exception.
Robust, if controller fails network keeps forwarding

The control protocol is only half the story. An effective measurement protocol is needed to rapidly identify network hot spots, available resources, and large flows so that the controller can identify the which flows need to be managed and where to apply the controls.

Measurement Protocol

The Broadcom white paper is limited in its discussion of measurement, but it does list four ways of detecting large flows:

A priori
Monitor end host socket buffers
Maintain per flow statistics in network
sFlow

The first two methods involve signaling the arrival of large flows to the network from the hosts. Both methods have practical difficulties in that they require that every application and / or host implement the measurements and communicate them to the fabric controller - a difficult challenge in a heterogeneous environment. However, the more fundamental problem is that while both methods can usefully identify the arrival of large flows, they don't provide sufficient information for the fabric controller to take action since it also needs to know the load on all the links in the fabric.

The requirement for end to end visibility can only be met if the instrumentation is built into the network devices, which leads to options 3 and 4. Option 3 would require an entry in the ACL table for each flow and the Broadcom paper points out that this approach does not scale.

The solution to the measurement challenge is option 4. Support for the multi-vendor sFlow protocol is included in Broadcom ASIC, is completely independent of the forwarding tables, and can be enabled on all port and all switches to provide the end to end visibility needed for effective control.

Figure 7: Custom vs. merchant silicon traffic measurement

Figure 7 compares traffic measurement on legacy custom ASIC based switches with standard sFlow measurements supported by merchant silicon vendors. The custom ASIC based switch, shown on top, performs many of the traffic flow analysis functions in hardware. In contrast, merchant silicon based switches shift flow analysis to external software, implementing only the essential measurement functions required for wire speed performance in silicon.

Figure 7 lists a number of benefits that result from moving flow analysis from the custom ASIC to external software, but in the context of large flow traffic engineering the real-time detection of flows made possible by an external flow cache is the essential if the traffic engineering controller is to be effective - see Rapidly detecting large flows, sFlow vs. NetFlow/IPFIX

Figure 8: sFlow-RT feedback controller

Figure 8 shows a fully instantiated SDN feedback controller. The sFlow-RT controller leverages the sFlow and OpenFlow standards to optimize the performance of fabrics built using commodity switches. The following practical applications for the sFlow-RT controller have already been demonstrated:

Alcatel-Lucent Enterprise: ALUE Demonstrates Practical SDN Use Cases, Joins sFlow.org
Brocade: Brocade Crowned Winner of SDN Idol 2014 at Open Networking Summit 2014
InMon: Integrated hybrid OpenFlow control of HP switches

While the industry at large appears to be moving to the Edge / Fabric architecture shown in Figure 2, Cisco's Application Centric Infrastructure (ACI) is an anomaly. ACI is a tightly integrated proprietary solution; the Cisco Application Policy Infrastructure Controller (APIC) uses the Cisco OpFlex protocol to manage Cisco Nexus 9000 switches and Cisco AVI virtual switches. For example, the Cisco Nexus 9000 switches are based on Broadcom silicon and provide an interoperable NX-OS mode. However, line cards that include an Application Leaf Engines (ALE) ASIC along with the Broadcom ASIC are required to support ACI mode. The ALE provides visibility and control features for large flow load balancing and prioritization - both of which can be achieved using standard protocols to manage the capabilities of the Broadcom ASIC.

It will be interesting to see whether ACI is able to compete with modular, low cost, solutions based on open standards and commodity hardware. Cisco has offered its customers a choice and given the compelling value of open platforms I expect many will choose not to be locked into the proprietary ACI solution and will favor NX-OS mode on the Nexus 9000 series, pushing Cisco to provide the full set of open APIs currently available on the Nexus 3000 series (sFlow, OpenFlow, Puppet, Python etc.).

Figure 9: Move communicating virtual machines together to reduce network traffic (from NUMA)

Finally, SDN is only one piece of a larger effort to orchestrate network, compute and storage resources to create a software defined data center (SDDC). For example, Figure 9 shows how network analytics from the fabric controller can be used move virtual machines (e.g. by integrating with OpenStack APIs) to reduce application response times and network traffic. More broadly, feedback control allows efficient matching of resources to workloads and can dramatically increase the efficiency of the data center - see Workload placement.

Saturday, May 31, 2014

SDN fabric controller for commodity data center switches

Control Protocol

Measurement Protocol

No comments:

Post a Comment