Friday, March 13, 2015

ECMP visibility with Cumulus Linux

Demo: Implementing the Big Data Design Guide in the Cumulus Workbench  is a great demonstration of the power of zero touch provisioning and automation. When the switches and servers boot they automatically pick up their operating systems and configurations for the complex Equal Cost Multi-Path (ECMP) routed network shown in the diagram.

Topology discovery with Cumulus Linux looked at an alternative Multi-Chassis Link Aggregation (MLAG) configuration and shows how to extract the configuration and monitor traffic on the network using sFlow and Fabric View.

The paper Hedera: Dynamic Flow Scheduling for Data Center Networks describes the impact of colliding flows on effective ECMP cross sectional bandwidth. The paper gives an example which demonstrates that effective cross sectional bandwidth can be reduced by a factor of between 20% to 60%, depending on the number of simultaneous flows per host.

This article uses the workbench to demonstrate the effect of large "Elephant" flow collisions on network throughput. The following script running on each of the servers uses the iperf tool to generate pairs of overlapping Elephant flows:
cumulus@server1:~$ while true; do iperf -c 10.4.2.2 -t 20; sleep 20; done
------------------------------------------------------------
Client connecting to 10.4.2.2, TCP port 5001
TCP window size: 1.06 MByte (default)
------------------------------------------------------------
[  3] local 10.4.1.2 port 57234 connected with 10.4.2.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-20.0 sec  21.9 GBytes  9.41 Gbits/sec
------------------------------------------------------------
Client connecting to 10.4.2.2, TCP port 5001
TCP window size: 1.06 MByte (default)
------------------------------------------------------------
[  3] local 10.4.1.2 port 57240 connected with 10.4.2.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-20.0 sec  10.1 GBytes  4.34 Gbits/sec
------------------------------------------------------------
Client connecting to 10.4.2.2, TCP port 5001
TCP window size: 1.06 MByte (default)
------------------------------------------------------------
[  3] local 10.4.1.2 port 57241 connected with 10.4.2.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-20.0 sec  21.9 GBytes  9.41 Gbits/sec
------------------------------------------------------------
The first iperf test achieves a TCP throughput of 9.41 Gbits/sec (the maximum achievable on the 10Gbit/s network in the workbench). However, the second test only achieves a throughput of 4.34 Gbits/sec. How can this result be explained?
The Top Flows table above confirms that two simultaneous elephant flows are being tracked by Fabric View.
The Traffic charts update every second and give a fine grained view of the traffic flows over time. The charts clearly show how iperf flows vary in throughput, with the low throughput runs achieving a throughput of approximately 50% of the network capacity (these results are consistent with 20% to 60% reported in the Hedera paper).
The Performance charts show what is happening. Packets take two hops as they are routed from leaf1 to leaf2 (via spine1 or spine2). Each iperf connection is able to fully utilize the two links to achieve line rate throughput. Comparing the Total Traffic and Busy Spine Links charts shows that peak total throughput of approximately 20Gbits/sec corresponds to interval when 4 spine links are busy. The throughput is halved during intervals when the routes overlap and share 1 or 2 links (shown in gold as Collisions on the Busy Spine Links chart).
Readers might be surprised by the frequency of collisions given the number of links in the network. Packets take two hops to go from leaf1 to leaf2 - routed via spine1 or spine2. In addition, the links between switches are paired, so there are 8 possible two hop paths from leaf1 to leaf2. The explanation involves looking at the conditional probability that the second flow with overlap with the first. Suppose the first flow takes is routed to spine1 via port swp1s0 and that spine1 routes the flow to leaf2 via port swp51. If the second flow is routed via any of the 4 paths through spine2, there is no collision. However, if it is routed via spine1, there is only 1 path that avoids collisions (leaf1 port swp1s1 to spine1 port swp52). This means that there is a 5 / 8 chance of avoiding a collision, or a 3/8 (37.5%) chance that the two flow will collide. The probability of flow collisions is surprisingly high even on very large networks with many spine switches and paths (see Birthday Paradox). 
Also note the Discards trend in the Congestion and Errors section. Comparing the rate of discards with Collisions in the Busy Spine Links chart shows that discards don't occur unless there are Elephant flow collisions on the busy links.
The Discard trend lags the Collision trend because discards are reported using sFlow counters and the Collision metric are based on packet samples - see Measurement delay, counters vs. packet samples
This example demonstrates the visibility into leaf and spine fabric performance achievable using standard sFlow instrumentation built into commodity switch hardware. If you have a leaf and spine network, request a free evaluation of Fabric View to better understand your network's performance.
This small four switch leaf and spine network is composed of 12 x 10 Gbits/sec links which would require 24 x 10 Gbits/sec taps with associated probes and collector to fully monitor using traditional tools used to monitor legacy data center networks. The cost and complexity of tapping leaf and spine topologies is prohibitive. However, leaf and spine switches typically include hardware support for the sFlow measurement standard, embedding line rate visibility into every switch port for network wide coverage at no extra cost. In this example, the Fabric View analytics software is running on a commodity physical or virtual server consuming 1% CPU and 200 MBytes RAM.
Real-time analytics for leaf and spine networks is a core enabling technology for software defined networking (SDN) control mechanisms that can automatically adapt the network to rapidly changing flow patterns and dramatically improve performance.
For example, REST API for Cumulus Linux ACLs describes how and SDN controller can remotely control switches. Use cases discussed on this blog include: Elephant flow marking,  Elephant flow steering, and DDoS mitigation.

Finally, Cumulus Linux runs on open switch hardware from Agema, Dell, Edge-Core, Penguin Computing, Quanta. In addition, Hewlett-Packard recently announced that they will soon be selling a new line of open network switches built by Accton Technologies and support Cumulus Linux. The increasing availability of low cost open networking hardware running Linux creates a platform for open source and commercial software developers to quickly build and deploy innovative solutions.

No comments:

Post a Comment