Saturday, January 19, 2013

Load balancing LAG/ECMP groups

Figure 1: Hash collision on a link aggregation group
The Internet Draft, draft-krishnan-opsawg-large-flow-load-balancing, is a good example of the type of problem that can be addressed using performance aware software defined networking (SDN). The Internet Draft describes the need to for real-time analytics to drive load balancing of long lived flows in LAG/ECMP groups.

The draft describes the challenge of managing long lived connections in the context of service provider backbones, but similar problems occur in the data center where long lived storage connections (iSCSI/FCoE) and network virtualization tunnels (VxLAN, NVGRE, STT, GRE etc) are responsible for a significant fraction of data center traffic.

The challenge posed by long lived flows is best illustrated by a specific example. The article, Link aggregation, provides a basic overview of LAG/MLAG topologies. Figure 1 shows a detailed view of an aggregation group consisting of 4 links connecting Switch A and Switch C and is used to illustrate the problem posed by long lived flows.

To ensure that packets in a flow arrive in order at their destination, Switch C computes a hash function over selected fields in the packets (e.g. L2: source and destination MAC address, L3: source and destination IP address, or L4: source and destination IP address + source and destination TCP/UDP ports) and picks a link based on the value of the hash, e.g.
index = hash(packet fields) % linkgroup.size
selected_link = linkgroup[index]
Hash based load balancing is easily implemented in hardware and works reasonably well, as long as traffic consists of large numbers of short duration flows, since the hash function randomly distributes flows across the members of the group. However, consider the problem posed by the two long lived high traffic flows, shown in the diagram as Packet Flow 1 and Packet Flow 2. There is a 1 in 4 chance that the two flows will be assigned the same group member and they will share this link for their lifetime.
Figure 2: from article Link aggregation
If you consider the network topology in Figure 2, there may be many aggregation groups in the shared path between the two flows, with a chance of collision on each hop. In addition, when you consider the large number of long living storage and tunneled flows in the data center, the probability that busy flows will collide on each aggregation group is high.

Collisions between high traffic flows can result in chronic performance problems poorly balanced load on the link. The link carrying colliding flows may become overloaded and experiencing packet loss and delay while other links in the group may be lightly loaded with plenty of spare capacity. The challenge is identifying links with flow collisions and changing the path selection used by the switches in order to use spare capacity in the link group.
Figure 3: Elements of an SDN stack to load balance aggregation groups
Figure 3 shows how performance aware SDN can be used to load balance long lived connections and increase the performance across the data center. A multi-path SDN load balancing system would consist of following elements:
  1. Measurement - The sFlow standard provides multi-vendor, scaleable, low latency monitoring of the entire network infrastructure.
  2. Analytics - The sFlow-RT real-time analytics engine receives the sFlow measurements and  rapidly identifies large flows. In addition, the analytics engine provides the detailed information into link aggregation topology and health needed to choose alternate paths.
  3. SDN application - The SDN application implements a load balancing algorithm, immediately responding to large flows with commands to the OpenFlow controller.
  4. Controller - The OpenFlow controller translates high level instructions to re-route flows into low level OpenFlow commands.
  5. Configuration - The OpenFlow protocol provides a fast, programatic, means for the controller to re-configuring forwarding in the network devices.
Figure 4: Load balance link aggregation group by moving large flow
Figure 4 shows the result of applying dynamic load balancing to the aggregation group.  The controller detected that the link connecting switch A, port 2 to Switch C, port 1 was heavily utilized and identified the two colliding flows responsible for the traffic. The controller selected the alternate link connecting Switch A, port 2 to Switch C, port 3 as being underutilized and used OpenFlow to reconfigure the forwarding tables in Switch C to direct Packet Flow 2 to this alternate path. The result is that traffic is evenly spread over the link group members, increasing effective capacity and improving performance by lowering packet loss and delay across the group.

Load balancing is a continuous process, traffic carried by each of the long lived flows is continually changing and different flows will collide at different times and in different places. To be effective, the the control system needs to have pervasive visibility into traffic and control of switch forwarding. Selecting switches that support both the OpenFlow and sFlow standards creates a solid foundation for deploying performance aware software defined networking solutions like the one described in this article.

Load balancing isn't just an issue for link aggregation (LAG/MLAG) topologies, the same issues occur with equal cost multi-path routing (ECMP) and WAN traffic optimization. Other applications for performance aware SDN include denial of service mitigation, multi-tenant performance isolation and workload placement.


  1. "There is a 1 in 4 chance that the two flows will be assigned the same group member and they will share this link for their lifetime."

    The two flows getting hashed to LAG members are two independent events and, hence, the probability that the two will be assigned the same group member is 1/16 (1/4 * 1/4) and not 1 in 4.

    1. There are 16 combinations of ports that could be selected for the two flows: (1,1), (1,2), (1,3), (1,4), (2,1), (2,2), (2,3), (2,4), (3,1), (3,2), (3,3), (3,4), (4,1), (4,2), (4,3) and (4,4), each of which is equally likely. Four of the combinations result in a link being shared (1,1), (2,2), (3,3) and (4,4) - so the chance that two flows will be assigned to the same member is 1 in 4.

      You could also look at this as a conditional probability problem. Assume the first flow was assigned to a particular port in the group of size N - the chance that a subsequent flow will be assigned to the same port is 1 in N - in this case 1 in 4.

  2. "The controller selected the alternate link connecting Switch A, port 2 to Switch C, port 3 as being underutilized and used OpenFlow to reconfigure the forwarding tables in Switch C to direct Packet Flow 2 to this alternate path"

    Shouldn't read "Switch A, port 4 to Switch C, port 3"?