Monday, October 1, 2012

Link aggregation

Figure 1: Link Aggregation Groups
The title of the recently finalized sFlow LAG Counters Structure specification may not sound like much, but it is an exciting development for managing data center networks. To understand why, it is worth looking at how Link Aggregation Groups (LAGs) are deployed.

Note: There is much confusion caused by the many different names that can be used to describe link aggregation, including Port Grouping, Port Trunking, Link Bundling, NIC/Link Bonding, NIC/Link Teaming etc. These are all examples of link aggregation and the discussion in this paper is applicable.

Figure 1 shows a number of common uses for link aggregation. Switches A, B, C and D are interconnected by LAGs, each of which is made up of four individual links. In this case the LAGs are used to provide greater bandwidth between switches at the network core.

A LAG generally doesn't provide the same performance characteristics as a single link with equivalent capacity. In this example, suppose that the LAGs are 4 x 10 Gigabit Ethernet. The LAG needs to ensure in-order delivery of packets since many network protocols perform badly when packets arrive out of order (e.g. TCP). Packet header fields are examined and used to assign all packets that are part of a connection to the same link within the aggregation group. The result is that the maximum bandwidth available to any single connection is 10 Gigabits per second, not 40 Gigabits per second. The LAG can carry 40 Gigabits per second, but the traffic must be a mixture of connections.

The alternative of a single 40G Ethernet link allows a single connection to use the full bandwidth of the link and transfer data at 40 Gigabits per second. However, the LAG is potentially more resilient, since a link failure will simply reduce the LAG capacity by 25% and the two switches will still have connectivity. On the other hand the LAG involves four times as many links and so there is an increased likelihood of link failures.

Servers are often connected to two separate switches to ensure that if one switch fails, the server has backup connectivity through the second switch. In this example, servers A and B are connected to switches C and D. A limitation of this approach is that the backup link is idle and the bandwidth isn't available to the server.

A Multi-chassis Link Aggregation Group (MLAG) allows the server to actively use both links, treating them as a single, high capacity LAG. The "multi-chassis" part of the name refers to what happens at the other end of the link. The two switches C and D communicate with each other in order to handle the two links as if they were arriving at a single switch as part of a conventional LAG, ensuring in-order delivery of packets etc.

There is no standard for logically combining the switches to support MLAGs - each vendor has their own approach (e.g. Hewlett-Packard Intelligent Redundant Framework (IRF), Cisco Virtual Switching System (VSS), Cisco Virtual PortChannel (vPC), Arista MLAG domains, Dell/Force10 VirtualScale (VS) etc.). However, as far as the servers are concerned the network adapters are combined (or bonded) to form a simple LAG that provides the benefit of increased bandwidth and redundancy. However, a potential drawback of actively using both adapters is an increased vulnerability to failures, since bandwidth will drop by 50% during a failure, potentially triggering congestion related service problems.

MLAGs aren't restricted to the server access layer. Looking at Figure 1, if switches A and B share control information and switches C and D share control information, it is possible to aggregate links into two groups of 8, or even a single group of 16. One of the benefits of aggregating core links is that the topology can become logically "loop free", ensuring fast convergence in the event of a link failure and relegating spanning tree to provide protection against configuration errors.

Based on the discussion, it should be clear that managing the performance of LAGs requires visibility into network traffic patterns and paths through the LAGs and member links, visibility into link utilizations and the balance between group members, and visibility into the health of each link.

The LAG extension to the sFlow standard builds the detailed visibility that sFlow already provides into switched network traffic to provide additional detail about LAG topology and health. The IEEE 802.3 LAG MIB defines the set of objects describing elements of the LAG and counters than can be used to monitor LAG health. The sFlow LAG extension simply maps values defined in the MIB into an sFlow counter structure that is exported using sFlow's scaleable "push" mechanism, allowing large scale monitoring of LAG based network architectures.

The new measurements are best understood by examining a single aggregation group.
Figure 2: Detail of a Link Aggregation Group
Figure 2 provides a detailed view of the LAG connecting switches A and C. Ethernet cables connect ports 2, 4, 6 and 8 on Switch A to ports 1, 3, 5 and 7 respectively on Switch C. The two switches communicate with each other using the Link Aggregation Control Protocol (LACP) in order to check the health of each link and negotiate settings to establish and maintain the LAG.

LACP associates a System ID with each switch. The system ID is simply a vendor assigned MAC address that is unique to each switch. In this example, Switch A has the System ID 000000000010 and Switch B has the ID 000000000012.

Each switch assigns an Aggregation ID, or logical port number, to the group of physical ports. Switch A identifies the LAG as port 501 and Switch C identifies the LAG as port 512.

The following sflowtool output shows what an interface counter sample exported by Switch A reporting on physical port 2, would look like:
startSample ----------------------
sampleType_tag 0:2
sampleType COUNTERSSAMPLE
sampleSequenceNo 110521
sourceId 0:2
counterBlock_tag 0:1
ifIndex 2
networkType 6
ifSpeed 100000000
ifDirection 1
ifStatus 3
ifInOctets 35293750622
ifInUcastPkts 241166136
ifInMulticastPkts 831459
ifInBroadcastPkts 11589475
ifInDiscards 0
ifInErrors 0
ifInUnknownProtos 0
ifOutOctets 184200359626
ifOutUcastPkts 375811771
ifOutMulticastPkts 1991731
ifOutBroadcastPkts 5001804
ifOutDiscards 63606
ifOutErrors 0
ifPromiscuousMode 1
counterBlock_tag 0:2
dot3StatsAlignmentErrors 1
dot3StatsFCSErrors 0
dot3StatsSingleCollisionFrames 0
dot3StatsMultipleCollisionFrames 0
dot3StatsSQETestErrors 0
dot3StatsDeferredTransmissions 0
dot3StatsLateCollisions 0
dot3StatsExcessiveCollisions 0
dot3StatsInternalMacTransmitErrors 0
dot3StatsCarrierSenseErrors 0
dot3StatsFrameTooLongs 0
dot3StatsInternalMacReceiveErrors 0
dot3StatsSymbolErrors 0
counterBlock_tag 0:7
actorSystemID 000000000010
partnerSystemID 000000000012
attachedAggID 501
actorAdminPortState 5
actorOperPortState 61
partnerAdminPortState 5
partnerOperPortState 61
LACPDUsRx 11
markerPDUsRx 0
markerResponsePDUsRx 0
unknownRx 0
illegalRx 0
LACPDUsTx 19
markerPDUsTx 0
markerResponsePDUsTx 0
endSample   ----------------------
The LAG MIB should be consulted for detailed descriptions of the fields, for example, refer to the following LacpState definition from the MIB to understand the operational port state values:
LacpState ::= TEXTUAL-CONVENTION
    STATUS      current
    DESCRIPTION
        “The Actor and Partner State values from the LACPDU.”
    SYNTAX      BITS {
  lacpActivity(0),
  lacpTimeout(1),
  aggregation(2),
  synchronization(3),
  collecting(4),
  distributing(5),
  defaulted(6),
  expired(7)
                }
In the sflowtool output the actor (local) and partner (remote) operational state associated with the LAG member is 61, which is 111101 in binary. This value indicates that the lacpActivity(0), aggregation(2), synchronization(3), collecting(4) and distributing(5) bits are set - i.e. the link is healthy.

While this article discussed the low level details of LAG monitoring, performance management tools should automate this analysis and allow the health and performance of all the LAGs to be tracked. In addition, sFlow integrates LAG monitoring with measurements of traffic flows, server activity and application response times to provide comprehensive visibility into data center performance. The Data center convergence, visibility and control presentation describes the critical role that measurement plays in managing costs and optimizing performance.

Today, almost every switch vendor offers products that implement the sFlow standard. If you make use of link aggregation, ask your switch vendor add support for the LAG extension. Implementing the sFlow LAG extension is straightforward if they already support IEEE LAG MIB.

No comments:

Post a Comment