Friday, October 19, 2012

Using Ganglia to monitor GPU performance


The Ganglia charts show GPU health and performance metrics collected using sFlow, see GPU performance monitoring. The combination of Ganglia and sFlow provides a highly scaleable solution for monitoring the performance of large GPU based compute clusters, eliminating the need to poll for GPU metrics. Instead, all the host and GPU metrics are efficiently pushed directly to the central Ganglia collector.

The screen capture shows the new GPU metrics, including:
  • Processes
  • GPU Utilization
  • Memory R/W Utilization
  • ECC Errors
  • Power
  • Temperature
The article, Ganglia 3.2 released, describes the basic steps needed to configure Ganglia as an sFlow collector. Once configured, Ganglia will automatically discover and track new servers as they are added to the network.

Note: Support for the GPU metrics is currently only available in Ganglia if you compile gmond from the latest development sources.

Tuesday, October 9, 2012

sFlowTrend adds web server monitoring

This chart was generated using the free sFlowTrend application to monitor an Apache web server using the sFlow standard. The chart shows a real-time, minute by minute view of Top URIs by Operations/s for a busy web server. What is interesting about the chart is the sudden drop off in total operations per second over the last few minutes.
The drop in throughput can be verified by examining the standard HTTP performance counters that are exported using sFlow's efficient push mechanism. The Counters chart above shows the same drop in throughput.

There are a couple of possible explanations that come to mind, the first is that the size of pages has increased, possibly because large images were added.
The Top URI extensions by Bytes/s chart shown above makes it clear that the proportion of image data hasn't changed and that the overall data rate has fallen, so the drop in throughput doesn't appear to be a bandwidth problem.
Another possibility is that there has been an increase in server latency. The Top URIs by Duration chart above shows a recent increase in the latency of the http://10.0.0.150/login.php page.

At this point the problem can probably be resolved by talking with the application team to see if they have made any recent changes to the login page. However, there is additional information available that might help further diagnose the problem.
Host sFlow agents installed on the servers provides a scaleable way of monitoring performance. The CPU utilization chart above shows a drop in CPU load on the web server that coincides with the reduced web throughput. It appears that the performance problem isn't related to web server CPU, but is likely the result of requests to a slow backend system.

Note: If it had been a CPU related issue, we might have expected that the latence would have increased for all URIs, not just the login.php page.

Network visibility is a critical component of application performance monitoring. In this case, network traffic data can help by identifying the backend systems that the web server is depends on. Fortunately, most switch vendors support the sFlow standard and the traffic data is readily accessible in sFlowTrend.
The Top servers chart above shows the top services and servers by Frames/s. The drop in traffic to the web server, 10.0.0.150 is readily apparent, as is a drop in traffic to the Memcached server, 10.0.0.151 (TCP:11211). The Memcached server is used to cache the results of database queries in order to improve site performance and scaleability, but the performance problem doesn't seem to be directly related to the Memcached performance since the amount of Memcache traffic has dropped proportionally with the HTTP traffic (if there had been an increase in Memcache traffic, this might have indicated that the Memcached server was overloaded).
A final piece of information available through sFlow is the link utilization trend which confirms that there is the drop in performance isn't due to a lack of network capacity.

At this point we have a pretty thorough understanding of the impact of the problem on application, server and network resources. Talking to the developers reveals a recent update to the login.php script that introduced a software bug that failed to properly cache information. The resulting increase in load to the database was causing the login page to load slowly and resulted in the drop in site throughput. Fixing the bug returned site performance to normal levels.

Note: This example is a recreation of a typical performance problem using real servers and switches generating sFlow data. However, the load is artificially generated using Apache JMeter since actual production data can't be shown.

Trying out sFlow monitoring on your own site is easy. The sFlowTrend application is a free download. There are open source sFlow modules available for popular web servers, including: Apache, NGINX, Tomcat and node.js. The open source Host sFlow agent runs on most operating systems and enabling sFlow on switches is straightforward (see sFlow.org for a list of switches supporting the sFlow standard). The article, Choosing an sFlow analyzer, provides additional information for large scale deployments.

Saturday, October 6, 2012

Thread pools

Figure 1: Thread pool
The thread pool pattern, illustrated in figure 1, is common to many parallel processing applications. A number of worker threads, organized in a thread pool, take tasks from a task queue. Once a thread has completed a task, it waits for a new task to appear on the task queue. Keeping track of the number of active threads in the pool is essential. If tasks wait in the queue because there aren't enough workers then requests will be delayed, or possibly dropped if the queue fills up.

The recently finalized sFlow Application Structures specification defines a standard set of metrics for reporting on thread pools:
  • Active Threads The number of threads in the thread pool that are actively processing a request.
  • Idle Threads The number of threads in the thread pool that are waiting for a request.
  • Maximum Threads The maximum number of threads that can exist in the thread pool.
  • Delayed Tasks The number of tasks that could not be served immediately, but spent time in the task queue.
  • Dropped Tasks The number of tasks that were dropped because the task queue was full.
The Apache web server uses a thread pool and is a useful demonstration of the value of the sFlow thread pool metrics. The Apache thread pool can be accessed using mod_status, which makes the thread pool visible as a web page. The following screen capture shows the server-status page generated by mod_status:
The grid of characters is used to visualize the the state of the pool (referred to as the "scoreboard"), each cell in the grid represents a slot for a thread and the size of the grid shows the maximum number of threads that are permitted in the pool. The summary line above the grid states that 6 requests are currently being processed and that there are 69 idle workers (i.e. there are six "W" characters and sixty nine "_" characters in the grid).

While the server-status page isn't designed to be machine readable, the information is critical and there are numerous performance monitoring tools that make HTTP requests and extract the worker pool statistics from the text. A much more efficient way to retrieve the information is to use the Apache sFlow module, which in addition to reporting the thread pool statistics will export HTTP counters, URLs, response times, status codes, etc.

The article, Using Ganglia to monitor web farms, describes how to use the open source Ganglia performance monitoring software to collect and report on web server clusters using sFlow. Ganglia now includes support for the sFlow thread pool metrics.
Figure 2: Ganglia chart showing active threads from an Apache web server
Figure 2 trends the number of active workers in the pool. If the number of active workers approaches the maximum allowed, then additional servers may need to be added to the cluster. An increase in active threads could also indicate a performance problem with backend systems (a slow database holding up worker threads) or may be the result of a Denial of Service (DoS) attack (e.g. Sloloris).

Monitoring thread pools using sFlow is very useful, but only scratches the surface of what is possible. The sFlow standard is widely support be network equipment vendors and can be combined with sFlow metrics from hosts, services and applications to provide a comprehensive view of data center performance.

Monday, October 1, 2012

Link aggregation

Figure 1: Link Aggregation Groups
The title of the recently finalized sFlow LAG Counters Structure specification may not sound like much, but it is an exciting development for managing data center networks. To understand why, it is worth looking at how Link Aggregation Groups (LAGs) are deployed.

Note: There is much confusion caused by the many different names that can be used to describe link aggregation, including Port Grouping, Port Trunking, Link Bundling, NIC/Link Bonding, NIC/Link Teaming etc. These are all examples of link aggregation and the discussion in this paper is applicable.

Figure 1 shows a number of common uses for link aggregation. Switches A, B, C and D are interconnected by LAGs, each of which is made up of four individual links. In this case the LAGs are used to provide greater bandwidth between switches at the network core.

A LAG generally doesn't provide the same performance characteristics as a single link with equivalent capacity. In this example, suppose that the LAGs are 4 x 10 Gigabit Ethernet. The LAG needs to ensure in-order delivery of packets since many network protocols perform badly when packets arrive out of order (e.g. TCP). Packet header fields are examined and used to assign all packets that are part of a connection to the same link within the aggregation group. The result is that the maximum bandwidth available to any single connection is 10 Gigabits per second, not 40 Gigabits per second. The LAG can carry 40 Gigabits per second, but the traffic must be a mixture of connections.

The alternative of a single 40G Ethernet link allows a single connection to use the full bandwidth of the link and transfer data at 40 Gigabits per second. However, the LAG is potentially more resilient, since a link failure will simply reduce the LAG capacity by 25% and the two switches will still have connectivity. On the other hand the LAG involves four times as many links and so there is an increased likelihood of link failures.

Servers are often connected to two separate switches to ensure that if one switch fails, the server has backup connectivity through the second switch. In this example, servers A and B are connected to switches C and D. A limitation of this approach is that the backup link is idle and the bandwidth isn't available to the server.

A Multi-chassis Link Aggregation Group (MLAG) allows the server to actively use both links, treating them as a single, high capacity LAG. The "multi-chassis" part of the name refers to what happens at the other end of the link. The two switches C and D communicate with each other in order to handle the two links as if they were arriving at a single switch as part of a conventional LAG, ensuring in-order delivery of packets etc.

There is no standard for logically combining the switches to support MLAGs - each vendor has their own approach (e.g. Hewlett-Packard Intelligent Redundant Framework (IRF), Cisco Virtual Switching System (VSS), Cisco Virtual PortChannel (vPC), Arista MLAG domains, Dell/Force10 VirtualScale (VS) etc.). However, as far as the servers are concerned the network adapters are combined (or bonded) to form a simple LAG that provides the benefit of increased bandwidth and redundancy. However, a potential drawback of actively using both adapters is an increased vulnerability to failures, since bandwidth will drop by 50% during a failure, potentially triggering congestion related service problems.

MLAGs aren't restricted to the server access layer. Looking at Figure 1, if switches A and B share control information and switches C and D share control information, it is possible to aggregate links into two groups of 8, or even a single group of 16. One of the benefits of aggregating core links is that the topology can become logically "loop free", ensuring fast convergence in the event of a link failure and relegating spanning tree to provide protection against configuration errors.

Based on the discussion, it should be clear that managing the performance of LAGs requires visibility into network traffic patterns and paths through the LAGs and member links, visibility into link utilizations and the balance between group members, and visibility into the health of each link.

The LAG extension to the sFlow standard builds the detailed visibility that sFlow already provides into switched network traffic to provide additional detail about LAG topology and health. The IEEE 802.3 LAG MIB defines the set of objects describing elements of the LAG and counters than can be used to monitor LAG health. The sFlow LAG extension simply maps values defined in the MIB into an sFlow counter structure that is exported using sFlow's scaleable "push" mechanism, allowing large scale monitoring of LAG based network architectures.

The new measurements are best understood by examining a single aggregation group.
Figure 2: Detail of a Link Aggregation Group
Figure 2 provides a detailed view of the LAG connecting switches A and C. Ethernet cables connect ports 2, 4, 6 and 8 on Switch A to ports 1, 3, 5 and 7 respectively on Switch C. The two switches communicate with each other using the Link Aggregation Control Protocol (LACP) in order to check the health of each link and negotiate settings to establish and maintain the LAG.

LACP associates a System ID with each switch. The system ID is simply a vendor assigned MAC address that is unique to each switch. In this example, Switch A has the System ID 000000000010 and Switch B has the ID 000000000012.

Each switch assigns an Aggregation ID, or logical port number, to the group of physical ports. Switch A identifies the LAG as port 501 and Switch C identifies the LAG as port 512.

The following sflowtool output shows what an interface counter sample exported by Switch A reporting on physical port 2, would look like:
startSample ----------------------
sampleType_tag 0:2
sampleType COUNTERSSAMPLE
sampleSequenceNo 110521
sourceId 0:2
counterBlock_tag 0:1
ifIndex 2
networkType 6
ifSpeed 100000000
ifDirection 1
ifStatus 3
ifInOctets 35293750622
ifInUcastPkts 241166136
ifInMulticastPkts 831459
ifInBroadcastPkts 11589475
ifInDiscards 0
ifInErrors 0
ifInUnknownProtos 0
ifOutOctets 184200359626
ifOutUcastPkts 375811771
ifOutMulticastPkts 1991731
ifOutBroadcastPkts 5001804
ifOutDiscards 63606
ifOutErrors 0
ifPromiscuousMode 1
counterBlock_tag 0:2
dot3StatsAlignmentErrors 1
dot3StatsFCSErrors 0
dot3StatsSingleCollisionFrames 0
dot3StatsMultipleCollisionFrames 0
dot3StatsSQETestErrors 0
dot3StatsDeferredTransmissions 0
dot3StatsLateCollisions 0
dot3StatsExcessiveCollisions 0
dot3StatsInternalMacTransmitErrors 0
dot3StatsCarrierSenseErrors 0
dot3StatsFrameTooLongs 0
dot3StatsInternalMacReceiveErrors 0
dot3StatsSymbolErrors 0
counterBlock_tag 0:7
actorSystemID 000000000010
partnerSystemID 000000000012
attachedAggID 501
actorAdminPortState 5
actorOperPortState 61
partnerAdminPortState 5
partnerOperPortState 61
LACPDUsRx 11
markerPDUsRx 0
markerResponsePDUsRx 0
unknownRx 0
illegalRx 0
LACPDUsTx 19
markerPDUsTx 0
markerResponsePDUsTx 0
endSample   ----------------------
The LAG MIB should be consulted for detailed descriptions of the fields, for example, refer to the following LacpState definition from the MIB to understand the operational port state values:
LacpState ::= TEXTUAL-CONVENTION
    STATUS      current
    DESCRIPTION
        “The Actor and Partner State values from the LACPDU.”
    SYNTAX      BITS {
  lacpActivity(0),
  lacpTimeout(1),
  aggregation(2),
  synchronization(3),
  collecting(4),
  distributing(5),
  defaulted(6),
  expired(7)
                }
In the sflowtool output the actor (local) and partner (remote) operational state associated with the LAG member is 61, which is 111101 in binary. This value indicates that the lacpActivity(0), aggregation(2), synchronization(3), collecting(4) and distributing(5) bits are set - i.e. the link is healthy.

While this article discussed the low level details of LAG monitoring, performance management tools should automate this analysis and allow the health and performance of all the LAGs to be tracked. In addition, sFlow integrates LAG monitoring with measurements of traffic flows, server activity and application response times to provide comprehensive visibility into data center performance. The Data center convergence, visibility and control presentation describes the critical role that measurement plays in managing costs and optimizing performance.

Today, almost every switch vendor offers products that implement the sFlow standard. If you make use of link aggregation, ask your switch vendor add support for the LAG extension. Implementing the sFlow LAG extension is straightforward if they already support IEEE LAG MIB.