Sunday, May 22, 2011

Delay and stability

Stability and control used the recent Amazon EC2 outage as an example to demonstrate the critical role that measurement plays in automating data center operations. This article looks at one important aspect of the automation system, the effect that delay has on the stability.

The diagram, from Electric Drives - Motor Controllers and Control Systems, illustrates how short, moderate and long delays in responding to disturbances affects the stability of the response. A short delay between a disturbance and corrective action ensures that the response is appropriate and results in a stable automation system. Introducing delay results in responses that are out of sync since the system state may have changed during the delay. As delay increases, the responses become increasingly out of step with the system, resulting in instability and failure. The effect of delay on the stability of automation systems is a general phenomenon that applies equally to all types of automation system, including network and system management systems.

There are are a number of steps where delay is introduced into network and system management:
  1. Measurement delay How quickly can changes be detected? Measurement systems often introduce significant delays:
    • Polling Polling systems periodically retrieving performance metrics from each of the systems being managed. Polling intervals can be short for small numbers of systems, but as the number of systems increases, the work involved in polling increases and each system is polled less frequently. Assigning additional servers to monitoring increases polling capacity, but creates an extra layer of management that adds delay to the monitoring system.  The end result is that measurement delay increases as the number of systems gets larger, reducing the stability of the management system.
    • Flows Network monitoring technologies such as NetFlow, jFlow, NetStream and IPFIX report on traffic flows seen by a network device. With flow monitoring, the network device keeps track of connections flowing through it and exports measurements when each connection ends. Delay is intrinsic to this type of measurement; the delay being proportional to the connection duration. While timeouts can be used to limit the measurement delay, reducing delay increases the computational load on the devices and the amount of measurement traffic each device generates.
  2. Planning delay How long does it take the controller to react to a new measurement? Many aspects of data center management are currently left to human operators, resulting in significant delays. Increased automation reduces delays and improves stability.
  3. Configuration delay How long does it take for the control to be implemented? Implementing a configuration change takes time. The controller needs to connect to the systems and communicate configuration changes. If these steps are carried out manually then they introduce significant delay. Automated configuration management significantly improves the speed and accuracy of configuration changes. For example, the OpenFlow protocol allows a controller to rapidly reconfigure networking devices in response to changing network demands.
  4. Response delay How long does it take for the configuration change to take effect? There may be additional delay as the change takes effect, for example move a virtual machine takes an appreciable amount of time.
Each of these sources of delay is cumulative, delay introduced at any stage affects that overall stability of the control policy.

Measurement is the foundation for automation; without measurement there is no information to react to and performance problems go undetected until they result in serious failures. To be effective for automation, the measurement system must have the scalability to report on all critical network, server and application performance metrics (see concept of observability in Stability and control) while ensuring that measurement delays are low.

The sFlow standard is well suited to automation, providing a unified measurement system that includes network, system and application performance. Devices implementing sFlow do not maintain state, instead measurements are sent immediately to a central collector, resulting in extremely low measurement delay and the scalability needed to centrally monitor all resources in the data center. For additional information, the Data center convergence, visibility and control presentation describes the critical role that measurement plays in managing costs and optimizing performance.

Wednesday, May 18, 2011

Stability and control

Amazon EC2 outage, describes some of the factors that led to the recent failure of Amazon's cloud computing service. The outage gained considerable attention in the press and was widely reported as being caused by a mistake in configuring a router, abruptly reduced network capacity and leading to a cascade of failures. This explanation over simplifies the problem, suggesting that all that is needed to avoid similar failures in future is to automate configuration management.

The article, Control, describes in general terms how concepts from control theory can be applied to analyze network and system behavior. In this article the Amazon outage is used to demonstrate the practical application of control theory concepts.

The first step in the analysis is describing the structure of the system and controller. The following diagram is a generalized representation of a typical control loop:

(credit Wikipedia)

In the case of the Amazon cloud service, the System consists of the services, servers and the network connecting them. Each service running on the cloud, for example the Elastic Block Store (EBS), makes measurements so that it can detect and react to problems. When the response time exceeds a reference value (threshold), a control action is taken to try and correct the problem, for example marking a storage block as down and adding a new block to maintain redundancy. This type of control strategy is referred to as a feedback control system: measurements provide feedback, allowing the controller to react to changes, adjusting system settings in order to maintain desired performance levels.

When analyzing a feedback loop, it is important to understand how it will react to changes in system capacity and demand. The step response is used to describe how a system reacts to an abrupt change, for example the abrupt reduction in network capacity that triggered the Amazon outage.

(credit Wikipedia)

A well behaved feedback loop will quickly adapt to the change, reaching a new equilibrium within a well defined settling time. However, while feedback can act to stabilize the system and improve performance, poorly designed feedback can cause instability, driving the system into wild oscillations and failure.

In the case of the Amazon failure, the controller was stable when handling small perturbation, but a large, abrupt change in network capacity triggered an unstable response causing the large scale failure. Actions that were appropriate for quickly correcting problems with a disk failure or localized connectivity failure were inappropriately applied to a problem of network congestion. The control actions placed additional demand on an already congested network, further increasing congestion and amplifying the problem. This is a classic unstable control behavior, shown in the bottom right chart in the grid above.

The Control article identified a number concepts from control theory that can be usefully applied to understanding this problem and developing a solution:
  • Stability The response of the Amazon EBS service to the network configuration error was clearly unstable. The concepts of Observability and Controllability help explain the unstable response.
  • Observability For a system to be observable, the state of each critical resource must be measurable. Many distributed applications, including Amazon EBS, use some form of keepalive measurement to test the availability of resources. However, a keepalive tests the path between two resources and cannot distinguish between a failure and overloaded server or network resources along the path. This inability to separate states means that the system isn't fully observable and any control actions will be based on partial information.  
  • Controllability  The Amazon EBS instability demonstrates the relationship between observability and controllability.  The controller interpreted keepalive failures as an indication that the a storage block had failed and it acted to correct the problem by replicating data, attempting to quickly replace the failed storage while increasing the network load. However, an alternative interpretation of the keepalive failures is that they indicate network congestion. In this case the corrective action would be to reduce the amount of storage activity, reducing the load on the network and alleviating the congestion. Unfortunately the controller doesn't have enough information to distinguish between these two cases and has to choose, risking an incorrect, unstable response. In general, there is a duality relationship between observability and controllability: the system must be observable in order to be controllable. Observability is necessary, but not sufficient to ensure controllability. The controller also needs to be able to alter the behavior of applications, servers and network elements in response to observed changes.
  • Robustness  Changes in demand, capacity and uncertainty in predicting the behavior of system elements affects the response to control actions. A robust controller will exhibit stable behavior across a wide range of conditions. The sensitivity of the Amazon EBS service to a change in network capacity demonstrated a lack of robustness in the control scheme.
Control engineering involves designing measurement and control strategies that ensure robust, stable and efficient operation of the systems being managed. A number of control engineering techniques are worth considering when designing distributed services:
  • Measurement Comprehensive, real-time measurement of all network, server and application performance metrics is critical to making the cloud system observable and controllable. The sFlow standard offers the timely, accurate measurements of all network, server and application resources needed for effective control.
  • Comprehensive Defining the system boundary to include all the closely coupled components is essential for successful control. With convergence, network, system and application behavior becomes tightly coupled requiring an integrated approach to management that crosses traditional administrative boundaries.
  • Controls Control mechanisms allow a controller to alter the behavior of network, server and application elements in response to observed changes. For example, the emerging OpenFlow standard allows a controller to alter the behavior of the network, adapting it to changing application demands. Session-based admission control can be used to regulate demand in order to prevent overload of critical resources. For example, ticketmaster uses admission control to manage huge spikes in demand when tickets for popular events become available. Virtualization provides powerful control features, allowing virtual servers to be started, stopped, replicated and moved in response to changing conditions. 
  • Fail-safe All systems will fail. Systems should be designed to detect when they are no longer operating within safe operating limits and drop into a safe state that minimises the impact of the failure and allows the problem to be diagnosed. For example, the sharp increase in failover activity in Amazon EBS could have triggered a fail-safe mode in which no further replication was allowed. This would have minimised the impact of the failure and provided the time needed for operators to diagnose and fix the problem. Computer operating system typically have a safe mode; what is needed is an equivalent safe mode for cloud systems.
  • Independence The measurement and control functions need to operate independently of the components being managed. Ensuring that measurement and control traffic either uses an out of band network, or has priority when transmitted in-band, ensures that the measurements needed to diagnose problems and the controls needed to correct them are always available. The Amazon EBS failure was exacerbated because the control plane was compromised as the network became congested.
Finally, the unique scalability of sFlow dramatically simplifies management by providing a single, centralized view of performance across all resources in the data center. Measurement eliminates uncertainty and reduces the complexity of managing large systems. An effective monitoring system is the foundation for automation: reducing costs, improving efficiency and optimizing performance in the data center. In future, expect to see sFlow monitoring tightly integrated in data center orchestration tools, fully exploiting the flexibility of virtualization and convergence to automatically adjust to changing workloads.  For additional information, the Data center convergence, visibility and control presentation describes the critical role that measurement plays in managing costs and optimizing performance.

Tuesday, May 10, 2011

OpenFlow and sFlow

OpenFlow is gaining considerable attention as the technology moves from research labs into mainstream products. The recently formed Open Networking Foundation's first priority is to "develop and use the OpenFlow protocol" and it is well placed to have an impact on the networking industry with major network operators and manufacturers as members: Deutsche Telekom, Facebook, Google, Microsoft, Verizon, Yahoo!, Big Switch Networks, Broadcom, Brocade, Ciena, Cisco, Citrix, Comcast, Dell, Ericsson, Extreme Networks, Force10 Networks, HP, IBM, IP Infusion, Juniper Networks, Marvell, Metaswitch Networks, NEC, Netgear, Netronome, Nicira Networks, Nokia, Siemens Networks, NTT, Plexxi Inc., Riverbed Technology, Vello Systems and VMware.

Many of the same network equipment manufacturers are also members of the industry consortium and the sFlow standard is widely supported in Ethernet switches; including many of the recently announced OpenFlow switches.

The diagram shows how sFlow and OpenFlow provide complementary functions that together offer exciting opportunities for delivering breakthrough data center and cloud networking performance. The OpenFlow protocol allows controller software running on a server to configure the hardware forwarding tables in a network of switches. The sFlow standard specifies instrumentation in the forwarding table hardware that provides real-time, network-wide visibility into traffic flowing across the network. In addition, sFlow also provides real-time visibility into the performance of servers. Combined, sFlow and OpenFlow can be used to construct feedback control systems that optimize performance, automatically adapting the network to meet changing demands.

The paper, DevoFlow: Cost-Effective Flow Management for High Performance Enterprise Networks, Ninth ACM Workshop on Hot Topics in Networks (HotNets-IX), describes research at HP Labs to create exactly this type hybrid sFlow/OpenFlow controller, combining the strengths of sFlow as a measurement technology with the strengths of OpenFlow as a control technology in order to create a scalable traffic control system.

Software controlled networking is evolving rapidly and much of the innovation in this space is being driven by smaller companies: Big Switch Networks and Nicira Networks, both developing OpenFlow controllers and InMon Corp. a leading provider of sFlow analysis software that already includes basic traffic control capabilities.

Sunday, May 8, 2011

NetFlow lite

Netflow-lite is a recently released packet sampling technology for the Cisco Catalyst 4948E Ethernet switch. The technology is described in Configuring Netflow-lite and at first glance Netflow-lite appears to be random 1 in N packet sampling, exporting sampled packets as IPFIX or Netflow V9 records.

However, looking more closely, the following description raises some questions:

The packet sampling mechanism tries to achieve random 1-in-N sampling. The accuracy of the algorithm is dependent on the size of the packets arriving at a given interface. To tune the relative accuracy of the algorithm, use the average-packet-size parameter. The whole system supports a maximum of 200 monitors.

The system automatically determines the average packet size at an interface based on observation of input traffic and uses that value in rate DBL sampling.

The acronym DBL refers to Cisco's Dynamic Buffer Limiting, a technology for congestions avoidance (see Quality of Service on Cisco Catalyst 4500 Series Supervisor Engines, page 16, for a description). It is clear that the 4948E doesn't have hardware support for 1 in N packet sampling. Instead, an estimate of average packet size is required to allow hardware designed for monitoring data rates to be used to select packets to sample.

This mechanism raises red flags regarding Netflow-lite accuracy, particularly in a data center top of rack setting where 4948E switches are intended to be deployed:
  1. Average packet size. Average packet sizes are highly variable. For example, network storage traffic generates large data packets (as large as 9000 byte jumbo frames) in one direction and small acknowledgement packets in the other (as small as 64 bytes).  Further complicating matters, traffic patterns change quickly, for example changing from storage read to write operations. Thus, the estimate of average packet size from one interval is likely to be a poor estimate of the packet size in the next interval, resulting in large errors in the Netflow-lite measurements. Finally, each class of traffic has its own characteristic packet size distribution. Relying on a single average packet size estimate to control sampling is likely to bias results for classes of traffic that have smaller or larger average packet sizes.
  2. Time. The rate based mechanism of DBL introduces time into the sampling process (since rates are a measure of traffic over some time interval). Time-based sampling methods yield inaccurate results for many typical network traffic patterns.
Before relying on Netflow-lite measurements, it is prudent to run tests to verify accuracy in a production setting. Testing accuracy using a traffic generator is likely to mask errors that will be apparent when monitoring the irregular traffic patterns seen in a production setting.

Netflow-lite exposes an important difference between IPFIX/NetFlow and sFlow. While IPFIX/NetFlow specify the format of data transmitted from a measurement device to a collector, they do not specify how the measurements should be made. The result is that seemingly identical data from different vendor's switches (or even different models of switch from a single vendor) can represent measurements made using very different methodologies. These differences make it very difficult to rely on the measurements or compare data between different devices. In contrast, sFlow standardizes how traffic is measured, ensuring that every device supporting the sFlow standard performs sampling in a standard way, yielding accurate and consistent results.

The article, Complexity kills, describes how sFlow's standard measurements simplify large scale monitoring of data center traffic. Accurate traffic measurements are increasingly important as convergence and virtualization place greater demands on network capacity. For additional information, the Data center convergence, visibility and control presentation describes the critical role that measurement plays in managing costs and optimizing performance.

Wednesday, May 4, 2011


The tomcat-sflow-valve project is an open source implementation of sFlow monitoring for Apache Tomcat, an open source web server implementing the Java Servlet and JavaServer Pages (JSP) specifications.

The advantage of using sFlow is the scalability it offers for monitoring the performance of large web server clusters or load balancers where request rates are high and conventional logging solutions generate too much data or impose excessive overhead. Real-time monitoring of HTTP provides essential visibility into the performance of large-scale, complex, multi-layer services constructed using Representational State Transfer (REST) architectures. In addition, monitoring HTTP services using sFlow is part of an integrated performance monitoring solution that provides real-time visibility into applications, servers and switches (see sFlow Host Structures).

The tomcat-sflow-valve software is designed to integrate with the Host sFlow agent to provide a complete picture of server performance. Download, install and configure Host sFlow before proceeding to install the tomcat-sflow-valve - see Installing Host sFlow on a Linux Server. There are a number of options for analyzing cluster performance using Host sFlow, including Ganglia and sFlowTrend.

Next, download the tomcat-sflow-valve software from The following steps install the SFlowValve in a Tomcat 7 server.

tar -xvzf tomcat-sflow-valve-0.5.1.tar.gz
cp sflowvalve.jar $TOMCAT_HOME/lib/

Edit the $TOMCAT_HOME/conf/server.xml file and insert the following Valve statement in the Host section of the Tomcat configuration file:

  <Valve className="com.sflow.catalina.SFlowValve" />

Restart Tomcat:

/sbin/service tomcat restart

Once installed, the sFlow Valve will stream measurements to a central sFlow Analyzer. Currently the only software that can decode HTTP sFlow is sflowtool. Download, compile and install the latest sflowtool sources on the system your are using to receive sFlow from the servers in the Tomcat cluster.

Running sflowtool will display output of the form:

[pp@test]$ /usr/local/bin/sflowtool
startDatagram =================================
datagramSize 116
unixSecondsUTC 1294273499
datagramVersion 5
agentSubId 6486
packetSequenceNo 6
sysUpTime 44000
samplesInPacket 1
startSample ----------------------
sampleType_tag 0:2
sampleSequenceNo 6
sourceId 3:65537
counterBlock_tag 0:2201
http_method_option_count 0
http_method_get_count 247
http_method_head_count 0
http_method_post_count 2
http_method_put_count 0
http_method_delete_count 0
http_method_trace_count 0
http_methd_connect_count 0
http_method_other_count 0
http_status_1XX_count 0
http_status_2XX_count 214
http_status_3XX_count 35
http_status_4XX_count 0
http_status_5XX_count 0
http_status_other_count 0
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleSequenceNo 3434
sourceId 3:65537
meanSkipCount 2
samplePool 7082
dropEvents 0
inputPort 0
outputPort 1073741823
flowBlock_tag 0:2100
extendedType socket4
socket4_ip_protocol 6
socket4_local_port 80
socket4_remote_port 61401
flowBlock_tag 0:2201
flowSampleType http
http_method 2
http_protocol 1001
http_uri /favicon.ico
http_useragent Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-us) AppleW
http_bytes 284
http_duration_uS 335
http_status 404
endSample   ----------------------
endDatagram   =================================

The -H option causes sflowtool to output the HTTP request samples using the combined log format:

[pp@test]$ /usr/local/bin/sflowtool -H - - [05/Jan/2011:22:39:50 -0800] "GET /membase.php HTTP/1.1" 200 3494 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-us) AppleW" - - [05/Jan/2011:22:39:50 -0800] "GET /favicon.ico HTTP/1.1" 404 284 "" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-us) AppleW"

Converting sFlow to combined logfile format allows existing log analyzers to be used to analyze the sFlow data. For example, the following commands use sflowtool and webalizer to create reports:

/usr/local/bin/sflowtool -H | rotatelogs log/http_log &
webalizer -o report log/*

The resulting webalizer report shows top URLs:

Finally, the real potential of HTTP sFlow is as part of a broader performance management system providing real-time visibility into applications, servers, storage and networking across the entire data center.