Sunday, May 22, 2011

Delay and stability

Stability and control used the recent Amazon EC2 outage as an example to demonstrate the critical role that measurement plays in automating data center operations. This article looks at one important aspect of the automation system, the effect that delay has on the stability.

The diagram, from Electric Drives - Motor Controllers and Control Systems, illustrates how short, moderate and long delays in responding to disturbances affects the stability of the response. A short delay between a disturbance and corrective action ensures that the response is appropriate and results in a stable automation system. Introducing delay results in responses that are out of sync since the system state may have changed during the delay. As delay increases, the responses become increasingly out of step with the system, resulting in instability and failure. The effect of delay on the stability of automation systems is a general phenomenon that applies equally to all types of automation system, including network and system management systems.

There are are a number of steps where delay is introduced into network and system management:
  1. Measurement delay How quickly can changes be detected? Measurement systems often introduce significant delays:
    • Polling Polling systems periodically retrieving performance metrics from each of the systems being managed. Polling intervals can be short for small numbers of systems, but as the number of systems increases, the work involved in polling increases and each system is polled less frequently. Assigning additional servers to monitoring increases polling capacity, but creates an extra layer of management that adds delay to the monitoring system.  The end result is that measurement delay increases as the number of systems gets larger, reducing the stability of the management system.
    • Flows Network monitoring technologies such as NetFlow, jFlow, NetStream and IPFIX report on traffic flows seen by a network device. With flow monitoring, the network device keeps track of connections flowing through it and exports measurements when each connection ends. Delay is intrinsic to this type of measurement; the delay being proportional to the connection duration. While timeouts can be used to limit the measurement delay, reducing delay increases the computational load on the devices and the amount of measurement traffic each device generates.
  2. Planning delay How long does it take the controller to react to a new measurement? Many aspects of data center management are currently left to human operators, resulting in significant delays. Increased automation reduces delays and improves stability.
  3. Configuration delay How long does it take for the control to be implemented? Implementing a configuration change takes time. The controller needs to connect to the systems and communicate configuration changes. If these steps are carried out manually then they introduce significant delay. Automated configuration management significantly improves the speed and accuracy of configuration changes. For example, the OpenFlow protocol allows a controller to rapidly reconfigure networking devices in response to changing network demands.
  4. Response delay How long does it take for the configuration change to take effect? There may be additional delay as the change takes effect, for example move a virtual machine takes an appreciable amount of time.
Each of these sources of delay is cumulative, delay introduced at any stage affects that overall stability of the control policy.

Measurement is the foundation for automation; without measurement there is no information to react to and performance problems go undetected until they result in serious failures. To be effective for automation, the measurement system must have the scalability to report on all critical network, server and application performance metrics (see concept of observability in Stability and control) while ensuring that measurement delays are low.

The sFlow standard is well suited to automation, providing a unified measurement system that includes network, system and application performance. Devices implementing sFlow do not maintain state, instead measurements are sent immediately to a central collector, resulting in extremely low measurement delay and the scalability needed to centrally monitor all resources in the data center. For additional information, the Data center convergence, visibility and control presentation describes the critical role that measurement plays in managing costs and optimizing performance.

No comments:

Post a Comment