Amazon EC2 outage, describes some of the factors that led to the recent failure of Amazon's cloud computing service. The outage gained considerable attention in the press and was widely reported as being caused by a mistake in configuring a router, abruptly reduced network capacity and leading to a cascade of failures. This explanation over simplifies the problem, suggesting that all that is needed to avoid similar failures in future is to automate configuration management.
The article, Control, describes in general terms how concepts from control theory can be applied to analyze network and system behavior. In this article the Amazon outage is used to demonstrate the practical application of control theory concepts.
The first step in the analysis is describing the structure of the system and controller. The following diagram is a generalized representation of a typical control loop:
In the case of the Amazon cloud service, the System consists of the services, servers and the network connecting them. Each service running on the cloud, for example the Elastic Block Store (EBS), makes measurements so that it can detect and react to problems. When the response time exceeds a reference value (threshold), a control action is taken to try and correct the problem, for example marking a storage block as down and adding a new block to maintain redundancy. This type of control strategy is referred to as a feedback control system: measurements provide feedback, allowing the controller to react to changes, adjusting system settings in order to maintain desired performance levels.
When analyzing a feedback loop, it is important to understand how it will react to changes in system capacity and demand. The step response is used to describe how a system reacts to an abrupt change, for example the abrupt reduction in network capacity that triggered the Amazon outage.
A well behaved feedback loop will quickly adapt to the change, reaching a new equilibrium within a well defined settling time. However, while feedback can act to stabilize the system and improve performance, poorly designed feedback can cause instability, driving the system into wild oscillations and failure.
(credit Intro to Control Theory)
In the case of the Amazon failure, the controller was stable when handling small perturbation, but a large, abrupt change in network capacity triggered an unstable response causing the large scale failure. Actions that were appropriate for quickly correcting problems with a disk failure or localized connectivity failure were inappropriately applied to a problem of network congestion. The control actions placed additional demand on an already congested network, further increasing congestion and amplifying the problem. This is a classic unstable control behavior, shown in the bottom right chart in the grid above.
The Control article identified a number concepts from control theory that can be usefully applied to understanding this problem and developing a solution:
- Stability The response of the Amazon EBS service to the network configuration error was clearly unstable. The concepts of Observability and Controllability help explain the unstable response.
- Observability For a system to be observable, the state of each critical resource must be measurable. Many distributed applications, including Amazon EBS, use some form of keepalive measurement to test the availability of resources. However, a keepalive tests the path between two resources and cannot distinguish between a failure and overloaded server or network resources along the path. This inability to separate states means that the system isn't fully observable and any control actions will be based on partial information.
- Controllability The Amazon EBS instability demonstrates the relationship between observability and controllability. The controller interpreted keepalive failures as an indication that the a storage block had failed and it acted to correct the problem by replicating data, attempting to quickly replace the failed storage while increasing the network load. However, an alternative interpretation of the keepalive failures is that they indicate network congestion. In this case the corrective action would be to reduce the amount of storage activity, reducing the load on the network and alleviating the congestion. Unfortunately the controller doesn't have enough information to distinguish between these two cases and has to choose, risking an incorrect, unstable response. In general, there is a duality relationship between observability and controllability: the system must be observable in order to be controllable. Observability is necessary, but not sufficient to ensure controllability. The controller also needs to be able to alter the behavior of applications, servers and network elements in response to observed changes.
- Robustness Changes in demand, capacity and uncertainty in predicting the behavior of system elements affects the response to control actions. A robust controller will exhibit stable behavior across a wide range of conditions. The sensitivity of the Amazon EBS service to a change in network capacity demonstrated a lack of robustness in the control scheme.
- Measurement Comprehensive, real-time measurement of all network, server and application performance metrics is critical to making the cloud system observable and controllable. The sFlow standard offers the timely, accurate measurements of all network, server and application resources needed for effective control.
- Comprehensive Defining the system boundary to include all the closely coupled components is essential for successful control. With convergence, network, system and application behavior becomes tightly coupled requiring an integrated approach to management that crosses traditional administrative boundaries.
- Controls Control mechanisms allow a controller to alter the behavior of network, server and application elements in response to observed changes. For example, the emerging OpenFlow standard allows a controller to alter the behavior of the network, adapting it to changing application demands. Session-based admission control can be used to regulate demand in order to prevent overload of critical resources. For example, ticketmaster uses admission control to manage huge spikes in demand when tickets for popular events become available. Virtualization provides powerful control features, allowing virtual servers to be started, stopped, replicated and moved in response to changing conditions.
- Fail-safe All systems will fail. Systems should be designed to detect when they are no longer operating within safe operating limits and drop into a safe state that minimises the impact of the failure and allows the problem to be diagnosed. For example, the sharp increase in failover activity in Amazon EBS could have triggered a fail-safe mode in which no further replication was allowed. This would have minimised the impact of the failure and provided the time needed for operators to diagnose and fix the problem. Computer operating system typically have a safe mode; what is needed is an equivalent safe mode for cloud systems.
- Independence The measurement and control functions need to operate independently of the components being managed. Ensuring that measurement and control traffic either uses an out of band network, or has priority when transmitted in-band, ensures that the measurements needed to diagnose problems and the controls needed to correct them are always available. The Amazon EBS failure was exacerbated because the control plane was compromised as the network became congested.