Friday, April 29, 2011
Amazon cloud glitch knocks out popular websites - Server outage hits sites Reddit, Quora and Foursquare hard, Computerworld, Thursday April 21st, 2011. The article describes the impact of a major three day outage in one of Amazon's Elastic Compute Cloud (EC2) data centers on prominent social networking companies and their services.
The screen shot from the Amazon Service Health Dashboard shows the extent and duration of the failure and the following note that appeared on the dashboard page gives an initial description of the failure:
8:54 AM PDT We’d like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it’s difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We’re starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.
A detailed postmortem, Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region, was published on April 29.
The failure in this case resulted from interaction between the network and the Elastic Block Store (EBS) storage service. An initial network configuration error resulted in a loss of network capacity that caused the storage service to start replicating volumes, further overloading the network and causing additional disrupted the storage service. A brown out of the network affected not just storage and compute services, but also the control functions needed to recover.
The Amazon failure demonstrates the tight coupling between network, storage and servers in converged cloud environments. Networked storage in particular dramatically increases network loads and must be closely managed in order to avoid congestion.
The sFlow standard provides scalable monitoring of all the application, storage, server and network elements in the data center, both physical and virtual. Implementing an sFlow monitoring solution helps break down management silos, ensuring the coordination of resources needed to manage converged infrastructures, optimize performance and avoid service failures.