The Thursday, September 24th service outage with Google Gmail was widely reported (see Google Gmail Users Hit With Another Service Disruption, The Wall Street Journal).
On Friday, September 25th Google published an incident report, Google Apps Incident Report, that describes some of the factors leading to the failure. The report makes interesting reading, concluding that the root cause was a high load on the Contacts service and that this load was the result of a combination of the following:
- A network issue in a data center, which caused additional load on the Contacts service
- A very high utilization of the Contacts service
- An update to Gmail that inadvertently increased the load on the Contacts service
This incident demonstrates the complex dependencies between the networking and computing components in a cloud computing environment. Data center wide visibility helps avoid this type of collapse, discovering dependencies and identifying capacity problems early enough to allow proactive action to be taken. When a service failure does happen, visibility is critical for quickly identifying the problem and targetting the controls needed to mitigate the failure.