The following entry on the Microsoft 365 community forum describes the incident:
==================================== Closure Summary: On Tuesday, June 24, 2014, at approximately 1:11 PM UTC, engineers received reports of an issue in which some customers were unable to access the Exchange Online service. Investigation determined that a portion of the networking infrastructure entered into a degraded state. Engineers made configuration changes on the affected capacity to remediate end-user impact. The issue was successfully fixed on Tuesday, June 24, 2014, at 9:50 PM UTC. Customer Impact: Affected customers were unable to access the Exchange Online service. Incident Start Time: Tuesday, June 24, 2014, at 1:11 PM UTC Incident End Time: Tuesday, June 24, 2014, at 9:50 PM UTC =====================================The closure summary shows that operators took 8 hour 39 minutes to manually diagnose and remediate the problem with degraded networking infrastructure. The network related outage described in this example is not an isolated incident; other incidents described on this blog include: Packet loss, Amazon EC2 outage, Gmail outage, Delay vs utilization for adaptive control, and Multi-tenant performance isolation.
The incidents demonstrate two important points:
- Cloud services are critically dependent on the physical network
- Manually diagnosing problems in large scale networks is a time consuming process that results in extended service outages.
No comments:
Post a Comment