The root cause of the outage is described as partial failure with a networking device that resulted in excessive packet loss. Network monitoring can detect many problems before they become serious and reduce the time to identify the root cause of problems when they do occur. However, traditional network monitoring technologies such as SNMP polling for interface counters don't easily scale, leaving operations teams with limited, delayed, information on critical network performance indicators such as packet loss. Fortunately, the challenge of large scale interface monitoring is addressed by the sFlow standard's counter push mechanism in which each device is responsible for periodically sending its own counters to a central collector, see Push vs Pull.
The sFlow counter export mechanism extends beyond the network to include server and application performance metrics and the article, Cluster performance metrics, shows how queries to the sFlow-RT analyzer's REST API are used to feed performance metrics from server clusters to operations tools like Graphite.
The following query builds on the example to show how real-time information summarizing network wide packet loss and error rates can easily be gathered, even if there are thousands of switches and tens of thousands of links.
$ curl http://10.0.0.162:8008/metric/ALL/sum:ifinerrors,sum:ifindiscards,sum:ifouterrors,sum:ifoutdiscards,max:ifinerrors,max:ifindiscards,max:ifouterrors,max:ifoutdiscards/json [ { "lastUpdateMax": 56187, "lastUpdateMin": 487, "metricN": 119, "metricName": "sum:ifinerrors", "metricValue": 0.05025125628140704 }, { "lastUpdateMax": 56187, "lastUpdateMin": 487, "metricN": 119, "metricName": "sum:ifindiscards", "metricValue": 0.05025125628140704 }, { "lastUpdateMax": 56187, "lastUpdateMin": 487, "metricN": 119, "metricName": "sum:ifouterrors", "metricValue": 2.1955918328651283 }, { "lastUpdateMax": 56187, "lastUpdateMin": 487, "metricN": 119, "metricName": "sum:ifoutdiscards", "metricValue": 2.1955918328651283 }, { "agent": "10.0.0.30", "dataSource": "4", "lastUpdate": 487, "lastUpdateMax": 27487, "lastUpdateMin": 487, "metricN": 119, "metricName": "max:ifinerrors", "metricValue": 0.05025125628140704 }, { "agent": "10.0.0.30", "dataSource": "4", "lastUpdate": 487, "lastUpdateMax": 27487, "lastUpdateMin": 487, "metricN": 119, "metricName": "max:ifindiscards", "metricValue": 0.05025125628140704 }, { "agent": "10.0.0.28", "dataSource": "8", "lastUpdate": 7288, "lastUpdateMax": 27488, "lastUpdateMin": 7288, "metricN": 119, "metricName": "max:ifouterrors", "metricValue": 0.7425742574257426 }, { "agent": "10.0.0.28", "dataSource": "8", "lastUpdate": 7288, "lastUpdateMax": 27488, "lastUpdateMin": 7288, "metricN": 119, "metricName": "max:ifoutdiscards", "metricValue": 0.7425742574257426 } ]The query result identifies links with the highest number of packet errors and packet discards (links are identified by the agent and dataSource fields for each of the max: ifindiscards, ifoutdiscards, ifinerrors, and ifouterrors metrics). Also of interest are the sum: metrics, measuring total packet loss and error rates summed across all the links in the network.
The sFlow counter push mechanism is also useful for tracking availability: the periodic export of counters by each device acts as a "heart beat" - if messages stop arriving from a device then there is a problem with the device or network that needs to be investigated. The metricN field in the query result indicates the number of data sources that contributed to the summary metrics and the lastUpdateMax and lastUpdateMin values indicate how long ago (in milliseconds) the most recent and oldest updates are within the group. A value of lastUpdateMax value exceeding a multiple of the configured sFlow export period (typically 30 seconds), or a decreasing value of metricN, indicates a problem - possibly the type of "grey" partial failure in the Amazon example.
The chart shows the metrics plotted on a chart. There is clearly an issue with a large number of discarded packets. The trend line for the worst interface tracks the total across all interfaces, indicating that the problem is localized to a specific link, rather than a general congestion related problem involving multiple links.
The network related outage described in this example is not an isolated incident; other incidents described on this blog include: Amazon EC2 outage, Gmail outage, Delay vs utilization for adaptive control, and Multi-tenant performance isolation. The article, Visibility and the software defined data center, describes how the sFlow standard delivers the comprehensive visibility needed to automatically manage data center resources, increasing efficiency, and reducing downtime due to network related outages.
Where should I type the query to test it?
ReplyDelete