Friday, June 24, 2011

Five W's

The Five W's are the set of questions that a news report must answer in order to be considered complete:
  • What happened (what is the story)?
  • Who is it about?
  • When did it take place?
  • Where did it take place?
  • Why did it happen?
  • How did it happen?
These questions provide a good framework for solving performance management problems. The following example demonstrates how the network-wide visibility provided by the sFlow standard makes it easy to quickly answer each question in order to detect, diagnose and eliminate performance problems.

Note: The free sFlowTrend tool is used to demonstrate problem solving using sFlow, but there are many other tools to choose from.

What?

What happened? Threshold violations on interface counters provide the notification that there is a problem. sFlow provides an extremely efficient method of collecting interface counters, monitoring every interface in the network, allowing prompt detection of performance problems.


This screen capture of the sFlowTrend dashboard shows that a problem with excessive unicast packets has been detected. There are many devices and interfaces in this network, the next question is who reported the problem? Clicking on the bar provides the following answer.

Who?

Who is reporting the problem? The following table sorts the switches to show which ones are seeing excessive unicast traffic. Comparing switches provides a baseline making it easy to see whether the problem is widespread, or localized to specific devices. 


Note: Many monitoring systems are hierarchical, counters are polled locally and notifications of threshold violations are sent to the central management system. The problem with this approach is that the underlying data needed to put the event into context is lost. The sFlow architecture centralizes monitoring - performance counters from all devices are centrally collected and thresholds calculations are performed by the collector. sFlow makes it simple to drill down and compare the statistics underlying any notification, making it much easier to troubleshoot problems. 

Drilling down further, the following table shows individual interfaces sorted to show which interface is seeing excessive traffic. 


Now that we know where the problem is, the next question is when did it start?

When?

Again, because sFlow centralizes all the critical performance data, follow up is straightforward. Counter trends on any link can be displayed and the following chart was obtained by drilling down on the interface highlighted in the screen above.


This chart shows that a 2 minute spike in traffic occurred around 10 minutes ago. The chart shows that link utilization has returned to normal levels so there is no need for immediate action. However, it is worth identifying why the spike occurred and assess if it is likely to occur again.

Why? How?

Interface counters are only one type of data exported by sFlow. sFlow agents also export real-time traffic information. The two types of data complement one another, counters allow performance anomalies to be quickly detected and traffic information provides the detail needed to identify the root cause of the problem.

By flipping the chart from Utilization to a Top Connections, sFlowTrend uses the sFlow traffic measurements to break down the traffic into the top connections responsible for the traffic.


The chart shows why the traffic spiked. Multiple TCP connections to port 80 (web) from ganglia.sf.inmon.com were responsible for the spike in traffic. The top two connections are to fedora.mirrors.tds.net, providing a clue as to the type of traffic.

How did the spike happen? It seems likely that a system update was run on the ganglia.sf.inmon.com server. Given the timing of the traffic, this looks like an unscheduled update. It would be a good idea to talk to the system administrator for the server and suggest off-peak times to schedule the updates so that they don't interfere with peak business hours traffic.

Where?

What if the spike was ongoing and we couldn't contact the system administrator of the server to shut down the update? In this case it is very important to be able to locate the server on the network in order to take action.

When an sFlow agent reports on network traffic it also includes information on the packet path across the device. Combining data from all the switches allows an sFlow analyzer to discover and track network topology and the location of each host on the network.

Clicking on an address in sFlowTrend provides the current location.


In this case, the host ganglia.sf.inmon.com is located on port A13 on switch 10.0.0.244. Knowing where the server is located allows the network administrator to log into the switch and take corrective action, blocking or rate limiting the traffic.

The article, Network edge, described how the process of detecting traffic problems and applying controls to the switches can be fully automated. Automation is particularly important in large scale environments where manual intervention is labor intensive and slow.

Finally, the sFlow standard is widely supported by network equipment vendors, providing simple, scalable, end-to-end monitoring of wired and wireless networking as well as servers, virtual machines and applications running on the network. Comprehensive, integrated visibility is the key to simplifying management and controlling costs in networked environments.


1 comment: