Friday, June 24, 2011

Five W's

The Five W's are the set of questions that a news report must answer in order to be considered complete:
  • What happened (what is the story)?
  • Who is it about?
  • When did it take place?
  • Where did it take place?
  • Why did it happen?
  • How did it happen?
These questions provide a good framework for solving performance management problems. The following example demonstrates how the network-wide visibility provided by the sFlow standard makes it easy to quickly answer each question in order to detect, diagnose and eliminate performance problems.

Note: The free sFlowTrend tool is used to demonstrate problem solving using sFlow, but there are many other tools to choose from.

What?

What happened? Threshold violations on interface counters provide the notification that there is a problem. sFlow provides an extremely efficient method of collecting interface counters, monitoring every interface in the network, allowing prompt detection of performance problems.


This screen capture of the sFlowTrend dashboard shows that a problem with excessive unicast packets has been detected. There are many devices and interfaces in this network, the next question is who reported the problem? Clicking on the bar provides the following answer.

Who?

Who is reporting the problem? The following table sorts the switches to show which ones are seeing excessive unicast traffic. Comparing switches provides a baseline making it easy to see whether the problem is widespread, or localized to specific devices. 


Note: Many monitoring systems are hierarchical, counters are polled locally and notifications of threshold violations are sent to the central management system. The problem with this approach is that the underlying data needed to put the event into context is lost. The sFlow architecture centralizes monitoring - performance counters from all devices are centrally collected and thresholds calculations are performed by the collector. sFlow makes it simple to drill down and compare the statistics underlying any notification, making it much easier to troubleshoot problems. 

Drilling down further, the following table shows individual interfaces sorted to show which interface is seeing excessive traffic. 


Now that we know where the problem is, the next question is when did it start?

When?

Again, because sFlow centralizes all the critical performance data, follow up is straightforward. Counter trends on any link can be displayed and the following chart was obtained by drilling down on the interface highlighted in the screen above.


This chart shows that a 2 minute spike in traffic occurred around 10 minutes ago. The chart shows that link utilization has returned to normal levels so there is no need for immediate action. However, it is worth identifying why the spike occurred and assess if it is likely to occur again.

Why? How?

Interface counters are only one type of data exported by sFlow. sFlow agents also export real-time traffic information. The two types of data complement one another, counters allow performance anomalies to be quickly detected and traffic information provides the detail needed to identify the root cause of the problem.

By flipping the chart from Utilization to a Top Connections, sFlowTrend uses the sFlow traffic measurements to break down the traffic into the top connections responsible for the traffic.


The chart shows why the traffic spiked. Multiple TCP connections to port 80 (web) from ganglia.sf.inmon.com were responsible for the spike in traffic. The top two connections are to fedora.mirrors.tds.net, providing a clue as to the type of traffic.

How did the spike happen? It seems likely that a system update was run on the ganglia.sf.inmon.com server. Given the timing of the traffic, this looks like an unscheduled update. It would be a good idea to talk to the system administrator for the server and suggest off-peak times to schedule the updates so that they don't interfere with peak business hours traffic.

Where?

What if the spike was ongoing and we couldn't contact the system administrator of the server to shut down the update? In this case it is very important to be able to locate the server on the network in order to take action.

When an sFlow agent reports on network traffic it also includes information on the packet path across the device. Combining data from all the switches allows an sFlow analyzer to discover and track network topology and the location of each host on the network.

Clicking on an address in sFlowTrend provides the current location.


In this case, the host ganglia.sf.inmon.com is located on port A13 on switch 10.0.0.244. Knowing where the server is located allows the network administrator to log into the switch and take corrective action, blocking or rate limiting the traffic.

The article, Network edge, described how the process of detecting traffic problems and applying controls to the switches can be fully automated. Automation is particularly important in large scale environments where manual intervention is labor intensive and slow.

Finally, the sFlow standard is widely supported by network equipment vendors, providing simple, scalable, end-to-end monitoring of wired and wireless networking as well as servers, virtual machines and applications running on the network. Comprehensive, integrated visibility is the key to simplifying management and controlling costs in networked environments.


Thursday, June 16, 2011

Standard metrics


This presentation describes the role that standard server performance metrics play in increasing the scalability and reducing the operational complexity of performance monitoring in large data centers.

The presentation uses popular performance monitoring tools: Nagios, Ganglia, Collectd, Cacti and Munin to demonstrate the complexity of managing each application's agents on multiple platforms and servers. The tools are then used demonstrate that a core set of metrics is widely recognized and broadly supported. The presentation goes on to show how an agent exporting these standard metrics allows performance monitoring tools to share data, eliminating the need for wasteful duplication. Finally, by including server performance metrics in the sFlow standard, server performance monitoring becomes part of an integrated solution that includes networking, servers and applications.

Tuesday, June 14, 2011

Configuring LG-ERICSSON switches

The following commands configure an LG-ERICSSON switch to sample packets at 1-in-512, poll counters every 30 seconds and send sFlow to an analyzer (10.0.0.50) over UDP using the default sFlow port (6343):

sflow receiver 1 10.0.0.50 6343

For each interface:

sflow flow-sampling 512 1
sflow counter-sampling 30 1

A previous posting discussed the selection of sampling rates. Additional information can be found on the LG-ERICSSON web site.

See Trying out sFlow for suggestions on getting started with sFlow monitoring and reporting.

Monday, June 13, 2011

Configuring NEC switches

The following commands configure an NEC switch (10.0.0.252), sampling packets at 1-in-2048, polling counters every 30 seconds and sending sFlow to an analyzer (10.0.0.50) over UDP using the default sFlow port (6343):

sflow source 10.0.0.252
sflow destination 10.0.0.50 6343
sflow sample 2048
sflow polling-interval 30

For each interface:

sflow forward ingress

A previous posting discussed the selection of sampling rates. Additional information can be found on the NEC web site.

See Trying out sFlow for suggestions on getting started with sFlow monitoring and reporting.

Monday, June 6, 2011

Hardware support for Open vSwitch


How to Port Open vSwitch to New Software or Hardware describes the steps needed to port the Open vSwitch onto different hardware and software platforms.  Porting Open vSwitch to different platforms is relatively straightforward since the bulk of the code resides in user space with a minimal set of functions critical to performance implemented in the kernel (or in hardware).

Porting the Open vSwitch sFlow function to hardware switch platforms is straightforward since merchant switch silicon typically contains the hardware counters and packet sampling capabilities needed to implement sFlow.

Implementing the following functions in the Open vSwitch datapath provider API (see  lib/dpif-provider.h) allows the sampling hardware to be configured:

/* Retrieves 'dpif''s sFlow sampling probability into '*probability'.
 * Return value is 0 or a positive errno value.  EOPNOTSUPP indicates that
 * the datapath does not support sFlow, as does a null pointer.
 *
 * '*probability' is expressed as the number of packets out of UINT_MAX to
 * sample, e.g. probability/UINT_MAX is the probability of sampling a given
 * packet. */
 int (*get_sflow_probability)(const struct dpif *dpif,
                               uint32_t *probability);
 
/* Sets 'dpif''s sFlow sampling probability to 'probability'.  Return value
 * is 0 or a positive errno value.  EOPNOTSUPP indicates that the datapath
 * does not support sFlow, as does a null pointer.
 *
 * 'probability' is expressed as the number of packets out of UINT_MAX to
 * sample, e.g. probability/UINT_MAX is the probability of sampling a given
 * packet. */
 int (*set_sflow_probability)(struct dpif *dpif, uint32_t probability);

Packet samples are passed from the hardware up to user space as part of the datapath function (see datapath/datapath.h):

/**
 * struct dp_upcall - metadata to include with a packet to send to userspace
 * @cmd: One of %ODP_PACKET_CMD_*.
 * @key: Becomes %ODP_PACKET_ATTR_KEY.  Must be nonnull.
 * @userdata: Becomes %ODP_PACKET_ATTR_USERDATA if nonzero.
 * @sample_pool: Becomes %ODP_PACKET_ATTR_SAMPLE_POOL if nonzero.
 * @actions: Becomes %ODP_PACKET_ATTR_ACTIONS if nonnull.
 * @actions_len: Number of bytes in @actions.
 */
 struct dp_upcall_info {
         u8 cmd;
         const struct sw_flow_key *key;
         u64 userdata;
         u32 sample_pool;
         const struct nlattr *actions;
         u32 actions_len;
 };

The user space sFlow agent periodically retrieves standard interface counters using the netdev provider interface (see lib/netdev-provider.h):

/* Retrieves current device stats for 'netdev' into 'stats'.
 *
 * A network device that supports some statistics but not others, it should
 * set the values of the unsupported statistics to all-1-bits
 * (UINT64_MAX). */
 int (*get_stats)(const struct netdev *netdev, struct netdev_stats *);

The article, OpenFlow and sFlow, describes how the two technologies combine to provide the visibility and control needed to manage network performance. In the server virtualization space, support for the Open vSwitch in network adapters, offers a way to deliver hardware acceleration while still providing the visibility and control of OpenFlow and sFlow.

Wednesday, June 1, 2011

Resource allocation


The paper, PRESS: PRedictive Elastic ReSource Scaling for cloud systems, describes a resource management scheme that uses continuous measurements of virtual machine resource usage to predict future requirements and then automatically adjusts the resources allocated to each virtual machine in order to satisfy predicted demand and meet service level objectives (SLOs).

The specific example described in the paper involves collecting virtual machine CPU usage measurements every minute from a cluster of Xen servers. Xen attracts innovation because the open source platform makes it easy for researchers to make modifications. In addition, Xen is the dominant cloud platform and so any improvements can quickly be adopted and have a large impact.

Support for the sFlow standard in the Xen Cloud Platform (XCP) provides the cloud-scale monitoring needed to support measurement based control schemes like PRESS in production environments. For a broader perspective, the Data center convergence, visibility and control presentation describes the critical role that measurement plays in managing costs and optimizing performance.