Wednesday, September 1, 2010

Cloud-scale performance monitoring


A cloud data center consists of large numbers of physical servers, each running a hypervisor with one or more virtual switches connecting the physical network to virtual machines (see Anatomy of an open source cloud). An integrated approach to network and system management is required in order to manage performance in this environment (see Management silos).

The combination of sFlow in Open vSwitch and Host sFlow provide a lightweight, scalable, performance monitoring solution for open source virtualization (Xen, XenServer and KVM) and cloud platforms (OpenStack and Xen Cloud Platform). However, coordinating and managing the configuration of sFlow monitoring of the large number of virtual switches and virtual machines in a cloud data center is a complex task that needs to be addressed as part of a scalable monitoring solution.

In order to address this challenge, the latest version of the Host sFlow agent adds the ability to automatically configure sFlow on the Open vSwitch. The Host sFlow agent already includes a highly scalable configuration mechanism (see DNS-SD) and the integration with the Open vSwitch extends this mechanism to configure network and system performance monitoring throughout the cloud.

Referring back to the diagram, the sFlow standard provides a performance monitoring solution that spans all the elements of the cloud. Most physical networking devices already include sFlow support (see Multi-vendor support), providing visibility into network and storage activity. Open vSwitch and Host sFlow extend visibility to include virtual networking and physical/virtual server performance respectively. Thus, sFlow provides the integrated end-to-end view of performance needed to manage resources throughout the cloud (see sFlow Host Structures).

Wednesday, August 25, 2010

Wireless



Recent articles, Higher Learning, Higher Speed: Campuses Graduate to 802.11n and Beyond 802.11n: Enterprise WLAN Trends For 2010, describe some of the trends driving the adoption of 802.11n wireless networking in higher education and enterprise campuses.

Moving to a wireless access network offer many benefits, including: flexibility, mobility, energy savings and reduced cabling. However, managing performance in a wireless environment is challenging since wireless bandwidth is a shared, limited, resource that can easily become congested. Bandwidth management is further complicated by the rapidly changing traffic patterns as users move from one part of the network to another.

The sFlow standard is currently supported by most switch vendors and is widely used to provide network-wide visibility. However, sFlow is not limited to monitoring switches. Deploying sFlow capable wireless access points extends monitoring into the wireless network, delivering the visibility needed for effective bandwidth management. The diagram at the top of the page shows how sFlow is used to centrally monitor the performance of the entire wireless infrastructure. When combined with sFlow from switches, sFlow delivers end-to-end visibility into the performance of the entire wired and wireless network.

The first task in managing wireless performance is rapidly identifying areas of the network experiencing performance problems.  Each wireless access point uses sFlow's scalable "counter push" mechanism to export utilization and error statistics, allowing a central sFlow analyzer to rapidly pinpoint overloaded access points (see Link utilization).

(chart created using sFlowTrend)

The trend chart above shows a sharp increase in transmission failure and retry counts, indicating a severe performance problem. The next step is to find the source of this congestion. Each wireless access point uses sFlow's packet sampling mechanism to export packet headers, allowing the sFlow analyzer to identify sources of traffic.

(chart created using sFlowTrend)

The chart above shows the top connections and protocols making use of the wireless access point. Looking at the top connections chart, it is clear that the increased load (shown in red) is due to an afpovertcp connection between dchp0 and dhcp6. The traffic associated with this connection is peaking at nearly 30M bits/s, resulting in poor performance. Using the information from the chart to install a rate limit in the wireless access point provides a short term fix, restoring network service.

Further investigation reveals that a laptop is using the wireless network to backup its entire hard drive. A longer term solution uses traffic measurements to develop traffic shaping policies that balance the requirements of different traffic classes. In this case, creating a low priority class for backup traffic helps prevent future quality of service problems.

In this example, sFlow was used to manually identify and manage traffic. However, the real-time, network-wide visibility that sFlow provides makes it possible to automate performance management, ensuring fair access to all network users (see Network edge).

Monday, August 9, 2010

Host sFlow 1.0 Released

(image from Host sFlow)

The Host sFlow project has released an open source agent implementing the sFlow Host Structures specification. The current stable release (version 1.0) supports: Linux, Windows, XenServer and Xen. The project is working to add support for additional platforms, including: AIX, HPUX, OS X, FreeBSD, OpenBSD, NetBSD, VMWare, KVM and Hyper-V.

The Host sFlow agent provides a highly scalable solution for monitoring clusters of servers (see Top servers and Cluster performance). The combination of sFlow in top of rack (ToR) or end of row (EoR) switches and Host sFlow agents installed on servers delivers visibility into the performance of large scale data center workloads (see Hybrid server monitoring).

The combination of Open vSwitch and the Host sFlow agent provide a lightweight, scalable, performance monitoring solution for open source virtualization (Xen, XenServer and KVM) and cloud platforms (OpenStack and Xen Cloud Platform).

The sFlow Host Structures specification has only recently been finalized. When looking for an sFlow analyzer (see Choosing an sFlow analyzer), ask if the vendor supports all optional and extended sFlow fields (including packet headersinterface counters and host structures). If possible, arrange for an evaluation and test the solution in a large scale trial.

Tuesday, August 3, 2010

sFlow Host Structures




The completed sFlow Host Structures specification has been published by sFlow.org, extending the sFlow standard to include physical and virtual server performance metrics. The specification describes a coherent framework that builds on the sFlow metrics exported by most switch vendors, linking network, server and application performance monitoring to provide an integrated picture of performance.

The diagram above shows how the packet header information exported by network devices is used to link network performance with performance metrics collected from servers and applications. The packet header contains MAC addresses corresponding to physical and virtual server network adapter cards as well as TCP/UDP socket information identifying individual application instances. Collecting sFlow data from the network devices provides an sFlow analyzer with a real-time map of the physical and logical relationships between entities on the network (see Packet paths and Application mapping).

A server exporting sFlow performance metrics includes an additional structure containing the MAC addresses associated with each of its network adapters. The inclusion of the MAC addresses provides a common key linking server performance metrics (CPU, Memory, I/O etc.) to network performance measurements (network flows, link utilizations, etc.), providing a complete picture of the server's performance (see Hybrid server monitoring and UUID)

The sFlow Host Structures specification builds on the scalable "counter push" mechanism that is used by network devices to export standard interface counters (see Link utilization). Most operating systems already maintain performance counter to track CPU, memory and I/O performance. The sFlow Host Structures specification leverages work done by the Ganglia project to define a common set of metrics across different operating systems, including: Windows, Linux (Fedora/RedHat/CentOS, Debian, Gentoo, SuSE/OpenSuSE), Solaris, FreeBSD, NetBSD, OpenBSD, DragonflyBSD and AIX. The extension of sFlow to include server performance metrics integrates network and system monitoring to deliver a data center wide view of performance (see Top servers and Cluster performance).

For virtual machine performance metrics, the sFlow Host Structures specification draws on definitions from the libvirt project which has defined a standard set of metrics that can be collected from a wide variety of virtualization platforms, including: Xen, QEMU, KVM, LXC, OpenVZ, User Mode Linux, VirtualBox, VMWare ESX and GSX. Again, the MAC addresses associated with each virtual machine are exported along with its performance metrics so that the virtual machine's performance can be linked to its network activity.

The sFlow Host Structures document also describes the extension of sFlow's sampling mechanism to include application transaction sampling. Examples of application level transactions include: HTTP requests to a web server, NFS/CIFS requests to a file server, memcached requests and operations performed by a Hadoop cluster. An application sFlow agent samples completed transactions, capturing information about each completed request, including: size, duration, type, URL, file name etc. Each application transaction sample is linked to the network through the inclusion of TCP/UDP socket information which can be matched to packet header information from network devices.

What clearly distinguishes sFlow from other monitoring technologies is the integrated, end-to-end, view of performance that it offers. Integration exponentially increases the value of information by making it actionable. For example, identifying that an application is running slowly isn't enough to solve the performance problem. However, if you also know that the server hosting the application is seeing poor disk performance, can link the disk performance to a slow NFS server, can identify the other clients of the NFS server and finally determine that all the request are competing for access to a single file, then you are in a position to take action. It is this ability to link data together, combined with the scalability to monitor every resource in the data center that makes sFlow revolutionary.

Tuesday, July 20, 2010

OpenStack


The recently launched OpenStack project aims to provide an open source stack for cloud computing service providers. The project's backers include NASA and Rackspace, along with Citrix, Dell, NTT Data, Peer1, Intel, AMD and a number of other companies.

The OpenStack project is focused on the tools needed to manage and deploy cloud services on a large scale. By creating an ecosystem of service providers sharing common standards and open source tools, the project aims to create an environment that increases acceptance of cloud computing by eliminating the threat of vendor or service provider lock in. The project is hypervisor agnostic, targeting KVM, Xen and XenServer with the initial release.

A previous blog entry described the  Xen Cloud Platform. The Open Stack and Xen Cloud Platform projects are largely complementary and since both projects share a number of major contributors, the efforts should be well coordinated. For example, a critical part of any cloud computing architecture is the virtualization and isolation of networking among tenants in the cloud. The Xen Cloud Platform has already adopted the Open vSwitch since it provides the open, standards-based, visibility and control needed to manage cloud networking. Based on comments from Citrix (a participant in both projects), it appears that OpenStack project will also incorporate the Open vSwitch as part of its networking stack.

The Open vSwitch supports the sFlow standard, extending the network visibility provided by most switch vendors into the virtualization layer. The sFlow standard is uniquely placed to become the standard of choice for cloud performance monitoring. The scalability of sFlow monitoring allows all the physical and virtual switches in a large cloud data center to be centrally monitored, providing the visibility needed to manage performance and account for network usage.

The recent extension of sFlow into server monitoring (see server) delivers the "single pane of glass" visibility into the network, storage and system resources that cloud service providers need to optimize service, reduce costs and charge for metered services.

Wednesday, July 14, 2010

Configuring Allied Telesis switches

The recent Allied Ware Plus 2.1.1 release adds sFlow support to Allied Telesis switches.

The following commands configure an Allied Telesis switch to sample packets at 1-in-512, poll counters every 30 seconds and send sFlow to an analyzer (10.0.0.50) over UDP using the default sFlow port (6343):

awplus> enable
awplus# configure terminal
awplus(config)# sflow collector ip 10.0.0.50 port 6343
awplus(config)# interface port1.0.1-port1.0.24
awplus(config-if)# sflow sampling-rate 512
awplus(config-if)# sflow polling-interval 30
awplus(config-if)# set sflow collector ip 10.0.0.50
awplus(config-if)# exit
awplus(config)# sflow enable

A previous posting discussed the selection of sampling rates. Additional information can be found on the Allied Telesis web site.

See Trying out sFlow for suggestions on getting started with sFlow monitoring and reporting.

Monday, July 5, 2010

RMON (4 groups)

Diagram highlighting 4 RMON groups supported by most switch vendors

If you look carefully at the data sheets for almost any managed switch you are likely to see RMON mentioned as a network management feature with one of the following qualifiers: mini-RMON, RMON (4 groups), RMON (groups 1,2,3 and 9) or RMON (statistics, history, alarm and event). The RMON feature is almost never used, it's a bit like the human appendix, a remnant left behind by evolution.

The RMON standard was developed by the IETF during the early 1990's to provide an SNMP interface to probes used for remotely monitoring Ethernet and Token Ring LANs. At the time, LANs consisted of coax cables that where shared by a number of hosts. Repeaters were used to connect the the cables and extend the network. In this environment, a single RMON probe would see all the traffic on the shared network, providing complete network visibility.

In the mid 1990's demand for bandwidth increased and switches started to become popular. However, while segmenting the network using switches helped improve performance, segmentation dramatically increased the number of probes needed to monitor the network since a probe was required for each segment. Many customers depended on the visibility that RMON probes provided and switch vendors felt pressure to provide embedded RMON functionality.

The RMON standard defines 20 different types (groups) of measurement, including: traffic matrices, top talkers, top protocols, trending etc. Implementing all these features on a switch is difficult, requiring a significant area on the switch ASIC, resources that the switch vendors wanted to allocate to more advanced switching features like QoS, rate limiting, VLANs etc.

Four RMON groups were identified that were easy to implement, requiring minimal ASIC resources. Since the RMON standard allowed a vendor to claim RMON compliance by implementing any of the RMON groups, many switch vendors decided to implement the four RMON groups in order to be able to market their products as supporting the RMON standard. The proliferation of devices, many with very limited capabilities, all claiming RMON compliance undermined the value of the RMON standard and it has fallen out of favor as a network monitoring technology.

Today, even though hardly anyone uses the four RMON groups they are now part of the design of most switch ASICs and leaving the feature in is easier than going to the trouble of redesigning the chip to remove it.

In 2001, the sFlow standard was developed to address the need to monitor network traffic in switched LAN environments. The sFlow standard describes a minimum set of functions (packet sampling and counter polling) that are easily implemented in a switch ASIC. Requiring that all sFlow compliant switches implement these features, ensures that every sFlow compliant switch delivers the full range of features needed for network visibility.

Diagram highlighting RMON functional areas addressed by sFlow

The sFlow monitoring architecture provides the full range of traffic monitoring functions by shifting complexity from the switches to a central sFlow analyzer (see Choosing an sFlow analyzer). The architecture has proven successful and today most switch vendors embed sFlow monitoring.