Wednesday, January 11, 2012

Graphite


The Graphite realtime charting software provides a flexible way to store and plot time series data. Graphite was originally developed by Orbitz.com, but is now an open source project that is growing in popularity among system administrators. However, while Graphite is very good at recording metrics and building dashboards, it provides no native monitoring capabilities, instead relying on user provided agents to make measurements.

The open source Host sFlow agent is an exact complement to Graphite, providing a lightweight, portable agent that exports standard metrics from a wide variety of systems. Deploying Host sFlow agents with a Graphite collector offers a complete, highly scalable monitoring solution.



The diagram shows the essential components of the solution. The Host sFlow agents continuously send metrics to the Graphite collector. The sflow2graphite script converts the binary sFlow data into Graphite's text based messages and submits them to the Graphite server.

First install sflowtool on the Graphite server. The sflow2graphite script requires sflowtool to receive binary sFlow datagrams and convert them into a text based representation. Next, download the sflow2graphite.pl script from http://sflow2graphite.googlecode.com.

Next run the following command on the Graphite server:

./sflow2graphite.pl

Now, configure the Host sFlow agents to send sFlow to the Graphite server. Within minutes, metrics from the hosts should start appearing in Graphite.

Note: If you are already collecting sFlow data with another tool, like Ganglia, then you may want to consider forwarding sFlow from the existing collector. Alternatively, you may want to extract the data from the existing collector and feed it into Graphite - in the case of Ganglia there are a number of solutions available.

It is worth noting that performance metrics may be gauges or counters. In the case of a gauge (like a system load average), Graphite will automatically display a useful chart. In the case of a counter metric (like total disk bytes read), Graphite will display the increasing total by default.
The screen capture above shows how you can apply a derivative function to the metric to convert a counter into a rate. The following charts trends disk bytes read/written as a bytes/second rate:

Finally, Graphite uses a hierarchical naming convention for metrics. If the default metrics names generated by the sflow2graphite script doesn't fit your naming hierarchy, simply edit the table of names in the sflow2graphite script to apply your own names.

Tuesday, January 10, 2012

Forwarding using sflowtool


The diagrams show different two different configurations for sFlow monitoring:
  1. Without Forwarding Each sFlow agent is configured to send sFlow to each of the analysis applications. This configuration is appropriate when a small number of applications is being used to continuously monitor performance. However, the overhead on the network and agents increases as additional analyzers are added. Often it is not possible to increase the number of analyzers since many embedded sFlow agents have limited resources and only support a small number of sFlow streams. In addition, the complexity of configuring each agent to add or remove an analysis application can be significant since agents may reside in Ethernet switches, routers, servers, hypervisors and applications on many different platforms from a variety of vendors.
  2. With Forwarding In this case all the agents are configured to send sFlow to a forwarding module which resends the data to the analysis applications. In this case analyzers can be added and removed simply by reconfiguring the forwarder without any changes required to the agent configurations.
There are many variations between these two extremes. Typically there will be one or two analyzers used for continuous monitoring and additional tools, like Wireshark, might be deployed for troubleshooting when the continuous monitoring tools detect anomalies. The sFlow analyzer may include a built-in forwarding capability, however, if built-in forwarding is not available, there are alternatives.

A previous posting introduced the sflowtool command line utility. The following examples demonstrate how sflowtool can be used to replicate and forward sFlow streams.

The following command configures sflowtool to listen for sFlow on the well known port (UDP port 6343) and forward the sFlow to two analyzers: the first running on remote machine 10.0.0.111 and the second listening on port 7343 on the local host.

sflowtool -f 10.0.0.111 -f localhost/7343

If an sFlow analyzer is already running on the server then it will already be bound to the sFlow port and the above command will fail. However, you can still forward the sFlow using the tcpdump command to capture the sFlow datagrams and sflowtool to forward them:

tcpdump -p -s 0 -w - udp port 6343 \
| sflowtool -r - -f 10.0.0.111 -f localhost/7343

It is also possible to filter the sFlow data to pick out a particular agent. This following command selectively forwards sFlow coming from IP address 10.0.0.237:

tcpdump -p -s 0 -w - src host 10.0.0.237 and udp port 6343 \
| sflowtool -r - -f 10.0.0.111 -f localhost/7343

Rather than forwarding the sFlow, this technique can also be used to locally analyze the data. For example, suppose that Ganglia is being used to monitor the performance of a web farm. While Ganglia might show a large spike in HTTP requests, analysis using sflowtool offers additional details:

tcpdump -p -s 0 -w - udp port 6343 | sflowtool -r - -H
10.0.0.70 - - [03/Jan/2012:14:44:29 -0800] "GET http://www.google.com/ HTTP/1.1" 200 21605 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7"

The information about URLs, user agents, response times, status codes and bytes provides the additional detail needed to diagnose the performance problem. For example, identifying overloaded web servers, top URLs and the sources of the increased load.

Note: See the sflowtool article for more examples of analyzing sFlow data using sflowtool.

Saturday, January 7, 2012

Host sFlow distributed agent


The sFlow standard uses a distributed push model in which each host autonomously sends a continuous stream of performance metrics to a central sFlow collector. The push model is highly scalable and is particularly suited to cloud environments.

The distributed architecture extends to the sFlow agents within a host. The following diagram shows how sFlow agent coordinate on each host to provide a comprehensive and consistent view of performance.


Typically, all hosts in the data center will share identical configuration settings. The Host sFlow daemon, hsflowd, provides a single point of configuration. By default hsflowd uses DNS Service Discover (DNS-SD) to learn configuration settings. Alternatively, settings can be distributed using orchestration software such as Puppet or Chef to update each host's hsflowd configuration file, /etc/hsflowd.conf, and restart the daemon.

The hsflowd daemon writes configuration information that it receives via DNS-SD or through its configuration file to the /etc/hsflowd.auto file. Other sFlow agents running on the host automatically detect changes to the /etc/hsflowd.auto file and apply the configuration settings.

The sFlow protocol allows each agent to operate autonomously and send an independent stream of metrics to the sFlow collector. Distributing monitoring among the agents eliminates dependencies and synchronization challenges that would increase the complexity of the agents.

Each of the sFlow agents is responsible for exporting a different set of metrics:
  • hsflowd The Host sFlow daemon exports CPU, memory, disk and network IO performance metrics and can also export per virtual machine statistics when run on a Xen, XenServer or KVM hypervisor. In addition, traffic monitoring using iptables/LOG is supported on Linux platforms.
  • Open vSwitch The Open vSwitch is "the default switch in XenServer 6.0, the Xen Cloud Platform and also supports Xen, KVM, Proxmox VE and VirtualBox. It has also been integrated into many virtual management systems including OpenStack, openQRM, and OpenNebula." Enabling the build-in sFlow monitoring on the virtual switch offers the same visibility as sFlow on physical switches and provides a unified end-to-end view of network performance across the physical and virtual infrastructure. The sflowovsd daemon ships with Host sFlow and is used to configure sFlow monitoring on the Open vSwitch using the ovs-vsctl command. Similar integrated sFlow support has also been demonstrated for the Microsoft extensible virtual switch that is part of the upcoming Windows 8 version of Hyper-V.
  • java The Java sFlow agent exports performance statistics about java threads, heap/non-heap memory, garbage collection, compilation and class loading.
  • httpd An sFlow agent embedded in the web server exports key performance metrics along with detailed transaction data that can be used to monitor top URLs, top Referers, top clients, response times etc. Currently, there are implementation of sFlow for Apache, NGINX, Tomcat and node.js web servers, see http://host-sflow.sourceforge.net/relatedlinks.php.
  • Memcached An sFlow agent embedded in the Memcache server exports performance metrics along with detailed data on Memcache operations that can be used to monitor hot keys, missed keys, top clients etc. Currently, there is an implementation of sFlow for Memcached, see http://host-sflow.sourceforge.net/relatedlinks.php.
Together the agents running on each host, along with sFlow agents embedded within the network infrastructure form an integrated monitoring system that provides a unified view view of network, system, storage and application performance that can scale to hundreds of thousands of servers.

Sunday, January 1, 2012

Using Ganglia to monitor virtual machine pools


The Ganglia charts show virtual machine performance metrics collected using sFlow. Enabling sFlow monitoring on each node in a virtual machine pool provides a highly scalable solution for monitoring performance. Embedded sFlow monitoring in the hypervisors simplifies deployments by eliminating the need to poll for metrics. Instead, virtual machine metrics are pushed directly from each node to the central Ganglia collector. Currently sFlow agents are available for XCP (Xen Cloud Platform), Citrix XenServer and KVM/libvirt virtualization platforms, see http://host-sflow.sourceforge.net/. In addition, an sFlow agent has been demonstrated for the upcoming Windows 8 version of Hyper-V.

The article, Ganglia 3.2 released, describes the basic steps needed to configure Ganglia as an sFlow collector. Once configured, Ganglia will automatically discover and track new servers and virtual machines as they are added to the network.

Note: To try out Ganglia's sFlow/virtual machine reporting, you will need to download Ganglia 3.3.

By default, Ganglia will automatically start displaying the virtual machine metrics. However, there is an optional configuration setting available in the gmond.conf file that can be used to modify how Ganglia handles the sFlow virtual machine metrics.

sflow{
  accept_vm_metrics = yes
}

Setting the accept_vm_metrics flag to no will cause Ganglia to ignore sFlow virtualization metrics.

Ganglia and sFlow offers a comprehensive view of the performance of a virtual machine pools, providing not just virtualization related metrics, but also the server CPU, memory, disk and network IO performance metrics needed to fully characterize pool performance.

Note: Visibility into network performance is an essential part of managing a virtual machine pool since virtual machines rely on the network for storage, local communication and Internet access. Network performance also affects other critical pool operations like virtual machine migration and backup. Support for the sFlow standard by most switch vendors delivers the necessary end to end network visibility and further simplifies management by including the network in an integrated monitoring solution.