Wednesday, January 11, 2012

Graphite


The Graphite realtime charting software provides a flexible way to store and plot time series data. Graphite was originally developed by Orbitz.com, but is now an open source project that is growing in popularity among system administrators. However, while Graphite is very good at recording metrics and building dashboards, it provides no native monitoring capabilities, instead relying on user provided agents to make measurements.

The open source Host sFlow agent is an exact complement to Graphite, providing a lightweight, portable agent that exports standard metrics from a wide variety of systems. Deploying Host sFlow agents with a Graphite collector offers a complete, highly scalable monitoring solution.



The diagram shows the essential components of the solution. The Host sFlow agents continuously send metrics to the Graphite collector. The sflow2graphite script converts the binary sFlow data into Graphite's text based messages and submits them to the Graphite server.

First install sflowtool on the Graphite server. The sflow2graphite script requires sflowtool to receive binary sFlow datagrams and convert them into a text based representation. Next, download the sflow2graphite.pl script from http://sflow2graphite.googlecode.com.

Next run the following command on the Graphite server:

./sflow2graphite.pl

Now, configure the Host sFlow agents to send sFlow to the Graphite server. Within minutes, metrics from the hosts should start appearing in Graphite.

Note: If you are already collecting sFlow data with another tool, like Ganglia, then you may want to consider forwarding sFlow from the existing collector. Alternatively, you may want to extract the data from the existing collector and feed it into Graphite - in the case of Ganglia there are a number of solutions available.

It is worth noting that performance metrics may be gauges or counters. In the case of a gauge (like a system load average), Graphite will automatically display a useful chart. In the case of a counter metric (like total disk bytes read), Graphite will display the increasing total by default.
The screen capture above shows how you can apply a derivative function to the metric to convert a counter into a rate. The following charts trends disk bytes read/written as a bytes/second rate:

Finally, Graphite uses a hierarchical naming convention for metrics. If the default metrics names generated by the sflow2graphite script doesn't fit your naming hierarchy, simply edit the table of names in the sflow2graphite script to apply your own names.

Tuesday, January 10, 2012

Forwarding using sflowtool


The diagrams show different two different configurations for sFlow monitoring:
  1. Without Forwarding Each sFlow agent is configured to send sFlow to each of the analysis applications. This configuration is appropriate when a small number of applications is being used to continuously monitor performance. However, the overhead on the network and agents increases as additional analyzers are added. Often it is not possible to increase the number of analyzers since many embedded sFlow agents have limited resources and only support a small number of sFlow streams. In addition, the complexity of configuring each agent to add or remove an analysis application can be significant since agents may reside in Ethernet switches, routers, servers, hypervisors and applications on many different platforms from a variety of vendors.
  2. With Forwarding In this case all the agents are configured to send sFlow to a forwarding module which resends the data to the analysis applications. In this case analyzers can be added and removed simply by reconfiguring the forwarder without any changes required to the agent configurations.
There are many variations between these two extremes. Typically there will be one or two analyzers used for continuous monitoring and additional tools, like Wireshark, might be deployed for troubleshooting when the continuous monitoring tools detect anomalies. The sFlow analyzer may include a built-in forwarding capability, however, if built-in forwarding is not available, there are alternatives.

A previous posting introduced the sflowtool command line utility. The following examples demonstrate how sflowtool can be used to replicate and forward sFlow streams.

The following command configures sflowtool to listen for sFlow on the well known port (UDP port 6343) and forward the sFlow to two analyzers: the first running on remote machine 10.0.0.111 and the second listening on port 7343 on the local host.

sflowtool -f 10.0.0.111 -f localhost/7343

If an sFlow analyzer is already running on the server then it will already be bound to the sFlow port and the above command will fail. However, you can still forward the sFlow using the tcpdump command to capture the sFlow datagrams and sflowtool to forward them:

tcpdump -p -s 0 -w - udp port 6343 \
| sflowtool -r - -f 10.0.0.111 -f localhost/7343

It is also possible to filter the sFlow data to pick out a particular agent. This following command selectively forwards sFlow coming from IP address 10.0.0.237:

tcpdump -p -s 0 -w - src host 10.0.0.237 udp port 6343 \
| sflowtool -r - -f 10.0.0.111 -f localhost/7343

Rather than forwarding the sFlow, this technique can also be used to locally analyze the data. For example, suppose that Ganglia is being used to monitor the performance of a web farm. While Ganglia might show a large spike in HTTP requests, analysis using sflowtool offers additional details:

tcpdump -p -s 0 -w - udp port 6343 | sflowtool -r -H
10.0.0.70 - - [03/Jan/2012:14:44:29 -0800] "GET http://www.google.com/ HTTP/1.1" 200 21605 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7"

The information about URLs, user agents, response times, status codes and bytes provides the additional detail needed to diagnose the performance problem. For example, identifying overloaded web servers, top URLs and the sources of the increased load.

Note: See the sflowtool article for more examples of analyzing sFlow data using sflowtool.

Saturday, January 7, 2012

Host sFlow distributed agent


The sFlow standard uses a distributed push model in which each host autonomously sends a continuous stream of performance metrics to a central sFlow collector. The push model is highly scalable and is particularly suited to cloud environments.

The distributed architecture extends to the sFlow agents within a host. The following diagram shows how sFlow agent coordinate on each host to provide a comprehensive and consistent view of performance.


Typically, all hosts in the data center will share identical configuration settings. The Host sFlow daemon, hsflowd, provides a single point of configuration. By default hsflowd uses DNS Service Discover (DNS-SD) to learn configuration settings. Alternatively, settings can be distributed using orchestration software such as Puppet or Chef to update each host's hsflowd configuration file, /etc/hsflowd.conf, and restart the daemon.

The hsflowd daemon writes configuration information that it receives via DNS-SD or through its configuration file to the /etc/hsflowd.auto file. Other sFlow agents running on the host automatically detect changes to the /etc/hsflowd.auto file and apply the configuration settings.

The sFlow protocol allows each agent to operate autonomously and send an independent stream of metrics to the sFlow collector. Distributing monitoring among the agents eliminates dependencies and synchronization challenges that would increase the complexity of the agents.

Each of the sFlow agents is responsible for exporting a different set of metrics:
  • hsflowd The Host sFlow daemon exports CPU, memory, disk and network IO performance metrics and can also export per virtual machine statistics when run on a Xen, XenServer or KVM hypervisor. In addition, traffic monitoring using iptables/LOG is supported on Linux platforms.
  • Open vSwitch The Open vSwitch is "the default switch in XenServer 6.0, the Xen Cloud Platform and also supports Xen, KVM, Proxmox VE and VirtualBox. It has also been integrated into many virtual management systems including OpenStack, openQRM, and OpenNebula." Enabling the build-in sFlow monitoring on the virtual switch offers the same visibility as sFlow on physical switches and provides a unified end-to-end view of network performance across the physical and virtual infrastructure. The sflowovsd daemon ships with Host sFlow and is used to configure sFlow monitoring on the Open vSwitch using the ovs-vsctl command. Similar integrated sFlow support has also been demonstrated for the Microsoft extensible virtual switch that is part of the upcoming Windows 8 version of Hyper-V.
  • java The Java sFlow agent exports performance statistics about java threads, heap/non-heap memory, garbage collection, compilation and class loading.
  • httpd An sFlow agent embedded in the web server exports key performance metrics along with detailed transaction data that can be used to monitor top URLs, top Referers, top clients, response times etc. Currently, there are implementation of sFlow for Apache, NGINX, Tomcat and node.js web servers, see http://host-sflow.sourceforge.net/relatedlinks.php.
  • Memcached An sFlow agent embedded in the Memcache server exports performance metrics along with detailed data on Memcache operations that can be used to monitor hot keys, missed keys, top clients etc. Currently, there is an implementation of sFlow for Memcached, see http://host-sflow.sourceforge.net/relatedlinks.php.
Together the agents running on each host, along with sFlow agents embedded within the network infrastructure form an integrated monitoring system that provides a unified view view of network, system, storage and application performance that can scale to hundreds of thousands of servers.

Sunday, January 1, 2012

Using Ganglia to monitor virtual machine pools


The Ganglia charts show virtual machine performance metrics collected using sFlow. Enabling sFlow monitoring on each node in a virtual machine pool provides a highly scalable solution for monitoring performance. Embedded sFlow monitoring in the hypervisors simplifies deployments by eliminating the need to poll for metrics. Instead, virtual machine metrics are pushed directly from each node to the central Ganglia collector. Currently sFlow agents are available for XCP (Xen Cloud Platform), Citrix XenServer and KVM/libvirt virtualization platforms, see http://host-sflow.sourceforge.net/. In addition, an sFlow agent has been demonstrated for the upcoming Windows 8 version of Hyper-V.

The article, Ganglia 3.2 released, describes the basic steps needed to configure Ganglia as an sFlow collector. Once configured, Ganglia will automatically discover and track new servers and virtual machines as they are added to the network.

Note: To try out Ganglia's sFlow/virtual machine reporting, you will need to download and compile Ganglia from sources since the feature is currently in the development branch (see http://sourceforge.net/projects/ganglia/develop).

By default, Ganglia will automatically start displaying the virtual machine metrics. However, there is an optional configuration setting available in the gmond.conf file that can be used to modify how Ganglia handles the sFlow virtual machine metrics.

sflow{
  accept_vm_metrics = yes
}

Setting the accept_vm_metrics flag to no will cause Ganglia to ignore sFlow virtualization metrics.

Ganglia and sFlow offers a comprehensive view of the performance of a virtual machine pools, providing not just virtualization related metrics, but also the server CPU, memory, disk and network IO performance metrics needed to fully characterize pool performance.

Note: Visibility into network performance is an essential part of managing a virtual machine pool since virtual machines rely on the network for storage, local communication and Internet access. Network performance also affects other critical pool operations like virtual machine migration and backup. Support for the sFlow standard by most switch vendors delivers the necessary end to end network visibility and further simplifies management by including the network in an integrated monitoring solution.

Friday, December 30, 2011

Using Ganglia to monitor Memcache clusters


The Ganglia charts show Memcache performance metrics collected using sFlow. Enabling sFlow monitoring in Memcache servers provides a highly scalable solution for monitoring the performance of large Memcache clusters. Embedded sFlow monitoring simplifies deployments by eliminating the need to poll for metrics. Instead, metrics are pushed directly from each Memcache server to the central Ganglia collector. Currently, there is an implementation of sFlow for Memcached, see http://host-sflow.sourceforge.net/relatedlinks.php.

The article, Ganglia 3.2 released, describes the basic steps needed to configure Ganglia as an sFlow collector. Once configured, Ganglia will automatically discover and track new Memcache servers as they are added to the network.

Note: To try out Ganglia's sFlow/Memcache reporting, you will need to download and compile Ganglia from sources since the feature is currently in the development branch (see http://sourceforge.net/projects/ganglia/develop).

By default, Ganglia will automatically start displaying the Memcache metrics. However, there are two optional configuration settings available in the gmond.conf file that can be used to modify how Ganglia handles the sFlow Memcache metrics.

sflow{
  accept_memcache_metrics = no
  multiple_memcache_instances = no
}

Setting the accept_memcache_metrics flag to no will cause Ganglia to ignore sFlow Memcache metrics.

The multiple_memcache_instances setting must be set to yes in cases where there are multiple Memcache instances running on each server in the cluster. Each Memcache instance will be identified by the server port included in the title of the charts. For example, the following chart is reporting on the Memcache server listening on port 11211 on host ganglia:


Ganglia and sFlow offers a comprehensive view of the performance of a cluster of Memcache servers, providing not just Memcache related metrics, but also the server CPU, memory, disk and network IO performance metrics needed to fully characterize cluster performance.

Note: A Memcache sFlow agent does more than simply export performance counters, it also exports detailed data on Memcache operations that can be used to monitor hot keys, missed keys, top clients etc. The operation data complements the counter data displayed in Ganglia, helping to identify the root cause of problems. For example, Ganglia was showing that the Memcache miss rate was high and an examination of the transactions identified a mistyped key in the application code as the root cause. In addition, Memcache performance is critically dependent on network latency and packet loss - here again, sFlow provides the necessary visibility since most switch vendors already include support for the sFlow standard.

Thursday, December 29, 2011

Using Ganglia to monitor Java virtual machines


The Ganglia charts show the standard sFlow Java virtual machine metrics. The combination of Ganglia and sFlow provides a highly scalable solution for monitoring the performance of clustered Java application servers. The sFlow Java agent for stand-along Java services, or Tomcat sFlow for web-based servlets, simplify deployments by eliminating the need to poll for metrics using a Java JMX client. Instead, metrics are pushed directly from each Java virtual machine to the central Ganglia collector.

Note: The Tomcat sFlow agent also allows Ganglia to report HTTP performance metrics.

The article, Ganglia 3.2 released, describes the basic steps needed to configure Ganglia as an sFlow collector. Once configured, Ganglia will automatically discover and track new servers as they are added to the network. The articles, Java virtual machine and Tomcat, describes the steps needed to instrument existing Java applications and Apache Tomcat servlet engines respectively. In both cases the sFlow agent is included when starting the Java virtual machine and requires minimal configuration and no change to the application code.

Note: To try out Ganglia's sFlow/Java reporting, you will need to download and compile Ganglia from sources since the feature is currently in the development branch (see http://sourceforge.net/projects/ganglia/develop).

By default, Ganglia will automatically start displaying the Java virtual machine metrics. However, there are two optional configuration settings available in the gmond.conf file that can be used to modify how Ganglia handles the sFlow Java metrics.

sflow{
  accept_jvm_metrics = yes
  multiple_jvm_instances = no
}

Setting the accept_jvm_metrics flag to no will cause Ganglia to ignore Java virtual machine metrics.

The multiple_jvm_instances setting must be set to yes in cases where there are multiple Java virtual machine instances running on each server in the cluster. Charts associated with each java virtual machine instance will be identified by a unique "hostname" included in the title of its charts. For example, the following chart is identified as being associated with the apache-tomcat java virtual machine on host xenvm4.sf.inmon.com:


Ganglia and sFlow offers a comprehensive view of the performance of a cluster of Java servers, providing not just Java related metrics, but also the server CPU, memory, disk and network IO performance metrics needed to fully characterize cluster performance.

Wednesday, December 28, 2011

Using Ganglia to monitor web farms


The Ganglia charts show HTTP performance metrics collected using sFlow. Enabling sFlow monitoring in web servers provides a highly scalable solution for monitoring the performance of large web farms. Embedded sFlow monitoring simplifies deployments by eliminating the need to poll for metrics or tail log files. Instead, metrics are pushed directly from each web server to the central Ganglia collector. Currently, there are implementation of sFlow for Apache, NGINX, Tomcat and node.js web servers, see http://host-sflow.sourceforge.net/relatedlinks.php.

The article, Ganglia 3.2 released, describes the basic steps needed to configure Ganglia as an sFlow collector. Once configured, Ganglia will automatically discover and track new web servers as they are added to the network.

Note: To try out Ganglia's sFlow/HTTP reporting, you will need to download and compile Ganglia from sources since the feature is currently in the development branch (see http://sourceforge.net/projects/ganglia/develop).

By default, Ganglia will automatically start displaying the HTTP metrics. However, there are two optional configuration settings available in the gmond.conf file that can be used to modify how Ganglia handles the sFlow HTTP metrics.

sflow{
  accept_http_metrics = yes
  multiple_http_instances = no
}

Setting the accept_http_metrics flag to no will cause Ganglia to ignore sFlow HTTP metrics.

The multiple_http_instances setting must be set to yes in cases where there are multiple HTTP instances running on each server in the cluster. Charts associated with each HTTP instance are identified by the server port included in the title of its charts. For example, the following chart is reporting on the web server listening on port 8080 on host xenvm4.sf.inmon.com:


Ganglia and sFlow provide a comprehensive view of the performance of a cluster of web servers, providing not just HTTP related metrics, but also the server CPU, memory, disk and network IO performance metrics needed to fully characterize cluster performance.

Note: An HTTP sFlow agent does more than simply export performance counters, it also exports detailed transaction data that can be used to monitor top URLs, top Referers, top clients, response times etc. The transaction data complements the counter data displayed in Ganglia, helping to identify the root cause of problems. For example, Ganglia was showing a sudden increase in HTTP requests and an examination of the transactions demonstrated that the increase was a denial of service attack, identifying the targeted URL and the list of attacker IP addresses.