Monday, July 11, 2011

Ganglia and cloud performance

The Ganglia 3.2 release includes support for collecting and displaying server performance metrics sent using the sFlow standard. Ganglia's focus has traditionally been to monitor clusters and grids, however, the scalability and automatic discovery capabilities of Ganglia also make it highly suited to monitoring pools of virtual machines.

Visibility in the cloud discusses the different challenges managing virtual machines hosted within a public cloud and management of the cloud infrastructure. The article, Rackspace cloudservers, shows how Ganglia and sFlow can be used to monitor the performance of virtual machines hosted in a public cloud. This article examines how Ganglia and sFlow can be used by service providers and private cloud operators to monitor the performance of the cloud infrastructure.

Currently sFlow agents are available for XCP (Xen Cloud Platform), Citrix XenServer and KVM/libvirt virtualization platforms. When monitoring a hypervisor using sFlow, Ganglia will display the following hypervisor specific metrics in addition to the familiar CPU, memory, disk and network statistics:


The Domain Count trends the number of virtual machines running on the server. The Hypervisor Free Memory chart shows how much free memory is available to run additional virtual machines.

In addition, sFlow also reports basic CPU, Memory, Disk I/O and Network I/O for every virtual machine running on the hypervisor without the need to install agents on the virtual machines. However, these additional statistics are currently discarded by default since the Ganglia user interface expects every server to report a common set of metrics and the data available from virtual machines is limited, resulting in missing charts. In addition, sFlow uniquely identifies virtual machines by their UUID (Universally Unique Identifier) but Ganglia currently expects hosts to be identified by IP addresses and hostnames (which may not be known for virtual machines).

Ganglia 3.2 provides an experimental override, allowing the additional per virtual machine performance metrics to be collected. The following entries in the Ganglia gmond configuration file (/etc/gmond.conf) configures sFlow monitoring and enables the additional per virtual machine performance metrics:

globals {
/* Listen, but don't send metrics */
  mute = yes
  deaf = no
  ...

/* sFlow channel */
udp_recv_channel {
  port = 6343
}

/* Enable virtual machine statistics */
sflow {
  accept_vm_metrics = yes
}

Once enabled, each virtual machine will appear as a member of the cluster. Selecting a virtual machine displays its metrics:


Note: The virtual machine metrics reported by sFlow are consistent with libvirt.

Ganglia has great potential for monitoring virtual machine pools. The experimental support for virtual machine monitoring in Ganglia 3.2 provides a starting point, laying the foundation for further development.

Thursday, July 7, 2011

Ganglia 3.2 released

The open source Ganglia Monitoring System is widely used to monitor high-performance computing systems such as clusters and Grids. The latest Ganglia 3.2 release includes native support for the sFlow standard. This article describes some of the benefits of using sFlow for cluster monitoring and describes how to configure Ganglia as an sFlow analyzer.



The diagram shows the elements of the solution. Each server sends sFlow to the Ganglia gmond process which builds an in-memory database containing the server statistics. The Ganglia gmetad process periodically queries the gmond database and updates trend charts that are made available through a web interface. The sFlow server performance data seamlessly integrates with Ganglia since the standard sFlow server metrics are based on Ganglia's core set of metrics.

Note: The metrics for all the servers in the cluster can be retrieved as an XML document by connecting to gmond (the default port is 8649). This API (used by gmetad) provides a simple way for performance monitoring tools that rely on a polling model to retrieve the sFlow metrics, for example Vladimir Vuksan's Use your trending data for alerting article describes how Nagios can use this API to retrieve metrics.

The sFlow solution is very similar to using Ganglia in a unicast configuration where gmond agents, installed on each server in the cluster, are configured to periodically send metrics to a central gmond instance that builds the database of cluster performance. The sFlow solution simply replaces the gmond agents in the cluster with sFlow agents.

Host sFlow is a free, open source, sFlow agent implementation. The Host sFlow agent reports on the performance of physical and virtual servers and currently supports Linux, FreeBSD and Windows servers as well as the Citrix XenServer, XCP (Xen Cloud Platform) and KVM/libvirt virtualization platforms.

Why use Host sFlow instead of gmond to monitor servers?

  • Lightweight Eliminating collector functionality reduces the size and complexity of the Host sFlow agent. The reduced overhead is particularly important when monitoring resource constrained environments like hypervisors.
  • Portable The Host sFlow agent has minimal software dependencies and is easily ported to different platforms, including a native Windows implementation.
  • Efficient The sFlow protocol efficiently packs all the standard Ganglia metrics in a single UDP datagram. Gmond requires over 32 datagrams to send the same information.
  • Standard The standard metrics exported by Host sFlow agents allows performance monitoring tools to share data, eliminating the need for wasteful duplication of agents. 
  • Metrics The standard set of sFlow metrics includes the core Ganglia metrics as well as additional disk I/O, swap, interrupt and virtual machine statistics. 

One of the strengths of Ganglia is the ability to easily add new metrics. While the Host sFlow agent doesn't support the addition of custom metrics, the Ganglia gmetric command line tool provides a simple way add custom metrics. For example, the following script exports the number of users currently logged into a system:

/usr/bin/gmetric --name Current_Users --value `who |wc -l` --type int32 --unit current_users

Running the command periodically using crontab allows Ganglia to track the metric. For embedded applications, the embeddedgmetric project provides C/C++, Python, PHP, Perl and Java libraries for sending Ganglia metrics.

The following entries in the gmond configuration file (/etc/gmond.conf) configures gmond to run in collector only mode, listening for sFlow data on UDP port 6343 (the standard sFlow port):

globals {
/* Listen, but don't send metrics */
  mute = yes
  deaf = no
  ...
}

/* channel to receive gmetric messages */
udp_recv_channel {
  port = 8649
}

/* channel to receive sFlow */
/* 6343 is the default sFlow port, an explicit sFlow    */
/* configuration section is needed to override default  */ 
udp_recv_channel {
  port = 6343
}

/* channel to service requests for XML data from gmetad */
tcp_accept_channel {
  port = 8649
}

Note: Delete all modules, collection_group and include sections from the gmond configuration file in this example and in the following examples since gmond is being used simply to collect sFlow metrics and doesn't need to load modules to generate metrics.

In the Ganglia architecture, each cluster is monitored by a separate gmond process. If more than one cluster is to be monitored, then it is possible to run multiple gmond processes on a single server, each with its own configuration file. For example, if Host sFlow agents on the first cluster are sending to port 6343, then Host sFlow agents on the second cluster should be configured to send to a different port, say 6344. The following settings will configure the second gmond to listen on the non-standard port.

globals {
/* Listen, but don't send metrics */
  mute = yes
  deaf = no
  ...
}

/* channel to receive gmetric messages */
udp_recv_channel {
  port = 8650
}

/* channel to receive sFlow */
udp_recv_channel {
  port = 6344
}

/* Change sFlow channel to non-standard port 6344 */
sflow {
  udp_port = 6344
}

/* channel to service requests for XML data from gmetad */
tcp_accept_channel {
  port = 8650
}

Note: The non-standard port setting is only required if both gmond processes are running on a single server. If each cluster is monitored by a separate server then the Host sFlow agents on each cluster simply need to be configured to send to the collector for their cluster.

Another alternative is to assign multiple IP addresses to the server, one per cluster. In this case Host sFlow agents in the first cluster will be configured to send to one address and agents in the second cluster to a different address. The following settings show how gmond can be configured to listen for sFlow on a specific IP address (e.g.  10.0.0.22):

globals {
/* Listen, but don't send metrics */
  mute = yes
  deaf = no
  ...
}

/* channel to receive sFlow */
udp_recv_channel {
  port = 6343
  bind = 10.0.0.22
}

The integration of network, system and application monitoring (see sFlow Host Structures) makes sFlow ideally suited for converged infrastructure monitoring. Using a single multi-vendor standard for both network and system performance monitoring reduces complexity and provides the integrated view of performance needed for effective management (see Management silos).