The open source
Ganglia Monitoring System is widely used to monitor high-performance computing systems such as
clusters and
Grids. The latest
Ganglia 3.2 release includes native support for the sFlow standard. This article describes some of the benefits of using sFlow for cluster monitoring and describes how to configure Ganglia as an sFlow analyzer.
The diagram shows the elements of the solution. Each server sends sFlow to the Ganglia
gmond process which builds an in-memory database containing the server statistics. The Ganglia
gmetad process periodically queries the
gmond database and updates trend charts that are made available through a web interface. The sFlow server performance data seamlessly integrates with Ganglia since the standard sFlow server metrics are based on Ganglia's core set of metrics.
Note: The metrics for all the servers in the cluster can be retrieved as an XML document by connecting to
gmond (the default port is 8649). This API (used by
gmetad) provides a simple way for performance monitoring tools that rely on a polling model to retrieve the sFlow metrics, for example Vladimir Vuksan's
Use your trending data for alerting article describes how
Nagios can use this API to retrieve metrics.
The sFlow solution is very similar to using Ganglia in a unicast configuration where
gmond agents, installed on each server in the cluster, are configured to periodically send metrics to a central
gmond instance that builds the database of cluster performance. The sFlow solution simply replaces the
gmond agents in the cluster with sFlow agents.
Host sFlow is a free, open source, sFlow agent implementation. The Host sFlow agent reports on the performance of physical and virtual servers and currently supports
Linux,
FreeBSD and
Windows servers as well as the
Citrix XenServer,
XCP (Xen Cloud Platform) and
KVM/libvirt virtualization platforms.
Why use Host sFlow instead of gmond to monitor servers?
- Lightweight Eliminating collector functionality reduces the size and complexity of the Host sFlow agent. The reduced overhead is particularly important when monitoring resource constrained environments like hypervisors.
- Portable The Host sFlow agent has minimal software dependencies and is easily ported to different platforms, including a native Windows implementation.
- Efficient The sFlow protocol efficiently packs all the standard Ganglia metrics in a single UDP datagram. Gmond requires over 32 datagrams to send the same information.
- Standard The standard metrics exported by Host sFlow agents allows performance monitoring tools to share data, eliminating the need for wasteful duplication of agents.
- Metrics The standard set of sFlow metrics includes the core Ganglia metrics as well as additional disk I/O, swap, interrupt and virtual machine statistics.
One of the strengths of Ganglia is the ability to easily add new metrics. While the Host sFlow agent doesn't support the addition of custom metrics, the Ganglia
gmetric command line tool provides a simple way add custom metrics. For example, the following script exports the number of users currently logged into a system:
/usr/bin/gmetric --name Current_Users --value `who |wc -l` --type int32 --unit current_users
Running the command periodically using crontab allows Ganglia to track the metric. For embedded applications, the
embeddedgmetric project provides C/C++, Python, PHP, Perl and Java libraries for sending Ganglia metrics.
The following entries in the gmond configuration file (
/etc/gmond.conf) configures gmond to run in collector only mode, listening for sFlow data on UDP port 6343 (the standard sFlow port):
globals {
/* Listen, but don't send metrics */
mute = yes
deaf = no
...
}
/* channel to receive gmetric messages */
udp_recv_channel {
port = 8649
}
/* channel to receive sFlow */
/* 6343 is the default sFlow port, an explicit sFlow */
/* configuration section is needed to override default */
udp_recv_channel {
port = 6343
}
/* channel to service requests for XML data from gmetad */
tcp_accept_channel {
port = 8649
}
Note: Delete all
modules,
collection_group and
include sections from the gmond configuration file in this example and in the following examples since gmond is being used simply to collect sFlow metrics and doesn't need to load modules to generate metrics.
In the Ganglia architecture, each cluster is monitored by a separate gmond process. If more than one cluster is to be monitored, then it is possible to run multiple gmond processes on a single server, each with its own configuration file. For example, if Host sFlow agents on the first cluster are sending to port 6343, then Host sFlow agents on the second cluster should be configured to send to a different port, say 6344. The following settings will configure the second gmond to listen on the non-standard port.
globals {
/* Listen, but don't send metrics */
mute = yes
deaf = no
...
}
/* channel to receive gmetric messages */
udp_recv_channel {
port = 8650
}
/* channel to receive sFlow */
udp_recv_channel {
port = 6344
}
/* Change sFlow channel to non-standard port 6344 */
sflow {
udp_port = 6344
}
/* channel to service requests for XML data from gmetad */
tcp_accept_channel {
port = 8650
}
Note: The non-standard port setting is only required if both gmond processes are running on a single server. If each cluster is monitored by a separate server then the Host sFlow agents on each cluster simply need to be configured to send to the collector for their cluster.
Another alternative is to assign multiple IP addresses to the server, one per cluster. In this case Host sFlow agents in the first cluster will be configured to send to one address and agents in the second cluster to a different address. The following settings show how gmond can be configured to listen for sFlow on a specific IP address (e.g. 10.0.0.22):
globals {
/* Listen, but don't send metrics */
mute = yes
deaf = no
...
}
/* channel to receive sFlow */
udp_recv_channel {
port = 6343
bind = 10.0.0.22
}
The integration of network, system and application monitoring (see
sFlow Host Structures) makes sFlow ideally suited for
converged infrastructure monitoring. Using a single
multi-vendor standard for both network and system performance monitoring reduces complexity and provides the integrated view of performance needed for effective management (see
Management silos).