Thursday, April 19, 2012

Hadoop


Hadoop is an open source software for storing and processing large data sets using a cluster of commodity servers. This article shows how sFlow agents can be installed to monitor the performance of a Hadoop cluster.

Note: This example is based on the commercially Hadoop distribution from Cloudera. The virtual machine downloads from Cloudera are a convenient way to experiment with Hadoop - providing a fully functional Hadoop system on a single virtual machine.

First, install Host sFlow on each node in the Hadoop cluster, see Installing Host sFlow on a Linux server. The Host sFlow agents export standard CPU, memory, disk and network statistics from the servers. In this case, the Host sFlow agent was installed on the Cloudera virtual machine.

Since Hadoop is implemented in Java, the next step involves installing and configuring the sFlow java agent, see Java virtual machine. In this case, the Hadoop java virtual machines are configured in the /etc/hadoop/conf/hadoop-env.sh file. The following fragment from hadoop-env.sh highlights the additional java command line options needed to implement sFlow monitoring of each Hadoop daemon:
# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-javaagent:/etc/hadoop/conf/sflowagent.jar -Dsflow.
hostname=hadoop.namenode -Dsflow.dsindex=50070 -Dcom.sun.management.jmxremote $H
ADOOP_NAMENODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-javaagent:/etc/hadoop/conf/sflowagent.jar
 -Dsflow.hostname=hadoop.secondarynamenode -Dsflow.dsindex=50090 -Dcom.sun.manag
ement.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-javaagent:/etc/hadoop/conf/sflowagent.jar -Dsflow.
hostname=hadoop.datanode -Dsflow.dsindex=50075 -Dcom.sun.management.jmxremote $H
ADOOP_DATANODE_OPTS"
export HADOOP_BALANCER_OPTS="-javaagent:/etc/hadoop/conf/sflowagent.jar -Dsflow.
hostname=hadoop.balancer -Dsflow.dsindex=50060 -Dcom.sun.management.jmxremote $H
ADOOP_BALANCER_OPTS"
export HADOOP_JOBTRACKER_OPTS="-javaagent:/etc/hadoop/conf/sflowagent.jar -Dsflo
w.hostname=hadoop.jobtracker -Dsflow.dsindex=50030 -Dcom.sun.management.jmxremote
 $HADOOP_JOBTRACKER_OPTS"
Note: The sflowagent.jar file was placed in the /etc/hadoop/conf directory. Each daemon was given a descriptive sflow.hostname value corresponding to the daemon name. In addition, a unique sflow.dsindex value must be assigned to each daemon - the index values only need to be unique within a single server - all the servers in the cluster can share the same configuration settings. In this case the port that each daemon listens on is used as the index.

The Host sFlow and Java sFlow agents share configuration, see Host sFlow distributed agent, simplifying the task of configuring sFlow monitoring within each server and across the cluster.

Note: If you are using Ganglia monitor performance of the cluster, then you need to set multiple_jvm_instances = yes since there is more than one Java daemon running on each server, see Using Ganglia to monitor Java virtual machines.

Finally, Hadoop is sensitive to network congestion and generates large traffic volumes. Fortunately, most switch vendors support the sFlow standard. Combining sFlow from the Hadoop servers with sFlow from the switches offers comprehensive visibility into the performance of a Hadoop cluster.