Saturday, May 12, 2012

Scripting languages

The Host sFlow distributed agent article describes how sFlow agents, embedded in applications such as Apache, NGINX, Tomcat, Memcached and Java, coordinate with the Host sFlow daemon (hsflowd) in order to monitor server and application performance.

Implementing sFlow monitoring natively in widely used applications is worth the effort, since the result is a highly scaleable, easily deployed solution with minimal impact on performance. However, this approach is overkill for monitoring application logic implemented in scripting languages such as PHP, Python, Ruby or Perl.

Recently, a simple JSON API was added to hsfowd that makes it easy to integrate performance monitoring in scripting languages. The API is an implementation of the sFlow Application Structures draft, which defines a generalized set of application layer performance metrics.

Note: Even if you instrument your scripted application logic, you should still enable sFlow in the web server since the HTTP information exported from the web server complements application metrics to provide a more complete picture of performance, see HTTP.

Currently the API is only implemented in the Linux trunk. To try it out, you will need to check out the trunk and build hsflowd from sources:

svn co host-sflow
cd host-sflow
make install
make schedule

Note: See Installing Host sFlow on a Linux server for additional configuration information. The JSON API will be included in the upcoming Host sFlow 1.21 release.

Uncommenting the following entry in the /etc/hsflowd.conf file opens a UDP port to receive JSON encoded metrics:

jsonPort = 36343

Note: It is recommended that you keep the default port, 36343, since any change to the port number will require a corresponding change in any scripts that send JSON messages. The jsonPort value is written into the /etc/ file so that it is available to client scripts, along with other sFlow settings, see Host sFlow distributed agent. The hsflowd daemon will only accept messages generated on the same host which shouldn't be a problem since hsflowd should be installed on every host to report host performance metrics.

The following is an example of the type of JSON message that a script can send to describe the outcome of a transaction:


Note: The names of the structures and attributes in the JSON message mirror the structures defined in sFlow Application Structures.

Many of the structures and attributes are optional, a message can be as simple as:


Constructing and sending JSON messages as UDP datagrams to hsflowd is straightforward. For example, the following PHP app_operation function formats and sends the previous JSON message.

function app_operation($app_name,
                       $sampling_rate=1) {
  if($sampling_rate > 1) {
    if(mt_rand(1,$sampling_rate) != 1) { return; }

  try {
     $sock = fsockopen("udp://localhost",36343,$errno,$errstr);
     if(! $sock) { return; }
     fwrite($sock, '{"flow_sample":{
  } catch(Exception $e) {}

Note: This function supports transaction sampling, i.e. if you set a sampling_rate of 10 then there is a 1-in-10 chance that the operation will actually generate a measurement. Sampling allows you to reduce the measurement overhead in high transaction rate environments and still generate useful results. You should set a sampling_rate that reduces the impact of monitoring on your application to acceptable levels, although, you probably don't need to sample at all unless you are handling hundreds of operations per second. Choose a fixed sampling_rate that works for your application - choose the lowest sampling_rate that protects application performance - as you will see later, a low sampling_rate setting will allow hsflowd to maintain more accurate counters and provide a greater range of sampling rates when it exports the data.

Including the app_operation function in your PHP application library makes instrumenting PHP application logic as simple as including a single line of code in a PHP rendered web page:

<?php app_operation("myapp","user.friend"); ?>

When hsflowd receives a JSON message, it increments per application performance counters (scaled by the sampling rate if needed). The counters are periodically exported along with the other sFlow metrics (CPU, memory, disk and network I/O) that hsflowd exports.

The following output from sflowtool shows the contents of an sFlow datagram containing application counters:

startDatagram =================================
datagramSize 112
unixSecondsUTC 1336846918
datagramVersion 5
agentSubId 100000
packetSequenceNo 2670
sysUpTime 77514000
samplesInPacket 1
startSample ----------------------
sampleType_tag 0:2
sampleSequenceNo 4
sourceId 3:150002
counterBlock_tag 0:2202
application myapp
status_OK 23
errors_OTHER 0
errors_TIMEOUT 0
errors_BAD_REQUEST 0
errors_FORBIDDEN 0
errors_TOO_LARGE 0
errors_NOT_FOUND 0
endSample   ----------------------
endDatagram   =================================

The application counters can be sent to tools like Ganglia and Graphite in order to trend the performance of individual application instances, or of whole clusters of applications.

The transactions are also sampled by hsflowd based on the sampling setting in hsflowd.conf, or DNS-SD. Specific sampling rates can be set based on application name, for example, to override the default sampling rate of 400 and apply a sampling rate of 1-in-100 to the myapp, use the following setting:

Note: If the transactions were sampled before being sent to hsflowd, then they will be sub-sampled to achieve the target sampling rate. For example, if the script used a sampling rate of 1-in-10, then hsflowd would apply a 1-in-10 sampling operation in order to achieve the desired 1-in-100 sampling rate.

The following output from sflowtool shows the contents of an sFlow datagram containing an application transaction sample:

startDatagram =================================
datagramSize 136
unixSecondsUTC 1336846925
datagramVersion 5
agentSubId 100000
packetSequenceNo 2671
sysUpTime 77521000
samplesInPacket 1
startSample ----------------------
sampleType_tag 0:1
sampleSequenceNo 1
sourceId 3:150002
meanSkipCount 10
samplePool 10
dropEvents 0
inputPort 0
outputPort 1073741823
flowBlock_tag 0:2202
flowSampleType applicationOperation
application myapp
operation user.friend
request_bytes 0
response_bytes 0
status SUCCESS
duration_uS 0
endSample   ----------------------
endDatagram   =================================

The transaction samples provide details that complement the counter samples. For example, if you were to see a rise in the errors_TIMEOUT rate, you could look at the transaction samples and determine the operations associated with the timeouts.

Note: Anyone familiar with Etsy's StatsD tool will see a similarity in the way sFlow monitoring is embedded in scripts, see Measure Anything, Measure Everything. The main difference is that sFlow application measurements contain additional structure that allows them to be part of a large scale monitoring system linking network switches, hosts and applications together. In addition, sFlow's inclusion of sampled transaction records allows metrics to be broken out into fine detail, making it possible to see how application instances interact and get to the root cause of performance problems.

Finally, the application metrics extension to the sFlow standard and the implementation in hsflowd are still in the early stages. Please try them out and provide feedback. Any comments or suggestions regarding the sFlow metrics should be directed to the mailing list and comments or questions relating to the scripting API should be directed to the host-sflow mailing list.

No comments:

Post a Comment