Monday, October 10, 2016

Collecting Docker Swarm service metrics

This article demonstrates how to address the challenge of monitoring dynamic Docker Swarm deployments and track service performance metrics using existing on-premises and cloud monitoring tools like Ganglia, Graphite, InfluxDB, Grafana, SignalFX, Librato, etc.

In this example, Docker Swarm is used to deploy a simple web service on a four node cluster:
docker service create --replicas 2 -p 80:80 --name apache httpd:2.4
Next, the following script tests the agility of monitoring systems by constantly changing the number of replicas in the service:
#!/bin/bash
while true
do
  docker service scale apache=$(( ( RANDOM % 20 )  + 1 ))
  sleep 30
done
The above test is easy to set up and is a quick way to stress test monitoring systems and reveal accuracy and performance problems when they are confronted with container workloads.

Many approaches to gathering and recording metrics were developed for static environments and have a great deal of difficulty tracking rapidly changing container-based service pools without missing information, leaking resources, and slowing down. For example, each new container in Docker Swarm has unique name, e.g. apache.16.17w67u9157wlri7trd854x6q0. Monitoring solutions that record container names, or even worse, index data by container name, will suffer from bloated databases and resulting slow queries.

The solution is to insert a stream processing analytics stage in the metrics pipeline that delivers a consistent set of service level metrics to existing tools.
The asynchronous metrics export method implemented in the open source Host sFlow agent is part of the solution, sending a real-time telemetry stream to a centralized sFlow collector which is then able to deliver a comprehensive view of all services deployed on the Docker Swarm cluster.

The sFlow-RT real-time analytics engine completes the solution by converting the detailed per instance metrics into service level statistics which are in turn streamed to a time series database where they drive operational dashboards.

For example, the following swarmmetrics.js script computes cluster and service level metrics and exports them to InfluxDB:
var docker = "https://10.0.0.134:2376/services";
var certs = '/tls/';

var influxdb = "http://10.0.0.50:8086/write?db=docker"

var clustermetrics = [
  'avg:load_one',
  'max:cpu_steal',
  'sum:node_domains'
];

var servicemetrics = [
  'avg:vir_cpu_utilization',
  'avg:vir_bytes_in',
  'avg:vir_bytes_out'
];

function sendToInfluxDB(msg) {
  if(!msg || !msg.length) return;

  var req = {
    url:influxdb,
    operation:'POST',
    headers:{"Content-Type":"text/plain"},
    body:msg.join('\n')
  };
  req.error = function(e) {
    logWarning('InfluxDB POST failed, error=' + e);
  }
  try { httpAsync(req); }
  catch(e) {
    logWarning('bad request ' + req.url + ' ' + e);
  }
}

function clusterMetrics(nservices) {
  var vals = metric(
    'ALL', clustermetrics,
    {'node_domains':['*'],'host_name':['vx*host*']}
  );
  var msg = [];
  msg.push('swarm.services value='+nservices);
  msg.push('nodes value='+(vals[0].metricN || 0));
  for(var i = 0; i < vals.length; i++) {
    let val = vals[i];
    msg.push(val.metricName+' value='+ (val.metricValue || 0));
  } 
  sendToInfluxDB(msg);
}

function serviceMetrics(name, replicas) {
  var vals = metric(
    'ALL', servicemetrics,
    {'vir_host_name':[name+'\\.*'],'vir_cpu_state':['running']}
  );
  var msg = [];
  msg.push('replicas_configured,service='+name+' value='+replicas);
  msg.push('replicas_measured,service='+name+' value='+(vals[0].metricN || 0));
  for(var i = 0; i < vals.length; i++) {
    let val = vals[i];
    msg.push(val.metricName+',service='+name+' value='+(val.metricValue || 0));
  }
  sendToInfluxDB(msg);
}

setIntervalHandler(function() {
  var i, services, service, spec, name, replicas, res;
  try { services = JSON.parse(http2({url:docker, certs:certs}).body); }
  catch(e) { logWarning("cannot get " + docker + " error=" + e); }
  if(!services || !services.length) return;

  clusterMetrics(services.length);

  for(i = 0; i < services.length; i++) {
    service = services[i];
    if(!service) continue;
    spec = service["Spec"];
    if(!spec) continue;
    name = spec["Name"];
    if(!name) continue;
 
    replicas = spec["Mode"]["Replicated"]["Replicas"];
    serviceMetrics(name, replicas);
  }
},10);
Some notes on the script:
  1. Only a few representative metrics are being monitored, many more are available, see Metrics.
  2. The setIntervalHandler function is run every 10 seconds. The function queries Docker REST API for the current list of services and then calculates summary statistics for each service. The summary statistics are then pushed to InfluxDB via a REST API call.
  3. Cluster performance metrics describes the set of summary statistics that can be calculated.
  4. Writing Applications provides additional information on sFlow-RT scripting and REST APIs.
Start gathering metrics:
docker run -v `pwd`/tls:/tls -v `pwd`/swarmmetrics.js:/sflow-rt/swarmmetrics.js \
-e "RTPROP=-Dscript.file=swarmmetrics.js" \
-p 8008:8008 -p 6343:6343/udp sflow/sflow-rt
The results are shown in the Grafana dashboard at the top of this article. The charts show 30 minutes of data. The top Replicas by Service chart compares the number of replicas configured for each service with the number of container instances that the monitoring system is tracking. The chart demonstrates that the monitoring system is accurately tracking the rapidly changing service pool and able to deliver reliable metrics. The middle Network IO by Service chart shows a brief spike in network activity whenever the number of instances in the apache service is increased. Finally, the bottom Cluster Size chart confirms that all four nodes in the Swarm cluster are being monitored.

This solution is extremely scaleable. For example, increasing the size of the cluster from 4 to 1,000 nodes increases the amount of raw data that sFlow-RT needs to process to accurately calculate service metrics, but has have no effect on the amount of data sent to the time series database and so there is no increase in storage requirements or query response time.
Pre-processing the stream of raw data reduces the cost of the monitoring solution, either in terms of the resources required by an on-premises monitoring solutions, or the direct costs of cloud based solutions which charge per data point per minute per month. In this case the raw telemetry stream contains hundreds of thousands of potential data points per minute per host - filtering and summarizing the data reduces monitoring costs by many orders of magnitude.
This example can easily be modified to send data into any on-premises or cloud based backend, examples in this blog include: SignalFX, Librato, Graphite and Ganglia. In addition, Docker 1.12 swarm mode elastic load balancing describes how the same architecture can be used to dynamically resize service pools to meet changing demand.

No comments:

Post a Comment