Monday, January 23, 2017

Telegraf, InfluxDB, Chronograf, and Kapacitor

The InfluxData TICK (Telegraf, InfluxDB, Chronograf, Kapacitor) provides a full set of integrated metrics tools, including an agent to export metrics (Telegraf), a time series database to collect and store the metrics (InfluxDB), a dashboard to display metrics (Chronograf), and a data processing engine (Kapacitor). Each of the tools is open sourced and can be used together or separately.
This article will show how industry standard sFlow agents embedded within the data center infrastructure can provide Telegraf metrics to InfluxDB. The solution uses sFlow-RT as a proxy to convert sFlow metrics into their Telegraf equivalent form so that they are immediately visible through the default Chronograf dashboards (Using a proxy to feed metrics into Ganglia described a similar approach for sending metrics to Ganglia).

The following telegraf.js script instructs sFlow-RT to periodically export host metrics to InfluxDB:
var influxdb = "http://10.0.0.56:8086/write?db=telegraf";

function sendToInfluxDB(msg) {
  if(!msg || !msg.length) return;
  
  var req = {
    url:influxdb,
    operation:'POST',
    headers:{"Content-Type":"text/plain"},
    body:msg.join('\n')
  };
  req.error = function(e) {
    logWarning('InfluxDB POST failed, error=' + e);
  }
  try { httpAsync(req); }
  catch(e) {
    logWarning('bad request ' + req.url + ' ' + e);
  }
}

var metric_names = [
  'host_name',
  'load_one',
  'load_five',
  'load_fifteen',
  'cpu_num',
  'uptime',
  'cpu_user',
  'cpu_system',
  'cpu_idle',
  'cpu_nice',
  'cpu_wio',
  'cpu_intr',
  'cpu_sintr',
  'cpu_steal',
  'cpu_guest',
  'cpu_guest_nice'
];

var ntoi;
function mVal(row,name) {
  if(!ntoi) {
    ntoi = {};
    for(var i = 0; i < metric_names.length; i++) {
      ntoi[metric_names[i]] = i;
    }
  }
  return row[ntoi[name]].metricValue;
}

setIntervalHandler(function() {
  var i,r,msg = [];
  var vals = table('ALL',metric_names);
  for(i = 0; i < vals.length; i++) {
    r = vals[i];

    // Telegraf System plugin metrics
    msg.push('system,host='
      +mVal(r,'host_name')
      +' load1='+mVal(r,'load_one')
      +',load5='+mVal(r,'load_five')
      +',load15='+mVal(r,'load_fifteen')
      +',n_cpus='+mVal(r,'cpu_num')+'i');
    msg.push('system,host='
      +mVal(r,'host_name')
      +' uptime='+mVal(r,'uptime')+'i');

    // Telegraf CPU plugin metrics
    msg.push('cpu,cpu=cpu-total,host='
      +mVal(r,'host_name')
      +' usage_user='+(mVal(r,'cpu_user')||0)
      +',usage_system='+(mVal(r,'cpu_system')||0)
      +',usage_idle='+(mVal(r,'cpu_idle')||0)
      +',usage_nice='+(mVal(r,'cpu_nice')||0)
      +',usage_iowait='+(mVal(r,'cpu_wio')||0)
      +',usage_irq='+(mVal(r,'cpu_intr')||0)
      +',usage_softirq='+(mVal(r,'cpu_sintr')||0)
      +',usage_steal='+(mVal(r,'cpu_steal')||0)
      +',usage_guest='+(mVal(r,'cpu_guest')||0)
      +',usage_guest_nice='+(mVal(r,'cpu_guest_nice')||0));
  }
  sendToInfluxDB(msg);
},15);
Some notes on the script:
  1. The sentToInfluxDB() function uses the Writing data using the HTTP API to POST metrics to InfluxDB.
  2. The setIntervalHandler function retrieves a table of metrics from sFlow-RT every 15 seconds and formats them to use the same names and tags as Telegraf.
  3. The script implements Telegraf System and CPU plugin functionality.
  4. Additional metrics can easily be added to proxy additional Telegraf plugins.
  5. Writing applications provides an overview of the sFlow-RT APIs.
Start gathering metrics:
docker run -v `pwd`/telegraf.js:/sflow-rt/telegraf.js \
-e "RTPROP=-Dscript.file=telegraf.js" \
-p 8008:8008 -p 6343:6343/udp sflow/sflow-rt
Accessing the Chronograf home page brings up a table of hosts with their status and CPU load:
Clicking on the leaf1 host displays a dashboard trending key performance metrics:
Pre-processing the metrics using sFlow-RT's real-time streaming analytics engine can greatly increase scaleability by selectively exporting metrics and calculating higher level summary statistics in order to reduce the amount of data logged to the time series database. The analytics pipeline can also augment the metrics with additional metadata.
For example, Collecting Docker Swarm service metrics demonstrates how sFlow-RT can monitor dynamic service pools running under Docker Swarm and write summary statistics to InfluxDB. In this case Grafana was used to build metrics dashboard instead of Chronograf.

The open source Host sFlow agent exports an extensive range of standard sFlow metrics and has been ported to a wide range of platforms. Standard metrics describes how standardization helps reduce operational complexity. The overlap between standard sFlow metrics and Telegraf base plugin metrics makes the task of proxying straightforward.
The Host sFlow agent (and sFlow agents embedded in network switches and routers) goes beyond simple metrics export to provide detailed visibility into network traffic and articles on this blog demonstrate how sFlow-RT analytics software can be configured to generate detailed traffic flow metrics that can be streamed into InfluxDB, logged (e.g. Exporting events using syslog), or trigger control actions (e.g. DDoS mitigationDocker 1.12 swarm mode elastic load balancing).