sFlow: October 2016

Tuesday, October 18, 2016

Network performance monitoring

Today, network performance monitoring typically relies on probe devices to perform active tests and/or observe network traffic in order to try and infer performance. This article demonstrates that hosts already track network performance and that exporting host-based network performance information provides an attractive alternative to complex and expensive in-network approaches.

# tcpdump -ni eth0 tcp
11:29:28.949783 IP 10.0.0.162.ssh > 10.0.0.70.56174: Flags [P.], seq 1424968:1425312, ack 1081, win 218, options [nop,nop,TS val 2823262261 ecr 2337599335], length 344
11:29:28.950393 IP 10.0.0.70.56174 > 10.0.0.162.ssh: Flags [.], ack 1425312, win 4085, options [nop,nop,TS val 2337599335 ecr 2823262261], length 0

The host TCP/IP stack continuously measured round trip time and estimates available bandwidth for each active connection as part of its normal operation. The tcpdump output shown above highlights timestamp information that is exchanged in TCP packets to provide the accurate round trip time measurements needed for reliable high speed data transfer.

The open source Host sFlow agent already makes use of Berkeley Packet Filter (BPF) capability on Linux to efficiently sample packets and provide visibility into traffic flows. Adding support for the tcp_diag kernel module allows the detailed performance metrics maintained in the Linux TCP stack to be attached to each sampled TCP packet.

enum packet_direction {
  unknown  = 0,
  received = 1,
  sent     = 2
}

/* TCP connection state */
/* Based on Linux struct tcp_info */
/* opaque = flow_data; enterprise=0; format=2209 */
struct extended_tcp_info {
  packet_direction dir;     /* Sampled packet direction */
  unsigned int snd_mss;     /* Cached effective mss, not including SACKS */
  unsigned int rcv_mss;     /* Max. recv. segment size */
  unsigned int unacked;     /* Packets which are "in flight" */
  unsigned int lost;        /* Lost packets */
  unsigned int retrans;     /* Retransmitted packets */
  unsigned int pmtu;        /* Last pmtu seen by socket */
  unsigned int rtt;         /* smoothed RTT (microseconds) */
  unsigned int rttvar;      /* RTT variance (microseconds) */
  unsigned int snd_cwnd;    /* Sending congestion window */
  unsigned int reordering;  /* Reordering */
  unsigned int min_rtt;     /* Minimum RTT (microseconds) */
}

The sFlow telemetry protocol is extensible, and the above structure was added to transport network performance metrics along with the sampled TCP packet.

startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 153026
sourceId 0:2
meanSkipCount 10
samplePool 1530260
dropEvents 0
inputPort 1073741823
outputPort 2
flowBlock_tag 0:2209
tcpinfo_direction sent
tcpinfo_send_mss 1448
tcpinfo_receive_mss 536
tcpinfo_unacked_pkts 0
tcpinfo_lost_pkts 0
tcpinfo_retrans_pkts 0
tcpinfo_path_mtu 1500
tcpinfo_rtt_uS 773
tcpinfo_rtt_uS_var 137
tcpinfo_send_congestion_win 10
tcpinfo_reordering 3
tcpinfo_rtt_uS_min 0
flowBlock_tag 0:1
flowSampleType HEADER
headerProtocol 1
sampledPacketSize 84
strippedBytes 4
headerLen 66
headerBytes 08-00-27-09-5C-F7-08-00-27-B8-32-6D-08-00-45-C0-00-34-60-79-40-00-01-06-03-7E-0A-00-00-88-0A-00-00-86-84-47-00-B3-50-6C-E7-E7-D8-49-29-17-80-10-00-ED-15-34-00-00-01-01-08-0A-18-09-85-3A-23-8C-C6-61
dstMAC 080027095cf7
srcMAC 080027b8326d
IPSize 66
ip.tot_len 52
srcIP 10.0.0.136
dstIP 10.0.0.134
IPProtocol 6
IPTOS 192
IPTTL 1
IPID 31072
TCPSrcPort 33863
TCPDstPort 179
TCPFlags 16
endSample   ----------------------

The sflowtool output shown above provides an example. The tcp_info values are highlighted.

Combining performance data and packet headers delivers a telemetry stream that is far more useful than either measurement on its own. There are hundreds of attributes and billions of values that can be decoded from the packet header resulting in a virtually infinite number of permutations that combine with the network performance data.

For example, the chart at the top of this article uses sFlow-RT real-time analytics software to combine telemetry from multiple hosts and generate an up to the second view of network performance, plotting round trip time by Country.

This solution leverages the TCP/IP stack to turn every host and its clients (desktops, laptops, tablets, smartphones, IoT devices, etc.) into a network performance monitoring probe - continuously streaming telemetry gathered from normal network activity.

A host-based approach to network performance monitoring is well suited to public cloud deployments, where lack of access to the physical network resources challenges in-network approaches to monitoring.

More generally, each network, host and application entity maintains state as part of its normal operation (for example, the TCP metrics in the host). However, the information is incomplete and of limited value when it is stranded within each device. The sFlow standard specifies a unified data model and efficient transport that allows each element to stream measurements and related meta-data to analytics software where the information is combined to provide a comprehensive view of performance.

Thursday, October 13, 2016

Real-time domain name lookups

Reverse DNS requests request the domain name associated with an IP address, for example providing the name google-public-dns-a.google.com for IP address 8.8.8.8. This article demonstrates how the sFlow-RT engine incorporates domain name lookups in real-time flow analytics.

First, use the dns.servers System Property is used to specify one or more DNS servers to handle the reverse lookup requests. For example, the following command uses Docker to run sFlow-RT with DNS lookups directed to server 10.0.0.1:

docker run -e "RTPROP=-Ddns.servers=10.0.0.1" \
-p 8008:8008 -p 6343:6343/udp -d sflow/sflow-rt

The following Python script dnspair.py uses the sFlow-RT REST API to define a flow and log the resulting flow records:

#!/usr/bin/env python
import requests
import json

flow = {'keys':'dns:ipsource,dns:ipdestination',
 'value':'bytes','activeTimeout':10,'log':True}
requests.put('http://localhost:8008/flow/dnspair/json',data=json.dumps(flow))
flowurl = 'http://localhost:8008/flows/json?name=dnspair&maxFlows=10&timeout=60'
flowID = -1
while 1 == 1:
  r = requests.get(flowurl + "&flowID=" + str(flowID))
  if r.status_code != 200: break
  flows = r.json()
  if len(flows) == 0: continue

  flowID = flows[0]["flowID"]
  flows.reverse()
  for f in flows:
    print json.dumps(f,indent=1)

Running the script generates the following output:

$ ./dnspair.py
{
 "value": 233370.92322668363, 
 "end": 1476234478177, 
 "name": "dnspair", 
 "flowID": 1523, 
 "agent": "10.0.0.20", 
 "start": 1476234466195, 
 "dataSource": "10", 
 "flowKeys": "xenvm11.sf.inmon.com.,dhcp20.sf.inmon.com."
}
{
 "value": 39692.88754760739, 
 "end": 1476234478177, 
 "name": "dnspair", 
 "flowID": 1524, 
 "agent": "10.0.0.20", 
 "start": 1476234466195, 
 "dataSource": "10", 
 "flowKeys": "xenvm11.sf.inmon.com.,switch.sf.inmon.com."
}

The token dns:ipsource in the flow definition is an example of a Key Function. Functions can be combined to define flow keys or in filters.

or:[dns:ipsource]:ipsource

Returns a dns name if available, otherwise the original IP address is returned

suffix:[dns:ipsource]:.:3

Returns the last 2 parts of the DNS name, e.g. xenvm11.sf.inmon.com. becomes inmon.com.

DNS results are cached by the dns: function in order to provide real-time lookups and reduce the load on the backend name server(s). Cache size and timeout settings are tune-able using System Properties.

Monday, October 10, 2016

Collecting Docker Swarm service metrics

This article demonstrates how to address the challenge of monitoring dynamic Docker Swarm deployments and track service performance metrics using existing on-premises and cloud monitoring tools like Ganglia, Graphite, InfluxDB, Grafana, SignalFX, Librato, etc.

In this example, Docker Swarm is used to deploy a simple web service on a four node cluster:

docker service create --replicas 2 -p 80:80 --name apache httpd:2.4

Next, the following script tests the agility of monitoring systems by constantly changing the number of replicas in the service:

#!/bin/bash
while true
do
  docker service scale apache=$(( ( RANDOM % 20 )  + 1 ))
  sleep 30
done

The above test is easy to set up and is a quick way to stress test monitoring systems and reveal accuracy and performance problems when they are confronted with container workloads.

Many approaches to gathering and recording metrics were developed for static environments and have a great deal of difficulty tracking rapidly changing container-based service pools without missing information, leaking resources, and slowing down. For example, each new container in Docker Swarm has unique name, e.g. apache.16.17w67u9157wlri7trd854x6q0. Monitoring solutions that record container names, or even worse, index data by container name, will suffer from bloated databases and resulting slow queries.

The solution is to insert a stream processing analytics stage in the metrics pipeline that delivers a consistent set of service level metrics to existing tools.

The asynchronous metrics export method implemented in the open source Host sFlow agent is part of the solution, sending a real-time telemetry stream to a centralized sFlow collector which is then able to deliver a comprehensive view of all services deployed on the Docker Swarm cluster.

The sFlow-RT real-time analytics engine completes the solution by converting the detailed per instance metrics into service level statistics which are in turn streamed to a time series database where they drive operational dashboards.

For example, the following swarmmetrics.js script computes cluster and service level metrics and exports them to InfluxDB:

var docker = "https://10.0.0.134:2376/services";
var certs = '/tls/';

var influxdb = "http://10.0.0.50:8086/write?db=docker"

var clustermetrics = [
  'avg:load_one',
  'max:cpu_steal',
  'sum:node_domains'
];

var servicemetrics = [
  'avg:vir_cpu_utilization',
  'avg:vir_bytes_in',
  'avg:vir_bytes_out'
];

function sendToInfluxDB(msg) {
  if(!msg || !msg.length) return;

  var req = {
    url:influxdb,
    operation:'POST',
    headers:{"Content-Type":"text/plain"},
    body:msg.join('\n')
  };
  req.error = function(e) {
    logWarning('InfluxDB POST failed, error=' + e);
  }
  try { httpAsync(req); }
  catch(e) {
    logWarning('bad request ' + req.url + ' ' + e);
  }
}

function clusterMetrics(nservices) {
  var vals = metric(
    'ALL', clustermetrics,
    {'node_domains':['*'],'host_name':['vx*host*']}
  );
  var msg = [];
  msg.push('swarm.services value='+nservices);
  msg.push('nodes value='+(vals[0].metricN || 0));
  for(var i = 0; i < vals.length; i++) {
    let val = vals[i];
    msg.push(val.metricName+' value='+ (val.metricValue || 0));
  } 
  sendToInfluxDB(msg);
}

function serviceMetrics(name, replicas) {
  var vals = metric(
    'ALL', servicemetrics,
    {'vir_host_name':[name+'\\.*'],'vir_cpu_state':['running']}
  );
  var msg = [];
  msg.push('replicas_configured,service='+name+' value='+replicas);
  msg.push('replicas_measured,service='+name+' value='+(vals[0].metricN || 0));
  for(var i = 0; i < vals.length; i++) {
    let val = vals[i];
    msg.push(val.metricName+',service='+name+' value='+(val.metricValue || 0));
  }
  sendToInfluxDB(msg);
}

setIntervalHandler(function() {
  var i, services, service, spec, name, replicas, res;
  try { services = JSON.parse(http2({url:docker, certs:certs}).body); }
  catch(e) { logWarning("cannot get " + docker + " error=" + e); }
  if(!services || !services.length) return;

  clusterMetrics(services.length);

  for(i = 0; i < services.length; i++) {
    service = services[i];
    if(!service) continue;
    spec = service["Spec"];
    if(!spec) continue;
    name = spec["Name"];
    if(!name) continue;
 
    replicas = spec["Mode"]["Replicated"]["Replicas"];
    serviceMetrics(name, replicas);
  }
},10);

Some notes on the script:

Only a few representative metrics are being monitored, many more are available, see Metrics.
The setIntervalHandler function is run every 10 seconds. The function queries Docker REST API for the current list of services and then calculates summary statistics for each service. The summary statistics are then pushed to InfluxDB via a REST API call.
Cluster performance metrics describes the set of summary statistics that can be calculated.
Writing Applications provides additional information on sFlow-RT scripting and REST APIs.

Start gathering metrics:

docker run -v `pwd`/tls:/tls -v `pwd`/swarmmetrics.js:/sflow-rt/swarmmetrics.js \
-e "RTPROP=-Dscript.file=swarmmetrics.js" \
-p 8008:8008 -p 6343:6343/udp sflow/sflow-rt

The results are shown in the Grafana dashboard at the top of this article. The charts show 30 minutes of data. The top Replicas by Service chart compares the number of replicas configured for each service with the number of container instances that the monitoring system is tracking. The chart demonstrates that the monitoring system is accurately tracking the rapidly changing service pool and able to deliver reliable metrics. The middle Network IO by Service chart shows a brief spike in network activity whenever the number of instances in the apache service is increased. Finally, the bottom Cluster Size chart confirms that all four nodes in the Swarm cluster are being monitored.

This solution is extremely scaleable. For example, increasing the size of the cluster from 4 to 1,000 nodes increases the amount of raw data that sFlow-RT needs to process to accurately calculate service metrics, but has have no effect on the amount of data sent to the time series database and so there is no increase in storage requirements or query response time.

Pre-processing the stream of raw data reduces the cost of the monitoring solution, either in terms of the resources required by an on-premises monitoring solutions, or the direct costs of cloud based solutions which charge per data point per minute per month. In this case the raw telemetry stream contains hundreds of thousands of potential data points per minute per host - filtering and summarizing the data reduces monitoring costs by many orders of magnitude.

This example can easily be modified to send data into any on-premises or cloud based backend, examples in this blog include: SignalFX, Librato, Graphite and Ganglia. In addition, Docker 1.12 swarm mode elastic load balancing describes how the same architecture can be used to dynamically resize service pools to meet changing demand.