Wednesday, December 16, 2015

Environmental metrics with Cumulus Linux

Custom metrics with Cumulus Linux describes how to extend the set of metrics exported by the sFlow agent and used the export of BGP metrics as an example. This article demonstrates how environmental metrics (power supplies, temperatures, fan speeds etc.) can be exported.

The smonctl command can be used to dump sensor data as JSON formatted text:
cumulus@cumulus$ smonctl -j
[
    {
        "pwm_path": "/sys/devices/soc.0/ffe03100.i2c/i2c-1/1-004d", 
        "all_ok": "1", 
        "driver_hwmon": [
            "fan1"
        ], 
        "min": 2500, 
        "cpld_path": "/sys/devices/ffe05000.localbus/ffb00000.CPLD", 
        "state": "OK", 
        "prev_state": "OK", 
        "msg": null, 
        "input": 8998, 
        "type": "fan", 
        "pwm1": 121, 
        "description": "Fan1", 
        "max": 29000, 
        "start_time": 1450228330, 
        "var": 15, 
        "pwm1_enable": 0, 
        "prev_msg": null, 
        "log_time": 1450228330, 
        "present": "1", 
        "target": 0, 
        "name": "Fan1", 
        "fault": "0", 
        "pwm_hwmon": [
            "pwm1"
        ], 
        "driver_path": "/sys/devices/soc.0/ffe03100.i2c/i2c-1/1-004d", 
        "div": "4", 
        "cpld_hwmon": [
            "fan1"
        ]
    },
    ... 
The following Python script, smon_sflow.py, invokes the command, parses the output, and posts a set of custom sFlow metrics:
#!/usr/bin/env python
import json
import socket
from subprocess import check_output

res = check_output(["/usr/sbin/smonctl","-j"])
smon = json.loads(res)
fan_maxpc = 0
fan_down = 0
fan_up = 0
psu_down = 0
psu_up = 0
temp_maxpc = 0
temp_up = 0
temp_down = 0
for s in smon:
  type = s["type"]
  if(type == "fan"):
    if "OK" == s["state"]:
      fan_maxpc = max(fan_maxpc, 100 * s["input"]/s["max"])
      fan_up = fan_up + 1
    else:
      fan_down = fan_down + 1
  elif(type == "power"):
    if "OK" == s["state"]:
      psu_up = psu_up + 1
    else:
      psu_down = psu_down + 1
  elif(type == "temp"):
    if "OK" == s["state"]:
      temp_maxpc = max(temp_maxpc, 100 * s["input"]/s["max"])
      temp_up = temp_up + 1
    else:
      temp_down = temp_down + 1

metrics = {
  "datasource":"smon",
  "fans-max-pc" : {"type":"gauge32", "value":int(fan_maxpc)},
  "fans-up-pc"  : {"type":"gauge32", "value":int(100 * fan_up / (fan_down + fan_up))},
  "psu-up-pc"   : {"type":"gauge32", "value":int(100 * psu_up / (psu_down + psu_up))},
  "temp-max-pc" : {"type":"gauge32", "value":int(temp_maxpc)},
  "temp-up-pc"  : {"type":"gauge32", "value":int(100.0 * temp_up / (temp_down + temp_up))}
}
msg = {"rtmetric":metrics}
sock = socket.socket(socket.AF_INET,socket.SOCK_DGRAM)
sock.sendto(json.dumps(msg),("127.0.0.1",36343))
Note: Make sure the following line is uncommented in the /etc/hsflowd.conf file in order to receive custom metrics. If the file is modified, restart hsflowd for the changes to take effect.
  jsonPort = 36343
Adding the following cron entry runs the script every minute:
* * * * * /home/cumulus/smon_sflow.py > /dev/null 2>&1
This example requires Host sFlow version 1.28.3 or later. This is newer than the version of Host sFlow that currently ships with Cumulus Linux 2.5.5. However, Cumulus Linux is an open platform, so the software can be compiled from sources in just the same way you would on a server:
sudo sh -c 'echo "deb http://ftp.us.debian.org/debian wheezy main contrib" > /etc/apt/sources.list.d/deb.list'
sudo apt-get update
sudo apt-get install gcc make libc-dev
wget https://github.com/sflow/host-sflow/archive/v1.28.3.tar.gz
tar -xvzf v1.28.3.tar.gz
cd host-sflow-1.28.3
make CUMULUS=yes
make deb CUMULUS=yes
sudo dpkg -i hsflowd_1.28.3-1_ppc.deb
The sFlow-RT chart at the top of this page shows a trend chart of the environment metrics. Each metrics has been constructed as a percentage, so they can all be combined on the chart.

While custom metrics are useful, they don't capture the semantics of the data and will vary in form and content. In the case of environmental metrics, a standard set of metrics would add significant value since many different types of device include environmental sensors and a common set of measurements from all networked devices would provide a comprehensive view of power, temperature, humidity, and cooling. Anyone interested in developing a standard sFlow export for environmental metrics can contribute ideas on the sFlow.org mailing list.

Saturday, December 12, 2015

Custom events

Measuring Page Load Speed with Navigation Timing describes the standard instrumentation built into web browsers. This article will use navigation timing as an example to demonstrate how custom sFlow events augment standard sFlow instrumentation embedded in network devices, load balancers, hosts and web servers.

The JQuery script can be embedded in a web page to provide timing information:
$(window).load(function(){
 var samplingRate = 10;
 if(samplingRate !== 1 && Math.random() > (1/samplingRate)) return;

 setTimeout(function(){
   if(window.performance) {
     var t = window.performance.timing;
     var msg = {
       sampling_rate : samplingRate,
       t_url         : {type:"string",value:window.location.href},
       t_useragent   : {type:"string",value:navigator.userAgent},
       t_loadtime    : {type:"int32",value:t.loadEventEnd-t.navigationStart},
       t_connecttime : {type:"int32",value:t.responseEnd-t.requestStart} 
     };
     $.ajax({
       url:"/navtiming.php",
       method:"PUT",
       contentType:"application/json",
       data:JSON.stringify(msg) 
     });
    }
  }, 0);
});
The script supports random sampling. In this case a samplingRate of 10 means that, on average, 1-in-10 page hits will generate a measurement record. Measurement records are sent back to the server where the navtiming.php script acts as a gateway, augmenting the measurements and sending them as custom sFlow events.
<?php
$rawInput = file_get_contents("php://input");
$rec = json_decode($rawInput);
$rec->datasource = "navtime";
$rec->t_ip = array("type" => "ip", "value" => $_SERVER['REMOTE_ADDR']);

$msg=array("rtflow"=>$rec);
$sock = fsockopen("udp://localhost",36343,$errno,$errstr);
if(! $sock) { return; }
fwrite($sock, json_encode($msg));
fclose($sock);
?>
In this case the remote IP address associated with the client browser is added to the measurement before it is formatted as a JSON rtflow message and sent to the Host sFlow agent (hsflowd) running on the web server host. The Host sFlow agent encodes the data as an sFlow structure and sends it to the sFlow collector as part of the telemetry stream.

The following sflowtool output verifies that the metrics are being received at the sFlow Analyzer:
startSample ----------------------
sampleType_tag 4300:1003
sampleType RTFLOW
rtflow_datasource_name navtime
rtflow_sampling_rate 1
rtflow_sample_pool 0
rtflow t_url = (string) "http://10.0.0.84/index.html"
rtflow t_useragent = (string) "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"
rtflow t_loadtime = (int32) 115
rtflow t_connecttime = (int32) 27
rtflow t_ip = (ip) 10.1.1.63
endSample   ----------------------
A more interesting way to consume the data is to use sFlow-RT. For example, the following REST API call programs the sFlow-RT analytics pipeline to create a metric that tracks average t_loadtime by URL:
curl -H "Content-Type:application/json" -X PUT --data '{keys:"t_url",value:"avg:t_loadtime",t:15}' http://localhost:8008/flow/urlloadtime/json
The following query can be used to retrieves the resulting metric value:
curl http://localhost:8008/metric/ALL/urlloadtime/json
[{
 "agent": "10.0.0.84",
 "dataSource": "navtime",
 "lastUpdate": 1807,
 "lastUpdateMax": 1807,
 "lastUpdateMin": 1807,
 "metricN": 1,
 "metricName": "urlloadtime",
 "metricValue": 11.8125,
 "topKeys": [{
  "key": "http://10.0.0.84/index.html",
  "lastUpdate": 1807,
  "value": 11.8125
 }]
}]
RESTflow describes the sFlow-RT REST API used to create flow definitions and access flow based metrics and Defining Flows provides reference material.

Installing mod-sflow provides a rich set of transaction and counter metrics from the Apache web server, including information on worker threads, see Thread pools.

Telemetry from Apache, Host sFlow and the custom events are all combined at the sFlow analyzer. For example, the following query pulls together the load average on the server, with Apache thread pool utilization and URL load times:
curl http://localhost:8008/metric/10.0.0.84/load_one,workers_utilization,urlloadtime/json
[
 {
  "agent": "10.0.0.84",
  "dataSource": "2.1",
  "lastUpdate": 23871,
  "lastUpdateMax": 23871,
  "lastUpdateMin": 23871,
  "metricN": 1,
  "metricName": "load_one",
  "metricValue": 0
 },
 {
  "agent": "10.0.0.84",
  "dataSource": "3.80",
  "lastUpdate": 4123,
  "lastUpdateMax": 4123,
  "lastUpdateMin": 4123,
  "metricN": 1,
  "metricName": "workers_utilization",
  "metricValue": 0.390625
 },
 {
  "agent": "10.0.0.84",
  "dataSource": "navtime",
  "lastUpdate": 8821,
  "lastUpdateMax": 8821,
  "lastUpdateMin": 8821,
  "metricN": 1,
  "metricName": "urlloadtime",
  "metricValue": 91.81072992715491,
  "topKeys": [{
   "key": "http://10.0.0.84/index.html",
   "lastUpdate": 8821,
   "value": 91.81072992715491
  }]
 }
]
The article, Cluster performance metrics, describes the metric API in more detail. Additional sFlow-RT APIs can be used to send data to a variety of DevOps tools, including: Ganglia, Graphite, InfluxDB and Grafana, Logstash, Splunk, cloud analytics services.

Finally, standard sFlow instrumentation is also widely implemented by physical and virtual network devices. Combining data from all these sources provides a comprehensive real-time view of applications and the compute, storage and networking resources that the applications depend on.

Thursday, December 10, 2015

Custom metrics with Cumulus Linux

Cumulus Networks, sFlow and data center automation describes how Cumulus Linux is monitored using the open source Host sFlow agent that supports Linux, Windows, FreeBSD, Solaris, and AIX operating systems and KVM, Xen, XCP, XenServer, and Hyper-V hypervisors, delivering a standard set of performance metrics from switches, servers, hypervisors, virtual switches, and virtual machines.

Host sFlow version 1.28.3 adds support for Custom Metrics. This article demonstrates how the extensive set of standard sFlow measurements can be augmented using custom metrics.

Recent releases of Cumulus Linux simplify the task by making machine readable JSON a supported output in command line tools. For example, the cl-bgp tool can be used to dump BGP summary statistics:
cumulus@leaf1$ sudo cl-bgp summary show json
{ "router-id": "192.168.0.80", "as": 65080, "table-version": 5, "rib-count": 9, "rib-memory": 1080, "peer-count": 2, "peer-memory": 34240, "peer-group-count": 1, "peer-group-memory": 56, "peers": { "swp1": { "remote-as": 65082, "version": 4, "msgrcvd": 52082, "msgsent": 52084, "table-version": 0, "outq": 0, "inq": 0, "uptime": "05w1d04h", "prefix-received-count": 2, "prefix-advertised-count": 5, "state": "Established", "id-type": "interface" }, "swp2": { "remote-as": 65083, "version": 4, "msgrcvd": 52082, "msgsent": 52083, "table-version": 0, "outq": 0, "inq": 0, "uptime": "05w1d04h", "prefix-received-count": 2, "prefix-advertised-count": 5, "state": "Established", "id-type": "interface" } }, "total-peers": 2, "dynamic-peers": 0 }
The following Python script, bgp_sflow.py, invokes the command, parses the output, and posts a set of custom sFlow metrics:
#!/usr/bin/env python
import json
import socket
from subprocess import check_output

res = check_output(["/usr/bin/cl-bgp","summary","show","json"])
bgp = json.loads(res)
metrics = {
  "datasource":"bgp",
  "bgp-router-id"    : {"type":"string", "value":bgp["router-id"]},
  "bgp-as"           : {"type":"string", "value":str(bgp["as"])},
  "bgp-total-peers"  : {"type":"gauge32", "value":bgp["total-peers"]},
  "bgp-peer-count"   : {"type":"gauge32", "value":bgp["peer-count"]},
  "bgp-dynamic-peers": {"type":"gauge32", "value":bgp["dynamic-peers"]},
  "bgp-rib-memory"   : {"type":"gauge32", "value":bgp["rib-memory"]},
  "bgp-rib-count"    : {"type":"gauge32", "value":bgp["rib-count"]},
  "bgp-peer-memory"  : {"type":"gauge32", "value":bgp["peer-memory"]},
  "bgp-msgsent"      : {"type":"counter32", "value":sum(bgp["peers"][c]["msgsent"] for c in bgp["peers"])},
  "bgp-msgrcvd"      : {"type":"counter32", "value":sum(bgp["peers"][c]["msgrcvd"] for c in bgp["peers"])}
}
msg = {"rtmetric":metrics}
sock = socket.socket(socket.AF_INET,socket.SOCK_DGRAM)
sock.sendto(json.dumps(msg),("127.0.0.1",36343))
Adding the following cron entry runs the script every minute:
* * * * * /home/cumulus/bgp_sflow.py > /dev/null 2>&1
The new metrics will now arrive at the sFlow collector. The following sflowtool output verifies that the metrics are being received:
startSample ----------------------
sampleType_tag 4300:1002
sampleType RTMETRIC
rtmetric_datasource_name bgp
rtmetric bgp-as = (string) "65080"
rtmetric bgp-rib-count = (gauge32) 9
rtmetric bgp-dynamic-peers = (gauge32) 0
rtmetric bgp-rib-memory = (gauge32) 1080
rtmetric bgp-peer-count = (gauge32) 2
rtmetric bgp-router-id = (string) "192.168.0.80"
rtmetric bgp-total-peers = (gauge32) 2
rtmetric bgp-msgrcvd = (counter32) 104648
rtmetric bgp-msgsent = (counter32) 104651
rtmetric bgp-peer-memory = (gauge32) 34240
endSample   ----------------------
A more interesting way to consume this data is using sFlow-RT. The diagram above shows a leaf and spine network built using CumuluxVX virtual machines that was used for a Network virtualization visibility demo. Installing the bgp_sflow.py script on each switch adds centralized visibility into fabric wide BGP statistics.

For example, the following sFlow-RT REST API command returns the total bgp messages sent and received summed across all switches:
$ curl http://10.0.0.86:8008/metric/ALL/sum:bgp-msgrcvd,sum:bgp-msgsent/json
[
 {
  "lastUpdateMax": 20498,
  "lastUpdateMin": 20359,
  "metricN": 4,
  "metricName": "sum:bgp-msgrcvd",
  "metricValue": 0.10000302901465385
 },
 {
  "lastUpdateMax": 20498,
  "lastUpdateMin": 20359,
  "metricN": 4,
  "metricName": "sum:bgp-msgsent",
  "metricValue": 0.10000302901465385
 }
]
The custom metrics are fully integrated with all the other sFlow metrics, for example, the following query returns the host_name, bgp-as and load_one metrics associated with bgp-router-id 192.168.0.80:
$ curl http://10.0.0.86:8008/metric/ALL/host_name,bgp-as,load_one/json?bgp-router-id=192.168.0.80
[
 {
  "agent": "10.0.0.80",
  "lastUpdate": 12194,
  "lastUpdateMax": 12194,
  "lastUpdateMin": 12194,
  "metricN": 1,
  "metricName": "host_name",
  "metricValue": "leaf1"
 },
 {
  "agent": "10.0.0.80",
  "dataSource": "bgp",
  "lastUpdate": 22232,
  "lastUpdateMax": 22232,
  "lastUpdateMin": 22232,
  "metricN": 1,
  "metricName": "bgp-as",
  "metricValue": "65080"
 },
 {
  "agent": "10.0.0.80",
  "lastUpdate": 12194,
  "lastUpdateMax": 12194,
  "lastUpdateMin": 12194,
  "metricN": 1,
  "metricName": "load_one",
  "metricValue": 0
 }
]
The article, Cluster performance metrics, describes the metric API in more detail. Additional sFlow-RT APIs can be used to send data to a variety of DevOps tools, including: Ganglia, Graphite, InfluxDB and Grafana, Logstash, Splunkcloud analytics services.

Software, documentation, applications, and community support is available on sFlow-RT.com. For example, the sFlow-RT Fabric View application shown in the screen capture calculates and displays fabric wide traffic analytics.

Tuesday, December 8, 2015

Using a proxy to feed metrics into Ganglia

The GitHub gmond-proxy project demonstrates how a simple proxy can be used to map metrics retrieved through a REST API into Ganglia's gmond TCP protocol.
The diagram shows the elements of the Ganglia monitoring system. The Ganglia server contains runs the gmetad daemon that polls for data from gmond instances and stores time series data. Trend charts are presented through the web interface. The transparent gmond-proxy replaces a native gmond daemon and delivers metrics in response to gmetad's polling requests.

The following commands install the proxy on the sFlow collector - an Ubuntu 14.04 system that is already runnig sFlow-RT:
wget https://raw.githubusercontent.com/sflow-rt/gmond-proxy/master/gmond_proxy.py
sudo mv gmond_proxy.py /etc/init.d/
sudo chown root:root /etc/init.d/gmond_proxy.py
sudo chmod 755 /etc/init.d/gmond_proxy.py
sudo service gmond_proxy.py start
sudo update-rc.d gmond_proxy.py start
The following commands install Ganglia's gmetad collector and web user interface on the Ganglia server - an Ubuntu 14.04 system:
sudo apt-get install gmetad
sudo apt-get install ganglia-webfrontend
cp /etc/ganglia-webfrontend/apache.conf /etc/apache2/sites-enabled
Next edit the /etc/ganglia/gmetad.conf file and configure the proxy as a data source:
data_source "my cluster" sflow-rt
Restart the Apache and gmetad daemons:
sudo service gmetad restart
sudo service apache2 restart
The Ganglia web user interface, shown in the screen capture, is now available at http://server/ganglia/

Ganglia natively supports sFlow, so what are some of the benefits of using the proxy? Firstly, the proxy allows metrics to be filtered, reducing the amount of data logged and increasing the scaleability of the Ganglia collector. Secondly, sFlow-RT generates traffic flow metrics, making them available to Ganglia. Finally, Ganglia is typically used in conjunction with additional monitoring tools that can all be driven using the analytics stream generated by sFlow-RT.

The diagram above shows how the sFlow-RT analytics engine is used to deliver metrics and events to cloud based and on-site DevOps tools, see: Cloud analytics,  InfluxDB and Grafana, Metric export to Graphite, and Exporting events using syslog. There are important scaleability and cost advantages to placing the sFlow-RT analytics engine in front of metrics collection applications as shown in the diagram. For example, in large scale cloud environments the metrics for each member of a dynamic pool are not necessarily worth trending since virtual machines are frequently added and removed. Instead, sFlow-RT can be configured to track all the members of the pool, calculate summary statistics for the pool, and log summary statistics. This pre-processing can significantly reduce storage requirements, reduce costs and increase query performance.

Monday, December 7, 2015

Broadcom BroadView Instrumentation

The diagram above, from the BroadView™ 2.0 Instrumentation Ecosystem presentation, illustrates how instrumentation built into the network Data Plane (the Broadcom Trident/Tomahawk ASICs used in most data center switches) provides visibility to Software Defined Networking (SDN) controllers so that they can optimize network performance.
The sFlow measurement standard provides open, scaleable, multi-vendor, streaming telemetry that supports SDN applications. Broadcom has been augmenting the rich set of counter and flow measurements in the base sFlow standard with additional metrics. For example, Broadcom ASIC table utilization metrics, DevOps, and SDN describes metrics that were added to track ASIC table resource consumption.

The highlighted Buffer congestion state / statistics capability in the slide refers to the BroadView Buffer Statistics Tracking (BST) instrumentation. The Memory Management Unit (MMU) is on-chip logic that manages how the on-chip packet buffers are organized.  BST is a feature that enables tracking the usage of these buffers. It includes snapshot views of peak utilization of the on-chip buffer memory across queues, ports, priority group, service pools and the entire chip.
The above chart from the Broadcom technical brief, Building an Open Source Data Center Monitoring Tool Using Broadcom BroadView™ Instrumentation Software, shows buffer utilization trended over an hour.

While the trend chart is useful, the value of BST instrumentation is fully realized when the data is integrated into the sFlow telemetry stream, allowing buffer utilizations to be correlated with traffic flows consuming the buffers. Broadcom's recently published sFlow extension, sFlow Broadcom Peak Buffer Utilization Structures, standardizes the export of the buffer metrics, ensures multi-vendor interoperability, and providing the comprehensive, actionable, telemetry from the network required by SDN applications.

Ask switch vendors about their plans to support the extension in their sFlow implementations. The enhanced visibility into buffer utilization addresses a number of important use cases:
  • Fabric-wide visibility into peak buffer utilization
  • Derive worst end to end case latency
  • Pro-actively track microbursts and identify hot spots before packets are lost
  • Correlate with traffic flows and link utilizations
  • Improve performance through QoS marking, load spreading, and workload placement

Wednesday, December 2, 2015

DDoS Blackhole

DDoS Blackhole has been released on GitHub, https://github.com/sflow-rt/ddos-blackhole. The application detects Distributed Denial of Service (DDoS) flood attacks in real-time and can automatically install a null / blackhole route to drop the attack traffic and maintain Internet connectivity. See DDoS for additional background.

The screen capture above shows a simulated DNS amplification attack. The Top Targets chart is a real-time view of external traffic to on-site IP addresses. The red line indicates the threshold that has been set at 10,000 packets per second and it is clear that traffic to address 192.168.151.4 exceeds the threshold. The Top Protocols chart below shows that the increase in traffic is predominantly DNS. The Controls chart shows that a control was added the instant the traffic crossed the threshold.
The Controls tab shows a table of the currently active controls. In this case, the controller is running in Manual mode and is listed with a pending status as it awaits manual confirmation (which is why the attack traffic persists in the Charts page). Clicking on the entry brings up a form that can be used to apply the control.
The chart above from the DDoS article shows an actual attack where the controller automatically dropped the attack traffic.
The basic settings are straightforward, allowing the threshold, duration, mode of operation and protected address ranges to be set.

Controls are added and removed by calling an external TCL/Expect script which logs into the site router and applies the following CLI command to drop traffic to the targeted address:
ip route target_ip/32 null0 name "DOS ATTACK"
The script can easily be modified or replaced to apply different controls or to work with different vendor CLIs.

Additional instructions are available under the Help tab. Instructions for downloading and installing the DDoS Blackhole application are available on sFlow-RT.com.

The software will work on any site with sFlow capable switches, even if the router itself doesn't support sFlow. Running the application in Manual mode is a completely safe way to become familiar with the software features and get an understanding of normal traffic levels. Download the software and give it a try.