Thursday, December 10, 2015

Custom metrics with Cumulus Linux

Cumulus Networks, sFlow and data center automation describes how Cumulus Linux is monitored using the open source Host sFlow agent that supports Linux, Windows, FreeBSD, Solaris, and AIX operating systems and KVM, Xen, XCP, XenServer, and Hyper-V hypervisors, delivering a standard set of performance metrics from switches, servers, hypervisors, virtual switches, and virtual machines.

Host sFlow version 1.28.3 adds support for Custom Metrics. This article demonstrates how the extensive set of standard sFlow measurements can be augmented using custom metrics.

Recent releases of Cumulus Linux simplify the task by making machine readable JSON a supported output in command line tools. For example, the cl-bgp tool can be used to dump BGP summary statistics:
cumulus@leaf1$ sudo cl-bgp summary show json
{ "router-id": "192.168.0.80", "as": 65080, "table-version": 5, "rib-count": 9, "rib-memory": 1080, "peer-count": 2, "peer-memory": 34240, "peer-group-count": 1, "peer-group-memory": 56, "peers": { "swp1": { "remote-as": 65082, "version": 4, "msgrcvd": 52082, "msgsent": 52084, "table-version": 0, "outq": 0, "inq": 0, "uptime": "05w1d04h", "prefix-received-count": 2, "prefix-advertised-count": 5, "state": "Established", "id-type": "interface" }, "swp2": { "remote-as": 65083, "version": 4, "msgrcvd": 52082, "msgsent": 52083, "table-version": 0, "outq": 0, "inq": 0, "uptime": "05w1d04h", "prefix-received-count": 2, "prefix-advertised-count": 5, "state": "Established", "id-type": "interface" } }, "total-peers": 2, "dynamic-peers": 0 }
The following Python script, bgp_sflow.py, invokes the command, parses the output, and posts a set of custom sFlow metrics:
#!/usr/bin/env python
import json
import socket
from subprocess import check_output

res = check_output(["/usr/bin/cl-bgp","summary","show","json"])
bgp = json.loads(res)
metrics = {
  "datasource":"bgp",
  "bgp-router-id"    : {"type":"string", "value":bgp["router-id"]},
  "bgp-as"           : {"type":"string", "value":str(bgp["as"])},
  "bgp-total-peers"  : {"type":"gauge32", "value":bgp["total-peers"]},
  "bgp-peer-count"   : {"type":"gauge32", "value":bgp["peer-count"]},
  "bgp-dynamic-peers": {"type":"gauge32", "value":bgp["dynamic-peers"]},
  "bgp-rib-memory"   : {"type":"gauge32", "value":bgp["rib-memory"]},
  "bgp-rib-count"    : {"type":"gauge32", "value":bgp["rib-count"]},
  "bgp-peer-memory"  : {"type":"gauge32", "value":bgp["peer-memory"]},
  "bgp-msgsent"      : {"type":"counter32", "value":sum(bgp["peers"][c]["msgsent"] for c in bgp["peers"])},
  "bgp-msgrcvd"      : {"type":"counter32", "value":sum(bgp["peers"][c]["msgrcvd"] for c in bgp["peers"])}
}
msg = {"rtmetric":metrics}
sock = socket.socket(socket.AF_INET,socket.SOCK_DGRAM)
sock.sendto(json.dumps(msg),("127.0.0.1",36343))
Adding the following cron entry runs the script every minute:
* * * * * /home/cumulus/bgp_sflow.py > /dev/null 2>&1
The new metrics will now arrive at the sFlow collector. The following sflowtool output verifies that the metrics are being received:
startSample ----------------------
sampleType_tag 4300:1002
sampleType RTMETRIC
rtmetric_datasource_name bgp
rtmetric bgp-as = (string) "65080"
rtmetric bgp-rib-count = (gauge32) 9
rtmetric bgp-dynamic-peers = (gauge32) 0
rtmetric bgp-rib-memory = (gauge32) 1080
rtmetric bgp-peer-count = (gauge32) 2
rtmetric bgp-router-id = (string) "192.168.0.80"
rtmetric bgp-total-peers = (gauge32) 2
rtmetric bgp-msgrcvd = (counter32) 104648
rtmetric bgp-msgsent = (counter32) 104651
rtmetric bgp-peer-memory = (gauge32) 34240
endSample   ----------------------
A more interesting way to consume this data is using sFlow-RT. The diagram above shows a leaf and spine network built using CumuluxVX virtual machines that was used for a Network virtualization visibility demo. Installing the bgp_sflow.py script on each switch adds centralized visibility into fabric wide BGP statistics.

For example, the following sFlow-RT REST API command returns the total bgp messages sent and received summed across all switches:
$ curl http://10.0.0.86:8008/metric/ALL/sum:bgp-msgrcvd,sum:bgp-msgsent/json
[
 {
  "lastUpdateMax": 20498,
  "lastUpdateMin": 20359,
  "metricN": 4,
  "metricName": "sum:bgp-msgrcvd",
  "metricValue": 0.10000302901465385
 },
 {
  "lastUpdateMax": 20498,
  "lastUpdateMin": 20359,
  "metricN": 4,
  "metricName": "sum:bgp-msgsent",
  "metricValue": 0.10000302901465385
 }
]
The custom metrics are fully integrated with all the other sFlow metrics, for example, the following query returns the host_name, bgp-as and load_one metrics associated with bgp-router-id 192.168.0.80:
$ curl http://10.0.0.86:8008/metric/ALL/host_name,bgp-as,load_one/json?bgp-router-id=192.168.0.80
[
 {
  "agent": "10.0.0.80",
  "lastUpdate": 12194,
  "lastUpdateMax": 12194,
  "lastUpdateMin": 12194,
  "metricN": 1,
  "metricName": "host_name",
  "metricValue": "leaf1"
 },
 {
  "agent": "10.0.0.80",
  "dataSource": "bgp",
  "lastUpdate": 22232,
  "lastUpdateMax": 22232,
  "lastUpdateMin": 22232,
  "metricN": 1,
  "metricName": "bgp-as",
  "metricValue": "65080"
 },
 {
  "agent": "10.0.0.80",
  "lastUpdate": 12194,
  "lastUpdateMax": 12194,
  "lastUpdateMin": 12194,
  "metricN": 1,
  "metricName": "load_one",
  "metricValue": 0
 }
]
The article, Cluster performance metrics, describes the metric API in more detail. Additional sFlow-RT APIs can be used to send data to a variety of DevOps tools, including: Ganglia, Graphite, InfluxDB and Grafana, Logstash, Splunkcloud analytics services.

Software, documentation, applications, and community support is available on sFlow-RT.com. For example, the sFlow-RT Fabric View application shown in the screen capture calculates and displays fabric wide traffic analytics.

2 comments:

  1. Just a note here the current install of hsflowd on Cumulus is 1.27.3-1 so 1.28.3 will need to be compiled.

    ReplyDelete
    Replies
    1. Thanks for mentioning the need to upgrade. One of the nice things about Cumulus Linux being an open Linux platform is that you can recompile from sources on the switch.

      The easiest thing to do is build a .deb package on one switch and install the package on the remaining switches.

      Delete