While many tools are capable of consuming sFlow metrics directly, not all tools have the scaleability to handle large clusters, or the ability to calculate summary statistics characterizing cluster performance (rather than simply reporting on each member of the cluster).
This article describes how the sFlow-RT analytics engine is used to collect sFlow from large clusters and report on overall cluster health. The diagram illustrates how sFlow-RT is deployed, consuming sFlow measurements from the servers and making summary statistics available through its REST API so that they can be used to populate a performance dashboard like Graphite.
The following summary statistics are supported by sFlow-RT:
- max: Largest value
- min: Smallest value
- sum: Total value
- avg: Average value
- var: Variance
- sdev: Standard deviation
- med: Median value
- q1: First quartile
- q2: Second quartile (same as med:)
- q3: Third quartile
- iqr: Inter-quartile range (i.e. q3 - q1)
http://server:8008/metric/agents/metrics/json?filterWhere:
- server The host running sFlow-RT
- agents A semicolon separated list of host addresses or names, or ALL to include all hosts.
- metrics A comma separated list of metrics to retrieve.
- filter A filter to further restrict the hosts to include in the query.
http://localhost:8008/metric/web1;web2;web3/avg:load_one,avg:http_method_get/jsonproduces the results:
[ { "metricN": 3, "metricName": "avg:load_one", "metricValue": 0.10105350773249354, "updateTime": 1360040842222 }, { "metricN": 3, "metricName": "avg:http_method_get", "metricValue": 54.015954359255026, "updateTime": 1360040842721 } ]The following query uses a filter to select servers whose hostname starts with the prefix "mem":
http://localhost:8008/metric/ALL/med:bytes_out,iqr:bytes_out/json?host_name=mem*The following Python script polls sFlow-RT for cluster statistics every 60 seconds and posts the results to a Graphite collector (10.0.0.151):
import requests import json import time import socket sock = socket.socket() sock.connect(("10.0.0.151",2003)) url = 'http://localhost:8008/metric/ALL/sum:load_one/json' while 1 == 1: r = requests.get(url) if r.status_code != 200: break vals = r.json() if len(vals) == 0: continue for v in vals: mname = v["metricName"] mvalue = v["metricValue"] mtime = v["updateTime"] / 1000 message = 'clusterB.%s %f %i\n' % (mname,mvalue,mtime) sock.sendall(message) time.sleep(60)Finally, sFlow (and sFlow-RT) is not limited to monitoring server metrics. The switches connecting the servers in the clusters can also be monitored (the sFlow standard is supported by most switch vendors). The network can quickly become a bottleneck as cluster size increases and it is important to track metrics such as link utilization, packet discards etc. that can result in severe performance degradation. In addition, sFlow instrumentation is available for Apache, NGINX, Java, Memcached and custom applications - providing details such as URLs, response times, status codes, etc., and tying application, server and network performance together to provide a comprehensive view of performance.
No comments:
Post a Comment