Monday, February 4, 2013

Cluster performance metrics

The diagram shows three clusters of servers in a data center (e.g. web, Memcache, Hadoop, or application cluster). Each cluster may consist of hundreds of individual servers. Host sFlow agents on the physical or virtual hosts efficiently export standard metrics that can be processed by a wide range of performance analysis tools, including: Graphite, Ganglia, sFlowTrend, log analyzers, etc.

While many tools are capable of consuming sFlow metrics directly, not all tools have the scaleability to handle large clusters, or the ability to calculate summary statistics characterizing cluster performance (rather than simply reporting on each member of the cluster).

This article describes how the sFlow-RT analytics engine is used to collect sFlow from large clusters and report on overall cluster health. The diagram illustrates how sFlow-RT is deployed, consuming sFlow measurements from the servers and making summary statistics available through its REST API so that they can be used to populate a performance dashboard like Graphite.

The following summary statistics are supported by sFlow-RT:
  • max: Largest value
  • min: Smallest value
  • sum: Total value
  • avg: Average value
  • var: Variance
  • sdev: Standard deviation
  • med: Median value
  • q1: First quartile
  • q2: Second quartile (same as med:)
  • q3: Third quartile
  • iqr: Inter-quartile range (i.e. q3 - q1)
Any programming language capable of making HTTP requests (Perl, Python, Java, Javascript, bash, etc.) can be used to retrieve metrics from sFlow-RT using the URL:
http://server:8008/metric/agents/metrics/json?filter
Where:
  • server The host running sFlow-RT
  • agents A semicolon separated list of host addresses or names, or ALL to include all hosts.
  • metrics A comma separated list of metrics to retrieve.
  • filter A filter to further restrict the hosts to include in the query.
Use of the API is best illustrated by a few examples:
http://localhost:8008/metric/web1;web2;web3/avg:load_one,avg:http_method_get/json
produces the results:
[
 {
  "metricN": 3,
  "metricName": "avg:load_one",
  "metricValue": 0.10105350773249354,
  "updateTime": 1360040842222
 },
 {
  "metricN": 3,
  "metricName": "avg:http_method_get",
  "metricValue": 54.015954359255026,
  "updateTime": 1360040842721
 }
]
The following query uses a filter to select servers whose hostname starts with the prefix "mem":
http://localhost:8008/metric/ALL/med:bytes_out,iqr:bytes_out/json?host_name=mem*
The following Python script polls sFlow-RT for cluster statistics every 60 seconds and posts the results to a Graphite collector (10.0.0.151):
import requests
import json
import time
import socket

sock = socket.socket()
sock.connect(("10.0.0.151",2003))

url = 'http://localhost:8008/metric/ALL/sum:load_one/json'
while 1 == 1:
  r = requests.get(url)
  if r.status_code != 200: break
  vals = r.json()
  if len(vals) == 0: continue
  for v in vals:
    mname  = v["metricName"]
    mvalue = v["metricValue"]
    mtime  = v["updateTime"] / 1000
    message = 'clusterB.%s %f %i\n' % (mname,mvalue,mtime)
    sock.sendall(message)
    time.sleep(60)
Finally, sFlow (and sFlow-RT) is not limited to monitoring server metrics. The switches connecting the servers in the clusters can also be monitored (the sFlow standard is supported by most switch vendors). The network can quickly become a bottleneck as cluster size increases and it is important to track metrics such as link utilization, packet discards etc. that can result in severe performance degradation. In addition, sFlow instrumentation is available for Apache, NGINX, Java, Memcached and custom applications - providing details such as URLs, response times, status codes, etc., and tying application, server and network performance together to provide a comprehensive view of performance.

No comments:

Post a Comment