SignalFx is an example of a cloud based analytics service. SignalFx provides a
REST API for uploading metrics and a web portal that it simple to combine and trend data and build and share dashboards.
This article describes a proof of concept demonstrating how SignalFx's cloud service can be used to cost effectively monitor large scale cloud infrastructure by leveraging
standard sFlow instrumentation. SignalFx offers a free 14 day trial, making it easy to evaluate solutions based on this demonstration.
The diagram shows the measurement pipeline. Standard sFlow measurements from hosts, hypervisors, virtual machines, containers, load balancers, web servers and network switches stream to the
sFlow-RT real-time analytics engine. Metrics are pushed from sFlow-RT to SignalFx using the REST API.
Over 40 vendors implement the sFlow standard and compatible products are listed on
sFlow.org. The open source
Host sFlow agent exports standard sFlow metrics from hosts, virtual machines and containers and
local services. For additional background, the
Velocity conference talk provides an introduction to sFlow and case study from a large social networking site.
SignalFx's service is
priced based on the number of data points that they need to store and they estimate a cost of $15 per host per month to record comprehensive host statistics at 10 second granularity. Collecting metrics from a cluster of 1,000 hosts would cost as much as $15,000 per month.
There are important scaleability and cost advantages to placing the sFlow-RT analytics engine in front of the metrics collection service. For example, in large scale cloud environments the metrics for each member of a dynamic pool isn't necessarily worth trending since virtual machines are frequently added and removed. Instead, sFlow-RT tracks all the members of the pool, calculates summary statistics for the pool, and logs the summary statistics. This pre-processing can significantly reduce storage requirements, reducing costs and increasing query performance. The sFlow-RT analytics software also calculates traffic flow metrics, hot/missed Memcache keys, top URLs, exports events via syslog to Splunk, Logstash etc. and provides access to detailed metrics through its REST API.
The following steps were involved in setting up the proof of concept.
First register for free trial at
SignalFx.com.
Download and
install sFlow-RT.
Create a
signalfx.js script in the sFlow-RT home directory with the following lines (use the token from your SignalFx account):
var url = "https://ingest.signalfx.com/v2/datapoint";
var token = "YOUR_APP_API_TOKEN";
setIntervalHandler(function() {
var metrics = ['min:load_one','q1:load_one','med:load_one',
'q3:load_one','max:load_one'];
var vals = metric('ALL',metrics,{os_name:['linux']});
var gauges = [];
for each (var val in vals) {
gauges.push({
metric: val.metricName.replace(/[^a-zA-Z0-9_]/g,"_"),
dimensions:{cluster:"Linux"},
value: val.metricValue
});
}
var body = {"gauge":gauges};
var req = {
url:url,
operation:'post',
headers: {
'Content-Type':'application/json',
'X-SF-TOKEN':token
},
body: JSON.stringify(body)
};
try { http2(req); }
catch(e) { logWarning("metric upload failed " + e); }
} , 10);
Add the following sFlow-RT
configuration entry to load the script:
script.file=signalfx.js
Now start sFlow-RT.
Cluster performance metrics describes the summary metrics that sFlow-RT can calculate. In this case, the load average minimum, maximum, and quartiles for the cluster are being calculated and pushed to SignalFx every minute.
Install
Host sFlow agents on the physical or virtual machines in your cluster and direct them to send metrics to the sFlow-RT host. The installation steps can be easily automated using orchestration tools like Puppet, Chef, Ansible, etc.
Physical and virtual switches in the cluster can be configured to send sFlow to sFlow-RT in order to add
traffic metrics to the mix, exporting metrics that characterizing traffic between service tiers etc. However, in public cloud environments, traffic flow information is typically not available. The articles,
Amazon Elastic Compute Cloud (EC2) and
Rackspace cloudservers describe how Host sFlow agents can be configured to monitor traffic between virtual machines in the cloud.
Metrics should start appearing in SignalFx as soon as the Host sFlow agents are started.
In this example, sFlow-RT is exporting 5 metrics to summarize the cluster performance, reducing the total monthly cost of monitoring the 1,000 host cluster to less than $15 per month. Of course there are likely to be more metrics that you will want to track, but the ability to selectively log high value metrics provides a way to control costs and maximize benefits.
If you are managing physical infrastructure then sFlow provides a simple way to incorporate network telemetry. For example, add the following metrics to the script to summarize network health:
- max:ifinutilization
- max:ifoututilization
- sum:ifindiscards
- sum:ifinerrors
- sum:ifoutdiscards
- sum:ifouterrors
A network connecting 1,000 physical hosts would have considerably more than 1,000 switch ports and summarizing the per port statistics greatly reduces the cost of monitoring the network. For a catalog of network, host, and application metrics, see
Metrics.