(chart from SCOM: How self-tuning threshold baseline is computed)
Calculating a baseline is a common technique in network and system management. The article, SCOM: How self-tuning threshold baseline is computed, describes how a value is monitored over time allowing a statistical envelope of likely values to be calculated. If the actual value falls outside the envelope then an alert is generated.
With any statistical baseline there is always a possibility that a normal value will fall outside the baseline envelope and trigger a false alarm. There is a tradeoff between making the baseline sensitive enough to quickly report an anomaly while avoiding excessive numbers of false alarms. For example, suppose that the value is monitored every minute. If the envelope covers 99.9% of values then between 1 and 2 false alarms per day would be expected. Reducing the sensitivity by choosing an envelope that covers 99.99% reduces the false positive rate to approximately 1 per week.
However, calculating a more accurate baseline is complicated by the need to monitor for a longer period. In the above example it would take at least a week to calculate the 99.99% baseline. Further complicating the calculation of longer term baselines is that the approach assumes a predictable and relatively static demand on the system. If demand is changing rapidly then the false alarm rate will go up since by the time the baseline is calculated it will no longer reflect the current behavior of the system.
The problem of false alarms creates a scalability problem when the time based, or temporal, baseline approach described above is used to monitor large numbers of items since the number of false alarms will increase as the number of items being monitored increases. For example, if there is only 1 false alarm per week per item being monitored, then the frequency of false alarms will go up with the number of items being monitored: going from 1 item to 1,000 items increases the false alarm rate to 1 every 10 minutes, increasing the number of items to 10,000 generates a false alarm every minute and finally, increasing the number of items to 100,000 generates a false alarm every 6 seconds.
The following chart shows how the accuracy of temporal baseline declines with system size as the number of false alarms drowns out useful alerts.
In addition, a spatial baseline requires no training period, allowing anomalies to be immediately identified. For example, when monitoring a converged data center environment a spatial baseline can be immediately applied as new resources added to a service pool whereas a temporal baseline approach would require time to calculate a baseline for the new member of the pool. In fact the addition of resources to the pool could cause a flurry of temporal baseline false alarms as the load profile of existing members of the resource pool changes, putting them outside their historic norms.
The table above compares performance metrics between servers within a cluster (see Top servers). It is immediately apparent from the chart that the server at the top of the chart has metrics that differ significantly from the other members of the 1,000 server cluster, indicating that the server is experiencing a performance anomaly.
To summarize, the following table compares temporal and spacial baseline techniques as they apply to small and large scale system monitoring:
The challenge in implementing a spatial baseline approach to anomaly detection is efficiently collecting metrics from all the systems in order to be able to compare them and create a baseline.
The sFlow standard is widely implemented by data center equipment vendors, providing an efficient solution that is ideally suited to managing performance in large scale converged, virtualized and cloud data center environments. The sFlow architecture provides a highly scalable mechanism for centrally collecting metrics from all the network, server and storage resources in the data center that is ideally suited to spatial baselining.
No comments:
Post a Comment