This chart was generated using the free sFlowTrend application to monitor an Apache web server using the sFlow standard. The chart shows a real-time, minute by minute view of Top URIs by Operations/s for a busy web server. What is interesting about the chart is the sudden drop off in total operations per second over the last few minutes.
The drop in throughput can be verified by examining the standard HTTP performance counters that are exported using sFlow's efficient push mechanism. The Counters chart above shows the same drop in throughput.
There are a couple of possible explanations that come to mind, the first is that the size of pages has increased, possibly because large images were added.
The Top URI extensions by Bytes/s chart shown above makes it clear that the proportion of image data hasn't changed and that the overall data rate has fallen, so the drop in throughput doesn't appear to be a bandwidth problem.
Another possibility is that there has been an increase in server latency. The Top URIs by Duration chart above shows a recent increase in the latency of the http://10.0.0.150/login.php page.
At this point the problem can probably be resolved by talking with the application team to see if they have made any recent changes to the login page. However, there is additional information available that might help further diagnose the problem.
Host sFlow agents installed on the servers provides a scaleable way of monitoring performance. The CPU utilization chart above shows a drop in CPU load on the web server that coincides with the reduced web throughput. It appears that the performance problem isn't related to web server CPU, but is likely the result of requests to a slow backend system.
Note: If it had been a CPU related issue, we might have expected that the latence would have increased for all URIs, not just the login.php page.
Network visibility is a critical component of application performance monitoring. In this case, network traffic data can help by identifying the backend systems that the web server is depends on. Fortunately, most switch vendors support the sFlow standard and the traffic data is readily accessible in sFlowTrend.
The Top servers chart above shows the top services and servers by Frames/s. The drop in traffic to the web server, 10.0.0.150 is readily apparent, as is a drop in traffic to the Memcached server, 10.0.0.151 (TCP:11211). The Memcached server is used to cache the results of database queries in order to improve site performance and scaleability, but the performance problem doesn't seem to be directly related to the Memcached performance since the amount of Memcache traffic has dropped proportionally with the HTTP traffic (if there had been an increase in Memcache traffic, this might have indicated that the Memcached server was overloaded).
A final piece of information available through sFlow is the link utilization trend which confirms that there is the drop in performance isn't due to a lack of network capacity.
At this point we have a pretty thorough understanding of the impact of the problem on application, server and network resources. Talking to the developers reveals a recent update to the login.php script that introduced a software bug that failed to properly cache information. The resulting increase in load to the database was causing the login page to load slowly and resulted in the drop in site throughput. Fixing the bug returned site performance to normal levels.
Note: This example is a recreation of a typical performance problem using real servers and switches generating sFlow data. However, the load is artificially generated using Apache JMeter since actual production data can't be shown.
Trying out sFlow monitoring on your own site is easy. The sFlowTrend application is a free download. There are open source sFlow modules available for popular web servers, including: Apache, NGINX, Tomcat and node.js. The open source Host sFlow agent runs on most operating systems and enabling sFlow on switches is straightforward (see sFlow.org for a list of switches supporting the sFlow standard). The article, Choosing an sFlow analyzer, provides additional information for large scale deployments.
No comments:
Post a Comment