Friday, April 29, 2011

Amazon EC2 outage


Amazon cloud glitch knocks out popular websites - Server outage hits sites Reddit, Quora and Foursquare hard, Computerworld, Thursday April 21st, 2011. The article describes the impact of a major three day outage in one of Amazon's Elastic Compute Cloud (EC2) data centers on prominent social networking companies and their services.

The screen shot from the Amazon Service Health Dashboard shows the extent and duration of the failure and the following note that appeared on the dashboard page gives an initial description of the failure:

8:54 AM PDT We’d like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it’s difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We’re starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.

A detailed postmortem, Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region, was published on April 29.

The failure in this case resulted from interaction between the network and the Elastic Block Store (EBS) storage service. An initial network configuration error resulted in a loss of network capacity that caused the storage service to start replicating volumes, further overloading the network and causing additional disrupted the storage service. A brown out of the network affected not just storage and compute services, but also the control functions needed to recover.

The Amazon failure demonstrates the tight coupling between network, storage and servers in converged cloud environments. Networked storage in particular dramatically increases network loads and must be closely managed in order to avoid congestion.

The sFlow standard provides scalable monitoring of all the application, storage, server and network elements in the data center, both physical and virtual. Implementing an sFlow monitoring solution helps break down management silos, ensuring the coordination of resources needed to manage converged infrastructures, optimize performance and avoid service failures.

Monday, April 25, 2011

NGINX


The nginx-sflow-module project is an open source implementation of sFlow monitoring for the Nginx (pronounced engine x) web server. The module exports the counter and transaction structures discussed in sFlow for HTTP.

The advantage of using sFlow is the scalability it offers for monitoring the performance of large web server clusters or load balancers where request rates are high and conventional logging solutions generate too much data or impose excessive overhead. Real-time monitoring of HTTP provides essential visibility into the performance of large-scale, complex, multi-layer services constructed using Representational State Transfer (REST) architectures. In addition, monitoring HTTP services using sFlow is part of an integrated performance monitoring solution that provides real-time visibility into applications, servers and switches (see sFlow Host Structures).

The nginx-sflow-module software is designed to integrate with the Host sFlow agent to provide a complete picture of server performance. Download, install and configure Host sFlow before proceeding to install nginx-sflow-module - see Installing Host sFlow on a Linux Server. There are a number of options for analyzing cluster performance using Host sFlow, including Ganglia and sFlowTrend.

Note: the nginx-sflow-module picks up its configuration from the Host sFlow agent. The Host sFlow sampling.http setting can be used to override the default sampling setting to set a specific sampling rate for HTTP requests.

Next, download the nginx sources from http://wiki.nginx.org/Install and the nginx-sflow-module sources from http://nginx-sflow-module.googlecode.com/. The following commands compile and install nginx with the sflow-module:

tar -xvzf nginx-sflow-module-0.9.3.tar.gz
tar -xvzf nginx-1.0.0.tar.gz
cd nginx-1.0.0
./configure --add-module=/root/nginx-sflow-module-0.9.3
make
make install

Once installed, the nginx-sflow-module will stream measurements to a central sFlow Analyzer. Currently the only software that can decode HTTP sFlow is sflowtool. Download, compile and install the latest sflowtool sources on the system your are using to receive sFlow from the servers in the nginx cluster.

Running sflowtool will display output of the form:

[pp@test]$ /usr/local/bin/sflowtool
startDatagram =================================
datagramSourceIP 10.0.0.111
datagramSize 116
unixSecondsUTC 1294273499
datagramVersion 5
agentSubId 6486
agent 10.0.0.150
packetSequenceNo 6
sysUpTime 44000
samplesInPacket 1
startSample ----------------------
sampleType_tag 0:2
sampleType COUNTERSSAMPLE
sampleSequenceNo 6
sourceId 3:65537
counterBlock_tag 0:2201
http_method_option_count 0
http_method_get_count 247
http_method_head_count 0
http_method_post_count 2
http_method_put_count 0
http_method_delete_count 0
http_method_trace_count 0
http_methd_connect_count 0
http_method_other_count 0
http_status_1XX_count 0
http_status_2XX_count 214
http_status_3XX_count 35
http_status_4XX_count 0
http_status_5XX_count 0
http_status_other_count 0
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 3434
sourceId 3:65537
meanSkipCount 2
samplePool 7082
dropEvents 0
inputPort 0
outputPort 1073741823
flowBlock_tag 0:2100
extendedType socket4
socket4_ip_protocol 6
socket4_local_ip 10.0.0.150
socket4_remote_ip 10.1.1.63
socket4_local_port 80
socket4_remote_port 61401
flowBlock_tag 0:2201
flowSampleType http
http_method 2
http_protocol 1001
http_uri /favicon.ico
http_host 10.0.0.150
http_referrer http://10.0.0.150/membase.php
http_useragent Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-us) AppleW
http_bytes 284
http_duration_uS 335
http_status 404
endSample   ----------------------
endDatagram   =================================

The -H option causes sflowtool to output the HTTP request samples using the combined log format:

[pp@test]$ /usr/local/bin/sflowtool -H
10.1.1.63 - - [05/Jan/2011:22:39:50 -0800] "GET /membase.php HTTP/1.1" 200 3494 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-us) AppleW"
10.1.1.63 - - [05/Jan/2011:22:39:50 -0800] "GET /favicon.ico HTTP/1.1" 404 284 "http://10.0.0.150/membase.php" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-us) AppleW"

Converting sFlow to combined logfile format allows existing log analyzers to be used to analyze the sFlow data. For example, the following commands use sflowtool and webalizer to create reports:

/usr/local/bin/sflowtool -H | rotatelogs log/http_log &
webalizer -o report log/*

The resulting webalizer report shows top URLs:


Finally, the real potential of HTTP sFlow is as part of a broader performance management system providing real-time visibility into applications, servers, storage and networking across the entire data center.


For example, the diagram above shows typical elements in a Web 2.0 data center (e.g. Facebook, Twitter, Wikipedia, Youtube, etc.). A cluster of web servers handles requests from users. Typically, the application logic for the web site will run on the web servers in the form of server side scripts (PHP, Ruby, ASP etc). The web applications access the database to retrieve and update user data. However, the database can quickly become a bottleneck, so a cache is used to store the results of database queries. The combination of sFlow from all the web servers, Memcached servers and network switches provides end-to-end visibility into performance that scales to handle even the largest data center.

Thursday, April 14, 2011

Timeline



The article, Visibility and control in HP Converged Infrastructure: A history of innovation and standards, describes the joint research between HP Labs and CERN that first demonstrated the scalability of packet sampling for monitoring data networks. The article briefly describes some of the factors that led to the adoption of the sFlow standard and goes on to describe the role of sFlow in HP's Converged Infrastructure/FlexFabric architecture.

This article looks at the evolution of the sFlow standard over the last decade and then looks forward to see how sFlow will continue to meet the challenges of scalable performance monitoring in the next decade.

The sFlow standard was developed to monitor high speed switched networks and sFlow's growth has closely tracked developments in Ethernet switching. The chart above plots milestones in the development of sFlow against the growing Ethernet switching market and the adoption of 1 Gigabit, 10 Gigabit and virtual switch technologies.

Ten years ago, the adoption of high speed Ethernet switching had severely limited the value of probes for monitoring network traffic. The sFlow architecture addressed this challenge by embedding instrumentation within the switches in order to provide network-wide visibility. Initially published by InMon Corp. as RFC 3176, the sFlow.org consortium was founded in 2004 in order to further develop and promote solutions based on sFlow technology.

sFlow.org 2004

Since 2004, sFlow has grown into the leading, multi-vendor, standard for monitoring high-speed switched networks. The growth in vendor support of sFlow has been driven by the move to 1G and more recently 10G Ethernet switches.
The sFlow.org site lists the large number of switches with embedded sFlow monitoring. Switch vendors supporting sFlow now include: AlaxalA, Alcatel-Lucent, Allied Telesis, Arista Networks, Blade Network Technologies, Brocade, Dell, D-Link, Enterasys, Extreme Networks, Force10 Networks, Fortinet, Hewlett-Packard, Hitachi, IBM, Juniper Networks, NEC, Netgear, Voltaire and Vyatta.

Strong vendor involvement in the sFlow standard has ensured that new management challenges are quickly addressed. For example, as wireless networking gained popularity, sFlow was extended to provide visibility into wireless networks and available in products by 2007.

The growth in virtualization has led to the adoption of sFlow monitoring in virtual switches, providing visibility in virtualized and cloud networking. Managing converged, virtualized and cloud infrastructures requires a unified view of performance and sFlow was extended to include storage, server and application performance monitoring.

sFlow.org 2010

Where will the next decade take sFlow?

Demand for more energy efficiency is driving current efforts to extend sFlow to include power metering.

Expect to see sFlow in an increasing number of 40G and 100G switching products. The sFlow protocol is designed to monitor high speed switch fabrics and provides low cost, wire-speed monitoring at speeds of 100Gbits/s and beyond with the scalability to monitor the large, flat, layer 2 fabrics emerging to support virtualization.

Finally, the unique scalability of sFlow dramatically simplifies management by providing a single, centralized view of performance across all resources in the data center. Measurement eliminates uncertainty and reduces the complexity of managing large systems. An effective monitoring system is the foundation for automation: reducing costs, improving efficiency and optimizing performance in the data center. In future, expect to see sFlow monitoring tightly integrated in data center orchestration tools, fully exploiting the flexibility of virtualization and convergence to automatically adjust to changing workloads.  For additional information, the Data center convergence, visibility and control presentation describes the critical role that measurement plays in managing costs and optimizing performance.

Monday, April 4, 2011

The s in sFlow


Have you ever wondered what the s in sFlow stands for? While the s is meant to indicate sampling (of counters and packets), the wide range of applications for sFlow suggest additional meanings:
  • s for Standard, the sFlow standard ensures interoperable, multi-vendor performance monitoring.
  • s for Switch, broad support for the sFlow standard by switch vendors allows best-in-class products to be selected while maintaining end-to-end visibility from wireless and wired campus switches to high performance data center switches.
  • s for Storage, sFlow provides visibility into all types of networked storage, including Ethernet Storage Area Network (SAN) technologies, FCoE and AoE
  • s for Server, standard sFlow performance metrics from servers simplifies management by eliminating the need for separate network and server monitoring systems.
  • s for System, sFlow is an integrated measurement system, providing the visibility into switches, storage and servers needed to manage performance in converged, virtualized and cloud environments.
  • s for Service, sFlow embedded in server protocols like HTTP and Memcache provides visibility into application transactions and response times. Linking application, server and network performance allows sFlow to deliver end-to-end visibility into services, mapping services and the resources that they depend on.
  • s for Security, sFlow provides the visibility needed to detect unauthorized activity and enforce security policies.
  • s for Simplify, sFlow eliminates complexity. Simplifying management reduces costs, increases flexibility and improves scalability.
  • s for Scalability, sFlow has the scalability needed to manage the performance of even the largest data center. 
The requirement for scalability brings us back to sampling: sFlow's distributed packet and counter sampling mechanisms are fundamental building blocks that give sFlow the scalability to deliver data center wide visibility and control.