sFlow: April 2010

Thursday, April 29, 2010

Configuring FortiGate appliances

The recent FortiOS 4.0 MR2 release adds sFlow support to Fortinet's FortiGate® appliances.

The following commands configure a FortiGate appliance to sample packets at 1-in-512, poll counters every 30 seconds and send sFlow to an analyzer (10.0.0.50) over UDP using the default sFlow port (6343):

config system sflow

   set collector-ip 10.0.0.50

   set collector-port 6343

end

Then for each interface:

config sys interface

    edit

       set sflow-sampler enable

       set sample-rate 512

       set sample-direction both

       set polling-interval 30

    next

end

A previous posting discussed the selection of sampling rates. Additional information can be found on the Fortinet web site.

See Trying out sFlow for suggestions on getting started with sFlow monitoring and reporting.

Sunday, April 18, 2010

Cluster performance

Convergence simplifies the data center by connecting flexible pools of storage and computation using a high-speed switched Ethernet fabric. Scale-out computing and storage solutions provide a way to efficiently exploit the resources within a converged data center to deliver scalable services. Scale-out architectures make use of clusters of systems to deliver services. Systems can be added and removed from the cluster to increase and decrease capacity to match demand. Converged data centers make it easy to assign systems to clusters and move systems between clusters as demand changes, increasing efficiency and flexibility. Examples of scale-out computing include: web farms, Hadoop/Map-Reduce clusters, NAS/iSCSI storage clusters and memcached clusters.

The performance of a cluster depends on the performance of the systems in the cluster and the network that connects them. Monitoring cluster performance requires a scalable monitoring solution that integrates network and system monitoring. Most switch vendors support the sFlow standard for network performance monitoring. Host sFlow extends visibility to include server performance, providing the integrated, scalable view of network and system performance needed to manage a converged network and the service clusters that it contains.

The image above shows the performance of a cluster of 1,000 servers. The charts trend combined measurements from all the servers to give a picture of the overall performance of the cluster. The charts simplify management by treating the cluster as if it were a single server with 8,000 processors, 16 terabytes of memory and 1 terabits/second of network bandwidth.

The sFlow analyzer has a real-time view of the performance of all the servers in the cluster and can easily combine the data to generate these charts. If problems are detected with the overall cluster performance, it is easy to drill-down to the individual servers and identify the source of the problem (see Top servers).

The traffic visibility from the switches provides context for the cluster performance metrics, identifying the clients making use of cluster services and the back end resources that the cluster depends on. The chart above shows total network activity for the cluster using sFlow data from all the switches (see Hybrid server monitoring). The chart provides a combined view of cluster network activity, integrating data from all switch ports (6,000) to generate a chart that represents the total cluster network activity (see Choosing an sFlow analyzer).

In this case, it is easy to see that the cluster is making heavy use of NFS storage (provided by an NFS scale-out storage array) and that the overall network traffic is dominated by storage traffic (see Networked storage). The cluster-wide network and server performance charts make it easy to see correlation between metrics. In this case it is apparent that the NFS traffic is strongly correlated with system swapping activity in the cluster.

Performance management in a converged data center requires a converged approach to data center visibility (see Management silos). The sFlow architecture delivers a centralized, real-time view of performance across all the networking, storage and computing elements in the data center, offering visibility at all levels, from individual components, to scale-out clusters, to the entire data center.

Saturday, April 10, 2010

Top servers

The image above shows the output of the Linux "top" command. Each row in the table corresponds to a process and the values in the row indicate how much of the system resources (memory and CPU) are consumed by the process. Sorting the table quickly identifies the top consumers of system resources.

Identifying "top" processes is a staple of system management and most operating systems have a tool that displays a sorted table of processes (e.g. Unix top, Windows Task Manager, OS X Activity Monitor).

When managing a data center full of servers, a top servers tool provides similar benefits, rapidly identifying servers with performance problems.

In the top servers table shown above, each row corresponds to a server in the data center. Sorting the table by server load quickly finds the most heavily loaded servers.

The challenge in constructing a data center wide top servers table is finding a scalable way to collect performance metrics from all the servers in the data center so that the metrics can be combined and sorted in a single table.

The screen capture shows actual data collected from over 1,000 servers. A Host sFlow agent was installed on each server. The agent is an open source implementation of the Host sFlow standard currently being developed at sFlow.org. The agent requires minimal server resource, only 50K of memory and negligible CPU. The combined network traffic from all 1,000+ Host sFlow agents is a little over 100K bits per second.

The Host sFlow agents provide the sFlow analyzer with a real-time view of the load on all the servers in the data center, making it possible to construct the data center wide top servers table.

Host sFlow combined with sFlow monitoring built into the network switches (see Hybrid server monitoring) provides a complete picture of the performance of each server in the data center. The traffic visibility from the switches provides context for a server's performance metrics, identifying the clients making use of its services and the back end resources that it depends on.

Sunday, April 4, 2010

Hybrid server monitoring

(image from Fiber Channel over Ethernet (FCoE) Primer)

Current trends toward convergence tightly link networking, storage and system performance. In a converged environment, system administrators need to be aware of network traffic linking applications, storage and users in order to avoid performance problems. The dynamic application environment created virtual machine migration, scale out storage and elastic service pools requires that server administrators be aware of network I/O in order to optimize performance and avoid creating problems through poor workload placement choices.

Server operating systems and hardware integrate the instrumentation needed to monitor CPU, memory and disk performance. However, server network adapters typically lack the hardware support needed for traffic monitoring, leaving system administrators with a very limited view of server network I/O. Without hardware support, the network monitoring tools that are available to system administrators are typically used only for troubleshooting since using the tools operationally would adversely impact server performance.

Solving the problem of poor server network visibility requires a broader perspective. Depending on the data center network topology, each server is attached to a blade, top of rack (ToR) or an end of row (EoR) switch. The diagram above illustrates the one-to-one relationship between network adapter and the switch connecting the server to storage and networking resources. Monitoring traffic on a server's switch port provides a complete picture of server network I/O.

Switch vendors recognize the need for network-wide visibility and most have implemented hardware support for the sFlow standard in their data center switches. Combining performance metrics from the server with network visibility from the adjacent switch creates a hybrid monitoring solution that exploits the strengths of existing server and switch instrumentation to provide a complete picture of system performance.

Similar challenges exist in virtual server environments. The integration of sFlow traffic monitoring in the virtual switch (e.g. Xen Cloud Platform) with system performance metrics obtained from virtual machines provides a complete picture of cloud performance. The emerging VEPA standard allows much of the virtual switch functionality to be offloaded from the server software to the adjacent physical switch hardware. VEPA will be a firmware upgrade for most switches so selecting a switch with sFlow support today provides visibility into physical server network I/O and also provides an upgrade path to extend visibility into the virtualization layer as VEPA becomes available.

One challenge remains to widely implementing this hybrid monitoring strategy. Currently, performance monitoring of servers is highly fragmented. Each server hardware, operating system, and system management vendor creates their own agents and software for performance monitoring, none of which interoperate. The emerging sFlow host and power extensions define standard export formats so that performance management tools can easily combine network and server measurements to build a complete picture of data center performance.