Sunday, April 18, 2010

Cluster performance


Convergence simplifies the data center by connecting flexible pools of storage and computation using a high-speed switched Ethernet fabric. Scale-out computing and storage solutions provide a way to efficiently exploit the resources within a converged data center to deliver scalable services. Scale-out architectures make use of clusters of systems to deliver services. Systems can be added and removed from the cluster to increase and decrease capacity to match demand. Converged data centers make it easy to assign systems to clusters and move systems between clusters as demand changes, increasing efficiency and flexibility. Examples of scale-out computing include: web farms, Hadoop/Map-Reduce clusters, NAS/iSCSI storage clusters and memcached clusters.

The performance of a cluster depends on the performance of the systems in the cluster and the network that connects them. Monitoring cluster performance requires a scalable monitoring solution that integrates network and system monitoring. Most switch vendors support the sFlow standard for network performance monitoring. Host sFlow extends visibility to include server performance, providing the integrated, scalable view of network and system performance needed to manage a converged network and the service clusters that it contains.

The image above shows the performance of a cluster of 1,000 servers. The charts trend combined measurements from all the servers to give a picture of the overall performance of the cluster. The charts simplify management by treating the cluster as if it were a single server with 8,000 processors, 16 terabytes of memory and 1 terabits/second of network bandwidth.

The sFlow analyzer has a real-time view of the performance of all the servers in the cluster and can easily combine the data to generate these charts. If problems are detected with the overall cluster performance, it is easy to drill-down to the individual servers and identify the source of the problem (see Top servers).


The traffic visibility from the switches provides context for the cluster performance metrics, identifying the clients making use of cluster services and the back end resources that the cluster depends on. The chart above shows total network activity for the cluster using sFlow data from all the switches (see Hybrid server monitoring). The chart provides a combined view of cluster network activity, integrating data from all switch ports (6,000) to generate a chart that represents the total cluster network activity (see Choosing an sFlow analyzer).

In this case, it is easy to see that the cluster is making heavy use of NFS storage (provided by an NFS scale-out storage array) and that the overall network traffic is dominated by storage traffic (see Networked storage).  The cluster-wide network and server performance charts make it easy to see correlation between metrics. In this case it is apparent that the NFS traffic is strongly correlated with system swapping activity in the cluster.

Performance management in a converged data center requires a converged approach to data center visibility (see Management silos). The sFlow architecture delivers a centralized, real-time view of performance across all the networking, storage and computing elements in the data center, offering visibility at all levels, from individual components, to scale-out clusters, to the entire data center.

No comments:

Post a Comment