Tuesday, November 16, 2010

Complexity kills


The July 2010 presentation NetFlow/IPFIX Various Thoughts from the IETF 3rd NMRG Workshop on NetFlow/IPFIX Usage in Network Management describes some of the challenges that still need to be addressed in NetFlow/IPFIX. In particular the slide above describes how increased flexibility has resulted in greater complexity when trying to configure and deploy NetFlow/IPFIX monitoring systems.

In contrast, sFlow implementations have very few configuration options. While there are superficial similarities between sFlow and NetFlow/IPFIX, the two approaches to network performance management reflect profound differences between the design goals of the two standards (see Standards).

NetFlow/IPFIX was developed to export WAN traffic measurements and is typically deployed in IP routers. Configuring routers is a complex task, requiring configuration of subnets, routing protocols, WAN interfaces etc. Many of the functions in a router are implemented in software, providing a flexible platform that permits complex measurements to be made. Over time, options have been added to NetFlow/IPFIX in order to export increasingly complex measurements used to characterize WAN traffic.

sFlow evolved to provide end-to-end monitoring of high-speed layer 2/3 Ethernet switches. Ethernet switches offer plug-and-play connectivity and require very little configuration. Unlike routers, switches perform most of their functions in hardware, relying on software only to perform simple management tasks. The need to embed measurement in hardware resulted in a standard that is very simple with minimal configuration options. However, the basic sFlow measurements, while simple to configure and implement in switches, provide a rich source of information about the performance of switched networks. Instead of relying on the switches to analyze the traffic, raw data is sent to a central sFlow analyzer (see Choosing an sFlow analyzer). The sFlow architecture results in a highly scalable system that can monitor the large numbers of high-speed switch ports found in layer 2 networks (see Superlinear).

The goal of convergence is to simplify data centers, creating flexible pools of storage and computing running over a flat, high bandwidth, low latency, Ethernet fabric (see Convergence). Eliminating complexity is essential if the scalability and flexibility of a converged infrastructure is to be realized.

Microsoft's Chief Software Architect, Ray Ozzie, eloquently describes the dangers of complexity in Dawn of a New Day, "Complexity kills. Complexity sucks the life out of users, developers and IT. Complexity makes products difficult to plan, build, test and use.  Complexity introduces security challenges. Complexity causes administrator frustration."

Maintaining visibility in large scale, dynamic data center environments requires a measurement technology that is designed for the task. sFlow is a mature, multi-vendor standard supported by most switch vendors that delivers the scalable plug and play visibility needed to manage performance in converged data center environments.

Finally, the end to end visibility that sFlow provides is a critical element in building scalable systems. Measurement eliminates uncertainty and reduces the complexity of managing large systems  (see Scalability). An effective monitoring system is the foundation for automationreducing costs, improving efficiency and optimizing performance in the data center.

Sunday, November 14, 2010

Shrink ray


(image from Despicable Me)

In the movie, Despicable Me, a shrink ray features prominently, making it possible to steal the Moon by shrinking it small enough to fit in the villian's pocket.

The ability to handle large, high-speed networks is one of the key benefits of the sFlow standard. The scalability results because sFlow's packet sampling technology acts like a shrink ray, shrinking down network traffic so that it is easier to analyze, reducing even the largest network to a manageable size.


Shrinking an image is another way of illustrating the scaling function that an sFlow monitoring system performs. When shrinking an image, sampling and compression operations reduce the amount of data needed to store the image while preserving the essential features of the original.

Choosing the right sampling rate is the key to a successful sFlow deployment. The sampling rate acts as the network shrink factor, reducing the resources needed to manage the network while preserving the essential features needed for a clear picture of network activity. For example, a sampling rate of 1-in-8192 shrinks even the busiest network down to a manageable size (see AMX-IX).

Monday, November 1, 2010

NUMA

SMP architecture

As the number of processor cores increases, system architectures have moved from Symmetric Multi-Processing (SMP) to Non-Uniform Memory Access (NUMA). SMP systems are limited in scalability by contention for access to the shared memory. In a NUMA system, memory is divided among groups of CPU's, increasing the bandwidth and reducing latency of access to memory within a module at the cost of increased latency for non-local memory access. Intel Xeon (Nahalem) and AMD Opteron (Magny-Cours) based servers provide commodity examples of the NUMA architecture.

NUMA architecture

System software running on a NUMA architecture needs to be aware of the processor topology in order to properly allocate memory and processes to maximize performance (see Process Scheduling Challenges in the Era of Multi-Core Processors). Since NUMA based servers are widely deployed, most server operating systems are NUMA aware and take location into account when scheduling tasks and allocating memory.

Virtualization platforms also need to be location aware when allocating resources to virtual machines on NUMA systems. The article, How to optimize VM memory and processor performance, describes some of the issues involved in allocating virtual machine vCPUs to NUMA nodes.


Ethernet networks share similar NUMA like properties; sending data over a short transmission path offers lower latency and higher bandwidth than sending the data over a longer transmission path. While bandwidth within an Ethernet switch is high (multi-Terrabit capacity backplanes are not uncommon), the bandwidth of Ethernet links connecting switches is only 1Gbit/s or 10Gbit/s (with 40Gbit/s and 100Gbit/s on the horizon). Shortest path bridging (see 802.1aq and Trill) further increases the amount of bandwidth, and reduces the latency of communication, between systems that are "close".

Virtualization and the need to support virtual machine mobility (e.g. vMotion/XenMotion/Xen Live Migration) is driving the adoption of large, flat, high-speed, layer-2, switched Ethernet fabrics in the data center. A layer-2 fabric allows a virtual machine to keeps its IP address and maintain network connections when it moves (performing a "live" migration). However, while a layer-2 fabric provides transparent connectivity that allows virtual machines to move, the performance of the virtual machine is highly dependent on its communication patterns and location.

As servers are pooled into large clusters, virtual machines can easily be moved, not just between NUMA nodes within a servers, but between servers within the cluster. For optimal performance the cluster orchestration software needs to be aware of the network topology and workloads in order to place each VM in the optimal location. The paper, Tashi: Location-aware Cluster Management, describes a network aware cluster management system, currently supporting Xen and KVM.

The inclusion of the sFlow standard in network switches and virtualization platforms (see XCP, XenServer and KVM) provides the visibility into each virtual machine's current workload and dependencies, including tracking the virtual machine as it migrates across the data center.


In the article, Network visibility in the data center, an example was presented showing how virtual machine migration could cause a cascade of performance problems. The illustration above demonstrates how virtual machine migration can be used to optimize performance. In this example sFlow monitoring identifies that two virtual machines, VM1 and VM2, are exchanging significant amounts of traffic across the core of the network. In addition, sFlow data from the servers shows that while the server currently hosting VM1 is close to capacity, there is spare capacity on the server hosting VM2. Migrating VM1 to VM2's server reduces network traffic through the core as well as reducing the latency of communication between VM1 and VM2.

Note: For many protocols low latency is extremely important, examples include: Memcached, FCoE, NFS, iSCSI, and RDMA over Converged Ethernet (RoCE). It's the Latency, Stupid is an excellent, if somewhat dated article describing the importance of low latency in networks. The article, Latency Is Everywhere And It Costs You Sales - How To Crush It, presents a number of examples demonstrating the value of low latency and discusses strategies for reducing latency.

The virtual machine migration examples illustrate the value of the integrated view of network, storage, system and application performance that sFlow provides (see sFlow Host Structures). More broadly, visibility is the key to controlling costs, improving efficiency, reducing power and optimizing performance in the data center.

Finally, there are two interesting trends taking data centers in opposite directions. From the computing side, there is a move from SMP to NUMA systems in order to increase scalability and performance. On the networking side there is a trend toward creating non-blocking architectures, analogous to a move from the current NUMA structure of networking to an SMP model. While there is an appeal to hiding the network from applications in order to create a "uniform" cloud; the physics of data transmission is inescapable: the shorter the communication path, the greater the bandwidth and the lower the latency. Instead of trying to hide the network, a better long term strategy is to make the network structure and performance visible to system software so that it appears as additional tiers in the NUMA hierarchy, allowing operating systems, hypervisors and cluster orchestration software to optimally position workloads and manage the network resources needed to deliver cloud services. Bringing network resources under the control of a unified "cloud operating system" will dramatically simplify management and ensures the tight coordination of resources needed for optimal performance.