Monday, November 1, 2010

NUMA

SMP architecture

As the number of processor cores increases, system architectures have moved from Symmetric Multi-Processing (SMP) to Non-Uniform Memory Access (NUMA). SMP systems are limited in scalability by contention for access to the shared memory. In a NUMA system, memory is divided among groups of CPU's, increasing the bandwidth and reducing latency of access to memory within a module at the cost of increased latency for non-local memory access. Intel Xeon (Nahalem) and AMD Opteron (Magny-Cours) based servers provide commodity examples of the NUMA architecture.

NUMA architecture

System software running on a NUMA architecture needs to be aware of the processor topology in order to properly allocate memory and processes to maximize performance (see Process Scheduling Challenges in the Era of Multi-Core Processors). Since NUMA based servers are widely deployed, most server operating systems are NUMA aware and take location into account when scheduling tasks and allocating memory.

Virtualization platforms also need to be location aware when allocating resources to virtual machines on NUMA systems. The article, How to optimize VM memory and processor performance, describes some of the issues involved in allocating virtual machine vCPUs to NUMA nodes.


Ethernet networks share similar NUMA like properties; sending data over a short transmission path offers lower latency and higher bandwidth than sending the data over a longer transmission path. While bandwidth within an Ethernet switch is high (multi-Terrabit capacity backplanes are not uncommon), the bandwidth of Ethernet links connecting switches is only 1Gbit/s or 10Gbit/s (with 40Gbit/s and 100Gbit/s on the horizon). Shortest path bridging (see 802.1aq and Trill) further increases the amount of bandwidth, and reduces the latency of communication, between systems that are "close".

Virtualization and the need to support virtual machine mobility (e.g. vMotion/XenMotion/Xen Live Migration) is driving the adoption of large, flat, high-speed, layer-2, switched Ethernet fabrics in the data center. A layer-2 fabric allows a virtual machine to keeps its IP address and maintain network connections when it moves (performing a "live" migration). However, while a layer-2 fabric provides transparent connectivity that allows virtual machines to move, the performance of the virtual machine is highly dependent on its communication patterns and location.

As servers are pooled into large clusters, virtual machines can easily be moved, not just between NUMA nodes within a servers, but between servers within the cluster. For optimal performance the cluster orchestration software needs to be aware of the network topology and workloads in order to place each VM in the optimal location. The paper, Tashi: Location-aware Cluster Management, describes a network aware cluster management system, currently supporting Xen and KVM.

The inclusion of the sFlow standard in network switches and virtualization platforms (see XCP, XenServer and KVM) provides the visibility into each virtual machine's current workload and dependencies, including tracking the virtual machine as it migrates across the data center.


In the article, Network visibility in the data center, an example was presented showing how virtual machine migration could cause a cascade of performance problems. The illustration above demonstrates how virtual machine migration can be used to optimize performance. In this example sFlow monitoring identifies that two virtual machines, VM1 and VM2, are exchanging significant amounts of traffic across the core of the network. In addition, sFlow data from the servers shows that while the server currently hosting VM1 is close to capacity, there is spare capacity on the server hosting VM2. Migrating VM1 to VM2's server reduces network traffic through the core as well as reducing the latency of communication between VM1 and VM2.

Note: For many protocols low latency is extremely important, examples include: Memcached, FCoE, NFS, iSCSI, and RDMA over Converged Ethernet (RoCE). It's the Latency, Stupid is an excellent, if somewhat dated article describing the importance of low latency in networks. The article, Latency Is Everywhere And It Costs You Sales - How To Crush It, presents a number of examples demonstrating the value of low latency and discusses strategies for reducing latency.

The virtual machine migration examples illustrate the value of the integrated view of network, storage, system and application performance that sFlow provides (see sFlow Host Structures). More broadly, visibility is the key to controlling costs, improving efficiency, reducing power and optimizing performance in the data center.

Finally, there are two interesting trends taking data centers in opposite directions. From the computing side, there is a move from SMP to NUMA systems in order to increase scalability and performance. On the networking side there is a trend toward creating non-blocking architectures, analogous to a move from the current NUMA structure of networking to an SMP model. While there is an appeal to hiding the network from applications in order to create a "uniform" cloud; the physics of data transmission is inescapable: the shorter the communication path, the greater the bandwidth and the lower the latency. Instead of trying to hide the network, a better long term strategy is to make the network structure and performance visible to system software so that it appears as additional tiers in the NUMA hierarchy, allowing operating systems, hypervisors and cluster orchestration software to optimally position workloads and manage the network resources needed to deliver cloud services. Bringing network resources under the control of a unified "cloud operating system" will dramatically simplify management and ensures the tight coordination of resources needed for optimal performance.

No comments:

Post a Comment