Monday, April 22, 2013

Multi-tenant traffic in virtualized network environments

Figure 1: Network virtualization (credit Brad Hedlund)
Network Virtualization: a next generation modular platform for the data center virtual network describes the basic concepts of network virtualization. Figure 1 shows the architectural elements of the solution which involves creating tunnels to encapsulate traffic between hypervisors. Tunneling allows the controller to create virtual networks between virtual machines that are independent of the underlying physical network (Any Network in the diagram).
Figure 2: Physical and virtual packet paths
Figure 2 shows a virtual network on the upper layer and maps the paths onto a physical network below. The network virtualization architecture is not aware of the topology of the underlying physical network and so the physical location of virtual machines and resulting packet paths are unlikely to bear any relationship to their logical relationships, resulting in an inefficient "spaghetti" of traffic flows. When a network manager observes traffic on the physical network, the traffic between hypervisors, top of rack switches, or virtual machine to virtual machine will appear to have very little structure.
Figure 3: Apparent virtual network traffic matrix
Figure 3 shows a traffic matrix in which the probability of any virtual machine talking to any other virtual machine is uniform. A network designed to carry this flat traffic matrix must itself be topologically flat, i.e. provide equal bandwidth between all hosts.
Figure 4: Relative cost of different topologies (from Flyways To De-Congest Data Networks)
Figure 4 shows that eliminating over-subscription to create a flat network is expensive, ranging from 2 to 5 times the cost of a conventional network design. Applying this same strategy to the road system would be the equivalent of connecting every town and city with an 8-lane freeway, no matter how small or remote the town. In practice, traffic studies guide development and roads are built where they are needed to satisfy demand. A similar, measurement-based, approach can be applied to network design.

In fact, the traffic matrix isn't random, it just appears random because the virtual machines have been randomly scattered around the data center by the network virtualization layer. Consider an important use case for network virtualization - multi-tenant isolation. Virtual networks are created for each tenant and configured to isolate and protect tenants from each other in the public cloud. Virtual machines assigned to each tenant are free to communicate among themselves, but are prevented for communicating with other tenants in the data center.
Figure 5: Traffic matrix within and between tenants
Figure 5 shows the apparently random traffic matrix shown in Figure 3, but this time the virtual machines have been grouped by tenant and the tenants have been sorted from largest to smallest. The resulting traffic matrix has some interesting features:
  1. The largest tenant occupies a small fraction of the total area in the traffic matrix.
  2. Tenant size rapidly decreases with most tenants being much smaller than the largest few.
  3. The traffic matrix is extremely sparse.
Even this picture is misleading, because if you drill down to look at a single tenant, their traffic matrix is likely to be equally sparse.
Figure 6: Traffic from large map / reduce cluster
Figure 5 shows the traffic matrix for a common large scale workload that a tenant might run in the cloud - map / reduce (Hadoop) - and the paper, Traffic Patterns and Affinities, discusses the sparseness and structure of this traffic matrix in some detail.
Note: There is a striking similarity between the traffic matrices in figures 5 and 6. The reason for the strong diagonal in the Hadoop traffic matrix is that the Hadoop scheduler is topologically aware, assigning compute tasks to nodes that are close to the storage they are going to operate on, and orchestrating storage replication in order to minimise non-local transfers. However, when this workload is run over a virtualized network, the virtual machines are scattered, turning this highly localized and efficient traffic pattern into randomly distributed traffic.
Apart from Hadoop, how else might a large tenant use the network?  It's worth focusing on large tenants since their workloads are likely to be the hardest to accomodate. Netflix is one of the largest and most sophisticated tenants in the Amazon Elastic Compute Cloud (EC2) and the presentation, Dynamically Scaling Netflix in the Cloud provides some interesting insights into their use of cloud resources.
Figure 7: Netflix elastic load balancing pools
Figure 7 shows how Netflix distributes copies of its service across availability zones. Each service intance, A, B or C is implemented by a scale out pool of virtual machines (EC2 instances). Note also the communication patterns between service pools, resulting in a sparse, structured traffic matrix.
Figure 8: Elastic load balancing
Figure 8 shows how each service within an availability zone is dynamically scaled based on measured demand. As demand increases, additional virtual machines are added to the pool. When demand decreases, virtual machines are released from the pool.
Figure 9: Variation in number of Netflix instances over a 24 hour period
Figure 9 shows how the number of virtual machines in each pool varies over the course of a day as a result of the elastic load balancing. Looking at the graph, one can see that a significant fraction of the virtual machines in each service pool is recycled each day.

Elastic load balancing is a service provided by the underlying infrastructure, the service provider is aware of the pools and the members within each pool. Since it's in the nature of a load balancing pool that each instance has similar traffic pattern to its peers, observing the communication patterns of active pool members would allow a topology aware orchestration controller to select poorly placed VMs when making a removal decision and add new VMs in locations that are close to their peers.
Note: Netflix maintains a base number of reserved instances (reserved instances are the least expensive option, provided you can keep them busy) and uses this "free" capacity for analytics tasks (Hadoop) during off peak periods. Exposing basic locality information to tenants would allow them to better configure topology aware workloads like Hadoop, delivering improved performance and reducing traffic on the shared physical network.
Multi-tenancy is just one application of network virtualization. However, the general concept of creating multiple virtual networks implies constraints on communication patterns and a location aware virtual network controller will be able to reduce network loads, improve application performance, and increase scaleability by placing nodes that communicate together topologically close to each other.

There are challenges dealing with large tenants since they may have large groups of machines that need to be provided with high bandwidth communication.  Helios: A Hybrid Electrical/Optical Switch Architecture for Modular Data Centers describes some of the limitations of fixed configuration networks and describes how optical networking can be used to flexibly allocate bandwidth where it is needed.
Figure 10: Demonstrating AWESOME in the Pursuit of the Optical Data Center
Figure 10, from the article Demonstrating AWESOME in the Pursuit of the Optical Data Center, shows a joint Plexxi and Calient solution that orchestrates connectivity based on what Plexxi terms network affinities. This technology can be used to "rewire" the network to create tailored pods to efficiently accomodate large tenants. The paper, PAST: Scalable Ethernet for Data Centers, describes how software defined networking can be used to exploit the capabilities of merchant silicon to deliver bandwidth where it is needed.

However flexible the network, coordinated management of storage, virtual machine and networking resources is required to fully realize the flexibility and efficiency promised by cloud data centers. The paper,  Joint VM Placement and Routing for Data Center Traffic Engineering, shows that jointly optimizing network and server resources can yield significant benefits.
Note: Vint Cerf recently revealed  that Google has re-engineering its data center networks to use OpenFlow based software defined networks, possibly bringing networking under the coordinated control of their data center resource management system. In addition, one of the authors of the Helios paper, Amin Vahdat, is a distinguished engineer at Google and has described Google's use of optical networking and OpenFlow in the context of WAN traffic engineering; it would be surprising if Google weren't applying similar techniques within their data centers.
Comprehensive measurement is an essential, but often overlooked, component of an adaptive architecture. The controller cannot optimally place workloads if the traffic matrix, link utilizations, and server loads are not known. The widely supported sFlow standard addresses the requirement for pervasive visibility by embedding instrumentation within physical and virtual switches, and in the servers and applications making use of the network to provide the integrated view of performance needed for unified control.

Finally, there are significant challenges to realizing revolutionary improvements in datacenter flexibility and scaleability, many of which aren't technical. Network virtualization, management silos and missed opportunities discusses how inflexible human organizational structures are being reflected in the data center architectures proposed by industry consortia. The article talks about Open Stack, but the recently formed Open Daylight consortium seems to have similar issues, freezing in place existing architectures that offer incremental benefits, rather than providing the flexibility needed for radical innovation and improvement.

1 comment:

  1. The ultimate test for any new technology or methodology seems to be this one: can I figure out quickly what is causing a problem? Knowledge of physical infrastructure (status, availability over time, load, etc) remains a requirement even in a completely virtualized world. The way that data is gathered and correlated is what requires augmentation to a virtualized world. Contextualization of what tenant traffic is going over which part of the infrastructure is only possible if you understand both sides of the virtualization bubble.