Friday, March 22, 2013

Network virtualization, management silos and missed opportunities

Conway's law states that "organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations"
Figure 1: Management silos
Management silos described how the organization of operations teams into functional silos (network, storage and server groups) creates an inflexible management structure that makes it hard to deal with highly dynamic cloud architectures.
Figure 2: OpenStack Quantum Intro
When you look at the OpenStack architecture shown in Figure 2, it bears a strong resemblance to existing organizational silos. Is this really the best way to architect next generation cloud systems or is it simply a demonstration of Conway's law in action?

The OpenStack compute scheduler documentation describes the factors that can be included in deciding which compute node to use when starting a virtual machine. Notable by their absence is any mention of storage or network location. In contrast, the Hadoop scheduler is both storage and network topology aware, allowing it to place compute tasks close to storage and replicate data within racks for increased performance and across racks for availability. A previous article, System boundary, discussed the importance of including all the tightly coupled network, storage, and compute resources within an integrated control system and NUMA discussed the importance of location awareness for optimal performance.

Note: OpenStack was selected as a representative example to demonstrate architectural features that are common to many cloud stacks. This article shouldn't be seen as a specific criticism of OpenStack, but as a general discussion of cloud architectures.
Figure 3OpenStack Quantum Intro
OpenStack is still in active development, so one might hope that future schedulers will be enhanced to be more location aware as the network service matures. However, looking at Figure 3, it appears that this will not be possible since the APIs being developed to access the network service do not expose network topology or performance information to the scheduler.

Figure 4: VMware NSX Network Virtualization
Figure 4 shows how the situation becomes even worse as additional layers are added, further removing the scheduler from the information it needs to be location aware. In fact, the lack of location awareness is touted as an advantage, providing "A complete and feature rich virtual network can be defined at liberty from any constraints in physical switching infrastructure features, topologies or resources."

Each orchestration layer kicks the problem of network resource management down to lower layers, until you are left selecting from a range of vendor specific fabrics which also hide the network topology and present the abstraction of a single switch.
Figure 5: Juniper QFabric Architecture
On a slightly different tack, consider whether the organizational divisions in cloud orchestration systems are being justified based on one or more Fallacies of Distributed Computing:
  1. The network is reliable
  2. Latency is zero
  3. Bandwidth is infinite
  4. The network is secure
  5. Topology doesn't change
  6. There is one administrator
  7. Transport cost is zero
  8. The network is homogeneous
A corollary to Conway's law is that flexible organizations are willing and able to reorganize to produce optimal designs. The DevOps movement is breaking down the silos between application development and operations teams in order to improve the agility and reliability of cloud based applications. The standards for cloud computing are just starting to emerge and it would be tragic if the opportunity to deliver agile, robust, efficient and scaleable cloud systems is lost because of an inability to create the flexible, cross disciplinary design groups needed to re-imagine the relationship between networking, storage and computing and produce new architectures.

It is easy to be complacent based on the the buzz around cloud computing, software defined networking and the software defined data center. However, if these architectures don't deliver on their promise, there is competition waiting in the wings - see Return of the Borg: How Twitter Rebuilt Google’s Secret Weapon. The difference is that these alternative architectures are being developed by flexible organizations that are prepared to consider all aspects of their stack in order to make disruptive improvements.

The unified visibility across all network, server, storage and application resources provided by the multi-vendor sFlow standard offers a solution. Piercing through the layers of abstraction and architectural silos delivers the comprehensive real-time analytics and location awareness for efficient scheduling.

No comments:

Post a Comment