Sunday, September 22, 2013

Wile E. Coyote

One of the classic moments in a Road Runner cartoon is Wile E. Coyote pursuing the Road Runner into a cloud of dust. Wile E. Coyote starts to suspect that there is something wrong, but remains suspended until the moment of realization that he is no longer on the road, but is instead suspended in mid-air over a chasm.

In the cartoon, the dust cloud allows Wile E. Coyote to temporarily defy the laws of physics by hiding the underlying physical topography. The Road Runner is under no such illusion - by leading the Road Runner is able to see the road ahead and stay on firm ground.

Example of an SDN solution with tunnels
Current network virtualization architectures are built on a similar cartoon reality - hiding the network under a cloud (using an overlay network of tunnels) and asserting that applications will somehow be insulated from the physical network topology and communication devices.

The network virtualization software used to establish and manage the overlay are a form of distributed computing system that delivers network connectivity as a service. Vendors of network virtualization software that assert that their solution is "independent of underlying hardware" are making flawed assumptions about networking that are common to distributing computing systems and are collectively known as the Fallacies of Distributed Computing:
  1. The network is reliable
  2. Latency is zero
  3. Bandwidth is infinite
  4. The network is secure
  5. Topology doesn't change
  6. There is one administrator
  7. Transport cost is zero
  8. The network is homogeneous
This article isn't intended to dismiss the value of the network virtualization abstraction. Virtualizing networking greatly increases operational flexibility. In addition, the move of complex functionality from the network core to edge hardware and virtual switches simplifies configuration and deployment of network functions (e.g. load balancing, firewalls, routing etc.). However, in order to realize the virtual network abstraction the orchestration system needs to be aware of the physical resources on which the service depends. The limitations of ignoring physical networking are demonstrated in the article, Multi-tenant performance isolation, which provides a real-life example of the type of service failure that impacts the entire data center and is difficult to address with current network virtualization architectures.

To be effective, virtualization architectures needs to be less like Wile E. Coyote, blindly running into trouble, and more like the Road Runner, fully aware of road ahead, safely navigating around obstacles and using resources to maximum advantage. In much the same way the hypervisor takes responsibility for managing limited physical resources like memory, CPU cycles and I/O bandwidth in order to deliver compute virtualization; the network virtualization system needs to be aware of the physical networking resources in order to integrate them into the virtualization stack. The article, NUMA, draws the parallel between how operating systems optimize performance by being aware of the location of resources and how cloud orchestration systems need to be similarly location aware.

One of the main reasons for the popularity of current overlay approaches to network virtualization has nothing to do with technology. The organizational silos that separate networking, compute and application operational teams in most enterprises make it difficult to deploy integrated solutions. Given the organizational challenges, it is easy to see the appeal to vendors creating overlay based products that bypasses the network silo and deliver operational flexibility to the virtualization team - see Network virtualization, management silos and missed opportunities. However, as network virtualization reaches the mainstream and software defined networking matures, expect to see enterprises integrate their functional teams and the emergence of network virtualization solutions that address current limitations. Multi-tenant traffic in virtualized network environments, examine the architectural problems with current cloud architectures and describe the benefits of taking a holistic, visibility driven, approach to coordinating network, compute, storage and application resources.

Thursday, September 12, 2013

Packet loss

The timeline describes an outage on Sunday, August 25th in Amazon's Elastic Block Store (EBS) service that affected a number of companies, including: Instagram, Vine, Airbnb and Flipboard - see Business Week: Another Amazon Outage Exposes the Cloud's Dark Lining.

The root cause of the outage is described as partial failure with a networking device that resulted in excessive packet loss. Network monitoring can detect many problems before they become serious and reduce the time to identify the root cause of problems when they do occur. However, traditional network monitoring technologies such as SNMP polling for interface counters don't easily scale, leaving operations teams with limited, delayed, information on critical network performance indicators such as packet loss. Fortunately, the challenge of large scale interface monitoring is addressed by the sFlow standard's counter push mechanism in which each device is responsible for periodically sending its own counters to a central collector, see Push vs Pull.
The sFlow counter export mechanism extends beyond the network to include server and application performance metrics and the article, Cluster performance metrics, shows how queries to the sFlow-RT analyzer's REST API are used to feed performance metrics from server clusters to operations tools like Graphite.

The following query builds on the example to show how real-time information summarizing network wide packet loss and error rates can easily be gathered, even if there are thousands of switches and tens of thousands of links.
$ curl http://10.0.0.162:8008/metric/ALL/sum:ifinerrors,sum:ifindiscards,sum:ifouterrors,sum:ifoutdiscards,max:ifinerrors,max:ifindiscards,max:ifouterrors,max:ifoutdiscards/json
[
 {
  "lastUpdateMax": 56187,
  "lastUpdateMin": 487,
  "metricN": 119,
  "metricName": "sum:ifinerrors",
  "metricValue": 0.05025125628140704
 },
 {
  "lastUpdateMax": 56187,
  "lastUpdateMin": 487,
  "metricN": 119,
  "metricName": "sum:ifindiscards",
  "metricValue": 0.05025125628140704
 },
 {
  "lastUpdateMax": 56187,
  "lastUpdateMin": 487,
  "metricN": 119,
  "metricName": "sum:ifouterrors",
  "metricValue": 2.1955918328651283
 },
 {
  "lastUpdateMax": 56187,
  "lastUpdateMin": 487,
  "metricN": 119,
  "metricName": "sum:ifoutdiscards",
  "metricValue": 2.1955918328651283
 },
 {
  "agent": "10.0.0.30",
  "dataSource": "4",
  "lastUpdate": 487,
  "lastUpdateMax": 27487,
  "lastUpdateMin": 487,
  "metricN": 119,
  "metricName": "max:ifinerrors",
  "metricValue": 0.05025125628140704
 },
 {
  "agent": "10.0.0.30",
  "dataSource": "4",
  "lastUpdate": 487,
  "lastUpdateMax": 27487,
  "lastUpdateMin": 487,
  "metricN": 119,
  "metricName": "max:ifindiscards",
  "metricValue": 0.05025125628140704
 },
 {
  "agent": "10.0.0.28",
  "dataSource": "8",
  "lastUpdate": 7288,
  "lastUpdateMax": 27488,
  "lastUpdateMin": 7288,
  "metricN": 119,
  "metricName": "max:ifouterrors",
  "metricValue": 0.7425742574257426
 },
 {
  "agent": "10.0.0.28",
  "dataSource": "8",
  "lastUpdate": 7288,
  "lastUpdateMax": 27488,
  "lastUpdateMin": 7288,
  "metricN": 119,
  "metricName": "max:ifoutdiscards",
  "metricValue": 0.7425742574257426
 }
]
The query result identifies links with the highest number of packet errors and packet discards (links are identified by the agent and dataSource fields for each of the max: ifindiscards, ifoutdiscards, ifinerrors, and ifouterrors metrics). Also of interest are the sum: metrics, measuring total packet loss and error rates summed across all the links in the network.

The sFlow counter push mechanism is also useful for tracking availability: the periodic export of counters by each device acts as a "heart beat" - if messages stop arriving from a device then there is a problem with the device or network that needs to be investigated. The metricN field in the query result indicates the number of data sources that contributed to the summary metrics and the lastUpdateMax and lastUpdateMin values indicate how long ago (in milliseconds) the most recent and oldest updates are within the group. A value of lastUpdateMax value exceeding a multiple of the configured sFlow export period (typically 30 seconds), or a decreasing value of metricN, indicates a problem - possibly the type of "grey" partial failure in the Amazon example.
The chart shows the metrics plotted on a chart. There is clearly an issue with a large number of discarded packets. The trend line for the worst interface tracks the total across all interfaces, indicating that the problem is localized to a specific link, rather than a general congestion related problem involving multiple links.

The network related outage described in this example is not an isolated incident; other incidents described on this blog include: Amazon EC2 outageGmail outageDelay vs utilization for adaptive control, and Multi-tenant performance isolation. The article, Visibility and the software defined data center, describes how the sFlow standard delivers the comprehensive visibility needed to automatically manage data center resources, increasing efficiency, and reducing downtime due to network related outages.