Figure 1 Long vs Short flows (from The Nature of Datacenter Traffic: Measurements & Analysis) |
The chart in Figure 1 was taken from the paper, The Nature of Datacenter Traffic: Measurement & Analysis, which provides a comprehensive analysis of network traffic patterns in a large scale data center environment running a realistic workload. The chart shows that most traffic flows are short lived, over 50% of flows last less than 1 second. However, very little of the bandwidth is consumed by these short flows. Most bandwidth is consumed by the small number of long lived flows, flows with a duration between 10 seconds and 1000 seconds.
Figure 2: ECMP vs SDN/sFlow (from DevoFlow: Scaling Flow Management for High-Performance Networks) |
Note: A CLOS network is a best case for ECMP since it offers the largest number of alternative equal cost paths. One would expect dynamic routing of large flows to deliver even greater improvements on non-CLOS networks.
DevoFlow refers to the paper, Hedera: Dynamic Flow Scheduling for Data Center Networks, for a definition of large flows - which defines a "large flow" as a flow that consumes 10% of a link's total bandwidth. For example, when monitoring 1Gigabit links, a large flow would be defined as any flow exceeding 100 Mbits/second. For a 10 Gigabit link, a large flow would be defined as any flow exceeding 1 Gigabits/second. The choice of the 1-in-1000 sampling rate in the DevoFlow article was selected to allow large flows on their 1Gigabit links to be detected within 1 second.
The sampling based scheme easily scales to higher speeds, since a sampling rate of 1-in-10,000 would detect large flows within a second on 10Gigabit links, a sampling rate of 1-in-100,000 would detect large flows within a second on 100Gigabit links etc. In each case the monitoring load on a central controller would be the same, i.e. the monitoring overhead to drive load balancing is small and doesn't go up with network speed. For more information, see Scalability and accuracy of packet sampling.
The detection speed of 1 second makes sense given the bi-modal distribution shown in Figure 1. Ignoring flows that last less than a second means that the controller can ignore the short flows (which consume very little bandwidth and are handled well by existing hardware load balancing techniques). Reacting to these short flows would consume controller resources and be counter productive since the flows would like end before any controls would take effect. If the controller can react within a few seconds to large flows it will have effective control over 90% of the bandwidth consumed on the network.
Note: The paper, Estimating the Volume of Elephant Flows under Packet Sampling, describes how sampling reduces the resources needed to detect large flows.
A skeptical reader might have noticed that the papers referenced in this article so far all relate to a map/reduce (e.g. Hadoop) workload and be concerned about the general applicability software defined load balancing.
Figure 3: Peak Period Aggregate Traffic Composition (North America, Mobile Access) |
Figure 4: Peak Period Aggregate Traffic Composition (North America, Mobile Access) |
Netflix hosts its service within Amazon EC2, therefore it's not unreasonable to expect that the network bandwidth within the Amazon cloud is strongly driven by large video flows (along with other related activities that also generate large flows: transcoding video files, off peak Amazon Elastic MapReduce, etc. - see Dynamically Scaling Netflix in the Cloud).
Other research papers have examined the impact of large flows on total bandwidth consumption:
- Understanding Internet Traffic Streams: Dragonflies and Tortoises
- On the Characteristics and Reasons of Long-lived flows
- On the correlation of Internet flow characteristics
Figure 5 Performance aware software defined networking |
All the components needed to take SDN load balancing into the mainstream are now in place. The sFlow standard is a mature, robust, scalable, measurement technology that is almost universally supported by switch vendors and vendor support for OpenFlow is rapidly increasing - so finding network equipment that supports both sFlow and OpenFlow is not difficult. OpenFlow controllers are readily available and InMon's sFlow-RT real-time analytics engine detects large flows and provides the APIs needed to drive load balancing SDN applications. Load balancing is poised to be a killer application that will drive SDN into the mainstream.