Saturday, February 23, 2013

SDN and large flows

Figure 1 Long vs Short flows (from The Nature of Datacenter Traffic: Measurements & Analysis)
Load balancing LAG/ECMP groups proposed using the combination of sFlow and OpenFlow to address the challenge that existing link aggregation (LAG) and equal cost multi-path routing (ECMP) architectures. An important question to ask is, how fast does a software defined networking load balancer have to be in order to significantly improve network performance? If load balancing decisions need to be made in the order of nanoseconds or microseconds, then software defined networking isn't a feasible approach, however, if decisions can be made in the order of seconds then software defined networking is an attractive approach because of the increased flexibility offered by an external software controller.

The chart in Figure 1 was taken from the paper, The Nature of Datacenter Traffic: Measurement & Analysis, which provides a comprehensive analysis of network traffic patterns in a large scale data center environment running a realistic workload. The chart shows that most traffic flows are short lived, over 50% of flows last less than 1 second. However, very little of the bandwidth is consumed by these short flows. Most bandwidth is consumed by the small number of long lived flows, flows with a duration between 10 seconds and 1000 seconds.
Figure 2: ECMP vs SDN/sFlow (from DevoFlow: Scaling Flow Management for High-Performance Networks)
The chart in Figure 2 is taken from the paper, DevoFlow: Scaling Flow Management for High-Performance Networks. This paper presents simulation results to examine the effect of using sFlow and OpenFlow to load balance long lived flows. The chart uses ECMP load balancing as a baseline (the light blue bars) and compares alternative approaches to load balancing traffic on a fat-tree/CLOS network topology. The results of load balancing based on sFlow traffic measurements (packets sampled at 1-in-1000 on 1G links) is shown by the brick red bars. The simulation demonstrates that load balancing of large flows can significantly improve throughput over ECMP.

Note: A CLOS network is a best case for ECMP since it offers the largest number of alternative equal cost paths. One would expect dynamic routing of large flows to deliver even greater improvements on non-CLOS networks.

DevoFlow refers to the paper, Hedera: Dynamic Flow Scheduling for Data Center Networks, for a definition of large flows - which defines a "large flow" as a flow that consumes 10% of a link's total bandwidth. For example, when monitoring 1Gigabit links, a large flow would be defined as any flow exceeding 100 Mbits/second. For a 10 Gigabit link, a large flow would be defined as any flow exceeding 1 Gigabits/second. The choice of the 1-in-1000 sampling rate in the DevoFlow article was selected to allow large flows on their 1Gigabit links to be detected within 1 second.

The sampling based scheme easily scales to higher speeds, since a sampling rate of 1-in-10,000 would detect large flows within a second on 10Gigabit links, a sampling rate of 1-in-100,000 would detect large flows within a second on 100Gigabit links etc. In each case the monitoring load on a central controller would be the same, i.e. the monitoring overhead to drive load balancing is small and doesn't go up with network speed. For more information, see Scalability and accuracy of packet sampling.

The detection speed of 1 second makes sense given the bi-modal distribution shown in Figure 1. Ignoring flows that last less than a second means that the controller can ignore the short flows (which consume very little bandwidth and are handled well by existing hardware load balancing techniques). Reacting to these short flows would consume controller resources and be counter productive since the flows would like end before any controls would take effect. If the controller can react within a few seconds to large flows it will have effective control over 90% of the bandwidth consumed on the network.

Note: The paper, Estimating the Volume of Elephant Flows under Packet Sampling, describes how sampling reduces the resources needed to detect large flows.

A skeptical reader might have noticed that the papers referenced in this article so far all relate to a map/reduce (e.g. Hadoop) workload and be concerned about the general applicability software defined load balancing.
Figure 3: Peak Period Aggregate Traffic Composition (North America, Mobile Access) 
For another proof point, consider Figure 3, from Sandvine's Global Internet Phenomena Report 2H 2012. The chart shows that 72% of peak period downstream bandwidth in the North America consistes of large flows (comprising Real-Time Entertainment such as Netflix, Hulu, YouTube etc. and Filesharing). NetFlix alone accounts for 33% of all North American primetime downstream bandwidth.
Figure 4: Peak Period Aggregate Traffic Composition (North America, Mobile Access)
Figure 4 shows that large flows also dominate bandwidth consumption by mobile devices. However, on mobile platforms YouTube streaming dominates, accounting for nearly 31% of peak period mobile bandwidth.

Netflix hosts its service within Amazon EC2, therefore it's not unreasonable to expect that the network bandwidth within the Amazon cloud is strongly driven by large video flows (along with other related activities that also generate large flows: transcoding video files, off peak Amazon Elastic MapReduce, etc. - see Dynamically Scaling Netflix in the Cloud).

Other research papers have examined the impact of large flows on total bandwidth consumption:
From all the evidence, it is clear that load balancing of large flows offers a significant opportunity to improve network performance in data centers, wide area and access networks, both wired and wireless.
Figure 5 Performance aware software defined networking
Figure 5 shows the elements of the software defined load balancing system described in Load balancing LAG/ECMP groups. This is a flexible architecture that could be used to load balance large flows in many different environments, including data center, WAN, wireless, and carrier networks. An important benefit of software defined networking is that control decisions are no longer limited to algorithms offered by network equipment providers - an SDN load balancer can take into account business concerns relating to value, cost, security, licensing etc., selecting traffic paths to maximise business benefit.

All the components needed to take SDN load balancing into the mainstream are now in place. The sFlow standard is a mature, robust, scalable, measurement technology that is almost universally supported by switch vendors and vendor support for OpenFlow is rapidly increasing - so finding network equipment that supports both sFlow and OpenFlow is not difficult. OpenFlow controllers are readily available and InMon's sFlow-RT real-time analytics engine detects large flows and provides the APIs needed to drive load balancing SDN applications. Load balancing is poised to be a killer application that will drive SDN into the mainstream.

Friday, February 22, 2013

Configuring Quanta switches

Configuring sFlow on a Quanta switch can be accomplished through the web interface (shown above), or through the command line interface.

The following command line commands configure a Quanta switch to sample packets at 1-in-2000, poll counters every 20 seconds and send sFlow to an analyzer ( using the default sFlow port (6343):

For each interface:
sflow collector-address
sflow rate 2000
sflow interval 20
A previous posting discussed the selection of sampling rates. Additional information can be found on the Quanta web site.

See Trying out sFlow for suggestions on getting started with sFlow monitoring and reporting.

Tuesday, February 5, 2013

Measurement delay, counters vs. packet samples

This chart compares the frame rate reported for a switch port based on sFlow interface counter and packet sample measurements (shown in blue and gold respectively). The chart was created using sFlow-RT, which asynchronously updates metrics as soon as new data arrives, demonstrating the fastest possible response to both counter and packet sample measurements.

In this case, the counter export interval was set to 20 seconds and the blue line, trending the ifinucastpkts counter, shows that it can take up to 40 seconds before the counter metric fully reflects a change in traffic level (illustrating the frequency resolution bounds imposed by Nyquist-Shannon). The frames metric, calculated from packet samples, responds far more quickly, immediately detecting a change in traffic and fully reflecting the new value within a few seconds.

The counter push mechanism used by sFlow is extremely efficient, permitting faster counter updates than are practical using large scale counter polling - see Push vs Pull. Reducing the counter export interval below 20 seconds would increase the responsiveness, but at the cost of increased overhead and reduced scaleability. On the other hand, packet sampling automatically allocates monitoring resources to busy links, providing a highly scaleable way to quickly detect traffic flows wherever they occur in the network, see Eye of Sauron.

The difference in responsiveness is important when driving software defined networking applications, where the ability to rapidly detecting large flows ensures responsive and stable controls. Packet sampling also provides richer detail than counters, allowing a controller to identify the root cause of traffic increases and drive corrective actions.

While not as responsive as packet sampling, counter updates provide important complementary functionality:
  1. Counters are maintained in hardware and provide precise traffic totals.
  2. Counters capture rare events, like packet discards, that can severely impact performance.
  3. Counters report important link state information, like link speed, LAG group membership etc.
The combination of periodic counter updates and packet sampling makes sFlow a highly scalable and responsive method of monitoring network performance, delivering the critical metrics needed for effective control of network resources.

Monday, February 4, 2013

Cluster performance metrics

The diagram shows three clusters of servers in a data center (e.g. web, Memcache, Hadoop, or application cluster). Each cluster may consist of hundreds of individual servers. Host sFlow agents on the physical or virtual hosts efficiently export standard metrics that can be processed by a wide range of performance analysis tools, including: Graphite, Ganglia, sFlowTrend, log analyzers, etc.

While many tools are capable of consuming sFlow metrics directly, not all tools have the scaleability to handle large clusters, or the ability to calculate summary statistics characterizing cluster performance (rather than simply reporting on each member of the cluster).

This article describes how the sFlow-RT analytics engine is used to collect sFlow from large clusters and report on overall cluster health. The diagram illustrates how sFlow-RT is deployed, consuming sFlow measurements from the servers and making summary statistics available through its REST API so that they can be used to populate a performance dashboard like Graphite.

The following summary statistics are supported by sFlow-RT:
  • max: Largest value
  • min: Smallest value
  • sum: Total value
  • avg: Average value
  • var: Variance
  • sdev: Standard deviation
  • med: Median value
  • q1: First quartile
  • q2: Second quartile (same as med:)
  • q3: Third quartile
  • iqr: Inter-quartile range (i.e. q3 - q1)
Any programming language capable of making HTTP requests (Perl, Python, Java, Javascript, bash, etc.) can be used to retrieve metrics from sFlow-RT using the URL:
  • server The host running sFlow-RT
  • agents A semicolon separated list of host addresses or names, or ALL to include all hosts.
  • metrics A comma separated list of metrics to retrieve.
  • filter A filter to further restrict the hosts to include in the query.
Use of the API is best illustrated by a few examples:
produces the results:
  "metricN": 3,
  "metricName": "avg:load_one",
  "metricValue": 0.10105350773249354,
  "updateTime": 1360040842222
  "metricN": 3,
  "metricName": "avg:http_method_get",
  "metricValue": 54.015954359255026,
  "updateTime": 1360040842721
The following query uses a filter to select servers whose hostname starts with the prefix "mem":
The following Python script polls sFlow-RT for cluster statistics every 60 seconds and posts the results to a Graphite collector (
import requests
import json
import time
import socket

sock = socket.socket()

url = 'http://localhost:8008/metric/ALL/sum:load_one/json'
while 1 == 1:
  r = requests.get(url)
  if r.status_code != 200: break
  vals = r.json()
  if len(vals) == 0: continue
  for v in vals:
    mname  = v["metricName"]
    mvalue = v["metricValue"]
    mtime  = v["updateTime"] / 1000
    message = 'clusterB.%s %f %i\n' % (mname,mvalue,mtime)
Finally, sFlow (and sFlow-RT) is not limited to monitoring server metrics. The switches connecting the servers in the clusters can also be monitored (the sFlow standard is supported by most switch vendors). The network can quickly become a bottleneck as cluster size increases and it is important to track metrics such as link utilization, packet discards etc. that can result in severe performance degradation. In addition, sFlow instrumentation is available for Apache, NGINX, Java, Memcached and custom applications - providing details such as URLs, response times, status codes, etc., and tying application, server and network performance together to provide a comprehensive view of performance.

Sunday, February 3, 2013

Delay vs utilization for adaptive control

Google AppEngine
The Google App Engine Blog describes an interesting performance related outage that occurred on Friday, October 26, 2012, with a result that "from approximately 7:30 to 11:30 AM US/Pacific, about 50% of requests to App Engine applications failed."

One comment from an App Engine user stood out, "We noticed a 5x increase in server instances. I think the scaling algorithm kicked in when instance latency grew to 60 seconds. Request latency is a key component in the decision to spawn more instances, right?"

Service level agreements are typically expressed in terms of latency/response time/delay, so response time needs to be managed. It seems intuitively obvious that monitoring response time and taking action if response time is seen to be increasing is the right approach to service scaling. However, there are serious problems with response time as a control metric.

This article discusses the problems with using response time to drive control decisions. The discussion has broad relevance to the areas of server scaling, cloud orchestration, load balancing and software defined networking, where cloud systems need to adapt to changing demand.
Figure 1: Response Time vs Utilization (from Performance by Design)
The discussion requires some background on queueing theory - which can be used describe how application response time changes as load on a system increases. Figure 1 shows the relationship between utilization and response time. The graph shows that response time remains fairly flat until utilization approaches 60-70%, after which response time increases rapidly.

Problem 1: Non-linear gain

Anyone who has held their ears because of the loud screech of a public address system has experienced the effect of gain on the stability of a feedback system. As the volume on the amplifier is increased there comes a point where the amplified sound from the speakers is picked up and re-amplified in a self sustaining feedback loop - resulting in an ear splitting screech. The only way to stop the sound is to turn the volume down, or turn off the microphone.
Figure 2: Step response vs loop gain (from PID controller)
This effect is well known in control theory. Figure 2 shows the effect of amplification, or gain, on the stability of feedback control. The chart shows the response of the controller in the face of an abrupt (step) change. As the gain of the feedback response is increased, the system overshoots and oscillates before settling at a new level. If the gain is increased sufficiently, the feedback control becomes unstable and generates a self sustaining oscillation.
Figure 3: Gain vs utilization
Figure 3 shows how the non-linearity of delay measurements effectively increases gain (slope of curve) as the load increases. For example, the curve is fairly flat (low gain) at 50% utilization and much steeper (high gain) at 90% utilization. If the gain is high enough, the system becomes unstable.

Problem 2: Non-linear delay

Delay and stability, describes how delay in a feedback loop results in system instability.
Figure 4: Effect of delay on stability (from Delay and stability)
Figure 4 shows that the effect of increasing delay on the stability of a feedback loop is similar to increasing the gain. As delay is increased the response starts to oscillate and if the delay is large enough, the controller becomes unstable.

Response time is what is referred to as a lagging (delayed) indicator of performance. Delay is intrinsic to the measurement since response time can only be calculated when a request completes.
Figure 5: Measurement delay vs measured response time
Figure 5 shows how measurement delay increases with response time. The linear relationship between measurement delay and measured response time should be intuitively obvious: for example, if the average response time is reported as 1 second, the measurement is based on requests that arrived on average 1 second earlier and are now completing. If the average response time increases to 2 seconds, then the measurement is based on requests that arrived on average 2 seconds ago. If the measurement delay is large enough, the system becomes unstable.


Use of response time as a control variable leads to insidious performance problems - the controller appears to work well when the system is operating at low to moderate utilizations, but suddenly becomes unstable if an unexpected surge in demand moves the system into the high gain, high delay (unstable) region. Once the system has been destabilized, it can continue to behave erratically, even after the surge in demand has passed. A full shutdown may be the only way to restore stable operation. From the Google blog"11:10 am - We determine that App Engine’s traffic routers are trapped in a cascading failure, and that we have no option other than to perform a full restart with gradual traffic ramp-up to return to service."

The solution to controlling response time lies in the recognition that response time is a function of system utilization. Instead of basing control actions on measured response time, controls should be based on measured utilization. Utilization is an easy to measure, low latency, linear metric that can be used to construct stable and responsive feedback control systems. Since response time is a function of utilization, controlling utilization effectively controls response time.

The sFlow standard provides multi-vendor, scaleable, visibility into changing demand needed to deliver stable and effective scaling, load balancing, control and orchestration solutions.