Sunday, February 3, 2013

Delay vs utilization for adaptive control

Google AppEngine
The Google App Engine Blog describes an interesting performance related outage that occurred on Friday, October 26, 2012, with a result that "from approximately 7:30 to 11:30 AM US/Pacific, about 50% of requests to App Engine applications failed."

One comment from an App Engine user stood out, "We noticed a 5x increase in server instances. I think the scaling algorithm kicked in when instance latency grew to 60 seconds. Request latency is a key component in the decision to spawn more instances, right?"

Service level agreements are typically expressed in terms of latency/response time/delay, so response time needs to be managed. It seems intuitively obvious that monitoring response time and taking action if response time is seen to be increasing is the right approach to service scaling. However, there are serious problems with response time as a control metric.

This article discusses the problems with using response time to drive control decisions. The discussion has broad relevance to the areas of server scaling, cloud orchestration, load balancing and software defined networking, where cloud systems need to adapt to changing demand.
Figure 1: Response Time vs Utilization (from Performance by Design)
The discussion requires some background on queueing theory - which can be used describe how application response time changes as load on a system increases. Figure 1 shows the relationship between utilization and response time. The graph shows that response time remains fairly flat until utilization approaches 60-70%, after which response time increases rapidly.

Problem 1: Non-linear gain

Anyone who has held their ears because of the loud screech of a public address system has experienced the effect of gain on the stability of a feedback system. As the volume on the amplifier is increased there comes a point where the amplified sound from the speakers is picked up and re-amplified in a self sustaining feedback loop - resulting in an ear splitting screech. The only way to stop the sound is to turn the volume down, or turn off the microphone.
Figure 2: Step response vs loop gain (from PID controller)
This effect is well known in control theory. Figure 2 shows the effect of amplification, or gain, on the stability of feedback control. The chart shows the response of the controller in the face of an abrupt (step) change. As the gain of the feedback response is increased, the system overshoots and oscillates before settling at a new level. If the gain is increased sufficiently, the feedback control becomes unstable and generates a self sustaining oscillation.
Figure 3: Gain vs utilization
Figure 3 shows how the non-linearity of delay measurements effectively increases gain (slope of curve) as the load increases. For example, the curve is fairly flat (low gain) at 50% utilization and much steeper (high gain) at 90% utilization. If the gain is high enough, the system becomes unstable.

Problem 2: Non-linear delay

Delay and stability, describes how delay in a feedback loop results in system instability.
Figure 4: Effect of delay on stability (from Delay and stability)
Figure 4 shows that the effect of increasing delay on the stability of a feedback loop is similar to increasing the gain. As delay is increased the response starts to oscillate and if the delay is large enough, the controller becomes unstable.

Response time is what is referred to as a lagging (delayed) indicator of performance. Delay is intrinsic to the measurement since response time can only be calculated when a request completes.
Figure 5: Measurement delay vs measured response time
Figure 5 shows how measurement delay increases with response time. The linear relationship between measurement delay and measured response time should be intuitively obvious: for example, if the average response time is reported as 1 second, the measurement is based on requests that arrived on average 1 second earlier and are now completing. If the average response time increases to 2 seconds, then the measurement is based on requests that arrived on average 2 seconds ago. If the measurement delay is large enough, the system becomes unstable.


Use of response time as a control variable leads to insidious performance problems - the controller appears to work well when the system is operating at low to moderate utilizations, but suddenly becomes unstable if an unexpected surge in demand moves the system into the high gain, high delay (unstable) region. Once the system has been destabilized, it can continue to behave erratically, even after the surge in demand has passed. A full shutdown may be the only way to restore stable operation. From the Google blog"11:10 am - We determine that App Engine’s traffic routers are trapped in a cascading failure, and that we have no option other than to perform a full restart with gradual traffic ramp-up to return to service."

The solution to controlling response time lies in the recognition that response time is a function of system utilization. Instead of basing control actions on measured response time, controls should be based on measured utilization. Utilization is an easy to measure, low latency, linear metric that can be used to construct stable and responsive feedback control systems. Since response time is a function of utilization, controlling utilization effectively controls response time.

The sFlow standard provides multi-vendor, scaleable, visibility into changing demand needed to deliver stable and effective scaling, load balancing, control and orchestration solutions.

No comments:

Post a Comment