Friday, June 14, 2013

Multi-tenant performance isolation

This incident report from an OpenStack based cloud data center illustrates how performance problems can propagate and affect multiple tenants within the data center. This article will examine the incident and describe how performance aware software defined networking can be used to improve performance isolation in multi-tenant environments.

The incident report describes an external distributed denial of service (DDoS) attack that was launched some time before 9:30. The effects of the attack started to be detected by the measurement system at 9:30 and it took until 10:00 fully identify the attack and start planning a response. The plan to null route the traffic was implemented at 10:09 and the incident was fully resolved at 10:29.
The SDN and delay discusses the components of delay in a feedback control loop and includes the above timeline. Applying the timeline to the DDoS incident identifies the following components of response delay:
  • Measurement delay, 30 minutes
  • Planning delay, 9 minutes
  • Configuration delay, not broken out, included in planning delay
  • Response delay, < 20 minutes
  • Loop delay, 59 minutes
Threats to performance aren't just external. The following related incident report shows an internal host (likely a compromised host that was part of the initial DDoS attack) was responsible for disrupting service for other tenants within the data center.
In this case the time to resolve the problem was faster at 11 minutes (however, if this host was part of the original DDoS attack then total response time to detect and isolate this host was 2 hours 10 minutes).

While automation is an important part of the OpenStack (and other cloud orchestration systems), current architectures don't include the feedback mechanisms and coordinated controls needed for effective multi-tenant performance isolation, Network virtualization, management silos, and missed opportunities.

The key to building responsive performance optimizing controllers is a pervasive, scaleable, real-time monitoring system.  The sFlow instrumentation embedded within the physical and virtual switches (in this case Open vSwitch), load balancers and hypervisors enables real-time monitoring of the entire cloud data center.

The next step is to integrate the real-time analytics into the orchestration system. The article performance aware software defined networking describes the basic elements of a performance optimizing controller.
The article DDoS describes a fully automated system for DDoS mitigation with a loop delay of around 10 seconds, i.e. it is able to detect, characterize, null route and eliminate an attack within 10 seconds (over 300 times faster than the manual process). The controller is fast enough to prevent the attack from fully developing, cutting the peak traffic by a factor of 4.
Even faster responses are possible using software defined networking (SDN): the article Controlling large flows with OpenFlow describes an experimental controller that can mitigate a denial of service attack in 2 seconds.

Denial of service mitigation is just one example of multi-tenant performance isolation. There are many types of application that tenants run within their cloud deployments that stress the infrastructure. The articles Multi-tenant traffic in virtualized network environmentsPragmatic software defined networking and Resource allocation look at some of the architectural issues involved in managing cloud performance.

No comments:

Post a Comment