Monday, December 15, 2014

Stop thief!

The Host-sFlow project recently added added CPU steal to the set of CPU metrics exported.
steal (since Linux 2.6.11)
       (8) Stolen time, which is the time spent in other operating systems
       when running in a virtualized environment
Keeping close track of the stolen time metric is particularly import when running managing virtual machines in a public cloud. For example, Netflix and Stolen Time includes the discussion:
So how does Netflix handle this problem when using Amazon’s Cloud? Adrian admits that they tracked this statistic so closely that when an instance crossed a stolen time threshold the standard operating procedure at Netflix was to kill the VM and start it up on a different hypervisor. What Netflix realized over time was that once a VM was performing poorly because another VM was crashing the party, usually due to a poorly written or compute intensive application hogging the machine, it never really got any better and their best learned approach was to get off that machine.
The following articles describe how to monitor public cloud instances using Host sFlow agents:
The CPU steal metric is particularly relevant to Network Function Virtualization (NFV). Virtual appliances implementing network functions such as load balancing are particularly sensitive to stolen CPU cycles that can severely impact application response times. Application Delivery Controller (ADC) vendors export sFlow metrics from their physical and virtual appliances - sFlow leads convergence of multi-vendor application, server, and network performance management.  The addition of CPU steal to the set of sFlow metrics exported by virtual appliances will allow the NFV orchestration tools to better optimize service pools.

No comments:

Post a Comment