Tuesday, January 6, 2015

Open vSwitch performance monitoring

Credit: Accelerating Open vSwitch to “Ludicrous Speed”
Accelerating Open vSwitch to "Ludicrous Speed" describes the architecture of Open vSwitch. When a packet arrives, the OVS Kernel Module checks its cache to see if there is an entry that matches the packet. If there is a match then the packet is forwarded within the kernel. Otherwise, the packet is sent to the user space ovs-vswitchd process to determine the forwarding decision based on the set of OpenFlow rules that have been installed or, if no rules are found, by passing the packet to an OpenFlow controller. Once a forwarding decision has been made, the packet and the forwarding actions are passed back to the OVS Kernel Module which caches the decision and forwards the packet. Subsequent packets in the flow will then be matched by the cache and forwarded within the kernel.

The recent Open vSwitch 2014 Fall Conference included the talk, Managing Open vSwitch across a large heterogeneous fleet by Chad Norgan, describing Rackspace's experience with running a large scale OpenStack deployment using Open vSwitch for network virtualization. The talk describes the key metrics that Rackspace collects to monitor the performance of the large pools of Open vSwitch instances.

This article discusses the metrics presented in the Rackspace talk and describes how the embedded sFlow agent in Open vSwitch was extended to efficiently export the metrics.
The first chart trends the number of entries in each of the OVS Kernel Module caches across all the virtual switches in the OpenStack deployment.
The next chart trends the cache hit / miss rates for the OVS Kernel Module. Processing packets using cached entries in the kernel is much faster than sending the packet to user space and requires far fewer CPU cycles and so maintaining a high cache hit rate is critical to handling the large volume of traffic in a cloud data center.
The third chart from the Rackspace presentation tracks the CPU consumed by ovs-vswitchd as it handles cache misses. Excessive CPU utilization can result in poor network performance and dropped packets. Reducing the CPU cycles consumed by networking frees up resources that can be used to host additional virtual machines and generates additional revenue.

Currently, monitoring Open vSwitch cache performance involves polling each switch using the ovs-dpctl command and collecting the results. Polling is complex to configure and maintain and operational complexity is reduced if the Open vSwitch is able to push the metrics - see Push vs Pull

The following sFlow structure was defined to allow Open vSwitch to export cache statistics along with the other sFlow metrics that are pushed by the sFlow agent:
/* Open vSwitch data path statistics */
/* see datapath/datapath.h */
/* opaque = counter_data; enterprise = 0; format = 2207 */ 
struct ovs_dp_stats { 
  unsigned int hits;                                                
  unsigned int misses; 
  unsigned int lost;
  unsigned int mask_hits;
  unsigned int flows;
  unsigned int masks;
}
The sFlow agent was also extended to export CPU and memory statistics for the ovs-vswitchd process by populating the app_resources structure - see sFlow Application Structures.

These extensions are the latest in a set of recent enhancements to the Open vSwitch sFlow implementation, including:
The Open vSwitch project first added sFlow support five years ago and these recent enhancements build on the detailed visibility into network traffic provided by the core Open vSwitch sFlow implementation and the complementary visibility into hosts, hypervisors, virtual machines and containers provided by the Host sFlow project.
Visibility and the software defined data center
Broad support for the sFlow standard across the cloud data center stack provides simple, efficient, low cost, scaleable, and comprehensive visibility. The standard metrics can be consumed by a broad range of open source and commercial tools, including: sflowtool, sFlow-Trend, sFlow-RT, Ganglia, Graphite, InfluxDB and Grafana.

6 comments:

  1. what version of OVS is needed to export these metrics via sFlow? (hits, misses.. ect')

    ReplyDelete
  2. Hi.

    What command can I use (you mentioned ovs-dpctl but which exactly) to get those statistics shown in the charts?

    You mentioned that the standard metrics can be consumed by a variety of tools. Are those cache statistics not standard? If they aren't, what tool can I use to analyze them?

    Thanks!

    ReplyDelete
    Replies
    1. The metrics are exported when you configure an sFlow polling interval. The easiest way to configure sFlow on Open vSwitch is to install the Host sFlow agent. It ships with the sflowovsd daemon that automatically synchronizes the Host sFlow agent settings and Open vSwitch settings.

      The ovs_dp_stats structure is understood by sflowtool which allows the data to be incorporated in scripts and pushed into other tools. You could also use sFlow-RT to calculate statistics and send them to other tools.

      Delete
  3. Is there Tool like "nfdump" from sFlow? I need to get metrics and flows via CLI

    ReplyDelete
    Replies
    1. You might want to take a look at sFlow-RT. Command line scripts can poll the REST API for metrics. The articles, Cluster performance metrics and RESTflow give examples.

      Delete