sFlow: polling

Showing posts with label polling. Show all posts

Monday, September 26, 2016

Asynchronous Docker metrics

Docker allows large numbers of lightweight containers can be started and stopped within seconds, creating an agile infrastructure that can rapidly adapt to changing requirements. However, the rapidly changing populating of containers poses a challenge to traditional methods of monitoring which struggle to keep pace with the changes. For example, periodic polling methods take time to detect new containers and can miss short lived containers entirely.

This article describes how the latest version of the Host sFlow agent is able to track the performance of a rapidly changing population of Docker containers and export a real-time stream of standard sFlow metrics.

The diagram above shows the life cycle status events associated with a container. The Docker Remote API provides a set of methods that allow the Host sFlow agent to communicate with the Docker to list containers and receive asynchronous container status events. The Host sFlow agent uses the events to keep track of running containers and periodically exports cpu, memory, network and disk performance counters for each container.

The diagram at the beginning of this article shows the sequence of messages, going from top to bottom, required to track a container. The Host sFlow agent first registers for container lifecycle events before asking for all the currently running containers. Later, when a new container is started, Docker immediately sends an event to the Host sFlow agent, which requests additional information (such as the container process identifier - PID) that it can use to retrieve performance counters from the operating system. Initial counter values are retrieved and exported along with container identity information as an sFlow counters message and a polling task for the new container is initiated. Container counters are periodically retrieved and exported while the container continues to run (2 polling intervals are shown in the diagram). When the Host sFlow agent receives an event from Docker indicating that the container is being stopped, it retrieves the final values of the performance counters, exports a final sFlow message, and removes the polling task for the container.

This method of asynchronously triggered periodic counter export allows an sFlow collector to accurately track rapidly changing container populations in large scale deployments. The diagram only shows the sequence of events relating to monitoring a single container. Docker network visibility demonstration shows the full range of network traffic and system performance information being exported.

Detailed real-time visibility is essential for fully realizing the benefits of agile container infrastructure, providing the feedback needed to track and automatically optimize the performance of large scale microservice deployments.

Thursday, June 25, 2015

WAN optimization using real-time traffic analytics

TATA Consultancy Services white paper, Actionable Intelligence in the SDN Ecosystem: Optimizing Network Traffic through FRSA, demonstrates how real-time traffic analytics and SDN can be combined to perform real-time traffic engineering of large flows across a WAN infrastructure.

The architecture being demonstrated is shown in the diagram (this diagram has been corrected - the diagram in the white paper incorrectly states that sFlow-RT analytics software uses a REST API to poll the nodes in the topology. In fact, the nodes stream telemetry using the widely supported, industry standard, sFlow protocol, providing real-time visibility and scaleability that would be difficult to achieve using polling - see Push vs Pull).

The load balancing application receives real-time notifications of large flows from the sFlow-RT analytics software and programs the SDN Controller (in this case OpenDaylight) to push forwarding rules to the switches to direct the large flows across a specific path. Flow Aware Real-time SDN Analytics (FRSA) provides an overview of the basic ideas behind large flow traffic engineering that inspired this use case.

While OpenDaylight is used in this example, an interesting alternative for this use case would be the ONOS SDN controller running the Segment Routing application. ONOS is specifically designed with carriers in mind and segment routing is a natural fit for the traffic engineering task described in this white paper.

Leaf and spine traffic engineering using segment routing describes a demonstration combining real-time analytics and SDN control in a data center context. The demonstration was part of the recent 2015 Open Networking Summit (ONS) conference Showcase and presented in the talk, CORD: FABRIC An Open-Source Leaf-Spine L3 Clos Fabric, by Saurav Das.

Tuesday, February 5, 2013

Measurement delay, counters vs. packet samples

This chart compares the frame rate reported for a switch port based on sFlow interface counter and packet sample measurements (shown in blue and gold respectively). The chart was created using sFlow-RT, which asynchronously updates metrics as soon as new data arrives, demonstrating the fastest possible response to both counter and packet sample measurements.

In this case, the counter export interval was set to 20 seconds and the blue line, trending the ifinucastpkts counter, shows that it can take up to 40 seconds before the counter metric fully reflects a change in traffic level (illustrating the frequency resolution bounds imposed by Nyquist-Shannon). The frames metric, calculated from packet samples, responds far more quickly, immediately detecting a change in traffic and fully reflecting the new value within a few seconds.

The counter push mechanism used by sFlow is extremely efficient, permitting faster counter updates than are practical using large scale counter polling - see Push vs Pull. Reducing the counter export interval below 20 seconds would increase the responsiveness, but at the cost of increased overhead and reduced scaleability. On the other hand, packet sampling automatically allocates monitoring resources to busy links, providing a highly scaleable way to quickly detect traffic flows wherever they occur in the network, see Eye of Sauron.

The difference in responsiveness is important when driving software defined networking applications, where the ability to rapidly detecting large flows ensures responsive and stable controls. Packet sampling also provides richer detail than counters, allowing a controller to identify the root cause of traffic increases and drive corrective actions.

While not as responsive as packet sampling, counter updates provide important complementary functionality:

Counters are maintained in hardware and provide precise traffic totals.
Counters capture rare events, like packet discards, that can severely impact performance.
Counters report important link state information, like link speed, LAG group membership etc.

The combination of periodic counter updates and packet sampling makes sFlow a highly scalable and responsive method of monitoring network performance, delivering the critical metrics needed for effective control of network resources.

Monday, August 27, 2012

Push vs Pull

Push-me-pull-you from Doctor Doolittle

There are two major performance monitoring architectures:

Push, metrics are periodically sent by each monitored system to a central collector. Examples of push architectures include: sFlow, Ganglia, Graphite, collectd and StatsD.
Pull, a central collector periodically requests metrics from each monitored system. Examples of pull architectures include: SNMP, JMX, WMI and libvirt.

The remainder of this article will explore some of the strengths and weaknesses of push and pull architectures:

	Push	Pull
Discovery	Agent automatically sends metrics as soon as it starts up, ensuring that it is immediately detected and continuously monitored. Speed of discovery is independent of number of agents.	Discovery requires collector to periodically sweep address space to find new agents. Speed of discovery depends on discovery sweep interval and size of address space.
Scalability	Polling task fully distributed among agents, resulting in linear scalability. Lightweight central collector listens for updates and stores measurements. Minimal work for agents to periodically send fixed set of measurements. Agents are stateless, exporting data as soon as it is generated.	Workload on central poller increases with the number of devices polled. Additional work on poller to generate requests and maintaining session state in order to match requests and responses. Additional work for agents to parse and process requests. Agents often required to maintain state so that metrics can be retrieved at a later time by the poller.
Security	Push agents are inherently secure against remote attacks since they do not listen for network connections.	Polling protocol can potentially open up system to remote access and denial of service attacks.
Operational Complexity	Minimal configuration required for agents: polling interval and address of collector. Firewalls need to be configured for unidirectional communication of measurements from agents to collector.	Poller needs to be configured with list of devices to poll, security credentials to access the devices and the set of measurements to retrieve. Firewalls need to be configured to allow bi-directional communication between poller and agents.
Latency	The low overhead and distributed nature of the push model permits measurement to be sent more frequently, allowing the management system to quickly react to changes. In addition, many push protocols, like sFlow, are implemented on top of UDP, providing non-blocking, low-latency transport of measurements.	The lack of scalability in polling typically means that measurements are retrieved less often, resulting in a delayed view of performance that makes the management system less responsive to changes. The two way communication involved in polling increases latency as connections are established and authenticated before measurements can be retrieved.
Flexibility	Relatively inflexible: pre-determined, fixed set of measurements are periodically exported.	Flexible: poller can ask for any metric at any time.

The push model is particularly attractive for large scale cloud environments where services and hosts are constantly being added, removed, started and stopped. Maintaining lists of devices to poll for statistics in these environments is challenging and the discovery, scalability, security, low-latency and the simplicity of the push model make it a clear winner.

The sFlow standard is particularly well suited to large scale monitoring of cloud infrastructures, delivering the comprehensive visibility into the performance of network, compute and application resources needed for effective management and control.

In practice, a hybrid approach provides the best overall solution. The core set of standard metrics needed to manage performance and detect problems is pushed using sFlow and a pull protocol is used to retrieve diagnostic information from specific devices when a problem is detected.

Saturday, March 20, 2010

Host sFlow

Virtualization extends networking into servers through virtual switches, virtual routers and virtual firewalls. The sFlow standard, already built into most vendor's switches, provides visibility and control of traffic on the physical network. As virtual device vendors implement sFlow (e.g. Open vSwitch and Vyatta), visibility is extended into the virtual network on the server. The implementation of sFlow monitoring on servers offers an opportunity to extend visibility into server performance.

Monitoring physical and virtual server performance is a challenging task. Consider that the number of virtual machines per server is going up, currently 20-40 virtual machines per physical machine is not unusual. Monitoring a data center with 10,000 physical switch ports might involve monitoring as many as 5,000 physical servers, 10,000 virtual switches, 200,000 virtual switch ports and 100,000 virtual servers.

The proven scalability of sFlow's counter polling mechanism offers an efficient way to monitor the large number physical and virtual servers in a data center. The Host sFlow extension, currently being developed on sFlow.org, offers a standard way to export physical and virtual server performance statistics (i.e. CPU, memory and I/O metrics).

Host sFlow integrates with sFlow in physical and virtual switches to unify network and system visibility, helping to break down management silos and provide the end-to-end visibility into data center resources needed for effective management and control.

Feb. 15, 2011 Update: To find more recent articles on using sFlow to monitor servers, click on the server label below.

Tuesday, December 1, 2009

Large Hadron Collider

LHCb (Photo Credit: CERN)

The Large Hadron Collider at CERN has been in the news as it comes online.

An interesting paper (Management of the LHCb Network Based on SCADA System) describes the data collection network associated with the LHCb experiment. High-speed switched Ethernet networks are used to collect measurements from the experiment and to control its operation. The paper states that, "Sophisticated monitoring of both networks at all levels is essential for the successful operation of the experiment."

The network monitoring system uses sFlow to measure network utilization on the core switches. "Because there are so many ports in the core switches, the SNMP query of interface counters takes a long time and occupies a lot CPU and memory resource."

The distributed counter polling mechanism in sFlow provides a highly scalable alternative to SNMP polling, delivering reliable monitoring in the most demanding environments. Network visibility in the data center is equally important and sFlow provides the scalability and performance needed to maintain effective control of high-speed data center networks.

Friday, June 26, 2009

Sampling rates

A previous posting discussed the scalability and accuracy of packet sampling and the advantages of packet sampling for network-wide visibility.

Selecting a suitable packet sampling rate is an important part of configuring sFlow on a switch. The table gives suggested values that should work well for general traffic monitoring in most networks. However, if traffic levels are unusually high the sampling rate may be decreased (e.g. use 1 in 5000 instead of 1 in 2000 for 10Gb/s links).

Configure sFlow monitoring on all interfaces on the switch for full visibility. Packet sampling is implemented in hardware so all the interfaces can be monitored with very little overhead.

Finally, select a suitable counter polling interval so that link utilizations can be accurately tracked. Generally the polling interval should be set to export counters at least twice as often as the data will be reported (see Nyquist-Shannon sampling theory for an explanation). For example, to trend utilization with minute granularity, select a polling interval of between 20 and 30 seconds. Don't be concerned about setting relatively short polling intervals; counter polling with sFlow is very efficient, allowing more frequent polling with less overhead than is possible with SNMP.

Tuesday, June 16, 2009

Trying out sFlow

If you are interested in network-wide visibility and want to start experimenting with sFlow, take a look at your network and see if any of the switches are sFlow capable. Most switch vendors support sFlow, including: Brocade, Hewlett-Packard, Juniper Networks, Extreme Networks, Force10 Networks, 3Com, D-Link, Alcatel-Lucent, H3C, Hitachi, NEC AlaxalA, Allied Telesis and Comtec (for a complete list of switches, see sFlow.org).

If you don't already have switches with sFlow support, consider purchasing a switch to experiment with. There are a number of inexpensive switches with sFlow support (check the list of switches on sFlow.org), alternatively you may be able to pick up a used switch on eBay.

Finally, the open source Host sFlow agent can be used to host traffic and traffic between virtual machines on a virtual server (Xen®, VMware®, KVM).

Once you have access to a source of sFlow data, you will need an sFlow analyzer. The sFlowTrend application (shown above) is a free, purpose built, sFlow analyzer that will allow you to try out the full range of sFlow functionality, including:

decoding and filtering on data from packet headers (including VLANs, priorities, MAC addresses, Ethernet types, as well as TCP/IP fields)
accurate analysis, trending and reporting of packet samples
trending of sFlow counters
support for sFlow MIB to automatically configure sFlow on switches

Many traffic analyzers claim support for sFlow, but provide only partial support. It is worth starting with sFlowTrend to see the full capabilities of sFlow and to gain experience with sFlow monitoring before evaluating larger scale solutions.

Future posts on this blog will use sFlowTrend to demonstrate how sFlow monitoring can be used to solve common network problems. Downloading a copy of sFlowTrend will allow you to try the different strategies on your own network.

Saturday, June 6, 2009

Choosing an sFlow analyzer

sFlow achieves network-wide visibility by shifting complexity away from the switches to the sFlow analysis application. Simplifying the monitoring task for the switch makes it possible to implement sFlow in hardware, providing wire-speed performance, without increasing the cost of the switch. However, the shift of complexity to the sFlow analysis application makes the selection of the sFlow analyzer a critical factor in realizing the full benefits of sFlow monitoring.

To illustrate some of the features that you should look for in an sFlow analyzer, consider the following basic question, "Which hosts are generating the most traffic on the network?" The chart provides information that answers the question, displaying the top traffic sources and the amount of traffic that they generate. In order to generate this chart, the sFlow analyzer needs to support the following features:

Since the busiest hosts in the network could be anywhere, the sFlow analyzer needs to monitor every link in the network to accurately generate the chart.
Traffic may traverse a number of monitored switch ports, in the example above, traffic between hosts A and B is monitored by 10 switch ports. In order to correctly report on the amount of traffic by host, the sFlow analyzer needs to combine data from the different switch ports in a way that correctly calculates the traffic totals and avoids under or over counting.
The sFlow analyzer must fully support sFlow's packet sampling mechanism in order to accurately calculate traffic volumes.
Notice that the chart contains IPv4, IPv6 and MAC addresses. The sFlow analyzer needs to be able to decode packet headers and report on all the protocols in use on the network, including layer 2 and layer 3 traffic. Traffic on local area networks (LANs) is much more diverse than routed wide area network (WAN) traffic. In addition to the normal TCP/IP traffic seen on the WAN, LAN traffic can include multicast, broadcast, service discovery (Bonjour), host configuration (DHCP), printing, backup and storage traffic not typically seen on the WAN.

When selecting an sFlow analyzer, try to arrange an evaluation and test the product on a full scale production network. Evaluating scalability and accuracy is not something that is easily performed in a test lab.

Saturday, May 16, 2009

Link utilization

One of the basic tasks in monitoring network traffic is to accurately track the utilization of links in your network. A managed switch will provide a standard set of counters for each interface that can be retrieved retrieved periodically using SNMP and used to trend link utilization, packet rates, errors and discards.

sFlow provides an alternative to SNMP counter polling. The sFlow agent in the switch will periodically send, or "push" its own counters to the central collector. Pushing counters is much more efficient than than retrieving them using SNMP, requiring 10-20 times fewer network packets to retrieve the same information. The sFlow protocol uses XDR to encode the counters. XDR is much simpler to encode and decode than the ASN1 encoding that the SNMP protocol uses, so the CPU load on the switches and the collectors is also significantly reduced. Finally, distributing the counter polling task among the switches further reduces the load on the central collector.

The benefits of using sFlow to retrieve interface statistics become clear when you monitor large networks. Instead of requiring 5-10 servers dedicated to SNMP polling, a single sFlow analyzer can collect counters from all the interfaces in the network, providing a centralized view of utilization throughout the network, rapidly identifying any areas of congestion.