Monday, August 27, 2012

Push vs Pull

Push-me-pull-you from Doctor Doolittle
There are two major performance monitoring architectures:
  • Push, metrics are periodically sent by each monitored system to a central collector. Examples of push architectures include: sFlow, Ganglia, Graphite, collectd and StatsD.
  • Pull, a central collector periodically requests metrics from each monitored system. Examples of pull architectures include: SNMP, JMX, WMI and libvirt.
The remainder of this article will explore some of the strengths and weaknesses of push and pull architectures:

Discovery Agent automatically sends metrics as soon as it starts up, ensuring that it is immediately detected and continuously monitored. Speed of discovery is independent of number of agents. Discovery requires collector to periodically sweep address space to find new agents. Speed of discovery depends on discovery sweep interval and size of address space.
Scalability Polling task fully distributed among agents, resulting in linear scalability. Lightweight central collector listens for updates and stores measurements. Minimal work for agents to periodically send fixed set of measurements. Agents are stateless, exporting data as soon as it is generated. Workload on central poller increases with the number of devices polled. Additional work on poller to generate requests and maintaining session state in order to match requests and responses. Additional work for agents to parse and process requests. Agents often required to maintain state so that metrics can be retrieved at a later time by the poller.
Security Push agents are inherently secure against remote attacks since they do not listen for network connections. Polling protocol can potentially open up system to remote access and denial of service attacks.
Operational Complexity Minimal configuration required for agents: polling interval and address of collector. Firewalls need to be configured for unidirectional communication of measurements from agents to collector. Poller needs to be configured with list of devices to poll, security credentials to access the devices and the set of measurements to retrieve. Firewalls need to be configured to allow bi-directional communication between poller and agents.
Latency The low overhead and distributed nature of the push model permits measurement to be sent more frequently, allowing the management system to quickly react to changes. In addition, many push protocols, like sFlow, are implemented on top of UDP, providing non-blocking, low-latency transport of measurements. The lack of scalability in polling typically means that measurements are retrieved less often, resulting in a delayed view of performance that makes the management system less responsive to changes. The two way communication involved in polling increases latency as connections are established and authenticated before measurements can be retrieved.
FlexibilityRelatively inflexible: pre-determined, fixed set of measurements are periodically exported. Flexible: poller can ask for any metric at any time.

The push model is particularly attractive for large scale cloud environments where services and hosts are constantly being added, removed, started and stopped. Maintaining lists of devices to poll for statistics in these environments is challenging and the discovery, scalability, security, low-latency and the simplicity of the push model make it a clear winner.

The sFlow standard is particularly well suited to large scale monitoring of cloud infrastructures, delivering the comprehensive visibility into the performance of network, compute and application resources needed for effective management and control.

In practice, a hybrid approach provides the best overall solution. The core set of standard metrics needed to manage performance and detect problems is pushed using sFlow and a pull protocol is used to retrieve diagnostic information from specific devices when a problem is detected.

No comments:

Post a Comment