Sunday, September 26, 2010

Memcached


The diagram shows typical elements in a Web 2.0 data center (e.g. Facebook, Twitter, Wikipedia, Youtube, etc.). A cluster of web servers handles requests from users. Typically, the application logic for the web site will run on the web servers in the form of server side scripts (PHP, Ruby, ASP etc). The web applications access the database to retrieve and update user data. However, the database can quickly become a bottleneck, so a cache is used to store the results of database queries.

The cache is a critical element for scalability and the cache itself must be highly scalable. The cache consists of open source Memcached software deployed on a cluster of servers. Memcached is an example of a scale-out system: adding additional servers to a Memcached cluster linearly increases the capacity of the cache.

The following example shows how the cache is typically integrated into the web application logic:

function get_foo(foo_id)
    foo = memcached_get("foo:" . foo_id)
    return foo if defined foo

    foo = fetch_foo_from_database(foo_id)
    memcached_set("foo:" . foo_id, foo)
    return foo
end

To give some perspective on the challenge of managing performance in this environment, consider the scale of the problem. In December 2008, Facebook's Memcached cluster consisted of over 800 servers providing over 28 terabytes of cache memory and able to handle over 160 million requests per second (see Scaling memcached at Facebook).

One particularly challenging task is identifying the most popular keys in an operational Memcached cluster. Determining which cache keys are popular is critical to managing and optimizing Memcached performance. Consider the following example: if all the web servers are making frequent reference to a shared cache item and it is removed from the cache, the result can be a "stampede" to the database to retrieve the value. If this situation isn't managed carefully the result can be a cascading failure as the database becomes overloaded (see Cache Miss Storm).

Memcached users can now add sFlow monitoring to their Memcached servers in order to improve visibility (see sFlow for memcached). The following code extract shows the minimal impact of sFlow instrumentation on the performance of Memcached request processing loop:

#define SFLOW_SAMPLE_TEST(c) (unlikely((--c->thread->sflow_skip)==0))
...
if(SFLOW_SAMPLE_TEST(c)) {
  sflow_sample(&sFlow, c, SFMC_PROT_ASCII,
               sflow_map_nread(c->cmd),
               ITEM_key(it),
               it->nkey,
               SFLOW_TOKENS_UNKNOWN,
               (ret == STORED) ? it->nbytes : 0,
               SFLOW_DURATION_UNKNOWN,
               sflow_map_status(ret));
}
...

If you examine the code, you can see that the overhead of adding sFlow transaction sampling is essentially the cost of maintaining one additional performance counter. Memcached already maintains over 30 performance counters and these counters are also exported by the sFlow agent using sFlow's efficient "counter push" mechanism (see Link utilization). Overall, the sFlow agent imposes negligible overhead on the Memcached server since no data analysis is performed locally. With sFlow monitoring the raw data is streamed to a central sFlow Analyzer (see Choosing an sFlow analyzer).

The Memcached transaction samples allow the sFlow Analyzer to easily identify popular keys, who is using the keys, what operations are being performed and the status of those operations (see Scalability and accuracy of packet sampling). In addition, the Memcached performance counters allow the analyzer to report on overall load, cache hit/miss rates etc. This data is available across all the machines in the Memcached cluster, providing the integrated view of cluster performance needed to improve cache efficiency and avoid stampedes.

However, managing Memcached performance involves more than simply monitoring the application statistics. The sFlow standard also provides server metrics (see Cluster performance and Top servers), identifying performance problems with the servers in the Memcached cluster before they affect the performance of the Memcached application. In addition, sFlow is implemented by most switch vendors, providing the detailed visibility into network performance needed to avoid congestion and packet loss problems that can severely impact the performance of Memcached.

Memcached demonstrates how the performance of all the elements in the data center are interrelated.  The sFlow standard addresses challenge by unifying network, server and application performance monitoring to deliver the integrated view needed for effective management (see sFlow Host Structures).

Feb. 15, 2011 Update: For more recent articles on Memcached, including identifying hot/missed keys, click on the Memcache label below.

Wednesday, September 8, 2010

Superlinear

(diagram from Going Superlinear)

The article, Going Superlinear, describes factors that affect the scalability of parallel processing systems. Ideally a parallel system will show a linear improvement in performance as the number of processors increases. However, typical parallel systems show sublinear performance since memory or I/O eventually becomes a bottleneck. Finally, the article states that it is sometimes possible to achieve superlinear scalability where adding processors creates a disproportionate increase in performance.

Applying the concepts from parallel processing to analyze the scalability of network performance monitoring systems is an interesting exercise. The parallel processing that allows a network monitoring system to scale is embedded in the network switches and operates as part of the packet forwarding function in each switch port. Therefore, it makes sense to measure system size in number of switch ports, not number of processors. The switch resources consumed by the monitoring system are a limiting factor. If an increase in network size requires a disproportionately increase in switch resources then the monitoring system will have limited scalability.
    The following diagram plots the relationship between network size and parallel speedup. The sublinear (limited scalability) region is shown in red and the superlinear (scalable) region is shown in green. The speedup associated with the sFlow and NetFlow network monitoring systems are plotted on the chart.


    In a NetFlow monitoring system, each switch maintains a flow cache containing active connections. As the network size increases, the amount of memory allocated to each flow cache must be increased to accommodate the increased number of active connections associated with a larger network. This disproportionate increase in switch resource needed to implement NetFlow results in sublinear scalability.

    The article states that superlinear scalability is possible if the monitoring system can:
    • Do disproportionately less work.
    • Harness disproportionately more resources.
    The sFlow monitoring system harnesses both techniques to increase scalability:
    • Less work. Packet sampling allows the switch to perform disproportionately less work (see Scalability and accuracy of packet sampling). Packet samples are not stored on the switch, but are immediately sent to a central collector for analysis. The elimination of the flow cache further reduces the switch resources consumed by monitoring.
    • More resources. Moving analysis to a central collector harnesses the disproportionately larger memory and computational resources available on a server compared to the limited resources available on a switch (see Choosing an sFlow analyzer).
    Moving the flow cache from the switches to the central analyzer greatly improves memory efficiency:
    • NetFlow caches must be sized to handle worst case loads. Most switches in the network will have excess flow cache capacity while a few busy switches might have insufficient flow cache memory. Centralizing the cache allows memory to be pooled and flexibly re-assigned  from idle to busy switches as traffic patterns change.
    • While some NetFlow implementations support packet sampling, the use of sampling has no impact on memory overhead since on-board flow caches are sized for worst case (unsampled) configurations. Thus, if packet sampling is used with NetFlow, the result is poor cache utilization. However, with a centralized cache the memory savings associated with sampling increase with network size.
    • Traffic paths often traverse multiple switches. With NetFlow, each switch in the path will keep a copy the flow in its cache.  A centralized cache reduces memory requirements by eliminating the redundant copies.
    • Centralizing the cache replaces scarce, expensive switch memory with abundant, inexpensive server memory.
      Memory efficiency is just one aspect of sFlow's scalability. Other aspects of the sFlow architecture that contribute to its scalability have been discussed in previous articles: Link utilization, Measurement traffic and Measurement overhead.

      The differences in scalability between NetFlow and sFlow reflect differences in the size and type of network targeted by the two technologies. Typically, NetFlow monitoring is selectively applied to a relatively small number of routed links, whereas sFlow is used to monitor all the links in large switched networks (see LAN and WAN).

      The sFlow standard is widely supported by switch vendors, delivering the scalability needed to manage the performance of large, converged networks. The scalability of sFlow allows performance monitoring to extend beyond the network to include virtualization, server and cloud computing in a single integrated system.


      Wednesday, September 1, 2010

      Cloud-scale performance monitoring


      A cloud data center consists of large numbers of physical servers, each running a hypervisor with one or more virtual switches connecting the physical network to virtual machines (see Anatomy of an open source cloud). An integrated approach to network and system management is required in order to manage performance in this environment (see Management silos).

      The combination of sFlow in Open vSwitch and Host sFlow provide a lightweight, scalable, performance monitoring solution for open source virtualization (Xen, XenServer and KVM) and cloud platforms (OpenStack and Xen Cloud Platform). However, coordinating and managing the configuration of sFlow monitoring of the large number of virtual switches and virtual machines in a cloud data center is a complex task that needs to be addressed as part of a scalable monitoring solution.

      In order to address this challenge, the latest version of the Host sFlow agent adds the ability to automatically configure sFlow on the Open vSwitch. The Host sFlow agent already includes a highly scalable configuration mechanism (see DNS-SD) and the integration with the Open vSwitch extends this mechanism to configure network and system performance monitoring throughout the cloud.

      Referring back to the diagram, the sFlow standard provides a performance monitoring solution that spans all the elements of the cloud. Most physical networking devices already include sFlow support (see Multi-vendor support), providing visibility into network and storage activity. Open vSwitch and Host sFlow extend visibility to include virtual networking and physical/virtual server performance respectively. Thus, sFlow provides the integrated end-to-end view of performance needed to manage resources throughout the cloud (see sFlow Host Structures).