Friday, December 30, 2011

Using Ganglia to monitor Memcache clusters


The Ganglia charts show Memcache performance metrics collected using sFlow. Enabling sFlow monitoring in Memcache servers provides a highly scalable solution for monitoring the performance of large Memcache clusters. Embedded sFlow monitoring simplifies deployments by eliminating the need to poll for metrics. Instead, metrics are pushed directly from each Memcache server to the central Ganglia collector. Currently, there is an implementation of sFlow for Memcached, see http://host-sflow.sourceforge.net/relatedlinks.php.

The article, Ganglia 3.2 released, describes the basic steps needed to configure Ganglia as an sFlow collector. Once configured, Ganglia will automatically discover and track new Memcache servers as they are added to the network.

Note: To try out Ganglia's sFlow/Memcache reporting, you will need to download Ganglia 3.3.

By default, Ganglia will automatically start displaying the Memcache metrics. However, there are two optional configuration settings available in the gmond.conf file that can be used to modify how Ganglia handles the sFlow Memcache metrics.

sflow{
  accept_memcache_metrics = no
  multiple_memcache_instances = no
}

Setting the accept_memcache_metrics flag to no will cause Ganglia to ignore sFlow Memcache metrics.

The multiple_memcache_instances setting must be set to yes in cases where there are multiple Memcache instances running on each server in the cluster. Each Memcache instance will be identified by the server port included in the title of the charts. For example, the following chart is reporting on the Memcache server listening on port 11211 on host ganglia:


Ganglia and sFlow offers a comprehensive view of the performance of a cluster of Memcache servers, providing not just Memcache related metrics, but also the server CPU, memory, disk and network IO performance metrics needed to fully characterize cluster performance.

Note: A Memcache sFlow agent does more than simply export performance counters, it also exports detailed data on Memcache operations that can be used to monitor hot keys, missed keys, top clients etc. The operation data complements the counter data displayed in Ganglia, helping to identify the root cause of problems. For example, Ganglia was showing that the Memcache miss rate was high and an examination of the transactions identified a mistyped key in the application code as the root cause. In addition, Memcache performance is critically dependent on network latency and packet loss - here again, sFlow provides the necessary visibility since most switch vendors already include support for the sFlow standard.

7 comments:

  1. Been scouring the net for reference and help to my problem with our Ganglia monitoring. So far this is the only reference that is close to what I am looking for.

    I have a server with 2 memcache instance running on separate ports 11211 and 11311. I would like to monitor these 2 instance but only managed to get 1 working but not both.

    I've already change the multiple_memcache_instances from no to yes in my gmond.conf but still not able to see the 2 instances running.

    I tried also copying the memcached.pyconf to another configuration memcache_db.pyconf and change the port value there to 11311. But still no luck.

    Thanks for the help.

    ReplyDelete
    Replies
    1. Are you able to connect to each of your Memcached instances using telnet? For example,

      telnet localhost 11211
      stats

      Delete
    2. Yes I can connect to both memcache instances port.

      Delete
  2. Yes. See transcript below



    [XXX ~]$ telnet 11211
    Trying 0.0.43.203...
    telnet: connect to address 0.0.43.203: Invalid argument
    telnet: Unable to connect to remote host: Invalid argument
    [darwinv@LSDWFE02 ~]$ telnet localhost 11211
    Trying 127.0.0.1...
    Connected to localhost.localdomain (127.0.0.1).
    Escape character is '^]'.
    stats
    STAT pid 9706
    STAT uptime 897
    STAT time 1348736305
    STAT version 1.2.8
    STAT pointer_size 64
    STAT rusage_user 0.734888
    STAT rusage_system 0.863868
    STAT curr_items 3779
    STAT total_items 4027
    STAT bytes 14980199
    STAT curr_connections 131
    STAT total_connections 208
    STAT connection_structures 145
    STAT cmd_flush 0
    STAT cmd_get 24510
    STAT cmd_set 4027
    STAT get_hits 20478
    STAT get_misses 4032
    STAT evictions 0
    STAT bytes_read 16301688
    STAT bytes_written 710713042
    STAT limit_maxbytes 4294967296
    STAT threads 2
    STAT accepting_conns 1
    STAT listen_disabled_num 0
    END
    quit
    Connection closed by foreign host.
    [XXX ~]$

    [XXX ~]$ telnet localhost 11311
    Trying 127.0.0.1...
    Connected to localhost.localdomain (127.0.0.1).
    Escape character is '^]'.
    stats
    STAT pid 9718
    STAT uptime 968
    STAT time 1348736383
    STAT version 1.2.8
    STAT pointer_size 64
    STAT rusage_user 9.265591
    STAT rusage_system 6.967940
    STAT curr_items 136856
    STAT total_items 152337
    STAT bytes 28827795
    STAT curr_connections 77
    STAT total_connections 191
    STAT connection_structures 100
    STAT cmd_flush 0
    STAT cmd_get 261235
    STAT cmd_set 152337
    STAT get_hits 124215
    STAT get_misses 137020
    STAT evictions 0
    STAT bytes_read 32631056
    STAT bytes_written 38919484
    STAT limit_maxbytes 1073741824
    STAT threads 2
    STAT accepting_conns 1
    STAT listen_disabled_num 0
    END
    quit
    Connection closed by foreign host.
    [XXX ~]$

    ReplyDelete
  3. Are you running the latest sFlow build (1.4.13) of Memcached, https://github.com/sflow/memcached?

    What version of Ganglia are you using? What version of Host sFlow?

    ReplyDelete
  4. I'm trying to use sFlow(jmx-agent 0.6.1)-Ganglia(3.5.0, source build) pair for jvm monitoring.
    gmond.conf
    udp_recv_channel {
    port = 6343
    }

    sflow {
    accept_vm_metrics = yes
    }

    First of all, is there any chance to see the logs , except "-d" mode ??

    gmond -m display no "vm_*" specific metrics !!! I thought that those metrics would be injected after the sflow UDP-datagrams would arrive, but ... no logs, no errors, no metrics

    ReplyDelete
    Replies
    1. You should be looking at the instructions in Using Ganglia to monitor Java virtual machines, if you have an further questions, please post them there.

      Delete