Friday, October 19, 2012

Using Ganglia to monitor GPU performance


The Ganglia charts show GPU health and performance metrics collected using sFlow, see GPU performance monitoring. The combination of Ganglia and sFlow provides a highly scaleable solution for monitoring the performance of large GPU based compute clusters, eliminating the need to poll for GPU metrics. Instead, all the host and GPU metrics are efficiently pushed directly to the central Ganglia collector.

The screen capture shows the new GPU metrics, including:
  • Processes
  • GPU Utilization
  • Memory R/W Utilization
  • ECC Errors
  • Power
  • Temperature
The article, Ganglia 3.2 released, describes the basic steps needed to configure Ganglia as an sFlow collector. Once configured, Ganglia will automatically discover and track new servers as they are added to the network.

Note: Support for the GPU metrics is currently only available in Ganglia if you compile gmond from the latest development sources.

No comments:

Post a Comment