The Ganglia charts show GPU health and performance metrics collected using sFlow, see GPU performance monitoring. The combination of Ganglia and sFlow provides a highly scaleable solution for monitoring the performance of large GPU based compute clusters, eliminating the need to poll for GPU metrics. Instead, all the host and GPU metrics are efficiently pushed directly to the central Ganglia collector.
The screen capture shows the new GPU metrics, including:
- GPU Utilization
- Memory R/W Utilization
- ECC Errors
Note: Support for the GPU metrics is currently only available in Ganglia if you compile gmond from the latest development sources.