Wednesday, September 5, 2012

GPU performance monitoring


NVIDIA's Compute Unified Device Architecture (CUDA™) dramatically increases computing performance by harnessing the power of the graphics processing unit (GPU). Recently, NVIDIA published the sFlow NVML GPU Structures specification, defining a standard set of metrics for reporting GPU health and performance, and extended the Host sFlow agent to export the GPU metrics.

The following displays the sFlow metrics using sflowtool, the GPU metrics are highlighted:
[pp@test] /usr/local/bin/sflowtool
startDatagram =================================
datagramSourceIP 10.0.0.150
datagramSize 512
unixSecondsUTC 1346360234
datagramVersion 5
agentSubId 100000
agent 10.0.0.150
packetSequenceNo 1
sysUpTime 3000
samplesInPacket 1
startSample ----------------------
sampleType_tag 0:2
sampleType COUNTERSSAMPLE
sampleSequenceNo 1
sourceId 2:1
counterBlock_tag 0:2001
adaptor_0_ifIndex 1
adaptor_0_MACs 1
adaptor_0_MAC_0 000000000000
adaptor_1_ifIndex 2
adaptor_1_MACs 1
adaptor_1_MAC_0 e0cb4e98f891
adaptor_2_ifIndex 3
adaptor_2_MACs 1
adaptor_2_MAC_0 e0cb4e98f890
counterBlock_tag 0:2005
disk_total 145102770176
disk_free 46691696640
disk_partition_max_used 76.06
disk_reads 477615
disk_bytes_read 13102692352
disk_read_time 2227298
disk_writes 2370522
disk_bytes_written 193176428544
disk_write_time 445531146
counterBlock_tag 0:2004
mem_total 12618829824
mem_free 2484174848
mem_shared 0
mem_buffers 971259904
mem_cached 8214761472
swap_total 12580810752
swap_free 12580810752
page_in 6400433
page_out 94324428
swap_in 0
swap_out 0
counterBlock_tag 5703:1
nvml_device_count 1
nvml_processes 0
nvml_gpu_mS 0
nvml_mem_mS 0
nvml_mem_bytes_total 6441598976
nvml_mem_bytes_free 6429614080
nvml_ecc_errors 0
nvml_energy_mJ 74569
nvml_temperature_C 54
nvml_fan_speed_pc 30
counterBlock_tag 0:2003
cpu_load_one 0.040
cpu_load_five 0.240
cpu_load_fifteen 0.350
cpu_proc_run 0
cpu_proc_total 229
cpu_num 8
cpu_speed 1600
cpu_uptime 896187
cpu_user 21731800
cpu_nice 120230
cpu_system 5686620
cpu_idle 2844149774
cpu_wio 2992230
cpuintr 570
cpu_sintr 222180
cpuinterrupts 166594944
cpu_contexts 266986130
counterBlock_tag 0:2006
nio_bytes_in 0
nio_pkts_in 0
nio_errs_in 0
nio_drops_in 0
nio_bytes_out 0
nio_pkts_out 0
nio_errs_out 0
nio_drops_out 0
counterBlock_tag 0:2000
hostname test0
UUID 00000000000000000000000000000000
machine_type 3
os_name 2
os_release 2.6.35.14-106.fc14.x86_64
endSample   ----------------------
endDatagram   =================================
Note: Currently only the Linux version of Host sFlow includes the GPU support and the agent needs to be compiled from sources on a system that includes the NVML library.

The inclusion of GPU metrics in Host sFlow offers an extremely scaleable, lightweight solution for monitoring compute cluster performance. In addition to exporting a comprehensive set of standard performance metrics, the Host sFlow agent also offers a convenient API for exporting custom application metrics.

The sFlow standard isn't limited to monitoring compute resources; most network switch vendors include sFlow support, providing detailed visibility into cluster communication patterns and network utilization. Combining sFlow from switches, servers and applications delivers a comprehensive view of cluster performance.

No comments:

Post a Comment