Tuesday, May 5, 2020

NVIDIA, Mellanox, and Cumulus

Recent press releases, Riding a Cloud: NVIDIA Acquires Network-Software Trailblazer Cumulus and NVIDIA Completes Acquisition of Mellanox, Creating Major Force Driving Next-Gen Data Centers, describe NVIDIA's moves to provide high speed data center networks to connect compute clusters that use of their GPUs to accelerate big data workloads, including: deep learning, climate modeling, animation, data visualization, physics, molecular dynamics etc.

Real-time visibility into compute, network, and GPU infrastructure is required manage and optimize the unified infrastructure. This article explores how the industry standard sFlow technology supported by all three vendors can deliver comprehensive visibility.

Cumulus Linux simplifies operations, providing the same operating system, Linux, that runs on the servers. Cumulus Networks and Mellanox have a long history of working with the Linux community to integrate support for switches. The latest Linux kernels now include native support for network ASICs, seamlessly integrating with standard Linux routing (FRR, Quagga, Bird, etc), configuration (Puppet, Chef, Ansible, etc) and monitoring (collectd, netstat, top, etc) tools.

Linux 4.11 kernel extends packet sampling support describes enhancements to the Linux kernel to support industry standard sFlow instrumentation in network ASICs. Cumulus Linux and Mellanox both support the new Linux APIs. Cumulus Linux uses the open source Host sFlow agent to stream telemetry gathered from the hardware, Linux operating system, and applications to a remote collector.

Ubuntu 18.04 and CentOS 8 describe how to install the Host sFlow agent on popular host Linux distributions. The Host sFlow agent is also available as a Docker image for easy deployment with container orchestration systems, see Host, Docker, Swarm and Kubernetes monitoring. Extending network visibility to the host allows network traffic to be associated with applications running on the host as well as providing details about the resources consumed by the applications and the network quality of service being delivered to the applications.

The Host sFlow agent also supports the sFlow NVML GPU Structures extension to export key metrics from NVIDIA GPUs using the NVIDIA Management Library (NVML), see GPU performance monitoring.

Enabling sFlow across the network, compute, and GPU stack provides a real-time, data center wide, view of performance. The sFlow-RT real-time analytics engine offers a convenient method of integrating sFlow analytics with popular orchestration, DevOps and SDN tools, examples include: Cumulus Networks, sFlow and data center automationFlow metrics with Prometheus and GrafanaECMP visibility with Cumulus LinuxFabric View, and Troubleshooting connectivity problems in leaf and spine fabrics.

