sFlow: NVIDIA Cumulus Linux 5.11 for AI / ML

Monday, November 18, 2024

NVIDIA Cumulus Linux 5.11 for AI / ML

NVIDIA Cumulus Linux 5.11 includes major upgrades to the sFlow agent that fully exposes the advanced instrumentation built into NVIDIA Spectrum-X silicon. The enhanced real-time telemetry is particularly relevant to the AI / machine learning workloads that Spectrum-X is designed to handle.

With Cumulus Linux 5.11, the sFlow agent is easily configured using nvue commands, see Monitoring System Statistics and Network Traffic with sFlow:

nv set system sflow dropmon hw
nv set system sflow poll-interval 20
nv set system sflow collector 192.0.2.1
nv set system sflow state enabled
nv config apply

Note: In this case, enabling dropmon ensures that every dropped packet is captured, along with ingress port and drop reason (e.g. ttl_exceeded).

The same commands should be applied to every switch in the fabric for comprehensive visibility.

RDMA over Converged Ethernet (RoCE) describes how sFlow provides detailed visibility into RoCE flows used to move data between GPUs in an AI / ML data center fabric. The chart above from the RDMA network visibility demonstration at the SC22 conference shows that sFlow monitoring easily scales to the 400/800G speeds needed for machine learning.

In this example, the sFlow-RT real-time analytics engine receives sFlow telemetry from all the switches and servers in the fabric. Deploy real-time network dashboards using Docker compose describes how to quickly set up an sFlow-RT, Prometheus, Grafana stack to capture and display metrics. Dropped packet metrics with Prometheus and Grafana describes how to add a dashboard to display packet drop notifications.

If you are standing up a new NVIDIA Spectrum-X / Cumulus Linux network, enable sFlow on all the switches and set up an instance of sFlow-RT for the real-time fabric wide visibility into traffic flows and dropped packets. Real-time network visibility is particularly relevant to AI / ML data center networks where congestion and dropped packets can result in serious performance degradation.

Monday, November 18, 2024

NVIDIA Cumulus Linux 5.11 for AI / ML

No comments:

Post a Comment