Monday, October 14, 2024

OCP Global Summit 2024

AI networking is a popular topic at the up coming OCP Global Summit in San Jose, California, with an entire morning on Wednesday October 16 devoted to the subject.
Of particular interest is the talk, Leveraging open technologies to monitor packet drops in AI cluster fabrics, by Aldrin Isaac, eBay, describing the challenge, AI clusters operate most efficiently over lossless networks for optimum job completion times which can be significantly impacted by dropped packets. Although networks can be designed to minimize packet loss by choosing the right network topology, optimizing network devices and protocols, an effective monitoring and troubleshooting network performance tool is still required. Such tool should capture packet drops, raise notifications and identify various drop reasons and pin point where the drops caused congestions. In turn, it allows the governing management application to tune configurations of relevant infrastructure components, including switches, NICs and GPU servers.

The talk will share the results and best practices of a TAM (Telemetry and Monitoring) solution being prepared for deployment at eBay. It leverages OCP’s SAI and open sFlow drop notification technologies as part of eBay’s ongoing initiatives to adopt open networking hardware and community SONiC for its data centers.

The sFlow Dropped Packet Notification Structures extension mentioned in the talk adds real-time packet drop notifications (including dropped packet header and drop reason) as part of an industry standard sFlow telemetry feed, making the data available to open source and commercial sFlow analytics tools.

For example, Dropped packet metrics with Prometheus and Grafana describes how to incorporate sFlow dropped packet notifications into operational dashboards using current implementations for Arista, VyOS, and Linux servers. The availability of drop monitoring in SONiC will extend this capability to the wide range of hardware platforms supporting the SONiC network operating system.

No comments:

Post a Comment