Tuesday, October 22, 2024

Leveraging open technologies to monitor packet drops in AI cluster fabrics

In this talk from the recent OCP Global Summit, Aldrin Isaac, eBay, describes the challenge, AI clusters operate most efficiently over lossless networks for optimum job completion times which can be significantly impacted by dropped packets. Although networks can be designed to minimize packet loss by choosing the right network topology, optimizing network devices and protocols, an effective monitoring and troubleshooting network performance tool is still required. Such tool should capture packet drops, raise notifications and identify various drop reasons and pin point where the drops caused congestions. In turn, it allows the governing management application to tune configurations of relevant infrastructure components, including switches, NICs and GPU servers.

The talk shares the results and best practices of a TAM (Telemetry and Monitoring) solution being prepared for deployment at eBay. It leverages OCP’s SAI and open sFlow drop notification technologies as part of eBay’s ongoing initiatives to adopt open networking hardware and community SONiC for its data centers.

The sFlow Dropped Packet Notification Structures extension mentioned in the talk adds real-time packet drop notifications (including dropped packet header and drop reason) as part of an industry standard sFlow telemetry feed, making the data available to open source and commercial sFlow analytics tools.

For example, Dropped packet metrics with Prometheus and Grafana describes how to incorporate sFlow dropped packet notifications into operational dashboards using current implementations by Arista Networks, VyOS, FD.io / VPP, and Linux servers. Current network hardware is capable of reporting on dropped packets, so ask your network equipment vendor about their plains to support the sFlow extension so that you can befit from this transformational capability.

Monday, October 14, 2024

OCP Global Summit 2024

AI networking is a popular topic at the up coming OCP Global Summit in San Jose, California, with an entire morning on Wednesday October 16 devoted to the subject.
Of particular interest is the talk, Leveraging open technologies to monitor packet drops in AI cluster fabrics, by Aldrin Isaac, eBay, describing the challenge, AI clusters operate most efficiently over lossless networks for optimum job completion times which can be significantly impacted by dropped packets. Although networks can be designed to minimize packet loss by choosing the right network topology, optimizing network devices and protocols, an effective monitoring and troubleshooting network performance tool is still required. Such tool should capture packet drops, raise notifications and identify various drop reasons and pin point where the drops caused congestions. In turn, it allows the governing management application to tune configurations of relevant infrastructure components, including switches, NICs and GPU servers.

The talk will share the results and best practices of a TAM (Telemetry and Monitoring) solution being prepared for deployment at eBay. It leverages OCP’s SAI and open sFlow drop notification technologies as part of eBay’s ongoing initiatives to adopt open networking hardware and community SONiC for its data centers.

The sFlow Dropped Packet Notification Structures extension mentioned in the talk adds real-time packet drop notifications (including dropped packet header and drop reason) as part of an industry standard sFlow telemetry feed, making the data available to open source and commercial sFlow analytics tools.

For example, Dropped packet metrics with Prometheus and Grafana describes how to incorporate sFlow dropped packet notifications into operational dashboards using current implementations for Arista, VyOS, and Linux servers. The availability of drop monitoring in SONiC will extend this capability to the wide range of hardware platforms supporting the SONiC network operating system.

Monday, October 7, 2024

Vector Packet Processor (VPP)

VPP with sFlow - Part 1 and VPP with sFlow - Part 2 describe the journey to add industry standard sFlow instrumentation to the Vector Packet Processor (VPP) an Open Source Terabit Software Dataplane for software routers running on commodity x86 / ARM hardware.

The main conclusions based on testing described in the two VPP blog posts are:

  1. If sFlow is not enabled on a given interface, there is no regression on other interfaces.
  2. If sFlow is enabled, copying packets costs 11 CPU cycles on average
  3. If sFlow takes a sample, it takes only marginally more CPU time to enqueue.
    • No sampling gets 9.88Mpps of IPv4 and 14.3Mpps of L2XC throughput,
    • 1:1000 sampling reduces to 9.77Mpps of L3 and 14.05Mpps of L2XC throughput,
    • and an overly harsh 1:100 reduces to 9.69Mpps and 13.97Mpps only.

The VPP sFlow plugin provides a lightweight method of exporting real-time sFlow telemetry from a VPP based router. Including the plugin with VPP distributions has no impact on performance. Enabling the plugin provides real-time visibility that opens up additional use cases for VPPs programmable dataplane. For example, VPP is well suited to packet filtering use cases where the number of ACL entries would exceed the capabilities of an ASIC. Combined with real-time visibility to identify DDoS attacks, VPP provides an effective means of mitigating the attacks by scrubbing the unwanted traffic.