Saturday, November 15, 2025

SC25: SDSC Expanse cluster live AI/ML metrics

The SDSC Expanse cluster live AI/ML metrics dashboard is a joint InMon / San Diego Supercomputer Center (SDSC) demonstration at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC25) conference being held this week in St. Louis, November 16-21. Click on the dashboard link during the show to see live traffic.

By default, the dashboard shows the Last 24 Hours of traffic. Explore the data: select Last 30 Days to get a long term view, select Last 5 Minutes to get an up to the second view, click on items in a chart legend to show selected metric, drag to select an interval and zoom in.

The Expanse cluster at the San Diego Supercomputer Center is a batch-oriented science computing gateway serving thousands of users and a wide range of research projects, see Google News for examples.

The SDSC Expanse cluster live AI/ML metrics dashboard displays real-time metrics for workloads running on the cluster:

    • Total Traffic Total traffic entering fabric
    • Cluster Services Traffic associated with Lustre, Ceph and NFS storage, and Slurm workload management
    • Core Link Traffic Histogram of load on fabric links
    • Edge Link Traffic Histogram of load on access ports
    • RDMA Operations Total RDMA operations
    • RDMA Avg. Bytes per Operation Average RDMA operation size
    • Infiniband Operations Total RoCEv2 Infiniband operations broken out by type
    • Compute / Exchange Interval Detected period of compute / exchange activity on fabric
    • Congestion Notification Messages Total ECN / CNP congestion messages
    • Infiniband Ack. Credits Average number of credits in RoCEv2 Infiniband acknowledgements
    • Packet Discards Total ingress / egress discards
    • Packet Errors Total ingress / egress errors

AI Metrics with Prometheus and Grafana describes how to quickly set up the monitoring stack for your own AI / ML network using industry standard telemetry from leading switch vendors (Arista, Cisco, Dell, Edge-Core, Juniper, HPE, NVIDIA, SONiC etc.).

Monday, November 3, 2025

Ultra Ethernet Transport

The Ultra Ethernet Consortium has a mission to Deliver an Ethernet based open, interoperable, high performance, full-communications stack architecture to meet the growing network demands of AI & HPC at scale. The recently released UE-Specification-1.0.1 includes an Ultra Ethernet Transport (UET) protocol with similar functionality to RDMA over Converged Ethernet (RoCEv2).

The sFlow instrumentation embedded as a standard feature of data center switch hardware from all leading vendors (Arista, Cisco, Dell, Juniper, NVIDIA, etc.) provides a cost effective solution for gaining visibility into UET traffic in large production AI / ML fabrics. 

docker run -p 8008:8008 -p 6343:6343/udp sflow/prometheus
The easiest way to get started is to use the pre-built sflow/prometheus Docker image to analyze the sFlow telemetry. The chart at the top of this page shows an up to the second view of UET operations using the included Flow Browser application, see Defining Flows for a list of available UET attributes. Getting Started describes how to set up the sFlow monitoring system.

Flow metrics with Prometheus and Grafana describes how collect custom network traffic flow metrics using the Prometheus time series database and include the metrics in Grafana dashboards. Use the Flow Browser to explore UET flow metrics and then configure a Prometheus scrape task to collect useful operational metrics.