Saturday, November 15, 2025

SC25: SDSC Expanse cluster live AI/ML metrics

The SDSC Expanse cluster live AI/ML metrics dashboard is a joint InMon / San Diego Supercomputer Center (SDSC) demonstration at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC25) conference being held this week in St. Louis, November 16-21. Click on the dashboard link during the show to see live traffic.

By default, the dashboard shows the Last 24 Hours of traffic. Explore the data: select Last 30 Days to get a long term view, select Last 5 Minutes to get an up to the second view, click on items in a chart legend to show selected metric, drag to select an interval and zoom in.

The Expanse cluster at the San Diego Supercomputer Center is a batch-oriented science computing gateway serving thousands of users and a wide range of research projects, see Google News for examples.

The SDSC Expanse cluster live AI/ML metrics dashboard displays real-time metrics for workloads running on the cluster:

    • Total Traffic Total traffic entering fabric
    • Cluster Services Traffic associated with Lustre, Ceph and NFS storage, and Slurm workload management
    • Core Link Traffic Histogram of load on fabric links
    • Edge Link Traffic Histogram of load on access ports
    • RDMA Operations Total RDMA operations
    • RDMA Avg. Bytes per Operation Average RDMA operation size
    • Infiniband Operations Total RoCEv2 Infiniband operations broken out by type
    • Compute / Exchange Interval Detected period of compute / exchange activity on fabric
    • Congestion Notification Messages Total ECN / CNP congestion messages
    • Infiniband Ack. Credits Average number of credits in RoCEv2 Infiniband acknowledgements
    • Packet Discards Total ingress / egress discards
    • Packet Errors Total ingress / egress errors

AI Metrics with Prometheus and Grafana describes how to quickly set up the monitoring stack for your own AI / ML network using industry standard telemetry from leading switch vendors (Arista, Cisco, Dell, Edge-Core, Juniper, HPE, NVIDIA, SONiC etc.).

No comments:

Post a Comment