By default, the dashboard shows the Last 24 Hours of traffic. Explore the data: select Last 30 Days to get a long term view, select Last 5 Minutes to get an up to the second view, click on items in a chart legend to show selected metric, drag to select an interval and zoom in.
The Expanse cluster at the San Diego Supercomputer Center is a batch-oriented science computing gateway serving thousands of users and a wide range of research projects, see Google News for examples.The SDSC Expanse cluster live AI/ML metrics dashboard displays real-time metrics for workloads running on the cluster:
- Total Traffic Total traffic entering fabric
- Cluster Services Traffic associated with Lustre, Ceph and NFS storage, and Slurm workload management
- Core Link Traffic Histogram of load on fabric links
- Edge Link Traffic Histogram of load on access ports
- RDMA Operations Total RDMA operations
- RDMA Avg. Bytes per Operation Average RDMA operation size
- Infiniband Operations Total RoCEv2 Infiniband operations broken out by type
- Compute / Exchange Interval Detected period of compute / exchange activity on fabric
- Congestion Notification Messages Total ECN / CNP congestion messages
- Infiniband Ack. Credits Average number of credits in RoCEv2 Infiniband acknowledgements
- Packet Discards Total ingress / egress discards
- Packet Errors Total ingress / egress errors
AI Metrics with Prometheus and Grafana describes how to quickly set up the monitoring stack for your own AI / ML network using industry standard telemetry from leading switch vendors (Arista, Cisco, Dell, Edge-Core, Juniper, HPE, NVIDIA, SONiC etc.).


No comments:
Post a Comment