AI Metrics is available on GitHub. The application provides performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter.
The screen capture is from a containerlab topology that emulates a AI compute cluster connected by a leaf and spine network. The metrics include:
- Total Traffic Total traffic entering fabric
- Operations Total RoCEv2 operations broken out by type
- Core Link Traffic Histogram of load on fabric links
- Edge Link Traffic Histogram of load on access ports
- RDMA Operations Total RDMA operations
- RDMA Bytes Average RDMA operation size
- Credits Average number of credits in RoCEv2 acknowledgements
- Period Detected period of compute / exchange activity on fabric (in this case just over 0.5 seconds)
- Congestion Total ECN / CNP congestion messages
- Errors Total ingress / egress errors
- Discards Total ingress / egress discards
- Drop Reasons Packet drop reasons
Note: Clicking on peaks in the charts shows values at that time.
This article gives step-by-step instructions to run the demonstration.
git clone https://github.com/sflow-rt/containerlab.git
Download the
sflow-rt/containerlab project from GitHub.
git clone https://github.com/sflow-rt/containerlab.git
cd containerlab
./run-clab
Run the above commands to download the
sflow-rt/containerlab GitHub project and run Containerlab on a system with Docker installed. Docker Desktop is a conventient way to run the labs on a laptop.
containerlab deploy -t rocev2.yml
Start the 3 stage leaf and spine emulation.
The initial launch may take a couple of minutes as the container images are downloaded for the first time. Once the images are downloaded, the topology deploys in a few seconds.
./topo.py clab-rocev2
Run the command above to send the topology to the AI Metrics application and connect to
http://localhost:8008/app/ai-metrics/html/ to access the dashboard shown at the top of this article.
docker exec -it clab-rocev2-h1 hping3 exec rocev2.tcl 172.16.2.2 10000 500 100
Run the command above to simulate RoCEv2 traffic between
h1 and
h2 during a training run. The run consists of 100 cycles, each cycle involves exchange of 10000 packets followed by a 500 millisecond pause (similuating GPU compute time). You should immediately see charts updating in the dashboard.
Clicking on the wave symbol in the top right of the Period chart pulls up a fast updating Periodicity chart that shows how sub-second variability in the traffic due to the exchange / compute cycles is used to compute the period shown in the dashboard, in this case a period of just under 0.7 seconds.
Connect to the included
sflow-rt/trace-flow application,
http://localhost:8008/app/trace-flow/html/
suffix:stack:.:1=ibbt
Enter the filter above and press
Submit. The instant a RoCEv2 flow is generated, its path should be shown. See
Defining Flows for information on traffic filters.
docker exec -it clab-rocev2-leaf1 vtysh
The routers in this topology run the open source
FRRouting daemon used in data center switches with
NVIDIA Cumulus Linux or
SONiC network operating systems. Type the command above to access the CLI on the
leaf1 switch.
leaf1# show running-config
For example, type the above command in the leaf1 CLI to see the running configuration.
Building configuration...
Current configuration:
!
frr version 10.2.3_git
frr defaults datacenter
hostname leaf1
log stdout
!
interface eth3
ip address 172.16.1.1/24
ipv6 address 2001:172:16:1::1/64
exit
!
router bgp 65001
bgp bestpath as-path multipath-relax
bgp bestpath compare-routerid
neighbor fabric peer-group
neighbor fabric remote-as external
neighbor fabric description Internal Fabric Network
neighbor fabric capability extended-nexthop
neighbor eth1 interface peer-group fabric
neighbor eth2 interface peer-group fabric
!
address-family ipv4 unicast
redistribute connected route-map HOST_ROUTES
exit-address-family
!
address-family ipv6 unicast
redistribute connected route-map HOST_ROUTES
neighbor fabric activate
exit-address-family
exit
!
route-map HOST_ROUTES permit 10
match interface eth3
exit
!
ip nht resolve-via-default
!
end
The switch is using BGP to establish equal cost multi-path (ECMP) routes across the fabric - this is very similar to a configuration you would expect to find in a production network. Containerlab provides a great environment to experiment with topologies and configuration options before putting them in production.
The results from this containerlab project are very similar to those observed in production networks, see Comparing AI / ML activity from two production networks. Follow the instruction in AI Metrics with Prometheus and Grafana to deploy this solution to monitor production GPU cluster traffic.