Monday, November 25, 2024

Topology aware flow analytics with NVIDIA NetQ

NVIDIA Cumulus Linux 5.11 for AI / ML describes how NVIDIA 400/800G Spectrum-X switches combined with the latest Cumulus Linux release deliver enhanced real-time telemetry that is particularly relevant to the AI / machine learning workloads that Spectrum-X switches are designed to handle.

This article shows how to extract Topology from an NVIDIA fabric in order to perform advanced fabric aware analytics, for example: detect flow collisions, trace flow paths, and de-duplicate traffic.

In this example, we will use NVIDIA NetQ, a highly scalable, modern network operations toolset that provides visibility, troubleshooting, and validation of your Cumulus and SONiC fabrics in real time.

netq show lldp json
For example, the NetQ Link Layer Discovery Protocol (LLDP) service simplifies the task of gathering neighbor data from switches in the network, and with the json option, makes the output easy to process with a Python script, for example, lldp-rt.py.

The simplest way to try sFlow-RT is to use the pre-built sflow/topology Docker image that packages sFlow-RT with additional applications that are useful for monitoring network topologies.

docker run -p 6343:6343/udp -p 8008:8008 sflow/topology
Configure Cumulus Linux to steam sFlow telemetry to sFlow-RT on UDP port 6343 (the default for sFlow).
netq show lldp json | ./lldp-rt.py http://sflow-rt:8008/topology/json
The above command puts it all together, taking LLDP data from NetQ, converting it to sFlow-RT format, and posting the fabric topology to the sFlow-RT REST API.
Access the sFlow-RT web interface on port 8008. The Topology application includes a dashboard to verify that all the nodes and links in the topology are fully covered by the sFlow telemetry stream.

Getting Started is a step by step guide to sFlow-RT applications, APIs, and community support.

Thursday, November 21, 2024

SC24 Over 10 Terabits per Second of WAN Traffic

The SC24 WAN Stress Test chart shows 10.3 Terabits bits per second of WAN traffic to the The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC24) conference held this week in Atlanta. The conference network used in the demonstration, SCinet, is described as the most powerful and advanced network on Earth, connecting the SC community to the world.

SC24 Real-time RoCEv2 traffic visibility describes a demonstration of wide area network bulk data transmission using RDMA over Converged Ethernet (RoCEv2) flows typically seen in AI/ML data centers. In the example, 3.2Tbits/second sustained trasmissions from sources geographically distributed around the United States was demonstrated.

SC24 Dropped packet visibility demonstration shows how the sFlow data model integrates three telemetry streams: counters, packet samples, and packet drop notifications. Each type of data is useful on its own, but together they provide the comprehensive network wide observability needed to drive automation. Real-time network visibility is particularly relevant to AI / ML data center networks where congestion and dropped packets can result in serious performance degradation and in this screen capture you can see multiple 400Gbits/s RoCEv2 flows.

SC24 SCinet traffic describes the architecture of the real-time monitoring system used to generate these charts. This chart shows that over 225 Petabytes of data were transfered during the show.

Wednesday, November 20, 2024

SC24 Real-time RoCEv2 traffic visibility

The chart shows eight 400Gbits/s RDMA over Converged Ethernet (RoCEv2) flows, typically seen in AI / ML data centers, totaling 3.2 Tbits/s. The unique challenge in this case is that flows are being routed from locations scattered around the United States to Atlanta, the location of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC24) conference.
SC24 Network Research Exhibit: The Resiliant, Performant Networks and Distributed Processing demonstration aims to explore performance limitations and enablers for high volume bulk data tranfers. Maintaining stable 400Gbits/s RoCEv2 connections over a wide area network is challenging since the packets have to traverse multiple links, avoid contention on links, and deal with buffering associated with transmission latency that is orders of magnitude higher than data center environments where RoCEv2 is typically deployed (one way latency across the USA is a minimum of 16 milliseconds due to speed of light, but in practice the latency is quite a bit larger, on the other hand latency across a leaf and spine data center fabric is measured in microseconds).
During setup it was noticed that total throughput with 8 concurrent flows was only 2.7Tbits/s (instead of the 3Tbits/second plus expected). Examining a real-time view of the throughput revealed that the two smallest flows, pink and light green at the top of the chart, were likely sharing a 400Gbits path since each flow was only transferring 200Gbps. The next flow down, light blue, appeared to be unstable and wasn't maintaining a constant 400Gbps.
Drilling down to look at the unstable flow showed that it was oscilating between 280Gbits/s and 400Gbits/s with a period of around 15 seconds. Further investigation revealed that the cause of the instability was a collision with a smaller flow on one of the links traversed by this flow. Once the flow collisions were resolved, all flows achieved close to 400Gbit/s, allowing the full 3Tbits/s transfer rate shown at the top of this article.
In this example, the sFlow-RT real-time analytics engine receives sFlow telemetry from switches, routers, and servers in the SCinet network and creates metrics to drive the real-time charts. Getting Started provides a quick introduction to deploying and using sFlow-RT for real-time network-wide flow analytics.

Real-time network visibility is particularly relevant to AI / ML data center networks where congestion and dropped packets can result in serious performance degredation of machine learning tasks. Industry standard sFlow instrumentation is supported by the high speed 400/800G switches currently being deployed in AI / ML data centers. Enabling sFlow analytics provides the visibility needed to optimize performance.

Network visibility complements existing system management tools used to provide visibility into compute nodes, extending visibility into the fabric to directly observe problems in the network that can't easily be inferred from the compute nodes, and providing a second pair of eyes with an independent view of performance.

Finally, check out the SC24 Dropped packet visibility demonstration to learn about one of newest developments in sFlow monitoring and see a live demonstration.

Tuesday, November 19, 2024

SC24 SCinet traffic

The real-time dashboard shows total network traffic at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC24) conference being held this week in Atlanta. The dashboard shows that 31 Petabytes of data have been transferred already and the conference has just started.

The conference network used in the demonstration, SCinet, is described as the most powerful and advanced network on Earth, connecting the SC community to the world.

In this example, the sFlow-RT real-time analytics engine receives sFlow telemetry from switches, routers, and servers in the SCinet network and creates metrics to drive the real-time charts in the dashboard. Getting Started provides a quick introduction to deploying and using sFlow-RT for real-time network-wide flow analytics.

Finally, check out the SC24 Dropped packet visibility demonstration to learn about one of newest developments in sFlow monitoring and see a live demonstration.

Monday, November 18, 2024

NVIDIA Cumulus Linux 5.11 for AI / ML


NVIDIA Cumulus Linux 5.11 includes major upgrades to the sFlow agent that fully exposes the advanced instrumentation built into NVIDIA Spectrum-X silicon. The enhanced real-time telemetry is particularly relevant to the AI / machine learning workloads that Spectrum-X is designed to handle.

With Cumulus Linux 5.11, the sFlow agent is easily configured using nvue commands, see Monitoring System Statistics and Network Traffic with sFlow:

nv set system sflow dropmon hw
nv set system sflow poll-interval 20
nv set system sflow collector 192.0.2.1
nv set system sflow state enabled
nv config apply

Note: In this case, enabling dropmon ensures that every dropped packet is captured, along with ingress port and drop reason (e.g. ttl_exceeded).

The same commands should be applied to every switch in the fabric for comprehensive visibility.

RDMA over Converged Ethernet (RoCE) describes how sFlow provides detailed visibility into RoCE flows used to move data between GPUs in an AI / ML data center fabric. The chart above from the RDMA network visibility demonstration at the SC22 conference shows that sFlow monitoring easily scales to the 400/800G speeds needed for machine learning.
In this example, the sFlow-RT real-time analytics engine receives sFlow telemetry from all the switches and servers in the fabric. Deploy real-time network dashboards using Docker compose describes how to quickly set up an sFlow-RT, Prometheus, Grafana stack to capture and display metrics. Dropped packet metrics with Prometheus and Grafana describes how to add a dashboard to display packet drop notifications.

If you are standing up a new NVIDIA Spectrum-X / Cumulus Linux network, enable sFlow on all the switches and set up an instance of sFlow-RT for the real-time fabric wide visibility into traffic flows and dropped packets. Real-time network visibility is particularly relevant to AI / ML data center networks where congestion and dropped packets can result in serious performance degradation.

Sunday, November 17, 2024

SC24 Dropped packet visibility demonstration

The real-time dashboard is a joint InMon / Arista demonstration at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC24) conference being held this week in Atlanta.

The conference network used in the demonstration, SCinet, is described as the most powerful and advanced network on Earth, connecting the SC community to the world.

The sFlow Packet Drop Monitoring In High Performance Networks dashboard combines telemetry from all the Arista switches in the SCinet network to provide real-time network-wide view of performance. Each of the three charts demonstrate a different type of measurement in the sFlow telemetry stream:

  • Counters: Total Traffic shows total traffic calculated from interface counters streamed from all interfaces. Counters provide a useful way of accurately reporting byte, frame, error and discard counters for each network interface. In this case, the chart rolls up data from all interfaces to trend total traffic on the network.
  • Samples: Top Flows shows the top 5 largest traffic flows traversing the network. The chart is based on sFlow's random packet sampling mechanism, providing a scaleable method of determining the hosts and services responsible for the traffic reported by the counters. Visibility into top flows is essential if one wants to take action to manage network usage and capacity: immediately identifying DDoS attacks, elephant flows, and tracking changing service demands.
    Note: Network addresses have been masked for privacy.
  • Notifications: Dropped Packets shows each dropped packet, the device that dropped it, and the reason it was dropped. Dropped packets have a profound impact on network performance and availability. Packet discards due to congestion can significantly impact application performance. Dropped packets due to black hole routes, expired TTLs, MTU mismatches, etc can result in insidious connection failures that are time consuming and difficult to diagnose.
    Note: Network addresses have been masked for privacy.
The sFlow data model integrates the three telemetry streams: counters, packet samples, and drop notifications. Each type of data is useful on its own, but together they provide the system wide observability needed to drive automation.
Dropped packet metrics with Prometheus and Grafana describes how to incorporate real-time dropped packet metrics into operational dashboards for rapid troubleshooting of network performance problems.

If you have Arista switches in your network, try enabling sFlow to gain insight into network traffic. Dropped packet notifications with Arista Networks, describes how to configure sFlow to include dropped packet notifications. Real-time network visibility is particularly relevant to AI / ML data center networks where congestion and dropped packets can result in serious performance degredation.

Tuesday, November 12, 2024

Worldwide deployment of real-time flow analytics

Industry standard sFlow telemetry is widely supported by network equipment vendors and network management platforms. However, the advent of real-time sFlow analytics has opened up a range of new applications for sFlow. The map above shows the proportion of sFlow-RT instances running in each of the over 70 countries in which it is deployed.

The following use cases are driving current deployments:

Addressing the challenge of operating AI / ML clusters is the emerging application for sFlow visibility. High speed (400/800G) data center switches needed to handle machine learning traffic flows include sFlow agents and real-time analytics are essential to optimize the network so that expensive GPU and compute resources are fully utilized, see Leveraging open technologies to monitor packet drops in AI cluster fabrics.

If you would like to see how real-time network analytics can transform network operations, Getting Started describes how to download and configure sFlow-RT analytics software for use in your network, or how to try it out using an emulator, or pre-captured data.

Tuesday, October 22, 2024

Leveraging open technologies to monitor packet drops in AI cluster fabrics

In this talk from the recent OCP Global Summit, Aldrin Isaac, eBay, describes the challenge, AI clusters operate most efficiently over lossless networks for optimum job completion times which can be significantly impacted by dropped packets. Although networks can be designed to minimize packet loss by choosing the right network topology, optimizing network devices and protocols, an effective monitoring and troubleshooting network performance tool is still required. Such tool should capture packet drops, raise notifications and identify various drop reasons and pin point where the drops caused congestions. In turn, it allows the governing management application to tune configurations of relevant infrastructure components, including switches, NICs and GPU servers.

The talk shares the results and best practices of a TAM (Telemetry and Monitoring) solution being prepared for deployment at eBay. It leverages OCP’s SAI and open sFlow drop notification technologies as part of eBay’s ongoing initiatives to adopt open networking hardware and community SONiC for its data centers.

The sFlow Dropped Packet Notification Structures extension mentioned in the talk adds real-time packet drop notifications (including dropped packet header and drop reason) as part of an industry standard sFlow telemetry feed, making the data available to open source and commercial sFlow analytics tools.

For example, Dropped packet metrics with Prometheus and Grafana describes how to incorporate sFlow dropped packet notifications into operational dashboards using current implementations by Arista Networks, VyOS, FD.io / VPP, and Linux servers. Current network hardware is capable of reporting on dropped packets, so ask your network equipment vendor about their plains to support the sFlow extension so that you can befit from this transformational capability.

Monday, October 14, 2024

OCP Global Summit 2024

AI networking is a popular topic at the up coming OCP Global Summit in San Jose, California, with an entire morning on Wednesday October 16 devoted to the subject.
Of particular interest is the talk, Leveraging open technologies to monitor packet drops in AI cluster fabrics, by Aldrin Isaac, eBay, describing the challenge, AI clusters operate most efficiently over lossless networks for optimum job completion times which can be significantly impacted by dropped packets. Although networks can be designed to minimize packet loss by choosing the right network topology, optimizing network devices and protocols, an effective monitoring and troubleshooting network performance tool is still required. Such tool should capture packet drops, raise notifications and identify various drop reasons and pin point where the drops caused congestions. In turn, it allows the governing management application to tune configurations of relevant infrastructure components, including switches, NICs and GPU servers.

The talk will share the results and best practices of a TAM (Telemetry and Monitoring) solution being prepared for deployment at eBay. It leverages OCP’s SAI and open sFlow drop notification technologies as part of eBay’s ongoing initiatives to adopt open networking hardware and community SONiC for its data centers.

The sFlow Dropped Packet Notification Structures extension mentioned in the talk adds real-time packet drop notifications (including dropped packet header and drop reason) as part of an industry standard sFlow telemetry feed, making the data available to open source and commercial sFlow analytics tools.

For example, Dropped packet metrics with Prometheus and Grafana describes how to incorporate sFlow dropped packet notifications into operational dashboards using current implementations for Arista, VyOS, and Linux servers. The availability of drop monitoring in SONiC will extend this capability to the wide range of hardware platforms supporting the SONiC network operating system.

Monday, October 7, 2024

Vector Packet Processor (VPP)

VPP with sFlow - Part 1 and VPP with sFlow - Part 2 describe the journey to add industry standard sFlow instrumentation to the Vector Packet Processor (VPP) an Open Source Terabit Software Dataplane for software routers running on commodity x86 / ARM hardware.

The main conclusions based on testing described in the two VPP blog posts are:

  1. If sFlow is not enabled on a given interface, there is no regression on other interfaces.
  2. If sFlow is enabled, copying packets costs 11 CPU cycles on average
  3. If sFlow takes a sample, it takes only marginally more CPU time to enqueue.
    • No sampling gets 9.88Mpps of IPv4 and 14.3Mpps of L2XC throughput,
    • 1:1000 sampling reduces to 9.77Mpps of L3 and 14.05Mpps of L2XC throughput,
    • and an overly harsh 1:100 reduces to 9.69Mpps and 13.97Mpps only.

The VPP sFlow plugin provides a lightweight method of exporting real-time sFlow telemetry from a VPP based router. Including the plugin with VPP distributions has no impact on performance. Enabling the plugin provides real-time visibility that opens up additional use cases for VPPs programmable dataplane. For example, VPP is well suited to packet filtering use cases where the number of ACL entries would exceed the capabilities of an ASIC. Combined with real-time visibility to identify DDoS attacks, VPP provides an effective means of mitigating the attacks by scrubbing the unwanted traffic.

Tuesday, September 10, 2024

Emulating congestion with Containerlab

The Containerlab dashboard above shows variation in throughput in a leaf and spine network due to large "Elephant" flow collisions in an emulated network, see Leaf and spine traffic engineering using segment routing and SDN for a demonstration of the issue using physical switches.

This article describes the steps needed to emulate realistic network performance problems using Containerlab. First, using the FRRouting (FRR) open source router to build the topology provides a lightweight, high performance, routing implementation that can be used to efficiently emulate large numbers of routers using the native Linux dataplane for packet forwarding. Second, the containerlab tools netem set command can be used to introduce packet loss, delay, jitter, or restrict bandwidth of ports.

The netem tool makes use of the Linux tc (traffic control) module. Unfortunately, if you are using Docker desktop, the minimal virtual machine used to run containers does not include the tc module.

multipass launch docker
Instead, use Multipass as a convenient way to create and start an Ubuntu virtual machine with Docker support on your laptop. If you are already on a Linux system with Docker installed, skip forward to the git clone step.
multipass ls
List the multipass virtual machines.
Name                    State             IPv4             Image
docker                  Running           192.168.65.3     Ubuntu 22.04 LTS
                                          172.17.0.1
Make a note of the IP address(es) of the docker virtual machine.
multipass shell docker
Run a shell inside the docker virtual machine.
git clone https://github.com/sflow-rt/containerlab.git
Install sflow-rt/containerlab project.
cd containerlab
./run-clab

Run Containerlab using Docker.

In this example we will be using the 3 Stage Close Topology shown above.
env SAMPLING=10 containerlab deploy -t clos3.yml
Start a leaf and spine topology emulation, but use a sampling rate of 1-in-10 rather than the default of 1-in-1000. See Large flow detection for a discussion of scaling sampling rates with link speed to get consistent results between the emulation and a physical network.
./bw.py clab-clos3
Rate limit the links in the topology to 10Mbps.
./topo.py clab-clos3
Post the topology to the sFlow-RT real-time analytics container. Access the Containerlab Dashboard shown at the top of this page using a web browser to connect to http://192.168.65.3:8008/ (where 192.168.65.3 is the IP address of the docker container noted earlier).
docker exec -it clab-clos3-h1 iperf3 -c 172.16.2.2 --parallel 2
Run a series of iperf3 tests to create pairs of large flows between h1 and h2. When the flows take different paths across the fabric the total available bandwidth is 20mbps. If the flows hash onto the same path, then they share 10mbps bandwidth and the throughput is halved.

This example demonstrates that Containerlab is not restricted to emulating and validating configurations, but can also be used to emulate performance issues. In this example, the effects of large flow collisions are relevant to the performance of data center fabrics handling AI/ML workloads where large flow collisions can significantly limit performance, see RoCE networks for distributed AI training at scale.

Monday, August 19, 2024

Dropped packet metrics with Prometheus and Grafana

Dropped packets due to black hole routes, buffer exhaustion, expired TTLs, MTU mismatches, etc. can result in insidious connection failures that are time consuming and difficult to diagnose. Dropped packet notifications with Arista Networks, VyOS dropped packet notifications and Using sFlow to monitor dropped packets describe implementations of the sFlow Dropped Packet Notification Structures extension for Arista Networks switches, VyOS routers, and Linux servers respectively, providing end to end visibility into packet drop events (including switch port, drop reason and packet header for each dropped packet).

Flow metrics with Prometheus and Grafana describes how define flow metrics and create dashboards to trend the flow metrics over time. This article describes how the same setup can be used to define and trend metrics based on dropped packet notifications.

  - job_name: sflow-rt-drops
    metrics_path: /app/prometheus/scripts/export.js/flows/ALL/txt
    static_configs:
      - targets: ['sflow-rt:8008']
    params:
      metric: ['dropped_packets']
      key:
        - 'node:inputifindex'
        - 'ifname:inputifindex'
        - 'reason'
        - 'stack'
        - 'macsource'
        - 'macdestination'
        - 'null:vlan:untagged'
        - 'null:[or:ipsource:ip6source]:none'
        - 'null:[or:ipdestination:ip6destination]:none'
        - 'null:[or:icmptype:icmp6type:ipprotocol:ip6nexthdr]:none'
      label:
        - 'switch'
        - 'port'
        - 'reason'
        - 'stack'
        - 'macsource'
        - 'macdestination'
        - 'vlan'
        - 'src'
        - 'dst'
        - 'protocol'
      value: ['frames']
      dropped: ['true']
      maxFlows: ['20']
      minValue: ['0.001']

The Prometheus scrape configuration above is used to keep track of drop notifications. The highlighed dropped setting is used to select drop notifications for the metric (the default dropped:['false'] creates flow metrics based packet samples and is used to trend normal traffic).

Deploy real-time network dashboards using Docker compose is the simplest way to deploy an sFlow-RT, Prometheus, and Grafana stack with some basic dashboards. Install sFlow-RT Dropped Packets dashboard, code 21721, in Grafana to see the dashboard shown at the top of this page, displaying Drop Locations, Drop Reasons and Dropped Packet Details.

Thursday, July 25, 2024

Dropped packet notifications with Arista Networks

Visibility into dropped packets is essential for Artificial Intelligence/Machine Learning (AI/ML) workloads, where a single dropped packet can stall large scale computational tasks, idling millions of dollars worth of GPU/CPU resources, and delaying the completion of business critical workloads. Enabling real-time sFlow telemetry provides the observability into traffic flows and packet drops needed to effectively manage these networks.

The availability of the Arista EOS 4.31.4M maintenance release brings sFlow dropped packet monitoring (previously demonstrated using the 4.30.1F feature release - see SC23 Dropped packet visibility demonstration) to production networks, see EOS Life Cycle Policy
sflow sampling 50000
sflow polling-interval 20
sflow vrf mgmt destination 203.0.113.100
sflow vrf mgmt source-interface Management0
sflow run
The above Arista EOS commands enable sFlow counter polling and packet sampling on all ports, sending the sFlow telemetry to the sFlow analyzer at 203.0.113.100
flow tracking mirror-on-drop
  sample limit 100 pps
  !
  tracker SFLOW
    exporter SFLOW
      format sflow
      collector sflow
      local interface Management0
  no shutdown
The above commands add sFlow Dropped Packet Notification Structures to the sFlow telemetry feed using Broadcom Mirror on Drop (MoD) instrumentation. Broadcom implements mirror-on-drop in Jericho 2, Trident 3, and Tomahawk 3, or later ASICs.
In this example, the sFlow-RT real-time analytics engine receives sFlow telemetry from switches/routers and creates metrics to drive the real-time Grafana dashboard shown at the top of the article. Deploy real-time network dashboards using Docker compose describes how to quickly deploy a monitoring stack consisting of sFlow-RT, a Prometheus time series database, and Grafana dashboards.

Thursday, February 22, 2024

VyOS 1.4 LTS released

Protectli Vault - 4 Port

The VyOS 1.4.0 (Sagitta) LTS release announcement is exciting news! VyOS is an open source router operating system based on Linux that can be installed on commodity PC hardware - for optimal performance at least 1GB RAM and 4GB of storage space is recommended.

The new 1.4 LTS release includes a significantly enhanced implementation of industry standard sFlow telemetry based on the open source Host sFlow agent.

set system sflow interface eth0
set system sflow interface eth1
set system sflow interface eth2
set system sflow interface eth3
set system sflow polling 30
set system sflow sampling-rate 1000
set system sflow drop-monitor-limit 50
set system sflow server 192.0.2.100
Enter the commands above to enable sFlow monitoring on interfaces eth0, eth1, eth2, and eth3. Interface counters will be exported every 30 seconds, packets will be sampled with probability 1/1000, and up to 50 packet headers (and drop reasons) per second will collected from packets dropped by the router. The sFlow telemetry stream will be sent to an sFlow collector at 192.0.2.100.

Running Docker on the sFlow collector makes it easy to run a variety of sFlow analytics tools.

docker run --rm -p 6343:6343/udp sflow/sflowtool
Run the sflow/sflowtool image to decode and print the contents of the sFlow telemetry stream and verify receipt of data.
docker run --rm -p 6343:6343/udp sflow/tcpdump tcp port 80
Run the sflow/tcpdump image to decode and filter sampled packet headers. For more complex packet analysis tasks, try the sflow/tshark image.
Run the sflow/sflowtrend image to trend interface counters and top flows.
Deploy real-time network dashboards using Docker compose describes how to configure Prometheus and Grafana to capture time series data and create custom dashboards.
Dropped packet reason codes in VyOS describes how the new Linux kernel in VyOS 1.4 provides detailed visibility into every dropped packet (including the reason it was dropped). This cabability is used by the new sFlow agent implement the sFlow Dropped Packet Notification Structures extension to provide network-wide visibility into dropped packets.

Download VyOS today to try out the new features. Pre-built LTS images are available with paid support, but anyone can build an image from sources or download the latest rolling release.

Monday, January 15, 2024

Raspberry Pi 5 network emulation with Containerlab

The GitHub sflow-rt/containerlab project contains example network topologies for the Containerlab network emulation tool that demonstrate real-time streaming telemetry in realistic data center topologies and network configurations. The examples use the same FRRouting (FRR) engine that is part of SONiC, NVIDIA Cumulus Linux, and DENT network operating systems. Containerlab can be used to experiment before deploying solutions into production. Examples include: tracing ECMP flows in leaf and spine topologies, EVPN visibility, and automated DDoS mitigation using BGP Flowspec and RTBH controls.
Raspberry Pi 5 real-time network analytics describes how to install Docker on a Raspberry Pi 5.
docker run hello-world
Run the hello-world container to verify that Docker in properly installed and running before proceeding.
git clone https://github.com/sflow-rt/containerlab.git
Download the sflow-rt/containerlab project from GitHub.
cd containerlab
./run-clab
Start Containerlab.
containerlab deploy -t clos5.yml
Start the 5 stage leaf and spine topology shown at the top of this page. The initial launch may take a couple of minutes as the container images are downloaded for the first time. Once the images are downloaded, the topology deploys in around 10 seconds.
./topo.py clab-clos5
Push the topology to the sFlow-RT analytics software.
An instance of the sFlow-RT real-time analytics engine receives industry standard sFlow telemetry from all the switches in the network. All of the switches in the topology are configured to send sFlow to the sFlow-RT instance. In this case, Containerlab is running the pre-built sflow/clab-sflow-rt image which packages sFlow-RT with useful applications for exploring the data.
Connect to the web interface on port 8008. The sFlow-RT dashboard verifies that telemetry is being received from 10 agents (the 10 switches in the Clos fabric). See the sFlow-RT Quickstart guide for more information.
The Containerlab Dashboard (click on sFlow-RT Apps tab and containerlab-dashboard button) shows real-time dashboard displaying up to the second traffic.
docker exec -it clab-clos5-h1 iperf3 -c 172.16.4.2
Each of the hosts in the network has an iperf3 server, so running the above command will test bandwidth between h1 and h4.
docker exec -it clab-clos5-h1 iperf3 -c 2001:172:16:4::2
Generate a large IPv6 flow between h1 and h4. The traffic flows should immediately appear in the Top Flows chart. You can check the accuracy by comparing the values reported by iperf3 with those shown in the chart.
Click on the Topology tab to see a real-time weathermap of traffic flowing over the topology. See how repeated iperf3 tests take different ECMP (equal-cost multi-path) routes across the network.
docker exec -it clab-clos5-leaf1 vtysh
Linux with open source routing software (FRRouting) is an accessible alternative to vendor routing stacks (no registration / license required, no restriction on copying means you can share images on Docker Hub, no need for virtual machines). FRRouting is popular in production network operating systems (e.g. Cumulus Linux, SONiC, DENT, etc.) and the VTY shell provides an industry standard CLI for configuration, so labs built around FRR allow realistic network configurations to be explored.
docker exec -it clab-clos5-leaf1 vtysh -c "show running-config"
Use vtysh to show the running configuration on leaf1.
containerlab destroy -t clos5.yml
When you are finished, run the above command to stop the containers and free the resources associated with the emulation. Try out other topologies from the project to explore topics such as DDoS mitigation, BGP Flowspec, and EVPN.

Note: If you are building your own topologies, the Raspberry Pi 5 8G can comfortably handle topologies with up to 50 FRR/Alpine Linux nodes.

Getting Started provides an introduction to sFlow-RT analytics and APIs. Containerlab provides a useful environment for developing and testing monitoring applications for sFlow-RT before moving them into production.

Moving monitoring solutions from Containerlab to production is straightforward since sFlow is widely implemented in datacenter equipment from vendors including: A10, Arista, Aruba, Cisco, Edge-Core, Extreme, Huawei, Juniper, NEC, Netgear, Nokia, NVIDIA, Quanta, and ZTE. In addition, the open source Host sFlow agent makes it easy to extend visibility beyond the physical network into the compute infrastructure.

Raspberry Pi 5 real-time network analytics describes how to deploy an sFlow-RT, Prometheus, and Grafana monitoring stack to monitor live network traffic.

Tuesday, January 9, 2024

Raspberry Pi 5 real-time network analytics

CanaKit Raspberry Pi 5 Starter Kit - Aluminum
This article describes how build an inexpensive Raspberry Pi 5 based server for real-time flow analytics using industry standard sFlow streaming telemetry. Support for sFlow is widely implemented in datacenter equipment from vendors including: A10, Arista, Aruba, Cisco, Edge-Core, Extreme, Huawei, Juniper, NEC, Netgear, Nokia, NVIDIA, Quanta, and ZTE.
In this example, we will use an 8G Raspberry Pi 5 running Raspberry Pi OS Lite (64-bit).  The easiest way to format a memory card and install the operating system is to use the Raspberry Pi Imager (shown above).
Click on EDIT SETTINGS button to customize the installation.
Set a hostname, username, and password.
Click on the SERVICES tab and select Enable SSH.  Click SAVE to save the settings and then YES to apply the settings and create a bootable micro SD card. These initial settings allow the Rasberry Pi to be accessed over the network without having to attach a screen, keyboard, and mouse.
ssh pp@192.168.4.170
Use ssh to log into Raspberry Pi (having installled the micro SD card).
sudo apt-get update && sudo apt-get -y upgrade
Update packages and OS to latest version.
curl -sSL https://get.docker.com | sh
Install Docker.
sudo usermod -aG docker $USER
Give permission to run Docker without sudo command. Exit ssh session and log in again to pick up the new settings.
docker run hello-world
Run the hello-world container to verify that docker in properly installed and running.
git clone https://github.com/sflow-rt/prometheus-grafana.git
cd prometheus-grafana
./start.sh
Start sFlow-RT, Prometheus, and Grafana using Docker compose.
Configure sFlow Agents embedded in switches, routers and servers to stream sFlow telemetry to the Raspberry Pi. The sFlow-RT Getting Started guide shows how to verify that sFlow is being received and includes tools flow and counter based analytics.
For example, the Flow Browser application lets you list attributes of network traffic that you are interested in and trend top flows with the attributes in real-time (up to the second). Defining Flows describes the flow analytics capability of sFlow-RT that can be explored.
Deploy real-time network dashboards using Docker compose describes how to configure Prometheus and Grafana to capture time series data and create custom dashboards.
The Raspberry Pi 5 is surprisingly capable, this pocket-sized server can easily monitor thousands of high speed (100G+) links, providing up to the second visibility into network flows. In this example, sFlow telemetry from 100 switches, each with 48 active 100G ports, was easily handled by the Raspberry Pi 5. Performance of the Prometheus database is likely to be the limiting factor given the relatively slow disk performance of the micro SD card, but could be improved adding an M.2 PCIe disk.