Tuesday, October 20, 2020

Docker DDoS testbed


Docker testbed describes how to use Docker Desktop to build a test network to experiment with real-time sFlow streaming telemetry and analytics. This article extends the testbed to experiment with distributed denial of service (DDoS) detection and mitigation techniques described in Real-time DDoS mitigation using BGP RTBH and FlowSpec.

Start a Host sFlow agent using the pre-built sflow/host-sflow image:
docker run --rm -d -e "COLLECTOR=host.docker.internal" -e "SAMPLING=10" \
--net=host -v /var/run/docker.sock:/var/run/docker.sock:ro \
--name=host-sflow sflow/host-sflow
Start ExaBGP using the pre-built sflow/exabgp image. ExaBGP connects to the sFlow-RT analytics software and displays BGP RTBH / Flowspec controls sent by sFlow-RT:
docker run --rm sflow/exabgp
In a second terminal window, start an instance of the sFlow-RT analytics software using the pre-built sflow/ddos-protect image:
GW=`docker network inspect bridge -f '{{range .IPAM.Config}}{{.Gateway}}{{end}}'`

SUBNET=`docker network inspect bridge -f '{{range .IPAM.Config}}{{.Subnet}}{{end}}'`

docker run --rm -p 6343:6343/udp -p 8008:8008 -p 1179:1179 --name=sflow-rt \
sflow/ddos-protect -Dddos_protect.router=$GW -Dddos_protect.as=65001 \
-Dddos_protect.enable.flowspec=yes -Dddos_protect.group.local=$SUBNET \
-Dddos_protect.mode=automatic \
-Dddos_protect.udp_amplification.action=filter \
-Dddos_protect.udp_amplification.threshold=5000
Open the sFlow-RT dashboard at http://localhost:8008/
The sFlow Agents gauge confirms that sFlow is being received from the Host sFlow agent. Now access the DDoS Protect application at http://localhost:8008/app/ddos-protect/html/index.html
The BGP chart at the bottom right verifies that BGP connection has been established so that controls can be sent to ExaBGP, which will display them in the terminal window.

Finally, hping3 can be used to generate simulated DDoS attacks. Start a simulated DNS amplification attack using the pre-built sflow/hping3 image:
GW=`docker network inspect bridge -f '{{range .IPAM.Config}}{{.Gateway}}{{end}}'`

docker run --rm sflow/hping3 --flood --udp -k -a 198.51.100.1 -s 53 $GW
The attack shows up immediately in the DDoS protect dashboard, http://localhost:8008/app/ddos-protect/html/index.html
The udp_amplification chart shows the traffic rising to cross the threshold and trigger the control shown in the Controls chart.
The exabgp log shows the Flowspec rule that was sent to block the attack, filtering traffic to 172.17.0.1/32 with UDP source port 53. Type CNTRL+C in the hping3 window to end the attack.

This testbed provides a convenient way to become familiar with the tools to automatically mitigate DDoS attacks. The following articles provide additional information on moving the solution into production: Real-time DDoS mitigation using BGP RTBH and FlowSpecPushing BGP Flowspec rules to multiple routers, and Monitoring DDoS mitigationReal-time network and system metrics as a service provides background on the sFlow-RT analytics platform running the DDoS Protect application.

Wednesday, October 7, 2020

Broadcom Mirror on Drop (MoD)

Networking Field Day 23 included a presentation by Bhaskar Chinni describing Broadcom's Mirror-on-Drop (MOD) capability. MOD capable hardware can generate a notification whenever a packet is dropped by the ASIC, reporting the packet header and the reason that the packet was dropped. MOD is supported by Trident 3, Tomahawk 3,  and Jericho 2 or later ASICs that are included in popular data center switches and widely deployed in data centers.

The recently published sFlow Dropped Packet Notification Structures specification adds drop notifications to industry standard sFlow telemetry export, complementing the existing push based counter and packet sampling measurements. The inclusion of drop monitoring in sFlow will allow the benefits of MOD to be fully realized, ensuring consistent end-to-end visibility into dropped packets across multiple vendors and network operating systems.

Using Advanced Telemetry to Correlate GPU and Network Performance Issues demonstrates how packet drop notifications from NVIDA Mellanox switches forms part of an integrated sFlow telemetry stream that provides the system wide observability needed to drive automation.

MOD instrumentation on Broadcom based switches provides the foundation needed for network vendors to integrate the functionality with sFlow agents to add dropped packet notifications. 

Tuesday, October 6, 2020

Using Advanced Telemetry to Correlate GPU and Network Performance Issues


The image above was captured from the recent talk Using Advanced Telemetry to Correlate GPU and Network Performance Issues [A21870] presented at the NVIDIA GTC conference. The talk includes a demonstration of monitoring a high performance GPU compute cluster in real-time. The real-time dashboard provides an up to the second view of key performance metrics for the cluster.

This diagram shows the elements of the GPU compute cluster that was demonstrated. Cumulus Linux running on the switches reduces operational complexity by allowing you to run the same Linux operating system on the network devices as is run on the compute servers. sFlow telemetry is generated by the open source Host sFlow agent that runs on the servers and the switches, using standard Linux APIs to enable instrumentation and gather measurements. On switches, the measurements are offloaded to the ASIC to provide line rate monitoring.

Telemetry from all the switches and servers in the cluster is streamed to an sFlow-RT analyzer, which builds a real-time view of performance that can be used to drive operational dashboards and automation.

The Real-time GPU and network telemetry dashboard combines measurements from all the devices to provide view of cluster performance. Each of the three charts demonstrated a different type of measurement in the sFlow telemetry stream:
  1. GPU Utilization is based on sFlow's counter push mechanism, exporting NVIDIA Management Library (NVML) counters. This chart trends buffer, memory, and execution utilization of the GPUs in the cluster.
  2. Network Traffic is based on sFlow's random packet sampling mechanism, supported by the Linux kernel on servers, and offloaded to the Mellanox ASIC on the switches. This chart trends the top network flows crossing the network.
  3. Network Drops is based on sFlow's recently added dropped packet notification mechanism, see Using sFlow to monitor dropped packets. This chart trends dropped packet source and destination addresses and the reason the packet was dropped.
The cluster is running a video transcoding workload in which video is streamed across the network to a GPU where it is transcoded, and the result returned. A normal transcoding task is shown on left, where the charts show an increase in GPU and network activity and zero dropped packets. A failed transcoding task is shown in the middle. Here the GPU activity is low, there is no network activity, and there is a sequence of packets dropped by an access control list (ACL). Removing the ACL fixes the problem, which is confirmed by the new data shown on the right of the trend charts.

The sFlow data model integrates the three telemetry streams: counters, packet samples, and drop notifications. Each type of data is useful on its own, but together they provide the system wide observability needed to drive automation.