Thursday, January 13, 2022

Cisco ASR 9000 Series Routers

Cisco already supports industry standard sFlow telemetry across a range of products and the recent IOS-XR Release 7.5.1 extends support to Cisco ASR 9000 Series Routers.

Note: The ASR 9000 series routers also support Cisco Netflow. Rapidly detecting large flows, sFlow vs. NetFlow/IPFIX describes why you should choose sFlow if you are interested in real-time monitoring and control applications.

The following commands configure an ASR 9000 series router to sample packets at 1-in-20,000 and stream telemetry to an sFlow analyzer (192.127.0.1) on UDP port 6343.

flow exporter-map SF-EXP-MAP-1
 version sflow v5
 !
 packet-length 1468
 transport udp 6343
 source GigabitEthernet0/0/0/1
 destination 192.127.0.1
 dfbit set
!

Configure the sFlow analyzer address in an exporter-map.

flow monitor-map SF-MON-MAP
 record sflow
 sflow options
  extended-router
  extended-gateway
  if-counters polling-interval 300
  input ifindex physical
  output ifindex physical
 !
 exporter SF-EXP-MAP-1
!

Configure sFlow options in a monitor-map.

sampler-map SF-SAMP-MAP
 random 1 out-of 20000
!

Define the sampling rate in a sampler-map.

interface GigabitEthernet0/0/0/3
 flow datalinkframesection monitor-map SF-MON-MAP sampler SF-SAMP-MAP ingress

Enable sFlow on each interface for complete visibilty into network traffic.

The diagram shows the general architecture of an sFlow monitoring deployment. All the switches stream sFlow telemetry to a central sFlow analyzer for network wide visibililty. Host sFlow agents installed on servers can extend visibilty into the compute infrastructure, and provide network visibility from virtual machines in the public cloud. In this instance, the sFlow-RT real-time analyzer provides an up to the second view of performance that is used to drive operational dashboards and network automation. The recommended sFlow configuration settings are optimized for real-time monitoring of the large scale networks targetted by Cisco ASR 9000 series routers.

docker run -p 8008:8008 -p 6343:6343/udp sflow/prometheus

Getting started with sFlow-RT is very simple, for example, the above command uses the pre-built sflow/prometheus Docker image to start analyzing sFlow. Real-time DDoS mitigation using BGP RTBH and FlowSpec, Monitoring leaf and spine fabric performance, and Flow metrics with Prometheus and Grafana describe additional use cases for real-time sFlow analytics.

Note: There is a wide range of options for sFlow analysis. See sFlow Collectors for a list of open source and commercial software.

Cisco first introduced sFlow support in the Nexus 3000 Series in 2012. Today, there is a range of Cisco products that include sFlow support. The broad support for sFlow by Cisco and other leading vendors (e.g. A10, Arista, Aruba, Edge-Core, Extreme, Huawei,  Juniper, NEC, Netgear, Nokia, NVIDIA, Quanta, and ZTE) makes sFlow an attractive option for multi-vendor network performance monitoring, particularly for those interested in real-time monitoring and automation.

Monday, December 6, 2021

Real-time Kubernetes cluster monitoring example

The Sunburst GPU chart updates every second to show a real-time view of the share of GPU resources being consumed by namespaces operating on the Nautilus hyperconverged Kubernetes cluster. The Nautilus cluster tightly couples distributes storage, GPU, and CPU resources to share among the participating research organizations.

The Sunburst Process chart provides an up to the second view of the cluster-wide share of CPU resources used by each namespace.

The Sunburst DNS chart shows a real-time view of network activity generated by each namespace. The chart is produced by looking up DNS names for network addresses observed in packet flows using the Kubernetes DNS service. The domain names contain information about the namespace, service, and node generating the packets. Most traffic is exchanges between nodes within the cluster (identified as local). The external (not local) traffic is also shown by DNS name.
The Sunburst Protocols chart shows the different network protocols being used to communicate between nodes in the cluster. The chart shows the IP over IP tunnel traffic used for network virtualization.
Clicking on a segment in the Sunburst Protocols chart allows the selected traffic to be examined in detail using the Flow Browser. In this example, DNS names are again used to translate raw packet flow data into inter-namespace flows. See Defining Flows for information on the flow analytics capabilities that can be explored using the browse-flows application.
The Discard Browser provides a detailed view of any network packets dropped in the cluster. In this chart inter-namespace dropped packets are displayed, identifying the haproxy service as the largest source of dropped packets. 
The final chart shows an up to the second view of the average power consumed by a GPU in the cluster (approximately 250 Watts per GPU).
The diagram shows the elements of the monitoring solution. Host sFlow agents deployed on each Node in the Kubernetes Cluster stream standard sFlow telemetry to an instance of the sFlow-RT real-time analytics software that provides cluster wide metrics through a REST API, where they can be viewed, or imported into time series databases like Prometheus and trended in dashboards using tools like Grafana.
Note: sFlow is widely supported by network switches and routers. Enable sFlow monitoring in the physical network infrastructure for end-to-end visibility.

Create the following sflow-rt.yml file to deploy the pre-built sflow/prometheus Docker image, bundling sFlow-RT with the applications used in this article:

apiVersion: v1
kind: Service
metadata:
  name: sflow-rt-sflow
spec:
  type: NodePort
  selector:
    name: sflow-rt
  ports:
    - protocol: UDP
      port: 6343
---
apiVersion: v1
kind: Service
metadata:
  name: sflow-rt-rest
spec:
  type: LoadBalancer
  selector:
    name: sflow-rt
  ports:
    - protocol: TCP
      port: 8008
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sflow-rt
spec:
  replicas: 1
  selector:
    matchLabels:
      name: sflow-rt
  template:
    metadata:
      labels:
        name: sflow-rt
    spec:
      containers:
      - name: sflow-rt
        image: sflow/prometheus:latest
        ports:
          - name: http
            protocol: TCP
            containerPort: 8008
          - name: sflow
            protocol: UDP
            containerPort: 6343

Run the following command to deploy the service:

kubectl apply -f sflow-rt.yml

Now create the following host-sflow.yml file to deploy the pre-built sflow/host-sflow Docker image:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: host-sflow
spec:
  selector:
    matchLabels:
      name: host-sflow
  template:
    metadata:
      labels:
        name: host-sflow
    spec:
      restartPolicy: Always
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - name: host-sflow
        image: sflow/host-sflow:latest
        env:
          - name: COLLECTOR
            value: "sflow-rt-sflow"
          - name: SAMPLING
            value: "10"
          - name: NET
            value: "host"
          - name: DROPMON
            value: "enable"
        volumeMounts:
          - mountPath: /var/run/docker.sock
            name: docker-sock
            readOnly: true
      volumes:
        - name: docker-sock
          hostPath:
            path: /var/run/docker.sock

Run the following command to deploy the agents:

kubectl apply -f host-sflow.yml

Telemetry should immediately start streaming as a Host sFlow agent is started on each node in the cluster.

Note: Exporting GPU performance metrics from the NVIDIA GPUs in the Nautilus cluster requires a special version of the Host sFlow agent built using the NVIDIA supplied Docker image that includes GPU drivers, see https://gitlab.nrp-nautilus.io/prp/sflow/

Access the sFlow-RT web user interface to confirm that telemetry is being received.

The sFlow-RT Status page confirms that telemetry is being received from all 180 nodes in the cluster.
Note: If you don't currently have access to a production Kubernetes cluster, you can experiment with this solution using Docker Desktop, see Kubernetes testbed.
The charts shown in this article are accessed via the sFlow-RT Apps tab.

The sFlow-RT applications are designed to explore the available metrics, but don't provide persistent storage. Prometheus export functionality allows metrics to be recorded in a time series database to drive operational dashboards, see  Flow metrics with Prometheus and Grafana.

Monday, November 1, 2021

Sunburst

The recently released open source Sunburst application provides a real-time visualization of the protocols running a network. The Sunburst application runs on the sFlow-RT real-time analytics platform, which receives standard streaming sFlow telemetry from switches and routers throughout the network to provide comprehensive visibility.
docker run -p 8008:8008 -p 6343:6343/udp sflow/prometheus
The pre-built sflow/prometheus Docker image packages sFlow-RT with the applications for exploring real-time sFlow analytics. Run the command above, configure network devices to send sFlow to the application on UDP port 6343 (the default sFlow port) and connect with a web browser to port 8008 to access the user interface.
 
The chart at the top of this article demonstrates the visibility that sFlow can provide into nested protocol stacks that result from network virtualization. For example, the most deeply nested set of protocols shown in the chart is:
  1. eth: Ethernet
  2. q: IEEE 802.1Q VLAN
  3. trill: Transparent Interconnection of Lots of Links (TRILL)
  4. eth: Ethernet
  5. q: IEEE 802.1Q VLAN
  6. ip: Internet Protocol (IP) version 4
  7. udp: User Datagram Protocol (UDP)
  8. vxlan: Virtual eXtensible Local Area Network (VXLAN)
  9. eth: Ethernet
  10. ip Internet Protocol (IP) version 4
  11. esp IPsec Encapsulating Security Payload (ESP)
Click on a segment in the sunburst chart to further explore the selected protocol using the Flow Browser application.
The Flow Browser allows the full set of flow attributes to be explored, see Defining Flows for details. In this example, the filter was added by clicking on a segment in the Sunburst application and additional keys were entered to show inner and outer IP addresses in the tunnel.

Thursday, October 21, 2021

InfluxDB Cloud


InfluxDB Cloud is a cloud hosted version of InfluxDB. The free tier makes it easy to try out the service and has enough capability to satisfy simple use cases. In this article we will explore how metrics based on sFlow streaming telemetry can be pushed into InfluxDB Cloud.

The diagram shows the elements of the solution. Agents in host and network devices are configured to stream sFlow telemetry to an sFlow-RT real-time analytics engine instance. The Telegraf Agent queries sFlow-RT's REST API for metrics and pushes them to InfluxDB Cloud.

docker run -p 8008:8008 -p 6343:6343/udp --name sflow-rt -d sflow/prometheus

Use Docker to run the pre-built sflow/prometheus image which packages sFlow-RT with the sflow-rt/prometheus application. Configure sFlow agents to stream data to this instance.

Create an InfluxDB Cloud account. Click the Data tab. Click on the Telegraf option and the InfluxDB Output Plugin button to get the URL to post data. Click the API Tokens option and generate a token.
[agent]
  interval = "15s"
  round_interval = true
  metric_batch_size = 5000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = "1s"
  hostname = ""
  omit_hostname = true

[[outputs.influxdb_v2]]
  urls = ["INFLUXDB_CLOUD_URL"]
  token = "INFLUXDB_CLOUD_TOKEN"
  organization = "INFLUXDB_CLOUD_USER"
  bucket = "sflow"

[[inputs.prometheus]]
  urls = ["http://host.docker.internal:8008/prometheus/metrics/ALL/ifinutilization,ifoututilization/txt"]
  metric_version = 2

Create a telegraf.conf file. Substitute INFLUXDB_CLOUD_URL, INFLUXDB_CLOUD_TOKEN, and INFLUXDB_CLOUD_USER with values retrieved from the InfluxDB Cloud account.

docker run -v $PWD/telegraf.conf:/etc/telegraf/telegraf.conf:ro \
-d --name telegraf telegraf

Use Docker to run the telegraf agent.

Data should start appearing in InfluxDB Cloud. Use the Explore tab to see what data is available and to create charts. In this case we are plotting ingress / egress utilization for each switch port in the network.

Telegraf sFlow input plugin describes why you would normally bypass Telegraf and have InfluxDB directly retrieve metrics from sFlow-RT. However, in the case of InfluxDB Cloud, Telegraf acts as a secure gateway, retrieving metrics locally using the inputs.prometheus module, and forwarding to the InfluxDB Cloud using the outputs.influxdb_v2 module. InfluxDB 2.0 released describes the settings used in the inputs.prometheus module.

Modify the urls setting in the inputs.prometheus section of the telegraf.conf file to add additional metrics and/or define flows.

There are important scaleability and cost advantages to placing the sFlow-RT analytics engine in front of the metrics collection service. For example, in large scale cloud environments the metrics for each member of a dynamic pool isn't necessarily worth trending since virtual machines / containers are frequently added and removed. Instead, sFlow-RT can be instructed to track all the members of the pool, calculates summary statistics for the pool, and log the summary statistics. This pre-processing can significantly reduce storage requirements, lowering costs and increasing query performance. 

Host, Docker, Swarm and Kubernetes monitoring describes how to deploy sFlow agents to monitor compute infrastructure.

The sFlow-RT Prometheus Exporter application exposes a REST API that allows metrics to be summarized, filtered, and synthesized. Exposing these capabilities through a REST API allows the Telegraf inputs.prometheus module to control the behavior of the sFlow-RT analytics pipeline and retrieve a small set of hight value metrics tailored to your requirements.

Wednesday, October 20, 2021

Telegraf sFlow input plugin

The Telegraf agent is bundled with an SFlow Input Plugin for importing sFlow telemetry into the InfluxDB time series database. However, the plugin has major caveats that severely limit the value that can be derived from sFlow telemetry.

Currently only Flow Samples of Ethernet / IPv4 & IPv4 TCP & UDP headers are turned into metrics. Counters and other header samples are ignored.

Series Cardinality Warning

This plugin may produce a high number of series which, when not controlled for, will cause high load on your database.

InfluxDB 2.0 released describes how to use sFlow-RT to convert sFlow telemetry into useful InfluxDB metrics.

Using sFlow-RT overcomes the limitations of the Telegraf sFlow Input Plugin, making it possible to fully realize the value of sFlow monitoring:

  • Counters are a major component of sFlow, efficiently streaming detailed network counters that would otherwise need to be polled via SNMP. Counter telemetry is ingested by sFlow-RT and used to compute an extensive set of Metrics that can be imported into InfluxDB.
  • Flow Samples are fully decoded by sFlow-RT, yielding visibility that extends beyond the basic Ethernet / IPv4 / TCP / UDP header metrics supported by the Telegraf plugin to include ARP, ICMP, IPv6, DNS, VxLAN tunnels, etc. The high cardinality of raw flow data is mitigated by sFlow-RT's programmable real-time flow analytics pipeline, exposing high value, low cardinality, flow metrics tailored to business requirements.
In addition, there are important scaleability and cost advantages to placing the sFlow-RT analytics engine in front of InfluxDB. For example, in large scale cloud environments the metrics for each member of a dynamic pool isn't necessarily worth trending since virtual machines / containers are frequently added and removed. Instead, sFlow-RT can be instructed to track all the members of the pool, calculates summary statistics for the pool, and log the summary statistics. This pre-processing can significantly reduce storage requirements, lowering costs and increasing query performance.

Tuesday, October 12, 2021

Grafana Cloud


Grafana Cloud is a cloud hosted version of Grafana, Prometheus, and Loki. The free tier makes it easy to try out the service and has enough capability to satisfy simple use cases. In this article we will explore how metrics based on sFlow streaming telemetry can be pushed into Grafana Cloud.

The diagram shows the elements of the solution. Agents in host and network devices are configured to stream sFlow telemetry to an sFlow-RT real-time analytics engine instance. The Grafana Agent queries sFlow-RT's REST API for metrics and pushes them to Grafana Cloud.
docker run -p 8008:8008 -p 6343:6343/udp --name sflow-rt -d sflow/prometheus
Use Docker to run the pre-built sflow/prometheus image which packages sFlow-RT with the sflow-rt/prometheus application. Configure sFlow agents to stream data to this instance.
Create a Grafana Cloud account. Click on the Agent button on the home page to get the configuration settings for the Grafana Agent.
Click on the Prometheus button to get the configuration to forward metrics from the Grafana Agent.
Enter a name and click on the Create API key button to generate configuration settings that include a URL, username, and password that will be used in the Grafana Agent configuration.
server:
  log_level: info
  http_listen_port: 12345
prometheus:
  wal_directory: /tmp/wal
  global:
    scrape_interval: 15s
  configs:
    - name: agent
      host_filter: false
      scrape_configs:
        - job_name: 'sflow-rt-analyzer'
          metrics_path: /prometheus/analyzer/txt
          static_configs:
            - targets: ['host.docker.internal:8008']
        - job_name: 'sflow-rt-metrics'
          metrics_path: /prometheus/metrics/ALL/ALL/txt
          static_configs:
            - targets: ['host.docker.internal:8008']
          metric_relabel_configs:
            - source_labels: ['agent', 'datasource']
              separator: ':'
              target_label: instance
        - job_name: 'sflow-rt-countries'
          metrics_path: /app/prometheus/scripts/export.js/flows/ALL/txt
          static_configs:
            - targets: ['host.docker.internal:8008']
          params:
            metric: ['sflow_country_bps']
            key: ['null:[country:ipsource:both]:unknown','null:[country:ipdestination:both]:unknown']
            label: ['src','dst']
            value: ['bytes']
            scale: ['8']
            aggMode: ['sum']
            minValue: ['1000']
            maxFlows: ['100']
        - job_name: 'sflow-rt-asns'
          metrics_path: /app/prometheus/scripts/export.js/flows/ALL/txt
          static_configs:
            - targets: ['host.docker.internal:8008']
          params:
            metric: ['sflow_asn_bps']
            key: ['null:[asn:ipsource:both]:unknown','null:[asn:ipdestination:both]:unknown']
            label: ['src','dst']
            value: ['bytes']
            scale: ['8']
            aggMode: ['sum']
            minValue: ['1000']
            maxFlows: ['100']
      remote_write:
        - url: API_URL
          basic_auth:
            username: API_USERID
            password: API_KEY
Create an agent.yaml configuration file. Substitute the API_URL, API_USERID, and API_KEY with values from the API Key settings obtained previosly.
docker run -v $PWD/data:/etc/agent/data -v $PWD/agent.yaml:/etc/agent/agent.yaml \
--name grafana-agent -d grafana/agent
Use Docker to run the Grafana Agent.
Data should start appearing in Grafana Cloud. Install the sFlow-RT Health, sFlow-RT Countries and Networks, and sFlow-RT Network Interfaces dashboards to view the data. For example, the Countries and Networks dashboard above shows traffic entering and leaving your network broken out by network and country. Flow metrics with Prometheus and Grafana describes how to build Prometheus scrape_configs that will cause sFlow-RT to export custom traffic flow metrics. 
There are important scaleability and cost advantages to placing the sFlow-RT analytics engine in front of the metrics collection service. For example, in large scale cloud environments the metrics for each member of a dynamic pool isn't necessarily worth trending since virtual machines / containers are frequently added and removed. Instead, sFlow-RT can be instructed to track all the members of the pool, calculates summary statistics for the pool, and log the summary statistics. This pre-processing can significantly reduce storage requirements, lowering costs and increasing query performance. 
Host, Docker, Swarm and Kubernetes monitoring describes how to deploy sFlow agents to monitor compute infrastructure.
The sFlow-RT Prometheus Exporter application exposes a REST API that allows metrics to be summarized, filtered, and synthesized. Exposing these capabilities through a REST API allows Prometheus scrape_configs to control the behavior of the sFlow-RT analytics pipeline and retrieve a small set of hight value metrics tailored to your requirements.

Thursday, October 7, 2021

DDoS protection quickstart guide

DDoS Protect is an open source denial of service mitigation tool that uses industry standard sFlow telemetry from routers to detect attacks and automatically deploy BGP remotely triggered blackhole (RTBH) and BGP Flowspec filters to block attacks within seconds.

This document pulls together links to a number of articles that describe how you can quickly try out DDoS Protect and get it running in your environment:

DDoS Protect is a lightweight solution that uses standard telemetry and control (sFlow and BGP) capabilities of routers to automatically block disruptive volumetric denial of service attacks. You can quickly evaluate the technology on your laptop or in a test lab. The solution leverages standard features of modern routing hardware to scale easily to large high traffic networks.