Tuesday, May 12, 2020

Real-time network and system metrics as a service

The sFlow-RT real-time analytics engine receives industry standard sFlow telemetry as a continuous stream from network and host devices and coverts the raw data into useful measurements that can be be queried through a REST API. A single sFlow-RT instance can monitor the entire data center, providing a comprehensive view of performance, not just of the individual components, but of the data center as a whole.

This article is an interactive tutorial intended to familiarize the reader with the REST API. The examples can be run on a laptop using recorded data so that access to a live network is not required.

The data was captured from the leaf and spine test network shown above (described in Fabric View).
curl -O https://raw.githubusercontent.com/sflow-rt/fabric-view/master/demo/ecmp.pcap
First, download the captured sFlow data.

You will need to have a system with Java or Docker to run the sFlow-RT software.
curl -O https://inmon.com/products/sFlow-RT/sflow-rt.tar.gz
tar -xzf sflow-rt.tar.gz
./sflow-rt/get-app.sh sflow-rt browse-metrics
./sflow-rt/get-app.sh sflow-rt browse-flows
./sflow-rt/get-app.sh sflow-rt prometheus
./sflow-rt/start.sh -Dsflow.file=$PWD/ecmp.pcap
The above commands download and run sFlow-RT, with browse-metrics, browse-flows, and prometheus applications on a system with Java 1.8+ installed.
docker run --rm -v $PWD/ecmp.pcap:/sflow-rt/ecmp.pcap \
-p 8008:8008 --name sflow-rt sflow/prometheus -Dsflow.file=ecmp.pcap
Alternatively, the above command runs sFlow-RT and applications using Docker.
The REST API is documented using OpenAPI. Use a web browser to access the REST API explorer at http://localhost:8008/api/index.html.

Each of the examples can be run in a terminal window using curl, or you can simply click on the link to see the results in your web browser.
Measurements are represented by sFlow-RT in the form of a logical table. Each agent is a device on the network and is uniquely identified by an IP address. Each agent may have one or more datasources that represent a logical source of measurements. For example, a network switch will have a data source for each port.
curl http://localhost:8008/agents/json
List the sFlow agents.
curl http://localhost:8008/metrics/json
List the names of the metrics being exported by the agents. The available metrics depend on the types of agent streaming data to sFlow-RT. For a list of supported sFlow metrics, see Metrics.
curl http://localhost:8008/metric/ALL/max:ifinutilization,max:ifoututilization/json
Find the switch ports with the highest input and output utilization.

The metric query walks the table and returns a value that summarizes each metric in a comma separated list. The following summary statistics are supported:
  • max: Maximum value
  • min: Smallest value
  • sum: Total value
  • avg: Average value
  • var: Variance
  • sdev: Standard deviation
  • med: Median value
  • q1: First quartile
  • q2: Second quartile (same as med:)
  • q3: Third quartile
  • iqr: Inter-quartile range (i.e. q3 - q1)
The browse-metrics application makes use of the metric REST API and can be used to query and trend metrics.
Click on the link below to plot a graph of the switch port with the highest input utilization (screen capture shown above):
http://localhost:8008/app/browse-metrics/html/index.html?metric=ifinutilization
The following examples show how to retrieve metric values without summarization.
curl http://localhost:8008/table/ALL/ifinutilization,ifoututilization/json
Get a table of input and output utilization for every switch port. The table query doesn't summarize metrics. Instead, the query returns rows from the logical table that include the metrics specified in the query.
curl http://localhost:8008/dump/ALL/ALL/json
Dump all metric values for all agents. The dump query is similar to the table query, but instead of walking the table row by row, individual metrics are traversed in their internal order.
curl http://localhost:8008/prometheus/metrics/ALL/ALL/txt
Dump all metric values in Prometheus Exporter format. For example, the Grafana sFlow-RT Network Interfaces dashboard makes use of the query to populate the Prometheus time series database.

There are two types of measurement carried by sFlow: periodically exported counters and randomly sampled packets. So far the examples have been querying metrics derived from the counters.
curl http://localhost:8008/flowkeys/json
Get the list of attributes that are being extracted from sampled packet headers. The available attributes depend on the type of traffic flowing in the network. For a list of supported packet attributes, see Defining Flows.
curl -H "Content-Type:application/json" -X PUT \
--data '{"keys":"ipsource,ipdestination",value:"bytes"}' \
http://localhost:8008/flow/srcdst/json
Define a new "flow" metric called srcdst that calculates the bytes per second between each pair of communicating IP addresses on the network.
curl http://localhost:8008/metric/ALL/max:srcdst/json
Find the maximum value of the newly defined srcdst flow metric, i.e. the switch port on the network observing the highest bandwidth flow of packets.
[{
 "agent": "192.168.0.12",
 "metricName": "max:srcdst",
 "topKeys": [
  {
   "lastUpdate": 1274,
   "value": 3.392739739066506E8,
   "key": "10.4.3.2,10.4.4.2"
  },
  {
   "lastUpdate": 2352,
   "value": 2.155296894816872E8,
   "key": "10.4.1.2,10.4.2.2"
  }
 ],
 "metricN": 10,
 "lastUpdate": 1275,
 "lastUpdateMax": 2031,
 "metricValue": 3.392739739066506E8,
 "dataSource": "4",
 "lastUpdateMin": 1267
}]
In addition to providing a metric value, the result also includes topKeys, showing the top flows seen at the switch port.
Click on the link below to trend the srcdst metric (screen capture shown above):
http://localhost:8008/app/browse-metrics/html/index.html?metric=srcdst
There are additional queries specific to flow metrics.
curl http://localhost:8008/activeflows/ALL/srcdst/json
Find the largest flows gathered from all the interfaces in the network.
[
 {
  "flowN": 7,
  "agent": "192.168.0.12",
  "value": 5.537867023642346E8,
  "dataSource": "4",
  "key": "10.4.1.2,10.4.2.2"
 },
 {
  "flowN": 6,
  "agent": "192.168.0.11",
  "value": 5.1034569007443213E8,
  "dataSource": "38",
  "key": "10.4.3.2,10.4.4.2"
 },
 {
  "flowN": 6,
  "agent": "192.168.0.11",
  "value": 1469003.6788768284,
  "dataSource": "4",
  "key": "10.4.4.2,10.4.3.2"
 },
 {
  "flowN": 7,
  "agent": "192.168.0.12",
  "value": 1306006.2405022713,
  "dataSource": "37",
  "key": "10.4.2.2,10.4.1.2"
 }
]
Each flow returned identifies the number of locations it was observed and the port with the maximum value. For example, the largest flow from 10.4.1.2 to 10.4.2.2 was seen by 7 data sources and its maximum value 5.5e8 was observed by data source 4 on agent 192.168.0.12.
Click on the link below to plot a graph of the top flows using the browse-flows application (screen capture shown above):
http://localhost:8008/app/browse-flows/html/index.html?keys=ipsource,ipdestination&value=bps
Note how quickly the graph changes as it tracks new elephant flows in real time.

See RESTflow for a more detailed discussion of sFlow-RT's flow REST API.

This tutorial has just scratches the surface of the capabilities of sFlow-RT's analytics engine. The Writing Applications tutorial provides further examples and a discussion of how to build applications using Python and JavaScript, see Real-time DDoS mitigation using BGP RTBH and FlowSpecFabric View and Flow metrics with Prometheus and Grafana for examples of sFlow-RT applications.

Seeing your own data is more interesting than a canned demonstration. Network Equipment lists devices that support sFlow. Ubuntu 18.04 and CentOS 8 describe how to install the open source Host sFlow agent on popular Linux distributions, extending visibility into compute and cloud infrastructure. The Host sFlow agent is also available as a Docker image for easy deployment with container orchestration systems, see Host, Docker, Swarm and Kubernetes monitoring.

Even if you don't have access to a production environment, the Docker testbed and Kubernetes testbed examples show how to build a virtual testbed using Docker Desktop. Alternatively, Mininet flow analytics and Mininet dashboard provide starting points if you want to experiment with software defined networking (SDN).

Finally, join the sFlow-RT community to ask questions and share solutions and operational experience.

Tuesday, May 5, 2020

NVIDIA, Mellanox, and Cumulus

Recent press releases, Riding a Cloud: NVIDIA Acquires Network-Software Trailblazer Cumulus and NVIDIA Completes Acquisition of Mellanox, Creating Major Force Driving Next-Gen Data Centers, describe NVIDIA's moves to provide high speed data center networks to connect compute clusters that use of their GPUs to accelerate big data workloads, including: deep learning, climate modeling, animation, data visualization, physics, molecular dynamics etc.

Real-time visibility into compute, network, and GPU infrastructure is required manage and optimize the unified infrastructure. This article explores how the industry standard sFlow technology supported by all three vendors can deliver comprehensive visibility.

Cumulus Linux simplifies operations, providing the same operating system, Linux, that runs on the servers. Cumulus Networks and Mellanox have a long history of working with the Linux community to integrate support for switches. The latest Linux kernels now include native support for network ASICs, seamlessly integrating with standard Linux routing (FRR, Quagga, Bird, etc), configuration (Puppet, Chef, Ansible, etc) and monitoring (collectd, netstat, top, etc) tools.

Linux 4.11 kernel extends packet sampling support describes enhancements to the Linux kernel to support industry standard sFlow instrumentation in network ASICs. Cumulus Linux and Mellanox both support the new Linux APIs. Cumulus Linux uses the open source Host sFlow agent to stream telemetry gathered from the hardware, Linux operating system, and applications to a remote collector.

Ubuntu 18.04 and CentOS 8 describe how to install the Host sFlow agent on popular host Linux distributions. The Host sFlow agent is also available as a Docker image for easy deployment with container orchestration systems, see Host, Docker, Swarm and Kubernetes monitoring. Extending network visibility to the host allows network traffic to be associated with applications running on the host as well as providing details about the resources consumed by the applications and the network quality of service being delivered to the applications.

The Host sFlow agent also supports the sFlow NVML GPU Structures extension to export key metrics from NVIDIA GPUs using the NVIDIA Management Library (NVML), see GPU performance monitoring.

Enabling sFlow across the network, compute, and GPU stack provides a real-time, data center wide, view of performance. The sFlow-RT real-time analytics engine offers a convenient method of integrating sFlow analytics with popular orchestration, DevOps and SDN tools, examples include: Cumulus Networks, sFlow and data center automationFlow metrics with Prometheus and GrafanaECMP visibility with Cumulus LinuxFabric View, and Troubleshooting connectivity problems in leaf and spine fabrics.

Friday, April 24, 2020

Monitoring DDoS mitigation

Real-time DDoS mitigation using BGP RTBH and FlowSpec and Pushing BGP Flowspec rules to multiple routers describe how to deploy the ddos-protect application. This article focuses on how to monitor DDoS activity and control actions.

The diagram shows the elements of the solution. Routers stream standard sFlow telemetry to an instance of the sFlow-RT real-time analytics engine running the ddos-protect application. The instant a DDoS attack is detected, RTBH and / or Flowspec actions are pushed via BGP to the routers to mitigate the attack. Key metrics are published using the Prometheus exporter format over HTTP and events are sent using the standard syslog protocol.
The sFlow-RT DDoS Protect dashboard, shown above, makes use of the Prometheus time series database and the Grafana metrics visualization tool to track DDoS attack mitigation actions.
The sFlow-RT Countries and Networks dashboard, shown above, breaks down traffic by origin network and country to provide an indication of the source of attacks.  Flow metrics with Prometheus and Grafana describes how to build additional dashboards to provide additional insight into network traffic.
In this example, syslog events are directed to an Elasticsearch, Logstash, and Kibana (ELK) stack where they are archived, queried, and analyzed. Grafana can be used to query Elasticsearch to incorporate event data in dashboards. The Grafana dashboard example above trends DDoS events and displays key information in a table below.

The tools demonstrated in this article are not the only ones that can be used. If you already have monitoring for your infrastructure then it makes sense to leverage the existing tools rather than stand up a new monitoring system. Syslog events are a standard that are widely supported by on-site (e.g. Splunk) and cloud based (e.g. Solarwinds Loggly) SIEM tools. Similarly, the Prometheus metrics export protocol widely supported (e.g. InfluxDB).

Wednesday, April 15, 2020

Pushing BGP Flowspec rules to multiple routers

Real-time DDoS mitigation using BGP RTBH and Flowspec describes the open source DDoS Protect application. The software runs on the sFlow-RT real-time analytics engine, which receives industry standard sFlow telemetry from routers and pushes controls using BGP. A recent enhancement to the application pushes controls to multiple routers in order to protect networks with redundant edge routers.
ddos_protect.router=10.0.0.96,10.0.0.97
Configuring multiple BGP connections is simple, the ddos_protect.router configuration option has been extended to accept a comma separated list of IP addresses for the routers that will be connecting to the controller.
Alternatively, a BGP Flowspec/RTBH reflector can be used to propagate the controls. Flowspec is a recent addition to open source BGP software, FRR and Bird, and it should be possible to use this software to reflect Flowspec controls. A reflector can be a useful place to implement policies that direct controls to specific enforcement devices.

Support for multiple BGP connections in the DDoS Protect application reduces the complexity of simple deployments by removing the requirement for a reflector. Controls are pushed to all devices, but differentiated policies can still be implemented by configuring each device's response to controls.

Tuesday, March 24, 2020

Kubernetes testbed

The sFlow-RT real-time analytics platform receives a continuous telemetry stream from sFlow Agents embedded in network devices, hosts and applications and converts the raw measurements into actionable metrics, accessible through open APIs, see Writing Applications.

Application development is greatly simplified if you can emulate the infrastructure you want to monitor on your development machine. Docker testbed describes a simple way to develop sFlow based visibility solutions. This article describes how to build a Kubernetes testbed to develop and test configurations before deploying solutions into production.
Docker Desktop provides a convenient way to set up a single node Kubernetes cluster, just select the Enable Kubernetes setting and click on Apply & Restart.

Create the following sflow-rt.yml file:
apiVersion: v1
kind: Service
metadata:
  name: sflow-rt-sflow
spec:
  type: NodePort
  selector:
    name: sflow-rt
  ports:
    - protocol: UDP
      port: 6343
---
apiVersion: v1
kind: Service
metadata:
  name: sflow-rt-rest
spec:
  type: LoadBalancer
  selector:
    name: sflow-rt
  ports:
    - protocol: TCP
      port: 8008
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sflow-rt
spec:
  replicas: 1
  selector:
    matchLabels:
      name: sflow-rt
  template:
    metadata:
      labels:
        name: sflow-rt
    spec:
      containers:
      - name: sflow-rt
        image: sflow/prometheus:latest
        ports:
          - name: http
            protocol: TCP
            containerPort: 8008
          - name: sflow
            protocol: UDP
            containerPort: 6343
Run the following command to deploy the service:
kubectl apply -f sflow-rt.yml
Now create the following host-sflow.yml file:
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: host-sflow
spec:
  selector:
    matchLabels:
      name: host-sflow
  template:
    metadata:
      labels:
        name: host-sflow
    spec:
      restartPolicy: Always
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - name: host-sflow
        image: sflow/host-sflow:latest
        env:
          - name: COLLECTOR
            value: "sflow-rt-sflow"
          - name: SAMPLING
            value: "10"
          - name: NET
            value: "flannel"
        volumeMounts:
          - mountPath: /var/run/docker.sock
            name: docker-sock
            readOnly: true
      volumes:
        - name: docker-sock
          hostPath:
            path: /var/run/docker.sock
Run the following command to deploy the service:
kubectl apply -f host-sflow.yml
In this case, there is only one node, but the command will deploy an instance of Host sFlow on every node in a Kubernetes cluster to provide a comprehensive view of network, server, and application performance.

Note: The single node Kubernetes cluster uses the Flannel plugin for Cluster Networking. Setting the sflow/host-sflow environment variable NET to flannel instruments the cni0 bridge used by Flannel to connect Kubernetes pods. The NET and SAMPLING settings will likely need to be changed when pushing the configuration into a production environment, see sflow/host-sflow for options.

Run the following command to verify that the Host sFlow and sFlow-RT pods are running:
kubectl get pods
The following output:
NAME                        READY   STATUS    RESTARTS   AGE
host-sflow-lp4db            1/1     Running   0          34s
sflow-rt-544bff645d-kj4km   1/1     Running   0          21h
The following command displays the network services:
kubectl get services
Generating the following output:
NAME             TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
kubernetes       ClusterIP      10.96.0.1       <none>        443/TCP          13d
sflow-rt-rest    LoadBalancer   10.110.89.167   localhost     8008:31317/TCP   21h
sflow-rt-sflow   NodePort       10.105.87.169   <none>        6343:31782/UDP   21h
Access to the sFlow-RT REST API is available via localhost port 8008.
The sFlow-RT web interface confirms that telemetry is being received from 1 sFlow agent (the Host sFlow instance monitoring the Kubernetes node).
ab -c 4 -n 10000 -b 500 -l http://10.0.0.73:8008/dump/ALL/ALL/json
The command above uses the ab - Apache HTTP server benchmarking tool to generate network traffic by repeatedly querying the sFlow-RT instance using the Kubernetes node IP address (10.0.0.73).
The screen capture above shows the sFlow-RT Flow Browser application reporting traffic in real-time.
#!/usr/bin/env python
import requests

requests.put(
  'http://10.0.0.73:8008/flow/elephant/json',
  json={'keys':'ipsource,ipdestination', 'value':'bytes'}
)
requests.put(
  'http://10.0.0.73:8008/threshold/elephant_threshold/json',
  json={'metric':'elephant', 'value': 10000000/8, 'byFlow':True, 'timeout': 1}
)
eventurl = 'http://10.0.0.73:8008/events/json'
eventurl += '?thresholdID=elephant_threshold&maxEvents=10&timeout=60'
eventID = -1
while 1 == 1:
  r = requests.get(eventurl + '&eventID=' + str(eventID))
  if r.status_code != 200: break
  events = r.json()
  if len(events) == 0: continue

  eventID = events[0]['eventID']
  events.reverse()
  for e in events:
    print(e['flowKey'])
The above elephant.py script is modified from the version in Docker testbed to reference the Kubernetes node IP address (10.0.0.73).
./elephant.py     
10.1.0.72,192.168.65.3
The output above is generated immediately when traffic is generated using the ab command. The IP addresses correspond to those displayed in the Flow Browser chart.
curl http://10.0.0.73:8008/prometheus/metrics/ALL/ALL/txt
Run the above command to metrics from the Kubernetes cluster exported using Prometheus export format.

This article was focussed on using Docker Desktop to move sFlow real-time analytics solutions into a Kubernetes production environment. Docker testbed describes how to use Docker Desktop to create an environment to develop the applications.

Thursday, March 19, 2020

SFMIX San Francisco shelter in place

A shelter in place order restricted San Francisco residents to their homes beginning at 12:01 a.m. on March 17, 2020. Many residents work for Bay Area technology companies such as Salesforce, Facebook, Twitter, Google, Netflix and Apple. Employees from these companies are able to, and have been instructed to, work remotely from their homes. In addition, other housebound residents are making use of social networking to keep in touch with friends and family as well as streaming media and online gaming for entertainment.

The traffic trend chart above from the San Francisco Metropolitan Internet Exchange (SFMIX) shows the change in network traffic that has resulted from the shelter in place order. Peak traffic has increased by around 10Gbit/s (a 25% increase) and continues throughout the day (whereas peaks previously occurred in the evenings).

The SFMIX network directly connects a number of data centers in the Bay Area and the member organizations that peer from those data centers.  Peering through the exchange network keeps traffic local by directly connecting companies with their employees and customers and avoiding potentially congested service provider networks.
SFMIX recently finished a network upgrade to 100Gbit/s Arista switches and all fiber optic connections and so is easily able to handle the increase in traffic.

Network visibility is critical to being able to quickly respond to unexpected changes in network usage. The sFlow measurement technology built into high speed switches is a standard method of monitoring Internet Exchanges (IXs) - Internet Exchange (IX) Metrics is a measurement tool developed with SFMIX.

Every organization depends on their networks and visibility is critical to manage the challenges posed by a rapidly changing environment. sFlow is an industry standard that is widely implemented by network vendors.  Enable sFlow telemetry from existing network equipment and deploy an sFlow analytics tool to gain visibility into your network traffic. sFlowTrend is a free tool that can be installed and running in minutes. Flow metrics with Prometheus and Grafana describes how to integrate network traffic visibility into existing operational dashboards.

Thursday, March 12, 2020

Ubuntu 18.04

Ubuntu 18.04 comes with Linux kernel version 4.15. This version of the kernel includes efficient in-kernel packet sampling that can be used to provide network visibility for production servers running network heavy workloads, see Berkeley Packet Filter (BPF).
This article provides instructions for installing and configuring the open source Host sFlow agent to remotely monitor servers using the industry standard sFlow protocol. The sFlow-RT real-time analyzer is used to demonstrate the capabilities of sFlow telemetry.

Find the latest Host sFlow version on the Host sFlow download page.
wget https://github.com/sflow/host-sflow/releases/download/v2.0.25-3/hsflowd-ubuntu18_2.0.25-3_amd64.deb
sudo dpkg -i hsflowd-ubuntu18_2.0.25-3_amd64.deb
sudo systemctl enable hsflowd
The above commands download and install the software.
sflow {
  collector { ip=10.0.0.30 }
  pcap { speed=1G-1T }
  tcp { }
  systemd { }
}
Edit the /etc/hsflowd.conf file. The above example sends sFlow to a collector at 10.0.0.30, enables packet sampling on all network adapters, adds TCP performance information, and exports metrics for Linux services. See Configuring Host sFlow for Linux for the complete set of configuration options.
sudo systemctl restart hsflowd
Restart the Host sFlow daemon to start streaming telemetry to the collector.
sflow {
  dns-sd { domain=.sf.inmon.com }
  pcap { speed=1G-1T }
  tcp { }
  systemd { }
}
Alternatively, if you have control over a DNS domain, you can use DNS SRV records to advertise sFlow collector address(es). In the above example, the sf.inmon.com domain will be queried for collectors.
sflow-rt          A       10.0.0.30
_sflow._udp   60  SRV     0 0 6343  sflow-rt
The above entries from the sf.inmon.com zone file directs the sFlow to sflow-rt.sf.inmon.com (10.0.0.30). If you change the collector, all Host sFlow agents will pick up the change within 60 seconds (the DNS time to live specified in the SRV entry).

Now that the Host sFlow agent has been configured, it's time to install an sFlow collector on server 10.0.0.30, which we will assume is also running Ubuntu 18.04.

First install Java.
sudo apt install openjdk-11-jre-headless
Next, install the latest version of sFlow-RT along with browse-metrics, browse-flows and prometheus applications.
LATEST=`wget -qO - https://inmon.com/products/sFlow-RT/latest.txt`
wget https://inmon.com/products/sFlow-RT/sflow-rt_$LATEST.deb
sudo dpkg -i sflow-rt_$LATEST.deb
sudo /usr/local/sflow-rt/get-app.sh sflow-rt browse-metrics
sudo /usr/local/sflow-rt/get-app.sh sflow-rt browse-flows
sudo /usr/local/sflow-rt/get-app.sh sflow-rt prometheus
sudo systemctl enable sflow-rt
sudo systemctl start sflow-rt
Finally, allow sFlow and HTTP requests through the firewall.
sudo ufw allow 6343/udp
sudo ufw allow 8008/tcp
System Properties describes configuration options that can be set in the /usr/local/sflow-rt/conf.d/sflow-rt.conf file. See Download and install for instructions on securing access to sFlow-RT as well as links to additional applications.
Use a web browser to connect to http://10.0.0.30:8008/ to access the sFlow-RT web interface (shown above). The Status page confirms that sFlow is being received.
curl http://10.0.0.30:8008/prometheus/metrics/ALL/ALL/txt
The above command retrieves metrics for all the hosts in Prometheus export format.

Configure Prometheus or InfluxDB to periodically retrieve and store metrics. The following examples demonstrate the use of Grafana to query a Prometheus database to populate dashboards: sFlow-RT Network Interfaces, sFlow-RT Countries and Networks, and sFlow-RT Health.