sFlow: June 2016

Wednesday, June 29, 2016

Configuring OpenSwitch

The following configuration enables sFlow monitoring of all interfaces on a white box switch running the OpenSwitch operating system, sampling packets at 1-in-4096, polling counters every 20 seconds and sending the sFlow to an analyzer (10.0.0.50) on UDP port 6343 (the default sFlow port):

switch(config)# sflow collector 10.0.0.50
switch(config)# sflow sampling 4096
switch(config)# sflow polling 20
switch(config)# sflow enable

A previous posting discussed the selection of sampling rates. Additional information can be found in the OpenSwitch sFlow User Guide.

See Trying out sFlow for suggestions on getting started with sFlow monitoring and reporting.

Thursday, June 16, 2016

Cisco Tetration analytics

Cisco Tetration Analytics: the most Comprehensive Data Center Visibility and Analysis in Real Time, at Scale, June 15, 2016, announced the new Cisco Tetration Analytics platform. The platform collects telemetry from proprietary agents on servers and embedded in hardware on certain Nexus 9k switches, analyzes the data, and presents results via Web GUI, REST API, and as events.

Cisco Tetration Analytics Data Sheet describes the hardware requirements:

Platform Hardware	Quantity
Cisco Tetration Analytics computing nodes (servers)	16
Cisco Tetration Analytics base nodes (servers)	12
Cisco Tetration Analytics serving nodes (servers)	8
Cisco Nexus 9372PX Switches	3

And the power requirements:

Property	Cisco Tetration Analytics Platform
Peak power for Cisco Tetration Analytics Platform (39-RU single-rack option)	22.5 kW
Peak power for Cisco Tetration Analytics Platform (39-RU dual-rack option)	11.25 kW per rack (22.5 KW Total)

No pricing is given, but based on the hardware, data center space, power and cooling requirements, this brute force approach to analytics will be reassuringly expensive to purchase and operate.

Update June 22, 2016: See 451 Research report, Cisco Tetration: a $3m, 1,700-pound appliance for network traffic analytics is born, for pricing information.

A much less expensive alternative is to use industry standard sFlow agents embedded in Cisco Nexus 9k/3k switches and in switches from over 40 other vendors. The open source Host sFlow agent extends visibility to servers and applications by streaming telemetry from Linux, Windows, FreeBSD, Solaris, and AIX operating system, hypervisors, Docker containers, web servers (Apache, NGINX, Tomcat, HAproxy) and Java application servers.

The diagram shows how the sFlow-RT real-time analytics engine receives a continuous telemetry stream from sFlow instrumentation build into network, server and application infrastructure and delivers analytics through APIs and can easily be integrated with a wide variety of on-site and cloud, orchestration, DevOps and Software Defined Networking (SDN) tools.

Minimizing cost of visibility describes why lightweight monitoring is critical to realizing the value that telemetry can bring to improving operational efficiency. In the case of the sFlow based solution, the critical data path instrumentation is built into the switch ASICs and in the Linux kernel, ensuring that there is negligible impact on operational performance.

The sFlow-RT analytics software shown in the diagram provides real-time (sub second) visibility for 5,000 unique end points (Virtual Machines or Bare metal server), the upper limit of scaleability in the Tetration data sheet, using a single virtual machine or Docker container with 4 GBytes of RAM and 4 CPU cores. With additional memory and CPU the solution easily scales to 100,000 unique end points.

How can sFlow provide real-time visibility at scale and consume so few resources? Shrink ray describes how advanced statistical techniques are used to select and analyze measurements that capture the essential features of network and system performance. A statistical approach yields fast, accurate answers, while minimizing the resources required to measure, transport and analyze the data.

The sFlow-RT analytics platform was selected as an example because of the overlap in capabilities with the Cisco Tetration analytics platform. However, sFlow is non-proprietary and there are many other open source and commercial sFlow analytics solutions listed on sFlow.org.

The Cisco press release states, "Available in July 2016, the first Tetration platform will be a full rack appliance that is deployed on-premise at the customer’s data center." On the other hand, the sFlow based solution described here is available today and can be installed and running in minutes on a virtual machine or in a Docker container.

Wednesday, June 15, 2016

Programmable hardware: Barefoot Networks, PISA, and P4

Barefoot Networks recently came out of stealth to reveal their Tofino 6.5Tbit/second (65 X 100GE or 260 X 25GE) fully user-programmable switch. The diagram above, from the talk Programming The Network Data Plane by Changhoon Kim of Barefoot Networks, shows the Protocol Independent Switch Architecture (PISA) of the programmable switch silicon.

A logical switch data-plane described in the P4 language is compiled to program the general purpose PISA hardware. For example, the following P4 code snippet is part of a P4 sFlow implementation:

table sflow_ing_take_sample {
    /* take_sample > MAX_VAL_31 and valid sflow_session_id => take the sample */
    reads {
        ingress_metadata.sflow_take_sample : ternary;
        sflow_metadata.sflow_session_id : exact;
    }
    actions {
        nop;
        sflow_ing_pkt_to_cpu;
    }
}

Network visibility is one of the major use cases for P4 based switches. Improving Network Monitoring and Management with Programmable Data Planes describes how P4 can be used to collect information about latency and queueing in the switch forwarding pipeline.

The document also describes an architecture for In-band Network Telemetry (INT) in which the ingress switch is programmed to insert a header containing measurements to packets entering the network. Each switch in the path is programmed to append additional measurements to the packet header. The egress switch is programmed to remove the header so that the packet can be delivered to its destination. The egress switch is responsible for processing the measurements or sending them on to analytics software.

In-band telemetry is an interesting example of the flexibility provided by P4 programmable hardware and the detailed information that can be gathered about latency and queueing from the hardware forwarding pipeline. However, there are practical issues that should be considered with this approach:

Transporting measurement headers is complex with different encapsulations for each transport protocol: Geneve, VxLAN, etc.
Addition of headers increases the size of packets and risks causing traffic to be dropped downstream due to maximum transmission unit (MTU) restrictions.
The number of measurements that can be added by each switch and the number of switches adding measurements in the path needs to be limited.
In-band telemetry cannot be incrementally deployed. Ideally, all devices need to participate, or at a minimum, the ingress and egress devices need to be in-band telemetry aware.
In-band telemetry transports data from the data plane to the control/management planes, providing a potential attack surface that could be exploited by crafting malicious packets with fake measurement headers.

The sFlow architecture provides an out of band alternative for transporting the per packet forwarding plane measurements defined by INT. Instead of adding the measurements to the egress packet, measurements can be attached as metadata to the sampled packets that are handed to the switch CPU. The sFlow agent immediately forwards the additional packet metadata as part of the standard sFlow telemetry stream to a centralized collector. Using sFlow as the telemetry transport has a number of benefits:

Simple to deploy since there is no modification of packets (no issues with encapsulations, MTU, number of measurements, path length, incremental deployment, etc.)
Extensibility of sFlow protocol allows additional forwarding plane measurements to augment standard sFlow measurements, fully integrating the new measurements with sFlow data exported from other switches in the network (sFlow is supported by over 40 switch vendors and is a standard feature of switch ASICs).
sFlow's is a unidirectional telemetry transport protocol originates from the device management plane, can be sent out of band, limiting possible attack surfaces.

The great thing about programmable hardware is that behavior can be modified by changing the software. Implementing out of band telemetry is a matter of combining measurements from the P4 INT code with the P4 sFlow agent code. Compiling and installing out of band sFlow telemetry code reprograms the hardware to implement the new scheme.

The advent of P4 and programmable hardware opens up exciting possibilities for defining additional packet sample metadata, counters, and gauges to augment the sFlow telemetry stream and gain additional insight into the performance of production traffic in large scale, high capacity, networks.

The real-time sFlow streaming telemetry can be used to drive automated controls, for example, to block DDoS attacks, or load balance large "Elephant" flows across multiple paths. Here again P4 combined with programmable hardware makes it possible to create additional control capabilities, for example, to be able to block the large numbers of source addresses involved in a DDoS attack, where sFlow analytics would be used to identify the attackers and their points of ingress and program each of the switches with filters based on their location in the network. The ability to customize the hardware to address specific tasks makes more efficient use of hardware resources than is possible with fixed function device. In this case, defining a specialized DDoS drop table would allow for a much larger number of filters than would be possible with a general purpose ACL table.

Tuesday, June 14, 2016

Merchant silicon based routing, flow analytics, and telemetry

Drivers for growth describes how switches built on merchant silicon from Broadcom ASICs dominate the current generation of data center switches, reduce hardware costs, and support an open ecosystem of switch operating systems (Cumulus Linux, OpenSwitch, Dell OS10, Broadcom FASTPATH, Pica8 PicOS, Open Network Linux, etc.).

The router market is poised to be similarly disrupted with the introduction of devices based on Broadcom's Jericho ASIC, which has the capacity to handle over 1 million routes in hardware (the full Internet routing table is currently around 600,000 routes).

An edge router is a very pricey box indeed, often costing anywhere from $100,000 to $200,000 per 100 Gb/sec port, depending on features in the router and not including optical cables that are also terribly expensive. Moreover, these routers might only be able to cram 80 ports into a half rack or full rack of space. The 7500R universal spine and 7280R universal leaf switches cost on the order of $3,000 per 100 Gb/sec port, and they are considerably denser and less expensive. - Leaving Fixed Function Switches Behind For Universal Leafs

Broadcom Jericho ASICs are currently available in Arista 7500R/7280R routers and in Cisco NCS 5000 series routers. Expect further disruption to the router market when white box versions of the 1U router hardware enter the market.

There was general enthusiasm for Broadcom Jericho based routers in a recent discussion on the North American Network Operators' Group (NANOG) mailing list, Arista Routing Solutions, so merchant silicon based routers should be expected to sell well.

The Broadcom Jericho ASICs also include hardware instrumentation to support industry standard sFlow traffic monitoring and streaming telemetry. For example, the following commands enable sFlow on all ports on an Arista router:

sflow source-interface Management1
sflow destination 170.1.1.11
sflow polling-interval 30
sflow sample 65535
sflow run

See EOS System Configuration Guide for details.

While Cisco supports standard sFlow on merchant silicon based switch platforms, see Cisco adds sFlow support, Cisco adds sFlow support to Nexus 9K series, and Cisco SF250, SG250, SF350, SG350, SG350XG, and SG550XG series switches. Unfortunately, IOS XR on Cisco's Jericho based routers doesn't yet support sFlow. Instead, a complex set of commands is required to configure Cisco's proprietary NetFlow and streaming telemetry protocols:

RP/0/RP0/CPU0:router#config
RP/0/RP0/CPU0:router(config)#flow exporter-map exp1
RP/0/RP0/CPU0:router(config-fem)#version v9
RP/0/RP0/CPU0:router(config-fem-ver)#options interface-table timeout 300
RP/0/RP0/CPU0:router(config-fem-ver)#options sampler-table timeout 300
RP/0/RP0/CPU0:router(config-fem-ver)#template data timeout 300
RP/0/RP0/CPU0:router(config-fem-ver)#template options timeout 300
RP/0/RP0/CPU0:router(config-fem-ver)#exit 
RP/0/RP0/CPU0:router(config-fem)#transport udp 12515
RP/0/RP0/CPU0:router(config-fem)#source Loopback0
RP/0/RP0/CPU0:router(config-fem)#destination 170.1.1.11
RP/0/RP0/CPU0:router(config-fmm)#exit
RP/0/RP0/CPU0:router(config)#flow monitor-map MPLS-IPv6-fmm
RP/0/RP0/CPU0:router(config-fmm)#record mpls ipv6-fields labels 3
RP/0/RP0/CPU0:router(config-fmm)#exporter exp1
RP/0/RP0/CPU0:router(config-fmm)#cache entries 10000
RP/0/RP0/CPU0:router(config-fmm)#cache permanent
RP/0/RP0/CPU0:router(config-fmm)#exit
RP/0/RP0/CPU0:router(config)#sampler-map FSM
RP/0/RP0/CPU0:router(config-sm)#random 1 out-of 65535
RP/0/RP0/CPU0:router(config-sm)# exit

And further commands are needed to enable monitoring on each interface (and there can be a large number of interfaces given the high port density of these routers):

RP/0/RP0/CPU0:router(config)#interface HundredGigE 0/3/0/0
RP/0/RP0/CPU0:router(config-if)#flow mpls monitor MPLS-IPv6-fmm sampler FSM ingress

See Netflow Configuration Guide for Cisco NCS 5500 Series Routers, IOS XR Release 6.0.x for configuration details and limitations.

We are still not done, further steps are required to enable the equivalent to sFlow's streaming telemetry.

Create policy file defining the counters to export:

{
 "Name": "Test",
 "Metadata": {
  "Version": 25,
  "Description": "This is a sample policy",
  "Comment": "This is the first draft",
  "Identifier": "data that may be sent by the encoder to the mgmt stn"
 },
 "CollectionGroups": {
  "FirstGroup": {
  "Period": 30,
  "Paths": [
   "RootOper.InfraStatistics.Interface(*).Latest.GenericCounters"
   ]
  }
 }
}

Copy the policy file to router:

$ scp Test.policy cisco@170.1.1.1:/telemetry/policies

Finally, configure the JSON encoder:

Router# configure
Router(config)#telemetry encoder json
Router(config-telemetry-json)#policy group FirstGroup
Router(config-policy-group)#policy Test
Router(config-policy-group)#destination ipv4 170.1.1.11 port 5555
Router(config-policy-group)#commit

See Cisco IOS XR Telemetry Configuration Guide for details.

Software defined analytics describes how the sFlow architecture disaggregates the flow analytics pipeline and integrates telemetry export to reduce complexity and increase flexibility. The reduced configuration complexity is clearly illustrated by the two configuration examples above.

Unlike the complex and disparate monitoring mechanisms in IOS XR, sFlow offers a simple, flexible and unified monitoring solution that exposes the full monitoring capabilities of the Broadcom Jericho ASIC. Expect a future release of IOS XR to add the sFlow support since sFlow a natural fit for the hardware capabilities of Jericho based router platforms and the addition of sFlow support will provide feature parity with Cisco's merchant silicon based switches.

Finally, the real-time visibility provided by sFlow supports a number of important use cases for high performance routers, including:

DDoS mitigation
Load balancing ECMP paths
BGP route analytics
Traffic engineering
Usage based accounting
Enforcing usage quotas

Wednesday, June 8, 2016

Docker networking with IPVLAN and Cumulus Linux

Macvlan and Ipvlan Network Drivers are being added as Docker networking options. The IPVlan L3 Mode shown in the diagram is particularly interesting since it dramatically simplifies the network by extending routing to the hosts and eliminating switching entirely.

Eliminating the complexity associated with switching broadcast domains, VLANs, spanning tree, etc. allows a purely routed network to be easily scaled to very large sizes. However, there are some challenges to overcome:

IPVlan will require routes to be distributed to each endpoint. The driver only builds the Ipvlan L3 mode port and attaches the container to the interface. Route distribution throughout a cluster is beyond the initial implementation of this single host scoped driver. In L3 mode, the Docker host is very similar to a router starting new networks in the container. They are on networks that the upstream network will not know about without route distribution.

Cumulus Networks has been working to simplify routing in the ECMP leaf and spine networks and the white paper Routing on the Host: An Introduction shows how the routing configuration used on Cumulus Linux can be extended to the hosts.

Update June 2, 2016: Routing on the Host contains packaged versions of the Cumulus Quagga daemon for Ubuntu, Redhat and Docker.

This article explores the combination of Cumulus Linux networking with Docker IPVLAN using a simple test bed built using free software: VirtualBox, CumulusVX switches, and Ubuntu 16.04 servers. This setup should result in a simple, easy to manage, easy to monitor, networking solution for Docker since all the switches and server will be running Linux allowing the same routing, monitoring, and orchestration software to be used throughout.

Using Cumulus VX with VirtualBox and Creating a Two-Spine, Two-Leaf Topology provide detailed instructions on building and configuring the leaf and spine network shown in the diagram. However, BGP was configured as the routing protocol instead of OSPF (see BGP configuration made simple with Cumulus Linux). For example, the following commands configure BGP on leaf1:

interface swp1
 ipv6 nd ra-interval 10
 link-detect
!
interface swp2
 ipv6 nd ra-interval 10
 link-detect
!
interface swp3
 ipv6 nd ra-interval 10
 link-detect
!
router bgp 65130
 bgp router-id 192.168.0.130
 bgp bestpath as-path multipath-relax
 neighbor EBGP peer-group
 neighbor EBGP remote-as external
 neighbor swp1 interface peer-group EBGP
 neighbor swp1 capability extended-nexthop
 neighbor swp2 interface peer-group EBGP
 neighbor swp2 capability extended-nexthop
 neighbor swp3 interface peer-group EBGP
 neighbor swp3 capability extended-nexthop

Auto-configured IPv6 link local addresses dramatically simplify the configuration of equal cost multi-path (ECMP) routing, eliminating the need to assign IP addresses and subnets to the routed switch ports. The simplified configuration is easy to template, all the switches have a very similar configuration, and an orchestration tool like Puppet, Chef, Ansible, Salt, etc. can be used to automate the process of configuring the switches.

Two Ubuntu 16.04 hosts were created, attached to leaf1 and leaf2 respectively. Each server has two network adapters: enp0s3 connected to the out of band network used to manage the switches and enp0s8 connected to the respective leaf switch.

Why the strange Ethernet interface names (e.g. enp0s3 instead of eth0)? They are the result of the predictable network interface names mechanism that is the default in Ubuntu 16.04. I can't say I am a fan, predictable interface names are difficult to read, push the problem of device naming up the stack, and make it difficult to write portable orchestration scripts.

Quagga is used for BGP routing on the Cumulus Linux switches and on the Ubuntu hosts. The host BGP configurations are virtually identical to the switch configurations, but with an added redistribute stanza to automatically advertise locally attached addresses and subnets:

redistribute connected route-map IPVLAN

The IPVLAN route-map is used to control the routes that are advertised, limiting them to the range of addresses that have been allocated to the IPVLAN orchestration system. Route filtering is an example of the flexibility that BGP brings as a routing protocol: top of rack switches can filter routes advertised by the hosts to protect the fabric from misconfigured hosts, and hosts can be configured to selectively advertise routes.

At this point, configuring routes between the hosts is easy. Configure a network on host1:

user@host1:~$ sudo ip address add 172.16.134.1/24 dev enp0s8

And a route immediately appears on host2:

user@host2:~$ ip route
default via 10.0.0.254 dev enp0s3 onlink 
10.0.0.0/24 dev enp0s3  proto kernel  scope link  src 10.0.0.135 
172.16.134.0/24 via 169.254.0.1 dev enp0s8  proto zebra  metric 20 onlink

Add a network to host2:

user@host2:~$ sudo ip address add 172.16.135.1/24 dev enp0s8

And it appears on host1:

cumulus@host1:~$ ip route
default via 10.0.0.254 dev enp0s3 onlink 
10.0.0.0/24 dev enp0s3  proto kernel  scope link  src 10.0.0.134 
172.16.134.0/24 dev enp0s8  proto kernel  scope link  src 172.16.134.1 
172.16.135.0/24 via 169.254.0.1 dev enp0s8  proto zebra  metric 20 onlink

Connectivity across the leaf and spine fabric can be verified with a ping test:

user@host2:~$ ping 172.16.134.1
PING 172.16.134.1 (172.16.134.1) 56(84) bytes of data.
64 bytes from 172.16.134.1: icmp_seq=1 ttl=61 time=2.60 ms
64 bytes from 172.16.134.1: icmp_seq=2 ttl=61 time=2.59 ms

Ubuntu 16.04 was selected as the host operating system since it has built-in IPVLAN support. Docker Experimental Features distribution, which includes the IPVLAN Docker networking plugin, is installed on the two hosts.

The following command creates a Docker IPVLAN network on host1:

user@host1:~$ docker network create -d ipvlan --subnet=172.16.134.0/24 \
-o parent=enp0s8 -o ipvlan_mode=l3 ipv3

Note: Route(s) to the IPVLAN network would be automatically distributed if the above command had an option to attach the subnets to the parent interface.

The following commands start a container attached to the network and show the network settings as seen by the container:

user@host1:~$ docker run --net=ipv3 -it --rm alpine /bin/sh
/ # ifconfig
eth0      Link encap:Ethernet  HWaddr 08:00:27:70:FA:B5  
          inet addr:172.16.134.2  Bcast:0.0.0.0  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fe70:fab5%32582/64 Scope:Link
          UP BROADCAST RUNNING NOARP MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:2 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1%32582/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

/ # ip route
default dev eth0 
172.16.134.0/24 dev eth0  src 172.16.134.2

Connectivity between containers attached to the same IPVLAN can be verified by starting a second container and performing a ping test:

user@host1:~$ docker run --net=ipv3 -it --rm alpine /bin/sh
/ # ping 172.16.134.2
PING 172.16.134.2 (172.16.134.2): 56 data bytes
64 bytes from 172.16.134.2: seq=0 ttl=64 time=0.085 ms
64 bytes from 172.16.134.2: seq=1 ttl=64 time=0.077 ms

Unfortunately, connecting containers to the IPVLAN network interferes with the auto-configured IPv6 link local BGP connection between host1 and leaf1, which results in host1 being disconnected from the leaf and spine fabric, and losing connectivity to host2. Assigning static IPv4 addresses to leaf1 and host1 complicates the configuration but solves the problem. For example, here is the Quagga BGP configuration on host1:

router bgp 65134
 bgp router-id 192.168.0.134
 redistribute connected route-map NON-MGMT
 neighbor 192.168.1.130 remote-as 65130

Now it is possible to connect to the containers remotely from host2:

user@host2:~$ ping 172.16.134.2
PING 172.16.134.2 (172.16.134.2) 56(84) bytes of data.
64 bytes from 172.16.134.2: icmp_seq=1 ttl=61 time=2.72 ms
64 bytes from 172.16.134.2: icmp_seq=2 ttl=61 time=2.79 ms

Now that we have end to end connectivity, how can we monitor traffic?

The following instructions show how to install the open source Host sFlow agent on the switches and Docker hosts:

The sFlow-RT real-time analytics software can be installed on a host, or spun up in a container using a simple Dockerfile:

FROM   centos:centos6
RUN    yum install -y wget
RUN    yum install -y java-1.7.0-openjdk
RUN    wget http://www.inmon.com/products/sFlow-RT/sflow-rt.tar.gz
RUN    tar -xzf sflow-rt.tar.gz
EXPOSE 8008 6343/udp
CMD    ./sflow-rt/start.sh

Update July 22, 2016: sFlow-RT can now be run from Docker Hub, see sflow/sflow-rt for instructions.

Writing Applications provides an overview of sFlow-RT's REST API. The following Python script demonstrates end to end visibililty by reporting the specific path that a large "Elephant" flow takes across the leaf and spine fabric:

#!/usr/bin/env python
import requests
import json

rt = 'http://127.0.0.1:8008'

flow = {'keys':'node:inputifindex,ipsource,ipdestination,ipttl','value':'bytes'}
requests.put(rt+'/flow/elephant/json',data=json.dumps(flow))

threshold = {'metric':'elephant','value':100000000/8,'byFlow':True,'timeout':600}
requests.put(rt+'/threshold/elephant/json',data=json.dumps(threshold))

eventurl = rt+'/events/json?thresholdID=elephant&maxEvents=10&timeout=60'
eventID = -1
while 1 == 1:
  r = requests.get(eventurl + "&eventID=" + str(eventID))
  if r.status_code != 200: break
  events = r.json()
  if len(events) == 0: continue

  eventID = events[0]["eventID"]
  events.reverse()
  for e in events:
    print e['flowKey']

Running the script and generating a large flow from host1 (172.16.135.1) to an IPVLAN connected container on host2 (172.16.134.2) gives the following output:

$ ./elephant.py 
spine2,172.16.135.1,172.16.134.2,62
leaf1,172.16.135.1,172.16.134.2,61
host2,172.16.135.1,172.16.134.2,64
leaf2,172.16.135.1,172.16.134.2,63
host1,172.16.135.1,172.16.134.2,61

The last field is the IP TTL (time to live). Ordering the events by TTL shows the path across the network (since the TTL is decremented by each switch along the path):
host2 -> leaf2 -> spine2 -> leaf1 -> host1

Monitoring resources on the physical switches is critical to large scale IPVLAN deployments since the number of routes must be kept within the table sizes supported by the hardware. Broadcom ASIC table utilization metrics, DevOps, and SDN describes how sFlow exports hardware routing table metrics.

The use of Cumulus Linux greatly simplified the configuration of the ECMP fabric and allowed a common set of routing, monitoring, and orchestration software to be used for both hosts and switches. A common set of tools dramatically simplifies the task of configuring, testing, and managing end to end Docker networking.

Finally, end to end IP routing using the new Docker IPVLAN networking plugin is very promising. Eliminating overlays and bridging improves scaleability, reduces operational complexity, and facilitates automation using a mature routing control plane (BGP).

Monday, June 6, 2016

Streaming telemetry

The OpenConfig project has been getting a lot of attention lately. A number of large network operators, lead by Google, are developing "a consistent set of vendor-neutral data models (written in YANG) based on actual operational needs from use cases and requirements from multiple network operators."

The OpenConfig project extends beyond configuration, "Streaming telemetry is a new paradigm for network monitoring in which data is streamed from devices continuously with efficient, incremental updates. Operators can subscribe to the specific data items they need, using OpenConfig data models as the common interface."

Anees Shaikh's Network Field Day talk provides an overview of OpenConfig and includes an example that demonstrates how configuration and state are combined in a single YANG data model. In the example, read/write config attributes used to configure a network interface (name, description, MTU, operational state) are combined with the state attributes needed to verify the configuration (MTU, name, description, oper-status, last-change) and collect metrics (in-octets, in-ucast-pkts, in-broadcast-pkts, ...).

Anees positions OpenConfig streaming telemetry mechanism as an attractive alternative to polling for metrics using Simple Network Management Protocol (SNMP) - see Push vs Pull for a detailed comparison between pushing (streaming) and pulling (polling) metrics.

Streaming telemetry is not unique to OpenConfig. Industry standard sFlow is a streaming telemetry alternative to SNMP that has seen rapid vendor adoption over the last decade. Drivers for growth discusses how the rise of merchant silicon and virtualization have accelerated adoption of sFlow, particularly in data centers.

	sFlow	OpenConfig Telemetry	SNMP
Organization	sFlow.org	OpenConfig.net	IETF
Users	General Purpose	Large Service Providers	General Purpose
Scope	Data Plane, Control Plane	Management Plane, Control Plane	Control Plane
Vendor Support	40+ (see sFlow.org)	1 (Cisco IOS XR )	Near universal
Models	structure definitions	YANG models	Management Information Base
Encoding	XDR (RFC 4506)	protobufs, JSON, NetConf	ASN.1
Transport	UDP	UDP, HTTP	UDP
Mode	Push	Push	Pull

The table compares sFlow and OpenConfig Telemetry. There are a number of similarities: sFlow and OpenConfig are both driven by participation based organizations that publish standards to ensure multi-vendor interoperability, and both push standard sets of metrics using standard encodings over widely supported transport protocols.

However, important differences result from OpenConfig's exclusive focus on large service provider configuration and monitoring requirements. Telemetry is tied to the hierarchical YANG configuration models, making it easy to correlate operational and configured state, but limiting the scope of monitoring to the management and control planes of devices that are configured using OpenConfig.

In contrast, sFlow is a management and control plane agnostic monitoring technology (i.e. a device may be configured using CLI, NetConf, JSON RPC, OpenConfig, etc. to use any control plane OpenFlow, BGP, OSPF, TRILL, spanning tree, etc.). In addition, sFlow is primarily concerned with the data and control planes, i.e. capturing information about packets and forwarding actions.

The article, Management, control and data planes in network devices and systems, by Ivan Pepelnjak, provides background on data, control, and management plane terminology.

Gathering data plane telemetry requires hardware support and merchant silicon vendors (Broadcom, Cavium/XPliant, Intel/Fulcrum, Marvell etc.) include sFlow instrumentation in their switching/routing ASICs. Embedded hardware support allows sFlow to efficiently stream standardized data plane telemetry from all devices in a large multi-vendor network.

To conclude, sFlow and OpenConfig shouldn't be viewed as competing technologies. Instead, the complementary capabilities of sFlow and OpenConfig and shared architectural model make it easy to combine sFlow and OpenConfig into an integrated management solution that unifies monitoring and control of management, control, and data planes.

Friday, June 3, 2016

Internet of Things (IoT) telemetry

The internet of things (IoT) is the network of physical objects—devices, vehicles, buildings and other items—embedded with electronics, software, sensors, and network connectivity that enables these objects to collect and exchange data. - ITU

The recently released Raspberry Pi Zero (costing $5) is an example of the type of embedded low power computer enabling IoT. These small devices are typically wired to one or more sensors (measuring temperature, humidity, location, acceleration, etc.) and embedded in or attached to physical devices.

Collecting real-time telemetry from large numbers of small devices that may be located within many widely dispersed administrative domains poses a number of challenges, for example:

Discovery - How are newly connected devices discovered?
Configuration - How can the numerous individual devices be efficiently configured?
Transport - How efficiently are measurements transported and delivered?
Latency - How long does it take before measurements are remotely accessible?

This article will use the Raspberry Pi as an example to explore how the architecture of the industry standard sFlow protocol and its implementation in the open source Host sFlow agent provide a method of addressing the challenges of embedded device monitoring.

The following steps describe how to install the Host sFlow agent on Raspbian Jesse (the Debian Linux based Raspberry Pi operating system).

sudo apt-get update
sudo apt-get install libpcap-dev
git clone https://github.com/sflow/host-sflow
sudo make install

The resulting Host sFlow binary is extremely small (only 163,300 bytes in this case):

pi@raspberrypi:~ $ ls -l /usr/sbin/hsflowd 
-rwx------ 1 root root 163300 Jun  1 17:18 /usr/sbin/hsflowd

Next, specify /etc/hsflowd.conf file for the device:

sflow {
  agent = eth0
  agent.cidr=::/0
  DNSSD = on
  DNSSD_domain = .sf.inmon.com
  jsonPort = 36343
  pcap { dev = eth0 }
}

There are a number of important points to note about this configuration:

The configuration is not device specific - this same configuration can be pre-loaded in every device.
Prefer IPv6 addresses as a way of identifying the agent since they are more likely to be globally unique.
DNS Service Discovery (DNS-SD) is used to retrieve dynamic configuration on startup and to periodically refresh the configuration. Hosting a single copy of the configuration (in the form of SRV and TXT records on the DNS server responsible for the sf.inmon.com domain) minimize the complexity of managing large numbers of devices.
Network visibility provides a way to monitor the interactions between the devices on the network. The pcap entry enables a Berkeley Packet Filter to efficiently sample network traffic using instrumentation built into the Linux kernel.
Custom Metrics can be be sent along with the extensive set of standard sFlow metrics by including the jsonPort entry.

Now start the daemon:

sudo /etc/init.d/hsflowd start

Now add an entry to the sf.inmon.com.zone file on the DNS server:

_sflow._udp   30  SRV     0 0 6343  collector.sf.inmon.com.

In this case, the SRV record specifies that sFlow records should be sent via UDP to collector.sf.inmon.com on port 6343. The TTL is set to 30 seconds so that agents will pick up any changes within 30 seconds. A larger TTL should be used to improve scaleability if there are large numbers of devices.

The following example shows how Custom Metrics can be used to export sensor data. The temp.py script exports the CPU temperature:

#!/usr/bin/env python

import json
import socket

tempC = int(open('/sys/class/thermal/thermal_zone0/temp').read()) / 1e3
msg = {
  "rtmetric": {
    "datasource": "sensors",
    "tempC": { "type": "gaugeFloat", "value": tempC }
  }
}
sock = socket.socket(socket.AF_INET,socket.SOCK_DGRAM)
sock.sendto(json.dumps(msg),("127.0.0.1",36343))

The following crontab entry runs the script every minute:

* * * * * /home/pi/temp.py

The Host sFlow agent will automatically pick up the configuration via a DNS request, start making measurements, which are immediately send in standard sFlow UDP datagrams to the designated sFlow collector collector.sf.inmon.com. sFlow's immediate transmission of measurements minimizes the memory requirements on the agent (since data doesn't have to be stored for later retrieval) and minimizes the latency before measurements are accessible on the collector (and can be acted on).

It should also be noted that all communication is initiated by the device (DNS requests and transmission of telemetry via sFlow). This means that the radio on the device can be powered down between transmissions to save power (and extend battery life if the device is battery powered).

Raspberry Pi real-time network analytics describes how to build a low cost sFlow analyzer using a Raspberry Pi model 3 b and sFlow-RT real-time analytics software. The following command queries the sFlow-RT REST API to show the set of standard metrics being exported by the agent (2001:470:67:27d:d811:aa7e:9e54:30e9):

pi@raspberrypi:~ $ curl http://localhost:8008/metric/2001:470:67:27d:d811:aa7e:9e54:30e9/json
{
 "2.1.bytes_in": 3211.5520193372945,
 "2.1.bytes_out": 462.2822036458858,
 "2.1.bytes_read": 0,
 "2.1.bytes_written": 4537.818511431161,
 "2.1.contexts": 7006.546480008057,
 "2.1.cpu_guest": 0,
 "2.1.cpu_guest_nice": 0,
 "2.1.cpu_idle": 99.17638114546376,
 "2.1.cpu_intr": 0,
 "2.1.cpu_nice": 0,
 "2.1.cpu_num": 4,
 "2.1.cpu_sintr": 0.025342118601115054,
 "2.1.cpu_speed": 0,
 "2.1.cpu_steal": 0,
 "2.1.cpu_system": 0.456158134820071,
 "2.1.cpu_user": 0.3294475418144957,
 "2.1.cpu_utilization": 0.8236188545362393,
 "2.1.cpu_wio": 0.012671059300557527,
 "2.1.disk_free": 24435570688,
 "2.1.disk_total": 29627484160,
 "2.1.disk_utilization": 17.523977160453796,
 "2.1.drops_in": 0,
 "2.1.drops_out": 0,
 "2.1.errs_in": 0,
 "2.1.errs_out": 0,
 "2.1.host_name": "raspberrypi",
 "2.1.icmp_inaddrmaskreps": 0,
 "2.1.icmp_inaddrmasks": 0,
 "2.1.icmp_indestunreachs": 0,
 "2.1.icmp_inechoreps": 0,
 "2.1.icmp_inechos": 0,
 "2.1.icmp_inerrors": 0,
 "2.1.icmp_inmsgs": 0,
 "2.1.icmp_inparamprobs": 0,
 "2.1.icmp_inredirects": 0,
 "2.1.icmp_insrcquenchs": 0,
 "2.1.icmp_intimeexcds": 0,
 "2.1.icmp_intimestamps": 0,
 "2.1.icmp_outaddrmaskreps": 0,
 "2.1.icmp_outaddrmasks": 0,
 "2.1.icmp_outdestunreachs": 0,
 "2.1.icmp_outechoreps": 0,
 "2.1.icmp_outechos": 0,
 "2.1.icmp_outerrors": 0,
 "2.1.icmp_outmsgs": 0,
 "2.1.icmp_outparamprobs": 0,
 "2.1.icmp_outredirects": 0,
 "2.1.icmp_outsrcquenchs": 0,
 "2.1.icmp_outtimeexcds": 0,
 "2.1.icmp_outtimestampreps": 0,
 "2.1.icmp_outtimestamps": 0,
 "2.1.interrupts": 4438.56380300131,
 "2.1.ip_defaultttl": 64,
 "2.1.ip_forwarding": 2,
 "2.1.ip_forwdatagrams": 0,
 "2.1.ip_fragcreates": 0,
 "2.1.ip_fragfails": 0,
 "2.1.ip_fragoks": 0,
 "2.1.ip_inaddrerrors": 0,
 "2.1.ip_indelivers": 9.165072011280088,
 "2.1.ip_indiscards": 0,
 "2.1.ip_inhdrerrors": 0,
 "2.1.ip_inreceives": 9.215429549803606,
 "2.1.ip_inunknownprotos": 0,
 "2.1.ip_outdiscards": 0,
 "2.1.ip_outnoroutes": 0,
 "2.1.ip_outrequests": 2.1653741565112297,
 "2.1.ip_reasmfails": 0,
 "2.1.ip_reasmoks": 0,
 "2.1.ip_reasmreqds": 0,
 "2.1.ip_reasmtimeout": 0,
 "2.1.load_fifteen": 0.05,
 "2.1.load_fifteen_per_cpu": 0.0125,
 "2.1.load_five": 0.02,
 "2.1.load_five_per_cpu": 0.005,
 "2.1.load_one": 0,
 "2.1.load_one_per_cpu": 0,
 "2.1.machine_type": "arm",
 "2.1.mem_buffers": 52133888,
 "2.1.mem_cached": 383287296,
 "2.1.mem_free": 238026752,
 "2.1.mem_shared": 0,
 "2.1.mem_total": 970506240,
 "2.1.mem_used": 297058304,
 "2.1.mem_utilization": 30.608591437339783,
 "2.1.os_name": "linux",
 "2.1.os_release": "4.4.9-v7+",
 "2.1.page_in": 0,
 "2.1.page_out": 2.2157316950347465,
 "2.1.part_max_used": 31.74,
 "2.1.pkts_in": 11.582233860408904,
 "2.1.pkts_out": 2.266089233558264,
 "2.1.proc_run": 1,
 "2.1.proc_total": 237,
 "2.1.read_time": 0,
 "2.1.reads": 0,
 "2.1.swap_free": 104853504,
 "2.1.swap_in": 0,
 "2.1.swap_out": 0,
 "2.1.swap_total": 104853504,
 "2.1.tcp_activeopens": 0,
 "2.1.tcp_attemptfails": 0,
 "2.1.tcp_currestab": 6,
 "2.1.tcp_estabresets": 0,
 "2.1.tcp_incsumerrs": 0,
 "2.1.tcp_inerrs": 0,
 "2.1.tcp_insegs": 2.568234464699365,
 "2.1.tcp_maxconn": 4294967295,
 "2.1.tcp_outrsts": 0,
 "2.1.tcp_outsegs": 1.913586463893645,
 "2.1.tcp_passiveopens": 0,
 "2.1.tcp_retranssegs": 0,
 "2.1.tcp_rtoalgorithm": 1,
 "2.1.tcp_rtomax": 120000,
 "2.1.tcp_rtomin": 200,
 "2.1.udp_incsumerrors": 0,
 "2.1.udp_indatagrams": 6.596837546580723,
 "2.1.udp_inerrors": 0,
 "2.1.udp_noports": 0,
 "2.1.udp_outdatagrams": 0.2517876926175849,
 "2.1.udp_rcvbuferrors": 0,
 "2.1.udp_sndbuferrors": 0,
 "2.1.uptime": 46603,
 "2.1.uuid": "22c4ce8c-067e-4517-8c00-8d822efc4897",
 "2.1.write_time": 8.333333333333334,
 "2.1.writes": 0.6042904622822036,
 "2.ifadminstatus": "up",
 "2.ifdirection": "full-duplex",
 "2.ifindex": "2",
 "2.ifindiscards": 0,
 "2.ifinerrors": 0,
 "2.ifinoctets": 3241.1101037574294,
 "2.ifinpkts": 11.483831973405863,
 "2.ifinucastpkts": 11.483831973405863,
 "2.ifinutilization": 0.025928880830059432,
 "2.ifname": "eth0",
 "2.ifoperstatus": "up",
 "2.ifoutdiscards": 0,
 "2.ifouterrors": 0,
 "2.ifoutoctets": 406.6686813740304,
 "2.ifoutpkts": 1.8636043114737586,
 "2.ifoutucastpkts": 1.8636043114737586,
 "2.ifoututilization": 0.0032533494509922435,
 "2.ifspeed": 100000000,
 "2.iftype": "ethernetCsmacd",
 "sensors.tempC": 49.388
}

Note the custom temperature metric at the end of the list.

In addition, enabling traffic monitoring in the Host sFlow agent provides detailed flow information along with the metrics to provide visibility into interactions between the devices on the network. In this case the wired Ethernet interface (eth0) is being monitored, but monitoring the wireless interface (wlan0) would be a way to gain visibility into messages exchanged over an ad-hoc wireless mesh network connecting devices. RESTflow describes how to perform flow analytics using sFlow-RT.

In conclusion, sFlow provides a standard way to export metrics and traffic information. Most network equipment vendors already provide sFlow support and the technology has a number of architectural features that are well suited to addressing the challenges of extending visibility to and gathering telemetry from large scale IoT deployments.

Thursday, June 2, 2016

OVS Orbit podcast with Ben Pfaff

OVS Orbit Episode 6 is a wide ranging discussion between Ben Pfaff and Peter Phaal of the industry standard sFlow measurement protocol, implementation of sFlow in Open vSwitch, network analytics use cases and application areas supported by sFlow, including: OpenStack, Open Network Virtualization (OVN), DDoS mitigation, ECMP load balancing, Elephant and Mice flows, Docker containers, Network Function Virtualization (NFV), and microservices.

Follow the link to see listen to the podcast, read the extensive show notes, follow related links, and to subscribe to the podcast.

Wednesday, June 1, 2016

Raspberry Pi real-time network analytics

The Raspberry Pi model 3b is not much bigger than a credit card, costs $35, runs Linux, has a 1G RAM, and powerful 4 core 64 bit ARM processor. This article will demonstrate how to turn the Raspberry Pi into a Terribit/second real-time network analytics engine capable of monitoring hundreds of switches and thousands of switch ports.

The diagram shows how the sFlow-RT real-time analytics engine receives a continuous telemetry stream from industry standard sFlow instrumentation build into network, server and application infrastructure and delivers analytics through APIs and can easily be integrated with a wide variety of on-site and cloud, orchestration, DevOps and Software Defined Networking (SDN) tools.

A future article will examine how the Host sFlow agent can be used to efficiently stream measurements from large numbers of inexpensive Rasberry Pi devices ($5 for model Zero) to the sFlow-RT collector to monitor and control the "Internet of Things" (IoT).

The following instructions show how to install sFlow-RT on Raspbian Jesse (the Debian Linux based Raspberry Pi operating system).

wget http://www.inmon.com/products/sFlow-RT/sflow-rt_2.0-1092.deb
sudo dpkg -i --ignore-depends=openjdk-7-jre-headless sflow-rt_2.0-1092.deb

We are ignoring the dependency on openjdk and will use the default Raspbian Java 1.8 version instead.

Next, edit /usr/local/sflow-rt/conf.d/sflow-rt.jvm and replace the default settings with the following:

-Xms600M
-Xmx600M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSIncrementalMode

These new settings reduce the requested memory to fit within the 1G on the Raspberry Pi and leave some memory for system tasks. The G1GC garbage collector not available on ARM Java implementation, so we will use incremental concurrent mark and sweep instead.

Start the sFlow-RT daemon:

sudo service sflow-rt start

The sFlow-RT web interface should now be accessible at http://<raspberrypi_ip>:8008/

Finally, Agents provides information on configuring devices to send sFlow to the Raspberry Pi analyzer. Visit the http://<raspberrypi_ip>:8008/agents/html page to verify that data is being received.

Writing Applications provides an overview of the sFlow-RT APIs. For example, run the following Python script on the Raspberry Pi to log traffic flows that exceed 100Mbits/second:

#!/usr/bin/env python
import requests
import json

rt = 'http://127.0.0.1:8008'

flow = {'keys':'ipsource,ipdestination','value':'bytes'}
requests.put(rt+'/flow/pair/json',data=json.dumps(flow))

threshold = {'metric':'pair','value':100000000/8,'byFlow':True,'timeout':1}
requests.put(rt+'/threshold/elephant/json',data=json.dumps(threshold))

eventurl = rt+'/events/json?thresholdID=elephant&maxEvents=10&timeout=60'
eventID = -1
while 1 == 1:
  r = requests.get(eventurl + "&eventID=" + str(eventID))
  if r.status_code != 200: break
  events = r.json()
  if len(events) == 0: continue

  eventID = events[0]["eventID"]
  events.reverse()
  for e in events:
    print e['flowKey']

In addition, there are a number of open source sFlow-RT Applications available on the Downloads page (e.g. Mininet dashboard, DDoS mitigation, etc.) and articles describing use cases for sFlow-RT on this blog.