Wednesday, June 8, 2016

Docker networking with IPVLAN and Cumulus Linux

Macvlan and Ipvlan Network Drivers are being added as Docker networking options. The IPVlan L3 Mode shown in the diagram is particularly interesting since it dramatically simplifies the network by extending routing to the hosts and eliminating switching entirely.

Eliminating the complexity associated with switching broadcast domains, VLANs, spanning tree, etc. allows a purely routed network to be easily scaled to very large sizes. However, there are some challenges to overcome:
IPVlan will require routes to be distributed to each endpoint. The driver only builds the Ipvlan L3 mode port and attaches the container to the interface. Route distribution throughout a cluster is beyond the initial implementation of this single host scoped driver. In L3 mode, the Docker host is very similar to a router starting new networks in the container. They are on networks that the upstream network will not know about without route distribution.
Cumulus Networks has been working to simplify routing in the ECMP leaf and spine networks and the white paper Routing on the Host: An Introduction shows how the routing configuration used on Cumulus Linux can be extended to the hosts.

Update June 2, 2016: Routing on the Host contains packaged versions of the Cumulus Quagga daemon for Ubuntu, Redhat and Docker.
This article explores the combination of Cumulus Linux networking with Docker IPVLAN using a simple test bed built using free software: VirtualBox, CumulusVX switches, and Ubuntu 16.04 servers.  This setup should result in a simple, easy to manage, easy to monitor, networking solution for Docker since all the switches and server will be running Linux allowing the same routing, monitoring, and orchestration software to be used throughout.
Using Cumulus VX with VirtualBox and Creating a Two-Spine, Two-Leaf Topology provide detailed instructions on building and configuring the leaf and spine network shown in the diagram. However, BGP was configured as the routing protocol instead of OSPF (see BGP configuration made simple with Cumulus Linux). For example, the following commands configure BGP on leaf1:
interface swp1
 ipv6 nd ra-interval 10
 link-detect
!
interface swp2
 ipv6 nd ra-interval 10
 link-detect
!
interface swp3
 ipv6 nd ra-interval 10
 link-detect
!
router bgp 65130
 bgp router-id 192.168.0.130
 bgp bestpath as-path multipath-relax
 neighbor EBGP peer-group
 neighbor EBGP remote-as external
 neighbor swp1 interface peer-group EBGP
 neighbor swp1 capability extended-nexthop
 neighbor swp2 interface peer-group EBGP
 neighbor swp2 capability extended-nexthop
 neighbor swp3 interface peer-group EBGP
 neighbor swp3 capability extended-nexthop
Auto-configured IPv6 link local addresses dramatically simplify the configuration of equal cost multi-path (ECMP) routing, eliminating the need to assign IP addresses and subnets to the routed switch ports. The simplified configuration is easy to template, all the switches have a very similar configuration, and an orchestration tool like Puppet, Chef, Ansible, Salt, etc. can be used to automate the process of configuring the switches.
Two Ubuntu 16.04 hosts were created, attached to leaf1 and leaf2 respectively. Each server has two network adapters: enp0s3 connected to the out of band network used to manage the switches and enp0s8 connected to the respective leaf switch.
Why the strange Ethernet interface names (e.g. enp0s3 instead of eth0)? They are the result of the predictable network interface names mechanism that is the default in Ubuntu 16.04. I can't say I am a fan, predictable interface names are difficult to read, push the problem of device naming up the stack, and make it difficult to write portable orchestration scripts.
Quagga is used for BGP routing on the Cumulus Linux switches and on the Ubuntu hosts. The host BGP configurations are virtually identical to the switch configurations, but with an added redistribute stanza to automatically advertise locally attached addresses and subnets:
redistribute connected route-map IPVLAN
The IPVLAN route-map is used to control the routes that are advertised, limiting them to the range of addresses that have been allocated to the IPVLAN orchestration system. Route filtering is an example of the flexibility that BGP brings as a routing protocol: top of rack switches can filter routes advertised by the hosts to protect the fabric from misconfigured hosts, and hosts can be configured to selectively advertise routes.

At this point, configuring routes between the hosts is easy. Configure a network on host1:
user@host1:~$ sudo ip address add 172.16.134.1/24 dev enp0s8
And a route immediately appears on host2:
user@host2:~$ ip route
default via 10.0.0.254 dev enp0s3 onlink 
10.0.0.0/24 dev enp0s3  proto kernel  scope link  src 10.0.0.135 
172.16.134.0/24 via 169.254.0.1 dev enp0s8  proto zebra  metric 20 onlink 
Add a network to host2:
user@host2:~$ sudo ip address add 172.16.135.1/24 dev enp0s8
And it appears on host1:
cumulus@host1:~$ ip route
default via 10.0.0.254 dev enp0s3 onlink 
10.0.0.0/24 dev enp0s3  proto kernel  scope link  src 10.0.0.134 
172.16.134.0/24 dev enp0s8  proto kernel  scope link  src 172.16.134.1 
172.16.135.0/24 via 169.254.0.1 dev enp0s8  proto zebra  metric 20 onlink
Connectivity across the leaf and spine fabric can be verified with a ping test:
user@host2:~$ ping 172.16.134.1
PING 172.16.134.1 (172.16.134.1) 56(84) bytes of data.
64 bytes from 172.16.134.1: icmp_seq=1 ttl=61 time=2.60 ms
64 bytes from 172.16.134.1: icmp_seq=2 ttl=61 time=2.59 ms
Ubuntu 16.04 was selected as the host operating system since it has built-in IPVLAN support. Docker Experimental Features distribution, which includes the IPVLAN Docker networking plugin, is installed on the two hosts.

The following command creates a Docker IPVLAN network on host1:
user@host1:~$ docker network create -d ipvlan --subnet=172.16.134.0/24 \
-o parent=enp0s8 -o ipvlan_mode=l3 ipv3
Note: Route(s) to the IPVLAN network would be automatically distributed if the above command had an option to attach the subnets to the parent interface.

The following commands start a container attached to the network and show the network settings as seen by the container:
user@host1:~$ docker run --net=ipv3 -it --rm alpine /bin/sh
/ # ifconfig
eth0      Link encap:Ethernet  HWaddr 08:00:27:70:FA:B5  
          inet addr:172.16.134.2  Bcast:0.0.0.0  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fe70:fab5%32582/64 Scope:Link
          UP BROADCAST RUNNING NOARP MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:2 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1%32582/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

/ # ip route
default dev eth0 
172.16.134.0/24 dev eth0  src 172.16.134.2 
Connectivity between containers attached to the same IPVLAN can be verified by starting a second container and performing a ping test:
user@host1:~$ docker run --net=ipv3 -it --rm alpine /bin/sh
/ # ping 172.16.134.2
PING 172.16.134.2 (172.16.134.2): 56 data bytes
64 bytes from 172.16.134.2: seq=0 ttl=64 time=0.085 ms
64 bytes from 172.16.134.2: seq=1 ttl=64 time=0.077 ms
Unfortunately, connecting containers to the IPVLAN network interferes with the auto-configured IPv6 link local BGP connection between host1 and leaf1, which results in host1 being disconnected from the leaf and spine fabric, and losing connectivity to host2. Assigning static IPv4 addresses to leaf1 and host1 complicates the configuration but solves the problem. For example, here is the Quagga BGP configuration on host1:
router bgp 65134
 bgp router-id 192.168.0.134
 redistribute connected route-map NON-MGMT
 neighbor 192.168.1.130 remote-as 65130
Now it is possible to connect to the containers remotely from host2:
user@host2:~$ ping 172.16.134.2
PING 172.16.134.2 (172.16.134.2) 56(84) bytes of data.
64 bytes from 172.16.134.2: icmp_seq=1 ttl=61 time=2.72 ms
64 bytes from 172.16.134.2: icmp_seq=2 ttl=61 time=2.79 ms
Now that we have end to end connectivity, how can we monitor traffic?
The following instructions show how to install the open source Host sFlow agent on the switches and Docker hosts:
The sFlow-RT real-time analytics software can be installed on a host, or spun up in a container using a simple Dockerfile:
FROM   centos:centos6
RUN    yum install -y wget
RUN    yum install -y java-1.7.0-openjdk
RUN    wget http://www.inmon.com/products/sFlow-RT/sflow-rt.tar.gz
RUN    tar -xzf sflow-rt.tar.gz
EXPOSE 8008 6343/udp
CMD    ./sflow-rt/start.sh
Update July 22, 2016: sFlow-RT can now be run from Docker Hub, see sflow/sflow-rt for instructions.

Writing Applications provides an overview of sFlow-RT's REST API. The following Python script demonstrates end to end visibililty by reporting the specific path that a large "Elephant" flow takes across the leaf and spine fabric:
#!/usr/bin/env python
import requests
import json

rt = 'http://127.0.0.1:8008'

flow = {'keys':'node:inputifindex,ipsource,ipdestination,ipttl','value':'bytes'}
requests.put(rt+'/flow/elephant/json',data=json.dumps(flow))

threshold = {'metric':'elephant','value':100000000/8,'byFlow':True,'timeout':600}
requests.put(rt+'/threshold/elephant/json',data=json.dumps(threshold))

eventurl = rt+'/events/json?thresholdID=elephant&maxEvents=10&timeout=60'
eventID = -1
while 1 == 1:
  r = requests.get(eventurl + "&eventID=" + str(eventID))
  if r.status_code != 200: break
  events = r.json()
  if len(events) == 0: continue

  eventID = events[0]["eventID"]
  events.reverse()
  for e in events:
    print e['flowKey']
Running the script and generating a large flow from host1 (172.16.135.1) to an IPVLAN connected container on host2 (172.16.134.2) gives the following output:
$ ./elephant.py 
spine2,172.16.135.1,172.16.134.2,62
leaf1,172.16.135.1,172.16.134.2,61
host2,172.16.135.1,172.16.134.2,64
leaf2,172.16.135.1,172.16.134.2,63
host1,172.16.135.1,172.16.134.2,61
The last field is the IP TTL (time to live). Ordering the events by TTL shows the path across the network (since the TTL is decremented by each switch along the path):
host2 -> leaf2 -> spine2 -> leaf1 -> host1

Monitoring resources on the physical switches is critical to large scale IPVLAN deployments since the number of routes must be kept within the table sizes supported by the hardware. Broadcom ASIC table utilization metrics, DevOps, and SDN describes how sFlow exports hardware routing table metrics.

The use of Cumulus Linux greatly simplified the configuration of the ECMP fabric and allowed a common set of routing, monitoring, and orchestration software to be used for both hosts and switches. A common set of tools dramatically simplifies the task of configuring, testing, and managing end to end Docker networking.

Finally, end to end IP routing using the new Docker IPVLAN networking plugin is very promising. Eliminating overlays and bridging improves scaleability, reduces operational complexity, and facilitates automation using a mature routing control plane (BGP).

No comments:

Post a Comment