Eliminating the complexity associated with switching broadcast domains, VLANs, spanning tree, etc. allows a purely routed network to be easily scaled to very large sizes. However, there are some challenges to overcome:
IPVlan will require routes to be distributed to each endpoint. The driver only builds the Ipvlan L3 mode port and attaches the container to the interface. Route distribution throughout a cluster is beyond the initial implementation of this single host scoped driver. In L3 mode, the Docker host is very similar to a router starting new networks in the container. They are on networks that the upstream network will not know about without route distribution.Cumulus Networks has been working to simplify routing in the ECMP leaf and spine networks and the white paper Routing on the Host: An Introduction shows how the routing configuration used on Cumulus Linux can be extended to the hosts.
Update June 2, 2016: Routing on the Host contains packaged versions of the Cumulus Quagga daemon for Ubuntu, Redhat and Docker.
This article explores the combination of Cumulus Linux networking with Docker IPVLAN using a simple test bed built using free software: VirtualBox, CumulusVX switches, and Ubuntu 16.04 servers. This setup should result in a simple, easy to manage, easy to monitor, networking solution for Docker since all the switches and server will be running Linux allowing the same routing, monitoring, and orchestration software to be used throughout.
Using Cumulus VX with VirtualBox and Creating a Two-Spine, Two-Leaf Topology provide detailed instructions on building and configuring the leaf and spine network shown in the diagram. However, BGP was configured as the routing protocol instead of OSPF (see BGP configuration made simple with Cumulus Linux). For example, the following commands configure BGP on leaf1:
interface swp1 ipv6 nd ra-interval 10 link-detect ! interface swp2 ipv6 nd ra-interval 10 link-detect ! interface swp3 ipv6 nd ra-interval 10 link-detect ! router bgp 65130 bgp router-id 192.168.0.130 bgp bestpath as-path multipath-relax neighbor EBGP peer-group neighbor EBGP remote-as external neighbor swp1 interface peer-group EBGP neighbor swp1 capability extended-nexthop neighbor swp2 interface peer-group EBGP neighbor swp2 capability extended-nexthop neighbor swp3 interface peer-group EBGP neighbor swp3 capability extended-nexthopAuto-configured IPv6 link local addresses dramatically simplify the configuration of equal cost multi-path (ECMP) routing, eliminating the need to assign IP addresses and subnets to the routed switch ports. The simplified configuration is easy to template, all the switches have a very similar configuration, and an orchestration tool like Puppet, Chef, Ansible, Salt, etc. can be used to automate the process of configuring the switches.
Two Ubuntu 16.04 hosts were created, attached to leaf1 and leaf2 respectively. Each server has two network adapters: enp0s3 connected to the out of band network used to manage the switches and enp0s8 connected to the respective leaf switch.
Why the strange Ethernet interface names (e.g. enp0s3 instead of eth0)? They are the result of the predictable network interface names mechanism that is the default in Ubuntu 16.04. I can't say I am a fan, predictable interface names are difficult to read, push the problem of device naming up the stack, and make it difficult to write portable orchestration scripts.Quagga is used for BGP routing on the Cumulus Linux switches and on the Ubuntu hosts. The host BGP configurations are virtually identical to the switch configurations, but with an added redistribute stanza to automatically advertise locally attached addresses and subnets:
redistribute connected route-map IPVLANThe IPVLAN route-map is used to control the routes that are advertised, limiting them to the range of addresses that have been allocated to the IPVLAN orchestration system. Route filtering is an example of the flexibility that BGP brings as a routing protocol: top of rack switches can filter routes advertised by the hosts to protect the fabric from misconfigured hosts, and hosts can be configured to selectively advertise routes.
At this point, configuring routes between the hosts is easy. Configure a network on host1:
user@host1:~$ sudo ip address add 172.16.134.1/24 dev enp0s8And a route immediately appears on host2:
user@host2:~$ ip route
default via 10.0.0.254 dev enp0s3 onlink
10.0.0.0/24 dev enp0s3 proto kernel scope link src 10.0.0.135
172.16.134.0/24 via 169.254.0.1 dev enp0s8 proto zebra metric 20 onlink
Add a network to host2:
user@host2:~$ sudo ip address add 172.16.135.1/24 dev enp0s8And it appears on host1:
cumulus@host1:~$ ip route
default via 10.0.0.254 dev enp0s3 onlink
10.0.0.0/24 dev enp0s3 proto kernel scope link src 10.0.0.134
172.16.134.0/24 dev enp0s8 proto kernel scope link src 172.16.134.1
172.16.135.0/24 via 169.254.0.1 dev enp0s8 proto zebra metric 20 onlink
Connectivity across the leaf and spine fabric can be verified with a ping test:user@host2:~$ ping 172.16.134.1 PING 172.16.134.1 (172.16.134.1) 56(84) bytes of data. 64 bytes from 172.16.134.1: icmp_seq=1 ttl=61 time=2.60 ms 64 bytes from 172.16.134.1: icmp_seq=2 ttl=61 time=2.59 msUbuntu 16.04 was selected as the host operating system since it has built-in IPVLAN support. Docker Experimental Features distribution, which includes the IPVLAN Docker networking plugin, is installed on the two hosts.
The following command creates a Docker IPVLAN network on host1:
user@host1:~$ docker network create -d ipvlan --subnet=172.16.134.0/24 \ -o parent=enp0s8 -o ipvlan_mode=l3 ipv3Note: Route(s) to the IPVLAN network would be automatically distributed if the above command had an option to attach the subnets to the parent interface.
The following commands start a container attached to the network and show the network settings as seen by the container:
user@host1:~$ docker run --net=ipv3 -it --rm alpine /bin/sh / # ifconfig eth0 Link encap:Ethernet HWaddr 08:00:27:70:FA:B5 inet addr:172.16.134.2 Bcast:0.0.0.0 Mask:255.255.255.0 inet6 addr: fe80::a00:27ff:fe70:fab5%32582/64 Scope:Link UP BROADCAST RUNNING NOARP MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:2 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1%32582/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) / # ip route default dev eth0 172.16.134.0/24 dev eth0 src 172.16.134.2Connectivity between containers attached to the same IPVLAN can be verified by starting a second container and performing a ping test:
user@host1:~$ docker run --net=ipv3 -it --rm alpine /bin/sh / # ping 172.16.134.2 PING 172.16.134.2 (172.16.134.2): 56 data bytes 64 bytes from 172.16.134.2: seq=0 ttl=64 time=0.085 ms 64 bytes from 172.16.134.2: seq=1 ttl=64 time=0.077 msUnfortunately, connecting containers to the IPVLAN network interferes with the auto-configured IPv6 link local BGP connection between host1 and leaf1, which results in host1 being disconnected from the leaf and spine fabric, and losing connectivity to host2. Assigning static IPv4 addresses to leaf1 and host1 complicates the configuration but solves the problem. For example, here is the Quagga BGP configuration on host1:
router bgp 65134 bgp router-id 192.168.0.134 redistribute connected route-map NON-MGMT neighbor 192.168.1.130 remote-as 65130Now it is possible to connect to the containers remotely from host2:
user@host2:~$ ping 172.16.134.2 PING 172.16.134.2 (172.16.134.2) 56(84) bytes of data. 64 bytes from 172.16.134.2: icmp_seq=1 ttl=61 time=2.72 ms 64 bytes from 172.16.134.2: icmp_seq=2 ttl=61 time=2.79 msNow that we have end to end connectivity, how can we monitor traffic?
The following instructions show how to install the open source Host sFlow agent on the switches and Docker hosts:
The sFlow-RT real-time analytics software can be installed on a host, or spun up in a container using a simple Dockerfile:
FROM centos:centos6 RUN yum install -y wget RUN yum install -y java-1.7.0-openjdk RUN wget http://www.inmon.com/products/sFlow-RT/sflow-rt.tar.gz RUN tar -xzf sflow-rt.tar.gz EXPOSE 8008 6343/udp CMD ./sflow-rt/start.shUpdate July 22, 2016: sFlow-RT can now be run from Docker Hub, see sflow/sflow-rt for instructions.
Writing Applications provides an overview of sFlow-RT's REST API. The following Python script demonstrates end to end visibililty by reporting the specific path that a large "Elephant" flow takes across the leaf and spine fabric:
#!/usr/bin/env python import requests import json rt = 'http://127.0.0.1:8008' flow = {'keys':'node:inputifindex,ipsource,ipdestination,ipttl','value':'bytes'} requests.put(rt+'/flow/elephant/json',data=json.dumps(flow)) threshold = {'metric':'elephant','value':100000000/8,'byFlow':True,'timeout':600} requests.put(rt+'/threshold/elephant/json',data=json.dumps(threshold)) eventurl = rt+'/events/json?thresholdID=elephant&maxEvents=10&timeout=60' eventID = -1 while 1 == 1: r = requests.get(eventurl + "&eventID=" + str(eventID)) if r.status_code != 200: break events = r.json() if len(events) == 0: continue eventID = events[0]["eventID"] events.reverse() for e in events: print e['flowKey']Running the script and generating a large flow from host1 (172.16.135.1) to an IPVLAN connected container on host2 (172.16.134.2) gives the following output:
$ ./elephant.py spine2,172.16.135.1,172.16.134.2,62 leaf1,172.16.135.1,172.16.134.2,61 host2,172.16.135.1,172.16.134.2,64 leaf2,172.16.135.1,172.16.134.2,63 host1,172.16.135.1,172.16.134.2,61The last field is the IP TTL (time to live). Ordering the events by TTL shows the path across the network (since the TTL is decremented by each switch along the path):
host2 -> leaf2 -> spine2 -> leaf1 -> host1
Monitoring resources on the physical switches is critical to large scale IPVLAN deployments since the number of routes must be kept within the table sizes supported by the hardware. Broadcom ASIC table utilization metrics, DevOps, and SDN describes how sFlow exports hardware routing table metrics.
The use of Cumulus Linux greatly simplified the configuration of the ECMP fabric and allowed a common set of routing, monitoring, and orchestration software to be used for both hosts and switches. A common set of tools dramatically simplifies the task of configuring, testing, and managing end to end Docker networking.
Finally, end to end IP routing using the new Docker IPVLAN networking plugin is very promising. Eliminating overlays and bridging improves scaleability, reduces operational complexity, and facilitates automation using a mature routing control plane (BGP).
No comments:
Post a Comment