Thursday, June 5, 2014

Cumulus Networks, sFlow and data center automation

Cumulus Networks and InMon Corp have ported the open source Host sFlow agent to the upcoming Cumulus Linux 2.1 release. The Host sFlow agent already supports Linux, Windows, FreeBSD, Solaris, and AIX operating systems and KVM, Xen, XCP, XenServer, and Hyper-V hypervisors, delivering a standard set of performance metrics from switches, servers, hypervisors, virtual switches, and virtual machines - see Visibility and the software defined data center

The Cumulus Linux platform makes it possible to run the same open source agent on switches, servers, and hypervisors - providing unified end-to-end visibility across the data center. The open networking model that Cumulus is pioneering offers exciting opportunities. Cumulus Linux allows popular open source server orchestration tools to also manage the network, and the combination of real-time, data center wide analytics with orchestration make it possible to create self-optimizing data centers.

Install and configure Host sFlow agent

The following command installs the Host sFlow agent on a Cumulus Linux switch:
sudo apt-get install hsflowd
Note: Network managers may find this command odd since it is usually not possible to install third party software on switch hardware. However, what is even more radical is that Cumulus Linux allows users to download source code and compile it on their switch. Instead of being dependent on the switch vendor to fix a bug or add a feature, users are free to change the source code and contribute the changes back to the community.

The sFlow agent requires very little configuration, automatically monitoring all switch ports using the following default settings:

Link SpeedSampling RatePolling Interval
1 Gbit/s1-in-1,00030 seconds
10 Gbit/s1-in-10,00030 seconds
40 Gbit/s1-in-40,00030 seconds
100 Gbit/s1-in-100,00030 seconds

Note: The default settings ensure that large flows (defined as consuming 10% of link bandwidth) are detected within approximately 1 second - see Large flow detection

Once the Host sFlow agent is installed, there are two alternative configuration mechanisms that can be used to tell the agent where to send the measurements:

1. DNS Service Discovery (DNS-SD)

This is the default configuration mechanism for Host sFlow agents. DNS-SD uses a special type of DNS record (the SRV record) to allow hosts to automatically discover servers. For example, adding the following line to the site DNS zone file will enable sFlow on all the agents and direct the sFlow measurements to an sFlow analyzer (10.0.0.1):
_sflow._udp 300 SRV 0 0 10.0.0.1
No Host sFlow agent specific configuration is required, each switch or host will automatically pick up the settings when the Host sFlow agent is installed, when the device is restarted, or if settings on the DNS server are changed.

Default sampling rates and polling interval can be overridden by adding a TXT record to the zone file. For example, the following TXT record reduces the sampling rate on 10G links to 1-in-2000 and the polling interval to 20 seconds:
_sflow._udp 300 TXT (
"txtvers=1"
"sampling.10G=2000"
"polling=20"
)
Note: Currently defined TXT options are described on sFlow.org.

The article DNS-SD describes how DNS service discovery allows sFlow agents to automatically discover their configuration settings. The slides DNS Service Discovery from a talk at the SF Bay Area Large Scale Production Engineering Meetup provide additional background.

 2. Configuration File

The Host sFlow agent is configured by editing the /etc/hsflowd.conf file. For example, the following configuration disables DNS-SD, instructs the agent to send sFlow to 10.0.0.1, reduces the sampling rate on 10G links to 1-in-2000 and the polling interval to 20 seconds:
sflow {
  DNSSD = off

  polling = 20
  sampling.10G = 2000
  collector {
    ip = 10.0.0.1
  }
}
The Host sFlow agent must be restarted for configuration changes to take effect:
sudu /etc/init.d/hsflowd restart
All hosts and switches can share the same settings and it is straightforward to use orchestration tools such as Puppet, Chef, etc. to manage the sFlow settings.

Collecting and analyzing sFlow

Figure 1: Visibility and the software defined data center
Figure 1 shows the general architecture of sFlow monitoring. Standard sFlow agents embedded within the elements of the infrastructure, stream essential performance metrics to management tools, ensuring that every resource in a dynamic cloud infrastructure is immediately detected and continuously monitored.

  • Applications -  e.g. Apache, NGINX, Tomcat, Memcache, HAProxy, F5, A10 ...
  • Virtual Servers - e.g. Xen, Hyper-V, KVM ...
  • Virtual Network - e.g. Open vSwitch, Hyper-V extensible vSwitch
  • Servers - e.g. BSD, Linux, Solaris and Windows
  • Network - over 40 switch vendors, see Drivers for growth

The sFlow data from a Cumulus switch contains standard Linux performance statistics in addition to the interface counters and packet samples that you would typically get from a networking device.

Note: Enhanced visibility into host performance is important on open switch platforms since they may be running a number of user installed services that can stress the limited CPU, memory and IO resources.

For example, the following sflowtool output shows the raw data contained in an sFlow datagram from a switch running Cumulus Linux:
startDatagram =================================
datagramSourceIP 10.0.0.160
datagramSize 1332
unixSecondsUTC 1402004767
datagramVersion 5
agentSubId 100000
agent 10.0.0.233
packetSequenceNo 340132
sysUpTime 17479000
samplesInPacket 7
startSample ----------------------
sampleType_tag 0:2
sampleType COUNTERSSAMPLE
sampleSequenceNo 876
sourceId 2:1
counterBlock_tag 0:2001
adaptor_0_ifIndex 2
adaptor_0_MACs 1
adaptor_0_MAC_0 6c641a000459
counterBlock_tag 0:2005
disk_total 0
disk_free 0
disk_partition_max_used 0.00
disk_reads 980
disk_bytes_read 4014080
disk_read_time 1501
disk_writes 0
disk_bytes_written 0
disk_write_time 0
counterBlock_tag 0:2004
mem_total 2056589312
mem_free 1100533760
mem_shared 0
mem_buffers 33464320
mem_cached 807546880
swap_total 0
swap_free 0
page_in 35947
page_out 0
swap_in 0
swap_out 0
counterBlock_tag 0:2003
cpu_load_one 0.390
cpu_load_five 0.440
cpu_load_fifteen 0.430
cpu_proc_run 1
cpu_proc_total 95
cpu_num 2
cpu_speed 0
cpu_uptime 770774
cpu_user 160600160
cpu_nice 192970
cpu_system 77855100
cpu_idle 1302586110
cpu_wio 4650
cpuintr 0
cpu_sintr 308370
cpuinterrupts 1851322098
cpu_contexts 800650455
counterBlock_tag 0:2006
nio_bytes_in 405248572711
nio_pkts_in 394079084
nio_errs_in 0
nio_drops_in 0
nio_bytes_out 406139719695
nio_pkts_out 394667262
nio_errs_out 0
nio_drops_out 0
counterBlock_tag 0:2000
hostname cumulus
UUID fd-01-78-45-93-93-42-03-a0-5a-a3-d7-42-ac-3c-de
machine_type 7
os_name 2
os_release 3.2.46-1+deb7u1+cl2+1
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:2
sampleType COUNTERSSAMPLE
sampleSequenceNo 876
sourceId 0:44
counterBlock_tag 0:1005
ifName swp42
counterBlock_tag 0:1
ifIndex 44
networkType 6
ifSpeed 0
ifDirection 2
ifStatus 0
ifInOctets 0
ifInUcastPkts 0
ifInMulticastPkts 0
ifInBroadcastPkts 0
ifInDiscards 0
ifInErrors 0
ifInUnknownProtos 4294967295
ifOutOctets 0
ifOutUcastPkts 0
ifOutMulticastPkts 0
ifOutBroadcastPkts 0
ifOutDiscards 0
ifOutErrors 0
ifPromiscuousMode 0
endSample   ----------------------
startSample ----------------------
sampleType_tag 0:1
sampleType FLOWSAMPLE
sampleSequenceNo 1022129
sourceId 0:7
meanSkipCount 128
samplePool 130832512
dropEvents 0
inputPort 7
outputPort 10
flowBlock_tag 0:1
flowSampleType HEADER
headerProtocol 1
sampledPacketSize 1518
strippedBytes 4
headerLen 128
headerBytes 6C-64-1A-00-04-5E-E8-E7-32-77-E2-B5-08-00-45-00-05-DC-63-06-40-00-40-06-9E-21-0A-64-0A-97-0A-64-14-96-9A-6D-13-89-4A-0C-4A-42-EA-3C-14-B5-80-10-00-2E-AB-45-00-00-01-01-08-0A-5D-B2-EB-A5-15-ED-48-B7-34-35-36-37-38-39-30-31-32-33-34-35-36-37-38-39-30-31-32-33-34-35-36-37-38-39-30-31-32-33-34-35-36-37-38-39-30-31-32-33-34-35-36-37-38-39-30-31-32-33-34-35-36-37-38-39-30-31-32-33-34-35
dstMAC 6c641a00045e
srcMAC e8e73277e2b5
IPSize 1500
ip.tot_len 1500
srcIP 10.100.10.151
dstIP 10.100.20.150
IPProtocol 6
IPTOS 0
IPTTL 64
TCPSrcPort 39533
TCPDstPort 5001
TCPFlags 16
endSample   ----------------------
While sflowtool is extremely useful, there are many other open source and commercial tools available, including:
Note: The sFlow Collectors list on sFlow.org contains a number of additional tools.

There is a great deal of variety among sFlow collectors - many focus on the network, others have a compute infrastructure focus, and yet others report on application performance. The shared sFlow measurement infrastructure delivers value in each of these areas. However, as network, storage, host and application resources are brought together and automated to create cloud data centers, a new set of sFlow analytics tools is emerging to deliver the integrated real-time visibility required to drive automation and optimize performance and efficiency across the data center.
While network administrators are likely to be familiar with sFlow, application development and operations teams may be unfamiliar with the technology. The 2012 O'Reilly Velocity conference talk provides an introduction to sFlow aimed at the DevOps community.
Cumulus Linux presents the switch as a server with a large number of network adapters, an abstraction that will be instantly familiar to anyone with server management experience. For example, displaying interface information on Cumulus Linux uses the standard Linux command:
ifconfig swp2
On the other hand, network administrators experienced with switch CLIs may find that Linux commands take a little time to get used to - the above command is roughly equivalent to:
show interfaces fastEthernet 6/1
However, the basic concepts of networking don't change and these skills are essential to designing, automating, operating and troubleshooting data center networks. Open networking platforms such as Cumulus Linux are an important piece of the automation puzzle, taking networking out of its silo and allowing a combined NetDevOps team to manage network, server, and application resources using proven monitoring and orchestration tools such as Ganglia, Graphite, Nagios, CFEngine, Puppet, Chef, Ansible, and Salt.

No comments:

Post a Comment