Sunday, December 6, 2009

Standards



Data center convergence and virtualization offer the promise of improved efficiency and flexibility. Developing a network visibility strategy when planning the network upgrades needed to support convergence is essential since only network equipment with embedded traffic monitoring will provide the data center wide visibility essential for effective control.

This article examines the proprietary and standards-based protocols for embedded traffic monitoring as well as their current status and level of support.

Before looking at the standards, it's important to understand how current traffic monitoring protocols fit in the protocol stack. The article, sFlow and NetFlow, describes how sFlow operates at layer 2 (switches) and Cisco NetFlow (and variants from other vendors such as j-Flow, NetStream, LFAP etc.) operates at layer 3 (routers).

There are two organizations responsible for most networking standards:
  1. IETF (Internet Engineering Task Force) is responsible for layer 3-7 protocols (IP, routing, DNS, telnet, smtp etc.)
  2. IEEE (Institute of Electrical and Electronics Engineers) is responsible for layer 2 protocols (Ethernet, switching, 802.11 etc.)
The IETF has recently developed a standard alternative to proprietary IP flow monitoring protocols (Cisco NetFlow, j-Flow, LFAP, NetStream etc.). The IPFIX standard was created in order "to transfer IP flow data from IPFIX exporters to collectors." This focus on IP flow export is consistent with the IETF's responsibility for the TCP/IP suite of protocols and explicitly avoids addressing layer 2 monitoring (switches). The IPFIX standard was published in 2008. Unfortunately, IPFIX has found little support among router vendors, who continue to implement proprietary solutions.

The IEEE publishes the standards for Ethernet and bridging/switching, and is currently developing the set of standards for Data Center Bridging (DCB) that are driving data center convergence. The IEEE would seem to be the natural place to standardize a protocol for monitoring switched Ethernet traffic, however, the IEEE's focus is on the mechanics of layer 2 connectivity and network management protocols have not been a priority.

The sFlow.org industry consortium was formed in order to develop a multi-vendor standard to address the need for network visibility in layer 2 devices. Many of the members of sFlow.org actively participate in the IEEE standards process, ensuring that sFlow is well matched to the challenge of monitoring current and emerging IEEE standard networks (Ethernet, 40/100G, DCB etc.). The sFlow standard was published in 2001 and it is now implemented by most switch vendors (see sFlow.org).

In selecting a standard for data center visibility, it is important to understand how data center networks are changing. The traditional three layer architecture, in which traffic is moved up to the core and back, does not scale well. Instead, the trend is to integrate access and aggregation layer switches into a flat layer 2 network using shortest-path bridging (IEEE 802.1aq) so that traffic bypasses the core to deliver the increased scalability, bandwidth and reduced latency needed to support converged data center workloads.

The shift in data center network architecture requires a corresponding shift in network management. Instead of monitoring and controlling at layer 3 in the core routers, visibility and control functions move to layer 2 switches and the network edge.

sFlow is the only standard specifically designed for embedded monitoring of layer 2 devices. Selecting switches with embedded sFlow provides a low cost, scalable means of obtaining layer 2-7 visibility into all traffic flowing over the switched network (including storage traffic). Building a network visibility strategy around the sFlow standard maximizes the choice of vendors and ensures interoperable monitoring in mixed vendor environments, eliminating vendor lock-in and facilitating "best in class" product selection.

Tuesday, December 1, 2009

Large Hadron Collider


LHCb (Photo Credit: CERN)

The Large Hadron Collider at CERN has been in the news as it comes online.

An interesting paper (Management of the LHCb Network Based on SCADA System) describes the data collection network associated with the LHCb experiment. High-speed switched Ethernet networks are used to collect measurements from the experiment and to control its operation. The paper states that, "Sophisticated monitoring of both networks at all levels is essential for the successful operation of the experiment."

The network monitoring system uses sFlow to measure network utilization on the core switches. "Because there are so many ports in the core switches, the SNMP query of interface counters takes a long time and occupies a lot CPU and memory resource."

The distributed counter polling mechanism in sFlow provides a highly scalable alternative to SNMP polling, delivering reliable monitoring in the most demanding environments. Network visibility in the data center is equally important and sFlow provides the scalability and performance needed to maintain effective control of high-speed data center networks.

Tuesday, November 17, 2009

Vyatta



Vyatta's open-source software router provides a way of adding routing and security functions to virtual environments (Xen, VMWare and Hyper-V).

Vyatta recently announced the availability of the Core 6 Alpha release, adding sFlow support. This early alpha release (the production version in due in early 2010) can be downloaded now. If you want to try it out, the bootable live CD version offers a quick way to see how the router works without dedicating a server to the project.

The following configuration sets the sampling rate on interface eth0 to 1-in-100 and sends sFlow to a collector at 10.0.0.50:
system {
 flow-accounting {
  interface eth0 {
   sampling-rate 100
  }
  sflow {
   agentid 0
   server 10.0.0.50 {
   }
  }
 }
}

Note: A previous posting discussed the selection of sampling rates. Additional information can be found on the Vyatta web site.

The screen capture was generated using sFlowTrend to monitor the Vyatta router. Both the Vyatta router software and sFlowTrend can be downloaded at no charge, offering an inexpensive way to try out sFlow.

Making sure that all network devices (both physical and virtual) support sFlow provides the data center wide visibility needed for effective control of network resources. sFlow is already supported by most hardware vendors and Vyatta's support for sFlow extends visibility into the virtual network.

Sunday, October 25, 2009

Probes




The RMON (Remote MONitoring) standard was developed in the early 1990's to standardize network monitoring devices (usually referred to as "probes"). At the time, Ethernet LANs consisted of coax cables that where shared by a number of hosts. Repeaters were used to connect the the cables and extend the network. In this environment, a single RMON probe would see all the traffic on the shared network, providing complete network visibility.

Multi-port switches started to become popular in the mid 1990's and SPAN/mirror ports were added to switches to continue to allow probe-based monitoring. The increasing number of ports per switch and the increasing port speeds has made the use of probes a challenge. The need for embedded instrumentation was becoming clear.

In the late 1990's, Cisco introduced the NetFlow protocol, embedding L3-4 monitoring in routers and in 2001 the sFlow protocol was introduced, embedding L2-7 monitoring in switches. Interest in network visibility has accelerated the adoption of the sFlow standard among switch vendors, further limiting the role of probes.

The chart clearly shows the trend toward embedded monitoring. Google Insight for Search was used to trend the popularity of the search terms sFlow, NetFlow, RMON and Probe compared to overall searches relating to Network Monitoring & Management. The sFlow and NetFlow lines track closely and exceed general interest in Network Monitoring & Management, indicating that they are increasingly important topics. The RMON and Probe lines track closely with each other and show a rapid decline compared to Network Monitoring & Management, indicating declining interest in probes as monitoring shifts from probes to embedded instrumentation.

Current trends toward data center convergence increase the need for visibility and control. Complete network visibility is likely to involve both NetFlow and sFlow. NetFlow provides visibility into routers while sFlow extends visibility into the increasingly important switching layer, including: virtual servers, blade servers and edge switches.

Note: The overall downward trend in all the lines results from the increasing population of Internet users. As more people use the Internet, the proportion of Internet users interested in any one topic is diluted. This is particularly true of technical topics. In the past, the technical barriers to using the Internet skewed the user population and resulted in more searches relating to technical topics. Now that everyone is online, the majority of searches relate to more populist topics.

Wednesday, October 21, 2009

Network visibility


The chart, created using Google News Timeline, trends the growing interest in "network visibility" from 1990 to 2008. The dot-com bubble (2000) shows up as a clear spike, followed by a plateau. Starting in 2005 interest has accelerated with strong annual gains.

The sFlow standard, designed to provide network-wide visibility, was introduced in 2002 and the rapid growth in multi-vendor support for sFlow closely tracks this chart.

Finally, current trends toward data center convergence and virtualization create new management challenges that continue to drive interest in the visibility and control made possible by sFlow.

Tuesday, October 20, 2009

802.1aq and TRILL


There are a number of drivers increasing demand for bandwidth in the data center:
  • Multi-core processors and blade servers greatly increase computational density, creating a corresponding demand for bandwidth.
  • Networked storage increases demand for bandwidth.
  • Virtualization and server consolidation ensures that servers are fully utilized, further increasing demand for bandwidth.
Virtual machine mobility (e.g. VMWare vMotion, Citrix XenMotion or Xen Live Migration) require a large, flat layer 2 network so that machines can be moved without reconfiguring their network settings. The increasing size of the layer 2 network, combined with the increasing demand for bandwidth challenges the scalability of current Ethernet switching technologies.

The diagram illustrates the problem. Currently, Ethernet uses the spanning tree protocol to determine forwarding paths. The tree structure forces traffic to the network core, creating a bottleneck. In addition, the tree structure doesn't allow traffic to flow on backup links, further limiting usable bandwidth. An alternative forwarding technique (used by routers), is to select shortest paths through the network. Shortest path forwarding allows traffic flows to bypass the core, reducing the bottleneck. The added benefit of shortest path forwarding is its ability to make use of all the links, including backup links, further increasing capacity.

There are two emerging standards for shortest path forwarding in switches:
  • TRILL (Transparent Interconnect of Lots of Links)
  • IEEE 802.1aq (Shortest Path Bridging)
Both TRILL and 802.1aq use IS-IS (Intermediate System to Intermediate System) routing to select shortest paths through the network. The protocols are very similar, but are being proposed by two different standards bodies: TRILL is being developed by IETF and 802.1aq by the IEEE.

It is surprising to see the IETF working on a LAN bridging protocol. The IETF is responsible for Internet protocols (TCP/IP, routing) and the IEEE is responsible for LAN protocols (802.11, Ethernet, bridging/switching). Adopting the IEEE 802.1aq standard makes the most sense, since it will ensure interoperability with the IEEE data center bridging standards being developed to support FCoE and facilitate data center convergence.

Finally, while more efficient network topologies will help increase network capacity, the days of relying on network over-provisioning are over. Much tighter control of bandwidth is going to be required in order to cope with converged data center workloads. Selecting switches that support the sFlow standard provides the visibility and control needed to manage the increasing demand for bandwidth.


Saturday, October 17, 2009

SR-IOV


The diagram (source Intel: Virtual Machine Direct Connect (VMDc)) illustrates the close relationship between the network adapter and the virtual switch in providing networking to virtual machines. Currently most virtual server systems use a software virtual switch (vSwitch) to share access to the network adapter among the virtual machines (VMs).

The Single Root I/O Virtualization (SR-IOV) standard being implemented by 10G network adapter vendors provides hardware support for virtualization, allowing virtual machines to directly share the network adapter without involving the software virtual switch. Hardware sharing improves performance and frees CPU cycles to be used by the virtual machines. Software is still needed to configure and manage the switching function on the network adapter and integrate it with the management of the virtual machines. Virtual switch software is evolving, offloading performance critical functions to the network adapter, while continuing to provide the management interface.

Maintaining visibility and control of the edge is a critical component of an effective data center management strategy. Since virtual switches provide the first layer of switching in a virtualized environment, they comprise the network edge. Integrating the virtual switches into the overall network management system is essential.

Previously, the role of the VEPA protocol in integrating software virtual switches with hardware switches was discussed. VEPA support in the network adapter offers integration between a hardware switch and the network adapter. Most switch vendors support sFlow for traffic monitoring, and the combination of sFlow and VEPA would provide visibility and control of virtual machine traffic.

Ultimately, the evolving functionality of the network adapter/virtual switch is likely to deliver the visibility, performance, security and quality of service capabilities needed from the network edge. This trend is illustrated by the roadmap for the Open vSwitch, which includes support for both sFlow and SR-IOV on its roadmap, along with OpenFlow for control of the edge.

Wednesday, October 14, 2009

VEPA


A virtual switch is a software component of a virtual server, providing network connectivity to virtual machines (VMs). The challenge with virtual switches is integrating them into the rest of the network in order to maintain visibility and control.

The diagram shows how the emerging VEPA (Virtual Ethernet Port Aggregator) standard addresses this challenge by ensuring that packets from the virtual machines connected to the virtual switch (shown in green) also pass through an adjacent hardware switch (Bridge). In a blade server, the adjacent hardware switch would be the blade switch. If stand-alone servers are used, then the adjacent hardware switch would be the top of rack switch.

Passing traffic through the hardware switch offloads tasks such as rate limiting and access control lists (ACLs), simplifying the virtual switch and freeing CPU cycles that can be used by the virtual machines.

The sFlow standard is widely supported by switch vendors. Selecting blade switches and top of rack switches with sFlow and VEPA support will offer visibility and control of the network edge.

Thursday, October 8, 2009

Network edge



InMon's quota controller brings together many of the topics that have been discussed on this blog, clearly illustrating the role of network-wide visibility in achieving effective control of the network.

The diagram shows the basic elements: a centralized sFlow analyzer receives sFlow data from every switch in the network, producing a real-time, network-wide view of traffic and accurately tracking the network topology. A centralized controller enforces management policies by automatically applying configuration settings to the edge switches in order to control traffic. For more information, see Controlling Traffic for a detailed description of InMon's controller and its application to peer-to-peer (P2P) traffic control.

Generally, this level of control is only possible because of the timely and complete picture of the network state that sFlow monitoring provides. In control engineering terms, sFlow makes the network observable; an essential prerequisite for control.

An accurate picture of the network state allows controls to be targeted where they will be most effective and have the least impact on other traffic; the edge of the network. The alternative, measurement and control at the network core, achieved at the core switches and routers, or by channeling traffic through shared control points (e.g. firewalls, traffic shapers, etc.), can result in serious performance problems as busy core devices become overloaded by additional measurement and control tasks. In addition, control at the core is ineffective if the traffic doesn't cross the core. On the other hand, all traffic crosses the edge and control at the edge is scalable since the number of edge devices grows with the network, providing additional measurement (sFlow) and control capacity as the network grows.

Interestingly, the centralized visibility into switched traffic that sFlow provides is being paralleled by a move toward centralized control of switches (see OpenFlow). The combination of centralized visibility and centralized control of network traffic paths has the potential to revolutionize data center networking, delivering the performance, scalability and control needed to build large, converged data centers.

In order to achieve visibility and control in the data center, it is essential to ensure that the edge is fully observable and controllable. Data center convergence is shifting the network edge to include components of blade servers and virtual servers. Finally, the Open vSwitch project is interesting because it will offer visibility (sFlow) and control (OpenFlow) at the edge of the virtualized data center (currently including support for Xen/XenServer, KVM, and VirtualBox).

Monday, October 5, 2009

Management silos


Currently, most organizations split the management of the data center among different groups: networking, storage, systems and possibly regulatory compliance. Enterprise applications require resources from each of these functional areas and a failure in any of these areas can have a significant impact on the business. The strategy of splitting management responsibilities by functional area has worked because these functional areas have traditionally been loosely coupled and the data center environment has been relatively static.

The trend towards convergence of computing, storage and networking in order to create a more dynamic and efficient infrastructure makes these different functions richly dependent on each other, forcing a change in management strategy. For example, server virtualization means that a small change made by the systems group could have a major effect on network bandwidth. The increasing demand for bandwidth by networked storage accounts for a significant proportion of overall network bandwidth, again making the network vulnerable to changes made by the storage group. The recent Gmail outage illustrates the complex relationships between the elements needed to maintain services in a converged environment.

Convergence and interdependence between the resources in a converged data center requires a cross functional approach to management in order to ensure successful operations. The diagram shows the migration management silos in which each group monitors their own resources and uses their own management tools (but has very limited visibility into the other components of the data center), to an integrated management strategy in which all the components in the data center are monitored by a single measurement system that provides shared visibility into all the elements of the data center. Data center wide visibility is critical, ensuring that each group is aware of its impact on shared resources, eliminating finger pointing, and providing the information needed to take control of the data center.

The sFlow measurement technology, built into most vendors network equipment, ensures data center wide visibility of all resource, including switches, storage, servers, blade servers and virtual servers. As networks, systems and storage converge, the visibility provided by sFlow in the network provides an increasingly complete picture of all aspects of data center operations, delivering the converged visibility needed to manage the converged data center.

Data center control


Feedback control is a powerful technique for managing complex systems. Most people are familiar with examples of feedback control, even if they don't have a clear understanding of the underlying theory. For example, anti-lock brakes use feedback control to ensure that a driver can quickly stop the car without skidding and losing control. Sensors attached to each wheel send measurements to a controller that adjusts the amount of braking force applied to each wheel. If too much force is applied to a wheel and it starts to skid, then the sensor detects that the wheel is locking. The controller reduces brake force, reapplying it when it detects that the wheel is turning again.

In the data center, many tools are available to deploy configuration changes. What is often missing are the sensors to provide the data center wide visibility needed to determine where configuration changes (controls) should be applied and to assess the effectiveness of the changes. The sFlow sensors, built into network devices from most network vendors, in combination with an sFlow analyzer, provide the feedback that is essential to maintain effective control over the data center.

Saturday, October 3, 2009

Virtual servers


Convergence blurs the traditional line between the servers and the network. In order to maintain visibility and control, it is important to identify and monitor all the switches in the converged network. Previously, the importance of maintaining visibility while migrating to blade servers was discussed. Maintaining visibility while virtualizing servers creates similar challenges.

The diagram shows the migration of multiple stand-alone servers connected to a stand-alone switch to a single server running multiple virtual machines. In this transition, where did the switch go? Popular virtual server systems such as VMWare® and Xen® make use of software "virtual switches" to connect virtual machines together and to the network. Using sFlow to monitor the virtual switches ensures that the benefits of virtualization can be realized without losing the visibility into network traffic that is essential for network troubleshooting, traffic accounting and security.

Currently, software probes are required to monitor virtual switches. However, the approach of using software probes has similar limitations to using probes to monitor physical switches: probes have limited performance and the installation and configuration of probes adds complexity to the task of managing the network. To provide a truly scalable solution, visibility must be an integral part of every switch, physical or virtual. The need for visibility is evident to virtualization vendors and virtual switches will soon be available with built-in sFlow support.

Multi-vendor support for sFlow ensures that all the layers in the network, from virtual switches, blade switches, top of rack switches and core switches can be monitored using a single technology. Convergence to high speed switched Ethernet unifies LAN and SAN connectivity and convergence to sFlow for traffic monitoring provides the network-wide visibility needed manage and control the unified network.

Friday, October 2, 2009

Gmail outage


The Thursday, September 24th service outage with Google Gmail was widely reported (see Google Gmail Users Hit With Another Service Disruption, The Wall Street Journal).

On Friday, September 25th Google published an incident report, Google Apps Incident Report, that describes some of the factors leading to the failure. The report makes interesting reading, concluding that the root cause was a high load on the Contacts service and that this load was the result of a combination of the following:
  • A network issue in a data center, which caused additional load on the Contacts service
  • A very high utilization of the Contacts service
  • An update to Gmail that inadvertently increased the load on the Contacts service
This incident demonstrates the complex dependencies between the networking and computing components in a cloud computing environment. Data center wide visibility helps avoid this type of collapse, discovering dependencies and identifying capacity problems early enough to allow proactive action to be taken. When a service failure does happen, visibility is critical for quickly identifying the problem and targetting the controls needed to mitigate the failure.

Thursday, October 1, 2009

Blade servers


The need for network visibility in the data center and the challenge posed by converged networking and networked storage can be managed if all the switches in the data center include the sFlow monitoring standard that most vendors support.

The diagram shows the migration from stand-alone servers connected to a stand-alone switch to a blade server (collapsing the discrete servers into blades within a common chassis). In this transition, where did the switch go? Typically a blade server also encloses a blade switch that provides network connectivity to each of the blades, connecting them to each other and to the rest of the data center. The management of the blade switch may be integrated into the blade server manager and the blade switch may not be described as a switch, but it serves the function of a switch (providing connectivity by directing Ethernet packets).

Convergence blurs the traditional line between the servers and the network and it is important to identify and monitor all the switches in order to maintain visibility and control of the network. In the case of a blade server, select a blade switch with sFlow in order to manage the growing demand for bandwidth that comes with data center consolidation. Ask your blade server vendor about sFlow and select a networking solution that provides the performance, visibility and control needed to successfully operate a converged data center.

Saturday, September 26, 2009

Ethernet growth


There are a number of drivers increasing demand for bandwidth in the data center:
  • Multi-core processors and blade servers greatly increase computational density, creating a corresponding demand for bandwidth.
  • Networked storage increases demand for bandwidth.
  • Virtualization and server consolidation ensures that servers are fully utilized, further increasing demand for bandwidth.

The chart shows a projection of the adoption of higher speed Ethernet server interconnects as the increasing demand for bandwidth drives current deployment of 10G Ethernet and accelerates adoption of emerging 40G and 100G Ethernet products.

The scalability, performance and relatively low cost makes Ethernet the clear choice of networking technology for providing converged SAN and LAN connectivity. The sFlow standard, already supported by most Ethernet switch vendors, provides the network visibility that is essential for managing data centers in this rapidly evolving environment.

Thursday, September 24, 2009

Multi-vendor support


Multi-vendor support of the sFlow standard has been increasing rapidly over the last 5 years. Initially published by InMon Corp. as RFC 3176, sFlow has grown into the leading, multi-vendor, standard for monitoring high-speed switched networks. The growth in vendor support of sFlow has been driven by the move to 1G and more recently 10G Ethernet switches. The sFlow.org industry consortium, responsible for developing and promoting the sFlow standard, lists the large number of switches that implement sFlow. The switch vendors supporting sFlow now include: HP, IBM, Dell, Brocade, Juniper, BLADE, 3Com, H3C, Force10, Hitachi, AlaxalA, NEC, Alcatel-Lucent, D-Link, Extreme Networks, Allied Telesis and Comtec.

Broad vendor support delivers the network-wide visibility that is essential for managing the convergence of voice, data and storage on the campus and in the data center. It is likely that you have products from one or more of these vendors - if you would like more information on options to extend visibility in your network, ask them about sFlow.

September 11, 2012 Update: An updated version of this article describes the continued growth (nearly doubling) of sFlow support among vendors over the last couple of years, including Cisco's recent support for sFlow.

Tuesday, September 22, 2009

LAN and WAN


There is widespread confusion about the differences between sFlow and NetFlow and they are often simply referred to collectively as xFlow, implying that the two technologies are interchangeable. The sFlow and NetFlow posting described some of the technical differences between the two technologies, in particular, describing how sFlow operates at the Switch/Ethernet/LAN level and NetFlow operates at the Router/IP/WAN level. This division helps understand where the two technologies fit in the market.

The chart breaks down networking into switching and routing on the x-axis and network speed on the y-axis and then plots the application areas for sFlow and NetFlow. Since sFlow is built into switch ASICs, it offers monitoring solutions that span the full range of layer 2-3 switching products, from inexpensive switches aimed at office and small business environments, to the most demanding applications in supercomputer data centers, Internet exchange points and digital effects render farms. NetFlow is typically found in enterprise class routers. Since performance critical components of NetFlow are often implemented in software, NetFlow isn't widely used for monitoring at the high end of the router market (tier-1 ISPs). The cost of NetFlow enabled equipment limits its use at the low end of the router market.

Dividing the market into routing and switching products and the related applications helps explain why some vendors support sFlow while others support NetFlow. The chart also explains why a vendor might offer sFlow on their switch products and NetFlow on their router products. In practice, most networks blend switching and routing in order to meet the varied requirements of the different services running on the network. In many cases, a network monitoring strategy that embraces both sFlow and NetFlow delivers the most complete visibility into network activity.

Wednesday, September 16, 2009

Networked storage


A previous posting discussed how sFlow is used to provide visibility in the data center. This post looks more closely at the challenge posed by networked storage.

There are many good reasons to use networked storage: the storage resources can be shared, replicated and backed up independently of the systems that use them. In a virtual server environment, using networked storage for the virtual machine images simplifies the replication of virtual machines and the migration of virtual machines between servers (e.g. VMWare vMotion, Citrix XenMotion or Xen Live Migration).

In addition, the migrating of storage from a dedicated storage area network (SAN) to a single Converged Enhanced Ethernet (CEE) network promises to reduce cost and create a more flexible data center infrastructure. However, this migration also places additional demands on the LAN infrastructure.

Regardless of the type of networked storage (iSCSI, NFS, AoE or FCoE), the management of network bandwidth is critical to successful deployment and operation. For example, the chart above shows site-wide traffic from a large campus network broken out by protocol. The storage traffic (iSCSI) is clearly the largest load on the network, dwarfing the amount of web (HTTP) traffic.

The visibility into network traffic provided by sFlow is critical to effectively managing network resources. If the network is poorly provisioned, congestion associated with storage traffic will degrade quality of service (QoS) for other applications on the network and impair system performance since network congestion will also manifest itself as slow disk performance.

Saturday, September 12, 2009

Network visibility in the data center


Current trends toward Virtualization, Converged Enhanced Ethernet (CEE), Fibre Channel over Ethernet (FCoE), Service Oriented Architectures (SOA) and Cloud Computing are part of a broader re-architecture of the data center in which enterprise applications are decomposed into simpler elements that can be deployed, moved, replicated and connected using high-speed switched Ethernet.

The following example, shown in the diagram, illustrates the management challenges faced in this new environment. A system manager decides to move a virtual machine from one server to another. The system management tools show that there is plenty of capacity on the destination server and this looks like a safe move. Unfortunately, the move causes the storage traffic, which had previously been confined to a single switch, to congest links across the data center causing system wide performance problems.

In this new environment the traditional siloed approach in which different teams manage the network, storage and servers does not work. An integrated approach to management is needed if the full benefits of a converged data center are to be achieved. Ensuring network-wide visibility into the storage, network and services running in the data center, their traffic volumes and dependencies is a critical component of an integrated management strategy.

In order to achieve data center wide visibility, every layer of the data center network, including the core, distribution, top of rack and blade server switches, needs to be instrumented. This might seem like a daunting (and expensive!) challenge. However, most vendors have integrated sFlow into their switch products. The sFlow standard provides a proven solution that is available in network products from the leading computer and network vendors, including HP, IBM, Dell, Brocade, BLADE, Juniper, Force10 and 3Com (for a complete list, see sFlow.org). Making sFlow a requirement when building out a new data center is a sound investment, adding very little to the cost of the network, but ensuring that the visibility needed safely deploy, optimize and scale up new services is available to the operations team.

Saturday, August 8, 2009

Voice quality of service


Complaints were coming in about poor voice quality and dropped calls.

The challenge in managing VoIP (Voice over IP) deployments is that voice traffic can take a variety of paths across the network and problems anywhere along the path can affect call quality. In the diagram, the voice traffic is shown by the RTP (Real-time Transport Protocol) connections between the phones and computers.

Fortunately, the network switches support sFlow and the network management team had recently installed an sFlow analyzer (Traffic Sentinel), providing them with network-wide visibility. The sFlow analyzer alerted the team to excessive broadcast traffic affecting a large part of the network (see Link utilization to see how sFlow provides network-wide interface statistics). The packet header and packet path information provided by sFlow allowed the analyzer to rapidly locate the source of the broadcasts to a single server. A call to the server administrator identified the cause of the recent broadcast activity; they were testing a new application for distributing software on the network and the application was generating the broadcast traffic. Shutting down the application resolved the problem.

This example demonstrates how sFlow monitoring can assist in ensuring quality of service (QoS). Proactively managing traffic allows network managers to act before service levels deteriorate to the point where users are complaining. The paper, Managing Quality of Service Using sFlow, provides a detailed discussion of this approach.

Finally, the example illustrates the challenges created by network convergence and virtualization. Typically, different teams manage voice, data, computing and storage services using their own management tools. The example shows how the "siloed" approach to management fails as services converge to share the same network resources. As well as creating challenges, convergence offers the opportunity to dramatically simplify management. Instrumenting the converged network, by selecting network equipment with sFlow support, reduces the number of tools needed to provide visibility into each of these areas (voice, data, computing, storage) and delivers the shared visibility into the network infrastructure essential for avoiding conflicts.

Tuesday, July 28, 2009

Configuring Extreme switches

The following commands configure an Extreme Networks switch (10.0.0.246), sampling packets at 1-in-512, polling counters every 30 seconds and sending the sFlow to an analyzer (10.0.0.50) over UDP using the default sFlow port (6343):
enable sflow
configure sflow agent 10.0.0.246
configure sflow collector 10.0.0.50 port 6343
configure sflow sample-rate 512
configure sflow poll-interval 30
enable sflow backoff-threshold
configure sflow backoff-threshold 100
enable sflow ports all
A previous posting discussed the selection of sampling rates. Additional information can be found on the Extreme Networks web site.

See Trying out sFlow for suggestions on getting started with sFlow monitoring and reporting.

Note: Extreme Networks switches support automatic backoff of sampling rates based on a settable samples-per-second threshold. This mechanism ensures that a poorly selected sampling rate will not generate excessive numbers of samples (see sFlow Version 5, section 4.2.2).

Wednesday, July 15, 2009

Configuring Force10 switches

The following commands configure a Force10 switch (10.0.0.245), sampling packets at 1-in-512, polling counters every 30 seconds and sending the sFlow to an analyzer (10.0.0.50) over UDP using the default sFlow port (6343):
config> sflow collector 10.0.0.50 agent-addr 10.0.0.245
config> sflow sample-rate 512
config> sflow polling 30
config> sflow enable
Then for each interface:
interface> sflow enable
You can also use the following command to list the configuration settings:

show sflow
A previous posting discussed the selection of sampling rates. Additional information can be found on the Force10 Networks web site.
See Trying out sFlow for suggestions on getting started with sFlow monitoring and reporting.

Wednesday, July 8, 2009

Configuring Brocade switches

The following configuration enables sFlow monitoring of all interfaces on a Brocade (formerly Foundry) FGS switch, sampling packets at 1-in-512, polling counters every 30 seconds and sending the sFlow to an analyzer (10.0.0.50) on UDP port 6343 (the default sFlow port):
fgs(config)# int e 0/1/1 to 0/1/24
fgs(config-mif-0/1/1-0/1/24)# sflow forwarding
fgs(config-mif-0/1/1-0/1/24)# exit
fgs(config)# sflow destination 10.0.0.50 6343
fgs(config)# sflow sample 512
fgs(config)# sflow polling-interval 30
fgs(config)# sflow enable
You can also use the following command to list the configuration settings:
fgs# show sflow
A previous posting discussed the selection of sampling rates. Additional information can be found on the Brocade web site.

See Trying out sFlow for suggestions on getting started with sFlow monitoring and reporting.

Monday, July 6, 2009

Configuring Juniper switches

The following configuration enables sFlow monitoring of all interfaces on a Juniper EX3200 switch, sampling packets at 1-in-500, polling counters every 30 seconds and sending the sFlow to an analyzer (10.0.0.50) on UDP port 6343 (the default sFlow port).
protocols {
 sflow {
  polling-interval 30;
  sample-rate 500;
  collector 10.0.0.50 {
   udp-port 6343;
  }
  interfaces ge-0/0/0.0;
  interfaces ge-0/0/1.0;
  interfaces ge-0/0/2.0;
  interfaces ge-0/0/3.0;
  interfaces ge-0/0/4.0;
  interfaces ge-0/0/5.0;
  interfaces ge-0/0/6.0;
  interfaces ge-0/0/7.0;
  interfaces ge-0/0/8.0;
  interfaces ge-0/0/9.0;
  interfaces ge-0/0/10.0;
  interfaces ge-0/0/11.0;
  interfaces ge-0/0/12.0;
  interfaces ge-0/0/13.0;
  interfaces ge-0/0/14.0;
  interfaces ge-0/0/15.0;
  interfaces ge-0/0/16.0;
  interfaces ge-0/0/17.0;
  interfaces ge-0/0/18.0;
  interfaces ge-0/0/19.0;
  interfaces ge-0/0/20.0;
  interfaces ge-0/0/21.0;
  interfaces ge-0/0/22.0;
  interfaces ge-0/0/23.0;
 }
}
A previous posting discussed the selection of sampling rates. Additional information on configuring Juniper switches can be found on the Juniper Networks web site.

See Trying out sFlow for suggestions on getting started with sFlow monitoring and reporting.

Friday, June 26, 2009

Sampling rates


A previous posting discussed the scalability and accuracy of packet sampling and the advantages of packet sampling for network-wide visibility.

Selecting a suitable packet sampling rate is an important part of configuring sFlow on a switch. The table gives suggested values that should work well for general traffic monitoring in most networks. However, if traffic levels are unusually high the sampling rate may be decreased (e.g. use 1 in 5000 instead of 1 in 2000 for 10Gb/s links).

Configure sFlow monitoring on all interfaces on the switch for full visibility. Packet sampling is implemented in hardware so all the interfaces can be monitored with very little overhead.

Finally, select a suitable counter polling interval so that link utilizations can be accurately tracked. Generally the polling interval should be set to export counters at least twice as often as the data will be reported (see Nyquist-Shannon sampling theory for an explanation). For example, to trend utilization with minute granularity, select a polling interval of between 20 and 30 seconds. Don't be concerned about setting relatively short polling intervals; counter polling with sFlow is very efficient, allowing more frequent polling with less overhead than is possible with SNMP.

Tuesday, June 23, 2009

sFlow MIB


Configuring switches through the switch command line interface (CLI) can be complex and time consuming, especially when monitoring needs to be configured on every switch in order to achieve network-wide visibility.

The sFlow MIB provides a way for an sFlow analysis application to use SNMP to automatically configure sFlow settings on the switches that it wants to monitor. Since the sFlow MIB is an optional part of the sFlow standard, not all sFlow capable switches can be configured using SNMP. However, HP ProCurve and Alcatel-Lucent switches support the sFlow MIB making it easy to quickly try out sFlow monitoring using the free sFlowTrend application. The screen capture above shows the sFlowTrend setting needed to enable SNMP configuration of sFlow.

Many traffic analyzers that claim sFlow support do not support the sFlow MIB. If you have switches that support the sFlow MIB, then selecting an analyzer that supports the sFlow MIB will ensure a successful deployment.

Future posts on this blog will describe the configuration commands needed to enable sFlow on additional vendor's switches.

Tuesday, June 16, 2009

Trying out sFlow


If you are interested in network-wide visibility and want to start experimenting with sFlow, take a look at your network and see if any of the switches are sFlow capable. Most switch vendors support sFlow, including: Brocade, Hewlett-Packard, Juniper Networks, Extreme Networks, Force10 Networks, 3Com, D-Link, Alcatel-Lucent, H3C, Hitachi, NEC AlaxalA, Allied Telesis and Comtec (for a complete list of switches, see sFlow.org).

If you don't already have switches with sFlow support, consider purchasing a switch to experiment with. There are a number of inexpensive switches with sFlow support (check the list of switches on sFlow.org), alternatively you may be able to pick up a used switch on eBay.

Finally, the open source Host sFlow agent can be used to host traffic and traffic between virtual machines on a virtual server (Xen®, VMware®, KVM).

Once you have access to a source of sFlow data, you will need an sFlow analyzer. The sFlowTrend application (shown above) is a free, purpose built, sFlow analyzer that will allow you to try out the full range of sFlow functionality, including:
  • decoding and filtering on data from packet headers (including VLANs, priorities, MAC addresses, Ethernet types, as well as TCP/IP fields)
  • accurate analysis, trending and reporting of packet samples
  • trending of sFlow counters
  • support for sFlow MIB to automatically configure sFlow on switches
Many traffic analyzers claim support for sFlow, but provide only partial support. It is worth starting with sFlowTrend to see the full capabilities of sFlow and to gain experience with sFlow monitoring before evaluating larger scale solutions.

Future posts on this blog will use sFlowTrend to demonstrate how sFlow monitoring can be used to solve common network problems. Downloading a copy of sFlowTrend will allow you to try the different strategies on your own network.

Saturday, June 6, 2009

Choosing an sFlow analyzer


sFlow achieves network-wide visibility by shifting complexity away from the switches to the sFlow analysis application. Simplifying the monitoring task for the switch makes it possible to implement sFlow in hardware, providing wire-speed performance, without increasing the cost of the switch. However, the shift of complexity to the sFlow analysis application makes the selection of the sFlow analyzer a critical factor in realizing the full benefits of sFlow monitoring.

To illustrate some of the features that you should look for in an sFlow analyzer, consider the following basic question, "Which hosts are generating the most traffic on the network?" The chart provides information that answers the question, displaying the top traffic sources and the amount of traffic that they generate. In order to generate this chart, the sFlow analyzer needs to support the following features:
  1. Since the busiest hosts in the network could be anywhere, the sFlow analyzer needs to monitor every link in the network to accurately generate the chart.
  2. Traffic may traverse a number of monitored switch ports, in the example above, traffic between hosts A and B is monitored by 10 switch ports. In order to correctly report on the amount of traffic by host, the sFlow analyzer needs to combine data from the different switch ports in a way that correctly calculates the traffic totals and avoids under or over counting.
  3. The sFlow analyzer must fully support sFlow's packet sampling mechanism in order to accurately calculate traffic volumes.
  4. Notice that the chart contains IPv4, IPv6 and MAC addresses. The sFlow analyzer needs to be able to decode packet headers and report on all the protocols in use on the network, including layer 2 and layer 3 traffic. Traffic on local area networks (LANs) is much more diverse than routed wide area network (WAN) traffic. In addition to the normal TCP/IP traffic seen on the WAN, LAN traffic can include multicast, broadcast, service discovery (Bonjour), host configuration (DHCP), printing, backup and storage traffic not typically seen on the WAN.
When selecting an sFlow analyzer, try to arrange an evaluation and test the product on a full scale production network.  Evaluating scalability and accuracy is not something that is easily performed in a test lab.

Monday, June 1, 2009

Accuracy and packet loss


Traffic records are often lost:
  1. A switch must reliably perform it's primary function of forwarding packets, so if there is any contention for resources in the switch, measurement records will be discarded.
  2. There will inevitably be some loss of measurement records as they are transferred over the network from the switches to the traffic analyzer. Again, measurement traffic is a low priority and may be discarded if the network is busy.
  3. Finally, a traffic analyzer may lose traffic records if larger numbers of switches are being monitored and records are arriving faster than they can be processed.
The chart shows the effect of lost records on the accuracy of sFlow and NetFlow monitoring:
  1. NetFlow has no mechanism to compensate for lost records. If NetFlow records are lost then traffic will be underreported. The greater the number of records lost, the lower the reported traffic. The bursty and unpredictable traffic produced by NetFlow monitoring increases the likelihood that NetFlow records will be lost. The loss of even one NetFlow record can significantly affect accuracy since a single flow record may summarize a large transfer of data and represent a substantial fraction of the overall network traffic.
  2. sFlow's packet sampling mechanism treats record loss as a decrease in the sampling probability. The sFlow records contain information that allows the traffic analyzer to measure the effective sampling rate, compensate for the packet loss, and generate corrected values. Each sFlow record represents a single packet event and large flows of traffic will generate a number of sFlow records. Thus, the loss of an sFlow record does not represent a significant loss of data and doesn't affect the overall accuracy of traffic measurements.
Underreporting traffic, particularly during peak periods is a serious problem for troubleshooting, congestion management and traffic engineering applications. For usage-based billing applications, underreported traffic represents lost revenue.

When monitoring using NetFlow and sFlow to achieve network-wide visibility, situating the traffic analyzer near the NetFlow sources will help reduce the loss of flow records and improve accuracy.

Saturday, May 30, 2009

Measurement overhead


How much extra traffic will network monitoring generate? The goal of network-wide visibility is to improve performance on the network, so the extra traffic generated by monitoring needs to be small and must not degrade performance.

The chart looks at the overhead in terms of measurement records reported per packet on the network. Ideally the overhead associated with monitoring should be small and constant (less than 0.1% of the traffic). Since flow-oriented monitoring (e.g. NetFlow) involves the creation and export of flow records, the overhead is determined by the average number of packets in a flow. If there are a large number of packets in a flow, the overhead will be low. However, if the number of packets per flow is small then the overhead will be high and in the worst case may result in a flow record being generated and exported for every packet on the network.

In practice, the number packets per flow can vary enormously depending on the type of traffic being monitored. DNS traffic is one packet per flow, web traffic will typically have 5-10 packets per flow and video streams may have thousands of packets per flow.

The overhead generated by flow monitoring can become acute during a worm outbreak or when the network is subjected to a denial of service attack (DoS attack). In both cases large numbers of single packet flows are created and the additional overhead created by flow monitoring is likely to exacerbate the problem. The impact of this increased measurement traffic on the network is made worse by the traffic bursts that flow monitoring creates. It is precisely during these times that network visibility is most needed so that the threat can be identified and controlled.

Since sFlow is not a flow-based protocol, the overhead is completely unaffected by the number of packets per flow. sFlow's use of packet sampling limits the overhead of traffic monitoring and ensures accurate, timely, network-wide visibility without impacting network performance - even during extreme traffic situations like a denial of service attack.

Wednesday, May 27, 2009

Measurement traffic


The charts, based on measurements from switches in a production environment, compare NetFlow and sFlow in terms of the load that they generate on the network. The following observations can be made based on this data:
  • NetFlow monitoring generates periodic bursts of traffic; the periodicity is confirmed by the sharp spikes in the frequency chart. This behavior is typical of flow-based traffic monitoring protocols (see Exporting IP flows using IPFIX) since flow generation involves maintaining a cache of active flows on the switch and the use timers to trigger flow export.
  • sFlow monitoring generates a random pattern of traffic with no periodicity and no bursts. The randomness is confirmed by the flat frequency chart.
Network-wide visibility involves collecting traffic data from large numbers of switches and routers. The bursts of traffic generated by flow monitoring can cause problems with delay, packet loss and jitter that will effect other traffic on the network. The periodicity observed in flow monitoring creates the risk that the different streams of monitoring traffic will synchronize and reinforce each other as large numbers of devices are monitored.

It is essential that the technology used to manage network traffic does not itself cause traffic problems. The random, low-level, background traffic that sFlow generates ensures that large networks can be safely monitored without any adverse effects. This behavior is no accident, sFlow was designed to be scalable and the random packet sampling mechanism in sFlow is one of the reasons that its traffic is well behaved.

Saturday, May 23, 2009

Control



Control theory is an area of engineering and applied mathematics dealing with the behavior and control of dynamic systems. Many of the concepts can usefully be applied to network visibility and control.

The diagram shows the basic elements of a feedback controller. When controlling a network, the network would be the "System", the "Sensor" takes observations of the system (sFlow) and converts them into an estimate of the current network state (link utilizations, traffic flows etc.). The measured network state is compared to a Reference (usage policies, thresholds etc.) and any deviations from the desired behavior is used to trigger a control action (blocking a port, setting a rate limit etc.), changing the behavior of the network and restoring service levels.

Control theory has concepts of stability, observability, controllability and robustness that are very general and worth thinking about in the context of network management:
  • Stability is a way of describing how well behaved a system is. If you make a small change and the system's behavior oscillates wildly then it isn't stable (routing instability and congestion are examples of instability in a network setting).
  • Observability is a way of saying, "You can't control what you can't see." If you don't incorporate traffic measurement into the network design (by specifying switches with built-in traffic monitoring) then traffic will not be observable. Every device needs to have built-in traffic monitoring if you want to ensure that the whole network is observable.
  • Controllability is something that should be considered when designing the network; deploying managed switches in each layer of the network with appropriate control capabilities (e.g. access control lists, rate-limiting, priorities etc.) ensures controllability.
  • Robustness is a measure of how resilient the control system is. The managed network should degrade gracefully during unexpected situations (failures, DoS, Slashdot etc.).
sFlow was designed to provide the network-wide visibility needed for effective traffic control. sFlow has the attributes, described in Control Systems Design, that the measurement component of a control system requires: reliability, accuracy, responsiveness, noise immunity, linearity and non-intrusiveness.

In describing the responsiveness requirement, the author states, "Slow responding measurements can not only affect the quality of control but can actually make the feedback loop unstable." sFlow's timely reporting of link utilization data and packet samples provides the responsive visibility into network traffic needed to make the information actionable. While flow-based measurements provide useful usage data for traffic accounting and reporting, they are by their nature less responsive than sFlow and less useful for control.

Tuesday, May 19, 2009

sFlow and Netflow


If you are interested in network-wide visibility it is easy to be confused by the different types of traffic monitoring available. Technologies such as sFlow, Cisco NetFlow®, Juniper J-Flow, NetStream and IPFIX all appear to perform similar functions, but are supported by different network equipment vendors.

The situation is simpler than it appears, in reality there are only two basic types of traffic monitoring available:
  1. Layer-2 (L2) packet, sFlow is designed to provide network-wide visibility. Monitoring all the way to the layer-2 access ports requires a protocol that scales well and can easily be implemented on a layer-2 switch. Because sFlow is packet-based, it is able to report in detail on all types of traffic on the network.
  2. Layer-3 (L3) flow, NetFlow, J-Flow, NetStream and IPFIX are all very similar technologies. Flow monitoring is typically implemented on routers and provide information about TCP/IP connections with limited visibility into other types of traffic.
To be scalable and cost effective, traffic monitoring needs to be built into switches and routers. Start by taking an inventory of the devices in your network and see what traffic monitoring they provide, you will probably find that your network is represented by one of the three diagrams above.

The diagrams show three typical scenarios based on the type of equipment in the network:
  1. All the switches and routers support sFlow and a central sFlow analyzer provides network-wide visibility. Most vendors support sFlow, so it is possible to build an sFlow capable network to meet almost any requirement.
  2. The switches support sFlow and the routers support L3 flow monitoring. In this case a traffic analyzer that supports sFlow and L3 flow monitoring will also be able to provide network-wide visibility. This situation typically occurs in multi-vendor environments where sFlow is supported by the switch vendor and flow monitoring is supported by the router vendor.
  3. The routers support L3 flow monitoring and the switches have no built-in traffic monitoring capability. In this case, only traffic through the routers is monitored providing very limited visibility into the data center and campus. This situation is typical of single vendor networks where the vendor exclusively supports L3 flow monitoring.
The first step to improved network visibility is to select a traffic analysis tool and enable whatever traffic monitoring is available from existing network equipment.

Making traffic monitoring a selection requirement for future network upgrades will allow you to increase network visibility over time. Selecting network equipment from one of the many vendors that support sFlow does not add to the network cost. Adding traffic monitoring later is likely to be prohibitively expensive.

Monday, May 18, 2009

Scalability and accuracy of packet sampling


This chart from Packet Sampling Basics is useful for explaining why sFlow's packet sampling mechanism provides the accuracy and scalability needed for network-wide visibility. The chart shows that the accuracy of a traffic measurement (e.g. How much bandwidth is being consumed by backup traffic?) increases rapidly as the number of samples contributing to the measurement increases.

The chart shows that the percentage accuracy is independent of the number of packets on the network. This independence is the key to sFlow's scalability.  For example, a measurement will have a 5% accuracy as long as it is based on at least 1,500 samples.  Only 1,500 samples are required whether the network contains one switch or 1,000 switches,  10Mbps links or 100Gbps links.

The accuracy of sampled data is also independent of the type of traffic: traffic can consist of a small number of large connections, many small connections, traffic can arrive in bursts or spread out over time. In all cases the accuracy is determined only by the number of samples.

The packet sampling mechanism in sFlow is implemented in hardware, providing wire-speed performance.  When a switch samples a packet, the sampled packet header and packet path information is immediately sent to the central traffic analyzer. Promptly sending the sFlow data reduces the amount of memory on the switch and provides the sFlow collector with a real-time view of network activity.

Using sFlow to monitor all the switches in the network provides a robust and accurate means of monitoring traffic suitable for exacting applications such as network billing and charge-back.  The redundancy that end-to-end monitoring provides ensures that very little data is lost, even when switches fail or are taken down for maintenance.