Wednesday, May 18, 2016

Identifying bad ECMP paths

In the talk Move Fast, Unbreak Things! at the recent DevOps Networking Forum,  Petr Lapukhov described how Facebook has tackled the problem of detecting packet loss in Equal Cost Multi-Path (ECMP) networks. At Facebook's scale,  there are many parallel paths and actively probing all the paths generates a lot of data. The active tests generate over 1Terabits/second of measurement data per Facebook data center and a Hadoop cluster with hundreds of compute nodes is required per data center to process the data.

Processing active test data can detect that packets are being lost within approximately 20 seconds, but doesn't provide the precise location where packets are dropped. A custom multi-path traceroute tool (fbtracert) is used to follow up and narrow down the location of the packet loss.

While described as measuring packet loss, the test system is really measuring path loss. For example, if there are 64 ECMP paths in a pod, then the loss of one path would result in a packet loss of approximately 1 in 64 packets in traffic flows that cross the ECMP group.

Black hole detection describes an alternative approach. Industry standard sFlow instrumentation embedded within most vendor's switch hardware provides visibility into the paths that packets take across the network - see Packet paths. In some ways the sFlow telemetry is very similar to the traceroute tests, each measurement identifies the specific location a packet was seen.

The passive sFlow monitoring approach has significant benefits:
  1. Eliminates active test traffic since production traffic exercises network paths.
  2. Eliminates traffic generators and test targets required to perform the active tests.
  3. Simplifies analysis since sFlow measurements provides a direct indication of anomaly location.
  4. Reduced operation complexity and associated costs.
Enabling sFlow throughout the network continuously monitors all paths and can rapidly detect routing anomalies. In addition, sFlow is a general purpose solution that delivers the visibility needed to manage leaf-spine networks and the distributed applications that they support. The following examples are illustrative of the breadth of solution supported by sFlow analytics:

No comments:

Post a Comment