Monday, June 8, 2020

Large flow marking using BGP Flowspec

Elephant Detection in Virtual Switches & Mitigation in Hardware discusses a VMware and Cumulus demonstration, Elephants and Mice, in which the virtual switch on a host detects and marks large "Elephant" flows and the hardware switch enforces priority queueing to prevent Elephant flows from adversely affecting latency of small "Mice" flows.

SDN and WAN optimization describes a presentation by Amin Vahdat describing Google's SDN based wide area network traffic engineering solution in which traffic prioritization allows Google to reduce costs by fully utilizing WAN bandwidth.

Deconstructing Datacenter Packet Transport describes how priority marking of packets associated with large flows can improve completion times for flows crossing the data center fabric. Simulation results presented in the paper show that prioritization of short flows over large flows can significantly improve throughput (reducing flow completion times by a factor of 5 or more at high loads).

This article demonstrates a self contained real-time Elephant flow marking solution that leverages the real-time visibility and control features available using commodity switch hardware.

The diagram shows the elements of the solution. An instance of the sFlow-RT real-time analytics engine receives streaming sFlow telemetry from a pair of edge routers. A mix of many small flows mixed with a few large flows arrive at the left router, all flows have the default Best Effort (be) Differentiated Services Code Point (DSCP) 0 marking (indicated in blue). As soon as a large flow is detected, a BGP Flowspec rule is pushed to the router, remarking the flow as Lower Effort (le) DSCP 1 (see RFC 8622: A Lower-Effort Per-Hop Behavior (LE PHB) for Differentiated Services).  The large flow is continuously monitored and the Flowspec rule is withdrawn when the flow ends.

The less than best effort class ensures that the large flow doesn't compete for bandwidth and buffer resources with the small flows, ensuring faster completion times and lower latency for time sensitive traffic while minimally impacting throughput of the large flow.

The following partial configuration enables sFlow and BGP Flowspec on an Arista EOS device (EOS 4.22 or later):
!
service routing protocols model multi-agent
!
sflow sample 16384
sflow polling-interval 30
sflow destination 10.0.0.70
sflow run
!
interface Ethernet1
   flow-spec ipv4 ipv6
!
interface Management1
   ip address 10.0.0.96/24
!
ip routing
!
ipv6 unicast-routing
!
router bgp 64497
   router-id 192.0.2.1
   neighbor 10.0.0.70 remote-as 65070
   neighbor 10.0.0.70 transport remote-port 1179
   neighbor 10.0.0.70 allowas-in 3
   neighbor 10.0.0.70 send-community extended
   neighbor 10.0.0.70 maximum-routes 12000 
   !
   address-family flow-spec ipv4
      neighbor 10.0.0.70 activate
   !
   address-family flow-spec ipv6
      neighbor 10.0.0.70 activate
   !
   address-family ipv4
      neighbor 10.0.0.70 activate
   !
   address-family ipv6
      neighbor 10.0.0.70 activate
!
The following sFlow-RT mark.js script implements the flow marking controller:
var routers = [
  {router:'10.0.0.96',agent:'10.0.0.170'},
  {router:'10.0.0.97',agent:'10.0.0.171'}
];
var my_as = '65030';
var my_id = '0.6.6.6';
var flow_t = 2;
var threshold_val = 10000000/8;
var threshold_t = 10;
var dscp_name = 'le';
var dscp_val = 1;
var enable_v6 = false;
var max_controls = 1000;

var controls = {};

var bgp_opts = {ipv6:enable_v6,flowspec:true,flowspec6:enable_v6};

function bgpClose(router) {
  var key, ctl;
  for(key in controls) {
    ctl = controls[key];
    if(ctl.router != router) continue;

    ctl.success = false;
  }
}
function bgpOpen(router) {
  var key, ctl;
  for(key in controls) {
    ctl = controls[key];
    if(ctl.router != router) continue;

    ctl.success = bgpAddFlow(ctl.router, ctl.flowspec);
  }
}
var agentToRouter = {};
var controlCount = {};
routers.forEach(function(rec) {
  bgpAddNeighbor(rec.router,my_as,my_id,bgp_opts,bgpOpen,bgpClose);
  agentToRouter[rec.agent] = rec.router || rec.agent;
  controlCount[rec.router] = 0;
});

setFlow('mark_tcp', {
  keys: 'ipsource,ipdestination,tcpsourceport,tcpdestinationport',
  value:'bytes',
  filter:'direction=ingress&ipprotocol=6&ipdscp!='+dscp_val,
  t:flow_t
});
setFlow('mark_tcp6', {
  keys: 'ip6source,ip6destination,tcpsourceport,tcpdestinationport',
  value:'bytes',
  filter:'direction=ingress&ip6nexthdr=6&ip6dscp!='+dscp_val,
  t:flow_t
});

setThreshold('mark_tcp', {
  metric:'mark_tcp',
  value:threshold_val,
  byFlow:true,
  timeout:threshold_t
});
setThreshold('mark_tcp6', {
  metric:'mark_tcp6',
  value:threshold_val,
  byFlow:true,
  timeout:threshold_t
});

setEventHandler(function(evt) {
  var router = agentToRouter[evt.agent];
  if(!router) {
    return;
  }

  var key = router + '-' + evt.flowKey;
  if(controls[key]) {
    return;
  }

  if(controlCount[router] >= max_controls) {
    return;
  }

  var [saddr,daddr,sport,dport] = evt.flowKey.split(',');
  var ctl = {
    key:key,
    router:router,
    event:evt,
    flowspec: {
      match: {
        source:saddr,
        destination:daddr,
        'source-port':'='+sport,
        'destination-port':'='+dport,
      },
      then: {
       'traffic-marking':dscp_name
      }
    }
  };
  switch(evt.eventID) {
    case 'mark_tcp':
      ctl.flowspec.match.version = '4';
      ctl.flowspec.match.protocol = '=6';
      break;
    case 'mark_tcp6':
      ctl.flowspec.match.version = '6';
      ctl.flowspec.match.protocol = '=6';
      break;
  }
  if(!enable_v6 && '6' == ctl.flowspec.match.version) {
    return;
  }

  ctl.success = bgpAddFlow(ctl.router, ctl.flowspec);
  controls[ctl.key] = ctl;
  controlCount[router]++;
  logInfo('mark add '+router+' '+evt.flowKey);
},['mark_tcp','mark_tcp6']);

setIntervalHandler(function(now) {
  var key, ctl, evt, triggered;
  for(key in controls) {
    ctl = controls[key];
    evt = ctl.event;
    if(thresholdTriggered(evt.thresholdID, evt.agent,
                          evt.dataSource+'.'+evt.metric,
                          evt.flowKey)) continue;

    if(ctl.success) bgpRemoveFlow(ctl.router,ctl.flowspec);
    delete controls[key];
    controlCount[ctl.router]--;
    logInfo('mark remove '+ctl.router+' '+ctl.event.flowKey);
  }
});
Some notes on the script:
  1. The routers array contains the set of BGP routers that are to be controlled. The router attribute specifies the IP address that will initiate the BGP connection and the agent attribute specifies the sFlow agent address of the router. 
  2. TCP connections exceeding the threshold_val of 10Mbit/s will be marked.
  3. The max_controls value of 1000 caps the number of Flowspec rules that can be installed in each router in order to avoid exceeding the capabilities of the hardware.
  4. The setFlow() function, see Defining Flows, tracks ingress TCP flows that haven't been marked as LE.
  5. The setThreshold() function defines a threshold to identify large unmarked flows.
  6. The setEventHandler() function triggers the marking action in response to a threshold event.
  7. The setIntervalHandler() function runs every second, finding large flows that have finished and removing their controls.
  8. See Writing Applications for more information.
The easiest way to run the script is to use Docker with the pre-built sflow/ddos-protect image. Running the following command on host 10.0.0.70 launches the controller:
docker run --net=host \
-v $PWD/mark.js:/sflow-rt/mark.js \
sflow/ddos-protect -Dscript.file=mark.js
Using iperf to generate a large flow to test the controller.
localhost#sh bgp flow-spec ipv4
BGP Flow Specification rules for VRF default
Router identifier 10.0.0.96, local AS number 65096
Rule status codes: # - not installed, M - received from multiple peers

   Matching Rule                                                Actions
   172.16.1.174/32;172.16.2.175/32;DP:=5001;SP:=39208;          Mark DSCP: 0x1
Command line output from the edge router confirms that the large flow has been detected and is being marked.

Note: Real-time DDoS mitigation using BGP RTBH and FlowSpec describes how DDoS attacks can be automatically mitigated in real-time using a control scheme very similar to the one described in this article. The Docker image used above includes the DDoS mitigation controller.