Wednesday, September 27, 2017

Real-time visibility and control of campus networks

Many of the examples on this blog describe network visibility driven control of data center networks. However, campus networks face many similar challenges and the availability of industry standard sFlow telemetry and RESTful control APIs in campus switches make it possible to apply feedback control.

HPE Aruba has an extensive selection of campus switches that combine programmatic control via a REST API with hardware sFlow support:
  • Aruba 2530 
  • Aruba 2540 
  • Aruba 2620
  • Aruba 2930F
  • Aruba 2930M
  • Aruba 3810
  • Aruba 5400R
  • Aruba 8400
 This article presents an example of implementing quota controls using HPE Aruba switches.
Typically, a small number of hosts are responsible for the majority of traffic on the network: identifying those hosts, and applying controls to their traffic to prevent them from unfairly dominating, ensures fair access to all users.

Peer-to-peer protocols (P2P) pose some unique challenges:
  • P2P protocols make use of very large numbers of connections in order to quickly transfer data. The large number of connections allows a P2P user to obtain a disproportionate amount of network bandwidth; even a small number of P2P users (less than 0.5% of users) can consume over 90% of the network bandwidth.
  • P2P protocols (and users) are very good at getting through access control lists (acl) by using non-standard ports, using port 80 (web) etc. Trying to maintain an effective filter to identify P2P traffic is a challenge and the resulting complex rule sets consume significant resources in devices attempting to perform classification.
In this example, all switches are configured to stream sFlow telemetry to an instance of the sFlow-RT real-time sFlow analyzer running a quota controller. When a host exceeds its traffic quota, a REST API call is made to the host's access switch, instructing the switch to mark the host's traffic as low priority. Marking the traffic ensures that if congestion occurs elsewhere in the network, typically on the Internet access links, priority queuing will cause marked packets can be dropped, reducing the bandwidth consumed by the marked host. The quota controller continues to monitor the host and the marking action is removed when the host's traffic returns to acceptable levels.

A usage quota is simply a limit on the amount of traffic that a user is allowed to generate in a given amount of time. Usage quotas have a number of attributed that make them an effective means of managing P2P activity:
  • A simple usage quota is easy to maintain and enforce and encourages users to be more responsible in their use of shared resources.
  • Since quota based controls are interested in the overall amount of traffic that a host generates and not the specific type of traffic, they don't encourage users to tailor P2P application setting to bypass access control rules and so their traffic is easier to monitor.
  • A quota system can be implemented using standard network hardware, without the addition of a "traffic shaping" appliance that can become a bottleneck and point of failure.
The following quota.js script implements the quota controller functionality:
var user = 'manager';
var password = 'manager';
var scheme = 'http://';
var mbps = 10;
var interval = 10;
var timeout = 20;
var dscp = '10';
var groups = {'ext':['0.0.0.0/0'],'inc':['10.0.0.0/8'],'exc':['10.1.0.0/16']};

function runCmd(agent,cmd) {
  var headers = {'Content-Type':'application/json','Accept':'application/json'};
  // create session
  var auth = http2({
    url:scheme+agent+'/rest/v3/login-sessions',
    operation:'post',
    headers:headers,
    body: JSON.stringify({userName:user, password:password})
  });
  headers['Cookie']=JSON.parse(auth.body).cookie;

  // make request
  var resp = http2({
    url:scheme+agent+'/rest/v3/cli',
    operation:'post',
    headers:headers,
    body: JSON.stringify({cmd:cmd})
  });
  var result = base64Decode(JSON.parse(resp.body).result_base64_encoded);

  // end session
  var end = http2({
    url:scheme+agent+'/rest/v3/login-sessions',
    operation:'delete',
    headers:headers
  });
  return result;
}

setGroups('site',groups);

setFlow('src', {
  keys:'ipsource',
  value:'bytes',
  filter:'direction=ingress&group:ipsource:site=inc&group:ipdestination:site=ext',
  t:interval
});

setThreshold('quota', {
  metric:'src',
  value:mbps*1000000/8,
  byFlow: true,
  timeout:timeout,
  filter:{ifspeed:[10000000,100000000,1000000000]}
});

var controls = {};
setEventHandler(function(evt) {
  var ip = evt.flowKey;
  if(controls[ip]) return;

  var agent = evt.agent;
  var ds = evt.dataSource;
  controls[ip] = {agent:agent,ds:ds,time:Date.now()};
  logInfo('mark '+ip+' agent '+agent);
  try { runCmd(agent,'qos device-priority '+ip+' dscp '+dscp); }
  catch(e) { logWarning('runCmd error ' + e); }
}, ['quota']);

setIntervalHandler(function() {
  for(var ip in controls) {
    var ctl = controls[ip];
    if(thresholdTriggered('quota',ctl.agent,ctl.ds+'.src',ip)) continue;

    logInfo('unmark '+ip+' agent '+ctl.agent);
    try { runCmd(ctl.agent,'no qos device-priority '+ip); }
    catch(e) { logWarning('runCmd error ' + e); }
    delete controls[ip];
  }
});
Some notes on the script:
  • Writing Applications provides an overview of the sFlow-RT scripting API.
  • HPE ArubaOS-Switch REST API and JSON Schema Reference Guide 16.03 describes the REST API calls in the runCmds() function.
  • The groups variable defines groups of IP addresses, identifying external addresses (ext), addresses included as candidates for the quota controller (inc), and addresses excluded from quota controls (exc).
  • Defining Flows describes the arguments to the setFlow() function. In this case, calculating a 10 second moving average of traffic from local sources (inc) to external destinations (ext).
  • The setThreshold() function creates a threshold that triggers if the moving average of traffic from an address exceeds 10Mbit/s. The thresholds are only applied to 10M, 100M and 1G access ports, ensuring that controls are only applied to access layer switches and not to 10G ports on aggregation and core switches.
  • The setEventHandler() function processes the events, keeping track of existing controls and implementing new DSCP marking rules.
  • The setIntervalHandler() function runs periodically and removes marking rules when they are not longer required (after the threshold timeout has expired).
  • The ArubaOS commands to add / remove DSCP marking for the host are highlighted in blue.
The easiest way to try out the script is to use Docker to run sFlow-RT:
docker run -v $PWD/cli.js:/sflow-rt/quota.js -e "RTPROP=-Dscript.file=quota.js" -p 6343:6343/udp -p 8008:8008 sflow/sflow-rt
2017-09-27T02:32:14+0000 INFO: Listening, sFlow port 6343
2017-09-27T02:32:14+0000 INFO: Listening, HTTP port 8008
2017-09-27T02:32:14+0000 INFO: quota.js started
2017-09-27T02:33:53+0000 INFO: mark 10.0.0.70 agent 10.0.0.232
2017-09-27T02:34:25+0000 INFO: unmark 10.0.0.70 agent 10.0.0.232
The output indicates that a control marking traffic from 10.0.0.70 was added to edge switch 10.0.0.232 and removed just over 30 seconds later.
The screen capture using Flow Trend to monitor the core switches shows the controller in action. The marking rule is added as soon as traffic from 10.0.0.70 exceeds the 10Mbps quota for 10 seconds (the DSCP marking shown in the chart changes from red be(0) to blue af11(10)). The marking rule is removed once the traffic returns to normal.
The quota settings in the demonstration were aggressive. In practice the threshold settings would be higher and the timeouts longer in order to minimize control churn. The trend above was collected from a university network of approximately 20,000 users that implemented a similar control scheme. In this case, the quota controller was able to consistently mark 50% of the traffic as low priority. Between 10 and 20 controls were in place at any given time and the controller made around 10 control changes an hour (adding or removing a control). During busy periods, congestion on the campus Internet access links was eliminated since marked traffic could be discarded when necessary.

Implementing usage quotas is just one example of applying measurement based control to a campus network. Other interesting applications include:
Real-time measurement and programmatic APIs are becoming standard features of campus switches, allowing visibility driven automatic control to adapt the network to changing network demands and security threats.

No comments:

Post a Comment