Anemone: Edge-based network management - Microsoft Research

flutteringevergreenNetworking and Communications

Oct 29, 2013 (4 years and 14 days ago)

59 views

Anemone

Edge
-
based network management

http://www.research.microsoft.com/projects/anemone/

Mort

(Richard Mortier)

Paul Barham, Austin Donnelly,
Rebecca Isaacs


Over 700 people worldwide, spread through 6 research labs


Bangalore, Beijing,
Cambridge
, Redmond, San Francisco, Silicon Valley


Cover a wide range of CS and EE areas


MSR Charter


Advance the state
-
of
-
the
-
art through cutting
-
edge research and publishing
in the open literature


Provide competitive edge to Microsoft’s product groups through technology
transfer and consultation


Engage with academic community through participation in conferences,
programme committees, journal editorial boards, student thesis committees


Cambridge lab is about 80 researchers, split into 4 main areas


Networking, systems, distributed systems


Magpie, Topology discovery, Pastry, Avalanche, Vigilante,
Anemone


Languages, security, theory


Graphics, vision, machine learning


Integrated systems, HCI, hardware

Preamble: Microsoft Research

The process of monitoring and controlling a large
complex distributed system of dumb devices
where failures are common and resources scarce



Networks are
large
: 10
5

hosts, 10
3
routers


Networks are
heterogeneous
:

130 router hardware/OS combinations


Networks run
distributed protocols
:

OSPF, BGP, all very loosely synchronized


Networks undergo
continuous change
:

links fail and recover, upgrades occur

Network management is
hard!

State of the art?

Tools to help visualize and inspect network


1.
Get topology


Recursive use of
ping

and
traceroute

2.
Get traffic data


Routers using SNMP and NetFlow
TM

3.
Analyze and present the data


Wrap it all up in a GUI: triggers, graphs, top
-
10s, etc


Unfortunately…

There are problems!



Traffic is becoming more opaque to the network core


Increasing deployment of IPSec, tunnelling, encryption


traceroute
data is ambiguous and only polls the topology


Best case is the reverse path anyway


SNMP data is often buggy


Non
-
critical part of router operation


Routers are often resource starved


Not built using the latest CPU, memory technologies



The result is that such systems can end up presenting
inaccurate, untimely, incomplete data

Edge
-
based

distributed network management platform



Collect
flow information

from hosts, and


Combine with
topology information

from routing
protocols


Enables applications


Visualize

current network state


Analyse

flow data for intrusion detection


Simulate

reconfiguration/failure for planning


Control
the network, automatically and in real
-
time

Anemone

Benefits

Anemone has
a priori

benefits over state of the art



Visibility

into opaque protocols


See into encrypted/tunnelled traffic e.g. IPSec, PPtP


Plentiful

resources
at hosts


They need only deal with their own traffic


Independence

from poor quality data


No more reliance on SNMP and
traceroute

data

Where is my traffic going today?

Anemone is a platform for network management apps



Pictures of current topology and traffic


Routes+flows+forwarding rules


BIG PICTURE



In fact, where did my traffic go yesterday?


Keep historical data for capacity planning, etc



A platform for anomaly detection


Historical data suggests “normality,” live monitoring allows
anomalies to be detected

Applications

Applications

Where might my traffic go tomorrow?

Anemone enables ‘what
-
if’ analysis



Plug into a simulator back
-
end


Discrete event simulator or flow allocation solver


Run multiple ‘what
-
if’ scenarios


…failures


…reconfigurations


…technology deployments


E.g. “What happens to the network if we coalesce all the
mail servers into one datacenter?”

Applications

Where should my traffic be going?

Anemone helps close the control loop



Use it to support an application that recomputes
link weights to implement policy goals


Recomputation on the order of hours or days



This enables more dynamic policies


Network configuration could be modified to track e.g.
time of day/week/year load changes


…potentially reducing bandwidth costs

Where are we now?

Studying feasibility and building prototypes



Three major components


Flow collection


Route collection


Anemone platform


Data collection: flows

Synthesise flow data from low
-
level packet tracing



Hosts track active flows


Using
ETW,
low overhead event posting infrastructure


Built prototype device driver provider & user
-
space
consumer



Took 24h packet traces from a client and a server


Peaks were at 165, respectively 5667, live flows per sec
and 39, respectively 567, active flows per sec


Quite manageable sized datasets

Interlude: OSPF routing 101

How does a packet get from any A to any B?

Learn network topology; compute shortest paths



For each node

1.
Discover
adjacencies
(~
immediate neighbours
)

2.
Advertise these
link states

to all other routers

3.
Build
link state database

(~
network topology
)

4.
Compute
shortest paths

to all destination prefixes

5.
Forward to
next
-
hop

using
longest
-
prefix
-
match
(~
most specific route
)

Data collection: routes

Passive

collection of network critical control protocol



OSPF is link
-
state so collect link state adverts


Completely passive, modulo configuration


Process data to recover network “events” and topology



Data collected for (local, backbone) areas (20 days)


LSA DB size: (700, 1048) LSAs ~ (21, 34) kB


Event totals: (2526, 3238) events ~ (5.3, 6.7) evts/hr


Small, generally stable with bursts of activity

NB: Spike to ~100 from
initial DB collection
truncated for readability

steady state

complete dataset

10 mins: data ca. 25/Nov?

30 mins: LSRefreshTime?

35 mins: LSRefreshTime+CheckAge?

1

2 mins: RouterDeadInterval?

The Anemone platform

Data unification, distribution and presentation



“Distributed database,” logically containing


1.
Traffic flow matrix (bandwidths), {srcs}
×

{dsts}


Hosts can supply flows they source and sink


Only need a subset of this data to get complete traffic matrix


2.
…each entry annotated with current route, src to dst


Note src/dst might be e.g. (IP end
-
point, application)


OSPF supplies topology → routes

System outline

Control

Packets

Flows

Routeing

protocol

Topology

Visualize

Simulate

Simulator

Anemone

platform

Traffic matrix

Set of routes

srcs

dsts

routes

Hosts

Routers

The Anemone platform

Provides an API for presenting data



Wish to be able to answer queries like


“Who are the top
-
10 traffic generators?”


Easy to aggregate, don’t care about topology


“What is the load on link
l
?”


Can aggregate from hosts, but need to know routes


“What happens if we remove links
{l

m}
?”


Interaction between traffic matrix, topology, even flow control



Related work


{ distributed, continuous query, temporal } databases


Sensor networks, Astrolabe, SDIMS, PHI …

The Anemone platform

Currently forming the core of the demo!



Have simulation model


OSPF data gives topology, event list, routes


Simple load model to start with (load ~ # subnets)


Predecessor matrix (from SPF) reduces flow
-
data query
set



Where/what/how much to distribute/aggregate?


Is data read
-

or write
-
dominated?


Which is more dynamic, flow or topology data?


Can the system successfully self
-
tune?

The Anemone platform

Many outstanding research questions



Can we do as well/better than e.g. NetFlow
TM
?


Accuracy of data vs. completeness of instrumentation


Which data sets should we distribute and how?


Just OSPF data? Just flow data? A mixture?


Use DHTs? IP multicast?


How many levels of aggregation?


How many nodes should a query touch?


What sort of API is suitable?


Example queries for sample applications

http://www.research.microsoft.com/projects/anemone/

Building a coherent
edge
-
based

network management
platform

using flow monitoring and standard routeing
protocols



Applications include visualization, simulation, dynamic
control


Research issues include


Accuracy: will not be able to monitor 100% of traffic


Scalability: want to manage a 300,000 node network


Robustness: must work as nodes fail or network partitions


Control systems: use the data to optimize the network in real
-
time, as well as just observe and simulate

Backup slides


SNMP


Internet routeing


Security

SNMP

Protocol to manage information tables at devices



Provides
get, set, trap, notify

operations


get
,

set
: read, write values


trap
: signal a condition (e.g. threshold exceeded)


notify
: reliable
trap



Complexity mostly in the table design


Some standard tables, but many vendor specific


Non
-
critical, so often tables populated incorrectly

Internet routeing


Q: how to get a packet from node to destination?



A1: advertise all reachable destinations and apply a
consistent cost function (
distance vector)


A2: learn network topology and compute consistent
shortest paths (
link state
)


Each node (1) discovers and advertises
adjacencies
;

(2) builds
link state database
; (3) computes
shortest paths


A1, A2: Forward to
next
-
hop

using
longest
-
prefix
-
match

Security


Threat: malicious/compromised host


Authenticate participants


Must secure route collector as if a router


Threat: DoS on monitors


Difference between client under DoS and server?


Rate pace output from monitors


Threat: eavesdropping


Standard IPSec/encryption solutions



Have not considered cross
-
domain implications