Experience in Black-box OSPF Measurement - AT&T Labs Research

thoughtlessskytopΔίκτυα και Επικοινωνίες

29 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

108 εμφανίσεις

1

OSPF Monitor
-

NSDI 2004

OSPF Monitor


Architecture, Design and Deployment Experience

Aman Shaikh

Albert Greenberg

AT&T Labs
-

Research


NSDI 2004

2

OSPF Monitor
-

NSDI 2004

Objectives for OSPF Monitor


Real
-
time analysis of OSPF behavior


Trouble
-
shooting, alerting, validation of maintenance


Real
-
time snapshots of OSPF network topology


Off
-
line analysis


Post
-
mortem analysis of recurring problems


Generate statistics and reports about network
performance


Identify anomaly signatures


Facilitate tuning of configurable parameters


Improve maintenance procedures


Analyze OSPF behavior in commercial networks

3

OSPF Monitor
-

NSDI 2004

OSPF Monitor in a Nutshell


Collect OSPF LSAs (Link State Advertisements)
passively from network


Every router describes its local connectivity in an LSA


Router originates an LSA due to...


Change in network topology


Periodic soft
-
state refresh


LSA is
flooded

to other routers in the domain


Flooding is reliable and hop
-
by
-
hop


Flooding leads to duplicate copies of LSAs being received


Every router stores LSAs (self
-
originated + received) in
link
-
state database

(= topology graph)


Real
-
time analysis of LSA streams


Archive LSAs for off
-
line analysis

4

OSPF Monitor
-

NSDI 2004

Components


Data collection:
LSA Reflector (LSAR)


Passively collects OSPF LSAs from network


“Reflects” streams of LSAs to LSAG


Archives LSAs for analysis by OSPFScan


Real
-
time analysis:
LSA aGgregator (LSAG)


Monitors network for topology changes, LSA storms,
node flaps and anomalies


Off
-
line analysis:
OSPFScan


Supports queries on LSA archives


Allows playback and modeling of topology changes


Allows emulation of OSPF routing

5

OSPF Monitor
-

NSDI 2004

Example

Area 0

Area 1

Area 2

Real
-
time Monitoring

LSAG

“Reflect” LSA

LSA archive

LSAR 1

“Reflect” LSA

LSAR 2

OSPFScan

Off
-
line Analysis

replicate

LSA archive

LSA archive

OSPF Network

LSAs

LSAs

LSAs

LSAs

LSAs

LSAs

TCP Connection

6

OSPF Monitor
-

NSDI 2004

How LSAR attaches to Network


Host mode


Join multicast group


Adv:

completely passive


Disadv:

not reliable, delayed initialization of LSDB


Full adjacency mode


Form full adjacency (= peering session) with a router


Adv:

reliable, immediate initialization of LSDB


Disadv:

LSAR’s instability can impact entire network


Partial adjacency mode


Keep adjacency in a state that allows LSAR to receive LSAs,
but does not allow data forwarding over link


Adv:

reliable, LSAR’s instability does not impact entire
network, immediate initialization of LSDB


Disadv:

can raise alarms on the router

7

OSPF Monitor
-

NSDI 2004

Partial Adjacency for LSAR

LSAR

Partial state

I have LSA L

Please send me LSA L

Please send me LSA L

Please send me LSA L

I need LSA L

from LSAR



LSAR

R link is not used for data forwarding

R



Router R does not advertise a link to LSAR



Routers (except R) not aware of LSAR’s presence



Does not trigger routing calculations in network



LSAR’s going up/down does not impact network



LSAR does not originate any LSAs

8

OSPF Monitor
-

NSDI 2004

LSA aGregator (LSAG)


Analyzes “reflected” LSAs from LSARs in real
-
time


Generates console messages:


Change in OSPF network topology


ADJACENY COST CHANGE: rtr 10.0.0.1 (intf 10.0.0.2)


rtr
10.0.0.5 old_cost 1000 new_cost 50000 area 0.0.0.0


Node flaps


RTR FLAP: rtr 10.0.0.12 no_flaps 7 flap_window 570 sec


LSA storms


LSA STORM: lstype 3 lsid 10.1.0.0 advrt 10.0.0.3 area 0.0.0.0 no_lsas
7 storm_window 470 sec


Anomalous behavior


TYPE
-
3 ROUTE FROM NON
-
BORDER RTR: ntw 10.3.0.0/24 rtr
10.0.0.6 area 0.0.0.0


Dumps snapshots of network topology

9

OSPF Monitor
-

NSDI 2004

OSPFScan


Tools for off
-
line analysis of LSA archives


Parse, select (based on queries), and analyze


Functionality supported by OSPFScan


Classification of LSA traffic


Change LSAs, refresh LSAs, duplicate LSAs


Emulation of OSPF Routing


How OSPF routing tables evolved in response to network changes


How end
-
to
-
end path within OSPF domain looked like at any instance


Modeling of topology changes


Vertex addition/deletion and link addition/deletion/change_cost


Playback of topology change events


Statistics and report generation

10

OSPF Monitor
-

NSDI 2004

Performance Evaluation


Performance of LSAR and LSAG through lab
experiments


LSAR and LSAG are key to real
-
time monitoring


How performance scales with LSA
-
rate and
network size

11

OSPF Monitor
-

NSDI 2004

Experimental Setup

LSA

LSA

PC

Zebra

OSPF adjacency

TCP connection

SUT

LSAR

LSAG

TCP

connection

LSA

LSA

Emulated topology

LSA

Measure LSA pass
-
through time for LSAR

Measure LSA processing time for LSAG

12

OSPF Monitor
-

NSDI 2004

Methodology


Send a burst of LSAs from Zebra to LSAR


Vary number of LSAs (
l
) in a burst of 1 sec duration


Use of fully connected graph as the emulated
topology


Vary number of nodes (
n
) in the topology


Performance measurements


LSAR performance: LSA “pass
-
through” time


Zebra measures time difference between sending and
receiving an LSA from LSAR


LSAG performance: LSA processing time


Instrumentation of LSAG code

13

OSPF Monitor
-

NSDI 2004

LSAR Performance

14

OSPF Monitor
-

NSDI 2004

LSAG Performance

15

OSPF Monitor
-

NSDI 2004

Deployment


Tier
-
1 ISP network


Area 0, 100+ routers; point
-
to
-
point links


Deployed since January, 2003


LSA archive size: 8 MB/day


LSAR connection:
partial adjacency mode


Enterprise network


15 areas, 500+ routers; Ethernet
-
based LANs


Deployed since February, 2002


LSA archive size: 10 MB/day


LSAR connection:
host mode

16

OSPF Monitor
-

NSDI 2004

LSAG in Day
-
to
-
day Operations


Generation of alarms by feeding messages into
higher layer network management systems


Grouping of messages to reduce the number of
alarms


Prioritization of messages


Validation of maintenance steps and monitoring
the impact of these steps on network
-
wide OSPF
behavior


Example:


Network operators use cost
-
out/cost
-
in of links to carry out
maintenance


A “link
-
audit” web
-
page allows operators to keep track of
link costs in real
-
time

17

OSPF Monitor
-

NSDI 2004

Problems Caught by LSAG


Equipment problem


Detected internal problems in a crucial router in
enterprise network


Problem manifested as episodes of OSPF adjacency
flapping


Configuration problem


Identified assignment of same router
-
id to two routers
in enterprise network


OSPF implementation bug


Caught a bug in type
-
3 LSA generation code of a
router vendor in ISP network


Faster refresh of LSAs than standards
-
mandated rate

18

OSPF Monitor
-

NSDI 2004

Long Term Analysis by OSPFScan


LSA traffic analysis


Identified excessive duplicate LSA traffic in some
areas of Enterprise Network


Led to root
-
cause analysis and preventative steps


Statistics generation


Inter
-
arrival time of change LSAs in ISP network


Fine
-
tuning configurable timers related to route
calculation (= SPF calculation)


Mean down
-
time and up
-
time for links and routers in
ISP network


Assessment of reliability and availability

19

OSPF Monitor
-

NSDI 2004

Lessons Learned through Deployment


New tools reveal new failure modes


Real
-
time alerting and off
-
line analysis are
complementary


Distributed architecture helped a lot


OSPF exhibits significant activity in real networks


Maintenance and genuine problems


Add functionality incrementally and through interaction
with users


Archive all LSAs


LSA volume is manageable


Don’t throw away refresh and duplicate LSAs

20

OSPF Monitor
-

NSDI 2004

Conclusion


Three component architecture


LSAR: data collection


LSAG: real
-
time analysis


OSPFScan: off
-
line analysis


Performance analysis


LSAR and LSAG scale well as LSA
-
rate and network
size increases


Deployment


Deployed in Tier
-
1 ISP and Enterprise network


Has proved to be an extremely valuable tool for network management


“OSPF Monitor was a Lifesaver”


VP of Networking, Enterprise network


21

OSPF Monitor
-

NSDI 2004

Future Work


Real
-
time analysis


Correlation with other fault and performance data for
more meaningful alerting


Prioritization of alerts


Off
-
line analysis


Correlation with other data sources


Work already underway: BGP, fault, performance


Identification of problem signatures and feeding them
into real
-
time component for problem prediction

22

OSPF Monitor
-

NSDI 2004

Backup Slides


23

OSPF Monitor
-

NSDI 2004

Overview of OSPF


OSPF is a
link
-
state

protocol


Every router learns entire network topology


Topology is represented as graph


Routers are vertices, links are edges


Every link is assigned weight through configuration


Every router uses Dijkstra’s single source shortest
path algorithm to build its forwarding table


Router builds Shortest Path Tree (
SPT
) with itself as root


Shortest Path Calculation (
SPF
)


Packets are forwarded along shortest paths defined by
link weights

24

OSPF Monitor
-

NSDI 2004

Areas in OSPF


OSPF allows domain to be divided into areas for
scalability


Areas are numbered 0, 1, 2 …


Hub
-
and
-
spoke with area 0 as hub


Every link is assigned to exactly one area


Routers with links in multiple areas are called
border
routers

Area 1

Area 2

Area 0

Border routers

25

OSPF Monitor
-

NSDI 2004

Summarization with Areas


Each router learns


Entire topology of its attached areas


Information about subnets in remote areas and their
distance

from the border routers


Distance = sum of link costs from border router to subnet

B1

B2

R2

Area 0

100

200

200

500

400

300

R3

R1

R1’s View

Area 1

10.10.4.0/24

10.10.5.0/24

20

70

10

60

Area 1

Area 0

20

100

B1

B2

C1

C2

10.10.4.0/24

10.10.5.0/24

10

50

200

200

500

400

300

R3

R2

R1

OSPF domain

26

OSPF Monitor
-

NSDI 2004

Link State Advertisements (LSAs)


Every router describes its local connectivity in Link
State Advertisements (
LSAs
)


Router originates an LSA due to…


Change in network topology


Example: link goes down or comes up


Periodic soft
-
state refresh


Recommended value of interval is 30 minutes


LSA is
flooded

to other routers in the domain


Flooding is reliable and hop
-
by
-
hop


Includes change and refresh LSAs


Flooding leads to duplicate copies of LSAs being received


Every router stores LSAs (self
-
originated + received) in
link
-
state database

(= topology graph)

27

OSPF Monitor
-

NSDI 2004

Adjacency


Neighbor routers (i.e., routers connected by a
physical link) form an
adjacency


The purpose is to make sure


Link is operational and routers can communicate with
each other


Neighbor routers have consistent view of network
topology


To avoid loops and black holes


Link gets used for data forwarding only after
adjacency is established


Use of periodic
Hellos

to monitor the status of
link and adjacency

28

OSPF Monitor
-

NSDI 2004

Equipment Problem at Enterprise Network


Internal errors in a router in area 0


Episodes where router would drop adjacencies with other routers


Problem manifested in LSAG as “ADJ UP” and “ADJ DOWN”
messages


Not visible in other network management systems


Led to proactive maintenance

29

OSPF Monitor
-

NSDI 2004

LSA Traffic in Enterprise Network

Area 4

Days

Area 3

Days

Area 2

Days

Area 0

Days

Duplicate


LSAs

Change

LSAs

Refresh

LSAs

Artifact: 23 hr day (Apr 7)

Genuine Anomaly

Genuine Anomaly

30

OSPF Monitor
-

NSDI 2004

Overhead: Duplicate LSAs


Why do some areas witness substantial duplicate LSA
traffic, while other areas do not witness any?


OSPF flooding over LANs leads to control plane asymmetries
and to imbalances in duplicate LSA traffic

Days