Experience in Black-box OSPF Measurement

Download Report

Transcript Experience in Black-box OSPF Measurement

UCSC
SHS
A Case-study of OSPF Behavior in a
Large Enterprise Network
Aman Shaikh, UCSC
Chris Isett, Siemens Health Services
Albert Greenberg, AT&T Labs-Research
Matthew Roughan, AT&T Labs-Research
Joel Gottlieb, AT&T Labs-Research
IMW – November 07, 2002
Aman Shaikh
IMW - 2002
1
UCSC
SHS
Why Study OSPF Behavior?
• Any meaningful performance assurance depends on routing
stability
– An internal network change (OSPF event) can have major
impact on services, flows and customers
• Transients can degrade services significantly (e.g., VoIP)
• Expectations for IP network management are higher
– Improve OSPF performance, particularly reliable and fast
detection of topology change, without introducing instabilities
– Changes are needed
• Parameter adjustment or more fundamental
• Realistic workload model for simulations are needed
– Testing scalability, convergence, reliability
• However, the behavior and performance of OSPF in large ISPs and
enterprise networks is not well understood
Aman Shaikh
IMW - 2002
2
UCSC
SHS
OSPF
• OSPF is a Link-state routing protocol
– All routers in the domain come to a consistent
view of the topology by exchange of Link
State Advertisements (LSAs)
• Router describes its local connectivity (i.e., set
of links) in an LSA
– Set of LSAs (self-originated + received) at a
router = topology
• Hierarchical routing
– OSPF domain can be divided into areas
– Hub-and-spoke topology with area 0 as hub
and other non-zero areas as spokes
Aman Shaikh
IMW - 2002
3
UCSC
SHS
OSPF Performance
• OSPF processing impacts convergence,
(in)stability
– Load is increasing as networks grow
• Bulk of OSPF processing is due to LSAs
– Sending/receiving LSAs
– LSAs can trigger Route calculation (Dijkstra’s
algorithm)
• Understanding dynamics of LSA traffic is key for
a better understanding of OSPF
Aman Shaikh
IMW - 2002
4
UCSC
SHS
Methodology
• Categorize and baseline LSA traffic
• Detect, diagnose and act on anomalies
• Propose changes to improve performance
Aman Shaikh
IMW - 2002
5
SHS
UCSC
Categorizing LSA Traffic
• A router originates an LSA due to…
– Change in network topology
Change LSAs
• Example: link goes down or comes up
• Detection of anomalies and problems
– Periodic soft-state refresh
Refresh LSAs
• Recommended value of interval is 30 minutes
• Forms baseline LSA traffic
• LSAs are disseminated using reliable flooding
– Includes change and refresh LSAs
– Flooding leads to duplicate copies of LSAs
being received at a router
Duplicate LSAs
– Overhead: wastes resources
Aman Shaikh
IMW - 2002
6
UCSC
SHS
Highlights of the Results
• Categorize, baseline and predict
– Categories: Refresh, Change, Duplicate; External, Internal
– Bulk of LSA traffic is due to refresh
– Refresh LSA traffic is smooth: no evidence of refresh
synchronization across network
– Refresh LSA traffic is predictable from router configuration info
• Detect, diagnose and act
– Almost all LSAs arise from persistent yet partial failure modes
– Internal LSA spikes
• Indicate router hardware degradation
• Carry out preventive maintenance
– External LSA spikes
• Indicate degradation in customer connectivity
• Call customer before customer calls you
• Propose Improvements
– Simple configuration changes to reduce duplicate LSA traffic
Aman Shaikh
IMW - 2002
7
UCSC
SHS
Enterprise Network Case Study
• The network provides customers with connectivity to
applications and databases residing in the data center
• OSPF network
– 15 areas, 500 routers
• This case study covers 8 areas, 250 routers
• One month: April 2002
– Link-layer = Ethernet-based LANs
• Customers are connected via leased lines
– Customer routes are injected via EIGRP into OSPF
• The routes are propagated via external LSAs
• Quite reasonable for the enterprise network in question
Aman Shaikh
IMW - 2002
8
SHS
UCSC
Enterprise Network Topology
Customer
OSPF
Domain
Area B
Customer
Customer
External
(EIGRP)
Area A
LAN1 Area A
Area 0
Area C
B1
Monitor
B2
Border rtrs
Area 0
Servers
Database Applications
Aman Shaikh
LAN2
Monitor is completely passive
No adjacencies with any routers
Receives LSAs on a multicast group
IMW - 2002
9
SHS
UCSC
LSA Traffic in Different Areas
Area 0
Area 2
1000000
8000
Refresh
LSAs
Genuine Anomaly
10000
4000
100
Change
LSAs
Genuine Anomaly
0
1
1
11
21
Days
1
8000
8000
4000
4000
11
21
Days
21
Days
Duplicate
LSAs
Artifact: 23 hr day (Apr 7)
0
0
1
11
21
Days
Area 3
Aman Shaikh
1
11
Area 4
IMW - 2002
10
SHS
UCSC
Baseline LSA Traffic: Refresh LSAs
• Refresh LSA traffic can be reliably predicted using
information available in router configuration files
– Important for workload modeling
– See paper for details
Refresh LSAs (expected:config)
Refresh LSAs (actual)
Refresh LSAs (expected:config)
Refresh LSAs (actual)
5000
7000
6000
4000
5000
4000
3000
1
11
21
Days
Area 2
Aman Shaikh
1
11
21
Days
Area 3
IMW - 2002
11
UCSC
SHS
Refresh process is not synchronized
Negligible LSA clumping
• No evidence of synchronization
– Contrary to simulation-based study in [Basu01]
• Reasons
– Changes in the topology help break synchronization
– LSA refresh at one router is not coupled with LSA refresh at
other routers
– Drift in the refresh interval of different routers
Aman Shaikh
IMW - 2002
12
SHS
UCSC
Anomaly Detection: Change LSAs
10000
1000
External
100
Internal
10
1
Days
1
11
21
• Internal to OSPF domain versus external
– Change LSAs due to external events dominated
– Not surprising due to large number of leased lines used to
import customer routes into OSPF
• Customer volatility  network volatility
Aman Shaikh
IMW - 2002
13
SHS
UCSC
Root Causes of Change LSAs
• Persistent problem  flapping  numerous change LSAs
– Internal LSA spikes  hardware router problems
• OSPF monitor identified a problem (not visible to SNMP-based
network mgt tools) early and led to preventive maintenance
– External LSA spikes  customer route volatility
• Overload of an external link to a customer between 8 pm – 4 am
causes EIGRP session on that link to flap
Total LSAs in area 2
Total LSAs due to flapping link
Total LSAs in area 2
Total LSAs due to flapping link
12000
1200
8000
800
4000
400
0
1
11
0
21
1
Day in April, 2002
Aman Shaikh
7
13
19
Hour on April 11, 2002
IMW - 2002
14
SHS
UCSC
Overhead: Duplicate LSAs
Duplicate LSAs in area 3
Duplicate LSAs in area 2
2950
1950
950
-50
1
11
21
Days
• Why do some areas witness substantial duplicate LSA
traffic, while other areas do not witness any?
– OSPF flooding over LANs leads to control plane
asymmetries and to imbalances in duplicate LSA
traffic
Aman Shaikh
IMW - 2002
15
SHS
UCSC
LSA Flooding over Broadcast LANs
LAN
DR
•
•
•
•
Aman Shaikh
BDR
DR = Designated router, BDR = Backup Designated
Router
Who becomes DR and BDR depends on configuration
Flooding on a LAN is a two-step process:
1. A router multicasts LSA to DR and BDR
2. DR or BDR multicasts LSA to other routers
LSA appears only twice on LAN instead of n – 1 times
IMW - 2002
16
UCSC
SHS
Control Plane Asymmetry
• Two LANs (LAN1 and LAN2) in each area
• Monitor is on LAN1
• Routers B1 and B2 are connected to LAN1 and
LAN2
• LSAs originated on LAN2 can get duplicated
depending on which routers have become DR
and BDR on LAN1
– Leads to control plane asymmetry
– Four cases
Aman Shaikh
IMW - 2002
17
SHS
UCSC
Four Cases
Case 2 (B1, R)
Case 1 (B1, B2)
LAN1
LAN1
B1
(DR)
B2
(BDR)
B1
(DR)
B2
LAN2
LAN2
Case 4 (R, R’)
Case 3 (R, B1)
LAN1
B1
(BDR)
B2
LAN2
Aman Shaikh
DR
DR
B1
LAN1
B2
LAN2
IMW - 2002
18
SHS
UCSC
Eliminating Duplicate LSA Traffic
Case1 Case 2 Case 3
Case 4
Duplicate LSA
traffic
High
None
High
None
Deterministic
via configuration
Yes
No
No
Yes
X
Area 2
X
Area 3
Aman Shaikh
X
configuration
change
IMW - 2002
X
configuration
change
19
UCSC
SHS
Summary
• Categorize and baseline LSA traffic
– Refresh LSAs: constitute bulk of overall LSA traffic
• No evidence of synchronization between different routers
• Refresh LSA traffic predictable from configuration information
• Detect, diagnose and act on anomalies
– Change LSAs: can indicate persistent yet partial
failure modes
• Internal LSA spikes  hardware router problems 
preventive router maintenance
• External LSA spikes  customer congestion problems 
“preventive” customer care
• Propose changes to improve performance
– Duplicate LSAs: can arise from control plane
asymmetries
• Simple configuration changes can eliminate duplicate LSAs
and improve performance
Aman Shaikh
IMW - 2002
20
UCSC
SHS
Future Work
• Study OSPF behavior in other commercial networks
– ISPs, enterprise networks
• Longer term studies
• Combine with other data sources
– BGP: interaction with OSPF
– Traffic: impact of routing on forwarding
• Convergence
• Better monitoring and management tools
• Good simulation models
– Combine with router-level measurements [Shaikh &
Greenberg, IMW ‘01]
Aman Shaikh
IMW - 2002
21
UCSC
SHS
Backup
Aman Shaikh
IMW - 2002
22
UCSC
SHS
Questions
• OSPF is a Link-state routing protocol
– All routers in the domain come to a consistent view of the
topology by exchange of Link State Advertisements (LSAs)
– Three categories of LSAs: refresh, change, duplicate
• Refresh
– Is the refresh traffic predictable? Can it be baselined?
– Is refresh traffic synchronized in real networks?
• Change
– What is the nature of change LSA traffic, arising from internal
and external sources?
– What do the failure modes look like?
– Is it possible to use this traffic to trigger preventive maintenance
traffic (e.g., just as measurements of bit error rates triggers
preventive maintenance of the data plane)
• Duplicate
– Can duplicate LSAs be reduced? At what cost to reliability?
Aman Shaikh
IMW - 2002
23
SHS
UCSC
Router Model
LSA Processing
Route Processor (CPU)
OSPF Process
LSA Flooding
Topology
View
SPF Calculation
SPF Calculation
FIB Update
FIB
LSA
LS Ack
Forwarding
Forwarding
Data packet
Interface card
Aman Shaikh
LSA
Switching
Fabric
IMW - 2002
Data packet
Interface card
24