Single External Disruption - EECS

Download Report

Transcript Single External Disruption - EECS

Finding a Needle in a Haystack:
Pinpointing Significant BGP Routing
Changes in an IP Network
Jian Wu (University of Michigan)
Z. Morley Mao (University of Michigan)
Jennifer Rexford (Princeton University)
Jia Wang (AT&T Labs Research)
1
Motivation
destination
Failure
AS4
Disruption
AS2
AS3
Congestion
BR
C
A
BR
C
B
AS1
BR
C
C
Mitigation
A backbone network is vulnerable to routing
BR
C
D
changes that occur
in other domains.
source
2
Goal
 Identify important routing anomalies
 Lost reachability
 Persistent flapping
 Large traffic shifts
Contributions:
•Build a tool to identify a small number of
important routing disruptions from a large
volume of raw BGP updates in real time.
•Use the tool to characterize routing
disruptions in an operational network
3
Interdomain Routing:
Border Gateway Protocol
“I can reach
12.34.158.0/24
via AS 1”
“I can reach
12.34.158.0/24”
AS 1
BR
BR
C
12.34.158.0/24
eBGP
data traffic
AS 2
BR
BR
C
iBGP
BR
BR
C
eBGP
BR
BR
C
AS 3
data traffic
12.34.158.5




Prefix-based: one route per prefix
Path-vector: list of ASes in the path
Incremental: every update indicates a change
Policy-based: local ranking of routes
4
Capturing Routing Changes
A large operational network
(8/16/2004 – 10/10-2004)
BR
BR
C
BR
BR
C
BR
BR
C
BGP
CPE
Monitor
BR
BR
C
BR
BR
C
BR
BR
C
5
Challenges
 Large volume of BGP updates
 Millions daily, very bursty
 Too much for an operator to manage
 Different from root-cause analysis
 Identify changes and their effects
 Focus on actionable events rather than
diagnosis
 Diagnose causes in/near the AS
6
System Architecture
BGP (106)
BR
E Updates
Events (105)
BR
E
BR
E
BGP Update
Grouping
Persistent
Flapping
Prefixes
(101)
“Typed”
Events
Event
Classification
Clusters
Event
Correlation
Frequent
Flapping
Prefixes
(101)
Large
Disruptions
(101)
(103)
Traffic Impact
Prediction
Netflow
Data
BR
E
BR
E
BR
E
From millions of updates to a few dozen reports
7
Grouping BGP Update into Events
Challenge: A single routing change
 leads to multiple update messages
 affects routing decisions at multiple routers
BR
E
BR
E
BR
E
Approach:
BGP BGP Update
Updates Grouping
Persistent
Flapping
Prefixes
Events
•Group together all updates
for a prefix with
inter-arrival < 70 seconds
•Flag prefixes with changes
lasting > 10 minutes.
8
Grouping Thresholds
 Based on our understanding of BGP
and data analysis
 Event timeout: 70 seconds
 2 * MRAI timer + 10 seconds
 98% inter-arrival time < 70 seconds
 Convergence timeout: 10 minutes
 BGP usually converges within a few
minutes
 99.9% events < 10 minutes
9
Persistent Flapping Prefixes
A surprising finding:
15.2% of updates were caused by
persistent-flapping prefixes even though
flap damping is enabled.
 Types of persistent flapping
 Conservative damping parameters (78.6%)
 Protocol oscillations due to MED (18.3%)
 Unstable interfaces or BGP sessions (3.0%)
10
Example: Unstable eBGP Session
AE
ISP
DE
Peer
BE
CE
p
Customer
 Flap damping parameters is session-based
 Damping not implemented for iBGP sessions
11
Event Classification
Challenge: Major concerns in network management
 Changes in reachability
 Heavy load of routing messages on the routers
 Change of flow of the traffic through the network
Events
Event
Classification
“Typed” Events,
e.g., Loss/Gain of Reachability
Solution: classify events by severity of their impact
12
Event Category – “No Disruption”
p
AS2
AS1
DE
No Traffic Shift
“No Disruption”:
EE
AE
BE
ISP
no border routers have any traffic shift. (50.3%)
CE
13
Event Category – “Internal Disruption”
p
AS2
AS1
DE
EE
AE
BE
“Internal Disruption”: ISP
all traffic shifts are internal. (15.6%)
CE
Internal Traffic Shift
14
Event Category – “Single External
Disruption”
p
AS2
AS1
DE
external Traffic Shift
EE
AE
BE
“Single External Disruption”:
ISP
only one of the traffic shifts is external (20.7%)
CE
15
Statistics on Event Classification
Events
Updates
No Disruption
50.3%
48.6%
Internal Disruption
15.6%
3.4%
Single External Disruption
20.7%
7.9%
Multiple External Disruption
7.4%
18.2%
Loss/Gain of Reachability
6.0%
21.9%
 First 3 categories have significant day-to-day
variations
 Updates per event depends on the type of events
and the number of affected routers
16
Event Correlation
Challenge: A single routing change
 affects multiple destination prefixes
“Typed”
Events
Event
Correlation
Clusters
Solution:
group the same-type, close-occurring events
17
EBGP Session Reset
 Caused most of “single external disruption”
events
 Check if the number of prefixes using that
session as the best route changes
dramatically
Number of prefixes
session
recovery
session
failure
time
 Validation with Syslog router report (95%)
18
Hot-Potato Changes
 Hot-Potato Changes
P
AE
11
9
BE
ISP
10
“Hot-potato routing” =
route to closest egress point
CE
 Caused “internal disruption” events
 Validation with OSPF measurement (95%)
[Teixeira et al – SIGMETRICS’ 04]
19
Traffic Impact Prediction
Challenge: Routing changes have different
impacts on the network which depends on
the popularity of the destinations
Traffic Impact
Prediction
Clusters
Large
Disruptions
Netflow
Data
E
BR
E
BR
E
BR
Solution: weigh each cluster by traffic volume
20
Traffic Impact Prediction
 Traffic weight
 Per-prefix measurement from netflow
 10% prefixes accounts for 90% of traffic
 Traffic weight of a cluster
 the sum of “traffic weight” of the prefixes
 A small number of large clusters have
large traffic weight
 Mostly session resets and hot-potato
changes
21
Performance Evaluation
 Memory
 Static memory: “current routes”, 600 MB
 Dynamic memory: “clusters”, 300 MB
 Speed
 99% of intervals of 1 second of updates
can be process within 1 second
 Occasional execution lag
 Every interval of 70 seconds of updates
can be processed within 70 seconds
Measurements were based on 900MHz CPU
22
Conclusion
 BGP troubleshooting system
 Fast, online fashion
 Operators’ concerns (reachability, flapping, traffic)
 Significant information reduction
 millions of update  a few dozens of large
disruptions
 Uncovered important network behavior
 Hot-Potato changes
 Session resets
 Persistent-flapping prefixes
23