Single External Disruption - EECS
Download
Report
Transcript Single External Disruption - EECS
Finding a Needle in a Haystack:
Pinpointing Significant BGP Routing
Changes in an IP Network
Jian Wu (University of Michigan)
Z. Morley Mao (University of Michigan)
Jennifer Rexford (Princeton University)
Jia Wang (AT&T Labs Research)
1
Motivation
destination
Failure
AS4
Disruption
AS2
AS3
Congestion
BR
C
A
BR
C
B
AS1
BR
C
C
Mitigation
A backbone network is vulnerable to routing
BR
C
D
changes that occur
in other domains.
source
2
Goal
Identify important routing anomalies
Lost reachability
Persistent flapping
Large traffic shifts
Contributions:
•Build a tool to identify a small number of
important routing disruptions from a large
volume of raw BGP updates in real time.
•Use the tool to characterize routing
disruptions in an operational network
3
Interdomain Routing:
Border Gateway Protocol
“I can reach
12.34.158.0/24
via AS 1”
“I can reach
12.34.158.0/24”
AS 1
BR
BR
C
12.34.158.0/24
eBGP
data traffic
AS 2
BR
BR
C
iBGP
BR
BR
C
eBGP
BR
BR
C
AS 3
data traffic
12.34.158.5
Prefix-based: one route per prefix
Path-vector: list of ASes in the path
Incremental: every update indicates a change
Policy-based: local ranking of routes
4
Capturing Routing Changes
A large operational network
(8/16/2004 – 10/10-2004)
BR
BR
C
BR
BR
C
BR
BR
C
BGP
CPE
Monitor
BR
BR
C
BR
BR
C
BR
BR
C
5
Challenges
Large volume of BGP updates
Millions daily, very bursty
Too much for an operator to manage
Different from root-cause analysis
Identify changes and their effects
Focus on actionable events rather than
diagnosis
Diagnose causes in/near the AS
6
System Architecture
BGP (106)
BR
E Updates
Events (105)
BR
E
BR
E
BGP Update
Grouping
Persistent
Flapping
Prefixes
(101)
“Typed”
Events
Event
Classification
Clusters
Event
Correlation
Frequent
Flapping
Prefixes
(101)
Large
Disruptions
(101)
(103)
Traffic Impact
Prediction
Netflow
Data
BR
E
BR
E
BR
E
From millions of updates to a few dozen reports
7
Grouping BGP Update into Events
Challenge: A single routing change
leads to multiple update messages
affects routing decisions at multiple routers
BR
E
BR
E
BR
E
Approach:
BGP BGP Update
Updates Grouping
Persistent
Flapping
Prefixes
Events
•Group together all updates
for a prefix with
inter-arrival < 70 seconds
•Flag prefixes with changes
lasting > 10 minutes.
8
Grouping Thresholds
Based on our understanding of BGP
and data analysis
Event timeout: 70 seconds
2 * MRAI timer + 10 seconds
98% inter-arrival time < 70 seconds
Convergence timeout: 10 minutes
BGP usually converges within a few
minutes
99.9% events < 10 minutes
9
Persistent Flapping Prefixes
A surprising finding:
15.2% of updates were caused by
persistent-flapping prefixes even though
flap damping is enabled.
Types of persistent flapping
Conservative damping parameters (78.6%)
Protocol oscillations due to MED (18.3%)
Unstable interfaces or BGP sessions (3.0%)
10
Example: Unstable eBGP Session
AE
ISP
DE
Peer
BE
CE
p
Customer
Flap damping parameters is session-based
Damping not implemented for iBGP sessions
11
Event Classification
Challenge: Major concerns in network management
Changes in reachability
Heavy load of routing messages on the routers
Change of flow of the traffic through the network
Events
Event
Classification
“Typed” Events,
e.g., Loss/Gain of Reachability
Solution: classify events by severity of their impact
12
Event Category – “No Disruption”
p
AS2
AS1
DE
No Traffic Shift
“No Disruption”:
EE
AE
BE
ISP
no border routers have any traffic shift. (50.3%)
CE
13
Event Category – “Internal Disruption”
p
AS2
AS1
DE
EE
AE
BE
“Internal Disruption”: ISP
all traffic shifts are internal. (15.6%)
CE
Internal Traffic Shift
14
Event Category – “Single External
Disruption”
p
AS2
AS1
DE
external Traffic Shift
EE
AE
BE
“Single External Disruption”:
ISP
only one of the traffic shifts is external (20.7%)
CE
15
Statistics on Event Classification
Events
Updates
No Disruption
50.3%
48.6%
Internal Disruption
15.6%
3.4%
Single External Disruption
20.7%
7.9%
Multiple External Disruption
7.4%
18.2%
Loss/Gain of Reachability
6.0%
21.9%
First 3 categories have significant day-to-day
variations
Updates per event depends on the type of events
and the number of affected routers
16
Event Correlation
Challenge: A single routing change
affects multiple destination prefixes
“Typed”
Events
Event
Correlation
Clusters
Solution:
group the same-type, close-occurring events
17
EBGP Session Reset
Caused most of “single external disruption”
events
Check if the number of prefixes using that
session as the best route changes
dramatically
Number of prefixes
session
recovery
session
failure
time
Validation with Syslog router report (95%)
18
Hot-Potato Changes
Hot-Potato Changes
P
AE
11
9
BE
ISP
10
“Hot-potato routing” =
route to closest egress point
CE
Caused “internal disruption” events
Validation with OSPF measurement (95%)
[Teixeira et al – SIGMETRICS’ 04]
19
Traffic Impact Prediction
Challenge: Routing changes have different
impacts on the network which depends on
the popularity of the destinations
Traffic Impact
Prediction
Clusters
Large
Disruptions
Netflow
Data
E
BR
E
BR
E
BR
Solution: weigh each cluster by traffic volume
20
Traffic Impact Prediction
Traffic weight
Per-prefix measurement from netflow
10% prefixes accounts for 90% of traffic
Traffic weight of a cluster
the sum of “traffic weight” of the prefixes
A small number of large clusters have
large traffic weight
Mostly session resets and hot-potato
changes
21
Performance Evaluation
Memory
Static memory: “current routes”, 600 MB
Dynamic memory: “clusters”, 300 MB
Speed
99% of intervals of 1 second of updates
can be process within 1 second
Occasional execution lag
Every interval of 70 seconds of updates
can be processed within 70 seconds
Measurements were based on 900MHz CPU
22
Conclusion
BGP troubleshooting system
Fast, online fashion
Operators’ concerns (reachability, flapping, traffic)
Significant information reduction
millions of update a few dozens of large
disruptions
Uncovered important network behavior
Hot-Potato changes
Session resets
Persistent-flapping prefixes
23