Transcript PPT
A Framework for Measuring and
Predicting the Impact of Routing Changes
Ying Zhang Z. Morley Mao Jia Wang
1
Internet routing changes
Various causes
Link failures, configuration changes, topology changes, etc.
Direct influence on the data plane
Transient data-plane disruption
Packet loss, increased delay, forwarding loops
BR
C
Source
Old path
BR
C
Internet
BR
C
BR
C
New path
BR
C
BR
C
BR
C
Destination
BR
C
2
Motivation
Frequent routing dynamics can cause
transient disruption in the data plane
Inconsistent routes during convergence
Real-time applications can be affected
Predicting performance impact can assist
more intelligent route selection
3
Measuring and predicting the impact
Comprehensively measure the impact of
routing changes
Characterize the properties of routing
changes that cause traffic disruption
Search for pattern to help prediction
4
Outline
Motivation
Methodology
Characterization of data-plane failures
Failure prediction model
5
Methodology
Data collection
Control plane: local real-time BGP updates
Data plane: ping and traceroute probes for each update
A light weight active probing methodology
A coarse-grained performance metric: reachability
Destination reachable: any ping reply
Scalable to many destinations with live IPs
Measurement-based approach
No simplifying assumptions
Empirical evidence
6
Our approach
Focus: measure data-plane failures caused by routing
changes
Coarse-grained performance metrics
Methodology: light-weight active probing
Triggered by locally observed routing updates Update
Prefix: P,
Probing target of a live IP within the prefix
AS path: A D B
Old path
BR
C
AS C
Internet
BR
C
Prefix P
AS B
New path
AS A
BR
C
AS D
Measurement
7
Framework
Our approach
Focus: measure data-plane failure caused by routing
changes
Methodology: light-weight active probing
Triggered by locally observed routing updates
Probing target of a live IP within the prefix
Ping
Traceroute
Ping, traceroute
Old path
BR
C
AS C
Internet
BR
C
AS B
New path
Live IP 1 within Prefix P
AS A
BR
C
AS D
Measurement
8
Framework
Probing control
Background probing
Identifying persistent failures
Verifying live IP’s response
Resource control
Ignoring updates due to table transfers
Imposing maximum probing duration
Accuracy control
Impose maximum waiting duration
9
Outline
Motivation
Methodology
Characterization of data-plane
failures
Failure prediction model
10
Characterization of data-plane failures
Failure types
Reachability failure
Ping reply is not received due to network problems
Forwarding loops
A subset of reachability failures
Transient loops observed in the path
Failure properties
Affected networks
Failure duration
Failure predictability
11
Overall reachability failure statistics
Loop
Unreachable Other
All
Reachable
Incidence Prefix
AS
6%
23%
33%
36%
72%
38%
42%
73%
63%
57%
83%
98%
Internet experiments for 11 weeks
12
Affected network locations
Understanding the networks affected by routing changes
Most Ases are near the edge and in foreign countries
Small fraction of destinations experiencing many unreachable
incidences
13
Failure durations
Short duration
Most last less than 300 seconds
Transient routing failure, convergence delay
10% incidences with longer duration
Configuration errors or path failures
14
Failure predictability
Destination prefix information
Appearance probability
Probability of an unreachable incidence for prefix D
Destination prefix and AS path segments
Conditional probability on AS path segments
Probability of an unreachable event occurring given a particular AS path segment
Responsible AS
Where traceroute stops
15
Outline
Motivation
Methodology
Characterization of data plane failure
Failure prediction model
16
Prediction model
Prefix and AS segment information
The data plane failure likelihood ratio
P(Y 1 | R; D)
(Y )
P(Y 0 | R; D)
P(Y=1|R;D): the conditional probability of data-plane failure given a routing
update R for prefix D
Assuming the failure on each AS is independent
n
P(Y 1 | R x1 , x2 ,...xn ; D) 1 (1 P(Y 1 | xi ; D))
i 1
xi is the responsible AS in history data
17
Evaluation
The trade-off between selectivity and sensitivity
is the decision threshold which determines false positives and
false negative route
Receiver operating characteristic
Evaluation results
60% detection rate
with 18% false positives
18
Conclusion
Developed an efficient framework for
measuring and predicting data-plane
failures caused by routing changes
Identified patterns to accurately predict
data-plane failures
Provided suggestions for more intelligent
route selections
19