Transcript PPT

A Framework for Measuring and
Predicting the Impact of Routing Changes
Ying Zhang Z. Morley Mao Jia Wang
1
Internet routing changes
 Various causes
 Link failures, configuration changes, topology changes, etc.
 Direct influence on the data plane
 Transient data-plane disruption
 Packet loss, increased delay, forwarding loops
BR
C
Source
Old path
BR
C
Internet
BR
C
BR
C
New path
BR
C
BR
C
BR
C
Destination
BR
C
2
Motivation
Frequent routing dynamics can cause
transient disruption in the data plane
Inconsistent routes during convergence
Real-time applications can be affected
Predicting performance impact can assist
more intelligent route selection
3
Measuring and predicting the impact
Comprehensively measure the impact of
routing changes
Characterize the properties of routing
changes that cause traffic disruption
Search for pattern to help prediction
4
Outline
Motivation
Methodology
Characterization of data-plane failures
Failure prediction model
5
Methodology
 Data collection
 Control plane: local real-time BGP updates
 Data plane: ping and traceroute probes for each update
 A light weight active probing methodology
 A coarse-grained performance metric: reachability
 Destination reachable: any ping reply
 Scalable to many destinations with live IPs
 Measurement-based approach
 No simplifying assumptions
 Empirical evidence
6
Our approach
 Focus: measure data-plane failures caused by routing
changes
 Coarse-grained performance metrics
 Methodology: light-weight active probing
 Triggered by locally observed routing updates Update
Prefix: P,
 Probing target of a live IP within the prefix
AS path: A D B
Old path
BR
C
AS C
Internet
BR
C
Prefix P
AS B
New path
AS A
BR
C
AS D
Measurement
7
Framework
Our approach
 Focus: measure data-plane failure caused by routing
changes
 Methodology: light-weight active probing
 Triggered by locally observed routing updates
 Probing target of a live IP within the prefix
Ping
Traceroute
Ping, traceroute
Old path
BR
C
AS C
Internet
BR
C
AS B
New path
Live IP 1 within Prefix P
AS A
BR
C
AS D
Measurement
8
Framework
Probing control
 Background probing
 Identifying persistent failures
 Verifying live IP’s response
 Resource control
 Ignoring updates due to table transfers
 Imposing maximum probing duration
 Accuracy control
 Impose maximum waiting duration
9
Outline
Motivation
Methodology
Characterization of data-plane
failures
Failure prediction model
10
Characterization of data-plane failures
 Failure types
 Reachability failure
 Ping reply is not received due to network problems
 Forwarding loops
 A subset of reachability failures
 Transient loops observed in the path
 Failure properties
 Affected networks
 Failure duration
 Failure predictability
11
Overall reachability failure statistics
Loop
Unreachable Other
All
Reachable
Incidence Prefix
AS
6%
23%
33%
36%
72%
38%
42%
73%
63%
57%
83%
98%
Internet experiments for 11 weeks
12
Affected network locations
 Understanding the networks affected by routing changes
 Most Ases are near the edge and in foreign countries
 Small fraction of destinations experiencing many unreachable
incidences
13
Failure durations
 Short duration
 Most last less than 300 seconds
 Transient routing failure, convergence delay
 10% incidences with longer duration
 Configuration errors or path failures
14
Failure predictability
 Destination prefix information
 Appearance probability
 Probability of an unreachable incidence for prefix D
 Destination prefix and AS path segments
 Conditional probability on AS path segments
 Probability of an unreachable event occurring given a particular AS path segment
 Responsible AS
 Where traceroute stops
15
Outline
Motivation
Methodology
Characterization of data plane failure
Failure prediction model
16
Prediction model
Prefix and AS segment information
The data plane failure likelihood ratio
P(Y  1 | R; D)
(Y ) 

P(Y  0 | R; D)
 P(Y=1|R;D): the conditional probability of data-plane failure given a routing
update R for prefix D
 Assuming the failure on each AS is independent
n
P(Y  1 | R  x1 , x2 ,...xn ; D)  1   (1  P(Y  1 | xi ; D))
i 1
xi is the responsible AS in history data
17
Evaluation
 The trade-off between selectivity and sensitivity

 is the decision threshold which determines false positives and
false negative route
 Receiver operating characteristic
 Evaluation results
 60% detection rate
with 18% false positives
18
Conclusion
Developed an efficient framework for
measuring and predicting data-plane
failures caused by routing changes
Identified patterns to accurately predict
data-plane failures
Provided suggestions for more intelligent
route selections
19