Multipath tracing with Paris traceroute
Download
Report
Transcript Multipath tracing with Paris traceroute
Internet measurements: fault
detection, identification, and
topology discovery
Renata Teixeira
Laboratoire LIP6
CNRS and UPMC Paris Universitas
Internet monitoring is essential
For
–
Monitor service-level agreements
–
Fault diagnosis
–
Diagnose anomalous behavior
For
1
network operators
users or content/application providers
–
Verify network performance
–
Verify network neutrality
Network operators can’t know
the user’s experience
AS3
AS2
AS1
Network
–
–
2
AS4
operators only have data of one AS
AS4 doesn’t detect any problem
AS3 doesn’t know who is affected by the failure
End users can’t know what
happens in the network
AS3
AS2
AS1
End-hosts
3
AS4
can only monitor end-to-end paths
Network tomography to rescue
Inference of unknown network properties from
measurable ones
Network operators
End users
–
Monitor network paths
–
Cooperative monitoring
–
From monitoring hosts
–
Among end users
–
From users to popular
services
–
•
In network
•
Third-party monitoring services
From home gateways
http://cmon.grenouille.com
http://www.nanodatacenters.eu
4
Fault diagnosis using end-to-end
measurements
Faults
are persistent reachability problems
detection
continuous path monitoring
identification
binary tomography
5
Outline
Background
Fault
detection
–
Active vs. passive measurements
–
Reducing overhead of active measurements
–
Disambiguating one-way failures
Fault
identification using binary tomography
–
Correlated path reachability
–
Topology discovery
Open
6
in network tomography
issues
Network tomography to infer
link performance
What
are the properties of network links?
–
Loss rate
–
Delay
–
Bandwidth
–
Connectivity
D
A
AS 1 C
B
Given
–
7
end-to-end measurements
No access to routers
F
E AS 2
The origins
MINC:
Multicast-based
Inference of Network-internal
Characteristics
Key
–
probe
sender
idea: multicast probes
Exploit correlation in traces to
estimate link properties
probe
collectors
[MINC project, 1999]
8
Inferring link loss rates
m
Assumptions
–
Known, logical-tree topology
–
Losses are independent
α2
α3
–
Multicast probes
t1
t2
1
0
1
1
1
1
Method
–
Maximum likelihood
estimates for αk
[Adams, 2000]
9
success
probabilities
α1
^ α2
^ α3
^
α1
estimated
success
probabilities
Binary tomography
Labels
–
–
–
Loss-rate estimation requires
tight correlation
Instead, separate good/bad
performance
If link is bad, all paths that
cross the link are bad
[Duffield, 2006]
10
m
links as good or bad
α1
α2
α3
t1
t2
1
0
0
1
1
1
bad
good
Single-source tree
“Smallest
Consistent Failure
Set” algorithm
–
–
Assumes a single-source tree
and known topology
Find the smallest set of links
that explains bad paths
•
Given bad links are uncommon
•
Bad link is the root of maximal
bad subtree
[Duffield, 2006]
11
m
bad
t1
t2
1
0
0
1
1
1
bad
good
Fault identification with binary
tomography
Fault
monitoring needs multiple
sources and targets
Problem
–
m2
t1
t2
becomes NP-hard
Minimum hitting set problem
Iterative
greedy heuristic
–
Given the set of links in bad paths
–
Iteratively choose link that explains
the max number of bad paths
[Kompella, 2007] [Dhamdhere, 2007]
12
m1
Hitting set of link =
paths that traverse
the link
Practical issues
Topology
–
Need to measure accurate topology
Multicast
–
–
13
of targets is not always practical
Need one-way performance from round-trip probes
Links
–
not available
Need to extract correlation from unicast probes
Even using probes from different monitors
Control
–
is often unknown
can fail for some paths, but not all
Need to extend tomography algorithms
Outline
Background
Fault
detection with no control of targets
–
Active vs. passive measurements
–
Reducing overhead of active measurements
–
Disambiguating one-way failures
Fault
identification using binary tomography
–
Correlated path reachability without multicast
–
Topology discovery
Open
14
in network tomography
issues
Detection techniques
Active
–
–
probing: ping
Send probe, collect response
From any end host
•
Works for network operators and end users
Passive
–
–
15
analysis of user’s traffic
Tap incoming and outgoing traffic
•
At user’s machines or servers: tcpdump, pcap
•
Inside the network: DAG card
Monitor status of TCP connections
Detection with ping
probe
ICMP
echo request
t
If
receives reply
–
If
no reply before timeout
–
m
16
reply
ICMP
echo reply
Then, path is good
Then, path is bad
Persistent failure or
measurement noise?
Many
–
Timeout may be too short
–
Rate limiting at routers
–
Some end-hosts don’t respond to ICMP request
–
Transient congestion
–
Routing change
Need
–
17
reasons to lose probe or reply
to confirm that failure is persistent
Otherwise, may trigger false alarms
Failure confirmation
Upon
detection of a failure, trigger extra probes
Goal: minimize detection errors
–
–
Sending more probes
Waiting longer between probes
Tradeoff:
detection error and detection time
loss burst
packets on
a path
time
Detection error
[Cunha, 2009]
18
Passive detection at end hosts
tcpdump/pcap
captures packets
Track status of each TCP connection
–
RTTs, timeouts, retransmissions
Multiple
–
If current seq. number > last seq. number seen
•
–
Path is good
If current seq. number = last seq. number seen
•
Timeout has occurred
•
After four timeouts, declare path as bad
[Zhang, 2004]
19
timeouts indicate path is bad
Passive detection inside the
network is hard
Traffic
–
Need special hardware
•
–
volume is too high
DAG cards can capture packets at high speeds
May lose packets
Tracking
20
TCP connections is hard
–
May not capture both sides of a connection
–
Large processing and memory overhead
Passive vs. active detection
Passive
+ No
need to inject traffic
+ Detects all failures that
affect user’s traffic
+ Responses from targets
that don’t respond to ping
‒ Not
always possible to tap
user’s traffic
‒ Only detects failures in
paths with traffic
21
Active
+ No
need to tap user’s traffic
+ Detects failures in any
desired path
‒ Probing
–
–
overhead
Cover a large number of paths
Detect failures fast
Outline
Background
Fault
detection with no control of targets
–
Active vs. passive measurements
–
Reducing overhead of active measurements
–
Disambiguating one-way failures
Fault
identification using binary tomography
–
Correlated path reachability without multicast
–
Topology discovery
Open
22
in network tomography
issues
Active monitoring: reducing
probing overhead
M1
T1
target
network
C
A
T2
D
B
monitors
M2
23
Goal
detect failures of any of the
interfaces in the target network
with minimum probing overhead
target hosts
T3
The coverage solution
T1
M1
C
A
T2
D
T3
B
Instead of probing all paths, select the
minimum set of paths that covers all
interfaces in target network
Coverage problem is NP-hard
M2
–
Solution: greedy set-cover heuristic
[Nguyen, 2004] [Bejerano,2003]
24
Coverage solution doesn’t detect
all types of failures
Detects
–
Failures that affect all packets that traverse the
faulty interface
•
But
–
Eg., interface or router crashes, fiber cuts, bugs
not path-specific failures
Failures that affect only a subset of paths that cross
the faulty interface
•
Eg., router misconfigurations
[Nguyen, 2009]
25
fail-stop failures
New formulation of failure
detection problem
Select
–
the frequency to probe each path
Lower frequency per-path probing can achieve a
high frequency probing of each interface
T1
1 every 9 mins
M1
1 every 3 mins
M2
[Nguyen, 2009]
26
C
A
B
D
T2
T3
Outline
Background
Fault
detection with no control of targets
–
Active vs. passive measurements
–
Reducing overhead of active measurements
–
Disambiguating one-way failures
Fault
identification using binary tomography
–
Correlated path reachability without multicast
–
Topology discovery
Open
27
in network tomography
issues
Is failure in forward or
reverse path?
probe
t
Paths
reply
m
28
can be asymmetric
–
Load balancing
–
Hot-potato routing
Disambiguating one-way losses:
Spoofing
t
m
[Katz-Bassett, 2008]
29
Spoofer
Monitor requests to spoofer to
send probe
Spoofer sends spoofed probe
with source address of the
monitor
If reply reaches the monitor,
reverse path is good
Limits of spoofing
Network
–
operators often drop spoofed packets
Spoofed packets are normally used for attacks
Placement
–
t
of spoofer
Paths from spoofer to
targets need to be
independent than paths
from monitors
m
30
Summary: Fault detection
End
–
–
users: passive plus active probing
Passive measurements capture user’s experience
Active probes
•
When path has no traffic
•
When TCP connections are too short
Network
–
–
31
operators: alarms plus active probing
Alarm systems directly report many faults
Active monitoring to capture customer’s experience
•
Detect blackholes (i.e., faults that don’t appear in alarms)
•
Detect faults in other networks
Outline
Background
Fault
detection with no control of targets
–
Active vs. passive measurements
–
Reducing overhead of active measurements
–
Disambiguating one-way failures
Fault
identification
–
Correlated path reachability without multicast
–
Topology discovery
Open
32
in network tomography
issues
Uncorrelated measurements
lead to errors
Lack
of synchronization
leads to inconsistencies
–
–
Probes cross links at
different times
Path may change between
probes
m
t1
mistakenly
inferred failure
33
t2
Sources of inconsistencies
In
–
In
–
–
34
measurements from a single monitor
Probing all targets can take time
measurements from multiple monitors
Hard to synchronize monitors for all probes to reach
a link at the same time
Impossible to generalize to all links
Inconsistent measurements with
multiple monitors
mK
path reachability
…
m1, tN good
mK,t1 good
mK, tN bad
inconsistent
measurements
35
…
good
…
m1,t1
…
m1
tN
t1
Solution:
Reprobe paths after failure
mK
path reachability
…
m1, tN bad
Consistency
mK, tN bad
has a cost
–
Delays fault identification
–
Cannot identify short failures
[Cunha, 2009]
36
mK,t1 good
…
good
…
m1,t1
…
m1
tN
t1
Summary: Correlated
measurements
Trade-off:
–
–
Faster identification leads to false alarms
Slower identification misses short failures
Network
–
–
37
operators
Too many false alarms are unmanageable
Longer failures are the ones that need intervention
End
–
consistency vs. identification speed
users
Even short failures affect performance
Outline
Background
Fault
detection with no control of targets
–
Active vs. passive measurements
–
Reducing overhead of active measurements
–
Disambiguating one-way failures
Fault
identification
–
Correlated path reachability without multicast
–
Topology discovery
Open
38
in network tomography
issues
Measuring router topology
With
–
Topology of one network
–
Routing monitors (OSPF or IS-IS)
No
39
access to routers (or “from inside”)
access to routers (or “from outside”)
–
Multi-AS topology or from end-hosts
–
Monitors issue active probes: traceroute
Topology from inside
Routing
protocols flood state of each link
–
Periodically refresh link state
–
Report any changes: link down, up, cost change
Monitor
–
listens to link-state messages
Acts as a regular router
•
AT&T’s OSPFmon or Sprint’s PyRT for IS-IS
Combining
–
link states gives the topology
Easy to maintain, messages report any changes
[Mortier] [Shaikh, 2004]
40
Inferring a path from outside:
traceroute
Actual path
TTL exceeded
from B.1
TTL exceeded
from A.1
A.1
m
A
A.2
B.1
B
B.2
t
TTL = 1
TTL = 2
Inferred path
m
41
A.1
B.1
t
A traceroute path can be
incomplete
Load
–
balancing is widely used
Traceroute only probes one path
Sometimes
–
ICMP rate limiting
–
Anonymous routers
Tunnelling
–
42
taceroute has no answer (stars)
(e.g., MPLS) may hide routers
Routers inside the tunnel may not decrement TTL
Traceroute under load balancing
Actual path
A
C
TTL = 2
m
B
t
E
L
D
TTL = 3
Inferred path
A
Missing nodes
and links
C
False link
m
E
L
B
[Augustin, 2006]
43
D
t
Errors happen even under
per-flow load balancing
Flow 1
A
m
L
C
TTL = 2
Port 2
B
E
t
D
TTL = 3
Port 3
Traceroute
–
–
Needs to match probe to response
Response only has the header of the issued probe
[Augustin, 2006]
44
uses the destination port as identifier
Paris traceroute
Solves
–
the problem with per-flow load balancing
Probes to a destination belong to same flow
Changes
–
Use the UDP checksum
m
[Augustin, 2006]
45
the location of the probe identifier
L
A
C
TTL = 2
Port 1
TTL = 3
Port 1
Checksum 2
Checksum 3
B
D
E
t
Topology from traceroutes
Actual topology
1
m1
A
3
2
1
1 B 2
3
3
C 4
2
D
1 2
t1
Inferred topology
D.1
m1 A.1
C.1
C.2
t2
m2
m2
B.3
Inferred nodes = interfaces, not routers
Coverage depends on monitors and targets
46
t1
–
Misses links and routers
–
Some links and routers appear multiple times
t2
Alias resolution:
Map interfaces to routers
Direct probing
–
–
Responses from the same
router will have close IP
identifiers and same TTL
Record-route IP option
–
Records up to nine IP
addresses of routers in the path
[Spring, 2002] [Sherwood, 2008]
47
Inferred topology
Probe an interface, may receive
response from another
D.1
m1 A.1
t1
C.1
C.2
m2
B.3
t2
same router
Large-scale topology
measurements
Probing
–
–
–
a large topology takes time
E.g., probing 1200 targets from PlanetLab nodes
takes 5 minutes on average (using 30 threads)
Probing more targets covers more links
But, getting a topology snapshot takes longer
Snapshot
–
Paths may change during snapshot
Hard
–
48
may be inaccurate
to get up-to-date topology
To know that a path changed, need to re-probe
Faster topology snapshots
Probing
redundancy
–
Intra-monitor
–
Inter-monitor
Doubletree
–
Combines backward and
forward probing to
eliminate redundancy
[Donnet, 2005]
49
t1
D
m1 A
C
B
m2
t2
Summary: Topology discovery
Network
–
–
Own network: routing messages
Neighbor networks: traceroutes
End
–
–
users: combining traceroutes
Be aware of inaccuracies
•
False or missing links and nodes
•
Hidden hops: stars, tunneling
Fault identification with lower precision
•
50
operators
Determine the network to blame
Outline
Background
Fault
detection with no control of targets
–
Active vs. passive measurements
–
Reducing overhead of active measurements
–
Disambiguating one-way failures
Fault
identification
–
Correlated path reachability without multicast
–
Topology discovery
Open
51
in network tomography
issues
Tomography algorithms
Make
robust to measurement noise
Make
robust to topology uncertainties
–
–
Multiple topologies close to the time of an event
Multiple paths between a monitor and a target
Identify
–
–
52
other types of faults
Path specific
Intermittent
Monitoring techniques
Track
–
dynamics of large-scale topologies
Fast identification requires up-to-date topology
Passive
–
–
detection inside a network
High speed packet processing
Detect faults with incomplete information
Large-scale
–
Consolidating measurements becomes bottleneck
Define
–
–
53
deployment
changes to easy fault diagnosis
Router reports or behavior
Common monitoring infrastructure
REFERENCES
54
Network tomography theory
Survey on network tomography
–
Traffic matrix estimation
–
55
R. Castro, M. Coates, G. Liang, R. Nowak, and B. Yu, “Network
Tomography: Recent Developments”, Statistical Science, Vol. 19, No.
3 (2004), 499-517.
Y. Vardi, “Network Tomography: Estimating Source-Destination Traffic
Intensities from Link Data”, Journal of the American Statistical
Association, Vol. 91, 1996.
Inference of link performance/connectivity
–
MINC project: http://gaia.cs.umass.edu/minc/
–
A. Adams et al., “The Use of End-to-end Multicast Measurements for
Characterizing Internal Network Behavior”, IEEE Communications
Magazine, May 2000.
Binary tomography
Single-source tree algorithm
–
Applying tomography in one network
–
A. Dhamdhere, R. Teixeira, C. Dovrolis, and C. Diot,
“NetDiagnoser:Troubleshooting network unreachabilities using end-toend probes and routing data”, CoNEXT, 2007.
Obtaining accurate path status for binary tomography
–
56
R. R. Kompella, J. Yates, A. Greenberg, A. C. Snoeren, “Detection
and Localization of Network Blackholes”, IEEE INFOCOM, 2007.
Applying tomography in multiple network topology
–
N. Duffield, “Network Tomography of Binary Network Performance
Characteristics”, IEEE Transactions on Information Theory, 2006.
I. Cunha, R. Teixeira, N. Feamster, and C. Diot, “Measurement
Methods for Fast and Accurate Blackhole Identification with Binary
Tomography”, Thomson technical report CR-PRL-2009-05-006, 2009.
Topology from inside
IS-IS monitoring
–
OSPF monitoring
–
A. Shaikh and A. Greenberg, “OSPF Monitoring: Architecture,
Design and Deployment Experience”, NSDI 2004
Commercial products
–
57
R. Mortier, “Python Routeing Toolkit (`PyRT')”,
https://research.sprintlabs.com/pyrt/
Packet Design: http://www.packetdesign.com/
Topology with traceroute
Tracing accurate paths under load-balancing
–
Reducing overhead to trace topology of a network and alias
resolution with direct probing
–
R. Sherwood, A. Bender, N. Spring, “DisCarte: A Disjunctive Internet
Cartographer”, SIGCOMM, 2008.
Reducing overhead to trace a multi-network topology
–
58
N. Spring, R. Mahajan, and D. Wetherall, “Measuring ISP Topologies
with Rocketfuel”, SIGCOMM 2002.
Use of record route to obtain more accurate topologies
–
B. Augustin et al., “Avoiding traceroute anomalies with Paris
traceroute”, IMC, 2006.
B. Donnet, P. Raoult, T. Friedman, and M. Crovella, “Efficient
Algorithms for Large-Scale Topology Discovery”, SIGMETRICS, 2005.
Reducing overhead of active
fault detection
Selection of paths to probe
–
H. Nguyen and P. Thiran, “Active measurement for multiple link
failures diagnosis in IP networks”, PAM, 2004.
–
Yigal Bejerano and Rajeev Rastogi, “Robust monitoring of link
delays and faults in IP networks”, INFOCOM, 2003.
Selection of the frequency to probe paths
–
59
H. X. Nguyen , R. Teixeira, P. Thiran, and C. Diot, " Minimizing
Probing Cost for Detecting Interface Failures: Algorithms and
Scalability Analysis", INFOCOM, 2009.
Internet-wide fault detection
systems
Detection with BGP monitoring plus continuous pings, spoofing to
disambiguate one-way failures, traceroute to locate faults
–
Detection with passive monitoring of traffic of peer-to-peer systems
or content distribution networks, traceroutes to locate faults
–
60
E. Katz-Bassett, H. V. Madhyastha, J. P. John, A. Krishnamurthy, D.
Wetherall, T. Anderson, “Studying Black Holes in the Internet with
Hubble”, NSDI, 2008.
M. Zhang, C. Zhang, V. Pai, L. Peterson, and R. Wang, “PlanetSeer:
Internet Path Failure Monitoring and Characterization in Wide-Area
Services”, OSDI, 2004.