dimacs-tomography2008

Download Report

Transcript dimacs-tomography2008

Challenges in Making
Tomography Practical
Yiyi Huang, Georgia Tech
Nick Feamster, Georgia Tech
Renata Teixeira, LIP6
Christophe Diot, Thomson
Problem
• Network operators need to detect and isolate
faults quickly, before customers complain
• Plenty of existing alarms
– SNMP traps
– Active probes
– Anomaly detection systems
• Unfortunately, this set of alarms does not help
operators locate and eliminate problems that
induce problems on end-to-end paths
2
Network Tomography to the Rescue
Monitor
Targets
y
x
• Send end-to-end probes through the network
• Monitor paths for differences in reachability
• Infer location of reachability problem from these differences
3
Some Problems
• Scalability vs. speed: Detection must be fast
• Ambiguity: Losses are one-way but don’t
always have access to both ends of the path
• Lack of synchronization: Different monitors
see different conditions
• Dynamics: Topology can change, loss can be
transient
4
Doppler: Making Tomography Practical
• Fast, scalable detection
– Solution: Monitor selection algorithm to reduce the
number of monitors and targets so that “cycle times”
are fast
• Transient packet loss
– Solution: Triggered confirmation of failed paths
• One-way losses
– Solution: New algorithm based on IP spoofing
• Dynamic routing
– Solution: Periodic snapshots of the network topology
Controlled evaluation on VINI, plus limited wide-area experiments.
5
Fast, Scalable Detection
• Select monitors, targets to satisfy two conditions
– All interfaces are “covered” (or diagnosable)
– The number of monitors is small enough to ensure a
short round time
• Two goals
– Coverage: When a failure occurs, system detects it
• Every interface is covered by at least one path
– Diagnosability: When a failure occurs, system locates it
• Every interface is covered by a unique set of paths
6
Offline Path Selection: Diagnosability
• Step 1: Compute the set of paths that cover all
interfaces (greedy set cover heuristic)
• Step 2: Compute hitting set for each interface
• Step 3: Build equivalence classes for interfaces
with common hitting set
– For each interface in a set with more than one
interface, find path that crosses only that interface
7
Detection, Confirmation, Correlation
• Periodic (once per 5 minutes) topology snapshot
from all monitors to all destinations keeps track
of underlying topology before the failure
• Detection: Periodic probes (once per “cycle
time”) detect failure
• Confirmation: When a probe is lost, the monitor
sends three additional probes. If all three are
lost, path is determined to have failed.
• Correlation: Paths that fail within 10 seconds of
one another are grouped.
8
Disambiguating One-Way Losses:
Spoofing
• Monitor sends request to spoofer to send probe
• Probe has IP address of the monitor
• If reply reaches the monitor, reverse path is
working
T
M
Spoofer: Send spoofed
packet with source address of M
9
Identification: NetDiagnoser
• Binary network tomography algorithm
[Dhamdhere et al.]
• Input: hosts, destinations, topology before the
failure
• Output: Set of possible locations for the fault
10
Evaluation of Detection Algorithms
• Controlled experiments on the VINI testbed
– Emulated copy of Abilene network on wide-area paths
– Probing strategy emulates the paths that would be probed in monitor
selection algorithm
– Compare reduced set of paths to “aggressive” measurement
approach
• Varied failure location and duration
– Duration varied from 5 to 80 seconds
– Test repeated for each failed link
• Measure detection and false alarm rates
• Preliminary experiments using data from real-world networks
11
Detection: Scale and Speed
• Compute reduction in the number of paths
required to achieve coverage and diagnosability
– Reduction from about 27,000 paths to 151 paths
• For real-world networks, compute corresponding
reduction in cycle time
– Reduction from aout 3.5 minutes to < 5 seconds
12
Single-Link Failures
• More selective probing identifies more of the
shorter link failures (due to shorter cycle time)
• Also results in fewer false alarms
13
Single-Node Failures
• Similar results to single-link failures
– Selective measurements result in faster detection,
fewer false alarms
14
Does Failure Confirmation Reduce the
Total Number of Alarms?
• Confirmation reduces the number of failures by
> 35%
• Correlation further reduces the number of
alarms (by about a factor of 10)
15
How Quickly can Doppler Identify
Failures?
• Answer: Roughly 20 seconds using the reduced
set of paths
• Two main components
– Detection/Confirmation: Time from when failure was
injected to the time Doppler could detect and confirm
the failure
– Correlation: Time to group failures and construct
reachability matrix
16
Detection and Confirmation Delay
Most failures are detected within
3-5 seconds
17
Correlation Delay
Reducing the number of paths
to probe significantly reduces
total correlation time
18
Summary
• Making tomography practical is challenging
–
–
–
–
Asynchronous measurements
Scale and speed
Changing topologies
Ambiguity about forward and reverse paths
• Doppler: Set of techniques to address many of
these problems
• Current analysis is still performed offline
– Many additional challenges remain to coordinate
online measurements
19