Overlay Network Monitoring and its Applications

Download Report

Transcript Overlay Network Monitoring and its Applications

Tomography-based Overlay Network
Monitoring and its Applications
Yan Chen
Joint work with David Bindel, Brian Chavez,
Hanhee Song, and Randy H. Katz
UC Berkeley
Motivation
• Applications of end-to-end distance monitoring
–
–
–
–
–
Overlay routing/location
Peer-to-peer systems
VPN management/provisioning
Service redirection/placement
Cache-infrastructure configuration
• Requirements for E2E monitoring system
–
–
–
–
Scalable & efficient: small amount of probing traffic
Accurate: capture congestion/failures
Incrementally deployable
Easy to use
Existing Work
• Static estimation:
– Global Network Positioning (GNP)
• Dynamic monitoring
– Loss rates: RON (n2 measurement)
– Latency: IDMaps, Dynamic Distance Maps, Isobar
• Latency similarity under normal conditions doesn’t imply
similar losses !
• Network tomography
– Focusing on inferring the characteristics of physical
links rather than E2E paths
– Limited measurements -> under-constrained system,
unidentifiable links
Problem Formulation
Given n end hosts on an overlay network and O(n2)
paths, how to select a minimal subset of paths to
monitor so that the loss rates/latency of all
topology
other paths can be inferred.
Overlay Network
Operation Center
measurements
End hosts
•
Key idea: select a basis set of k paths that
completely describe all O(n2) paths (k «O(n2))
–
–
–
Select and monitor k linearly independent paths to
compute the loss rates of basis set
Infer the loss rates of all other paths
Applicable for any additive metrics, like latency
Path Matrix and Path Space
A
B
p
lj
• Path loss rate p, link loss rate l
1 p 
 (1  l )
j
j s.t .vj 1
• Totally s links, path vector v
s
s
log( 1  p)   vj log( 1  lj )   vjxj v x
T
j 1
• Path matrix G
O(n2)
– Put all r =
paths
together
– Path loss rate vector b
j 1
rs
Gx  b, G {0 | 1} , b  
r 1
Sample Path Matrix
A
b2
1
b1
3
D
2
B
b3
C
1 1 0
G  0 0 1
1 1 1
 x1   b1 
G  x 2   b 2 
 x3  b3 
link 2
(1,-1,0)
(1,1,0)
row space
(measured)
null space
(unmeasured)
link 1
link 3
1 
0 b1 / 2
• x1 - x2 unknown => cannot
( x1  x2 )  
0  b / 2
xG 
1

x
compute x1, x2
  3   1 
2
• Set of vectors [1 1 0]T
0
1  b2 
form null space
1
• To separate identifiable vs. x  ( x1  x2 )  1
N
 
unidentifiable components:
2
 0 
x = xG + xN
• All E2E paths are in path b  Gx  Gx  Gx  Gx
G
N
G
space, i.e., GxN = 0
Intuition through Topology
Virtualization
• Virtual links:
minimal path
segments
whose loss
rates uniquely
identified
• Can fully
describe all
paths
• xG : similar
forms as
virtual links
1’
1

2
Rank(G)=1
Rank(G)=2 1
1’
2’
2
1’
1
2
Rank(G)=3
3
2’
3’
4
3

4’

1
2
1
1
2
3
Virtualization
Real links (solid) and overlay
paths (dotted) going through them
Virtual links
5
Algorithms
• Select k = rank(G) linearly independent paths to
monitor
– Use rank revealing decomposition, e.g., QR with column
pivoting
– Leverage sparse matrix: time O(rk2) and memory O(k2)
• E.g., 10 minutes for n = 350 (r = 61075) and k = 2958
• Compute the loss rates of other paths
Solve xG with GxG  b, then b  GxG
– Time O(k2) and memory O(k2)
How much measurement saved ?
• k « O(n2) ?
• For power-law Internet topology, M nodes, N end
hosts
– There are O(M) links and N > M/2 (with proof)
– If n = O(N), overlay network has O(n) IP links, k = O(n)
When a Small Portion of End Hosts on Overlay
• Internet has moderate hierarchical structure [TGJ+02]
– If a pure hierarchical structure (tree): k = O(n)
– If no hierarchy at all (worst case, clique): k = O(n2)
– Internet should fall in between …
For reasonably large n, (e.g., 100), k = O(nlogn)
BRITE 20K-node hierarchical topology
Mercator 284K-node real router topology
Practical Issues
• Topology measurement errors tolerance
– Care about path loss rates than any interior links
– Poor router alias resolution present show multiple links
for one => assign similar loss rates to all the links
– Unidentifiable routers => ignore them as virtualization
• Topology changes
– Add/remove/change one path incurs O(k2) time
– Topology relatively stable in order of a day =>
incremental detection
• Simulation
Evaluation
– Topology
• BRITE: Barabasi-Albert, Waxman, hierarchical: 1K – 20K nodes
• Real router topology from Mercator: 284K nodes
– Fraction of end hosts on the overlay: 1 - 10%
– Loss rate distribution (90% links are good)
• Good link: 0-1% loss rate; bad link: 5-10% loss rates
• Good link: 0-1% loss rate; bad link: 1-100% loss rates
– Loss model:
• Bernouli: independent drop of packet
• Gilbert: busty drop of packet
– Path loss rate simulated via transmission of 10K pkts
• Metric: path loss rate estimation accuracy
– Absolute/relative errors
– Lossy path inference
Experiments on Planet Lab
• 51 hosts, each from
different organizations
– 51 × 50 = 2,550 paths
• Simultaneous loss rate
measurement
US (40)
– 300 trials, 300 msec each
– In each trial, send a 40-byte
UDP pkt to every other host
• Simultaneous topology
measurement
– Traceroute
• Experiments: 6/24 – 6/27
– 100 experiments in peak
hours
# of
hosts
Areas and Domains
Europe (6)
International
(11)
.edu
33
.org
3
.net
2
.gov
1
.us
1
France
1
Sweden
1
Denmark
1
Germany
1
UK
2
Taiwan
1
Hong Kong
1
Asia (2)
Canada
2
Australia
1
Tomography-based Overlay Monitoring Results
• Loss rate distribution
loss
rate
[0, 0.05)
%
95.9%
lossy path [0.05, 1.0] (4.1%)
[0.05, 0.1)
[0.1, 0.3)
[0.3, 0.5)
[0.5, 1.0)
1.0
15.2%
31.0%
23.9%
4.3%
25.6%
• Accuracy
– On average k = 872 out of 2550
– Absolute error |p – p’|:
• Average 0.0027 for all paths, 0.0058 for lossy paths
– Relative error
 p( ) p' ( ) 
F ( p, p' )  max 
,

 p' ( ) p( ) 
where p( )  max(  , p), p' ( )  max(  , p' )
Absolute and Relative Errors
• For each experiment, get its 95 percentile absolute and
relative errors for estimation of 2,550 paths
Lossy Path Inference Accuracy
• 90 out of 100 runs have coverage over 85% and false
positive less than 10%
• Many caused by the 5% threshold boundary effects
Topology Measurement Error Tolerance
• Out of 13 sets of pair-wise traceroute …
• On average 248 out of 2550 paths have no or
incomplete routing information
• No router aliases resolved
Conclusion: robust against topology measurement
errors
Performance Improvement with Overlay
• With single-node relay
• Loss rate improvement
– Among 10,980 lossy paths:
– 5,705 paths (52.0%) have loss rate reduced by 0.05 or more
– 3,084 paths (28.1%) change from lossy to non-lossy
• Throughput improvement
– Estimated with
1.5
throughput 
rtt lossrate
– 60,320 paths (24%) with non-zero loss rate, throughput computable
– Among them, 32,939 (54.6%) paths have throughput improved,
13,734 (22.8%) paths have throughput doubled or more
• Implications: use overlay path to bypass congestion or
failures
Adaptive Overlay Streaming Media
Stanford
5. Alert +
New Overlay Path
OVERLAY NETWORK
OPERATION CENTER
2. Register trigger
4. Detect congestion /
failure
UC Berkeley
X
CLIENT
UC San Diego
1. Setup
connection
SERVER
3. Network congestion /
failure
7. Skip-free streaming
media recovery
6. Setup New Path
OVERLAY RELAY
NODE
HP Labs
•
•
•
•
Implemented with Winamp client and SHOUTcast server
Congestion introduced with a Packet Shaper
Skip-free playback: server buffering and rewinding
Total adaptation time < 4 seconds
Adaptive Streaming Media Architecture
MEDIA
SOURCE
OVERLAY NETWORK
OPERATION CENTER
SERVER
CLIENT
Winamp client
Triggering /
alert + new path
SHOUTcast
Server
RELAY
Overlay Layer
Buffering Layer
Winamp Video/Audio Filter
Byte Counter
Path Management
RELAY
Path Management
TCP/IP Layer
Calculated
concatenation
point
TCP/IP Layer
Client 4 Client 3 Client 2 Client 1
From
SHOUTcast
server
BUFFER
Byte
Counter
Overlay
Layer
TCP/IP Layer
Internet
Client 1
OVERLAY RELAY NODE
Client 2
INTERNET
Client 3
Client 4
Conclusions
• A tomography-based overlay network monitoring
system
– Given n end hosts, characterize O(n2) paths with a
basis set of O(nlogn) paths
– Selectively monitor O(nlogn) paths to compute the
loss rates of the basis set, then infer the loss rates
of all other paths
• Both simulation and real Internet experiments
promising
• Built adaptive overlay streaming media system
on top of monitoring services
– Bypass congestion/failures for smooth playback
within seconds
Backup Slides