Network Awareness & Failure Resilience in Self
Download
Report
Transcript Network Awareness & Failure Resilience in Self
Network Awareness & Failure Resilience
in
Self-Organising Overlay Networks
L. Massoulié,
A.-M. Kermarrec,
A.J. Ganesh
Microsoft Research
Context: unstructured overlays
Peer machines connected to the Internet,
Each only maintains IP addresses of
neighbors.
Applications:
Low-level: message dissemination (flooding)
Higher-level: tree construction; content
search,…
E.g., Gnutella, ScaMP,…
Overlay structure = graph of “who knows
who” relations; choice of neighbors flexible.
The
Internet…
Objectives
Adapt overlay graph structure, for:
Improving resilience to failures,
Reducing network load,
Reducing network impact on application
performance.
Network Awareness
Cost of overlay connection (i,j):
•
•
Communication cost = number of network hops n(i,j);
Application cost = propagation delay from i to j
(both measured by ping).
Assumption: some function c(i,j) captures network
cost, to be minimised; easily measured.
Failure Resilience
i.e., connectivity in the presence of link / node failures.
Benchmark: connectivity of random graphs
(relevant for existing systems, like ScaMP)
Random graph on N nodes, with mean degree of c.log(N)
supports node or link failure rates up to 1-1/c.
degree distribution
3000
2500
2000
Number of nodes
disconnections: due
to isolated nodes
1500
1000
500
0
0
10
20
30
40
Degree
50
60
70
Formal problem statement
Adapt graph in a distributed way, keeping number
of edges fixed, so as to reduce
objective function E (G ) w
d2
c (i , j )
i
di=degree of node i;
Forces degree balancing
i
(i , j )
c(i,j)=cost of maintaining
connection (i,j), to both
network and overlay app.
Parameter w:
controls trade-off between objectives
Solution: a Metropolis algorithm
Periodically each node i picks two current
neighbours j, k
j
k
j
k
Candidate rewiring:
i
i
Local evaluation of impact on energy:
E 2w (dk di 1) c ( j , k ) c (i , j )
Rewiring accepted with probability:
E /T di (di 1)
Min 1, e
dk (dk 1)
Metropolis algorithm (2)
Defines Markov chain on set of connected graphs
with initial number of edges E, and stationary
distribution:
1 E (G ) /T
P (G )
Z
e
hence concentrates on low energy configurations.
Analysis of failure resilience
Key result:
–
For an average degree of c.log(N), c>0,
resulting graph remains connected for link failure rates
up to exp(-1/c).
Improves upon failure resilience of uniform
random graphs (cf. Erdös-Renyi law);
Essentially optimal failure resilience for uniform
random link failures.
Experimental results
Topologies:
–
–
Georgia Tech transit-stub model, 5050 core nodes, 100node LANs attached to core nodes.
(Subgraph of) Microsoft’s corporate network, with 298
core nodes.
Initial overlay: random, with average degree of
2.log(N) (based on ”ScaMP” system)
Costs: delays (1ms per local hop, 40ms per nonlocal hop).
N=50,000 peers.
Number of disconnections
Connectivity (Corp)
180
160
140
120
100
80
60
(0,0,0)
(10,1,100)
(10,1,1000)
40
20
0
0
5000
10000
15000
Number of faulty nodes
20000
25000
Degree (Corp)
50,000
mean value= 19.39
Number of nodes
45,000
40,000
(0,0,0)
(10,1,100)
35,000
(10,1,1000)
30,000
(50,1,100)
(50,1,1000)
25,000
20,000
15,000
10,000
5,000
0
0
10
20
30
40
50
Number of neighbours
60
70
Network delays
(w,iterations)
T=1
Mean distance to neighbors (GT5050)
(0,0)
483
(10,100)
408
(10,1000)
290
(10,5000)
196
Reduced by a factor of 4 on Corp and >2 on GT-5050.
Similar reductions on standard deviation of distances
between neighbours.
Outlook
Trade-off between network locality and
degree balancing;
Study application-related costs for
diverse applications:
–
–
For fast dissemination, aggregation of network
delays not enough “small-world” topologies;
Other distances for other applications:
“semantic distance” between peers for content
searching.