Graph Sampling Techniques for Studying Unstructured Overlays

Download Report

Transcript Graph Sampling Techniques for Studying Unstructured Overlays

Respondent-driven Sampling
for Characterizing
Unstructured Overlays
A. H. Rasti
M. Torkjazi
R. Rejaie
N. Duffield
W. Willinger
D. Stutzbach
University of Oregon
AT&T Labs - Research
Stutzbach Enterprises
Graciously Presented By:
Shubho Sen
Research
AT&T Labs -
Motivation
• P2P systems are very popular in practice.
– Millions of simultaneous users.
– A significant fraction of Internet traffic
• Measurement studies aid understanding existing systems
and user behavior.
• Capturing an accurate global “snapshot” is often infeasible.
– P2P systems are distributed, large, and rapidly changing.
– P2P crawlers are likely to capture incomplete or distorted snapshots
• Sampling is a natural approach, and has been used implicitly
in most earlier P2P measurement studies.
 How can we collect representative samples?
•2
The Graph Sampling Problem
• We focus on sampling peer properties, such as number of
neighbors (degree), access link bandwidth, session time, # files
• Sampling peer properties has two steps:
– Discovering and selecting peers (or samples)
– Measuring the desired properties of selected peers
• Selecting peers uniformly at random is hard – there are two
sources of bias [Stutzbach:IMC06]
– Topological: high-degree peers are more likely to be selected
– Temporal: short-lived peers are more likely to be selected
• Random walks are a promising approach to sampling
– The resulting bias is precisely known
– Samples can be collected in parallel by multiple walkers
•3
Sampling Using Random Walk
• Random walks can be described with a transition matrix
P(x,y)
• P(x,y) : probability of moving from x to y
 1

P( x, y )   deg( x)
 0
y is a neighborof x
otherwise
• Pr(x,y) : probability of moving from x to y after r moves
• Random walks converge to a stationary distribution
deg( x)
 ( x)  lim(vP )(x) 
r 
2E
r
• Problem: we need a uniform distribution
•4
Metropolized Random Walk (MRW)
• The Metropolis-Hastings method modifies the transition matrix to yield
the desired uniform distribution [Stutzbach:IMC06]
deg( x)

P( x, y) min(deg( y) ,1) if x  y
Q( x, y)  
1   Q( x, y)
if x  y

x y

x
y
• MRW method:
– Select a neighbor y of x uniformly at random
– Transition to y with probability min( deg(x)/deg(y) , 1)
– Otherwise, self-loop to x.
– Results in uniform stationary dist. (x)= 1/|V|
 MRW compensates for bias as samples are collected
•5
This paper
• Presents a new graph sampling technique,
Respondent-Driven Sampling (RDS)
• Compares the performance of RDS and MRW
sampling techniques using simulations &
experiments
•6
Respondent-driven Sampling
• A development of Snowball Sampling [Salganik04]
• Commonly used in social sciences to sample
“hidden” populations, e.g. HIV+ individuals
• Social relationships (references) are used by sampler
to diffuse into hidden populations
– Each person introduces n other persons
– Similar to random walk (n = 1)
• We adopt the RDS technique from social sciences
for sampling P2P networks
•7
RDS Formulation
• Goal: Estimate the distribution of node property X
• Perform regular random walk, collect values of property X and
node degree (deg(v)) at each visited node
• Deal with the bias during the post-processing as follows:
– Divide possible values for X into several ranges: {R1, . . . ,Rm}
– Partition nodes with the X value within the same range: {V1, . . . ,Vm}
• Using Hansen-Hurwitz estimator to compensate for the bias, the
proportion of all nodes in group i is estimated as follows:
pˆ i



vTi
vT
1
1
deg( v)
• Ti: visited samples in group i
• T: all visited samples
deg( v)
•8
Evaluation Overview
• Performance metric
– Consider only peer properties that may interact with the walk:
• 1) Peer Degree, 2) Peer Uptime, 3) Peer RTT
– Compare the dist. of the these peer properties from samples and
“ground truth” using Kolmogorov-Smirnov (KS) statistics
• Evaluation Methodology
– Evaluation over static graphs
• Effect of graph structure
– Evaluation over dynamic graphs (session level simulation)
• Benefits of parallel Sampling (see the paper)
• Effect of 1) churn, 2) peer discovery, 3) target peer degree
– Experiments over Gnutella network
•9
Evaluation: Static Graphs
• Using graphs with different degree distribution &
clustering characteristics:
–
–
–
–
Random graphs (ER): Erdos-Renyi
Small-world graphs (SW): Watts and Strogatz
Scale-free graphs (BA): Barabasi and Albert
Hierarchical Scale-Free graphs (HSF): Barabasi ‘02
• Power-law degree distribution
• Node clustering is inversely proportional to node degree
– Gnutella graphs (GA): Snapshots of Gnutella Ultrapeer
topology
•10
Hierarchical Scale-Free (HSF)
•11
Static Graphs
• Accuracy of both techniques is
improved with the number of
samples in most cases
• The rate of improvement in
accuracy is much lower over HSF
especially for MRW
• Walkers are likely to get trapped
within clusters in HSF graphs
• Leaving a cluster requires visiting high
degree nodes but MRW is less likely to
visit these nodes
• Rewiring a small fraction of
randomly selected edges in HSF
significantly improves accuracy
for both techniques
 RDS is less sensitive to graph clustering
than MRW
•12
Dynamic Graphs
• Churn is a primary limiting factor
for accuracy
• Session len.> 5m  Very good
sampling accuracy
• Churn model has little effect
• Similar impact on other peer
properties (see the paper)
• Sampling error is small once nodes
have sufficient connectivity (> 5)
• Lower accuracy for smaller degree
is due to graph partitioning
• Partitioned nodes in History mech.
reduce the accuracy of sampling
•13
Experiment: Gnutella
• Run crawler, 1000 RDS & 1000
MRW walkers in parallel
– 500 steps per walker
• Use captured snapshots by crawler
as a “rough” reference
– Show min, max, avg KS over 6
experiments
– Focus only on degree dist
• The degree dist from samples &
crawls are very similar (KS~0.03)
• The accuracy is an order of
magnitude lower than dynamic sim
due to inaccurate reference.
• Both sampling technique achieve
similar accuracy
•14
Conclusions & Future Work
• RDS always performs as good or better than MRW
• High level of graph clustering can significantly degrade
the accuracy of both RDS and MRW
– RDS is less sensitive than MRW to graph clustering
• There is sweet spot for the number of parallel samplers.
• Poor connectivity & high dynamics adversely affect the
accuracy of both techniques.
• Future Work:
– RDS is a promising approach for sampling user properties in
Online Social Networks
– Sampling over directed graphs raises new challenges.
•15
Thank You !
Different Grpah Structures
Dynamic Simulation Setting
• Simulation environment
– Session-time distributions : Weibull, Exponential,
Pareto
– Poisson arrival process
– Peer discovery : Oracle, FIFO, HeartBeat, History
– Target population : 100’000
– Min. Degree : 3-30
– Sampling Parameters:
• Node degree (DEG)
• Node query latency (RTT)
• Session length/uptime (UT)
•18
Dynamic Graphs: Effect of Parallelism
• Too much parallelism
does not improve
performance
• Too long random walks
have negative effect
• Sweet spot exists
•20