Slides - VLDB 2009

Download Report

Transcript Slides - VLDB 2009

Coordinated weighted sampling
for estimating aggregates
over multiple weight
assignments
Edith Cohen, AT&T Research
Haim Kaplan, Tel Aviv University
Shubho Sen, AT&T Research
Data model
• Universe U of items:
i1
w1(i1)
w2(i1)
w3(i1)
i2
w1(i2)
w2(i2)
w3(i2)
i3
i4
w1(i3) w1(i4)
w2(i3) w2(i4)
w3(i3) w3(i4)
…………
IP addresses
traffic day 1
traffic day 2
traffic day 3
Multiple weight assignments defined over the items
Data model (example)
• Universe U of items:
i1
w1(i1)
w2(i1)
w3(i1)
i2
w1(i2)
w2(i2)
w3(i2)
i3
i4
w1(i3) w1(i4)
w2(i3) w2(i4)
w3(i3) w3(i4)
…………
facebook users
male friends
female friends
online per day
Data model (example)
• Universe U of items:
i1
w1(i1)
w2(i1)
w3(i1)
i2
w1(i2)
w2(i2)
w3(i2)
i3
i4
w1(i3) w1(i4)
w2(i3) w2(i4)
w3(i3) w3(i4)
………… customers
wait time item 1
wait time item 2
wait time item 3
Data model (example)
• Universe U of items:
i1
w1(i1)
w2(i1)
w3(i1)
i2
w1(i2)
w2(i2)
w3(i2)
i3
i4
w1(i3) w1(i4)
w2(i3) w2(i4)
w3(i3) w3(i4)
………… license plates
GW bridge day 1
GW bridge day 2
GW bridge day 3
Data model (example)
• Universe U of items:
i1
w1(i1)
w2(i1)
w3(i1)
i2
w1(i2)
w2(i2)
w3(i2)
i3
i4
w1(i3) w1(i4)
w2(i3) w2(i4)
w3(i3) w3(i4)
………… web pages
bytes
out links
in links
Aggregation queries
• Selection predicate d(i) over items in U
• compute the sum of the weights of all items for which
the predicate holds:

w(i)
i|d ( i ) true
Examples:
• Total number of links out of pages of some web site
• Total number of flows into a particular network
This work:Aggregates depending on
more than one weight function
• Selection predicate d(i) over items in U
• Want to compute the sum of f(i) for all items for which
the predicate holds:

f (i)
i|d ( i ) true
f(i) depends on some subset of the weights given to i
Simple examples: f(i) = maxbwb(i), f(i) = minbwb(i),
f(i)=maxbwb(i)-minbwb(i)
Data model (example)
• Universe U of items:
i1
w1(i1)
w2(i1)
w3(i1)
i2
w1(i2)
w2(i2)
w3(i2)
i3
i4
w1(i3) w1(i4)
w2(i3) w2(i4)
w3(i3) w3(i4)
………… customers
wait time item 1
wait time item 2
wait time item 3
f(i) = maxbwb(i) aggregated over a subset gives the total
waiting time of these customers
Data model
• Universe U of items:
i1
w1(i1)
w2(i1)
w3(i1)
i2
w1(i2)
w2(i2)
w3(i2)
i3
i4
w1(i3) w1(i4)
w2(i3) w2(i4)
w3(i3) w3(i4)
…………
IP addresses
traffic day 1
traffic day 2
traffic day 3
f(i) = maxbwb(i)-minbwb(i) aggregated over a network gives
sum of maximum changes in # of flows
Challenges
• Massive data volumes, too large to store or to transport
elsewhere
• w1 may not be available when we process w2, etc, collected
at different place/time
• Many different types of queries (d(i),f(i)), not known in
advance
 Exact aggregation can be very resource
consuming or simply impossible
Challenge: which summary to keep and how to estimate the
aggregate ??
Solution: Use Sampling
Keep a sample
Desirable Properties of Sampling Scheme:
• Scalability: efficient on data streams and distributed
data, decouple the summarization of different sets
• Applicable to a wide range of d(i), f(i)
• Good estimators: unbiased, small variance
Sampling a single weight
function
•Independent (Bernoulli) with probability kw(i)/W
Poisson, PPS sampling
•k times without replacement
Order (bottom-k) sampling (includes PPSWOR, priority sampling
(Rosen 72, Rosen 97, CK07, DLT 07)
• Methods can be implemented efficiently in a
stream/reservoir/distributed setting
•Order (bottom-k) sampling dominates other methods (fixed
sample size, with unbiased estimators [CK07, DLT07] )
Rank-based definition of
Order/bottom-k sampling
• For each item draw a random rank value from
some distribution fw(i)
• Order/bottom-k: Pick the k smallest-ranked
items to your sample
Rank distributions:
u (i )  U[0,1]
(The seed)
• r(i)=u(i)/w(i) : priority order samples
• r(i)=-ln(u(i))/w(i) : PPSWOR order samples
Relations between samples by
different weight function
Independent sampling: Get a sample from each weight
function independently
Coordinated Sampling: If w2(i) ≥ w1(i) then r2(i) ≤ r1(i).
For example by using the same seed u(i) for all wi:
r1(i) = u(i)/w1(i)
r2(i) = u(i)/w2(i)
Can be achieved efficiently via hashing
Coordination is critical
• Coordination is critical for tight estimates
• We develop estimators for coordinated
samples and analyze them
A lot of previous work on coordination in the context of
survey sampling and when weights are uniform..
Horvitz Thompson Estimators [HT52]
HT estimator: Give each item i an adjusted weight a(i)
Let p(i) be the prob. that item i is sampled
If i is in sample, a(i)=w(i)/p(i) (otherwise a(i) = 0 )
 w(i ) 
E[a(i)]  p(i) 
  (1  p(i))0  w(i)
 p(i) 
a (Q)   a (i ) 
iQ

a (i )
iQ  S
 a(Q) is an unbiased estimator of w(Q)
Using HT for order samples
Problem
• Cannot compute p(i) from an order sample
• Therefore cannot compute HT adjusted weights
a(i)=w(i)/p(i)
p(i)?
Solution: Apply HT on partitions
•
•
p1(i)
For item i, if we can find a partition of the
sample space such that we can compute the
p3(i)
conditioned p(i) for item i for the cell that the 2
p (i)
sampled set belongs to
p4(i)
Then apply HT using that conditioned p(i)
 Obtain unbiased adjusted weights for each item
p5(i)
p6(i)
Estimators for aggregates
over multiple weights
Suppose we want to estimate:

i|d ( i ) true
b
min b w (i)
We have a sample for each weight
function using independent or coordinated
ranks
Estimator for

i|d ( i ) true
b
min b w (i)
Identify the items i for which d(i)=true and you know
minbwb(i)
(=Items contained in all samples)
Compute p(i), the probability of such item to be sampled
for all weight functions (conditioned in a subspace that
depends on the other ranks)
Set a(i) = minbwb(i)/p(i)
Sum up these adjusted weights
Independent vs Coordinated
 1  p(i) 
var[a(i)]  min b w (i) 

 p(i ) 

b

2
Variance is small when p(i) is large
Coordinated sketches have larger p(i)
Estimator for

i|d ( i ) true
b
max b w (i)
Identify the items i for which d(i)=true and you
know maxbwb(i)
If you see all weights then you know maxbwb(i)
But you never see them all if minbwb(i) = 0
Estimator for

i|d ( i ) true
b
max b w (i)
Use the consistency of the ranks:
If the largest weight you see has rank smaller
than minb{kth smallest rank of an item  I\{i}}
then you know it’s the maximum.
Estimator for

i|d ( i ) true
b
max b w (i)
If the largest weight you see has rank smaller
than minb{kth smallest rank of an item  I\{i}}
then you know it’s the maximum.
Compute p(i), the probability of this to
happen
Set a(i) = minbwb(i)/p(i)
Sum up these adjusted weights
More estimators
Now we can get an unbiased estimator for
the L1 difference:
 (min
w (i)  minb w (i))
b
b
b
Can prove that this estimator is nonnegative
Empirical Evaluation
Data sets:
• IP packet traces:
Items: (src,dest,src-port,dest-port).
wb: total # of bytes in hour b
• Netflix data sets:
Items: movies.
wb: total # of ratings in month b
• Stock data:
Items: ticker symbols
wb: high value on day b
Netflix:
b
var[
a
(min
w
(i ))]

ind
b
i
 var[a
coor
i
b
(min b w (i ))]
Netflix:
 var[a( f (i))]
i
for various f
Summary
• Coordinated sketches improves accuracy
and sometimes estimating is not feasible
without it
• It can be achieved with little overhead
using state of the art sampling techniques
Thank you!
Application: Sensor nodes recording daily vehicle
traffic in different locations in a city
• Items: vehicles (license plate numbers)
• Sets: All vehicles observed at a particular
location/date
Example queries:
• Number of distinct vehicles in Manhattan on
election day (size of the union of all Manhattan
locations on election day)
• Number of trucks with PA license plates that
crossed both the Hudson and the East River on June
18, 2009
Application: Items are IP addresses. Sets h1,…,h24
correspond to destination IP’s observed in different 1hour time periods.
Example Queries: number of distinct IP dests
• In the first 12 hours (size of union of h1,…,h12)
• In at most one/at least 4 out of h1,…,h24
• In the intersection of h9,…,h17 (all business hours)
• Same queries restricted to: blacklisted IP’s, servers
in a specific country etc.
Uses: Traffic Analysis, Anomaly detection
Application: Text/HyperText corpus: items are
features/terms, sets are documents (or vice versa)
Example Queries:
•Number of distinct hyperlinks to financial news
websites from websites in the .edu domain
•Fraction of pages citing A that also cite B
•Similarity (Jaccard coefficient) of two documents
Uses: research, duplicate elimination, similarity
search or clustering
Application: Market basket data set. Items are
consumers/baskets, subsets are goods (or vice versa)
Example queries:
• Likelihood of purchasing beer if diapers were
purchased
• Number of baskets from zip code 07932 containing
shampoo
• Number of consumers purchasing baby food and not
diapers
Uses: Marketing, placement, pricing, choosing products
Rank-based definition of
sampling
• For each item draw a random rank value from
some distribution fw(i)
•Poisson: Pick all items with rank value below Tau(k)
•Order/bottom-k: Pick the k smallest-ranked
items to your sample
Rank distributions:
u  U[0,1]
• r(i)=u/w(i) : Poisson PPS, priority order samples
• r(i)=-ln(u)/w(i) : PPSWOR order samples

Horvitz Thompson Estimators [HT52]
Give each item i an adjusted weight a(i):
Let p(i) be the prob. that item i is sampled
If i is in the sample a(i)=w(i)/p(i) (otherwise a(i) = 0 )
 w(i ) 
E[a(i)]  p(i) 
  (1  p(i))0  w(i)
 p(i) 
 a(i) is an unbiased estimator of w(i)
Rank Conditioning Estimator for single
bottom-k sample set [CK07, DLT07]
Use the principal of applying HT on a partition
A
0.15
0.62
0.10
0.79
0.51
4-sample of A
0.04
0.42
0.38
0.87
0.94
< 0.42
For item i in sample, partition is based on the kth smallest rank value among all
other items (excluding i) in the sample being some fixed value.
In this specific case, k=4 and the fixed value = 0.42.
Compute p(i) : the probability that the rank value of i is smaller than 0.42.
For priority order samples, r(i)=u/w(i); p(i)=prob(r(i)<0.42)= min{1,0.42w(i)};
a(i)=w(i)/p(i)=max{w(i),100/42}