Slides - VLDB 2009
Download
Report
Transcript Slides - VLDB 2009
Coordinated weighted sampling
for estimating aggregates
over multiple weight
assignments
Edith Cohen, AT&T Research
Haim Kaplan, Tel Aviv University
Shubho Sen, AT&T Research
Data model
• Universe U of items:
i1
w1(i1)
w2(i1)
w3(i1)
i2
w1(i2)
w2(i2)
w3(i2)
i3
i4
w1(i3) w1(i4)
w2(i3) w2(i4)
w3(i3) w3(i4)
…………
IP addresses
traffic day 1
traffic day 2
traffic day 3
Multiple weight assignments defined over the items
Data model (example)
• Universe U of items:
i1
w1(i1)
w2(i1)
w3(i1)
i2
w1(i2)
w2(i2)
w3(i2)
i3
i4
w1(i3) w1(i4)
w2(i3) w2(i4)
w3(i3) w3(i4)
…………
facebook users
male friends
female friends
online per day
Data model (example)
• Universe U of items:
i1
w1(i1)
w2(i1)
w3(i1)
i2
w1(i2)
w2(i2)
w3(i2)
i3
i4
w1(i3) w1(i4)
w2(i3) w2(i4)
w3(i3) w3(i4)
………… customers
wait time item 1
wait time item 2
wait time item 3
Data model (example)
• Universe U of items:
i1
w1(i1)
w2(i1)
w3(i1)
i2
w1(i2)
w2(i2)
w3(i2)
i3
i4
w1(i3) w1(i4)
w2(i3) w2(i4)
w3(i3) w3(i4)
………… license plates
GW bridge day 1
GW bridge day 2
GW bridge day 3
Data model (example)
• Universe U of items:
i1
w1(i1)
w2(i1)
w3(i1)
i2
w1(i2)
w2(i2)
w3(i2)
i3
i4
w1(i3) w1(i4)
w2(i3) w2(i4)
w3(i3) w3(i4)
………… web pages
bytes
out links
in links
Aggregation queries
• Selection predicate d(i) over items in U
• compute the sum of the weights of all items for which
the predicate holds:
w(i)
i|d ( i ) true
Examples:
• Total number of links out of pages of some web site
• Total number of flows into a particular network
This work:Aggregates depending on
more than one weight function
• Selection predicate d(i) over items in U
• Want to compute the sum of f(i) for all items for which
the predicate holds:
f (i)
i|d ( i ) true
f(i) depends on some subset of the weights given to i
Simple examples: f(i) = maxbwb(i), f(i) = minbwb(i),
f(i)=maxbwb(i)-minbwb(i)
Data model (example)
• Universe U of items:
i1
w1(i1)
w2(i1)
w3(i1)
i2
w1(i2)
w2(i2)
w3(i2)
i3
i4
w1(i3) w1(i4)
w2(i3) w2(i4)
w3(i3) w3(i4)
………… customers
wait time item 1
wait time item 2
wait time item 3
f(i) = maxbwb(i) aggregated over a subset gives the total
waiting time of these customers
Data model
• Universe U of items:
i1
w1(i1)
w2(i1)
w3(i1)
i2
w1(i2)
w2(i2)
w3(i2)
i3
i4
w1(i3) w1(i4)
w2(i3) w2(i4)
w3(i3) w3(i4)
…………
IP addresses
traffic day 1
traffic day 2
traffic day 3
f(i) = maxbwb(i)-minbwb(i) aggregated over a network gives
sum of maximum changes in # of flows
Challenges
• Massive data volumes, too large to store or to transport
elsewhere
• w1 may not be available when we process w2, etc, collected
at different place/time
• Many different types of queries (d(i),f(i)), not known in
advance
Exact aggregation can be very resource
consuming or simply impossible
Challenge: which summary to keep and how to estimate the
aggregate ??
Solution: Use Sampling
Keep a sample
Desirable Properties of Sampling Scheme:
• Scalability: efficient on data streams and distributed
data, decouple the summarization of different sets
• Applicable to a wide range of d(i), f(i)
• Good estimators: unbiased, small variance
Sampling a single weight
function
•Independent (Bernoulli) with probability kw(i)/W
Poisson, PPS sampling
•k times without replacement
Order (bottom-k) sampling (includes PPSWOR, priority sampling
(Rosen 72, Rosen 97, CK07, DLT 07)
• Methods can be implemented efficiently in a
stream/reservoir/distributed setting
•Order (bottom-k) sampling dominates other methods (fixed
sample size, with unbiased estimators [CK07, DLT07] )
Rank-based definition of
Order/bottom-k sampling
• For each item draw a random rank value from
some distribution fw(i)
• Order/bottom-k: Pick the k smallest-ranked
items to your sample
Rank distributions:
u (i ) U[0,1]
(The seed)
• r(i)=u(i)/w(i) : priority order samples
• r(i)=-ln(u(i))/w(i) : PPSWOR order samples
Relations between samples by
different weight function
Independent sampling: Get a sample from each weight
function independently
Coordinated Sampling: If w2(i) ≥ w1(i) then r2(i) ≤ r1(i).
For example by using the same seed u(i) for all wi:
r1(i) = u(i)/w1(i)
r2(i) = u(i)/w2(i)
Can be achieved efficiently via hashing
Coordination is critical
• Coordination is critical for tight estimates
• We develop estimators for coordinated
samples and analyze them
A lot of previous work on coordination in the context of
survey sampling and when weights are uniform..
Horvitz Thompson Estimators [HT52]
HT estimator: Give each item i an adjusted weight a(i)
Let p(i) be the prob. that item i is sampled
If i is in sample, a(i)=w(i)/p(i) (otherwise a(i) = 0 )
w(i )
E[a(i)] p(i)
(1 p(i))0 w(i)
p(i)
a (Q) a (i )
iQ
a (i )
iQ S
a(Q) is an unbiased estimator of w(Q)
Using HT for order samples
Problem
• Cannot compute p(i) from an order sample
• Therefore cannot compute HT adjusted weights
a(i)=w(i)/p(i)
p(i)?
Solution: Apply HT on partitions
•
•
p1(i)
For item i, if we can find a partition of the
sample space such that we can compute the
p3(i)
conditioned p(i) for item i for the cell that the 2
p (i)
sampled set belongs to
p4(i)
Then apply HT using that conditioned p(i)
Obtain unbiased adjusted weights for each item
p5(i)
p6(i)
Estimators for aggregates
over multiple weights
Suppose we want to estimate:
i|d ( i ) true
b
min b w (i)
We have a sample for each weight
function using independent or coordinated
ranks
Estimator for
i|d ( i ) true
b
min b w (i)
Identify the items i for which d(i)=true and you know
minbwb(i)
(=Items contained in all samples)
Compute p(i), the probability of such item to be sampled
for all weight functions (conditioned in a subspace that
depends on the other ranks)
Set a(i) = minbwb(i)/p(i)
Sum up these adjusted weights
Independent vs Coordinated
1 p(i)
var[a(i)] min b w (i)
p(i )
b
2
Variance is small when p(i) is large
Coordinated sketches have larger p(i)
Estimator for
i|d ( i ) true
b
max b w (i)
Identify the items i for which d(i)=true and you
know maxbwb(i)
If you see all weights then you know maxbwb(i)
But you never see them all if minbwb(i) = 0
Estimator for
i|d ( i ) true
b
max b w (i)
Use the consistency of the ranks:
If the largest weight you see has rank smaller
than minb{kth smallest rank of an item I\{i}}
then you know it’s the maximum.
Estimator for
i|d ( i ) true
b
max b w (i)
If the largest weight you see has rank smaller
than minb{kth smallest rank of an item I\{i}}
then you know it’s the maximum.
Compute p(i), the probability of this to
happen
Set a(i) = minbwb(i)/p(i)
Sum up these adjusted weights
More estimators
Now we can get an unbiased estimator for
the L1 difference:
(min
w (i) minb w (i))
b
b
b
Can prove that this estimator is nonnegative
Empirical Evaluation
Data sets:
• IP packet traces:
Items: (src,dest,src-port,dest-port).
wb: total # of bytes in hour b
• Netflix data sets:
Items: movies.
wb: total # of ratings in month b
• Stock data:
Items: ticker symbols
wb: high value on day b
Netflix:
b
var[
a
(min
w
(i ))]
ind
b
i
var[a
coor
i
b
(min b w (i ))]
Netflix:
var[a( f (i))]
i
for various f
Summary
• Coordinated sketches improves accuracy
and sometimes estimating is not feasible
without it
• It can be achieved with little overhead
using state of the art sampling techniques
Thank you!
Application: Sensor nodes recording daily vehicle
traffic in different locations in a city
• Items: vehicles (license plate numbers)
• Sets: All vehicles observed at a particular
location/date
Example queries:
• Number of distinct vehicles in Manhattan on
election day (size of the union of all Manhattan
locations on election day)
• Number of trucks with PA license plates that
crossed both the Hudson and the East River on June
18, 2009
Application: Items are IP addresses. Sets h1,…,h24
correspond to destination IP’s observed in different 1hour time periods.
Example Queries: number of distinct IP dests
• In the first 12 hours (size of union of h1,…,h12)
• In at most one/at least 4 out of h1,…,h24
• In the intersection of h9,…,h17 (all business hours)
• Same queries restricted to: blacklisted IP’s, servers
in a specific country etc.
Uses: Traffic Analysis, Anomaly detection
Application: Text/HyperText corpus: items are
features/terms, sets are documents (or vice versa)
Example Queries:
•Number of distinct hyperlinks to financial news
websites from websites in the .edu domain
•Fraction of pages citing A that also cite B
•Similarity (Jaccard coefficient) of two documents
Uses: research, duplicate elimination, similarity
search or clustering
Application: Market basket data set. Items are
consumers/baskets, subsets are goods (or vice versa)
Example queries:
• Likelihood of purchasing beer if diapers were
purchased
• Number of baskets from zip code 07932 containing
shampoo
• Number of consumers purchasing baby food and not
diapers
Uses: Marketing, placement, pricing, choosing products
Rank-based definition of
sampling
• For each item draw a random rank value from
some distribution fw(i)
•Poisson: Pick all items with rank value below Tau(k)
•Order/bottom-k: Pick the k smallest-ranked
items to your sample
Rank distributions:
u U[0,1]
• r(i)=u/w(i) : Poisson PPS, priority order samples
• r(i)=-ln(u)/w(i) : PPSWOR order samples
Horvitz Thompson Estimators [HT52]
Give each item i an adjusted weight a(i):
Let p(i) be the prob. that item i is sampled
If i is in the sample a(i)=w(i)/p(i) (otherwise a(i) = 0 )
w(i )
E[a(i)] p(i)
(1 p(i))0 w(i)
p(i)
a(i) is an unbiased estimator of w(i)
Rank Conditioning Estimator for single
bottom-k sample set [CK07, DLT07]
Use the principal of applying HT on a partition
A
0.15
0.62
0.10
0.79
0.51
4-sample of A
0.04
0.42
0.38
0.87
0.94
< 0.42
For item i in sample, partition is based on the kth smallest rank value among all
other items (excluding i) in the sample being some fixed value.
In this specific case, k=4 and the fixed value = 0.42.
Compute p(i) : the probability that the rank value of i is smaller than 0.42.
For priority order samples, r(i)=u/w(i); p(i)=prob(r(i)<0.42)= min{1,0.42w(i)};
a(i)=w(i)/p(i)=max{w(i),100/42}