EURO07cs240B - UCLA Computer Science

Download Report

Transcript EURO07cs240B - UCLA Computer Science

Samples in more complex query graphs
Improving the Accuracy of Continuous Aggregates &
Mining Queries Under Load Shedding
Yan-Nei Law* and Carlo Zaniolo
Computer Science Dept. UCLA
*Bioinformatics Institute Singapore
Continuous Aggregate Queries on Data Streams:
Sampling & Load Shedding
Only random samples are available for computing aggregate
queries because of


Limitations of remote sensors, or transmission lines

Load Shedding policies implemented when overloads occur

When overloads occur (e.g., due to a burst of arrivals} we can
1.
drop queries all together, or
2.
sample the input---much preferable

Key objective: Achieve answer accuracy with sparse samples for
complex aggregates on windows

Can we improve answer accuracy with minimal overhead?
General Architecture

Sn
…...
Query Network
Basic Idea:


S1
Optimize sampling rates of load
shedders for accurate answers.
Previous Work [BDM04]:




Find an error bound for each
∑
aggregate query.
Determine sampling rates that
minimize query inaccuracy within the
limits imposed by resource
constraints.
Only work for SUM and COUNT
No error model provided
∑
…...
Si
∑
Data Stream
Load Shedder
Query Operator
∑
Aggregate
A New Approach




Correlation between answers at different points in time
 Example: sensor data [VAA04,AVA04]
Objective: The current answer can be adjusted by the past
answers in the way that:
 Low sampling rate  current answer less accurate  more
dependent on history.
 High sampling rate  current answer more accurate  less
dependent on history.
We propose a Bayesian quality enhancement module which can
achieve this objective automatically and reduce the
uncertainty of the approximate answers.
A larger class of queries will be considered:
 SUM, COUNT, AVG, quantiles.
Our Model

The observed answer à is
computed from random
samples of the complete
stream with sampling rate
P.
We propose a bayesian
method to obtain the
improved answer by
P
combining



the observed answer
the error model
history of the answer
…...
Sn
Query Network

S1
∑
…...
∑
∑
…...
Ã
History
Quality Enhancement
Module
Si
Data Stream
Load Shedder
Improved
answer
Query Operator
∑
Aggregate
Error Model of the aggregate answers


à – approximate answer obtained by a random sample with
sampling rate P.
~
Key result:
p ( A | A) ~ N ( A, 
Error model for sum
count, avg, quantiles:
2
1 P  2   2
2
 


A
.
2
P
N
1 P 1
2
 
  N 2.
P
N
1 P  2   2
2
 

.
P
N
2

SUM:

COUNT:

AVG:

p-th Quantiles [B86]:
F is the cumulative distribution,
f = F’ is the density function
p(1  p)
 
.
1
2
PN ( f ( F ( p)))
2
).
Use of Error Model

Derive accuracy estimate for larger class of
queries for optimizing the load shedding
policy


Idea: minimize the variance of each query
Enhance the quality of the query answer, on
the basis of statistically information derived
from the past.
Learning Prior Distribution from the Past

Statistical information on the answers:



Model the distribution of the answer by:

Spatial – e.g. reading from the neighbors.
Temporal – e.g the past answers {xi}.
Normal distribution:
 By MLE, pdf ~ N(s,s2) where s=xi/n,s2=(xi-s)2/n;
only need to store s, s.
 Only require a minimal amount of computation time.
Assuming that there is no concept change.
Observations
{


 s2
~
t  2


A
 s   2 s  s2   2
2
t
2
 s2
2
 2

s  2
mean
variance
Reduced uncertainty. (small s  t << )
Compromise between prior and observed answer:



large   less accurate à  more dependent on s
small   more accurate à  less dependent on s
uncertain prior (i.e. large s) will not have much effect on the
improved answer.
Generalizing to Mining functions:
K-means is a Generalized AVG Query
Relative Error for the first mean
Relative Error for the second mean
Quantiles: (dataset with concept drifts)
p-th
Quantiles
Approximate Posterior
20%
3.30
2.67
40%
1.59
1.56
60%
0.30
0.20
80%
0.21
0.13
Average Rel. Err. for every quantile
Changing Distribution



Corrections effective for distributions besides
normal ones
Changing distributions (a.k.a. concept
changes) can be easily detected—we used a
two sample test
Then old prior is dropped and new prior is
constructed
Minimal Overheads

Computation costs introduced by:


Calculating posterior distribution
Detecting changes
Query
Approximate
Posterior
SUM
200
200
Quantile
1230
1240
K-means
960
990
Time (in ms) for each query
Summary


Proposed a Bayesian quality enhancement
method for approximate aggregates in the
presence of sampling.
Our method:


Works for ordered statistics and data mining
functions as well as with traditional aggregates,
and also
handles concept changes in the data streams