EURO07cs240B - UCLA Computer Science
Download
Report
Transcript EURO07cs240B - UCLA Computer Science
Samples in more complex query graphs
Improving the Accuracy of Continuous Aggregates &
Mining Queries Under Load Shedding
Yan-Nei Law* and Carlo Zaniolo
Computer Science Dept. UCLA
*Bioinformatics Institute Singapore
Continuous Aggregate Queries on Data Streams:
Sampling & Load Shedding
Only random samples are available for computing aggregate
queries because of
Limitations of remote sensors, or transmission lines
Load Shedding policies implemented when overloads occur
When overloads occur (e.g., due to a burst of arrivals} we can
1.
drop queries all together, or
2.
sample the input---much preferable
Key objective: Achieve answer accuracy with sparse samples for
complex aggregates on windows
Can we improve answer accuracy with minimal overhead?
General Architecture
Sn
…...
Query Network
Basic Idea:
S1
Optimize sampling rates of load
shedders for accurate answers.
Previous Work [BDM04]:
Find an error bound for each
∑
aggregate query.
Determine sampling rates that
minimize query inaccuracy within the
limits imposed by resource
constraints.
Only work for SUM and COUNT
No error model provided
∑
…...
Si
∑
Data Stream
Load Shedder
Query Operator
∑
Aggregate
A New Approach
Correlation between answers at different points in time
Example: sensor data [VAA04,AVA04]
Objective: The current answer can be adjusted by the past
answers in the way that:
Low sampling rate current answer less accurate more
dependent on history.
High sampling rate current answer more accurate less
dependent on history.
We propose a Bayesian quality enhancement module which can
achieve this objective automatically and reduce the
uncertainty of the approximate answers.
A larger class of queries will be considered:
SUM, COUNT, AVG, quantiles.
Our Model
The observed answer à is
computed from random
samples of the complete
stream with sampling rate
P.
We propose a bayesian
method to obtain the
improved answer by
P
combining
the observed answer
the error model
history of the answer
…...
Sn
Query Network
S1
∑
…...
∑
∑
…...
Ã
History
Quality Enhancement
Module
Si
Data Stream
Load Shedder
Improved
answer
Query Operator
∑
Aggregate
Error Model of the aggregate answers
à – approximate answer obtained by a random sample with
sampling rate P.
~
Key result:
p ( A | A) ~ N ( A,
Error model for sum
count, avg, quantiles:
2
1 P 2 2
2
A
.
2
P
N
1 P 1
2
N 2.
P
N
1 P 2 2
2
.
P
N
2
SUM:
COUNT:
AVG:
p-th Quantiles [B86]:
F is the cumulative distribution,
f = F’ is the density function
p(1 p)
.
1
2
PN ( f ( F ( p)))
2
).
Use of Error Model
Derive accuracy estimate for larger class of
queries for optimizing the load shedding
policy
Idea: minimize the variance of each query
Enhance the quality of the query answer, on
the basis of statistically information derived
from the past.
Learning Prior Distribution from the Past
Statistical information on the answers:
Model the distribution of the answer by:
Spatial – e.g. reading from the neighbors.
Temporal – e.g the past answers {xi}.
Normal distribution:
By MLE, pdf ~ N(s,s2) where s=xi/n,s2=(xi-s)2/n;
only need to store s, s.
Only require a minimal amount of computation time.
Assuming that there is no concept change.
Observations
{
s2
~
t 2
A
s 2 s s2 2
2
t
2
s2
2
2
s 2
mean
variance
Reduced uncertainty. (small s t << )
Compromise between prior and observed answer:
large less accurate à more dependent on s
small more accurate à less dependent on s
uncertain prior (i.e. large s) will not have much effect on the
improved answer.
Generalizing to Mining functions:
K-means is a Generalized AVG Query
Relative Error for the first mean
Relative Error for the second mean
Quantiles: (dataset with concept drifts)
p-th
Quantiles
Approximate Posterior
20%
3.30
2.67
40%
1.59
1.56
60%
0.30
0.20
80%
0.21
0.13
Average Rel. Err. for every quantile
Changing Distribution
Corrections effective for distributions besides
normal ones
Changing distributions (a.k.a. concept
changes) can be easily detected—we used a
two sample test
Then old prior is dropped and new prior is
constructed
Minimal Overheads
Computation costs introduced by:
Calculating posterior distribution
Detecting changes
Query
Approximate
Posterior
SUM
200
200
Quantile
1230
1240
K-means
960
990
Time (in ms) for each query
Summary
Proposed a Bayesian quality enhancement
method for approximate aggregates in the
presence of sampling.
Our method:
Works for ordered statistics and data mining
functions as well as with traditional aggregates,
and also
handles concept changes in the data streams