Transcript 幻灯片 1
Probabilistic Threshold Range
Aggregate Query Processing over
Uncertain Data
Wenjie Zhang
University of New South Wales & NICTA, Australia
Joint work:
Shuxiang Yang, Ying Zhang, Xuemin Lin (UNSW & NICTA)
Outline
2
Background and Preliminaries
Probabilistic Threshold Range Aggregate Query
Exact query processing
Approximate query processing: Simple Sampling & Double
Sampling
Experiments
Conclusion
DB@UNSW
Applications
3
Many applications involve data that is imperfect due to
data randomness and incompleteness
limitation of equipment
delay or lose in data transfer
……
Applications
Sensor networks
Environmental surveillance
Moving objects
Data cleaning and integration
……
DB@UNSW
Applications
4
Sensor Networks:
Sensor
readings are often imprecise due to equipment
limitation and periodical reporting mechanism.
(figures are borrowed from Jian et al, SIGMOD08)
DB@UNSW
Applications
5
Mobile Equipments / Moving Objects
A mobile object reports its location periodically, the exact location is
often uncertain.
DB@UNSW
Applications
6
Satellite data
DB@UNSW
Applications
7
Data Quality
Social Data Collection: Errors and estimation inherent in
customer surveys and sampling
DBG @ UNSW
Outline
8
Background and Preliminaries
Modeling
Uncertainty & Related Work
Probabilistic Threshold Range Query
Conclusion
DB@UNSW
Modeling Uncertainty ( cont. )
9
1.
Uncertain Objects Model
Continuous case: described using a probability density
function (PDF) fU such that
f U (u )du 1. E.g., uniform
u U
distribution, normal distribution.
DB@UNSW
Modeling Uncertainty ( cont. )
10
2.
Uncertain Objects Model
Discrete case : described using a set of instances each
instance u has an occurrence probability pu
u U
pu 1
DB@UNSW
Possible World Semantics
11
Given a set of uncertain objects {U1,U2, ..., Un}, a possible world
W = {u1,u2, .., un} is a set of n instances --- one instance per
uncertain object
The probability of a possible worlds is
n
P(W) =
i 1
P(ui )
Let Ω be the set of all possible world, clearly,
W
P (W ) 1
DB@UNSW
Probabilistic Queries:
12
Query Evaluation [CKP03, CXPSV04, DS04, DS05, DS07, SD07]
Aggregate Queries [BDJR05, MJ07, CG07]
Join Queries [CSP06, AW07]
Top-k queries [SIC07, YLSK08, RDS07, HJZL08]
Nearest Neighbor Queries [KKR07, CCMC08]
Skyline Queries [PJLY07]
……
DB@UNSW
Range query
13
Uncertain objects, exact query
Probability threshold is often assigned
DBG @ UNSW
Related Work
14
Range Queries [TCXNKP05, BPS06, AY08]
Given a rectangle r and a probabilistic threshold t , find all objects that
appear in r with probability at least t.
Appearance probability
xo. region r
r
o. pdf ( x )dx
o.region
DB@UNSW
U-tree
15
Probabilistically Constrained Region ( PCR )
PCR (0.2)
Multi PCRs
DB@UNSW
[TCXNKP05]
Outline
16
Introduction
Modeling Uncertainty & Related Work
Probabilistic Threshold Range Aggregate Query
(PTRA)
Conclusion
DB@UNSW
Contribution
17
Formally define PTRA query
aU-Tree structure for exact PTRA query
singleSample and doubleSample techniques for
approximate answer.
DB@UNSW
Problem Statement
18
Given a set of uncertain objects and query q , return
the number of uncertain objects with appearance
probability no less than threshold pq
DB@UNSW
Problem Definition
19
Assume threshold =
0.5, if the appearance
probability computed
for b is > 0.5 and for c
is < 0.5, then the
aggregate returned is
2 (a & b)
DB@UNSW
Exact Query Processing ( aU-Tree)
20
Main idea: add aggregate information on U-tree
Advantage:
stop at intermediate level if pruned or fully
covered by the query
Disadvantage: otherwise, still need to drill down to the
leaf nodes.
For
a large portion of uncertain objects, appearance
probability needs to be computed
Expensive for a massive number of instances per object!
DB@UNSW
Exact Query Processing ( aU-Tree)
21
DB@UNSW
singleSample
22
Sampling the instances of the uncertain objects.
If m’ out of m sampled instances are inside query region, then
the approximate appearance probability is m’/m
DB@UNSW
singleSample ( cont. )
23
An immediate application of Chernoff-Hoeffding bound
DB@UNSW
doubleSample
24
Single Sampling is expensive when there is a
massive number of objects!
Sampling the uncertain objects as well.
Naive : uniform sampling objects from all uncertain objects.
DB@UNSW
doubleSample: Accuracy
25
•Note: “ appearance probability” of each object follows uniform distribution
means spatial location is uniformly distributed.
•Using Chernoff-Hoeffding bound.
DB@UNSW
doubleSample: Our Approach
26
Skew!
Aim: select K disjoint groups covering all objects with the
minimum “skew”; i.e. objects in each group with “uniform”
distribution. (Then do uniform sampling of objects in each
group.)
The optimization problem is NP-hard.
Observation:
Min-skew is a good heuristic to conduct such a group.
aU-tree groups objects with a similar principle to the minskew.
DB@UNSW
doubleSample: Our Approach
27
Step 1: choose K subtrees to cover all objects with the total
minimum skew. NP-hard!
Find a level L such that the number of nodes at level L is smaller
than K but the number of nodes at level L-1 is larger than K.
Feed the min-skew algorithm with the subtrees at level L.
(note: if at a level L, the number of nodes = K, then these K
subtrees are chosen.)
Step 2: sample objects in each subtree.
Step 3. sample instances in each sampled object.
DB@UNSW
Experiments
28
Algorithms:
exact, singleSample, doubleSample
Data set:
LB : 53k objects at long beach country
CA : 62k objects at California
Synthetic aircraft dataset in 3D
10k instances for each points follow Uniform or constrained-Gaussian
Setting : C++, P4 2.8GHz , 2G memory, Debian linux, Page size
DB@UNSW
8K
Efficiency
29
DB@UNSW
Accuracy
30
DB@UNSW
Accuracy ( cont. )
31
DB@UNSW
Conclusion
32
Definition of PTRA
aU-Tree technique
Sampling technique
Future work. Any approach with theoretic
guarantee?
DB@UNSW
33
Thanks
DB@UNSW
Min-Skew technique
34
DB@UNSW