Transcript 幻灯片 1

Probabilistic Threshold Range
Aggregate Query Processing over
Uncertain Data
Wenjie Zhang
University of New South Wales & NICTA, Australia
Joint work:
Shuxiang Yang, Ying Zhang, Xuemin Lin (UNSW & NICTA)
Outline
2


Background and Preliminaries
Probabilistic Threshold Range Aggregate Query
Exact query processing
 Approximate query processing: Simple Sampling & Double
Sampling



Experiments
Conclusion
DB@UNSW
Applications
3

Many applications involve data that is imperfect due to





data randomness and incompleteness
limitation of equipment
delay or lose in data transfer
……
Applications





Sensor networks
Environmental surveillance
Moving objects
Data cleaning and integration
……
DB@UNSW
Applications
4

Sensor Networks:
 Sensor
readings are often imprecise due to equipment
limitation and periodical reporting mechanism.
(figures are borrowed from Jian et al, SIGMOD08)
DB@UNSW
Applications
5

Mobile Equipments / Moving Objects

A mobile object reports its location periodically, the exact location is
often uncertain.
DB@UNSW
Applications
6

Satellite data
DB@UNSW
Applications
7

Data Quality

Social Data Collection: Errors and estimation inherent in
customer surveys and sampling
DBG @ UNSW
Outline
8

Background and Preliminaries
 Modeling


Uncertainty & Related Work
Probabilistic Threshold Range Query
Conclusion
DB@UNSW
Modeling Uncertainty ( cont. )
9

1.
Uncertain Objects Model
Continuous case: described using a probability density
function (PDF) fU such that 
f U (u )du  1. E.g., uniform
u U
distribution, normal distribution.
DB@UNSW
Modeling Uncertainty ( cont. )
10

2.
Uncertain Objects Model
Discrete case : described using a set of instances each
instance u has an occurrence probability pu

u U
pu  1
DB@UNSW
Possible World Semantics
11

Given a set of uncertain objects {U1,U2, ..., Un}, a possible world
W = {u1,u2, .., un} is a set of n instances --- one instance per
uncertain object

The probability of a possible worlds is

n
P(W) =

i 1
P(ui )
Let Ω be the set of all possible world, clearly,

W 
P (W )  1
DB@UNSW
Probabilistic Queries:
12

Query Evaluation [CKP03, CXPSV04, DS04, DS05, DS07, SD07]

Aggregate Queries [BDJR05, MJ07, CG07]





Join Queries [CSP06, AW07]
Top-k queries [SIC07, YLSK08, RDS07, HJZL08]
Nearest Neighbor Queries [KKR07, CCMC08]
Skyline Queries [PJLY07]
……
DB@UNSW
Range query
13


Uncertain objects, exact query
Probability threshold is often assigned
DBG @ UNSW
Related Work
14
Range Queries [TCXNKP05, BPS06, AY08]

Given a rectangle r and a probabilistic threshold t , find all objects that
appear in r with probability at least t.
Appearance probability

xo. region r
r
o. pdf ( x )dx
o.region
DB@UNSW
U-tree
15
Probabilistically Constrained Region ( PCR )
PCR (0.2)
Multi PCRs
DB@UNSW
[TCXNKP05]
Outline
16




Introduction
Modeling Uncertainty & Related Work
Probabilistic Threshold Range Aggregate Query
(PTRA)
Conclusion
DB@UNSW
Contribution
17



Formally define PTRA query
aU-Tree structure for exact PTRA query
singleSample and doubleSample techniques for
approximate answer.
DB@UNSW
Problem Statement
18
Given a set of uncertain objects and query q , return
the number of uncertain objects with appearance
probability no less than threshold pq
DB@UNSW
Problem Definition
19
Assume threshold =
0.5, if the appearance
probability computed
for b is > 0.5 and for c
is < 0.5, then the
aggregate returned is
2 (a & b)
DB@UNSW
Exact Query Processing ( aU-Tree)
20

Main idea: add aggregate information on U-tree
 Advantage:
stop at intermediate level if pruned or fully
covered by the query
 Disadvantage: otherwise, still need to drill down to the
leaf nodes.
 For
a large portion of uncertain objects, appearance
probability needs to be computed
 Expensive for a massive number of instances per object!
DB@UNSW
Exact Query Processing ( aU-Tree)
21
DB@UNSW
singleSample
22


Sampling the instances of the uncertain objects.
If m’ out of m sampled instances are inside query region, then
the approximate appearance probability is m’/m
DB@UNSW
singleSample ( cont. )
23
An immediate application of Chernoff-Hoeffding bound
DB@UNSW
doubleSample
24

Single Sampling is expensive when there is a
massive number of objects!
Sampling the uncertain objects as well.
Naive : uniform sampling objects from all uncertain objects.

DB@UNSW
doubleSample: Accuracy
25
•Note: “ appearance probability” of each object follows uniform distribution
means spatial location is uniformly distributed.
•Using Chernoff-Hoeffding bound.
DB@UNSW
doubleSample: Our Approach
26




Skew!
Aim: select K disjoint groups covering all objects with the
minimum “skew”; i.e. objects in each group with “uniform”
distribution. (Then do uniform sampling of objects in each
group.)
The optimization problem is NP-hard.
Observation:
 Min-skew is a good heuristic to conduct such a group.
 aU-tree groups objects with a similar principle to the minskew.
DB@UNSW
doubleSample: Our Approach
27

Step 1: choose K subtrees to cover all objects with the total
minimum skew. NP-hard!




Find a level L such that the number of nodes at level L is smaller
than K but the number of nodes at level L-1 is larger than K.
Feed the min-skew algorithm with the subtrees at level L.
(note: if at a level L, the number of nodes = K, then these K
subtrees are chosen.)
Step 2: sample objects in each subtree.
Step 3. sample instances in each sampled object.
DB@UNSW
Experiments
28
Algorithms:
exact, singleSample, doubleSample
Data set:
LB : 53k objects at long beach country
CA : 62k objects at California
Synthetic aircraft dataset in 3D
10k instances for each points follow Uniform or constrained-Gaussian
Setting : C++, P4 2.8GHz , 2G memory, Debian linux, Page size
DB@UNSW
8K
Efficiency
29
DB@UNSW
Accuracy
30
DB@UNSW
Accuracy ( cont. )
31
DB@UNSW
Conclusion
32




Definition of PTRA
aU-Tree technique
Sampling technique
Future work. Any approach with theoretic
guarantee?
DB@UNSW
33
Thanks
DB@UNSW
Min-Skew technique
34
DB@UNSW