Top-k Queries on Uncertain Data: On score Distribution and Typical
Download
Report
Transcript Top-k Queries on Uncertain Data: On score Distribution and Typical
Top-k Queries on Uncertain Data: On
score Distribution and Typical Answers
Presented by Qian Wan, HKUST
Based on [1][2]
Introduction: Uncertain Data
Management
Modeling Uncertain Data
Uncertain data management
Possible Worlds Model
Top-k, Join, kNN, Skyline, Indexing, etc.
Uncertain Data Mining
Clustering, Classification, Frequent Pattern,
Outlier Detection
Introduction: Data Representation
A simple way to representing
probabilistic data
Each tuple has a confidence
Pr(instance)=
∏Pr(attendance) x
∏Pr(absence)
Mutual Exclusion Constraints
for each tuple*
Scoring function*
Introduction: Other Works
K tuples that co-exist in a possible
world
U-Topk
Returning tuples according to marginal
distribution of top-k results
U-kRanks and PT-k
Introduction: Other Works
(Example)
Introduction: Other Works
(drawback)
The top-k result may be atypical
The distribution of scores is not used
Introduction: c-Typical-Top k
3-Typical-Top 2 scores of this
example is {118, 183, 235}
Expected distance is 6.6
The vectors are {(t2, t6),
(T7,T6), (T7,T3)}
Algorithm
Distribution of
top-2 tuples’
scores
Algorithm – Naïve approach
INPUT: tuples with membership
probabilities
OUTPUT: Top-k scores distribution
IDEA: recursively go through all
possible worlds to calculate all
probabilities, until reaching a threshold
Algorithm – a DP approach
D(i,j): score
distribution of top-j
starting at Ti.
The main problem is
D(1,k) (?)
Algorithm – a DP approach
Transformation:
D(i,j) =
TF[D(i+1,j),D(i+1,j-1)]
D(i+1,j):
D(i+1,j-1):
For each (v,p) add (v,
p(1-pi))
For each (v,p) add (v+si,
p*pi)
Merge duplicate items
Bottom up DP
Approximation
Handling More Real Scenarios
Handling Mutually Exclusive Rules
Compress the ME group
Refine by lead tuple region
Handling Ties
When two tuples have the same score,
rank them according to probability
Algorithm
0.25
0.2
118, 0.2
0.15
183, 0.15
235, 0.12Series1
0.1
0.05
0
0
50
100
150
200
250
3-Typical-Top 2
scores
c-Typical-Top k
3-Typical-Top 2 scores of this
example is {118, 183, 235}
Expected distance is 6.6
The vectors are {(t2, t6),
(T7,T6), (T7,T3)}
Computing c-Typical-Top k
Define F^a(j) to be the optimal
objective over {sj, …, sn} where a is the
number of typical scores.
G^a(j) means the same
Computing c-Typical-Top k
Just solve the two function optimization
problem, using DP
Boundary conditions
Empirical Study
3 -Typical VS U-Topk
Empirical Study
Empirical Study
Q&A
Reference
[1] Charu C. Aggarwal, Philip S. Yu “A Survey of
Uncertain Data Algorithms and Applications”, IEEE
Transactions on Knowledge and Data Engineering,
2009
[2] Tingjian Ge, Stan Zdonik, Samuel Madden. Top-k
Queries on Uncertain Data: On Score Distribution and
Typical Answers. SIGMOD, 2009