Top-k Queries on Uncertain Data: On score Distribution and Typical

Transcript Top-k Queries on Uncertain Data: On score Distribution and Typical

Top-k Queries on Uncertain Data: On
score Distribution and Typical Answers
Presented by Qian Wan, HKUST
Based on [1][2]
Introduction: Uncertain Data
Management

Modeling Uncertain Data


Uncertain data management


Possible Worlds Model
Top-k, Join, kNN, Skyline, Indexing, etc.
Uncertain Data Mining

Clustering, Classification, Frequent Pattern,
Outlier Detection
Introduction: Data Representation





A simple way to representing
probabilistic data
Each tuple has a confidence
Pr(instance)=
∏Pr(attendance) x
∏Pr(absence)
Mutual Exclusion Constraints
for each tuple*
Scoring function*
Introduction: Other Works

K tuples that co-exist in a possible
world


U-Topk
Returning tuples according to marginal
distribution of top-k results

U-kRanks and PT-k
Introduction: Other Works
(Example)
Introduction: Other Works
(drawback)


The top-k result may be atypical
The distribution of scores is not used
Introduction: c-Typical-Top k



3-Typical-Top 2 scores of this
example is {118, 183, 235}
Expected distance is 6.6
The vectors are {(t2, t6),
(T7,T6), (T7,T3)}
Algorithm
Distribution of
top-2 tuples’
scores
Algorithm – Naïve approach



INPUT: tuples with membership
probabilities
OUTPUT: Top-k scores distribution
IDEA: recursively go through all
possible worlds to calculate all
probabilities, until reaching a threshold
Algorithm – a DP approach


D(i,j): score
distribution of top-j
starting at Ti.
The main problem is
D(1,k) (?)
Algorithm – a DP approach

Transformation:


D(i,j) =
TF[D(i+1,j),D(i+1,j-1)]
D(i+1,j):


D(i+1,j-1):




For each (v,p) add (v,
p(1-pi))
For each (v,p) add (v+si,
p*pi)
Merge duplicate items
Bottom up DP
Approximation
Handling More Real Scenarios

Handling Mutually Exclusive Rules



Compress the ME group
Refine by lead tuple region
Handling Ties

When two tuples have the same score,
rank them according to probability
Algorithm
0.25
0.2
118, 0.2
0.15
183, 0.15
235, 0.12Series1
0.1
0.05
0
0
50
100
150
200
250
3-Typical-Top 2
scores
c-Typical-Top k



3-Typical-Top 2 scores of this
example is {118, 183, 235}
Expected distance is 6.6
The vectors are {(t2, t6),
(T7,T6), (T7,T3)}
Computing c-Typical-Top k


Define F^a(j) to be the optimal
objective over {sj, …, sn} where a is the
number of typical scores.
G^a(j) means the same
Computing c-Typical-Top k


Just solve the two function optimization
problem, using DP
Boundary conditions
Empirical Study

3 -Typical VS U-Topk
Empirical Study
Empirical Study
Q&A
Reference


[1] Charu C. Aggarwal, Philip S. Yu “A Survey of
Uncertain Data Algorithms and Applications”, IEEE
Transactions on Knowledge and Data Engineering,
2009
[2] Tingjian Ge, Stan Zdonik, Samuel Madden. Top-k
Queries on Uncertain Data: On Score Distribution and
Typical Answers. SIGMOD, 2009

Top-k Queries on Uncertain Data: On score Distribution and Typical

Transcript Top-k Queries on Uncertain Data: On score Distribution and Typical

Directory