Efficient Top-k Query Evaluation on Probabilistic Data By
Download
Report
Transcript Efficient Top-k Query Evaluation on Probabilistic Data By
EFFICIENT TOP-K QUERY EVALUATION
ON PROBABILISTIC DATA
PAPER BY
CHRISTOPHER R´E
NILESH DALVI
DAN SUCIU
Presented By
Chandrashekar Vijayarenu
Anirban Maiti
AGENDA
Overview
DNF Evaluation
Query Evaluation
Problem definition
Monte Carlo (MC) Simulation
Multi simulation(MS)
Experiments
PROBABILISTIC DATABASES
Probabilistic database is an uncertain database in which the
possible worlds have associated probabilities. The simplistic
definition is that every tuple belongs to the database with some
probability (between 0 - 1)
Example:
SELECT QUERY ON PROBABILISTIC DATABASE
Find directors with a highly rated Drama and low rated
comedy
SELECT
DISTINCT d.dirName AS Director
FROM
AMZNReviews a, AMZNReviews b,
TitleMatch ax, TitleMatch by,
IMDBMovie x, IMDBMovie y,
IMDBDirector d
WHERE
a.asin=ax.asin and b.asin=by.asin
and ax.mid=x.mid and by.mid=y.mid
and x.did=y.did and y.did=d.did
and x.genre=’comedy’ and y.genre=’drama’
User specifies a SQL Query
and a number k, and the
system returns the highest
ranked k answers.
and abs(x.year - y.year) <= 5
and a.rating<2 and b.rating>4
Major challenge in probabilistic database is query evaluation.
Dalvi and Suciu have shown that most SQL queries are #P Complete.
This paper propose a new approach to query evaluation on probabilistic
databases, by combining top-k style queries.
QUERY PROCESSING CHALLENGES
Compute exact output probabilities is computationally
hard. Meaning, any algorithm computing the
probabilities need to iterate through all possible
subsets of TitleMatch.
Potential answers for which we need to calculate the
probability is large. User is likely to end up inspecting
just the first few of them.
This paper introduces the multisimulation algorithm
which enables effective processing of probabilistic
queries with some error guarantees.
OVERVIEW
POSSIBLE WORLDS
A possible world is thus any subset of the tuples in the database and its
probability can be computed as a product of the probabilities of the tuples in it,
and the respective probabilities of the tuples that are not in that world.
Consider the following probabilistic database containing two relations S and T:
A probabilistic database over schema S is a pair (W,P) where W = {W1, . . .
,Wn} is a set of database instances over S, and P : W ->[0, 1] is a probability
distribution (i.e. P j=1,n P(Wj) = 1). Each instance Wj for which P(Wj) > 0 is
called a possible world.
Let Jp be a database instance over schema Sp. Then Mod(Jp) is the
probabilistic database (W,P) over the schema S obtained as described below
DNF FORMULA
In boolean logic, a disjunctive normal form (DNF)
is a standardization of a logical formula which is
a disjunction of conjunctive clauses.
In our Example
E = (t1 Λ t5) V t2 = true
in the possible worlds W3,W7,W10,W11, and its
probability is thus
P(E) = P(W3) + P(W7) + P(W10) + P(W11).
QUERY EVALUATION DNF EVALUATION
qe = SELECT *
FROM
AMZNReviews a, AMZNReviews b,
TitleMatch ax, TitleMatch by,
IMDBMovie x, IMDBMovie y,
IMDBDirector d
WHERE ...
Each answer returned by qe will have 7 tuple variables
defined in where clause: (a, b, axp, byp, x, y, d)
axp and byp are probabilistic tuples. (From TupleMatch
table)
Thus, every row returned by qe defines a boolean
expression t.E = axp Λ byp.
INTRODUCE GROUP BY
Next we group the rows by their directors, and for each
group
G = {(axp1, byp1), . . . , (axpm, bypm)}
DNF formula: G.E = (axp1 Λ byp1) V . . . V (axpm Λ bxpm)
The director’s probability give by P(G.E).
How to calculate the director’s probability?
Brute Force Approach: Choose every possible world and calculate
the truth value of the boolean expression.
p = P(G.E) is the frequency with which G.E = true
#P Hard problem
Monte Carlo Simulation: Choose the possible world at random and
calculate the truth value.
PROBLEM DEFINITION
G = {G1, . . . ,Gn} of n objects, with unknown
Probabilities p1, . . . , pn, and a number k <= n.
Our goal is to find a set of k objects with the
highest probabilities, denoted Top-K Subset of G.
Solution:
The idea in our algorithm is to run in parallel
several Monte-Carlo simulations, one for each
candidate answer, and approximate each
probability only to the extent needed to compute
correctly the top-k answers
MONTE CARLO (MC) SIMULATION
An MC algorithm repeatedly chooses at random a possible world,
and computes the truth value of the Boolean expression G.E
(Eq.(3)); the probability p = P(G.E) is approximated by the
frequency ˜p with which G.E was true.
Luby and Karp have described the variant shown in Algorithm
fix an order on the disjuncts: t1, t2, . . . , tm
C := 0
repeat
Choose a random disjunct ti Є G
Choose a random truth assignment s.t. ti.E = true
if forall j < i tj.E = false then C := C + 1
until N times
return ˜p = C/N
MULTISIMULATION (MS)
Assumptions:
Intervals:
MULTISIMULATION (MS)
Critical Region:
The critical region, top objects, and bottom objects are:
(c, d) = (topk(a1, . . . , an), topk+1(b1, . . . , bn))……Eq (5)
T = {Gi | d <= ai}
B = {Gi | bi <= c}
Algorithm:
MS TopK(G, k) : /* G = {G1, . . . ,Gn} */
Let [a1, b1] = . . . = [an, bn] = [0, 1], (c, d) = (0, 1)
while c <= d do
Case 1: choose a double crosser to simulate
Case 2: choose upper and lower crosser to simulate
Case 3: choose a maximal crosser to simulate
Update (c, d) using Eq.(5)
end while
return TopK = T = {Gi | d <= ai}
EXAMPLE : LET US SELECT TOP 2.
c
d
G1
G2
G3
G4
G5
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
EXPERIMENTS
EXPERIMENTS RESULTS
THANK YOU