Transcript ppt slides

Probabilistic Ranking of Database
Query Results
Surajit Chaudhuri, Microsoft Research
Gautam Das, Microsoft Research
Vagelis Hristidis, Florida International University
Gerhard Weikum, MPI Informatik
Presented by
Raghunath Ravi
Sivaramakrishnan Subramani
CSE@UTA
1
Roadmap
Motivation
 Key Problems
 System Architecture
 Construction of Ranking Function
 Implementation
 Experiments
 Conclusion and open problems

2
Motivation
Many-answers problem
 Two alternative solutions:
Query reformulation
Automatic ranking
 Apply probabilistic model in IR to DB
tuple ranking

3
Example – Realtor Database

House Attributes: Price, City, Bedrooms, Bathrooms,
SchoolDistrict,Waterfront, BoatDock,Year

Query: City =`Seattle’ AND Waterfront = TRUE

Too Many Results!

Intuitively, Houses with lower Price, more
Bedrooms, or BoatDock are generally preferable
4
Rank According to Unspecified
Attributes
Score of a Result Tuple t depends on
 Global Score: Global Importance of Unspecified
Attribute Values [CIDR2003]
◦ E.g., Newer Houses are generally preferred

Conditional Score: Correlations between
Specified and Unspecified Attribute Values
◦ E.g., Waterfront  BoatDock
Many Bedrooms  Good School District
5
Roadmap
Motivation
 Key Problems
 System Architecture
 Construction of Ranking Function
 Implementation
 Experiments
 Conclusion and open problems

6
Key Problems

Given a Query Q, How to Combine the
Global and Conditional Scores into a
Ranking Function.
Use Probabilistic Information Retrieval
(PIR).

How to Calculate the Global and
Conditional Scores.
Use Query Workload and Data.
7
Roadmap
Motivation
 Key Problems
 System Architecture
 Construction of Ranking Function
 Implementation
 Experiments
 Conclusion and open problems

8
System Architecture
9
Roadmap
Motivation
 Key Problems
 System Architecture
 Construction of Ranking Function
 Implementation
 Experiments
 Conclusion and open problems

10
PIR Review



p(b | a) p(a)
p ( a | b) 
p(b)
Bayes’ Rule
Product Rule p(a, b | c)  p(a | c) p(b | a, c)
Document (Tuple) t, Query Q
R: Relevant Documents
R = D - R: Irrelevant Documents
p(t | R) p( R)
p( R | t )
p(t | R)
p(t )
Score(t ) 


p( R | t ) p(t | R ) p( R )
p(t | R )
p(t )
11
Adaptation of PIR to DB
Tuple t is considered as a document
 Partition t into t(X) and t(Y)
 t(X) and t(Y) are written as X and Y
 Derive from initial scoring function until
final ranking function is obtained

12
Preliminary Derivation
13
Limited Independence Assumptions

Given a query Q and a tuple t, the X (and Y)
values within themselves are assumed to be
independent, though dependencies between
the X and Y values are allowed
p( X C )   p( x C )
xX
p (Y C )   p ( y C )
yY
14
Continuing Derivation
15
Pre-computing Atomic Probabilities
in Ranking Function
p( y W ) Relative frequency in W
p( y D) Relative frequency in D
p( x y,W ) (#of tuples in W that conatains x, y)/total # of
tuples in W
p( x y, D) (#of tuples in D that conatains x, y)/total # of
tuples in D
Use Workload
p ( y | R)
1
Score(t )  

yY p( y | D) yY xX p( x | y, D)
Use Data
16
Roadmap
Motivation
 Key Problems
 System Architecture
 Construction of Ranking Function
 Implementation
 Experiments
 Conclusion and open problems

17
Architecture of Ranking Systems
18
Scan Algorithm
Preprocessing - Atomic Probabilities Module
 Computes and Indexes the Quantities
P(y | W), P(y | D), P(x | y,W), and P(x | y, D) for All
Distinct Values x and y
Execution
 Select Tuples that Satisfy the Query
 Scan and Compute Score for Each Result-Tuple
 Return Top-K Tuples
19
Beyond Scan Algorithm

Scan algorithm is Inefficient
Many tuples in the answer set

Another extreme
Pre-compute top-K tuples for all possible queries
Still infeasible in practice

Trade-off solution
Pre-compute ranked lists of tuples for all possible atomic
queries
At query time, merge ranked lists to get top-K tuples
20
Output from Index Module

CondList Cx
{AttName, AttVal, TID, CondScore}
B+ tree index on (AttName, AttVal,
CondScore)

GlobList Gx
{AttName, AttVal, TID, GlobScore}
B+ tree index on (AttName, AttVal,
GlobScore)
21
Index Module
22
Preprocessing Component
Preprocessing
 For Each Distinct Value x of Database, Calculate and Store the
Conditional (Cx) and the Global (Gx) Lists as follows
◦ For Each Tuple t Containing x Calculate
and add to Cx and Gx respectively

Sort Cx, Gx by decreasing scores
Execution
Query Q: X1=x1 AND … AND Xs=xs

Execute Threshold Algorithm [Fag01] on the following lists:
Cx1,…,Cxs, and Gxb, where Gxb is the shortest list among Gx1,…,Gxs
23
List Merge Algorithm
24
Roadmap
Motivation
 Key Problems
 System Architecture
 Construction of Ranking Function
 Implementation
 Experiments
 Conclusion and open problems

25
Experimental Setup


Datasets:
◦ MSR HomeAdvisor Seattle
(http://houseandhome.msn.com/)
◦ Internet Movie Database (http://www.imdb.com)
Software and Hardware:
 Microsoft SQL Server2000 RDBMS
 P4 2.8-GHz PC, 1 GB RAM
 C#, Connected to RDBMS through DAO
26
Quality Experiments
Conducted on Seattle Homes and Movies
tables
 Collect a workload from users
 Compare Conditional Ranking Method in
the paper with the Global Method
[CIDR03]

27
Quality Experiment-Average Precision



For each query Qi , generate a set Hi of 30 tuples likely to contain a
good mix of relevant and irrelevant tuples
Let each user mark 10 tuples in Hi as most relevant to Qi
Measure how closely the 10 tuples marked by the user match the
10 tuples returned by each algorithm
28
Quality Experiment- Fraction of Users
Preferring Each Algorithm


5 new queries
Users were given the top-5 results
29
Performance Experiments

Datasets
Table
NumTuples
Seattle Homes
US Homes

Database Size (MB)
17463
1.936
1380762
140.432
Compare 2 Algorithms:
 Scan algorithm
 List Merge algorithm
30
Performance Experiments – Precomputation Time
31
Performance Experiments –
Execution Time
32
Roadmap
Motivation
 Key Problems
 System Architecture
 Construction of Ranking Function
 Implementation
 Experiments
 Conclusion and open problems

33
Conclusions – Future Work
Conclusions
 Completely Automated Approach for the ManyAnswers Problem which Leverages Data and Workload
Statistics and Correlations
 Based on PIR
Drawbacks
 Mutiple-table query
 Non-categorical attributes
Future Work
 Empty-Answer Problem
 Handle Plain Text Attributes
34
Questions?
35