Transcript ppt slides
Probabilistic Ranking of Database
Query Results
Surajit Chaudhuri, Microsoft Research
Gautam Das, Microsoft Research
Vagelis Hristidis, Florida International University
Gerhard Weikum, MPI Informatik
Presented by
Raghunath Ravi
Sivaramakrishnan Subramani
CSE@UTA
1
Roadmap
Motivation
Key Problems
System Architecture
Construction of Ranking Function
Implementation
Experiments
Conclusion and open problems
2
Motivation
Many-answers problem
Two alternative solutions:
Query reformulation
Automatic ranking
Apply probabilistic model in IR to DB
tuple ranking
3
Example – Realtor Database
House Attributes: Price, City, Bedrooms, Bathrooms,
SchoolDistrict,Waterfront, BoatDock,Year
Query: City =`Seattle’ AND Waterfront = TRUE
Too Many Results!
Intuitively, Houses with lower Price, more
Bedrooms, or BoatDock are generally preferable
4
Rank According to Unspecified
Attributes
Score of a Result Tuple t depends on
Global Score: Global Importance of Unspecified
Attribute Values [CIDR2003]
◦ E.g., Newer Houses are generally preferred
Conditional Score: Correlations between
Specified and Unspecified Attribute Values
◦ E.g., Waterfront BoatDock
Many Bedrooms Good School District
5
Roadmap
Motivation
Key Problems
System Architecture
Construction of Ranking Function
Implementation
Experiments
Conclusion and open problems
6
Key Problems
Given a Query Q, How to Combine the
Global and Conditional Scores into a
Ranking Function.
Use Probabilistic Information Retrieval
(PIR).
How to Calculate the Global and
Conditional Scores.
Use Query Workload and Data.
7
Roadmap
Motivation
Key Problems
System Architecture
Construction of Ranking Function
Implementation
Experiments
Conclusion and open problems
8
System Architecture
9
Roadmap
Motivation
Key Problems
System Architecture
Construction of Ranking Function
Implementation
Experiments
Conclusion and open problems
10
PIR Review
p(b | a) p(a)
p ( a | b)
p(b)
Bayes’ Rule
Product Rule p(a, b | c) p(a | c) p(b | a, c)
Document (Tuple) t, Query Q
R: Relevant Documents
R = D - R: Irrelevant Documents
p(t | R) p( R)
p( R | t )
p(t | R)
p(t )
Score(t )
p( R | t ) p(t | R ) p( R )
p(t | R )
p(t )
11
Adaptation of PIR to DB
Tuple t is considered as a document
Partition t into t(X) and t(Y)
t(X) and t(Y) are written as X and Y
Derive from initial scoring function until
final ranking function is obtained
12
Preliminary Derivation
13
Limited Independence Assumptions
Given a query Q and a tuple t, the X (and Y)
values within themselves are assumed to be
independent, though dependencies between
the X and Y values are allowed
p( X C ) p( x C )
xX
p (Y C ) p ( y C )
yY
14
Continuing Derivation
15
Pre-computing Atomic Probabilities
in Ranking Function
p( y W ) Relative frequency in W
p( y D) Relative frequency in D
p( x y,W ) (#of tuples in W that conatains x, y)/total # of
tuples in W
p( x y, D) (#of tuples in D that conatains x, y)/total # of
tuples in D
Use Workload
p ( y | R)
1
Score(t )
yY p( y | D) yY xX p( x | y, D)
Use Data
16
Roadmap
Motivation
Key Problems
System Architecture
Construction of Ranking Function
Implementation
Experiments
Conclusion and open problems
17
Architecture of Ranking Systems
18
Scan Algorithm
Preprocessing - Atomic Probabilities Module
Computes and Indexes the Quantities
P(y | W), P(y | D), P(x | y,W), and P(x | y, D) for All
Distinct Values x and y
Execution
Select Tuples that Satisfy the Query
Scan and Compute Score for Each Result-Tuple
Return Top-K Tuples
19
Beyond Scan Algorithm
Scan algorithm is Inefficient
Many tuples in the answer set
Another extreme
Pre-compute top-K tuples for all possible queries
Still infeasible in practice
Trade-off solution
Pre-compute ranked lists of tuples for all possible atomic
queries
At query time, merge ranked lists to get top-K tuples
20
Output from Index Module
CondList Cx
{AttName, AttVal, TID, CondScore}
B+ tree index on (AttName, AttVal,
CondScore)
GlobList Gx
{AttName, AttVal, TID, GlobScore}
B+ tree index on (AttName, AttVal,
GlobScore)
21
Index Module
22
Preprocessing Component
Preprocessing
For Each Distinct Value x of Database, Calculate and Store the
Conditional (Cx) and the Global (Gx) Lists as follows
◦ For Each Tuple t Containing x Calculate
and add to Cx and Gx respectively
Sort Cx, Gx by decreasing scores
Execution
Query Q: X1=x1 AND … AND Xs=xs
Execute Threshold Algorithm [Fag01] on the following lists:
Cx1,…,Cxs, and Gxb, where Gxb is the shortest list among Gx1,…,Gxs
23
List Merge Algorithm
24
Roadmap
Motivation
Key Problems
System Architecture
Construction of Ranking Function
Implementation
Experiments
Conclusion and open problems
25
Experimental Setup
Datasets:
◦ MSR HomeAdvisor Seattle
(http://houseandhome.msn.com/)
◦ Internet Movie Database (http://www.imdb.com)
Software and Hardware:
Microsoft SQL Server2000 RDBMS
P4 2.8-GHz PC, 1 GB RAM
C#, Connected to RDBMS through DAO
26
Quality Experiments
Conducted on Seattle Homes and Movies
tables
Collect a workload from users
Compare Conditional Ranking Method in
the paper with the Global Method
[CIDR03]
27
Quality Experiment-Average Precision
For each query Qi , generate a set Hi of 30 tuples likely to contain a
good mix of relevant and irrelevant tuples
Let each user mark 10 tuples in Hi as most relevant to Qi
Measure how closely the 10 tuples marked by the user match the
10 tuples returned by each algorithm
28
Quality Experiment- Fraction of Users
Preferring Each Algorithm
5 new queries
Users were given the top-5 results
29
Performance Experiments
Datasets
Table
NumTuples
Seattle Homes
US Homes
Database Size (MB)
17463
1.936
1380762
140.432
Compare 2 Algorithms:
Scan algorithm
List Merge algorithm
30
Performance Experiments – Precomputation Time
31
Performance Experiments –
Execution Time
32
Roadmap
Motivation
Key Problems
System Architecture
Construction of Ranking Function
Implementation
Experiments
Conclusion and open problems
33
Conclusions – Future Work
Conclusions
Completely Automated Approach for the ManyAnswers Problem which Leverages Data and Workload
Statistics and Correlations
Based on PIR
Drawbacks
Mutiple-table query
Non-categorical attributes
Future Work
Empty-Answer Problem
Handle Plain Text Attributes
34
Questions?
35