PPT - Crystal

Download Report

Transcript PPT - Crystal

Surajit Chaudhuri, Microsoft Research
Gautam Das, Microsoft Research
Vagelis Hristidis, Florida International University
Gerhard Weikum, MPI Informatik
Presented by: Kiran Karnam

Introduction & Motivation

Problem Definition

Architecture

Ranking Function

Implementation

Experiments

Conclusions & Limitations



Many-answers problem
Two alternative solutions:
Query reformulation
Automatic ranking
Apply probabilistic model in IR to DB tuple
ranking

Many answers problem
SELECT * FROM REALTOR_DB
WHERE CITY=‘SEATTLE’ ;

Query reformulation

Automatic ranking
 Specified
Attributes
city
 Unspecified Attributes
View
School District
Boat Dock

Global Score:
Global score which captures the global importance
of unspecified attribute values.
Eg: VIEW=‘WATERFRONT’

Conditional Score:
which captures the strengths of dependencies (or
correlations) between specified and unspecified attribute
values.
Eg: If CITY=‘SEATTLE’ and VIEW=‘WATERFRONT’

Important Rules and Theorem required

Bayes’ Rule:
p(a/b) = [ p(b/a) p(a) ] / [p(b)]

Product Rule:
p(a,b/c) = p(a/c) * p(b/a,c)

Bayes theorem shows the relation between two
conditional probabilities which are the reverse of
each other

The probability of an event A given an event B
depends not only on the relationship between
events A and B but on the marginal probability (or
"simple probability") of occurrence of each event

Document (Tuple) t, Query Q
R: Relevant Documents
R = D - R: Irrelevant Documents

Tuple t is considered as a document

Partition t into t(X) and t(Y)

t(X) and t(Y) are written as X and Y

Derive from initial scoring function until final
ranking function is obtained

Given a query Q and a tuple t, the X (and Y)
values within themselves are assumed to be
independent, though dependencies between the
X and Y values are allowed

If Many Queries Specify Set X of Conditions then there is
Preference Correlation between Attributes in X.

Global: E.g., If Many Queries ask for Waterfront then
p(Waterfront=TRUE) is high.

Conditional: E.g., If Many Queries ask for 4-Bedroom Houses
in Good School Districts, then p(Bedrooms=4 |
SchoolDistrict=`good’), p(SchoolDistrict=`good’ |
Bedrooms=4) are high.

Final Ranking Formula is
Where:
p(y|W) = Relative frequency of unspecified attribute ‘y’
given workload ‘W’
p(y|D)= Relative frequency of unspecified attribute ‘y’
given data base ‘D’
p(x|y,W)=Frequency of correlation between x and y in W
P(x|y,D)=Frequency of correlation between x and y in D


Pre processing
◦ Atomic probability module
◦ Index module

Intermediate Knowledge Reference layer

Query processing
◦ Scan algorithm
◦ List merge algorithm
 Computation
of modules:
p(y | W), p(y | D), p(x | y, W), and p(x | y, D) for
all distinct values of x and y.
 Storing these atomic probabilities as database tables
in intermediate knowledge representation layer with
appropriate indexes.
 Computation of index module resulting in conditional
and global lists table.

CONDITIONAL LISTS Cx:
Contains <TID, CondScore> in descending order

GLOBAL LISTS Gx:
Contains <TID,GlobScore> in descending order

Select Tuples that Satisfy the Query

Scan and Compute Score for Each Result-Tuple

Return Top-K Tuples
Scan algorithm is Inefficient
Many tuples in the answer set

Another approach
Pre-compute top-K tuples for all possible queries
Still infeasible in practice

Trade-off solution
Pre-compute ranked lists of tuples for all possible atomic queries
At query time, merge ranked lists to get top-K tuples

Databases Used
◦ MSN Home Advisor database
(http://houseandhome.msn.com/)
◦ Internet Movie Database

Software and Hardware:
•
Microsoft SQL Server2000 RDBMS
•
P4 2.8-GHz PC, 1 GB RAM
•
C#, Connected to RDBMS through DAO

Quality Experiments

Performance Experiments
Query: select * from SeattleHomes where
City=‘Seattle’ and Bedroom=1;
Conditional ranked condos with garages the
highest
 Global failed to recognize importance of the
unspecified attribute Garage=‘Y’


User preference of rankings


5 new queries
Users were given the top-5 results

Compare 2 algorithms
◦ Scan algorithm
◦ List Merge algorithm

Execution time of performance algorithms

Completely Automated Approach for the Many-Answers
Problem which Leverages Data and Workload Statistics
and Correlations
LIMITATION:
Existence of correlations between text and non-text data.
Future Work
 Empty-Answer Problem
 Handle Plain Text Attributes




Surajit Chaudhuri, Gautam Das, Vagelis Hristidis, Gerhard Weikum,
Probabilistic Ranking of Database Query Results, VLDB 2004.
users.cs.fiu.edu/~vagelis/presentations/ProbRanking.ppt
http://crystal.uta.edu/~cse6339/Fall08DBIR.htm
http://crystal.uta.edu/~cse6339/Fall09DBIR.htm