Slides - CSE Buffalo
Download
Report
Transcript Slides - CSE Buffalo
Surajit Chaudhuri, Microsoft Research
Gautam Das, Microsoft Research
Vagelis Hristidis, Florida International University
Gerhard Weikum, MPI Informatik
30th VLDB Conference Toronto ,Canada,2004
Presented By
Abhishek Jamloki
CSE@UB
Realtor DB:
Table D=(TID, Price , City, Bedrooms, Bathrooms,
LivingArea, SchoolDistrict, View, Pool, Garage,
BoatDock)
SQL query:
Select * From D
Where City=Seattle AND View=Waterfront
Consider a database table D with n tuples {t1, …, tn} over a set of m
categorical attributes A = {A1, …, Am}
a query Q:
SELECT * FROM D
WHERE
X1=x1 AND … AND Xs=xs
where each Xi is an attribute from A and xi is a value in its domain.
specified attributes: X ={X1, …, Xs}
unspecified attributes: Y = A – X
Let S be the answer set of Q
How to rank tuples in S and return top-k tuples to the user?
IR Treatment
Query Reformulation
Automatic Ranking
Correlations are ignored in high dimensional spaces of IR
• Automated Ranking function proposed based on
1) A global score of unspecified attributes
2) A conditional score (strength of correlation between
specified and unspecified attributes)
Automatic estimation using workload and data analysis
• Bayes’ Rule
p(b | a) p(a)
p ( a | b)
p(b)
• Product Rule
p(a, b | c) p(a | c) p(b | a, c)
Document t, Query Q
R: Relevant document set
R = D - R: Irrelevant document set
p(t | R) p( R)
p( R | t )
p(t | R)
p(t )
Score(t )
p( R | t ) p(t | R ) p( R )
p(t | R )
p(t )
Each tuple t is treated as a document
Partition t into two parts
t(X): contains specified attributes
t(Y): contains unspecified attributes
Replace t with X and Y
Replace R with D
Comprehensive dependency models have unacceptable
preprocessing and query processing costs
Choose a middle ground.
Given a query Q and a tuple t, the X (and Y) values
within themselves are assumed to be independent,
though dependencies between the X and Y values are
allowed
p( X | C ) p( x | C )
xX
p(Y | C ) p( y | C )
yY
Workload W: a collection of ranking queries that have been
executed on our system in the past.
Represented as a set of “tuples”, where each tuple represents a
query and is a vector containing the corresponding values of
the specified attributes.
We approximate R as all query “tuples” in W that also request
for X (approximation is novel to this paper)
Properties of the set of relevant tuples R can be obtained by
only examining the subset of the workload that contains
queries that also request for X
Substitute p(y | R) as p(y | X,W)
p(y | W) the relative frequencies of each distinct value y in the
workload
p( y | D) relative frequencies of each distinct value y in the
database (similar to IDF concept in IR)
p(x | y,W) confidences of pair-wise association rules in the
workload, that is:
(#of tuples in W that contains x, y)/total # of tuples in W
p(x | y,D):
(#of tuples in D that contains x, y)/total # of tuples in D
Stored as auxiliary tables in the intermediate knowledge
representation layer
p(y | w) {AttName, AttVal, Prob}
◦ B+Tree index on (AttName, AttVal)
p(y | D) {AttName, AttVal, Prob}
◦ B+Tree index on (AttName, AttVal)
p(x | y,W) {AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}
◦ B+Tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)
p(x | y,D) {AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}
◦ B+Tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)
Preprocessing - Atomic Probabilities Module
Computes and Indexes the Quantities
P(y | W), P(y | D), P(x | y, W), and P(x | y, D)
for All Distinct Values x and y
Execution
Select Tuples that Satisfy the Query
Scan and Compute Score for Each Result-Tuple
Return Top-K Tuples
Trade off between pre-processing and query processing
Pre-compute ranked lists of the tuples for all possible “atomic”
queries. Then at query time, given an actual query that specifies a
set of values X, we “merge” the ranked lists corresponding to each
x in X to compute the final Top-K tuples.
We should be able to perform merging without scanning the entire
ranked lists
Threshold algorithm can be used for this purpose
A feasible adaptation of TA should keep the number of sorted
streams small
Number of sorted streams will depend on number of attributes in
database
At query time we do a TA-like merging of several ranked lists
(i.e. sorted streams)
The required number of sorted streams depends only on the
number of specified attribute values in the query and not on
the total number of attributes in the database
Such a merge operation is only made possible due to the
specific functional form of our ranking function resulting from
our limited independence assumptions
Index Module: takes as inputs the association rules and the database,
and for every distinct value x, creates two lists Cx and Gx, each
containing the tuple-ids of all data tuples that contain x, ordered in
specific ways.
Conditional List Cx: consists of pairs of the form <TID, CondScore>,
ordered by descending CondScore
TID: tuple-id of a tuple t that contains x
Global List Gx: consists of pairs of the form <TID, GlobScore>,
ordered by descending GlobScore, where TID is the tuple-id of a
tuple t that contains x and
At query time we retrieve and multiply the scores of t in the
lists Cx1,…,Cxs and in one of Gx1,…,Gxs. This requires only
s+1 multiplications and results in a score2 that is proportional
to the actual score.
Two kinds of efficient access operations are needed:
First, given a value x, it should be possible to perform a
GetNextTID operation on lists Cx and Gx in constant time,
tuple-ids in the lists should be efficiently retrievable one-byone in order of decreasing score. This corresponds to the sorted
stream access of TA.
Second, it should be possible to perform random access on the
lists, that is, given a TID, the corresponding score (CondScore
or GlobScore) should be retrievable in constant time.
These lists are stored as database tables –
CondList Cx
{AttName, AttVal, TID, CondScore}
B+Tree index on (AttName, AttVal, CondScore)
GlobList Gx
{AttName, AttVal, TID, GlobScore}
B+Tree index on (AttName, AttVal, GlobScore)
Space consumed by the lists is O(mn) bytes (m is the number
of attributes and n the number of tuples of the database table)
We can store only a subset of the lists at preprocessing time, at
the expense of an increase in the query processing time.
Determining which lists to retain/omit at preprocessing time
done by analyzing the workload.
Store the conditional lists Cx and the corresponding global
lists Gx only for those attribute values x that occur most
frequently in the workload
Probe the intermediate knowledge representation layer at
query time to compute the missing information
The following Datasets were used:
MSR HomeAdvisor Seattle (http://houseandhome.msn.com/)
Internet Movie Database (http://www.imdb.com)
Software and Hardware:
Microsoft SQL Server2000 RDBMS
P4 2.8-GHz PC, 1 GB RAM
C#, Connected to RDBMS through DAO
Evaluated using two ranking methods
1) Conditional
2) Global
Several hundred workload queries were collected for both the
datasets and ranking algorithm trained on this workload
For each query Qi , generate a set Hi of 30 tuples likely to
contain a good mix of relevant and irrelevant tuples
Let each user mark 10 tuples in Hi as most relevant to Qi
Measure how closely the 10 tuples marked by the user match
the 10 tuples returned by each algorithm
Users were given the Top-5 results of the two ranking methods
for 5 queries (different from the previous survey), and were
asked to choose which rankings they preferred
Compared performance of the various implementations of the
Conditional algorithm: List Merge, its space-saving variant
and Scan
Datasets used:
Completely automated approach for the Many-Answers
Problem which leverages data and workload statistics and
correlation
Probabilistic IR models were adapted for structured data.
Experiments demonstrate efficiency as well as quality of the
ranking system
Many relational databases contain text columns in addition to
numeric and categorical columns. Whether correlations
between text and non-text data can be leveraged in a
meaningful way for ranking ?
Comprehensive quality benchmarks for database ranking need
to be established