Effective Keyword-Based Selection of Relational Databases
Download
Report
Transcript Effective Keyword-Based Selection of Relational Databases
Effective Keyword-Based
Selection of Relational
Databases
By Bei Yu, Guoliang Li, Karen Sollins &
Anthony K. H. Tung
Presented by Deborah Kallina
Introduction
The database selection problem for relational
data sources
Demand for keyword-based searches of
structured databases
What about searches over multiple structured
databases?
Current Approach
Document collections are summarized with keyword
frequency lists
Inadequate because…
Relational tables are typically normalized
Results must consider number of join operations
Example:
query Q = {multimedia,
database, VLDB}
The Usefulness of a Database in
Answering a Keyword Query
A database is useful if it has high quality
results to the keyword query.
What are “high quality” results?
The results contain all the query keywords.
There is a keyword relationship – the query
keywords can be connected meaningfully in the
database.
This paper proposes…
A method that summaries the relationships
between keywords in a database.
A ranking method based on the keyword
summaries in order to select the most useful
databases for a given keyword query.
Why Summaries?
We have an equation which measures the
usefulness of a database DB to query Q
But calculating this score isn’t feasible for every
database in the system!
So we estimate the usefulness of the database
using a summary of the relationships between all
pairs of keywords in the database.
Results of a Keyword Query
A minimal tree of tuples containing
all the query keywords
* tuples in the tree are connected
according to their relationships
defined in the database schema
The more distant the relationships,
the weaker the relationships
connecting the tuples.
distance - the number of join
operations in a joining sequence of
tuples
Examples:
t1
t2
t3, distance = 2
t4, distance = 0
DB2 of Figure 1,
Q = {multimedia, binder}
t1
t5
t12 or t4
t9
t12,
distance = 2
t15
t10
t1
t5
t12,
distance = 4
The Matrices
Each relational database DB has several types of matrices
one D matrix – represents the presence or absence of each
keyword in each tuple in DB
multiple T matrices – represents the relationship between tuples
in a relational database at each distance
construct the Wd matrices – the frequencies of keyword pairs at
distance d; Example: Figure 3
Calculate the Key Relationship Matrix (KRM)
wd(ki, kj) is the frequency of d-distance joining sequences to
connect the pair of keywords ki and kj
is the maximum number of join operations (user-specified)
Example: Figure 4
Q = {database:VLDB} for DB2
in Figure 3, the only non-zero is when d = 2
1/(2+1) * 1 occurrence = 0.33
Implementation in SQL
The generation of the matrices: D, T1, T2, … T and
W0, W1, … , W , for each DB, can be performed
conveniently inside the local RDBMS using SQL.
Tables are created for each matrix.
Tables can be maintained dynamically when tuples
are added to the database or old tuples are deleted.
Data Sources Selection Using KRM
Estimate the score of the data source DB to
Q through the scores of the individual pairs.
With the KR-summary, we can effectively
rank a set of databases. A higher score of the
data source DB to Q indicates a better
ranking.
Definitions & Metrics for Comparing
Experimental Results
real ranking – the “real” score of each database
based on Equation 2-1
recall – compares the accumulated score of the top
l databases selected based on the summaries of
the source databases against the total available
score when we select top l databases according to
real ranking (summaries vs. real ranking)
precision – measures the fraction of the top l
selected databases which have high quality results
Experimental Results
Observations:
The selection performance
of KR-summaries generally
gets better as gets larger.
The precision and recall
performance for different
values of tends to cluster
into groups.
There are big gaps in both
precisions and recalls
between KR-summaries
when 0 ≤ ≤ 1 and when
is greater.
Conclusion
“The estimation method is effective in
assigning high ranks to useful databases,
although less relevant or irrelevant databases
might be selected.”