Study Abroad Program Search
Download
Report
Transcript Study Abroad Program Search
DBXplorer: A System for KeywordBased Search over Relational
Databases
Sanjay Agrawal
Surajit Chaudhuri
Gautam Das
Presented by
Bhushan Pachpande
Contents
1.
2.
3.
4.
5.
6.
Introduction
Overview of DBXplorer
Symbol Table Design - Publish
Keyword Search
Support for Generalized Matches
Conclusion
Introduction
Internet search engines have popularized keyword-based search.
Searching on traditional database management system is done
through customized applications which are closely tied to the
database schema.
Traditional database management systems do not support keywordbased search.
e.g. search the Microsoft intranet on ‘Jim Gray’ to obtain matched rows,
i.e., rows in the database where ‘Jim Gray’ occur.
In this paper, DBXplorer, an efficient and scalable keyword search
utility for relational databases, is described.
The goal is to enable such searches without necessarily requiring the
users to know the schema of the respective databases.
Example
Searching for
a book
Keywords “Programming” by “Ritchie”
Less probability of presence of both keywords in single
row of table
For result, rows need to be generated by joining tables
on the fly (all possible combinations)
Authors
AuthorsBooks
Books
BookStores
Store
Overview of DBXplorer
Objective - Given a set of query keywords, DBXplorer returns all
rows (either from single tables, or by joining tables connected by
foreign-key joins) such that the each row contains all keywords.
Applying IR techniques from the documents world to databases
is difficult, because of
Database normalization – by which logical units of information
may be fragmented and scattered across several tables
Matching row may be obtained by joining several tables on the fly
IR techniques use Inverted Lists = Symbol Table in databases
Symbol Table - Stores the information about keywords at
different granularities (column/row), i.e. for each keyword stores
the list of all rows
Overview of DBXplorer - Publish
Enabling keyword search in DBXplorer requires 2 steps
Publish
Search
enables database for keywords by building Symbol table and
associated structures
retrieves matching rows from published database
Steps in Publish
Step 1: A database is identified, along with the set of tables and
columns within the database to be published.
Step 2: Auxiliary tables [like Symbol table] are created for
supporting keyword searches.
Overview of DBXplorer - Search
Steps in Search
Looks up symbol table to find tables / columns which
contain keywords
Identify all possible joins (subsets of table if joined) whose
rows might contain required keywords
Generate SQL statements for each join (gives rows which
contain all keywords), rank rows and return
Architecture of DBXplorer
The publish component provides
interfaces to
select a database,
select tables/columns within the
database to publish, and
modify/remove/maintain the
publication.
For a given set of keywords, the
search component provides
interfaces to
retrieve matching databases from a
set of published databases, and
selectively identify tables,
columns/rows that need to be
searched within each database
identified in step (1).
Architecture of DBXplorer
Symbol Table Design
Exact match problem considered only
Symbol Table (S) - stores the information about keywords at
different granularities (column/row), i.e. for each keyword stores
the list of all rows
Column Level granularity (Pub-Col)
For every keyword S maintains list of all database columns (i.e.
table.column)
Cell level granularity (Pub-cell)
For every keyword S maintains list of all database cells (i.e.
table.column.rowid)
Symbol Table - Design factors
Space and Time Requirements
Size: pub-col are smaller than pub-cell since repetition of
keyword in a column does not increases entries in case of pubcol
Time: pub-col takes less time to build
Keyword search performance
Depends on efficient generation and execution of SQL statement
(built from symbol table entries)
pub-cell returns more number of SQL statements than pub-col as
for a keyword in column there are multiple entries
Ease of maintenance
Insert/Update: required for insertion of distinct new value in case
of pub-col while pub-cell needs for every update/insert
Same for delete
Storing Symbol Table
Store symbol tables (pub-col) in
database as (keyword hash ,
column Id)
FK Compression (Foreign Key)
v3
If there is foreign key relationship
between c1 and c2, store only c1
CP Compression
v2
v4
c1
x
c2
Uncompressed hash table
Partition H into a minimum number
of bipartite cliques (a bipartite clique
is any subgraph of H with a maximal
number of edges).
Compress each clique.
Stores symbol table (pub-cell) in
database as (keyword hash, list of
all cellids)
ColumnsMap table
Compressed hash table
Search - Enumerating Join Trees
Step1 - Looks up symbol table to find tables / columns which
contain keywords
Step2 - Enumerate join trees
Identify and enumerate all potential subsets of tables in the
database that, if joined, might contain rows having all keywords.
The resulting relation will contain all potential rows having all
keywords specified in the query.
keywords
If we view the schema graph G
as an undirected graph, this step
enumerates join trees, i.e., subtrees of G such that
(a)
(b)
the leaves belong to MatchedTables
together, the leaves contain all
keywords of the query
Join Trees
Search – Identify matching rows
The input to this final search step is the enumerated join trees.
Each join tree is then mapped to a single SQL statement that joins the
tables as specified in the tree, and selects those rows that contain all
keywords.
The retrieved rows are ranked before being output.
Rows ranked by number of joins involved (ties broken arbitrarily) (same as
keywords occurring close to one another in documents are ranked higher)
Join Trees
Generalized Matches – Token Matches
Token matches - the keyword in the query matches only a token or
sub-string of an attribute value (e.g., retrieve rows of address by
specifying only a street name).
Pub-Prefix method
B+ tree indexes can be used to retrieve rows whose cell matches a
given prefix string
This clause is of the form
WHERE T.C LIKE ‘P%K%’
During publishing of a database, for every keyword K, the entry
(hash(K), T.C, P) is kept in the symbol table if there exists a string in
column T.C which
contains a token K, and
has prefix P
Generalized Matches - Token Matches
Let the hash values of the searchable tokens i.e., ‘string’,
‘ball’ and ‘round’ be 1, 2 and 3 respectively
RowId
C
1
this is a string
2
this string
3
this is a ball
4
no string
5
any ball is round
Pub-Prefix table
Database table T
Consider searching keyword “string”
Pub-Prefix table returns prefixes “th” and “no” and subsequent SQL will
contain (T.C LIKE ‘th%string%’) OR (T.C LIKE ‘no%string%’)
Conclusion
This paper discusses DBXplorer, a system that enables
keyword-based search in relational databases.
DBXplorer uses symbol table alternatives to store the
location of keywords in database.
DBXplorer support exact matches and generalized
matches upto some extent.