Transcript DBXplorer

Sanjay Agrawal
Surajit Chaudhuri
Gautham Das
Microsoft Research
Microsoft Research
Microsoft Research
Presented by
Anusha Mannem

Introduction

Overview Of DBXplorer

Publish & Design of Symbol Table

Finding Matches For Keyword Search

Support for Generalized Matches(Token Matching)

Experimental Results

Conclusions

Internet search engines have popularized keyword based search.
Alternative to Keyword-Based search is Structured Search.
Ex: Searching for books in Bookseller's database

Books Travel  Lonely planet  Asia
(or)
Books Travel  Rough Guides  Europe
Other similar example is http://m.www.yahoo.com/

Example for Keyword-Based Search is searching for “Jim Gary” on
Microsoft intranet to obtain matched rows, i.e., rows in DB where ‘Jim
Gary’ occur.

DBXplorer, an efficient and scalable keyword search utility for relational
databases.

The goal of DBXplorer is to enable keyword searches on multiple
databases without requiring the users to know the schema of respective
databases.

How it works?
When a DBXplorer is given a set of query keywords, it returns all the rows
(either from single table, or by joining tables connected by foreign-key
joins) such that each row contains all the keywords.

Enabling this keyword search requires 2 steps
(a)Publish Step
(b)Search Step

Publish : A databases is enabled for keyword search through the following
steps.

Step 1 : A database is identified, along with the set of tables or columns
within the database to be published.

Step 2 : Auxiliary tables are created for supporting keyword searches. The
most important structure is Symbol Table S that is used at search time to
efficiently determine the locations of query keywords in the database

Search: Given a query consisting of a set of keywords, it is answered as
follows.

Step 1: Looks up symbol table to identify the tables, and columns/rows of
the database which contain the query keywords.

Step 2: Identify and enumerate all possible joins (subsets of table if
joined) whose rows might contain required keywords

Step 3: Generate SQL statements for each join (gives rows which contain
all keywords), and the final rows are ranked and presented to the user.
 Publish and Search components are packaged as two separate COM
objects.

Publish Component provides interface to
(a)
Select a database
(b)
Select tables/columns within the database to publish ,and
(c)
modify/remove/maintain the publication

For given set of keywords, the search component provides interfaces to
(a)
Retrieve matching databases from set of published databases.
(b)
Selectively identify tables, columns/rows that need to be searched within
each database identified in step (a)

Exact match problem is considered i.e., each keyword in the query must
match the value of an attribute in a row of a table.

Symbol Table S, which stores the information of the query keywords at
different levels (row/column ) of granularity.

The two granularity levels are :

Column level granularity (Pub-Col) : for each keyword , we only store the
list of columns where they occur.

Row Level granularity (Pub-Cell) : for each keyword we keep track of all
the rows that contains the keyword.

Space and Time Requirements
Pub-Col is much faster and smaller when compared to Pub-Cell.

Keyword Search Performance
Pub-Col is effective if there are an indexes on the published columns

Ease of Maintenance
Pub-Col is easier to maintain as it requires an update only if the insertions
cause new values in column data.

Hence, Pub-Col symbol table is always better than Pub-Cell table, unless
certain columns do not have indexes.

Symbol table is stored as a relation with two attributes : Keyword and
ColId.

Alternative way is to keep hashed values of keywords in the symbol table .

FK-Comp:
If the set of values in column c1 is a subset of the values in another
column c2 due to key-foreign key relationship, we retain only a single hash
table entry for keywords common to both as (keyval,c1), since the foreign
key constraint can be used to infer presence of keyval in c2.

CP-Comp :
V2
(a)Partition H into a minimum number
c1
V3
of bipartite cliques (a bipartite clique
is any subgraph of H with a
maximal number of edges).
v4
x
c2
Uncompressed hash table
(b) Compress each clique.
(map the symbol table S to a bipartite
graph H having two node sets, HashVal &
ColId, where every row (v, c) in table S
corresponds to an edge in H.)
ColumnsMap table
Compressed hash table

Step 1 :
Symbol table is searched(using generated SQL) to
identify the database tables, columns/cells that contain at least
one query keyword.

Step 2 :
Enumerating Join trees

Step 3 :
Identifying matching rows

Identify and enumerate all potential subsets of tables in the
database that, if joined, might contain rows having all
keywords.

The resulting relation will contain all potential rows having all
keywords specified in the query.
If we view the schema graph G as an
undirected graph, this step
enumerates join trees, i.e., sub-trees
of G such that:
(a) the leaves belong to MatchedTables
and
(b) together, the leaves contain all
keywords of the query
keywords
Join Trees

The input to this final search step is enumerated join trees.

Each join tree is then mapped to a single SQL statement that joins the
tables as specified in the tree, and select those rows that contain all
keywords.

The retrieved rows are ranked before being output.

Our approach is to rank the rows by number of joins involved
E.g., documents in which keywords occur close to one another are ranked
higher than documents in which keywords are far apart.

Token matches - the keyword in the query matches only a token or substring of an attribute value (e.g., retrieve rows of address by specifying
only a street name).

Pub-Prefix method
 B+ tree indexes can be used to retrieve rows whose cell matches a given prefix
string
 This clause is of the form
WHERE T.C LIKE ‘P%K%’
 During publishing of a database, for every keyword K, the entry (hash(K), T.C,
P) is kept in the symbol table if there exists a string in column T.C which
▪ contains a token K, and
▪ has prefix P
Let the hash values of the searchable tokens i.e., ‘string’, ‘ball’
and ‘round’ be 1, 2 and 3 respectively
RowId
C
1
this is a string
2
this string
3
this is a ball
4
no string
5
any ball is round
Pub-Prefix table
Database table T
Consider searching keyword “string”
Pub-Prefix table returns prefixes “th” and “no” and subsequent SQL will
contain (T.C LIKE ‘th%string%’) OR (T.C LIKE ‘no%string%’)

Setup: The Experiments are on a 450 MHz 256 MB Intel P-III machine.

Four databases were used for evaluation:
 TPC-H data of sizes from 100 to 500 MB
 USR is Microsoft employee address DB of 130 MB with 19 tables
 ML is a 375 MB mailing list DB with 38 tables
 KB is a 365 MB DB with 84 tables containing information on articles
and help manuals on various shipped products

In particular it has been showed that :

Pub-Col table is compact when compared to Pub-Cell

Pub-Col scales linearly with data size, and is independent of data
distribution.

Pub-Prefix is compact compared to Pub-Cell and has a significantly better
performance when full text indexes are not present
Publish & Search performance of Pub-Col & Pub-Cell are compared:
Fig: Matching Databases
Fig: Join Trees
Fig: Matching Rows

DBXplorer has been implemented using a commercial relational database
and web server and allows users to interact via a browser at front-end.

Although the authors discussed only the case where there is a single
database, this techniques can be extended to keyword search over multiple
databases.

DBXplorer support exact matches and generalized matches up to some
extent.

As mentioned earlier, the Pub-Col alternative is the best when columns
have indexes on them. If a full-text index is available, use Pub-Col with
the full-text index. Instead, if only a traditional index is available and the
column width is small, use Pub-Prefix, otherwise use Pub-Cell.