DBXplorer: A system for keyword based search over relational
Download
Report
Transcript DBXplorer: A system for keyword based search over relational
DBXplorer: A System for Keyword Based
Search over Relational Databases
Sanjay Agrawal
Surajit Chaudhuri
Gautam Das
Microsoft Research
Microsoft Research
Microsoft Research
Presented by:
DEEP PANCHOLI
1000556121
Introduction
The two most common types of search are Structured Search
and Keyword Based Search
Example of Structured Search http://autos.yahoo.com/
A similar example is to search for books in bookseller’s
database e.g. Books->Travel->Maps->Asia->Russia
We all already know what is keyword based search and one
example can be searching for ‘Jim Gray’ on Microsoft
Intranet to obtain matched rows
Introduction (cont)
Problems faced with Keyword based search implementation
Need to know schema
Normalized databases
Availability of indexes
Built on the concept laid by BANKS paper explained in the
last lecture.
Symbol tables
Compacting the symbol tables
Search requirements
Overview of DBXplorer
DBXplorer returns all rows either from single table or from
multiple tables, using FK-joins, such that each row has all the
keywords
Publish
1. Identify a database and tables and columns within it that are to be
enabled for search
2. Create auxiliary tables (Symbol Tables)
Search
1. Look up the Symbol table
2. Searching in possible subsets of tables
3. Construct and execute SQL statement and rank the results before
displaying to user
Different Symbol Table Designs
We will only consider exact match problem
Two important levels of granularities
Column level granularity (Pub-Col)
Cell level granularity (Pub-Cell)
Table=Authors
Fname
Lname
John
Marshall
John
Shawn
Shawn
Archer
Jacquelyn
Marshall
Factors that affect granularity
Space and time requirements
Pub-Col is faster and occupies less space
Keyword search performance
Pub-Col if there is an index on the column
Ease of Symbol table maintenance
Pub-Col is easier to maintain as it contains updates only if there is
addition of a new distinct values
Hence, the Pub-Col alternative is almost always better than Pub-
Cell unless if certain columns contain no indexes
If an index is available for column, we should use Pub-Col
granularity
Pub-Col representation
Store simply as Keyword-ColId
Alternative is to use Hashvalue-ColId since storing keywords
is wasteful as strings can be long and of varying lengths
Compression Algorithms
FK-Comp: If column c1 is a subset of values in another
column c2, we retain only values in c1
CP-Comp: It is used when pairs of columns share common
keywords but are not tied by FK
Pub-Col Algorithm
Search Component
Common step for all kinds of granularities
It makes use of join trees
Hence, if we join tables that occur in the join tree the
resulting relation will contain all potential rows containing all
keywords specified in the query
Example of graph tree
Finally SQL query is generated and run
The result is then ranked before outputting. The basic
approach is to rank them based on the number of joins
involved which is similar to Banks approach
Search Algorithm
Case of Token matches
Token matches are matches in which keyword match with a
token or a substring of attribute value
Pub-Prefix method efficiently enables token match
capabilities by exploiting available B+ tree indexes
Symbol table has entry (hash(k),T.C, P)
Case of Token matches (cont)
Pub-Prefix method result is comparable to Pub-Cell method
when the column width is small (i.e. less than 100
characters)
For columns where strings are greater than hundreds of
characters, Pub-Cell outperforms Pub-Prefix significantly
Important issue is to determine the appropriate prefix length
stored in symbol table. However, Pub-Prefix method is still
being researched upon
Other research is going on in field of stemming of query
keywords
Experimental Results
The experiments were carried out on a 450MHz 256 MB
Intel P-3 machine. There were 4 databases used for
evaluation:
TPC-H data of sizes from 100 to 500 MB
USR is Microsoft employee address DB of 130 MB with 19
tables
ML is a 375 MB mailing list DB with 38 tables
KB is a 365 MB DB with 84 tables containing information on
articles and help manuals on various shipped products
System Architecture for DBXplorer
Experimental Results (cont)
In particular the authors show the following:
Pub-Col is compact compared to Pub-Cell
Pub-Col scales linearly with data size and is independent of data
distribution
Pub-Prefix is compact compared to Pub-Cell and has a
significantly better performance when full text indexes are not
present
Pub-Col and Pub-Cell symbol table size
comparison
Symbol table publishing time
comparison
Query performance
Other Observations
It was also noticed that search scales with number of query
keywords. The query was varied with 2 to 10 keywords and
still the average query time was between 1 to 1.3 seconds
Also, it was noticed that FK-Comp and CP-Comp reduce the
size of Pub-Col by a factor of 0.45 to 0.90 depending on size
of original table
However, it was noticed that compression added a negligible
overhead on search performance
Effectiveness of Pub-Prefix method
The Pub-prefix method was tested on workload consisting of
100 random keywords from character column of width 64
bytes in the KB database.
It was noticed that the performance of Pub-Prefix increased
with increase in Pub-Prefix length and gave the optimum
performance at prefix-length of 8
This is because as the length increases, beyond a certain limit
the optimizer decides to scan the original table compared to
index search
Conclusion
Although, we discussed only about a single database query,
this technique can be applied to search multiple databases
also
DBXplorer is easy to use with any Database Management
system
As mentioned before, the Pub-Col alternative is the best
when columns have indexes on them. A hybrid table can be
created so that if there is an index for a column, we use PubCol granularity and if there is no index, we use Pub-Cell
granularity