Transcript DBxplorer
DBxplorer: A System for Keyword-Based Search
over Relational Databases
Paper By: Sanjay Agrawal, Surajit Chaudhuri, Gautam Das
Presented By
Bhushan Chaudhari
University of Texas at Arlington
1
Contents
Typical Key-word Search
Overview of DBxplorer
Processing Component for producing Symbol Table
Storing Symbol table in databases
Search Component
Compaction Algorithms
Finding Matches using graphs
Generalized Matches
Experiments and Statistics
2
Key-word Search
General Scenario
Finding specified keywords in same or different tables or
completely different schemas
Physical schema needs to be explored
Existing indexes need to be leveraged by using proper data
structures
Data structure => Structure to be followed by symbol tables
3
Levels of Granularity
Column level granularity
Row level granularity
2
4
2
TOYOTA
COROLLA
2
TOYOTA
CAMRY
2
HONDA
ACCORD
2
HONDA
CRV
4
Overview of DBxplorer
Supports Conjunctive Queries
Implementation using MSSQL Server2000 and IIS
Web Server
ODBC Interface for database Connection
Uses the functionality of relational engines very well
5
Overview of DBxplorer (Contd ..)
Publish
Determine the tables to be published
all_tab, all_tab_user
Table relations in form of graphs
Columns to be published
all_tab_columns, user_tab_columns
“select table_name from user_tab_columns”
“where column_name = ‘desired_name’
Building Symbol Tables
6
Overview of DBxplorer (Contd ..)
Search
Symbol table is looked to identify the tables,
columns and rows containing keywords
Join trees (set of tables which are related)
Query is constructed for each join tree and
tuples containing all keywords are found
7
Design Alternatives for Symbol Table
Location Granularity
Column level and cell level
“Why there is no row level granularity? “
Hard to implement
SQL queries work w.r.t. columns
8
Factors affected due to granularity
Space and time requirements
Searching time
Time required to build
Using in-built operators like ‘distinct’ accumulating all values
inside column becomes easy
Most of the typical database systems use Hash Value
indexes which are good for equality searches
9
Factors affected due to granularity
(Contd..)
Search Performance
Typically depends upon presence of indices
Ease of maintenance
One time creation
Insert
Update
Delete
Pub-Col
Y/N
Y/N
N
Pub-Cell
Y
Y
Y
“ Which type of symbol table should we use?”
10
Storing Symbol Tables
Pub-Col Representation
Key-word -> Column Id
Hash Value -> Column Id
Types of Compression Algorithms Used
Foreign Key Compression (FK-Comp)
General Compression Technique (CP-Comp)
11
Building Compression table
12
Compression Algorithm
13
Storing Symbol Tables (Contd..)
Pub-Cell Representation
Hash Value -> Cell ID
Hash Value -> Cell ID List
“Retrieval of all locations for a key-word is achieved by looking up a
single row from pub-cell symbol table”
No Compression
Pre-computation is complex
Inverted lists can be implemented using this table
14
Finding Matches for Keyword Search
• Each join tree is mapped to a SQL Query and selects those
rows that contain all keywords.
• Ranking is based upon no. of joins (Quite similar to ranking
upon proximity of words in documents)
15
Search Algorithm
16
Supporting Generalized Matches
Where T.C like ‘%STRING%’
Traditional databases use B+ Trees indices
Pub-Prefix Representation
Microsoft SQL Server (Most Recent version)
Where CONTAINS(C,’String’);
Enables token searches having form WHERE T.C LIKE
‘P%K%’
Symbol table entry (Hash(k), T.C, P)
Efficiency depends on length of prefix
Length of prefix also affects symbol table size and build
time.
Stemming
17
Experiments – Symbol Table Granularity
Symbol Table Size
18
Experiments – Symbol Table Granularity
Publishing Time
19
Experiments – Symbol Table Granularity
Search Performance
20
Experiments – Scalability of Pub-Col
Data Size and Distribution
21
Experiments – Scalability of Pub-Col
Number of Keywords in Search
22
Experiments – Scalability of Pub-Col
Effectiveness of Compression Techniques
23
Experiments – Scalability of Pub-Col
Effectiveness of Pub-Prefix Method
24
References
DBxplorer: A System for Keyword Based Search over Relational
Databases ICDE 2002
By: Sanjay Agrawal, Surajit Chaudhuri, Gautam Das
DBXplorer: Enabling Keyword Search over Relational Databases.
(Demo), SIGMOD Conference 2002: 627
By: Sanjay Agrawal, Surajit Chaudhuri, Gautam Das
25