Transcript DBxplorer

DBxplorer: A System for Keyword-Based Search
over Relational Databases
Paper By: Sanjay Agrawal, Surajit Chaudhuri, Gautam Das
Presented By
Bhushan Chaudhari
University of Texas at Arlington
1
Contents




Typical Key-word Search
Overview of DBxplorer
Processing Component for producing Symbol Table
Storing Symbol table in databases


Search Component



Compaction Algorithms
Finding Matches using graphs
Generalized Matches
Experiments and Statistics
2
Key-word Search

General Scenario




Finding specified keywords in same or different tables or
completely different schemas
Physical schema needs to be explored
Existing indexes need to be leveraged by using proper data
structures
Data structure => Structure to be followed by symbol tables
3
Levels of Granularity


Column level granularity
Row level granularity
2
4
2
TOYOTA
COROLLA
2
TOYOTA
CAMRY
2
HONDA
ACCORD
2
HONDA
CRV
4
Overview of DBxplorer




Supports Conjunctive Queries
Implementation using MSSQL Server2000 and IIS
Web Server
ODBC Interface for database Connection
Uses the functionality of relational engines very well
5
Overview of DBxplorer (Contd ..)

Publish

Determine the tables to be published



all_tab, all_tab_user
Table relations in form of graphs
Columns to be published

all_tab_columns, user_tab_columns
“select table_name from user_tab_columns”
“where column_name = ‘desired_name’

Building Symbol Tables
6
Overview of DBxplorer (Contd ..)

Search



Symbol table is looked to identify the tables,
columns and rows containing keywords
Join trees (set of tables which are related)
Query is constructed for each join tree and
tuples containing all keywords are found
7
Design Alternatives for Symbol Table

Location Granularity

Column level and cell level
“Why there is no row level granularity? “
Hard to implement
SQL queries work w.r.t. columns
8
Factors affected due to granularity

Space and time requirements




Searching time
Time required to build
Using in-built operators like ‘distinct’ accumulating all values
inside column becomes easy
Most of the typical database systems use Hash Value
indexes which are good for equality searches
9
Factors affected due to granularity
(Contd..)

Search Performance


Typically depends upon presence of indices
Ease of maintenance




One time creation
Insert
Update
Delete
Pub-Col
Y/N
Y/N
N
Pub-Cell
Y
Y
Y
“ Which type of symbol table should we use?”
10
Storing Symbol Tables

Pub-Col Representation



Key-word -> Column Id
Hash Value -> Column Id
Types of Compression Algorithms Used

Foreign Key Compression (FK-Comp)

General Compression Technique (CP-Comp)
11
Building Compression table
12
Compression Algorithm
13
Storing Symbol Tables (Contd..)

Pub-Cell Representation


Hash Value -> Cell ID
Hash Value -> Cell ID List
“Retrieval of all locations for a key-word is achieved by looking up a
single row from pub-cell symbol table”

No Compression

Pre-computation is complex

Inverted lists can be implemented using this table
14
Finding Matches for Keyword Search
• Each join tree is mapped to a SQL Query and selects those
rows that contain all keywords.
• Ranking is based upon no. of joins (Quite similar to ranking
upon proximity of words in documents)
15
Search Algorithm
16
Supporting Generalized Matches

Where T.C like ‘%STRING%’




Traditional databases use B+ Trees indices
Pub-Prefix Representation





Microsoft SQL Server (Most Recent version)
Where CONTAINS(C,’String’);
Enables token searches having form WHERE T.C LIKE
‘P%K%’
Symbol table entry (Hash(k), T.C, P)
Efficiency depends on length of prefix
Length of prefix also affects symbol table size and build
time.
Stemming
17
Experiments – Symbol Table Granularity

Symbol Table Size
18
Experiments – Symbol Table Granularity

Publishing Time
19
Experiments – Symbol Table Granularity

Search Performance
20
Experiments – Scalability of Pub-Col

Data Size and Distribution
21
Experiments – Scalability of Pub-Col

Number of Keywords in Search
22
Experiments – Scalability of Pub-Col

Effectiveness of Compression Techniques
23
Experiments – Scalability of Pub-Col

Effectiveness of Pub-Prefix Method
24
References


DBxplorer: A System for Keyword Based Search over Relational
Databases ICDE 2002
By: Sanjay Agrawal, Surajit Chaudhuri, Gautam Das
DBXplorer: Enabling Keyword Search over Relational Databases.
(Demo), SIGMOD Conference 2002: 627
By: Sanjay Agrawal, Surajit Chaudhuri, Gautam Das
25