Transcript Slides

NUITS: A Novel User Interface for Efficient Keyword Search over Databases
The integration of DB and IR provides users with a wide range of high quality services
1. Internet users search information with search engines
Expect to query databases in the same way
Not know the database schema and SQL
2. Hidden Web problem
Most of data on the Web are stored in databases. They are
hidden to Web search engines because of the mismatch of
search interfaces between Web search engines and databases
3. Integrating different classes of information systems
Modern information systems manage many kinds of data
structured relational data
semi-structured XML documents
unstructured text documents
……
Keyword search: to provide a unified query language/interface
Critical Challenges:
Efficiency
Result presentation
……
NUITS
 High efficiency
 User-friendly result display
 Supporting advanced keyword queries
Characteristics of NUITS
Input:
In addition to simple keyword queries (a set of keywords),
advanced queries with specified conditions are also supported
Output:
Search results can be organized into clusters, facilitating users'
quick browsing
Search Engine:
Adopt a new and efficient search algorithm
A
r
c
h
i
t
e
c
t
u
r
e
Note: All examples in this poster are based on DBLP dataset
Keyword Query Specification
 Simple keyword: just a keyword
eg: database
 Typed keyword: a type can be either a relation-name or
attribute-name
eg: Paper:database
Author:*
Writer:*
 Conditional keyword: conditions associated with a keyword
eg: database year>2000
database year~2000
Keyword query: a set of keywords associated with Boolean
operators
Q ::= p | (Q) | Q AND Q | Q OR Q | NOT Q
p: a keyword which can be a simple keyword, typed keyword, or
conditional keyword
Q: a keyword query
eg: Ullman AND (database OR algorithm)
Search Algorithm
 NUITS adopts a data-graph-based search algorithm, and each
result is a tuple-connection-tree
 The algorithm proposes a dynamic programming approach to
find the optimal top-1 with a low time complexity
 It computes top-k minimum cost tuple-connection-trees oneby-one incrementally, and does not need to compute or sort all
result trees in order to find the top-k results
TreeCluster: Clustering Results
In order to assist users to find the needed results, we propose to
organize the result trees into two levels of clusters
 Structural-Level Clustering: Clustering trees using tree
isomorphism. The two trees below are isomorphic, and both mean
the two authors Hristidis and Papakonstantinou coauthor papers
An example of result tree
Another example of result tree
The structural pattern of the above two trees
 Content-Level Clustering: Based on keyword frequencies
and content similarity. If the size of the cluster is larger than the
user-given threshold after structural-level clustering, the contentlevel clustering further clusters result trees
eg: Considering a keyword query Gray Database, we cluster the
results based on content of author names, because Gray occurs
less frequently than Database in DBLP dataset. Then there may
be clusters for different authors, such as Jim Gray or W.A. Gray
Structural-Level
Clusters
Structural
pattern
Tuple-connection-tree
Content-Level
Clusters
Implementation of TreeCluster
TreeCLuster is cost effective:
 Use labels to represent schema information of each result
tree and reformulate the clustering problem as a problem of
judging whether labeled trees are isomorphic
 Rank user keywords according to their frequencies in
databases, and further partition the large clusters based on
keyword nodes
 Give each cluster a readable description, and present the
description and each result graphically
A
r
c
h
i
t
e
c
t
u
r
e