Keyword Searching and Browsing in databases using BANKS
Download
Report
Transcript Keyword Searching and Browsing in databases using BANKS
Bidirectional Expansion
for Keyword Search
on Graph Databases
Varun Kacholia
Soumen Chakrabarti
Rushi Desai
Shashank Pandit
S. Sudarshan
Hrishikesh Karambelkar
http://www.cse.iitb.ac.in/banks/
Keyword Search on Graph
Representation of Data
Keyword search on relational, XML,
HTML, etc. data
BANKS, Discover, DBXplorer, XRank, etc.
Need to find a (closely) connected set of
nodes that together match all given
keywords
Focus of our work
Search algorithms to find connections
between nodes
Outline
Data, Query and Response Models
Backward Search Algorithm
Bidirectional Search Algorithm
Experiments
Related Work
Conclusions
Graph Data Model
Data modeled as a directed weighted graph:
BANKS [ICDE’02]
Can model relational, XML, HTML, etc. data
E.g., DBLP database
Node = tuple
Edge = foreign key reference
BANKS: Keyword search…
Multi-Query Optimization
paper
writes
Soumen
Sudarshan
Prasan Roy
author
Graph Data Model (2)
E.g., XML data
<proceedings>
<paper id=“1”>
<title>Databases</title>
</paper>
<paper id=“2”>
<title>Keyword Search</title>
<cite ref=“1”>Databases</cite>
</paper>
</proceedings>
paper (@id = 1)
title
proceedings
paper (@id = 2)
cite
title
Response Model
Response: Minimal, rooted tree connecting
keyword nodes
Undirected: Discover, DBXplorer
Directed: BANKS
paper Multi-Query Optimization
E.g., Sudarshan Roy
writes
author
Sudarshan
writes
author
Prasan Roy
Response Ranking
Edge Score = EA
Node Score = NA
Smaller tree => higher score
E.g., BANKS: EA = 1/ (S edge weights)
Measure of authority of nodes in tree
E.g., BANKS: NA = S (leaf and root node
authorities)
Overall score = f (EA, NA)
l
E.g., BANKS: f (EA, NA) = EA . NA
Finding Answer Trees
Backward Expanding Search:
BANKS [ICDE02]
Intuition: travel backwards from keyword nodes till
you hit a common node
Query: sudarshan roy
paper
MultiQuery Optimization
writes
authors
Sudarshan
Prasan Roy
Backward Search: Algorithm
Algorithm
Run concurrent single source shortest path
iterators from each node matching a
keyword
Traverse the graph edges in reverse direction
Output next nearest node on each get-next() call
Do best-first search across iterators
Output node if in the intersection of sets of
nodes reached from each keyword
Backward Search: Limitations
Wasteful exploration of graph:
Frequently occurring keywords
“Hub” nodes in the graph (high in-degree)
“Shashank Sudarshan Database”
…
Schema Legend
Database
…
author
writes
Shashank
Sudarshan
paper
Bidirectional Search: Motivation
Bidir Search: Intuition
First cut solution:
Don’t go backward if keyword matches many
nodes
Don’t go backward if node points to a hub
Instead explore forward from other keywords
Bidir Search: Example
“Shashank Sudarshan Database”
…
Database
…
…
Schema Legend
author
Shashank
Sudarshan
writes
paper
Bidir Search: Issues
What should threshold for not expanding
be?
Our solution: prioritize expansion of nodes
based on spreading activation
to penalize frequent keywords and bushy trees
How to manage exploration in both
directions?
Bidir Search: Spreading Activation
Spreading Activation
Node with highest activation explored first
Every node given an initial activation
1/5
1/5
1/5
1/5
1/5
“John”
Gives low activation to frequently
occurring keywords
Bidir Search: Spreading Activation
Spreading Activation
Node with highest activation explored first
Activation spread to neighbors (μ = 0.3)
1/5
1
0
0.7 x 1/5 x 1/4
1
0
0.7 x 1/5 x 1/4
0
0.7 x 1/5 x 1/4
0
0.7 x 1/5 x 1/4
1
0.3 x 1/5
1
Gives low activation to neighbors of hubs
Bidir Search: Iterators
How to manage exploration in both directions?
6
7
[1,∞] 3
[∞,1] 4
[2,3
∞]
[2,∞]
[0,∞]
…
1
2
“A”
“B”
5
[∞,1]
[∞,0]
Single backward iterator + single forward
iterator w/ suitable datastructures
[Dist from “A”, Dist from “B”]
[∞,∞]
[∞,∞ 2]
E.g., to keep track of parents of nodes
Details in full paper
Bidir Search: Algorithm
Algorithm
Activate matching nodes; insert into backward
iterator
while (iterators are not empty)
Choose iterator for expansion in best-first manner
Explore node with highest activation
Spread activation to neighbors
Update path weights (and other datastructures)
Propagate values to ancestors if necessary
Insert nodes explored in the backward direction into the
forward iterator /* for future forward exploration */
Stop when top-k results are produced
Bidir Search: top-k results
Results need not be generated “in-order”
Naïve solution
Store results in an intermediate heap
Output top k results after mk total results
have been generated (m ~ 10)
Can do better
Compute upper bound on score of next
result; output answers with a higher score
Similar to NRA algorithm (Fagin et al.,
PODS’01)
Experiments
Datasets
Workload
DBLP, IMDB ~ 2 million nodes, 9 million edges
US Patent DB ~ 4 million nodes, 15 million edges
Keywords randomly picked from results of SQL join statements
Search algorithms
MI-Bkwd: original backward search
SI-Bkwd: backward search with single backward iterator
Bidirec: bidirectional search
Time taken/nodes explored
Iterator for every node matching a keyword
Measured when 10th answer is generated (or last answer if
#answers < 10)
Origin size
#nodes matched by keywords in the query
Experiments (2)
MI-Bkwd versus SI-Bkwd
SI-Bkwd gain increases with origin size, #
keywords
Experiments (3)
SI-Bkwd versus Bidirec
Bidirec gain increases with origin size, #
keywords
Experiments (4)
Precision/Recall experiments
Relevant answers are well-defined; can be
generated through SQL statements
Both MI-Backward and Bidirectional show
similar performance
Recall ~ 100%
Precision ~ 100% at near full recall
Few irrelevant answers produced before
generating all relevant answers
Bidirectional runs faster, yet minimal loss of
relevant results!
Experiments (5)
Comparison with Sparse:
Hristidis et al. [VLDB’03]
Generate join expressions leading to query results
Use DB-provided scores for ranking tuples and
aggregate them to rank answer trees
For top-k results: automatically determine required
number of join expressions
Sparse-LB
Manually generate required join expressions
Sparse needs to do at least this much (and usually a lot
more!)
Bidirectional versus Sparse-LB
Bidirectional outperforms by a factor of ~ 3 (esp. when
#joins is large)
Experiments (6)
SI-Bkwd versus Bidirec: by origin size
A = (T,S,S,S)
B = (M,M,M,M)
C = (M,L,L,L)
D = (M,M,L,L)
E = (T,L,L,L)
F = (T,S,M,L)
G = (T,M,L,L)
H = (T,T,T,L)
Bidirec gains more with unbalanced origin
sizes
Discussion
Bidirectional search as dynamic per-tuple
join ordering
Related work in this area: Eddies
Bidirectional search
Schema-less
Prioritization based on activation instead of
selectivity
Generate answers in relevance order
Related Work
Keyword querying on relational data:
Discover (UCSD), DBExplorer (Microsoft)
Keyword querying on XML
Use SQL generation, without in-memory data
structures
Issues: generate join plans, re-use common
sub-expressions, etc.
XRank (Cornell), Schema-Free XQuery
(Michigan), …
Tree model is too limited
ObjectRank
Conclusions
Graph model
Convenient common denominator
representation
Schema-free querying leads to graph
search
Purely backward strategy inadequate
Bidirectional search with spreading activation
performs much better
Dynamically choose join order on per-tuple
basis
Thank You!
Questions??
Future of Keyword Search in DBs
Next generation of intelligent search will require
context information
Is there a killer app?
E.g. search email, files, calendar, ..
Information integration will be important
Graph structured data will be a key component
Deep web?
Display of answers
Users don’t want to see schema details
Can we leverage off existing (Web) apps?
BANKS Future Work
Applications of BANKS
Exploit BANKS to integrate different sources of data
Extract information, Infer soft links
BANKS for personal information management
Soumen Chakrabarti, Sunita Sarawagi and students
SPIN: Search Personal Information Networks
Ongoing/future work on BANKS:
More sysadmin/user control on ranking
Characterize bidirectional search better
One size does not fit all
BANKS provides infrastructure
And find other applications
Security
Bidir Search: top-k results (2)
Compute upper bound on score of next result;
output answers with a higher score
Computing the bound
mi = minimum path length explored backward from
keyword i
unseen answer node: 1/(m1 + m2 + … + mn )
visited answer node: suppose reached from first x
keywords with distance di
1/[(d1 + d2 + … + dx ) + (mx+1 + mx+2 + … + mn )]
combine this with max node prestige
We simply use
1/(m1 + m2 + … + mn )!
Experiments show no significant loss in using this
heuristic