Keyword Searching and Browsing in databases using BANKS

Download Report

Transcript Keyword Searching and Browsing in databases using BANKS

Bidirectional Expansion
for Keyword Search
on Graph Databases
Varun Kacholia
Soumen Chakrabarti
Rushi Desai
Shashank Pandit
S. Sudarshan
Hrishikesh Karambelkar
http://www.cse.iitb.ac.in/banks/
Keyword Search on Graph
Representation of Data

Keyword search on relational, XML,
HTML, etc. data

BANKS, Discover, DBXplorer, XRank, etc.
Need to find a (closely) connected set of
nodes that together match all given
keywords
 Focus of our work


Search algorithms to find connections
between nodes
Outline
Data, Query and Response Models
 Backward Search Algorithm
 Bidirectional Search Algorithm
 Experiments
 Related Work
 Conclusions

Graph Data Model

Data modeled as a directed weighted graph:
BANKS [ICDE’02]


Can model relational, XML, HTML, etc. data
E.g., DBLP database


Node = tuple
Edge = foreign key reference
BANKS: Keyword search…
Multi-Query Optimization
paper
writes
Soumen
Sudarshan
Prasan Roy
author
Graph Data Model (2)

E.g., XML data
<proceedings>
<paper id=“1”>
<title>Databases</title>
</paper>
<paper id=“2”>
<title>Keyword Search</title>
<cite ref=“1”>Databases</cite>
</paper>
</proceedings>
paper (@id = 1)
title
proceedings
paper (@id = 2)
cite
title
Response Model

Response: Minimal, rooted tree connecting
keyword nodes


Undirected: Discover, DBXplorer
Directed: BANKS
paper Multi-Query Optimization
E.g., Sudarshan Roy
writes
author
Sudarshan
writes
author
Prasan Roy
Response Ranking

Edge Score = EA



Node Score = NA



Smaller tree => higher score
E.g., BANKS: EA = 1/ (S edge weights)
Measure of authority of nodes in tree
E.g., BANKS: NA = S (leaf and root node
authorities)
Overall score = f (EA, NA)
l
 E.g., BANKS: f (EA, NA) = EA . NA
Finding Answer Trees

Backward Expanding Search:
BANKS [ICDE02]

Intuition: travel backwards from keyword nodes till
you hit a common node
Query: sudarshan roy
paper
MultiQuery Optimization
writes
authors
Sudarshan
Prasan Roy
Backward Search: Algorithm

Algorithm

Run concurrent single source shortest path
iterators from each node matching a
keyword
Traverse the graph edges in reverse direction
 Output next nearest node on each get-next() call



Do best-first search across iterators
Output node if in the intersection of sets of
nodes reached from each keyword
Backward Search: Limitations
Wasteful exploration of graph:


Frequently occurring keywords
“Hub” nodes in the graph (high in-degree)
“Shashank Sudarshan Database”
…
Schema Legend
Database
…
author
writes
Shashank
Sudarshan
paper
Bidirectional Search: Motivation
Bidir Search: Intuition

First cut solution:



Don’t go backward if keyword matches many
nodes
Don’t go backward if node points to a hub
Instead explore forward from other keywords
Bidir Search: Example
“Shashank Sudarshan Database”
…
Database
…
…
Schema Legend
author
Shashank
Sudarshan
writes
paper
Bidir Search: Issues

What should threshold for not expanding
be?

Our solution: prioritize expansion of nodes
based on spreading activation


to penalize frequent keywords and bushy trees
How to manage exploration in both
directions?
Bidir Search: Spreading Activation

Spreading Activation


Node with highest activation explored first
Every node given an initial activation
1/5
1/5
1/5
1/5
1/5
“John”

Gives low activation to frequently
occurring keywords
Bidir Search: Spreading Activation

Spreading Activation


Node with highest activation explored first
Activation spread to neighbors (μ = 0.3)
1/5
1
0
0.7 x 1/5 x 1/4
1
0
0.7 x 1/5 x 1/4
0
0.7 x 1/5 x 1/4
0
0.7 x 1/5 x 1/4
1
0.3 x 1/5

1
Gives low activation to neighbors of hubs
Bidir Search: Iterators

How to manage exploration in both directions?
6
7
[1,∞] 3
[∞,1] 4
[2,3
∞]
[2,∞]
[0,∞]

…
1
2
“A”
“B”
5
[∞,1]
[∞,0]
Single backward iterator + single forward
iterator w/ suitable datastructures


[Dist from “A”, Dist from “B”]
[∞,∞]
[∞,∞ 2]
E.g., to keep track of parents of nodes
Details in full paper
Bidir Search: Algorithm

Algorithm


Activate matching nodes; insert into backward
iterator
while (iterators are not empty)




Choose iterator for expansion in best-first manner
Explore node with highest activation
Spread activation to neighbors
Update path weights (and other datastructures)



Propagate values to ancestors if necessary
Insert nodes explored in the backward direction into the
forward iterator /* for future forward exploration */
Stop when top-k results are produced
Bidir Search: top-k results
Results need not be generated “in-order”
 Naïve solution




Store results in an intermediate heap
Output top k results after mk total results
have been generated (m ~ 10)
Can do better


Compute upper bound on score of next
result; output answers with a higher score
Similar to NRA algorithm (Fagin et al.,
PODS’01)
Experiments

Datasets



Workload


DBLP, IMDB ~ 2 million nodes, 9 million edges
US Patent DB ~ 4 million nodes, 15 million edges
Keywords randomly picked from results of SQL join statements
Search algorithms

MI-Bkwd: original backward search




SI-Bkwd: backward search with single backward iterator
Bidirec: bidirectional search
Time taken/nodes explored


Iterator for every node matching a keyword
Measured when 10th answer is generated (or last answer if
#answers < 10)
Origin size

#nodes matched by keywords in the query
Experiments (2)

MI-Bkwd versus SI-Bkwd

SI-Bkwd gain increases with origin size, #
keywords
Experiments (3)

SI-Bkwd versus Bidirec

Bidirec gain increases with origin size, #
keywords
Experiments (4)

Precision/Recall experiments


Relevant answers are well-defined; can be
generated through SQL statements
Both MI-Backward and Bidirectional show
similar performance
Recall ~ 100%
 Precision ~ 100% at near full recall
 Few irrelevant answers produced before
generating all relevant answers


Bidirectional runs faster, yet minimal loss of
relevant results!
Experiments (5)

Comparison with Sparse:
Hristidis et al. [VLDB’03]




Generate join expressions leading to query results
Use DB-provided scores for ranking tuples and
aggregate them to rank answer trees
For top-k results: automatically determine required
number of join expressions
Sparse-LB



Manually generate required join expressions
Sparse needs to do at least this much (and usually a lot
more!)
Bidirectional versus Sparse-LB

Bidirectional outperforms by a factor of ~ 3 (esp. when
#joins is large)
Experiments (6)

SI-Bkwd versus Bidirec: by origin size
A = (T,S,S,S)
B = (M,M,M,M)
C = (M,L,L,L)
D = (M,M,L,L)
E = (T,L,L,L)
F = (T,S,M,L)
G = (T,M,L,L)
H = (T,T,T,L)

Bidirec gains more with unbalanced origin
sizes
Discussion
Bidirectional search as dynamic per-tuple
join ordering
 Related work in this area: Eddies


Bidirectional search
Schema-less
 Prioritization based on activation instead of
selectivity
 Generate answers in relevance order

Related Work

Keyword querying on relational data:
Discover (UCSD), DBExplorer (Microsoft)



Keyword querying on XML



Use SQL generation, without in-memory data
structures
Issues: generate join plans, re-use common
sub-expressions, etc.
XRank (Cornell), Schema-Free XQuery
(Michigan), …
Tree model is too limited
ObjectRank
Conclusions

Graph model


Convenient common denominator
representation
Schema-free querying leads to graph
search



Purely backward strategy inadequate
Bidirectional search with spreading activation
performs much better
Dynamically choose join order on per-tuple
basis
Thank You!
Questions??
Future of Keyword Search in DBs

Next generation of intelligent search will require
context information




Is there a killer app?


E.g. search email, files, calendar, ..
Information integration will be important
Graph structured data will be a key component
Deep web?
Display of answers


Users don’t want to see schema details
Can we leverage off existing (Web) apps?
BANKS Future Work

Applications of BANKS


Exploit BANKS to integrate different sources of data


Extract information, Infer soft links
BANKS for personal information management


Soumen Chakrabarti, Sunita Sarawagi and students
SPIN: Search Personal Information Networks
Ongoing/future work on BANKS:

More sysadmin/user control on ranking



Characterize bidirectional search better


One size does not fit all
BANKS provides infrastructure
And find other applications
Security
Bidir Search: top-k results (2)


Compute upper bound on score of next result;
output answers with a higher score
Computing the bound



mi = minimum path length explored backward from
keyword i
unseen answer node: 1/(m1 + m2 + … + mn )
visited answer node: suppose reached from first x
keywords with distance di



1/[(d1 + d2 + … + dx ) + (mx+1 + mx+2 + … + mn )]
combine this with max node prestige
We simply use

1/(m1 + m2 + … + mn )!
Experiments show no significant loss in using this
heuristic