Transcript link
Towards Graph Containment Search and Indexing
Chen Chen 1, Xifeng Yan 2, Philip S. Yu 2, Jiawei Han 1, Dong-Qing Zhang 3, Xiaohui Gu 2
1
University of Illinois at Urbana-Champaign
2IBM
T.J. Watson Research Center
3Thomson
Research
VLDB ’07, September 23-28, 2007, Vienna, Austria
2007. 12. 28
Summarized by Dongjoo Lee, IDS Lab., Seoul National University
Presented by Dongjoo Lee, IDS Lab., Seoul National University
Contents
Graph Search
cIndex Basic Framework
Indexing Features
Indexing Model
cIndex-Basic
Hierarchical Indexing Models
–
cIndex-BottomUp, cIndex-TopDown
Index Maintenances
Experiments
Discussion
Copyright 2007 by CEBT
2
Graph Search
gc
Subgraph
Traditional Graph Search
q
Supergraph
Graph Containment Search
gc q
Find all supergraphs of a query graph
Find all subgraphs of a query graph
Subgraph isomorphism
SCAN
Answers of Graph Search
Answers of
Graph Containment Search
Sample Database
q
Sample Query
Copyright 2007 by CEBT
3
Subgraph Indexing & Pruning
f1
q1
ga
gb
f2
q2
f3
q3
gc
f4
Sample Database
Feature-Graph Matrix
Indexed Subgraphs
SCAN
Queries
Traditional Graph Search - Inclusion Logic
Given a query graph q and a database graph g D, if a feature f q and f g,
then q g . That is, if feature f is in q then the graphs not having f are
pruned.
Graph Containment Search - Exclusion Logic
If a feature f q and f g, then g q. That is, if feature f is not in q
then the graphs having f are pruned.
Copyright 2007 by CEBT
4
Basic Search Framework
1.
2.
Off-line index construction
Generate and select a feature set F from the graph database D
f ∈ F, Df = {g | f ⊆ g, g ∈D}
Search
3.
Test indexed features in F against the query q which returns all f q, and
compute the candidate query answer set, Cq = D − fDf (f q, f ∈ F).
Verification
Check each graph g in the database set Cq to see whether g is really a
subgraph of q
Search Time
Let’s reduce these
Copyright 2007 by CEBT
5
Redundancy-Aware Feature Selection
Select minimal features that cover
many graphs that other features
didn’t cover from frequent features
Expected number of reduced
subgraph isomorphism tests
X
X
X
X
X
q1
f2
X
q2
Jf = np(1-p’) -1
–
n : |D|
–
p : frequency of f in D
–
p’ : probability that a query graph
having f
f3
q3
p p’=> Maximum at p = ½
Frequent subgraph mining algorithm
f1
X
f4
FSG, GASTON, gSpan
Copyright 2007 by CEBT
6
Maximum Coverage With Cost
Definition 6. (Maximum Coverage With Cost). Given a set of subsets S = {S1, S2,
…, Sm} of the universal set U = {1, 2, … , n} and a cost parameter
associated with any Si ∈ S, find a subset T of S such that |∪Si∈TSi|−|T|
is maximized.
U = {1, 2, 3, 4, 5, 6, 7, 8, 9}
S = {{4,5,6}, {1,2,4,5},{4,5,7,8}, {1,4,7}}
f1
f2
f3
f4
= 3
Set Cover Problem
http://en.wikipedia.org/wiki/Set_cover_problem
1
2
3
4
5
6
Contrast Graph Matrix
7 8
9
This is NP-complete problem
The greedy feature selection process can approximate the optimal index with K features within
a ratio of 1 − 1/e. [D. S. Hochbaum, editor. “Approximation Algorithms for NP-Hard Problems”, 1997]
Copyright 2007 by CEBT
7
cIndex-Basic
1
2
3
4
5
6
7 8
9
Time complexity : O(|F0||D||L|) ----> O(|F0||Ds||L|)
Reduced by sampling
Space usage : O(|F0||D||L|) ----> O(|F0||D| + |F0||L|)
Reduced by virtualization
Copyright 2007 by CEBT
8
Hierarchical Indexing Models
cIndex-BottomUp
cIndex-TopDown
Copyright 2007 by CEBT
9
Index Maintenance
“ostrich” strategy
Stick with the same set of selected features and the same
hierarchical structure built
Sampling strategy
Periodically take small samples (Fs, Ds, Ls) and construct sample
index Is
–
if performance of legacy index I is much worse than Is then reconstruct
index
–
else take ostrich strategy
Copyright 2007 by CEBT
10
Experiments (1)
Chemical Descriptor Search
a model graph database usually includes a set of fundamental substructures, called
descriptors. These descriptors, shared by specific groups of known molecules, often
indicate particular chemical and physical properties. Given a molecule, fast searching
for its “descriptor” substructures can help researchers to quickly predict its attributes.
Comparison
SCAN
FB
–
Extract features using gIndex and apply features as Basic Framework
cIndex-Basic, cIndex-BottomUp, cIndex-TopDown
Copyright 2007 by CEBT
11
Experiments (2)
Subgraph Isomorphism Test Numbers
Effectiveness of Data Space Reduction
Copyright 2007 by CEBT
Query Processing Time
Index Maintenance
12
Experiments (3)
Performance of Hierarchical Indices
Scalability of Hierarchical Indices
Index Size
Copyright 2007 by CEBT
13