Transcript link

Towards Graph Containment Search and Indexing
Chen Chen 1, Xifeng Yan 2, Philip S. Yu 2, Jiawei Han 1, Dong-Qing Zhang 3, Xiaohui Gu 2
1
University of Illinois at Urbana-Champaign
2IBM
T.J. Watson Research Center
3Thomson
Research
VLDB ’07, September 23-28, 2007, Vienna, Austria
2007. 12. 28
Summarized by Dongjoo Lee, IDS Lab., Seoul National University
Presented by Dongjoo Lee, IDS Lab., Seoul National University
Contents

Graph Search

cIndex Basic Framework

Indexing Features

Indexing Model

cIndex-Basic

Hierarchical Indexing Models
–
cIndex-BottomUp, cIndex-TopDown

Index Maintenances

Experiments

Discussion
Copyright  2007 by CEBT
2
Graph Search
gc
Subgraph

Traditional Graph Search

q
Supergraph

Graph Containment Search

gc  q
Find all supergraphs of a query graph
Find all subgraphs of a query graph
Subgraph isomorphism
SCAN
Answers of Graph Search
Answers of
Graph Containment Search
Sample Database
q
Sample Query
Copyright  2007 by CEBT
3
Subgraph Indexing & Pruning
f1
q1
ga
gb
f2
q2
f3
q3
gc
f4
Sample Database

Feature-Graph Matrix
Indexed Subgraphs
SCAN
Queries
Traditional Graph Search - Inclusion Logic
Given a query graph q and a database graph g  D, if a feature f  q and f  g,
then q  g . That is, if feature f is in q then the graphs not having f are
pruned.
Graph Containment Search - Exclusion Logic
If a feature f  q and f  g, then g  q. That is, if feature f is not in q
then the graphs having f are pruned.
Copyright  2007 by CEBT
4
Basic Search Framework
1.
2.
Off-line index construction

Generate and select a feature set F from the graph database D

f ∈ F, Df = {g | f ⊆ g, g ∈D}
Search

3.
Test indexed features in F against the query q which returns all f  q, and
compute the candidate query answer set, Cq = D − fDf (f  q, f ∈ F).
Verification

Check each graph g in the database set Cq to see whether g is really a
subgraph of q
Search Time

Let’s reduce these
Copyright  2007 by CEBT
5
Redundancy-Aware Feature Selection


Select minimal features that cover
many graphs that other features
didn’t cover from frequent features
Expected number of reduced
subgraph isomorphism tests



X
X
X
X
X
q1
f2
X
q2
Jf = np(1-p’) -1
–
n : |D|
–
p : frequency of f in D
–
p’ : probability that a query graph
having f
f3
q3
p  p’=> Maximum at p = ½
Frequent subgraph mining algorithm

f1
X
f4
FSG, GASTON, gSpan
Copyright  2007 by CEBT
6
Maximum Coverage With Cost
Definition 6. (Maximum Coverage With Cost). Given a set of subsets S = {S1, S2,
…, Sm} of the universal set U = {1, 2, … , n} and a cost parameter 
associated with any Si ∈ S, find a subset T of S such that |∪Si∈TSi|−|T|
is maximized.
U = {1, 2, 3, 4, 5, 6, 7, 8, 9}
S = {{4,5,6}, {1,2,4,5},{4,5,7,8}, {1,4,7}}
f1
f2
f3
f4
 = 3
Set Cover Problem
http://en.wikipedia.org/wiki/Set_cover_problem
1
2
3
4
5
6
Contrast Graph Matrix
7 8
9
This is NP-complete problem
The greedy feature selection process can approximate the optimal index with K features within
a ratio of 1 − 1/e. [D. S. Hochbaum, editor. “Approximation Algorithms for NP-Hard Problems”, 1997]
Copyright  2007 by CEBT
7
cIndex-Basic
1

2
3
4
5
6
7 8
9
Time complexity : O(|F0||D||L|) ----> O(|F0||Ds||L|)
Reduced by sampling

Space usage : O(|F0||D||L|) ----> O(|F0||D| + |F0||L|)
Reduced by virtualization
Copyright  2007 by CEBT
8
Hierarchical Indexing Models

cIndex-BottomUp

cIndex-TopDown
Copyright  2007 by CEBT
9
Index Maintenance
 “ostrich” strategy

Stick with the same set of selected features and the same
hierarchical structure built
 Sampling strategy

Periodically take small samples (Fs, Ds, Ls) and construct sample
index Is
–
if performance of legacy index I is much worse than Is then reconstruct
index
–
else take ostrich strategy
Copyright  2007 by CEBT
10
Experiments (1)

Chemical Descriptor Search


a model graph database usually includes a set of fundamental substructures, called
descriptors. These descriptors, shared by specific groups of known molecules, often
indicate particular chemical and physical properties. Given a molecule, fast searching
for its “descriptor” substructures can help researchers to quickly predict its attributes.
Comparison

SCAN

FB
–

Extract features using gIndex and apply features as Basic Framework
cIndex-Basic, cIndex-BottomUp, cIndex-TopDown
Copyright  2007 by CEBT
11
Experiments (2)
Subgraph Isomorphism Test Numbers
Effectiveness of Data Space Reduction
Copyright  2007 by CEBT
Query Processing Time
Index Maintenance
12
Experiments (3)
Performance of Hierarchical Indices
Scalability of Hierarchical Indices
Index Size
Copyright  2007 by CEBT
13