Transcript cIndex

Towards Graph Containment Search and Indexing
Chen Chen1, Xifeng Yan2, Philip S. Yu2, Jiawei Han1, Dong-Qing Zhang3, Xiaohui Gu2
1
University of Illinois at Urbana-Champaign
2 IBM T. J. Watson Research Center
3 Thomson Research
33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Problem Definition

Given a graph database D = {g1, … gn}, and a graph
query q, one could formulate two basic search problems:

(1) (traditional) graph search: find all graphs gi in D s.t. q is a
subgraph of gi.





GraphGrep: D. Shasha, J. T.-L. Wang, and R. Giugno. PODS 2002.
gIndex: X. Yan, P. S. Yu, and J. Han. SIGMOD 2004.
C-Tree: H. He and A. K. Singh. ICDE 2006.
Tree+Δ: P. Zhao, J. X. Yu, and P. S. Yu. VLDB 2007.
(2) graph containment search: find all graphs gi in D s.t. q is a
supergraph of gi
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.
33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Applications

Chem-Informatics

Pattern Recognition

Cyber Security (Virus Signature Detection)

Information Management (User-interest Mapping)
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.
33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Example
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.
33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Preliminary Definitions

Subgraph Isomorphism:

For two labeled graphs g and g’, a subgraph isomorphism is an
injective function
f : V(g) -> V(g’), s.t.
∀v ∈ V(g), l(v) = l’(f(v));
∀(u, v) ∈ E(g), (f(u), f(v)) ∈ E(g’) and l(u, v) = l’(f(u), f(v)).


f is called an embedding of g in g’
Subgraph and Supergraph

If there exists an embedding of g in g’, then g is a subgraph of g’,
denoted by g ⊆ g’, and g’ is a supergraph of g.
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.
33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Feature-based Indexing Methodology

Naïve solution (SCAN):



Examines the database D sequentially and compares each graph
gi with the query graph q to decide whether q ⊇ gi.
Subgraph isomorphism problem is NP-complete.
Feature-based indexing:



Similar model graphs gi and gj are likely to have similar
isomorphism testing results w.r.t. the same query graph.
Let f be a common substructure shared by gi and gj. If f ⊆ q, then
gi ⊆ q and gj ⊆ q. Therefore, we can save on isomorphism test.
Select a feature set F from graph database D. If feature f ∈ F is
not a subgraph of q, then the graphs having f as subgraph are
pruned.
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.
33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Basic Framework

Off-line index construction: Generate and select a feature
set F from the graph database D. For feature f ∈ F, Df =
{g|f ⊆ g, g ∈ D}, which can be represented by an inverted
list over D.

Search: Test indexed features in F against the query q
which returns all f ⊆ q, and compute the candidate query
answer set, Cq = D – ∪f Df (f ⊆ q, f ∈ F).

Verification: Check each graph g in the candidate set Cq
to see whether g is really a subgraph of q.
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.
33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Cost Model

Search Time Formula:
 T ( f , q)   T ( g , q)  T
f F
|F| +
index
gCq
|Cq|
(negligible)
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.
33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Feature Graph Matrix
ga
gb
gc
f1
1
1
1
f2
1
1
0
f3
1
1
0
f4
1
0
0
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.
33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Feature Generation

Good features should be frequent, but not too frequent in
the database.



frequent: index more graphs in database
too frequent: simple and easy to be contained by query graph
Use frequent subgraph mining algorithms, e.g. gSpan[1],
to generate an initial set of frequent subgraphs.
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.
33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Feature Selection

Given a set of queries {q1, q2, …, qr}, an optimal index
should be able to maximize the total gain from naïve
SCAN:
r
Jtotal  r | D | ( Cql  r | F |)
l 1
r
 |  f
Df |  r | F |
 ql , f  F
l 1
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.
33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Feature Selection



Set i-th row to 0 if the query has feature fi as its subgraph.
Concatenate feature graph matrix to form a global matrix.
fi covers a set of columns -> Maximum Coverage with Cost:
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.
33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Maximum Coverage with Cost



Given a set of subsets S = {S1, S2, …, Sm} of the universal set U
= {1, 2, …, n} and a cost parameter λassociated with any Si ∈ S,
find a subset T of S such that |∪Si ∈TSi| - λ|T| is maximized.
Can be reduced from set cover, and therefore is NPcomplete.
Greedy heuristic method, in each iteration:




Select a row i with the most # of non-zero entries from global
matrix M.
Set j-th column to 0 if Mij = 1
Note that selecting a row is associate with a cost r, so stop the
iteration if no rows have more than r non-zero entries.
The greedy heuristic achieves an approximation ratio of 1-1/e.
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.
33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Algorithm: cIndex-Basic


Input: Graph Matrix M over r queries
Output: Selected Features F.

1: F = ∅ ;

2: while ∃i, ∑j Mij > r do
3: select row i with most non-zero entries in M;
4: F = F ∪ {fi};
5: for each column j s.t. Mij is not zero do
6:
delete column j from M
7: delete row i;
8: return F;






Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.
33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Complexity

Time Complexity:


O(|F0||D||r|), where |D| and |r| can be reduced by sampling and clustering
on graph database and queries.
Space Complexity:

Use a compact matrix, reduce the space complexity from O(|F0||D||r|) to
O(|F0||D| + |F0||r|)
q1 q2 q3
f1
0
3
0
f2
2
2
0
f3
0
2
2
f4
1
1
1
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.
33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Hierarchical Indexing Models

The cIndex-Basic algorithm builds a flat index structure,
where each feature is tested sequentially and
deterministically against any input queries.

Hierarchical index may improve the performance:


Bottom-up
Top-down
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.
33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
cIndex-BottomUp

Build index layer by layer staring from the bottom-level graphs.



The first-level index L1 is built on the original graph database by cIndex-Basic.
The features in L1 can be regarded as another graph database, where cIndexBasic can be executed again to form second-level index L2.
Disadvantage: high-level features are simple and easy to be contained be
queries.
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.
33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
cIndex-TopDown



Select feature fi that covers most columns in global graph matrix M.
Divide queries into two groups: contain fi and do not contain fi. Divide M into
two parts according to query groups.
Run the above steps recursively on new matrices, until we reach a small
number of queries in a group (to avoid overfitting).
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.
33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Experiments
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.
33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Thank You!
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.
33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna