SPIN: Mining Maximal Frequent Subgraphs from Graph Databases

Transcript SPIN: Mining Maximal Frequent Subgraphs from Graph Databases

SPIN: Mining Maximal
Frequent Subgraphs from
Graph Databases
Jun Huan, Wei Wang, Jan Prins,
Jiong Yang
KDD 2004
Introduction
► Graphs
model a relations among data
 Inter-disciplinary research
► Huge
number of recurring patterns
► To mining only maximal frequent subgraphs.
 None of its super graphs are frequent
Advantages
► Reducing
the total number of mined
subgraphs
 Saving space and analysis effort
► Reducing
mining time
► Non-maximal frequent subgraph can be
reconstructed.
► Maximal frequent subgraphs are of most
interest in some appliations.
Algorithm
► Mining
all frequent trees from a general
graph database.
 Tree normalization is simpler than graph.
 In certain applications, most of the frequent
subgraphs are really trees.
 Use current subgraph mining algorithm
 Mining subtrees from a forest
Algorithm
► Reconstruct
all maximal subgraphs from the
mined trees.
 For each frequent tree T, find all frequent
subgraphs whose canonical spanning tree are
isomorphic to T
 Enumerate the equvalence class of a tree T
 Maximal subgraph mining
Tree-based Equivalence Classes
►A
subtree T is a spanning tree of G if T
contains all nodes in G.
 Maximal one: canonical spanning tree
► Group
all frequent subgraphs in to
equivalence classes based on spanning trees.
Spanning tree
Tree-based Equivalence Classes
back
12 singletons group
b
y
b
x
a
b
y
a
b
b
a
x
x
y
a
x
a
a
a
a
a
y
y
a
y
a
x
a
x
a
x
a
b
b
y
a
x
a
b
a
y
y
a
y
a
x
a
y
y
a
a
b
x
a
a
x
a
a
x
a
Enumerating Graphs from Trees
►G
C :{e1,e2,…,en}
GO
 If frequent -> edge  C (candidate set)
► Search
space of G： G:C ={G+y|y  2C}
Optimizations
► Removing
a set of frequent subgraphs that
can not be maximal from a search space
► Locally maximal：frequent subgraph G is
maximal in its equivalence class
► Globally maximal：maximal frequent in a
graph database
► Avoid enumerating subgraphs which are not
locally maximal.
Bottom-up Pruning
► G’
=G
C
 G’ is frequent : each graph in search space is a
subgraph of G’ and not maximal
Tail Shrink
► Embedding
of G in G’ is a subgraph isomorphism
f from G to G’
 Two embeddings of L in P
l1->P1, l2->P2, l3->P3, l4->P4
l1->P1, l2->P3 ,l3->P2 ,l4->P4
go
Tail Shrink
► candidate
graph G
edge (i, j, el) is associative to a
 It appears in every embedding of G in a graph
databases
► If
a tree T contains a set of associative
edges, any maximal frequent graph G, a
superset of T, must contains all associative
edges.
Tail Shrink
► Remove
associative edges from candidate sets
and augment them to T without missing any
maximal ones
 Reducing the search space
 Prune the entire equivalences class in certain cases
►A
set of associative edges C of a tree T is lethal
 G’ = T C has a canonical spanning tree different
from that of T
go
External-Edge Pruning
► Remove
one equivalence class without any
knowledge about its candidate edges
► External-edge for a graph G: it connects a node in
G and a node not in G
► (i, el, vl) is associative to a graph G
 Every embedding f of G in a graph G’, G’ has a node v
with the label vl
 v connects to the node f(i) with an edge label el in G’
 Not exist node j  V[G] such that v = f(j)
Associative external edges
Experiments
► 2.8GHz
Pentium Xeon,
► 512KB L2 cache,2GB main memory
► Red Hat Linux 7.3
► C++ Programming language
Synthetic Dataset
D10KT30L200I11V4E4
DTP CA data set
DTP CM data set

SPIN: Mining Maximal Frequent Subgraphs from Graph Databases

Transcript SPIN: Mining Maximal Frequent Subgraphs from Graph Databases

Directory