#### Transcript SPIN: Mining Maximal Frequent Subgraphs from Graph Databases

SPIN: Mining Maximal Frequent Subgraphs from Graph Databases Jun Huan, Wei Wang, Jan Prins, Jiong Yang KDD 2004 Introduction ► Graphs model a relations among data Inter-disciplinary research ► Huge number of recurring patterns ► To mining only maximal frequent subgraphs. None of its super graphs are frequent Advantages ► Reducing the total number of mined subgraphs Saving space and analysis effort ► Reducing mining time ► Non-maximal frequent subgraph can be reconstructed. ► Maximal frequent subgraphs are of most interest in some appliations. Algorithm ► Mining all frequent trees from a general graph database. Tree normalization is simpler than graph. In certain applications, most of the frequent subgraphs are really trees. Use current subgraph mining algorithm Mining subtrees from a forest Algorithm ► Reconstruct all maximal subgraphs from the mined trees. For each frequent tree T, find all frequent subgraphs whose canonical spanning tree are isomorphic to T Enumerate the equvalence class of a tree T Maximal subgraph mining Tree-based Equivalence Classes ►A subtree T is a spanning tree of G if T contains all nodes in G. Maximal one: canonical spanning tree ► Group all frequent subgraphs in to equivalence classes based on spanning trees. Spanning tree Tree-based Equivalence Classes back 12 singletons group b y b x a b y a b b a x x y a x a a a a a y y a y a x a x a x a b b y a x a b a y y a y a x a y y a a b x a a x a a x a Enumerating Graphs from Trees ►G C :{e1,e2,…,en} GO If frequent -> edge C (candidate set) ► Search space of G： G:C ={G+y|y 2C} Optimizations ► Removing a set of frequent subgraphs that can not be maximal from a search space ► Locally maximal：frequent subgraph G is maximal in its equivalence class ► Globally maximal：maximal frequent in a graph database ► Avoid enumerating subgraphs which are not locally maximal. Bottom-up Pruning ► G’ =G C G’ is frequent : each graph in search space is a subgraph of G’ and not maximal Tail Shrink ► Embedding of G in G’ is a subgraph isomorphism f from G to G’ Two embeddings of L in P l1->P1, l2->P2, l3->P3, l4->P4 l1->P1, l2->P3 ,l3->P2 ,l4->P4 go Tail Shrink ► candidate graph G edge (i, j, el) is associative to a It appears in every embedding of G in a graph databases ► If a tree T contains a set of associative edges, any maximal frequent graph G, a superset of T, must contains all associative edges. Tail Shrink ► Remove associative edges from candidate sets and augment them to T without missing any maximal ones Reducing the search space Prune the entire equivalences class in certain cases ►A set of associative edges C of a tree T is lethal G’ = T C has a canonical spanning tree different from that of T go External-Edge Pruning ► Remove one equivalence class without any knowledge about its candidate edges ► External-edge for a graph G: it connects a node in G and a node not in G ► (i, el, vl) is associative to a graph G Every embedding f of G in a graph G’, G’ has a node v with the label vl v connects to the node f(i) with an edge label el in G’ Not exist node j V[G] such that v = f(j) Associative external edges Experiments ► 2.8GHz Pentium Xeon, ► 512KB L2 cache,2GB main memory ► Red Hat Linux 7.3 ► C++ Programming language Synthetic Dataset D10KT30L200I11V4E4 DTP CA data set DTP CM data set