Frequent Structure Mining

Download Report

Transcript Frequent Structure Mining

Frequent Structure Mining
Robert Howe
University of Vermont
Spring 2014
Original Authors
•
This presentation is based on the paper
Zaki MJ (2002). Efficiently mining frequent trees in a
forest. Proceedings of the 8th ACM SIGKDD
International Conference.
•
The author’s original presentation was used to make
this one.
•
I further adapted this from Ahmed R. Nabhan’s
modifications.
2
Outline
•
Graph Mining Overview
•
Mining Complex Structures - Introduction
•
Motivation and Contributions of author
•
Problem Definition and Case Examples
•
Main Ingredients for Efficient Pattern Extraction
•
Experimental Results
•
Conclusions
3
Outline
•
Graph Mining Overview
•
Mining Complex Structures - Introduction
•
Motivation and Contributions of author
•
Problem Definition and Case Examples
•
Main Ingredients for Efficient Pattern Extraction
•
Experimental Results
•
Conclusions
4
Why Graph Mining?
•
Graphs are convenient structures that can represent
many complex relationships.
•
We are drowning in graph data:
•
Social Networks
•
Biological Networks
•
World Wide Web
5
UVM
High School
BU
Facebook Data
(Source: Wolfram|Alpha Facebook Report)
6
Facebook Data
(Source: Wolfram|Alpha Facebook Report)
7
Biological Data
(Source: KEGG Pathways Database)
8
Some Graph Mining
Problems
•
Pattern Discovery
•
Graph Clustering
•
Graph Classification and Label Propagation
•
Structure and Dynamics of Evolving Graphs
9
Graph Mining Framework
Mining graph patterns is a fundamental problem in data mining.
Exponential Pattern Space
Graph Data
Mine
Relevant Patterns
Select
Structure Indices
Exploratory Task
Clustering
10
Classification
Basic Concepts
•
•
Graph – A graph G is a 3-tuple G = (V,
E, L) where
•
V is the finite set of nodes.
•
E ⊆ V × V is the set of edges
•
L is a labeling function for edges and nodes.
A
C
B
D
Subgraph – A graph G1 = (V1, E1, L1)
is a subgraph of G2 = (V2, E2, L2) iff:
•
V1 ⊆ V2
•
E1 ⊆ E2
•
L1(v) = L2(v) for all v ∈ V1.
•
L1(e) = L2(e) for all e ∈ E1.
A
C
11
B
Basic Concepts
•
Graph Isomorphism – “A bijection between the
vertex sets of G1 and G2 such that any two vertices u
and v which are adjacent in G1 are also adjacent in
G2.” (Wikipedia)
A
3
B
5
4
C
•
D
E
1
Subgraph Isomorphism is even harder (NPComplete!)
12
2
Basic Concepts
•
Graph Isomorphism – Let G1 = (V1, E1, L1) and G2 =
(V2, E2, L2). A graph isomorphism is a bijective
function f: V1 → V2 satisfying
•
L1(u) = L1( f (u)) for all u ∈ V1.
•
For each edge e1 = (u,v) ∈ E1, there exists e2 = (
f(u), f(v)) ∈ E2 such that L1(e1) = L2(e2).
•
For each edge e2 = (u,v) ∈ E2, there exists e1 =
–1(u), f –1(v)) ∈ E such that L (e ) = L (e ).
1
1 1
2 2
13
(f
Discovering Subgraphs
•
TreeMiner and gSpan both employ subgraph or
substructure pattern mining.
•
Graph or subgraph isomorphism can be used as an
equivalence relation between two structures.
•
There is an exponential number of subgraph patterns
inside a larger graph (as there are 2n node subsets in
each graph and then there are edges.)
•
Finding frequent subgraphs (or subtrees) tends to be
useful in data mining.
14
Outline
•
Graph Mining Overview
•
Mining Complex Structures - Introduction
•
Motivation and Contributions of author
•
Problem Definition and Case Examples
•
Main Ingredients for Efficient Pattern Extraction
•
Experimental Results
•
Conclusions
15
Mining Complex Structures
•
•
Frequent structure mining tasks
•
Item sets – Transactional, unordered data.
•
Sequences – Temporal/positional, text, biological sequences.
•
Tree Patterns – Semi-structured data, web mining, bioinformatics,
etc.
•
Graph Patterns – Bioinformatics, Web Data
“Frequent” is a broad term
•
Maximal or closed patterns in dense data
•
Correlation and other statistical metrics
•
Interesting, rare, non-redundant patterns.
16
Anti-Monotonicity
The black line is always decreasing
•
A monotonic function is a
consistently increasing or
decreasing function*.
•
The author refers to a
monotonically decreasing
function as anti-monotonic.
•
The frequency of a supergraph cannot be greater than
the frequency of a subgraph
(similar to Apriori).
* Very Informal Definition
17
(Source: SIGMOD ’08)
Outline
•
Graph Mining Overview
•
Mining Complex Structures - Introduction
•
Motivation and Contributions of author
•
Problem Definition and Case Examples
•
Main Ingredients for Efficient Pattern Extraction
•
Experimental Results
•
Conclusions
18
Tree Mining –
Motivation
(Source: University of Washington)
•
Capture intricate (subspace) patterns
•
Can be used (as features) to build global
models (classification, clustering, etc.)
•
Ideally suited for categorical, highdimensional, complex, and massive data.
•
Interesting Applications
•
Semi-structured Data – Mine structure and content
•
Web usage mining – Log mining (user sessions as
trees)
•
Bioinformatics – RNA secondary structures,
Phylogenetic trees
19
Classification Example
•
Subgraph patterns can be used as features for
classification.
# of sides
2
3
4
5
6
7
8
Amount
0
1
0
0
1
0
0
“Hexagons are a
commonly occurring
subgraph in organic
compounds.”
•
Off-the-shelf classifiers (like neural networks or
genetic algorithms) can be trained using these
vectors.
•
Feature selection is very useful too.
20
Contributions
•
Systematic subtree enumeration.
•
Extensions for mining unlabeled or unordered
subtrees or sub-forests.
•
Optimizations
•
Representing trees as strings.
•
Scope-lists for subtree occurrences.
21
Outline
•
Graph Mining Overview
•
Mining Complex Structures - Introduction
•
Motivation and Contributions of author
•
Problem Definition and Case Examples
•
Main Ingredients for Efficient Pattern Extraction
•
Experimental Results
•
Conclusions
22
How does searching for
patterns work?
•
Start with graphs with small sizes.
•
Extend k-size graphs by one node to generate k + 1
candidate patterns.
•
Use a scoring function to evaluate each candidate.
•
A popular scoring function is one that defines the
minimum support. Only graphs with frequency
greater than minisup are kept.
23
How does searching for
patterns work?
•
“The generation of size k + 1 subgraph candidates
from size k frequent subgraphs is more complicated
and more costly than that of itemsets” – Yan and
Han (2002), on gSpan
•
Where do we add a new edge?
•
It is possible to add a new edge to a pattern and
then find that doesn’t exist in the database.
•
The main story of this presentation is on good
candidate generation strategies.
24
TreeMiner
•
TreeMiner uses a technique for numbering tree
nodes based on DFS.
•
This numbering is used to encode trees as vectors.
•
Subtrees sharing a common prefix (e.g. the first k
numbers in vectors) form an equivalence class.
•
Generate new candidate (k + 1)-subtrees from
equivalence classes of k-subtrees (e.g. Apriori)
25
TreeMiner
•
This is important because candidate subtrees are
generated only once!
•
(Remember the subgraph isomorphism problem that
makes it likely to generate the same pattern over
and over)
26
Definitions
•
Tree – An undirected graph where there is exactly
one path between any two vertices.
•
Rooted Tree – Tree with a special node called root.
This tree has no root node.
It is an unrooted tree.
This tree has a root node.
It is a rooted tree.
27
Definitions
•
Ordered Tree – The ordering of a node’s children
matters.
v1
v2
v3
≠
v2
v1
v3
•
Example: XML Documents
•
Exercise – Prove that ordered trees must be rooted.
28
Definitions
•
Labeled Tree – Nodes have labels.
•
Rooted trees also have some
special terminology.
•
ancestor
parent
embedded
sibling
Parent – The node one closer to
the root.
•
Ancestor – The node n edges
closer to the root, for any n.
•
Siblings – Two nodes with the
same parent.
29
embedded
sibling
sibling
ancestor(X,Y) :parent(X,Y).
ancestor(X,Y) :parent(Z,Y),
ancestor(X,Z).
sibling(X,Y) :parent(Z,X),
parent(Z,Y).
Definitions
•
Embedded Siblings – Two nodes sharing a
common ancestor.
•
Numbering – The node’s position in a traversal
(normally DFS) of the tree.
•
•
A node has a number ni and a label L(ni).
Scope – The scope of a node nl is [l, r], where nr is
the rightmost leaf under nl (again, DFS numbering).
30
Definitions
v0
•
Embedded Subtrees – S =
(Ns, Bs) is an embedded
subtree of T = (N, B) if and only
if the following conditions are
met:
•
v1
v2
Ns ⊆ N (the nodes have to
be a subset).
•
b = (nx, ny) ∊ Bs iff nx is an
ancestor of ny.
•
For each subset of nodes Ns
there is one embedded
subtree or subforest.
v6
v7
v8
v3
v4
v5
subtree
v1
v4
31
v5
(Colors are only on this graph to show
corresponding nodes)
Definitions
v0
•
•
•
Match Label – The node
numbers (DFS numbers) in
T of the nodes in S with
matching labels.
v1
v2
A match label uniquely
identifies a subtree.
v6
v7
v8
v3
v4
v5
subtree
This is useful because a
labeling function must be
surjective but will not
necessarily be bijective.
v1
v4
{v1, v4, v5} or {1, 4, 5}
32
v5
(Colors are only on this graph to show
corresponding nodes)
Definitions
v0
v1
•
v2
Subforest – A
disconnected pattern
generated in the same
way as an embedded
subtree.
v6
v7
v8
v3
v4
v5
subforest
33
v1
v7
v4
v8
(Colors are only on this graph to show
corresponding nodes)
Problem Definition
•
Given a database (forest) D of trees, find all frequent
embedded subtrees.
•
Frequent – Occurring a minimum number of times
(use user-defined minisup).
•
Support(S) – The number of trees in D that contain
at least one occurrence of S.
•
Weighted-Support(S) – The number of
occurrences of S across all trees in D.
34
Exercise
Generate an embedded subtree or subforest for the set of
nodes Ns = {v1, v2, v5}. Is this an embedded subtree or
subforest, and why? Assume a labeling function L(x) = x.
v1
v0
v1
v2
v6
v7
v8
v3
v4
v2
v5
v5
This is an embedded subtree
because all of the nodes are
connected.
(*Cough* Exam 35Question *Cough*)
Outline
•
Graph Mining Overview
•
Mining Complex Structures - Introduction
•
Motivation and Contributions of author
•
Problem Definition and Case Examples
•
Main Ingredients for Efficient Pattern Extraction
•
Experimental Results
•
Conclusions
36
Main Ingredients
•
Pattern Representation
•
•
Candidate Generation
•
•
Trees as strings
No duplicates.
Pattern Counting
•
Scope-based List (TreeMiner)
•
Pattern-based Matching (PatternMatcher)
37
String Representation
•
With N nodes, M branches, and a max fanout of F:
•
An adjacency matrix takes (N)(F + 1) space.
•
An adjacency list requires 4N – 2 space.
•
A tree of (node, child, sibling) requires 3N space.
•
String representation requires 2N – 1 space.
38
String Representation
•
String representation is labels with a backtrack
operator, –1.
0
2
1
3
2
1
0
1
3
2
1
–1
2
–1 –1
39
2
–1 –1
2
–1
Candidate Generation
•
Equivalence Classes – Two subtrees are in the
same equivalence class iff they share a common
prefix string P up to the (k – 1)-th node.
•
This gives us simple equivalence testing of a
fixed-size array.
•
Fast and parallel – Can be run on a GPU.
•
Caveat – The order of the tree matters.
40
Candidate Generation
•
Generate new candidate (k + 1)-subtrees from
equivalence classes of k-subtrees.
•
Consider each pair of elements in a class, including selfextensions.
•
Up to two new candidates for each pair of joined
elements.
•
All possible candidate subtrees are enumerated.
•
Each subtree is generated only once!
41
Candidate Generation
•
Each class is represented in memory by a prefix
string and a set of ordered pairs indicating nodes
that exist in that class.
•
A class is extended by applying a join operator ⊗ on
all ordered pairs in the class.
42
Candidate Generation
Equivalence Class
Prefix String 12
1
1
2
2
4
3
This generates two elements: (3, v1) and (4, v0)
The element notation can be confusing because the first
item is a label and the second item is a DFS node number.
43
Candidate Generation
•
Theorem 1. Define a join operator ⊗ on two
elements as (x, i) ⊗ (y, j). Then apply one of the
following cases:
(1) If i = j and P is not empty, add (y, j) and (y, j + 1)
to class [Px]. If P is empty, only add (y, j + 1) to
[Px].
(2) If i > j, add (y, j) to [Px].
(3) If i < j, no new candidate is possible.
44
Candidate Generation
•
Consider the prefix class from the previous example:
P = (1, 2) which contains two elements, (3, v1) and (4,
v0).
1. Join (3, v1) ⊗ (3, v1) – Case (1) applies, producing
(3, v1) and (3, v2) for the new class P3 = (1, 2, 3).
2. Join (3, v1) ⊗ (4, v0) – Case (2) applies.
(Don’t worry, there’s an illustration on the next slide.)
45
Candidate Generation
1
2
3
1
⊗
2
=
3
A class with prefix {1,2} contains a
node with label 3. This is written as
(3, v1), meaning a node labeled ‘3’ is
added at position 1 in DFS order of
nodes.
46
1
1
2
2
3
3
3
3
Prefix = (1, 2, 3)
New nodes = (3, v2), (3, v1)
Candidate Generation
1
2
1
⊗
2
4
=
1
1
1
2
2
2
3
3
3
4
3
3
3
Prefix = (1, 2, 3)
New nodes = (3, v2), (3, v1), (4, v0)
47
The Algorithm
TREEMINER( D, minisup ):
F1 = { frequent 1-subtrees}
F2 = { classes [P]1 of frequent 2-subtrees }
for all [P]1 ∈ E do
Enumerate-Frequent-Subtrees( [P]1 )
ENUMERATE-FREQUENT-SUBTREES( [P] ):
for each element (x, i) ∈ [P] do
[Px] = ∅
for each element (y, j) ∈ [P] do
R = { (x, i) ⊗ (y, j) }
L(R) = { L(x) ∩⊗ L(y) }
if for any R ∈ R, R is frequent, then
[Px] = [Px] ∪ {R}
Enumerate-Frequent-Subtrees( [Px] )
48
ScopeList Join
•
•
Recall that the scope is
the interval between the
lowest numbered child
(or self) node and the
highest numbered child
node, using DFS
numbering.
v0
[1, 5]
v1
v6
[0, 8]
v7
[7, 8]
[3, 5]
v8
[2, 2]
v2
v3
v4
This can be used to
calculate support.
[4, 4]
49
v5
[5, 5]
[8, 8]
ScopeList Join
•
ScopeLists are used to calculate support.
•
Let x and y be nodes with scopes sx = [lx, ux], sy = [ly,
uy].
•
sx contains sy iff lx ≤ ly and ux ≥ uy.
•
A scope list represents the entire forest.
50
ScopeList Join
•
•
A ScopeList is a list of (t, m, s) 3-tuples.
•
t is the tree ID.
•
m is the match label of the (k – 1)-length prefix of xk.
•
s is the scope of the last item, xk.
The use of scope lists allows constant time
computations of whether y is a descendent or
embedded sibling of x.
51
Outline
•
Graph Mining Overview
•
Mining Complex Structures - Introduction
•
Motivation and Contributions of author
•
Problem Definition and Case Examples
•
Main Ingredients for Efficient Pattern Extraction
•
Experimental Results
•
Conclusions
52
Experimental Results
•
Machine: 500Mhz PentiumII, 512MB memory, 9GB disk, RHEL 6.0
•
Synthetic Data: Web browsing
•
Parameters: N = #Labels, M = #Nodes, F = Max Fanout, D = Max Depth, T = #Trees
•
Create master website tree W
•
For each node in W, generate #children (0 to F)
•
Assign probabilities of following each child or to backtrack; adding up to 1
•
Recursively continue until D is reached
•
Generate a database of T subtrees of W
•
Start at root. Recursively at each node generate a random number
•
(0 – 1) to decide which child to follow or to backtrack.
•
Default parameters: N=100, M=10,000, D=10, F=10, T=100,000
•
Three Datasets: D10 (all default values), F5 (F=5), T1M (T=106)
•
Real Data: CSLOGS – 1 month web log files at RPI CS
•
Over 13361 pages accessed (#labels)
•
Obtained 59,691 user browsing trees (#number of trees)
•
Average string length of 23.3
per tree
53
Distribution of Frequent
Trees
F5: Max-Fanout = 5
T1M: 106 Trees
Sparse
Dense
Take-Home Point: Many large, frequent trees can be
discovered.
54
Experiments (Sparse)
Sparse
Dense
Take-Home Point: Both algorithms are able to cope with
relatively short patterns in sparse
data.
55
Experiments (Dense)
Sparse
Dense
(Artificial Dataset)
(Real-World Dataset)
Take-Home Point: Long patterns at low-support (length=20); the
level-wise approach suffers.
The authors use the artificial dataset to justify TreeMiner as 20
times faster than pattern matcher.
56
Outline
•
Graph Mining Overview
•
Mining Complex Structures - Introduction
•
Motivation and Contributions of author
•
Problem Definition and Case Examples
•
Main Ingredients for Efficient Pattern Extraction
•
Experimental Results
•
Conclusions
57
Conclusions
•
•
TREEMINER: A novel tree mining approach.
•
Non-duplicate candidate generation.
•
Scope-List joins for frequency comparison.
Framework for tree-mining tasks
•
Frequent subtrees in a forest of rooted, labeled, ordered trees.
•
Frequent subtrees in a single tree.
•
There are extensions for unlabeled and unordered trees.
58
Caveats
•
Frequent does not always mean significant!
•
Exhaustive enumeration is a problem even though the
candidate generation in TreeMiner is efficient.
•
Low min_sup values increases true positives at the cost of
increasing false positives.
•
State-of-the-art graph miners make use of the structure of the
search space (e.g. the LEAP search algorithm) to extract only
significant structures.
•
Candidate structures can be generated by tree miners and
evaluated by some other mean.
59
Question One
Generate an embedded subtree or subforest for the set of
nodes Ns = {v1, v2, v5}. Is this an embedded subtree or
subforest, and why? Assume a labeling function L(x) = x.
v1
v0
v1
v2
v6
v7
v8
v3
v4
v2
v5
This is an embedded subtree
because all of the nodes are
connected.
v5
60
Question Two
Why is the frequency of subgraphs a good function to
evaluate candidate patterns? How could it be better?
Answer. The frequency of subgraphs is a monotonically
decreasing function, meaning supergraphs are not more
frequent than subgraphs. This is a desirable property
combined with a minimum support threshold to reduce the
search space as subgraph patterns get bigger.
However, frequency does not always imply significance –
another metric must be used to evaluate the candidates
generated by a graph miner for significance.
61
Question Three
How is a string representation of a tree useful in graph
mining? What requirements does it place on the graph?
Answer. A string representation of a tree is useful
because string comparisons are worst-case O(n) and
can be easily optimized. However, it requires that a tree
be rooted and ordered, because otherwise the string
comparison operator would not be valid.
62