Mining Significant Graph Patterns by Leap Search

Download Report

Transcript Mining Significant Graph Patterns by Leap Search

Mining Significant
Graph Patterns by
Leap Search
Shachar Langer
Background (1)
Background (2)
Background (3)
How much data is out there?
• iTunes has over 11 million music tracks.
• Amazon has over 7.5 million titles.
• 100 hours of video are uploaded to YouTube every minute.
• In 2012, we created 2.5 quintillion bytes of data every day.
Data Mining
Definition
The practice of examining large databases in
order to generate new information.
Data Mining
Applications
• Financial Data Analysis
• Biological Data Analysis
• Retail Industry
• Telecommunication Industry
… and many more!
Branch and Bound
Example
Mining Significant
Graph Patterns by
Leap Search
Xifeng Yan (IBM T. J. Watson)
Hong Cheng, Jiawei Han (UIUC)
Philip S. Yu (UIC)
Frequent Graph Pattern
Given a graph dataset D = {𝐺1 , 𝐺2 , … , 𝐺𝑛 } and a minimum frequncy
threshold 𝜃, find subgraphs g, s.t.
𝑓𝑟𝑒𝑞(𝑔) ≥ 𝜃
Where freq(g) is the percentage of graphs in D that contain g.
The are many enumeration algorithm available: AGM, FSG, gSpan,
Path-join and so on.
Problems
1. Exponential Pattern Set.
2. Redundant frequent subgraphs.
Optimal Graph Pattern Mining
Given a graph dataset D = {𝐺1 , 𝐺2 , … , 𝐺𝑛 } and an objective function
F(g), where g is a subgraph in D, find a graph pattern g* such that
F(g*) is maximized.
Notations
• V(g) – The vertex set of a graph g.
• E(g) – The edge set of a graph g.
• 𝑔 ⊆ 𝑔′ - g is a subgraph of g’. g’ is a super-graph of g.
Frequency
Given a graph dataset D = {𝐺1 , 𝐺2 , … , 𝐺𝑛 } and a subgraph g, the
supporting graph set of g is 𝐷𝑔 = 𝐺 𝑔 ⊆ 𝐺, 𝐺 ∈ 𝐷 . The frequency of
g is
𝐷𝑔
𝐷
.
Objective Functions
For a given objective function, optimal graph patterns are the most
significant ones. Examples:
• Chi-square
• G-test
• Information Gain
Branch-and-Bound
Is it the best we can do?
No! We will view a mining strategy called LEAP (Descending Leap
Mine) that includes two new mining concepts:
1. structural leap search
2. Frequency-descending mining
Problem Setting
Non-Monotonicity (1)
Graph patterns are enumerated in increasing order of their size.
Unlike frequency measure, objective functions such as G-test
score and information gain are
neither monotonic nor
anti-monotonic with respect to
graph size.
But…
Non-Monotonicity (2)
Let p(g) and q(g) be the frequency of g in positive graphs and negative graphs
(sometimes written p and q). They are called positive frequency and negative
frequency respectively.
Assume F(g) is a function of p, q,
F(g)=f(p(g),q(g))
Non-Monotonicity (3)
Although F(g) might not be anti-monotonic, it follows a basic rule:
If the frequency difference of a pattern in the positive dataset and the
negative dataset increases, the pattern become more significant
Non-Monotonicity (4)
Mathematically,
Non-Monotonicity (5)
G-test as an example
G-test tells whether the frequency of a pattern in the
positive dataset fits its distribution in the negative dataset.
Non-Monotonicity (6)
G-test as an example
With some simple calculations, we have
G-test score follows the property shown before!
Non-Monotonicity (7)
Upper Bound
An upper bound for graph g can now be:
Structural Leap Search (1)
In the existing branch-and-bound method we only perform
“vertical” pruning.
Can sibling branches be pruned too? Can we
prune “horizontally”?
Structural Leap Search (2)
Yes!
This is called prune by structural proximity.
Many branches exhibit strong similarity in their frequencies
and their significance.
Structural Leap Search (3)
Frequency-Descending Mining (1)
When a complex objective is considered, pattern frequency
is often put aside and never use in existing solutions,
but if we rank all of subgraph patterns according to their
frequency, significant graph patterns often have a high
rank.
This phenomenon is called frequency association.
Frequency-Descending Mining (2)
Existing solutions have to set the frequency threshold very
low so that the optimal pattern will not be missed.
Unfortunately, low-frequency threshold could generate a
huge set of low-significance redundant patterns with long
mining time
Frequency-Descending Mining (3)
The researchers found that if all of frequent subgraphs are
ranked in increasing order of their frequency, significant
subgraph patterns are often in the
high-end range.
Frequency-Descending Mining (4)
Descending Leap Mine
Experiments
Conclusions
• We examined an increasingly important issue in frequent subgraph mining.
• The huge number of frequent subgraphs makes it hard to analyze.
• A comprehensive study on general mining strategy was performed,
which is able to mine the most significant graph pattern measured by
different kind of non-monotonic objective functions.
Thank you
for your attention!