Summarize-Mine
Download
Report
Transcript Summarize-Mine
Chen Chen1, Cindy X. Lin1, Matt Fredrikson2,
Mihai Christodorescu3, Xifeng Yan4, Jiawei Han1
1University of Illinois at Urbana-Champaign
2University of Wisconsin at Madison
3IBM T. J. Watson Research Center
4University of California at Santa Barbara
1
Outline
Motivation
The efficiency bottleneck encountered in big networks
Patterns must be preserved
Summarize-Mine
Experiments
Summary
2
3
Frequent Subgraph Mining
Find all graphs p such that |Dp| >= min_sup
Get into the topological structures of graph data
Useful for many downstream applications
4
Challenges
Subgraph isomorphism checking is inevitable for any
frequent subgraph mining algorithm
This will have problems on big networks
Suppose there is only one triangle in the network
But there are 1,000,000 length-2 paths
We must enumerate all these 1,000,000, because any one
of them has the potential to grow into a full triangle
5
Too Many Embeddings
Subgraph isomorphism is NP-hard
So, when the problem size increases, …
During the checking, large graphs are grown from
small subparts
For small subparts, there might be too many
(overlapped) embeddings in a big network
Such embedding enumerations will finally kill us
6
Motivating Application
System call graphs from security research
Model dependencies among system calls
Unique subgraph signatures for malicious programs
Compare malicious/benign programs
These graphs are very big
Thousands of nodes on average
We tried state-of-art mining technologies, but failed
7
Our Approach
Subgraph isomorphism checking cannot be done on
large networks
So we do it on small graphs
Summarize-Mine
Summarize: Merge nodes by label and collapse
corresponding edges
Mine: Now, state-of-art algorithms should work
8
Mining after Summarization
G1
G2
a
b
c
a
c
b
a
…
Original
…
…
Summarize
g1
g2
a
c
b
a
c
b
a
…
Summary
Mining
&
Output
…
…
9
Remedy for Pattern Changes
Frequent subgraphs are presented on a different
abstraction level
False negatives & false positives, compared to true
patterns mined from the un-summarized database D
False negatives (recover)
Randomized technique + multiple rounds
False positives (delete)
Verify against D
Substantial work can be transferred to the summaries
10
Outline
Motivation
Summarize-Mine
The algorithm flow-chart
Recovering false negatives
Verifying false positives
Experiments
Summary
11
12
False Negatives
For a pattern p, if each of its vertices bears a different label,
then the embeddings of p must be preserved after
summarization
Since we are merging groups of vertices by label, the nodes
of p should stay in different groups
Otherwise,
Gi
a
p
a
c
b
a
b
gi
a
...
c
b
c
13
Missing Prob. of Embeddings
Suppose
Assign xj nodes for label lj (j=1,…,L) in the summary Si =>
xj groups of nodes with label lj in the original graph Gi
Pattern p has mj nodes with label lj
Then
14
No “Collision” for Same Labels
Consider a specific embedding f: p->Gi, f is preserved if
vertices in f(p) stay in different groups
Randomly assign mj nodes with label lj to xj groups,
the probability that they will not “collide” is:
Multiply probabilities for independent events
15
Example
A pattern with 5 labels, each label => 2 vertices
m1 = m2 = m3 = m4 = m5 = 2
Assign 20 nodes in the summary (i.e., 20 node groups
in the original graph) for each label
The summary has 100 vertices
x1 = x2 = x3 = x4 = x5 = 20
The probability that an embedding will persist
19 19 19 19 19
0.774
20 20 20 20 20
16
Extend to Multiple Graphs
Setting x1,…,xL to the same values across all Gi’s in the
database
only depends on m1,…,mL, i.e., pattern p’s vertex
label distribution
We denote this probability as q(p)
For each of p’s support graphs in D, it has a probability
of at least q(p) to continue support p
Thus, the overall support can be bounded below by a
binomial random variable
17
Support Moves Downward
18
False Negative Bound
19
Example, Cont.
As above, q(p)=0.774
min_sup=50
min_sup'
40
39
38
37
36
35
1 round
0.5966
0.4622
0.3346
0.2255
0.1412
0.0820
2 rounds
0.3559
0.2136
0.1119
0.0508
0.0199
0.0067
3 rounds
0.2123
0.0988
0.0374
0.0115
0.0028
0.0006
20
False Positives
Gi
a
p
a
a
b
gi
a
a
c
a
b
c
b
c
Much easier to handle
Just check against the original database D
Discard if this “actual” support is less than min_sup
21
The Same Skeleton as gSpan
DFS code tree
Depth-first search
Minimum DFS code?
Check support by
isomorphism tests
Record all one-edge
extensions along the way
Pass down the projected
database and recurse
22
Integrate Verification Schemes
Top-Down and Bottom-Up
Possible factors
Amount of false positives
Top-down
verification
Transaction ID
list for p1 => Dcan
p1
be performed early
Top-down preferred
Just
search
within
D-D
Just
search
within
Dpp12;
if frequent, can stop
by experiments
Transaction ID list for p2 => Dp2
23
Summary-Guided Verification
Substantial verification work can be performed on the
summaries, as well
Got it!
24
Iterative Summarize-Mine
Use a single pattern tree to hold all results spanning
across multiple iterations
No need to combine pattern sets in a final step
Avoid verifying patterns that have already been checked
by previous iterations
Verified support graphs are accurate, they can help prepruning in later iterations
Details omitted
25
Outline
Motivation
Summarize-Mine
Experiments
Summary
26
Dataset
Real data
W32.Stration, a family of mass-mailing worms
W32.Virut, W32.Delf, W32.Ldpinch, W32.Poisonivy, etc.
Vertex # up to 20,000 and edge # even higher
Avg. # of vertices: 1,300
Synthetic data
Size, # of distinct node/edge labels, etc.
Generator details omitted
27
A Sample Malware Signature
Mined from W32.Stration
A malware reading and leaking certain registry
settings related to the network devices
28
Comparison with gSpan
gSpan is an efficient graph pattern mining algorithm
Graphs with different size are randomly drawn
Eventually, gSpan cannot work
29
The Influence of min_sup'
Total vs. False Positives
The gap corresponds to true patterns
It gradually widens as we decrease min_sup'
30
Summarization Ratio
10/1 node(s) before/after summarization => ratio=10
Trading-off min_sup' and t as the inner loop
A range of reasonable parameters in the middle
31
Scalability
On the synthetic data
Parameters are tuned as done above
32
Outline
Motivation
Summarize-Mine
Experiments
Summary
33
Summary
We solve the frequent subgraph mining problem for
graphs with big size
We found interesting malware signatures
Our algorithm is much more efficient, while the state-
of-art mining technologies do not work
We show that patterns can be well preserved on
higher-level by a good generalization scheme
Very useful, given the emerging trend of huge networks
The data has to be preprocessed and summarized
34
Summary
Our method is orthogonal to many previous works on
this topic => Combine for further improvement
Efficient pattern space traversal
Other data space reduction techniques different from
our compression within individual transactions
Transaction sampling, merging, etc.
They perform compression between transactions
35
36