Summarize-Mine

Transcript Summarize-Mine

Chen Chen1, Cindy X. Lin1, Matt Fredrikson2,
Mihai Christodorescu3, Xifeng Yan4, Jiawei Han1
1University of Illinois at Urbana-Champaign
2University of Wisconsin at Madison
3IBM T. J. Watson Research Center
4University of California at Santa Barbara
1
Outline
 Motivation
 The efficiency bottleneck encountered in big networks
 Patterns must be preserved
 Summarize-Mine
 Experiments
 Summary
2
3
Frequent Subgraph Mining
 Find all graphs p such that |Dp| >= min_sup
 Get into the topological structures of graph data
 Useful for many downstream applications
4
Challenges
 Subgraph isomorphism checking is inevitable for any
frequent subgraph mining algorithm
 This will have problems on big networks
 Suppose there is only one triangle in the network
 But there are 1,000,000 length-2 paths
 We must enumerate all these 1,000,000, because any one
of them has the potential to grow into a full triangle
5
Too Many Embeddings
 Subgraph isomorphism is NP-hard
 So, when the problem size increases, …
 During the checking, large graphs are grown from
small subparts
 For small subparts, there might be too many
(overlapped) embeddings in a big network
 Such embedding enumerations will finally kill us
6
Motivating Application
 System call graphs from security research
 Model dependencies among system calls
 Unique subgraph signatures for malicious programs
 Compare malicious/benign programs
 These graphs are very big
 Thousands of nodes on average
 We tried state-of-art mining technologies, but failed
7
Our Approach
 Subgraph isomorphism checking cannot be done on
large networks
 So we do it on small graphs
 Summarize-Mine
 Summarize: Merge nodes by label and collapse
corresponding edges
 Mine: Now, state-of-art algorithms should work
8
Mining after Summarization
G1
G2
a
b
c
a
c
b
a
…
Original
…
…
Summarize
g1
g2
a
c
b
a
c
b
a
…
Summary
Mining
&
Output
…
…
9
Remedy for Pattern Changes
 Frequent subgraphs are presented on a different
abstraction level
 False negatives & false positives, compared to true
patterns mined from the un-summarized database D
 False negatives (recover)
 Randomized technique + multiple rounds
 False positives (delete)
 Verify against D
 Substantial work can be transferred to the summaries
10
Outline
 Motivation
 Summarize-Mine
 The algorithm flow-chart
 Recovering false negatives
 Verifying false positives
 Experiments
 Summary
11
12
False Negatives
 For a pattern p, if each of its vertices bears a different label,
then the embeddings of p must be preserved after
summarization
 Since we are merging groups of vertices by label, the nodes
of p should stay in different groups
 Otherwise,
Gi
a
p
a
c
b
a
b
gi
a
...
c
b
c
13
Missing Prob. of Embeddings
 Suppose
 Assign xj nodes for label lj (j=1,…,L) in the summary Si =>
xj groups of nodes with label lj in the original graph Gi
 Pattern p has mj nodes with label lj
 Then
14
No “Collision” for Same Labels
 Consider a specific embedding f: p->Gi, f is preserved if
vertices in f(p) stay in different groups
 Randomly assign mj nodes with label lj to xj groups,
the probability that they will not “collide” is:
 Multiply probabilities for independent events
15
Example
 A pattern with 5 labels, each label => 2 vertices
 m1 = m2 = m3 = m4 = m5 = 2
 Assign 20 nodes in the summary (i.e., 20 node groups
in the original graph) for each label
 The summary has 100 vertices
 x1 = x2 = x3 = x4 = x5 = 20
 The probability that an embedding will persist
19 19 19 19 19

   
 0.774
20 20 20 20 20
16
Extend to Multiple Graphs
 Setting x1,…,xL to the same values across all Gi’s in the
database
only depends on m1,…,mL, i.e., pattern p’s vertex
label distribution
 We denote this probability as q(p)

 For each of p’s support graphs in D, it has a probability
of at least q(p) to continue support p
 Thus, the overall support can be bounded below by a
binomial random variable
17
Support Moves Downward
18
False Negative Bound
19
Example, Cont.
 As above, q(p)=0.774
 min_sup=50
min_sup'
40
39
38
37
36
35
1 round
0.5966
0.4622
0.3346
0.2255
0.1412
0.0820
2 rounds
0.3559
0.2136
0.1119
0.0508
0.0199
0.0067
3 rounds
0.2123
0.0988
0.0374
0.0115
0.0028
0.0006
20
False Positives
Gi
a
p
a
a
b
gi
a
a
c
a
b
c
b
c
 Much easier to handle
 Just check against the original database D
 Discard if this “actual” support is less than min_sup
21
The Same Skeleton as gSpan
 DFS code tree
 Depth-first search
 Minimum DFS code?
 Check support by
isomorphism tests
 Record all one-edge
extensions along the way
 Pass down the projected
database and recurse
22
Integrate Verification Schemes
 Top-Down and Bottom-Up
 Possible factors
 Amount of false positives
 Top-down
verification
Transaction ID
list for p1 => Dcan
p1
be performed early
 Top-down preferred
Just
search
within
D-D
Just
search
within
Dpp12;
if frequent, can stop
by experiments
Transaction ID list for p2 => Dp2
23
Summary-Guided Verification
 Substantial verification work can be performed on the
summaries, as well
Got it!
24
Iterative Summarize-Mine
 Use a single pattern tree to hold all results spanning
across multiple iterations
 No need to combine pattern sets in a final step
 Avoid verifying patterns that have already been checked
by previous iterations
 Verified support graphs are accurate, they can help prepruning in later iterations
 Details omitted
25
Outline
 Motivation
 Summarize-Mine
 Experiments
 Summary
26
Dataset
 Real data
 W32.Stration, a family of mass-mailing worms
 W32.Virut, W32.Delf, W32.Ldpinch, W32.Poisonivy, etc.
 Vertex # up to 20,000 and edge # even higher
 Avg. # of vertices: 1,300
 Synthetic data
 Size, # of distinct node/edge labels, etc.
 Generator details omitted
27
A Sample Malware Signature
 Mined from W32.Stration
 A malware reading and leaking certain registry
settings related to the network devices
28
Comparison with gSpan
 gSpan is an efficient graph pattern mining algorithm
 Graphs with different size are randomly drawn
 Eventually, gSpan cannot work
29
The Influence of min_sup'
 Total vs. False Positives
 The gap corresponds to true patterns
 It gradually widens as we decrease min_sup'
30
Summarization Ratio
 10/1 node(s) before/after summarization => ratio=10
 Trading-off min_sup' and t as the inner loop
 A range of reasonable parameters in the middle
31
Scalability
 On the synthetic data
 Parameters are tuned as done above
32
Outline
 Motivation
 Summarize-Mine
 Experiments
 Summary
33
Summary
 We solve the frequent subgraph mining problem for
graphs with big size
 We found interesting malware signatures
 Our algorithm is much more efficient, while the state-
of-art mining technologies do not work
 We show that patterns can be well preserved on
higher-level by a good generalization scheme
 Very useful, given the emerging trend of huge networks
 The data has to be preprocessed and summarized
34
Summary
 Our method is orthogonal to many previous works on
this topic => Combine for further improvement
 Efficient pattern space traversal
 Other data space reduction techniques different from
our compression within individual transactions


Transaction sampling, merging, etc.
They perform compression between transactions
35
36

Summarize-Mine

Transcript Summarize-Mine

Directory