graph-mining

Download Report

Transcript graph-mining


Introduction

Important Concepts in MCL Algorithm

MCL Algorithm

The Features of MCL Algorithm

Summary


Decompose a network into subnetworks
based on some topological properties
Usually we look for dense subnetworks
3
Algorithms:
Exact: have proven solution quality and time
complexity
 Approximate: heuristics are used to make them
efficient
Example algorithms:
 Highly connected subgraphs (HCS)
 Restricted neighborhood search clustering (RNSC)
 Molecular Complex Detection (MCODE)
 Markov Cluster Algorithm (MCL)

4

Intuition:
◦ High connected nodes could be in one cluster
◦ Low connected nodes could be in different clusters.

Model:
◦ A random walk may start at any node
◦ Starting at node r, if a random walk will reach node
t with high probability, then r and t should be
clustered together.
An undirected graph and its adjacency matrix representation.
An undirected graph and its adjacency list representation.
Theorem. Let M be the adjacency matrix for
graph G. Then each (i, j) entry in M r is the
number of paths of length r from vertex i to vertex
j.
Note: This is the standard power of m,
not a Boolean product.
7
8
9

Graph power
◦ The kth power of a graph G: a graph with the same
set of vertices as G and an edge between two
vertices iff there is a path of length at most k
between them
◦ The number of paths of length k between any two
nodes can be calculated by raising adjacency matrix
of G to the exponent k
◦ Then, G’s kth power is defined as the graph whose
adjacency matrix is given by the sum of the first k
powers of the adjacency matrix:
10
G
G2
G3
11


Given a weighted graph G(V,E,w), the all-pairs
shortest paths problem is to find the shortest
paths between all pairs of vertices vi, vj ∈ V.
A number of algorithms are known for solving
this problem.



Consider the multiplication of the weighted
adjacency matrix with itself - except, in this
case, we replace the multiplication operation
in matrix multiplication by addition, and the
addition operation by minimization.
Notice that the product of weighted
adjacency matrix with itself returns a matrix
that contains shortest paths of length 2
between any pair of nodes.
It follows from this argument that An contains
all shortest paths.

Markov process
◦ The probability that a random will take an edge at
node u only depends on u and the given edge.
◦ It does not depend on its previous route.
◦ This assumption simplifies the computation.



Flow of network is used to approximate the
partition
There is an initial amount of flow injected
into each node.
At each step, a percentage of flow will goes
from a node to its neighbors via the outgoing
edges.

Edge Weight
◦ Similarity between two nodes
◦ Considered as the bandwidth or connectivity.
◦ If an edge has higher weight than the other, then
more flow will be flown over the edge.
◦ The amount of flow is proportional to the edge
weight.
◦ If there is no edge weight, then we can assign the
same weight to all edges.

Two natural clusters
A

B
When the flow reaches the border points, it is likely to
return back, than cross the border.

When the flow reaches A, it has four possible
outcomes.
◦ Three back into the cluster, one leak out.
◦ ¾ of flow will return, only ¼ leaks.


Flow will accumulate in the center of a cluster
(island).
The border nodes will starve.

Simualtion of Random Flow in graph

Two Operations: Expansion and Inflation

Intrinsic relationship between MCL process
result and cluster structure



Observation 1:
The number of Higher-Length paths in G is
large for pairs of vertices lying in the same
dense cluster
Small for pairs of vertices belonging to
different clusters


Oberservation 2:
A Random Walk in G that visits a dense
cluster will likely not leave the cluster until
many of its vertices have been visited


nxn Adjacency matrix A.
◦ A(i,j) = weight on edge from i to j
◦ If the graph is undirected A(i,j)=A(j,i), i.e.
A is symmetric
nxn Transition matrix P.
◦ P is row stochastic
◦ P(i,j) = probability of stepping on node j
from node i
= A(i,j)/∑iA(i,j)
• Flow: Transition probability from a node to another node.
• Flow matrix: Matrix with the flows among all nodes; ith
column represents flows out of ith node. Each column sums
to 1.
1
2
0.5
1
3
0.5
2
1
Flow
3
1
Matrix
1
2
3
1
0
0.5
0
2
1.0
0
1.0
3
0
0.5
0
24
Adjacency matrix A
Transition matrix P
1
1
1
1
1
1
1/2
1/2
t=0
1
1
1/2
1/2
t=1
t=0
1
1
1/2
1/2
1
1
1/2
1/2
t=1
t=0
1
1
1/2
1/2
1
1/2
1
1/2
t=2
1
1
1/2
1/2
t=1
t=0
1
1
1/2
1
1
1/2
1/2
t=2
1
1
1/2
1/2
t=3
1
1
1/2
1/2
1/2


xt(i) = probability that the surfer is at node i at time t
xt+1(i) = ∑j(Probability of being at node j)*Pr(j->i)
=∑jxt(j)*P(j,i)

xt+1 = xtP = xt-1*P*P= xt-2*P*P*P = …=x0 Pt

What happens when the surfer keeps walking for a long
time?



Measure or Sample any of these—high-length
paths, random walks and deduce the cluster
structure from the behavior of the samples
quantities.
Cluster structure will show itself as a peaked
distribution of the quantities
A lack of cluster structure will result in a flat
distribution

Markov Chain

Random Walk on Graph

Some Definitions in MCL



A Random Process with Markov Property
Markov Property: given the present state,
future states are independent of the past
states
At each step the process may change its state
from the current state to another state, or
remain in the same state, according to a
certain probability distribution.



A walker takes off on some arbitrary vertex
He successively visits new vertices by
selecting arbitrarily one of outgoing edges
There is not much difference between
random walk and finite Markov chain.


Simple Graph
Simple graph is undirected graph in which
every nonzero weight equals 1.


Associated Matrix
The associated matrix of G, denoted MG ,is
defined by setting the entry (MG)pq equal to
w(vp,vq)


Markov Matrix
The Markov matrix associated with a graph G
is denoted by TG and is formally defined by
letting its qth column be the qth column of M
normalized



The associate matrix and markov matrix is
actually for matrix M+I
I denotes diagonal matrix with nonzero
element equals 1
Adding a loop to every vertex of the graph
because for a walker it is possible that he will
stay in the same place in his next step


Find Higher-Length Path
Start Point: In associated matrix that the
quantity (Mk)pq has a straightforward
interpretation as the number of paths of
length k between vp and vq
MG
(MG+I)2
MG



Flow is easier with dense regions than across
sparse boundaries,
However, in the long run, this effect
disappears.
Power of matrix can be used to find higherlength path but the effect will diminish as the
flow goes on.


Idea: How can we change the distribution of
transition probabilities such that prefered
neighbours are further favoured and less
popular neighbours are demoted.
MCL Solution: raise all the entries in a given
column to a certain power greater than 1 (e.g.
squaring) and rescaling the column to have
the sum 1 again.


Expansion Operation: power of matrix,
expansion of dense region
Inflation Operation: mention aboved,
elimination of unfavoured region
Input: A, Adjacency matrix
Initialize M to MG, the canonical
transition matrix M:= MG:= (A+I) D-1
Enhances flow to well-connected nodes
as well as to new nodes.
Expand: M := M*M
Inflate: M := M.^r (r usually
2), renormalize columns
Increases inequality in each column.
“Rich get richer, poor get poorer.”
Prune
Converged?
No
Saves memory by removing entries close
to zero.
Yes
Output clusters
53

http://www.micans.org/mcl/ani/mclanimation.html



Find attractor: the node a is an attractor if
Maa is nonzero
Find attractor system: If a is an attractor then
the set of its neighbours is called an attractor
system.
If there is a node who has arc connected to
any node of an attractor system, the node will
belong to the same cluster as that attractor
system.
Attractor Set={1,2,3,4,5,6,7,8,9,10}
The Attractor System is {1,2,3},{4,5,6,7},{8,9},{10}
The overlapping clusters are
{1,2,3,11,12,15},{4,5,6,7,13},{8,9,12,13,14,15},{10,12,13}



how many steps are requred before the
algorithm converges to a idempoent matrix?
The number is typically somewhere between
10 and 100
The effect of inflation on cluster granularity



MCL stimulates random walk on graph to find
cluster
Expansion promotes dense region while
Inflation demotes the less favoured region
There is intrinsic relationship between MCL
result and cluster structure
The original algorithm for clustering graphs using stochastic
flows.
Advantages:
• Simple and elegant.
• Widely used in Bioinformatics because of its noise tolerance
and effectiveness.
Disadvantages:
• Very slow.
- Takes 1.2 hours to cluster a 76K node social network.
• Prone to output too many clusters.
- Produces 1416 clusters on a 4741 node PPI network.
Scalable Graph Clustering using
6
1