Community Structure
Download
Report
Transcript Community Structure
Computational Molecular Biology
Community Structures
What is Community Structure
Definition:
A community is a group of nodes in which:
There are more edges (interactions) between nodes within
the group than to nodes outside of it
My T. Thai
[email protected]
2
Why Community Structure (CS)?
Many systems can be expressed by a network,
in which nodes represent the objects and edges
represent the relations between them:
Social networks: collaboration, online social
networks
Technological networks: IP address networks,
WWW, software dependency
Biological networks: protein interaction networks,
metabolic networks, gene regulatory networks
My T. Thai
[email protected]
3
Why CS?
Yeast Protein interaction
networks
My T. Thai
[email protected]
4
Why CS?
IP address
network
My T. Thai
[email protected]
5
Why Community Structure?
Nodes in a community have some common
properties
Communities represent some properties of a
networks
Examples:
In social networks, represent social groupings based
on interest or background
In citation networks, represent related papers on one
topic
In metabolic networks, represent cycles and other
functional groupings
My T. Thai
[email protected]
6
How to detect a community?
My T. Thai
[email protected]
7
Early Work
Using hierarchical clustering
Overview of this method:
For each pair (u,v), calculate weight wuv which
represents how closely connected u and v are
Initialize G = (V, emptyset)
At each iteration, add an edge with the strongest
weight
My T. Thai
[email protected]
8
Early Work
My T. Thai
[email protected]
9
Early Work
How to define the weight wuv
Many different methods have been proposed:
Number disjoint paths between u and v
Number of possible paths between u and v
Disadvantages:
Tendency to separate the boundary vertices from
the communities (to which they should belong)
My T. Thai
[email protected]
10
An Overview of Recent Work
Disjoint CS
Overlapping CS
Centralized Approach
Define the quantity of modularity and use the
greedy algorithms, IP, SDP
Spectral clustering
Random Walk, Clique Percolation
Localized Approach
My T. Thai
[email protected]
11
Edge Betweeness
Focus on the edges which are least central, i.e.,,
the edges which are most “between”
communities
Instead of adding edge to G = (V, emptyset),
progressively removing edges from an original
graph G = (V,E)
My T. Thai
[email protected]
12
Edge Betweeness
Definition:
For each edge (u,v), the edge betweeness of (u,v) is
defined as the number of shortest paths between any
pair of nodes in a network that run through (u,v)
betweeness(u,v) = | { Pxy | x, y in V, Pxy is a shortest
path between x and y, and (u,v) in Pxy}|
My T. Thai
[email protected]
13
Why Edge Betweeness
My T. Thai
[email protected]
14
Algorithm
Initialize G = (V,E) representing a network
while E is not empty
Calculate the betweeness of all edges in G
Remove the edge e with the highest betweeness, G
= (V, E – e)
Indeed, we just need to recalculate the
betweeness of all edges affected by the removal
My T. Thai
[email protected]
15
Time Complexity
Let |V| = n and |E| = m
Calculate the betweeness of all edges: O(mn)
Since we need to recalculate each time we
remove an edge: O(m2n)
My T. Thai
[email protected]
16
An Example
My T. Thai
[email protected]
17
Disadvantages/Improvements
Can we improve the time complexity?
The communities are in the hierarchical form,
can we find the disjoint communities?
My T. Thai
[email protected]
18
Define the quantity (measurement) of
modularity Q and find an approximation
algorithm to maximize Q
My T. Thai
[email protected]
19
How to define Q
Let A be an adjacency matrix of the network
Fraction of edges that fall within communities
where
My T. Thai
[email protected]
20
How to define Q
What is the problem? If we try to maximize the
above equation, then we may put all nodes into
one single community
How to fix it?
My T. Thai
[email protected]
21
How to define Q?
Let kv be the degree of node v
The term kv kw /2m represents the probability of an
edge existing between vertices v and w if connections
are made random but respecting vertex degrees
My T. Thai
[email protected]
22
Greedy Algorithm
Initially, we have n communities (each node is
a community)
At each step, join two communities whose the
hierarchical tree has the largest increase in Q
Stop when we left with a single community
(run n -1 steps)
My T. Thai
[email protected]
23
Disadvantages
It is still a hierarchical approach
Cannot escape a suboptimal maximum
How to avoid the suboptimal maximum?
My T. Thai
[email protected]
24
Local Communities
My T. Thai
[email protected]
25
Overview
Find communities based on local information
(not information of entire network)
Two ways:
Detect the communities
Define local modularity, then greedily optimize this
function
My T. Thai
[email protected]
26
What We Have Learnt
Can we use this Q?
How can we “twist” it to make it work?
My T. Thai
[email protected]
27
Consider this Figure
1. Suppose we have perfect
knowledge of some
subgraphs C
2. Then we should know
some neighbors on C, lie
in U
3. Visit some neighbors of
nodes in U may extend
the knowledge of C
4. Now, can we re-use Q
(defined before) on C?
My T. Thai
[email protected]
28
Some Definitions
Again, define an adjacent matrix A (wrt C) as follows:
Consider this quantity:
where m*
1
Aij (# of edges in the partial adj matrix)
2 ij
(i, j ) 1 iff both vi and v j are in C
My T. Thai
[email protected]
29
Relationship between B, C, U
Consider nodes in B
If C is a sharp community, then
Nodes in B have more connections to nodes in C
Nodes in B have less (a few) connections to nodes
in U
My T. Thai
[email protected]
30
Definitions
Define a boundary-adjacent matrix B as follows:
Define local community R:
where δ(i,j) = 1 iff vi in B and vj in C or vice versa. Otherwise, δ(i,j) = 0.
T: #edges with one ore more endpoints in B
I: #edges in B such that none of their endpoints in U
My T. Thai
[email protected]
31
Properties of R
0<R<1
Directly proportional to the sharpness of the
boundary given by B
When R is undefined?
My T. Thai
[email protected]
32
A Greedy Algorithm
My T. Thai
[email protected]
33
Overview of Second Method
Start at a vertex, check the degree of each
vertex with respect to each one-hop neighbors,
two-hop neighbors, …, l-hops neighbors
Why?
If the community is highly connected, the lhops neighbors tend to revisit the nodes
At the boundary, the number of newly added
edges decreases
My T. Thai
[email protected]
34
Some Definitions
kie(j): Emerging degree of a vertex i which is l-hops
away vertex j is defined as the number of edges (u,i)
where u is not within l-hops away from j
Kjl: Total emerging degree of all nodes that exactly lhops away from j
where Sjl is the set of all vertices exactly l-hops away from j
Initially, Kj0= degree of node j = kj
My T. Thai
[email protected]
35
Some Definitions
The change in total emerging degree
My T. Thai
[email protected]
36
Algorithm
1. Randomly choose a starting vertex j
2. Initially, l = 0, add j to C (C is a community),
and K0j = kj
3. l = l++; add all l-hop neighbors of j to C
4. Compute ΔKjl. If ΔKjl < α, then return C.
Otherwise, repeat step 3
My T. Thai
[email protected]
37
Any Problem?
How to define α? Do we need to define α? If not,
what should we change?
What if the starting vertex is the “bridge one”? What
can we do?
My T. Thai
[email protected]
38
Impact of α
α = 0, never stop until explore the entire
connected subgraph
α is large, stop sooner (l is small), resulting in
many small communities
α is too large, return n singleton communities
(α > kmax where kmax is the largest degree)
My T. Thai
[email protected]
39
A Small Example
Obtained by the Alg
Actual CS of the Karate Club
My T. Thai
[email protected]
40
Dynamic Communities
My T. Thai
[email protected]
41
Dynamic Networks
A dynamic network
A collection of network snapshots at many time points.
Changes are frequently introduced
Insertions / Removals of nodes
Insertions / Removals of edges
t=0
t=1
t=2
t=3
• Event decomposition
– Insertions / Removals of nodes = {Insert/Remove
a node}
– Insertions / Removals of edges = {Insert/Remove
an edge}
42
Recall…
• Modularity function
• However, max Q is NP-hard
43
Adaptive Solutions
An adaptive method:
To maximize the gained modularity with low
computational complexity
Locally compute a new structure based on local
information after each change of the network
A basic community structure
Only for the first snapshot
Adaptively update the network communities
based on this basic structure
Method: Blondel et al (2008)
44
QCA: An Adaptive Method
Input
network
Blondel’s
method
:
:
Basic communities
Network
changes
• Need to handle
–
–
–
–
Node insertion
Edge insertion
Node removal
Edge removal
45
Updated
communities
Membership determination
A node actively determines its
Membership
S
u
FinS(u)
FoutC(u)
C
46
Introducing a new node
Possibilities
No new edges
New edges linking with one community
New edges linking multiple communities
u
C1
C2
C3
47
Handling node insertion
u
C1
C2
Fout
C2(u)
C3
Join u to the community C
with the highest FoutC(u)
FoutC1(u)
FoutC3(u)
48
Introducing a new edge
Possibilities
A new edge is inside a single community
A new edge is joining two communities
u
v
v
u
49
Handling edge insertion
Keep the current community structure intact
b
a
a
b
– Find qu,C,D and qv,D,C
– Join a to C or D according
to qu,C,D and qv,C,D
– If a (or b) changes its
membership
• Check all a’s neighbors
50
Time complexity
Inserting a new node
Visit all neighbors of u at most once
O(du)
Inserting a new edge
Computing qu,C(u),C(v) and in constant time
O(1)
51
Removing an edge
Resulting community is either:
Remains unchanged
Breaks up into smaller communities
If it contains substructures that are less attractive to the
others
u
v
52
Handling edge removal
Strategy
Find maximal ‘quasi’-cliques
Let the other singletons determine their best
communities
53
Removing a node
All edges connected to u will be removed
Resulting community either
Remains unchanged
Breaks up into smaller spices and merged to others
54
Handling node removal
Strategy
3-Clique percolation
Let the left over nodes determine their best
communities
55
Experimental Results
Test our algorithms on real-world data traces
Enron email, ArXiv citation and Facebook networks
In comparison with the Blondel’s method at each snapshot
• Metrics
–
–
–
–
Modularity
Number of communities
Normalized Mutual Information (NMI)
Running time
56
ArXiv e-print Citation network
Modularity
# Communities
57
NMI
Running Time
Facebook
Modularity
# Communities
58
NMI
Running Time