Community Structure

Download Report

Transcript Community Structure

Computational Molecular Biology
Community Structures
What is Community Structure
 Definition:
 A community is a group of nodes in which:
 There are more edges (interactions) between nodes within
the group than to nodes outside of it
My T. Thai
[email protected]
2
Why Community Structure (CS)?
 Many systems can be expressed by a network,
in which nodes represent the objects and edges
represent the relations between them:
 Social networks: collaboration, online social
networks
 Technological networks: IP address networks,
WWW, software dependency
 Biological networks: protein interaction networks,
metabolic networks, gene regulatory networks
My T. Thai
[email protected]
3
Why CS?
Yeast Protein interaction
networks
My T. Thai
[email protected]
4
Why CS?
IP address
network
My T. Thai
[email protected]
5
Why Community Structure?
 Nodes in a community have some common
properties
 Communities represent some properties of a
networks
 Examples:
 In social networks, represent social groupings based
on interest or background
 In citation networks, represent related papers on one
topic
 In metabolic networks, represent cycles and other
functional groupings
My T. Thai
[email protected]
6
How to detect a community?
My T. Thai
[email protected]
7
Early Work
 Using hierarchical clustering
 Overview of this method:
 For each pair (u,v), calculate weight wuv which
represents how closely connected u and v are
 Initialize G = (V, emptyset)
 At each iteration, add an edge with the strongest
weight
My T. Thai
[email protected]
8
Early Work
My T. Thai
[email protected]
9
Early Work
 How to define the weight wuv
 Many different methods have been proposed:
 Number disjoint paths between u and v
 Number of possible paths between u and v
 Disadvantages:
 Tendency to separate the boundary vertices from
the communities (to which they should belong)
My T. Thai
[email protected]
10
An Overview of Recent Work
 Disjoint CS
 Overlapping CS
 Centralized Approach
 Define the quantity of modularity and use the
greedy algorithms, IP, SDP
 Spectral clustering
 Random Walk, Clique Percolation
 Localized Approach
My T. Thai
[email protected]
11
Edge Betweeness
 Focus on the edges which are least central, i.e.,,
the edges which are most “between”
communities
 Instead of adding edge to G = (V, emptyset),
progressively removing edges from an original
graph G = (V,E)
My T. Thai
[email protected]
12
Edge Betweeness
 Definition:
 For each edge (u,v), the edge betweeness of (u,v) is
defined as the number of shortest paths between any
pair of nodes in a network that run through (u,v)
 betweeness(u,v) = | { Pxy | x, y in V, Pxy is a shortest
path between x and y, and (u,v) in Pxy}|
My T. Thai
[email protected]
13
Why Edge Betweeness
My T. Thai
[email protected]
14
Algorithm
 Initialize G = (V,E) representing a network
 while E is not empty
 Calculate the betweeness of all edges in G
 Remove the edge e with the highest betweeness, G
= (V, E – e)
 Indeed, we just need to recalculate the
betweeness of all edges affected by the removal
My T. Thai
[email protected]
15
Time Complexity
 Let |V| = n and |E| = m
 Calculate the betweeness of all edges: O(mn)
 Since we need to recalculate each time we
remove an edge: O(m2n)
My T. Thai
[email protected]
16
An Example
My T. Thai
[email protected]
17
Disadvantages/Improvements
 Can we improve the time complexity?
 The communities are in the hierarchical form,
can we find the disjoint communities?
My T. Thai
[email protected]
18
Define the quantity (measurement) of
modularity Q and find an approximation
algorithm to maximize Q
My T. Thai
[email protected]
19
How to define Q
 Let A be an adjacency matrix of the network
 Fraction of edges that fall within communities
where
My T. Thai
[email protected]
20
How to define Q
 What is the problem? If we try to maximize the
above equation, then we may put all nodes into
one single community
 How to fix it?
My T. Thai
[email protected]
21
How to define Q?
Let kv be the degree of node v
 The term kv kw /2m represents the probability of an
edge existing between vertices v and w if connections
are made random but respecting vertex degrees
My T. Thai
[email protected]
22
Greedy Algorithm
 Initially, we have n communities (each node is
a community)
 At each step, join two communities whose the
hierarchical tree has the largest increase in Q
 Stop when we left with a single community
(run n -1 steps)
My T. Thai
[email protected]
23
Disadvantages
 It is still a hierarchical approach
 Cannot escape a suboptimal maximum
 How to avoid the suboptimal maximum?
My T. Thai
[email protected]
24
Local Communities
My T. Thai
[email protected]
25
Overview
 Find communities based on local information
(not information of entire network)
 Two ways:
 Detect the communities
 Define local modularity, then greedily optimize this
function
My T. Thai
[email protected]
26
What We Have Learnt
 Can we use this Q?
 How can we “twist” it to make it work?
My T. Thai
[email protected]
27
Consider this Figure
1. Suppose we have perfect
knowledge of some
subgraphs C
2. Then we should know
some neighbors on C, lie
in U
3. Visit some neighbors of
nodes in U may extend
the knowledge of C
4. Now, can we re-use Q
(defined before) on C?
My T. Thai
[email protected]
28
Some Definitions
 Again, define an adjacent matrix A (wrt C) as follows:
 Consider this quantity:
where m* 
1
Aij (# of edges in the partial adj matrix)

2 ij
 (i, j )  1 iff both vi and v j are in C
My T. Thai
[email protected]
29
Relationship between B, C, U
 Consider nodes in B
 If C is a sharp community, then
 Nodes in B have more connections to nodes in C
 Nodes in B have less (a few) connections to nodes
in U
My T. Thai
[email protected]
30
Definitions
 Define a boundary-adjacent matrix B as follows:
 Define local community R:
where δ(i,j) = 1 iff vi in B and vj in C or vice versa. Otherwise, δ(i,j) = 0.
T: #edges with one ore more endpoints in B
I: #edges in B such that none of their endpoints in U
My T. Thai
[email protected]
31
Properties of R
 0<R<1
 Directly proportional to the sharpness of the
boundary given by B
 When R is undefined?
My T. Thai
[email protected]
32
A Greedy Algorithm
My T. Thai
[email protected]
33
Overview of Second Method
 Start at a vertex, check the degree of each
vertex with respect to each one-hop neighbors,
two-hop neighbors, …, l-hops neighbors
 Why?
 If the community is highly connected, the lhops neighbors tend to revisit the nodes
 At the boundary, the number of newly added
edges decreases
My T. Thai
[email protected]
34
Some Definitions
 kie(j): Emerging degree of a vertex i which is l-hops
away vertex j is defined as the number of edges (u,i)
where u is not within l-hops away from j
 Kjl: Total emerging degree of all nodes that exactly lhops away from j
where Sjl is the set of all vertices exactly l-hops away from j
 Initially, Kj0= degree of node j = kj
My T. Thai
[email protected]
35
Some Definitions
 The change in total emerging degree
My T. Thai
[email protected]
36
Algorithm
1. Randomly choose a starting vertex j
2. Initially, l = 0, add j to C (C is a community),
and K0j = kj
3. l = l++; add all l-hop neighbors of j to C
4. Compute ΔKjl. If ΔKjl < α, then return C.
Otherwise, repeat step 3
My T. Thai
[email protected]
37
Any Problem?
 How to define α? Do we need to define α? If not,
what should we change?
 What if the starting vertex is the “bridge one”? What
can we do?
My T. Thai
[email protected]
38
Impact of α
 α = 0, never stop until explore the entire
connected subgraph
 α is large, stop sooner (l is small), resulting in
many small communities
 α is too large, return n singleton communities
(α > kmax where kmax is the largest degree)
My T. Thai
[email protected]
39
A Small Example
Obtained by the Alg
Actual CS of the Karate Club
My T. Thai
[email protected]
40
Dynamic Communities
My T. Thai
[email protected]
41
Dynamic Networks
 A dynamic network
 A collection of network snapshots at many time points.
 Changes are frequently introduced


Insertions / Removals of nodes
Insertions / Removals of edges
t=0
t=1
t=2
t=3
• Event decomposition
– Insertions / Removals of nodes = {Insert/Remove
a node}
– Insertions / Removals of edges = {Insert/Remove
an edge}
42
Recall…
• Modularity function
• However, max Q is NP-hard
43
Adaptive Solutions
 An adaptive method:
 To maximize the gained modularity with low
computational complexity
 Locally compute a new structure based on local
information after each change of the network
 A basic community structure
Only for the first snapshot
Adaptively update the network communities
based on this basic structure
Method: Blondel et al (2008)
44
QCA: An Adaptive Method
Input
network
Blondel’s
method
:
:
Basic communities
Network
changes
• Need to handle
–
–
–
–
Node insertion
Edge insertion
Node removal
Edge removal
45
Updated
communities
Membership determination
A node actively determines its
Membership
S
u
FinS(u)
FoutC(u)
C
46
Introducing a new node
 Possibilities
 No new edges
 New edges linking with one community
 New edges linking multiple communities
u
C1
C2
C3
47
Handling node insertion
u
C1
C2
Fout
C2(u)
C3
Join u to the community C
with the highest FoutC(u)
FoutC1(u)
FoutC3(u)
48
Introducing a new edge
 Possibilities
 A new edge is inside a single community
 A new edge is joining two communities
u
v
v
u
49
Handling edge insertion
Keep the current community structure intact
b
a
a
b
– Find qu,C,D and qv,D,C
– Join a to C or D according
to qu,C,D and qv,C,D
– If a (or b) changes its
membership
• Check all a’s neighbors
50
Time complexity
 Inserting a new node
 Visit all neighbors of u at most once
 O(du)
 Inserting a new edge
 Computing qu,C(u),C(v) and in constant time
 O(1)
51
Removing an edge
 Resulting community is either:
 Remains unchanged
 Breaks up into smaller communities
 If it contains substructures that are less attractive to the
others
u
v
52
Handling edge removal
 Strategy
 Find maximal ‘quasi’-cliques
 Let the other singletons determine their best
communities
53
Removing a node
 All edges connected to u will be removed
 Resulting community either
 Remains unchanged
 Breaks up into smaller spices and merged to others
54
Handling node removal
 Strategy
 3-Clique percolation
 Let the left over nodes determine their best
communities
55
Experimental Results
 Test our algorithms on real-world data traces
 Enron email, ArXiv citation and Facebook networks
 In comparison with the Blondel’s method at each snapshot
• Metrics
–
–
–
–
Modularity
Number of communities
Normalized Mutual Information (NMI)
Running time
56
ArXiv e-print Citation network
Modularity
# Communities
57
NMI
Running Time
Facebook
Modularity
# Communities
58
NMI
Running Time