Guest lecture by Dr. Yuhua Jiao "Hierarchical Agglomerative

Download Report

Transcript Guest lecture by Dr. Yuhua Jiao "Hierarchical Agglomerative

Presented by Yuhua Jiao
2012-12-4
Outline
• Limitation of some network clustering methods
• Hierarchical Agglomerative Clustering
– Method
– Performance evaluation
• Results and Discussion
– Data preparation
– Empirical evaluation
– Multi-resolution view of a physical interaction
network
Background of network clustering
• Challenges in biological network analysis
– Inference of structure of subgroups of related vertices
– Prediction of possible links not represented in data
• Network clustering is a valuable approach for
– summarizing the structure in large networks,
– predicting unobserved interactions
– predicting functional annotations
Common limitations for
some network clustering algorithms
• Poor resolution of top-level clusters
– Stochastic block models
• Over-splitting of bottom-level clusters
– Hierarchical network model
• Requirements to pre-define the number of
clusters prior to analysis
– Stochastic block models
• An inability to jointly cluster over multiple
interaction types
Hierarchical network model by
Clauset, Moore, and Newman (CMN)
Hierarchical Agglomerative Clustering
• Hierarchical Agglomerative Clustering
– An approximation for optimizing a network
probability motivated by CMN.
– Interactions with vertices outside a group often
provide more information than within-group
interactions.
– Power Graph Analysis is a lossless transformation of
biological networks into a compact, less redundant
representation, exploiting the abundance of cliques
and bicliques as elementary topological motifs.
HAC-Method
• Notation
G (V , E )
– Graph
– Groups
{C k : k Î1,… ,K }
– Edges between groups
eij = åu Îi ,v Î j euv for i ¹ j
eii = åu <v Îi euv
ìï 1 An edge between u and v
euv = í
ïî 0 No edge between u and v
– Total possible connections
tij = ni n j for i ¹ j
– Number of holes
hij = tij - eij
tii = ni ( ni - 1) 2
HAC-Method
• For a given pair of group i and j, edges between
groups are result of tij independent Bernoulli trials.
• The probability of observed edges, conditioned on
parameter θij
( )
eij
(
Pij qij = qij 1- qij
)
hij
• The maximum likelihood estimate of θij is
• The maximum likelihood value of Pij(θij) is
eij
hij
P = eij hij
ML
ij
tij
tij
q̂ij = eij tij
An instance of flat model
• Given two groups: ni = 5 nj = 4
• Probability density is
eij
hij
P = eij hij
ML
ij
tij
tij
• The likelihood of the flat
model
( )
L M = Õ PijML
i£j
HAC-Method
• Generalization to hierarchical model
– Binary dendrogram T
– Each node r in the dendrogram represents the
joining of vertices in left sub-tree L(r) and vertices
in right sub-tree R(r).
– Er and hr are numbers of edges and holes crossing
between the left and right sub-trees.
( ) Õ
L M =
r £ r ¢Îtop
ML
PrML
P
r¢ Õ r
r
HAC-Agglomerative clustering
• Maximum likelihood guide tree
– K top-level clusters
– R total tree nodes
– Merge clusters 1 and 2 into
cluster 1’, defining a new model
M’
l12ML
P
(
)
=
=Õ
P P
L ( M)
L M¢
K
k =3
ML
1¢ k
ML ML
1k
2k
1’
1
2 Current top level
HAC-Agglomerative clustering
HAC-Bayesian model selection for
terminal clusters
• During the merging process, if clusters 1 and 2 are
selected for merging and are both collapsed, the
probability ratio is calculated, where the subscripts
indicate edges and holes within and between groups.
l12C =
Beta
(å
Õ
)
e + 1,å i £ j =1 hij + 1,
i £ j =1 ij
2
2
i £ j =1
2
(
)
Beta eij + 1,hij + 1
• The merged cluster is collapsed if λc ≥1 .
• Clusters of two vertices are always merged because λc = 1.
Performance Evaluation
• Data preparation
– BioGRID database (http://thebiogrid.org)
– The graph is undirected and unweighted with no self
edges.
• Other methods
–
–
–
–
Fast Modularity (CNM)
Variational Bayes modularity (VBM)
Graph Diffusion Kernel (GDK)
Heuristic merging scores
• Edge density (HAC-E)
• Combined edge density and shared neighbor density (HAC-ES)
• Decomposed Newman modularity Q from CNM (HAC-Q)
Link Prediction
• Starting with a real-world network, training networks are generated
by deleting a specified fraction of edges.
• A test set is defined by the held-out edges and a random choice of
an equal number of holes.
• The trained group structure provides maximum likelihood estimates
for edges within and between clusters (Eq. 9). For VBM and CNM,
we estimated edge densities between all pairs of clusters and
within all clusters. For hierarchical models, we estimated densities
between all left and right clusters at all tree levels. For GDK, each
pair’s diffusion was directly used to rank pairs.
• Finally we assessed precision and recall of pairs in the test set
ranked by link probability or GDK score.
Results and Discussion
• Data Preparation
Further Discussion
• Extending HAC to dynamic networks is limited:
– A solution is required to the identifiability problem: how complexes
inferred at one time point correspond to complexes inferred at other
time points.
– Transitions of a protein from one complex to another must be
permitted by the model, requiring dynamical coupling between
network snapshots.
•
Dynamical Hierarchical Agglomerative Clustering (DHAC)
– Maximum likelihood is converted to fully Bayesian statistics
– The likelihood modularity is ‘kernelized’ with an adaptive bandwidth
to couple network clusters at nearby time points.
– Matching clusters across time points is solved with a new belief
propagation method that extends Expectation-Maximization and belief
propagation for bipartite matching to consistently match multiple
time-evolving clusters.