PCA-based Consensus methods outperform all other algorithms!

Download Report

Transcript PCA-based Consensus methods outperform all other algorithms!

Integrated Mining of PPI Networks:
A Case for Ensemble Clustering
Srinivasan Parthasarathy
Department of Computer Science and Engineering
The Ohio State University
Joint work with Sitaram Asur and Duygu Ucar
Copyright 2006, Data Mining Research Laboratory
I. Preliminaries and Motivation
Copyright 2006, Data Mining Research Laboratory
Proteins
• Central component of cell machinery and life
– It is the proteins dynamically generated by a cell
that execute the genetic program [Kahn 1995]
• Proteins work with other proteins [Von Mering et
al 2002]
– Form large interaction networks typically refered to
as protein-protein interaction (PPI) networks
– Regulate and support each other for specific
functionality or process
Copyright 2006, Data Mining Research Laboratory
Protein Protein Interaction Networks
• Why analyze?
– To fully understand cellular machinery, simply
listing proteins is not enough – (clusters of)
interactions need to be delineated as well
[v.Mering 2002]
• Understanding the organism
– Protein function prediction
• E.g. no functional annotations
for one-third of baker’s yeast
– Drug design
• Goal: To find modular clusters
Copyright 2006, Data Mining Research Laboratory
Challenges in analyzing PPI
Networks
– Noisy data
• False positives [Deane 2002], false negatives [Hsu 06]
– Existence of Hub Nodes
• Particularly problematic for standard clustering and graph
partitioning algorithms -- lead to very large core clusters and not
much else!
– Proteins can be multi-faceted
• Can belong to multiple functional groups – most clustering
algorithms are hard – need for soft or fuzzy clustering
– Data Integration Issues
• Multiple Sources
– 2-Hyrbid, Mass Spectrometry, genetic co-occurrence
• Different targets
– Y2H, Mass Spec – target binding
– Gene co-occurrence – target functional
• Different weaknesses (missing certain interactions)
– Y2H – translation
– mass-spectrometry – transport & sensing
Copyright 2006, Data Mining Research Laboratory
Ensemble Clustering
• A useful approach to combine the results from multiple clustering
arrangements into a single arrangement based on consensus [SG03]
• Objective: Mapping between clusters obtained by different algorithms to a
single clustering arrangement
• Our hypothesis: Potentially offers a viable solution for problems
simultaneously
– Given nice theory in the context of classification it is likely to be
particularly useful in a noisy environment.
• A weak analogy to the audience vote in millionaire
– Naturally handles arrangements produced from different sources or
domain driven segmentation.
Copyright 2006, Data Mining Research Laboratory
Ensemble Clustering on PPI networks:
Key Questions
• What are the base clustering methods and
arrangements to use in the context of interaction
networks?
– How to handle the influence of noise and hubs?
• How do we scale to problems of the scale of
interaction networks?
• How do we address the issue of soft clustering?
• How to address the issue of data integration?
– Another day another time 
Copyright 2006, Data Mining Research Laboratory
II. Ensemble Clustering Framework
Copyright 2006, Data Mining Research Laboratory
Birds-eye-view (coarse grained)
Topology-based
Similarity Metrics
Scale-free graph
x
Clustering
Algorithms
y
Clustering Arrangements
xy base clustering
arrangements
(soft)Consensus Clustering
Cluster Representation
Final clusters
Copyright 2006, Data Mining Research Laboratory
Similarity Metrics
• Central to any clustering algorithm
• Key idea:
– Leverage topological information to determine the
similarity between two proteins in the interaction
network
– With ensemble approach we are not limited to one!
• Metrics :
– Clustering coefficient based (edge oriented, local)
– Edge Betweenness based (edge oriented, global)
– Neighborhood based (local, non-edge oriented)
Copyright 2006, Data Mining Research Laboratory
Clustering coefficient-based similarity
• Clustering coefficient
– "all-my-friends-know-each-other" property
– Measures the interconnectivity of a node’s neighbors.
1
2
vi
5
3
4
• Clustering coefficient-based similarity of two connected
nodes vi and vj
– Measures the contribution of the edge between the nodes towards
the clustering coefficient of the nodes
Copyright 2006, Data Mining Research Laboratory
vj
6
Edge betweenness-based similarity
• Shortest path edge betweenness [Newman et al]
– “I-am-between-every-pair” property
– Computes the fraction of shortest paths passing through
an edge
1
2
5
3
6
4
8
– Edges that lie between communities have high values of
betweenness
– Edge betweenness-based similarity
Copyright 2006, Data Mining Research Laboratory
7
Neighborhood-based similarity
• “my-friends-are-your-friends” property
• Based on the number of common neighbors between
nodes (Czekanowski-Dice metric [Brun et al, 2004])
where Int(i) = number of neighbors of node i
1
2
5
3
4
Copyright 2006, Data Mining Research Laboratory
6
Base Clustering
• Base clustering algorithms : Different criteria
– kMetis
– Repeated bisections
– Direct k-way partitioning
• Topology-based similarity measures : weight
interactions
– Clustering coefficient-based – local, targets FP
– Edge betweenness-based – global, targets FP
– Neighborhood – local, potentially targets FN & FP
• 3X3 = 9 arrangements (variance is good!)
– K clusters per arrangement (K clusters)
Copyright 2006, Data Mining Research Laboratory
PCA-based Consensus Technique
Cluster Purification
Dimensionality Reduction
Consensus Clustering
Copyright 2006, Data Mining Research Laboratory
Cluster Purification
• Goal : Prune unreliable base clusters
• Intra-cluster similarity measure
where SP(i,j) represents shortest path between i and j
• Low intra-cluster distance => high reliability
• Remove clusters with low reliability
Copyright 2006, Data Mining Research Laboratory
Dimensionality Reduction
• Cluster membership matrix to represent pruned base clusters
• Dimensions likely to be high (9 X k)
• Clustering inefficient for high-dimensional data
– Distance metric computations do not scale well
• Lot of noise and redundancy in the matrix
• Solution : Reduce dimensions of the matrix
– Apply logistic PCA
– Variant of PCA for binary data (Schein et al, 2003)
Copyright 2006, Data Mining Research Laboratory
Consensus Clustering
• Agglomerative Hierarchical Clustering
– Bottom-up clustering algorithm
– Begin with each point in a separate cluster
– Iteratively merge clusters that are similar
• Recursive Bisection (RBR) algorithm
• Soft Clustering Variants
– Find initial clusters using agglo or RBR
– Assign points to multiple clusters based on similarity
– Hub nodes have high propensity for multiple membership
Copyright 2006, Data Mining Research Laboratory
Topological
Metrics
Ensemble Framework
(Detailed View)
Base Clustering
Base clustering arrangements
Cluster Purification
Consensus
Clustering
Weights
Pruning
Agglomerative
Clustering
Principal Component
Analysis
Final clusters
Weighted
Graph
Soft
PCA-agglo
PCA-softvariants
Copyright 2006, Data Mining Research Laboratory
PCA-rbr
III. Evaluation
Copyright 2006, Data Mining Research Laboratory
Validation Metrics: Domain Independant
• Topological measure : Modularity [Newman&Girvan04]
– Measures the modularity within clusters
– dij represents fraction of edges linking nodes in clusters i
and j
• Information theoretic measure : Normalized Mutual
Information [Strehl & Ghosh03]
– Measures the shared information between the consensus
and base clustering arrangements
Copyright 2006, Data Mining Research Laboratory
Validation Metric: Domain Dependant
• Domain-based measure:
– Gene ontology annotations for each cluster of proteins
• Cellular Component
• Molecular Function
• Biological Process
– P-value to measure statistical significance of clusters
• Computes the probability of the grouping being random
• Smaller p-values represent higher biological significance
– Clustering Score to measure overall clustering arrangement
Copyright 2006, Data Mining Research Laboratory
Experimental Setup
• Algorithms proposed by Strehl et al , 2003
– HyperGraph Partitioning Algorithm (HGPA)
• Minimal Hyperedge Separator using HMetis
– Meta-CLustering Algorithm (MCLA)
• Group related hyperedges to form meta-clusters
• Assign each point to the closest meta-cluster
– Cluster-based Similarity Partitioning (CSPA)
• Pairwise similarity matrix is partitioned with METIS
• Algorithms proposed by Gionis et al, ICDE 2005
– Agglomerative algorithm (CE-agglo)
– Density-based clustering algorithm (CE-balls)
– Use strict thresholds and are non-parametric
• Database of Interacting Proteins (DIP)
– 4928 proteins, 17194 interactions
Copyright 2006, Data Mining Research Laboratory
Modularity and NMI
CSPA algorithm ran out of memory
CE-agglo and CE-balls algorithms resulted in pairs and singleton clusters
(cluster-sizes 2121 and 2783 respectively)
Algorithm
Modularity
NMI
PCA-agglo
0.471
0.66
PCA-rbr
0.46
0.656
MCLA
0.41
0.614
HGPA
0.1
0.275
PCA-based consensus methods provide best scores!
Copyright 2006, Data Mining Research Laboratory
Comparison with Ensemble Algorithms
Process
Ensemble Algorithms
Clustering Score
Function
Component
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
CE-balls
CE-agglo
HGPA
PCA-agglo
PCA-rbr
MCLA
Wt-agglo
PCA-based Consensus methods outperform all other algorithms!
MCLA performs best of the other algorithms
Copyright 2006, Data Mining Research Laboratory
Existing Solutions to Identify Dense
Regions
• Molecular Complex Detection (MCODE)
– Bader et al, 2003
– Use local neighborhood density to identify seed
vertices
– Group highly weighted vertices around seed vertices
• Markov Cluster Algorithm (MCL)
– Dongen et al 2000
– Random walks on the graph will infrequently go from
one natural cluster to another
– Cluster structure separates out
– Fast, scalable and non-parametric
Copyright 2006, Data Mining Research Laboratory
Comparison with MCODE and MCL
• MCODE produced only 59 clusters
– Not all proteins clustered (794/4928)
– 10-20 clusters insignificant
• MCL produced 1246 clusters
Algorithm Modularity
PCA-agglo
0.471
MCL
0.217
MCODE
0.372
– Most of the clusters insignificant (close to 75-80%)
Copyright 2006, Data Mining Research Laboratory
Soft Clustering: Comparison with Hub
Duplication (Ucar 2006)
For Hub H
i
Hub-induced Subgraph Si
Dense components of Si
Hi
Duplicate Hi
Hi
i++
D’i
Graph Partitioning
Copyright 2006, Data Mining Research Laboratory
Benefits of Soft Ensemble Clustering
Copyright 2006, Data Mining Research Laboratory
A closer look at soft clustering
performance
• CKA1 (hub protein)
Base Algorithm
Annotation
PCA-agglo
PCA-softagglo
Direct-bet
Kinase CK2 complex
Kinase CK2
complex
Kinase CK2
complex
Direct-cc
rRNA metabolism
rRNA metabolism
RBR-bet
Kinase CK2 complex
Cell organization
and biogenesis
RBR-cc
Kinase CK2 complex
Metis-bet
Cell organization and
biogenesis
Metis-cc
Copyright 2006, Data Mining Research Laboratory
Concluding Remarks
Clustering PPI networks is • Ongoing work
– General applicability
challenging
•
–
–
–
–
•
Noise
Presence of hubs
Need for soft clustering
Integration
Ensemble clustering shows
promise as a unified method
to handle these problems
–
–
Competes well against existing
stand-alone solutions
Scalable -- straightforward
parallelization for the most part
• WWW applications
• Social network analysis
– Explicit modeling of domain
knowledge
• E.g. encoding directionality
– Data Integration
• Key is to weight edges and/or
components of the ensemble
– Leveraging graphical models
– More robust base models
• Extrinsic similarity measures
• Impact of anomalies
Copyright 2006, Data Mining Research Laboratory
Questions?
• We acknowledge the following grants for support
–
–
–
–
NSF: CAREER-IIS-0347662
NSF: NGS-CNS-0406386
NSF: RI-CNS-0403342
DOE: ECPI-FG02
• Graduate Student Colleagues
– S. Asur and D. Ucar
• Details
– http://dmrl.cse.ohio-state.edu
– www.cse.ohio-state.edu/~srini/
Copyright 2006, Data Mining Research Laboratory