Transcript BFAM is....

BFAM Project BF-S15T07
“Efficient clustering algorithms
for genome-wide expression
analysis“
BFAM Project BF-S15T08
“Modeling and visualization of
biochemical networks“
Sebastian Wernicke ([email protected])
Arno Buchner ([email protected])
Jan Griebsch ([email protected])
Jens Ernst ([email protected])
Misc. projects in
Bioinformatics
Hanjo Täubig ([email protected])
Moritz Maass ([email protected])
Project I: Efficient Clustering Algorithms for
genome-wide Expression Analysis
Expression Profiles
Normalization
Similarity
Measure
Gene Expression Data
Clustering
1. Retrospect: The SR-Algorithm
• Powerful algorithm for similarity-based clustering
• Based on methods of spectral graph theory,
numerical linear algebra and randomization
• Applicable not only to gene expression profiles
but to any class of biological objects where pair-wise
similarity is defined
• Thoroughly mathematically analyzed with respect
to noise-robustness and running time
• Complexity: Θ(n2), and hence optimal
• New: Parallelized version and optimized version for
sparse similarity matrices.
2. Tests on Synthetic Data (1)
Output quality as a function of n and the amount of noise
(false positive, false negative rate α). The number of
clusters is specified to the algorithm.
1.0
α
0.45
α
0.45
n
n
500 – 2000 genes forming 4 clusters with 20%-49% false positives/negatives
Tests on Synthetic Data (2)
Output quality as a function of n and the amount of noise
(false positive, false negative rate α). The number of
clusters is found by the algorithm.
0.45
1.0
n
n
0.45
α
α
500 – 4000 genes forming 4 clusters with 20%-45% false positives/negatives
Tests on Synthetic Data (3)
Running time as a function of n and the amount of noise
(false positive, false negative rate α) on a 1GHz machine.
α = 0.45
293.0
time(s)
time(s)
293
0.45
α
n
5.0
5000
n
30,000
5.000 – 30.000 genes, i.e. 25.000.000 – 900.000.000 similarity values
4. Clustering Protein Interaction Networks
• Experiments with a network from the STRING system
provided by the Bork group at EMBL.
• Data: Escherichia coli, orthologous group-based
• Edge scores: Interaction intensities defined by
score=1-(1-neighborhood score)x(1-fusion score)x
(1-co-occurence score)
[ Courtesy of C. von Mering, Nucleic Acids Res. 2003 Jan 1;31(1):258-61 ]
4.1 Methods Current Applied in STRING
• Functional module extraction: Generic partitionbased clustering methods (Single Linkage, MarkovClustering) have been applied to identify functional
modules in the network.
• However: Due to the definition of the interaction score
as a combination of three different channels, multiple
cluster structures are superimposed in this data set.
• Generalized Clustering: Grouping such that any
protein (/orthologous group) can belong to multiple
clusters. The density of each cluster should be as high
as possible, whereas the inter-cluster connectivity
(excluding overlaps) should be minimized.
4.2. Schematic representation:
“Lsets”
1
1,3
1,2
2
2,3
1,2,3
2,4
3
3,4
4
Cluster Structure
Interaction Matrix
(permuted with respect
to cluster structure)
Interaction Matrix
(original form)
4.2. Construction of Intersecting Clusters:
1. Construction of elementary sets by SR-techniques
Result: A partition of the protein set into a fixed
number k of elementary sets. The value of k may
safely be overestimated.
Intra- and inter-Lset edge densities:
k = 150;
Mean intra-Lset density: 0.309
Inter-Lset connectivity: 0.024
Lsets belonging to the same
cluster
Frequency distribution of edge densities within
and between Lsets
2. Definition of the Lset-graph
1
Some pairs of Lsets are still highly connected.
This is represented by a graph structure whose3
nodes are Lsets. Maximal cliques in this graph
are macroscopic clusters, which can overlap.
1,2
2
1,2,3
2,3
3,4
1,3
4
2,4
Note: This means that the method self-corrects
an over-estimated value of k.
3. Construction of the intersecting clusters
The cliques are extracted using the Tsukiyama-algorithm.
Result: 144 clusters
Intra-cluster density: 0.269
Inter-cluster connectivity: 0.020 (excl. overlaps)
Quality assessment based on biological expert
knowledge: currently pending
The clusters are being compared with a known
set of protein-to-pathway assignments.
5. Mathematical Result Evaluation in Comparative
Analysis of Clustering Algorithms
• Mathematical scoring scheme for clustering quality:
• Suppose a clustering has induced the partition
C={C1,C2,…,Ck} of the set of genes {X1,X2,…,Xn}.
• Denote the similarity between a pair of genes Xi,Xj
with s(Xi,Xj).
• Denote the Cluster containing Xi with C(Xi) and the
center of some cluster C with XC.
Cluster Homogeneity:
Separation:
•
Remarks:
1. The cluster analysis was conducted in the form of a
blind test. Use of expert knowledge or supervised
learning techniques was not intended for.
2. No prior selection of genes was asked for.
3. Normalization/standardization of expression data or the
similarity-/distance measure were not explicitly required.
•
Choice of similarity measure s for the evaluation:
Pearson Correlation Coefficient (due to invariance under
scaling and translation of expression profiles, which was
used by some participants).
•
Homogeneity and Separation in the Clusterings (NRO)
NRO Data Set
(Pearson correlation)
“Average” (2)
“Average” (3)
Kröger (10)
Separation
“Binary” (16)
(20)
“SR”
“SOM” (2) “Ward” (2)
(2)
Homogeneity
“Optimum”
•
Using |Pearson| to accommodate for anti-correlation
NRO Data Set
Separation
“SOM” (2)
“Ward” (2)
(absolute Pearson correlation)
Kröger (10)
“Binary” (16)
(16) (20)
“Average” (2)
“SR”
“Average” (3)
Homogeneity
(3)
“Optimum”
• An SR-Clustering with 16 Clusters on the NRO Data:
• The appropriately permuted similarity matrix
The gray off-diagonal
blocks suggest some
inter-cluster similarity.
Cluster overlap is
conceivable here.
Isolated clusters
with high
confidence
6. Cooperation within the BFAM Network:
1. Cooperation with Genomatix Software GmbH:
• Extension of cluster analysis by integration of information
from biological databases and expert knowledge
2. Cooperation with Genomatix Software GmbH, Biomax
Informatics GmbH, the group of Prof. Lasser and the
group of Prof. Kriegel:
• Comparative analysis of clustering algorithms
3. Publications:
[1] „Similarity-Based Clustering Algorithms for Gene Expression Profiles“,
J. Ernst, Dissertation, Technische Universität München, 2002
[2] „Generalized Clustering of Gene Expression Profiles – A Spectral Approach“,
J. Ernst, Proc. of the Int. Conference on Bioinformatics, Bangkok, 2002
[3] „The Complexity of Detecting Fixed-Density Clusters“, H. Täubig et. al.,
Proc. of the 5th Italian Conference on Algorithms and Complexity, 2003
Chair for Efficient Algorithms
Algorithms for Bioinformatics
Graph Theory
Combinatorial Optimization
Randomized Algorithms
Algorithm Visualization
Complexity Theory
Computer Algebra
Petri Nets
Scheduling
Project
“Clustering“
Project
“Biological
Networks“
Misc.
Bioinformatics
Projects