(DCCA) for grouping of genes
Download
Report
Transcript (DCCA) for grouping of genes
Anindya Bhattacharya and Rajat K. De
Bioinformatics, 2008
Introduction
Divisive Correlation Clustering Algorithm
Results
Conclusions
2
Introduction
Divisive Correlation Clustering Algorithm
Results
Conclusions
3
Correlation Clustering
4
Correlation clustering is proposed by Bansal
et al. in Machine Learning, 2004.
It is basically based on the notion of graph
partitioning.
5
How to construct the graph?
Nodes: genes.
Edges: correlation between the genes.
Two types of edges:
Positive edge.
Negative edge.
6
For example:
X
Y
Positive correlation coefficient: Positive edge(
)
X
Y
Negative correlation coefficient: Negative edge(
)
Graph Construction
A
B
C
A
Graph Partitioning
B Cluster 1
C
G
E
D
G
F
H
G
E
D
Cluster 2
G
F
H
7
How to measure the quality of clusters?
The number of agreements.
The number of disagreements.
The number of agreements: the number of
genes that are in correct clusters.
The number of disagreements: the number of
genes wrongly clustered.
8
For example:
Cluster 1
A
B
C
D
Cluster 2
E
The measure of agreements is the sum of:
(1) # of positive edges in the same clusters
(2) # of negative edges in different clusters
4+4=8
The measure of disagreements is the sum of:
(1) # of negative edges in the same clusters
(2) # of positive edges in different clusters
0+2=2
9
Minimization of disagreements or
equivalently Maximization of agreements!
However, it’s NP-Complete proved by Bansal
et al., 2004.
Another problem is without the magnitude of
correlation coefficients.
10
Introduction
Divisive Correlation Clustering Algorithm
Results
Conclusions
11
Pearson correlation coefficient
Terms and measurements used in DCCA
Divisive Correlation Clustering Algorithm
12
Consider a set of genes,
, for
each of which expression values are given.
The Pearson correlation coefficient between
two genes and is defined as:
lth sample value of gene
mean value of gene
from
samples
13
: and are positively
correlated with the degree of correlation as
its magnitude.
: and are negatively
correlated with value
.
14
We define some terms and measurements
used in DCCA:
Attraction
Repulsion
Attraction/Repulsion value
Average correlation value
15
Attraction: There’s an attraction between
and if
.
Repulsion: There’s a repulsion between and
if
.
Attraction/Repulsion value: Magnitude of
is the strength of attraction or
repulsion.
16
The genes will be grouped into disjoint
clusters
.
Average correlation value: Average
correlation value for a gene with respect to
cluster is defined as:
the number of data points in
17
indicates that the average
correlation for a gene with other genes
inside the cluster .
Average correlation value reflects the degree
of inclusion of to cluster .
18
Divisive Correlation Clustering Algorithm
m samples
X1 1
m
K disjoint clusters
DCCA
C1
n genes
Xn 1
C2
Ck
m
19
Step 1:
Step 2: for each iteration, do:
Step 2-i:
20
Step 2:
Step 2-ii:
Step 2-iii:
Which cluster exists the most repulsion value?
C1
C2
Cp
Cluster C!
21
Step 2-iv:
Cp
xk
xk
xk
xk
xk
xi
xi
xk
xk
Cluster C
Cq
xj
xj
22
Step 2-v:
Place a copy of xk
xk
C1
C2
CK
The highest average correlation value!
C1
Cx2k
CK
CNEW: new clusters
23
Step 2-vi:
C1
C2
CK
Any change?
C1
C2
CNEW: new clusters
CK
24
Introduction
Divisive Correlation Clustering Algorithm
Results
Conclusions
25
Performance comparison
A synthetic dataset ADS
Nine gene expression datasets
26
A synthetic dataset ADS:
Three groups.
27
Experimental results:
Clustering correctly.
28
Experimental results:
Undesired Clusters.
Undesired Clusters.
29
Five yeast datasets:
Yeast ATP, Yeast PHO, Yeast AFR, Yeast AFRt,
Yeast Cho et al.
Four mammalian datasets:
GDS958 Wild type, GDS958 Knocked out,
GDS1423, GDS2745.
30
Performance comparison: z-score is
calculated by observing the relation between
a clustering result and the functional
annotation of the genes in the cluster.
Mutual information
The entropies for each cluster-attribute pair.
Attributes
The entropies for clustering result
independent of attributes.
The entropies for each of
the NA attributes
independent of clusters.
31
z-score is defined as:
Mean of these MI-values.
The computed MI for the clustered
data, using the attribute database.
The standard deviation
of these MI-values.
MIrandom is computed by computing MI for a clustering
obtained by randomly assigning genes to clusters of uniform
size and repeating until a distribution of values is obtained.
32
A higher value of z indicates that genes would
be better clustered by function, indicating a
more biologically relevant clustering result.
Gibbons ClusterJudge tool is used to
calculating z-score for five yeast datasets.
33
Experimental results:
34
Experimental results:
35
Experimental results:
36
Experimental results:
37
Experimental results:
38
Introduction
Divisive Correlation Clustering Algorithm
Results
Conclusions
39
Pros:
DCCA is able to obtain clustering solution from
gene-expression dataset with high biological
significance.
DCCA detects clusters with genes in similar
variation pattern of expression profiles, without
taking the expected number of clusters as an
input.
40
Cons:
The computation cost for repairing any
misplacement occurring in clustering step is high.
DCCA will not work if dataset contains less than 3
samples. The correlation value will be either +1 or
-1.
41