(DCCA) for grouping of genes

Download Report

Transcript (DCCA) for grouping of genes

Anindya Bhattacharya and Rajat K. De
Bioinformatics, 2008




Introduction
Divisive Correlation Clustering Algorithm
Results
Conclusions
2




Introduction
Divisive Correlation Clustering Algorithm
Results
Conclusions
3

Correlation Clustering
4


Correlation clustering is proposed by Bansal
et al. in Machine Learning, 2004.
It is basically based on the notion of graph
partitioning.
5

How to construct the graph?
 Nodes: genes.
 Edges: correlation between the genes.

Two types of edges:
 Positive edge.
 Negative edge.
6

For example:
X
Y
Positive correlation coefficient: Positive edge(
)
X
Y
Negative correlation coefficient: Negative edge(
)
Graph Construction
A
B
C
A
Graph Partitioning
B Cluster 1
C
G
E
D
G
F
H
G
E
D
Cluster 2
G
F
H
7

How to measure the quality of clusters?
 The number of agreements.
 The number of disagreements.


The number of agreements: the number of
genes that are in correct clusters.
The number of disagreements: the number of
genes wrongly clustered.
8

For example:
Cluster 1
A
B
C
D
Cluster 2
E
The measure of agreements is the sum of:
(1) # of positive edges in the same clusters
(2) # of negative edges in different clusters
4+4=8
The measure of disagreements is the sum of:
(1) # of negative edges in the same clusters
(2) # of positive edges in different clusters
0+2=2
9



Minimization of disagreements or
equivalently Maximization of agreements!
However, it’s NP-Complete proved by Bansal
et al., 2004.
Another problem is without the magnitude of
correlation coefficients.
10




Introduction
Divisive Correlation Clustering Algorithm
Results
Conclusions
11



Pearson correlation coefficient
Terms and measurements used in DCCA
Divisive Correlation Clustering Algorithm
12


Consider a set of genes,
, for
each of which expression values are given.
The Pearson correlation coefficient between
two genes and is defined as:
lth sample value of gene
mean value of gene
from
samples
13


: and are positively
correlated with the degree of correlation as
its magnitude.
: and are negatively
correlated with value
.
14

We define some terms and measurements
used in DCCA:
 Attraction
 Repulsion
 Attraction/Repulsion value
 Average correlation value
15



Attraction: There’s an attraction between
and if
.
Repulsion: There’s a repulsion between and
if
.
Attraction/Repulsion value: Magnitude of
is the strength of attraction or
repulsion.
16


The genes will be grouped into disjoint
clusters
.
Average correlation value: Average
correlation value for a gene with respect to
cluster is defined as:
the number of data points in
17


indicates that the average
correlation for a gene with other genes
inside the cluster .
Average correlation value reflects the degree
of inclusion of to cluster .
18

Divisive Correlation Clustering Algorithm
m samples
X1 1
m
K disjoint clusters
DCCA
C1
n genes
Xn 1
C2
Ck
m
19

Step 1:

Step 2: for each iteration, do:
 Step 2-i:
20

Step 2:
 Step 2-ii:
 Step 2-iii:
Which cluster exists the most repulsion value?
C1
C2
Cp
Cluster C!
21
 Step 2-iv:
Cp
xk
xk
xk
xk
xk
xi
xi
xk
xk
Cluster C
Cq
xj
xj
22
 Step 2-v:
Place a copy of xk
xk
C1
C2
CK
The highest average correlation value!
C1
Cx2k
CK
CNEW: new clusters
23
 Step 2-vi:
C1
C2
CK
Any change?
C1
C2
CNEW: new clusters
CK
24




Introduction
Divisive Correlation Clustering Algorithm
Results
Conclusions
25

Performance comparison
 A synthetic dataset ADS
 Nine gene expression datasets
26

A synthetic dataset ADS:
Three groups.
27

Experimental results:
Clustering correctly.
28

Experimental results:
Undesired Clusters.
Undesired Clusters.
29

Five yeast datasets:
 Yeast ATP, Yeast PHO, Yeast AFR, Yeast AFRt,
Yeast Cho et al.

Four mammalian datasets:
 GDS958 Wild type, GDS958 Knocked out,
GDS1423, GDS2745.
30

Performance comparison: z-score is
calculated by observing the relation between
a clustering result and the functional
annotation of the genes in the cluster.
Mutual information
The entropies for each cluster-attribute pair.
Attributes
The entropies for clustering result
independent of attributes.
The entropies for each of
the NA attributes
independent of clusters.
31

z-score is defined as:
Mean of these MI-values.
The computed MI for the clustered
data, using the attribute database.
The standard deviation
of these MI-values.
MIrandom is computed by computing MI for a clustering
obtained by randomly assigning genes to clusters of uniform
size and repeating until a distribution of values is obtained.
32


A higher value of z indicates that genes would
be better clustered by function, indicating a
more biologically relevant clustering result.
Gibbons ClusterJudge tool is used to
calculating z-score for five yeast datasets.
33

Experimental results:
34

Experimental results:
35

Experimental results:
36

Experimental results:
37

Experimental results:
38




Introduction
Divisive Correlation Clustering Algorithm
Results
Conclusions
39

Pros:
 DCCA is able to obtain clustering solution from
gene-expression dataset with high biological
significance.
 DCCA detects clusters with genes in similar
variation pattern of expression profiles, without
taking the expected number of clusters as an
input.
40

Cons:
 The computation cost for repairing any
misplacement occurring in clustering step is high.
 DCCA will not work if dataset contains less than 3
samples. The correlation value will be either +1 or
-1.
41