Lecture 1/12/2006

Download Report

Transcript Lecture 1/12/2006

Cluster Analysis of Gene Expression Profiles
• Identifying groups of genes that exhibit a similar expression
"behavior" across a number of experimental conditions
– Assuming that such "co-expression" will tell us something about these
genes are regulated or even possibly something about their function
(Functional Genomics)
• Using information from multiple genes at a time - as opposed to
the single gene at a time analysis we did so far
• We can also cluster biological samples based on the expression of
some or all of the genes
– Example: Identifying groups of molecularly similar tumor
– "Molecular phenotyping"
• "Unsupervised learning" in the computer science lingo
1-12-2006
1
Cluster analysis
source("http://eh3.uc.edu/teaching/cfg/2006/R/ClusterAnalysis.R")
• Often a large portion of genes are "not interesting"
• The meaning of the "not interesting" depends on the
context
– Possibly we are interested in genes that whose expression is not
constant across all experimental conditions. To remove "noninteresting" genes one can apply a "variation filter".
– Various sorts of "filtering" of "non-interesting" genes generally
amounts to performing some kind of informal statistical testing
with a very low confidence.
• For now, we will just play with our data with some
more exciting examples to follow
• We have six measurements for each gene and will try to
cluster genes and experimental conditions using this
data
1-12-2006
2
Cluster analysis
> load(url("http://eh3.uc.edu/teaching/cfg/2006/data/SimpleData.RData"))
> Nic<-grep("Nic",dimnames(SimpleData)[[2]])
> Ctl<-grep("Ctl",dimnames(SimpleData)[[2]])
> MNic<-apply(SimpleData[,Nic],1,mean,na.rm=TRUE)
> VNic<-apply(SimpleData[,Nic],1,var,na.rm=TRUE)
> MCtl<-apply(SimpleData[,Ctl],1,mean,na.rm=TRUE)
> VCtl<-apply(SimpleData[,Ctl],1,var,na.rm=TRUE)
> NNic<-apply(!is.na(SimpleData[,Nic]),1,sum,na.rm=TRUE)
> NCtl<-apply(!is.na(SimpleData[,Ctl]),1,sum,na.rm=TRUE)
> VNicCtl<-(((NNic-1)*VNic)+((NCtl-1)*VCtl))/(NCtl+NNic-2)
> DF<-NNic+NCtl-2
> TStat<-abs(MNic-MCtl)/((VNicCtl*((1/NNic)+(1/NCtl)))^0.5)
> TPvalue<-2*pt(TStat,DF,lower.tail=FALSE)
> SigGenes<-(TPvalue<0.001)
> sum(SigGenes)
[1] 7
1-12-2006
3
Cluster analysis
> library(marray)
> library(mclust)
> pal<-maPalette(low="green", high="red", mid="black")
> MinExp<-min(SimpleData[SigGenes,2:7])
> MaxExp<-max(SimpleData[SigGenes,2:7])
> heatmap(data.matrix(SimpleData[SigGenes,2:7]),Colv=NA,Rowv=NA,col=pal,labRow=as.character(SimpleData[SigGenes,1]),scale="none")
> maColorBar(seq(MinExp,MaxExp,(MaxExp-MinExp)/5), col=pal, horizontal=FALSE, k=5)
AF057156
16
AF186373
14
NM008181
12
NM008000
9.9
AF192382
8
AK014133
1-12-2006
Ctl.2
Ctl.1
Nic.2
Nic.1
Nic
Ctl
M77497
4
Cluster analysis
> heatmap(data.matrix(SimpleData[SigGenes,2:7]),col=pal,labRow=as.character(SimpleData[SigGenes,1]),scale="none")
AF057156
16
NM008181
14
AK014133
12
AF192382
9.9
AF186373
M77497
8
•
Nic
Nic.2
Nic.1
Ctl.2
Ctl
Ctl.1
NM008000
Genes were selected based on their differences between Nic and Ctl treatments - not
obvsious except for one gene
1-12-2006
5
Cluster analysis - centered data
> CenteredData<-SimpleData[,2:7]-apply(SimpleData[,2:7],1,mean,na.rm=T)
> heatmap(data.matrix(CenteredData[SigGenes,]),col=pal,labRow=as.character(SimpleData[SigGenes,1]),scale="none")
> heatmap(data.matrix(SimpleData[SigGenes,2:7]),col=pal,labRow=as.character(SimpleData[SigGenes,1]))
M77497
AF057156
AF057156
NM008181
NM008181
AK014133
AF186373
AF192382
AK014133
AF186373
AF192382
M77497
NM008000
1-12-2006
Nic
Nic.2
Nic.1
Ctl.2
Ctl
Ctl.1
Nic
Nic.2
Nic.1
Ctl.2
Ctl
Ctl.1
NM008000
6
Hierarchical Clustering
•
Calculating the "distance" or "similarity between each pair of expression
profiles
•
Merging two "closest" profiles, forming a "node" in the clustering tree and
re-calculating the "distance between such a "sub-cluster" and rest of the
profiles or sub-clusters using on of the "linkage" principles. Again merge
two closest sub-clusters
•
Complete linkage - define the distance/similarity between the two clusters
as the maximum/minimum distance/similarity between pairs of profiles
in which one profile is from the first sub-cluster and the other profile is
from the second sub-cluster
•
Average linkage - define the distance/similarity between the two clusters
as the average distance/similarity between pairs of profiles in which one
profile is from the first sub-cluster and the other profile is from the second
sub-cluster
•
Single linkage - define the distance/similarity between the two clusters as
the minimum/maximum distance/similarity between pairs of profiles in
which one profile is from the first sub-cluster and the other profile is from
the second sub-cluster
1-12-2006
7
Euclidian Distance
•
R actually operates on distances, so similarities have to be transformed
into distances - usually straightforward
•
Euclidian distance:
xi  ( xi1 , xi 2 ,...,xiM )...i th profile
x j  ( x j1 , x j 2 ,..., x jM )... j th profile, i, j  1,..., N
M
d e (x i , x j ) 
 (x
ik
 x jk ) 2 ...Euclidian distance ; N(N - 1)/2 distances
k 1
•
In 2 and 3 dimensions, this is our usual, every day's distance
> EDistances<-dist(CenteredData[SigGenes,],method = "euclidean", diag = T, upper = T)
> print(EDistances,digits=2)
1-12-2006
8
Distance Matrix
Distance Matrix - whole:
34
440
596 2797 4466 4512 7651
34
0.00 8.55 5.64 5.46 8.15 8.03 9.14
440
8.55 0.00 3.01 3.19 0.82 0.82 1.13
596
5.64 3.01 0.00 0.33 2.53 2.48 3.59
2797 5.46 3.19 0.33 0.00 2.71 2.62 3.72
4466 8.15 0.82 2.53 2.71 0.00 0.47 1.18
4512 8.03 0.82 2.48 2.62 0.47 0.00 1.14
7651 9.14 1.13 3.59 3.72 1.18 1.14 0.00
Distance Matrix - lower triangular:
> EDistances<-dist(CenteredData[SigGenes,],method = "euclidean")
> print(EDistances,digits=2)
34
440
440
8.55
596
5.64 3.01
596 2797 4466 4512
2797 5.46 3.19 0.33
4466 8.15 0.82 2.53 2.71
4512 8.03 0.82 2.48 2.62 0.47
7651 9.14 1.13 3.59 3.72 1.18 1.14
1-12-2006
9
Dendrograms - Complete Linkage
> Clustering<-hclust(EDistances,method="complete")
> plot(Clustering)
Cluster Dendrogram
34
34
6
8
Distance Matrix - lower triangular:
440
440
8.55
596
5.64 3.01
596 2797 4466 4512
4
4466 8.15 0.82 2.53 2.71
4512 8.03 0.82 2.48 2.62 0.47
4512
4466
7651
2797
596
440
2
7651 9.14 1.13 3.59 3.72 1.18 1.14
0
Height
2797 5.46 3.19 0.33
EDistances
hclust (*, "complete")
1-12-2006
10
Clustering genes and samples
> EDistancesS<-dist(t(CenteredData[SigGenes,]),method = "euclidean")
> ClusteringS<-hclust(EDistancesS,method="complete")
> heatmap(data.matrix(CenteredData[SigGenes,]),Colv=as.dendrogram(ClusteringS),Rowv=as.dendrogram(Clustering),
col=pal,scale="none")
4512
4466
> TwoClusters<-cutree(ClusteringS,k = 2, h = NULL)
> TwoClusters
Ctl
440
1
Nic Nic.1 Nic.2 Ctl.1 Ctl.2
2
2
2
1
1
7651
2797
596
1-12-2006
Ctl.2
Ctl
Ctl.1
Nic.2
Nic
Nic.1
34
11
Clustering by partitioning: K-means algorithm
•
For a pre-specified number of clusters iterate between calculating cluster
"centroides" (i.e. cluster means) and re-assigning each profile to the
cluster with the closest "centroid"
t
μqt  (qt 1 ,..., qM
)...centroid for the q th cluster after t iterations
c t  (c1t ,..., c Nt )...classification vector defining the membership for each
profile after t iterations
•
t+1st iteration:
1) for each profile i, calculate Euclidian distances to all centroids
d e (x i , μ qt ) and assign it to the closest one
c it 1  arg min{ d e (x i , μ qt ), q  1,..., Q}
2) re - calculate medians
N
x
μ qt 1 
•
i
i 1, cit 1  q
nq
iterate until ct+1=ct
1-12-2006
12
Clustering k-means
> TwoCKmeans<-kmeans(t(CenteredData[SigGenes,]), 2, iter.max = 10)
> TwoCKmeans
K-means clustering with 2 clusters of sizes 3, 3
Cluster means:
34
1
440
596
2.510742 -0.9565299
0.2554164
2 -2.510742
2797
4466
4512
7651
0.3246475 -0.770937 -0.7398173 -1.181848
0.9565299 -0.2554164 -0.3246475
0.770937
0.7398173
1.181848
Clustering vector:
Ctl
1
Nic Nic.1 Nic.2 Ctl.1 Ctl.2
2
2
2
1
1
Within cluster sum of squares by cluster:
[1] 1.0805679 0.9474704
Available components:
[1] "cluster"
1-12-2006
"centers"
"withinss" "size"
13
Questions
• How many clusters there are in the data?
• What is the statistical significance of a
clustering?
• What is a confidence in assigning any
particular expression profile to any
particular cluster?
• Difficult questions, particularly difficult
to answer when using heuristic methods like
hierarchical clustering and k-means
• Need statistical models
1-12-2006
14
Two genes at a time
•Are these two genes co-expressed?
•By looking at their expression patterns alone, combined with the “null
distribution” of the similarity measure in non-co-expressed genes, we could
conclude that this is the case.
YDR113c and YDR183c
Pearson Correlation = .83
Statistical Significance
P-value = 0.00006
CLUSTER 1 ; 39 N 2
S
G2
M
G1
S
G2
M
4
G1
0
0.8
-2
Expression Level
2
1.2
-4
0.4
0.0
0
50
100
Time
1-12-2006
150
-1.0
-0.8
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
Pearson's Correlation Coefficient
15
Another look
•What if we knew that there are two and only two distinct patterns in the data
and we know how they look (thick dashed lines)?
•Given this additional information we are likely to conclude that our two genes
actually have different patterns of expression.
CLUSTER 1 ; 39 N 2
S
G2
M
G1
S
G2
M
0
-4
-2
Expression Level
2
4
G1
0
1-12-2006
50
100
T ime
150
16
Many genes at a time
•Simultaneous detection of “patterns” of expression defined by groups of
expression profiles and assignment of individual expression profiles to
appropriate patterns.
•By looking at “all” genes at the same time, we came up with a completely
different conclusion than when looking at only two of them.
•Questions: How many clusters? How confident are we in the number of
clusters in the data? How confident are we that our two genes belong to two
different clusters? Is such a confidence statement taking into account the
“uncertainty” about the true number of clusters?
CLUSTER
S
G2
1 ;
39
M
N
38
G1
S
G2
M
0
-4
-2
Expression Level
2
4
G1
0
50
100
150
T i me
CLUSTER
S
G2
1 ;
M
39
N
G1
17
S
G2
M
0
-4
-2
Expression Level
2
4
G1
0
1-12-2006
50
100
T i me
150
17
Gene-specific normalization of the data
CLUSTER 1
M
2
0
-2
0
50
100
G1
4
G2
S
G2
150
0
50
Time
G2
15
S
M
10
100
150
G1
S
G2
M
G1
S
G2
M
0
5
0
1-12-2006
M
10
G1
Expression Level
M
G2
5
15
G2
S
CLUSTER 2
0
Expression Level
S
G1
Time
CLUSTER 1
G1
M
2
S
0
G1
-2
M
-4
G2
Scaled Expression Level
S
-4
Scaled Expression Level
4
G1
CLUSTER 2
50
100
Time
150
0
50
100
Time
150
18
Clustering using non-normalized data
K-means
1-12-2006
Euclidian Distance
Pearson's correlation
19
Clustering using normalized data
K-means
1-12-2006
Euclidian Distance
Pearson's correlation
20
Why do we cluster?
• “Guilt by association”
Co-expression
Co-regulation
Functional relationship
1-12-2006
Dissecting regulatory
mechanisms
Assigning function to
genes
21
1-12-2006
Ribosome
Cell cycle
Purine metabolism
Pyrmidine metaoblism
DNA Polymerase
Cell cycle
0 – 160min
W303 0h – 12h
SK1 0h – 12h
Why do we cluster - Functional
Annotation?
22
Dissecting the gene expression
regulatory mechanisms
S.Tavazoie, J.D.Hughes, M.J.Campbell, R.J.Cho, G.M.Church. Systematic determination
of genetic network architecture, Nat.Genet., 22, (1999) 281-285.
1-12-2006
23