Identifying differentially expressed genes and clusters of genes in

Download Report

Transcript Identifying differentially expressed genes and clusters of genes in

Discrimination and clustering with
microarray gene expression data
Terry Speed, Jane Fridlyand,
Yee Hwa Yang and Sandrine Dudoit*
Department of Statistics, UC Berkeley,
*Department of Biochemistry, Stanford University
ENAR, Charlotte NC, March 27 2001
Outline
Introductory comments
Classification
Clustering
A synthesis
Concluding remarks
Tumor classification
A reliable and precise classification of tumors
is essential for successful treatment of cancer.
Current methods for classifying human
malignancies rely on a variety of morphological,
clinical and molecular variables.
In spite of recent progress, there are still
uncertainties in diagnosis. Also, it is likely that
the existing classes are heterogeneous.
DNA microarrays may be used to characterize
the molecular variations among tumors by
monitoring gene expression on a genomic scale.
Tumor classification, ctd
There are three main types of statistical problems
associated with tumor classification:
1. The identification of new/unknown tumor classes
using gene expression profiles;
2. The classification of malignancies into known classes;
3. The identification of “marker” genes that characterize
the different tumor classes.
These issues are relevant to other questions we meet ,
e.g. characterising/classifying neurons or the toxicity
of chemicals administered to cells or model animals.
Gene Expression Data
Gene expression data on p genes for n samples
mRNA samples
sample1 sample2 sample3 sample4 sample5 …
Genes
1
2
3
4
5
0.46
-0.10
0.15
-0.45
-0.06
0.30
0.49
0.74
-1.03
1.06
0.80
0.24
0.04
-0.79
1.35
1.51
0.06
0.10
-0.56
1.09
0.90
0.46
0.20
-0.32
-1.09
...
...
...
...
...
Gene expression level of gene i in mRNA sample j
=
Log( Red intensity / Green intensity)
Log(Avg. PM - Avg. MM)
Comparison of discrimination methods
In this field many people are inventing new methods of
classification or using quite complex ones (e.g. SVMs).
Is this necessary?
We did a study comparing several methods on three
publicly available tumor data sets: the Leukemia data
set, the Lymphoma data set, and the NIH 60 tumor cell
line data, as well as some unpublished data sets.
We compared NN, FLDA, DLDA, DQDA and CART, the
last with or without aggregation (bagging or boosting).
The results were unequivocal: simplest is best!
Images of correlation matrix between 81 samples
4,682 genes
50 genes
Lymphoma data set: 29 B-CLL, 9 FL, 43 DLBCL,
Cluster Analysis
Can cluster genes, cell samples, or both.
Strengthens signal when averages are taken
within clusters of genes (Eisen).
Useful (essential ?) when seeking new
subclasses of cells, tumors, etc.
Leads to readily interpreted figures.
Clusters
Taken from
Nature February, 2000
Paper by A Alizadeh et al
Distinct types of diffuse large
B-cell lymphoma identified by
Gene expression profiling,
Discovering sub-groups
Clustering problems
Suppose we have gene expression
data on p genes for n tumor mRNA
samples in the form of gene expression
profiles xi = (xi1, …, xip), i=1,…,p.
Three related tasks are:
1. Estimating the number of tumor clusters ;
2. Assigning each tumor sample to a cluster;
3. Assessing the strength/confidence of cluster
assignments for individual tumors.
These are generic clustering problems.
Assessing the strength/confidence of
cluster assignments
The silhouette width of an observation is
s = (b-a )/max(a,b)
where a is the average dissimilarity
between the observation and all others in
the cluster to which it belongs, and b is the
smallest of the average dissimilarities
between the observation and ones in other
clusters. Large s means well clustered.
Bagging
• In discriminant analysis, it is well known that gains in
accuracy can be obtained by aggregating predictors built
from perturbed versions of the learning set (cf. bagging
and boosting).
• In the bootstrap aggregating or bagging procedure,
perturbed learning sets of the same size as the original
learning set are formed by drawing at random with
replacement from the learning set, i.e., by forming nonparametric bootstrap replicates of the learning set.
• Predictors are build for each perturbed dataset and
aggregated by plurality voting.
Bagging a clustering algorithm
For a fixed number k of clusters
– Generate multiple bootstrap learning sets (B=50)
– Apply the clustering algorithm to each bootstrap
learning set;
– Re-label the clusters for the bootstrap learning sets so
that there is maximum overlap with the original
clustering of these observations;
– The cluster assignment of each observation is then
obtained by plurality voting.
Record for each observation its cluster vote (CV),
which is the proportion of votes in favour of the
“winning” cluster.
Lymphoma data set
Leukemia data set
Comparison of clustering and other
approaches to microarray data analysis
Cluster analyses:
1) Usually outside the normal framework of
statistical inference;
2) less appropriate when only a few genes are likely
to change.
3) Needs lots of experiments
Single gene approaches
1) may be too noisy in general to show much
2) may not reveal coordinated effects of positively
correlated genes.
3) harder to relate to pathways.
Clustering as a means to an end
We and others (Stanford) are working on methods
which try to combine combine clustering with
more traditional approaches to microarray data
analysis.
Idea: find clusters of genes and average their
responses to reduce noise and enhance
interpretability.
Use testing to assign significance with averages of
clusters of genes as we would with single genes.
Clustering genes
E.g. p=5
Cluster 6=(1,2)
Cluster 7=(1,2,3)
Cluster 8=(4,5)
Cluster 9=
(1,2,3,4,5)
1
2 3 4
5
Let p = number of genes.
1. Calculate within class
correlation.
2. Perform hierarchical
clustering which will produce
(2p-1) clusters of genes.
3. Average within clusters of
genes.
4 Perform testing on averages
of clusters of genes as if they
were single genes.
Data - Ro1
Transgenic mice with a modified Gi coupled receptor (Ro1).
Experiment: induced expression of Ro1 in mice.
8 control (ctl) mice
9 treatment mice eight weeks after Ro1 being induced.
Long-term question: Which groups of genes work together.
Based on paper: Conditional expression of a Gi-coupled
receptor causes ventricular conduction delay and a lethal
cardiomyopathy, see Redfern C. et al. PNAS, April 25, 2000.
http://www.pnas.org also
http://www.GenMAPP.org/ (Conklin lab, UCSF)
Histogram
Cluster of genes
(1703, 3754)
Top 15 averages of gene clusters
T
-13.4
-12.1
11.8
11.7
11.3
11.2
-10.7
10.7
10.7
10.6
-10.4
-10.4
-10.4
10.3
Group ID
7869
3754
6175
4689
6089
1683
2272
9955
5179
3916
8255
4772
10548
9476
= (1703, 3754)
Might be influenced by 3754
 1 0.7 0.7 
0.7 1 0.8 

0.7 0.8 1 

= (6194, 1703, 3754)
= (4572, 4772, 5809)
Correlation
= (2534, 1343, 1954)
= (6089, 5455, 3236, 4014)
1 0.5 0.5
0.5 1 0.8

0.5 0.8 1 

Closing remarks
More sophisticated classification methods may
become justified when data sets are larger.
There seems to be considerable room for
approaches which bring cluster analysis into a
more traditional statistical framework.
The idea of using clustering to obtain derived
variables seems promising, but has yet to realise
this promise.
Acknowledgments
UCB
Yee Hwa Yang
Jane Fridlyand
Stanford
Sandrine Dudoit
WEHI
Natalie Thorne
UCSF
Bruce Conklin
Karen Vranizan
PMCI
David Bowtell
Chuang Fong Kong