Part 1 Microarray Timeseries Analysis with replicates OSM and EGF

Download Report

Transcript Part 1 Microarray Timeseries Analysis with replicates OSM and EGF

Advanced analysis:
Classification, clustering and other
multivariate methods.
Statistics for Microarray Data Analysis – Lecture 4
The Fields Institute for Research in
Mathematical Sciences
May 25, 2002
1
Cluster and discriminant analysis
• These techniques group, or equivalently classify,
observational units on the basis of measurements.
• They differ according to their aims, which in turn depend on
the availability of a pre-existing basis for the grouping.
• In cluster analysis, there are no predefined groups or labels
for the observations, while discriminant analysis is based on
the existence of such groups or labels.
• Alternative terminology
– Computer science: unsupervised and supervised learning.
– Microarray literature: class discovery and class prediction.
2
Tumor classification
A reliable and precise classification of tumors is essential for
successful diagnosis and treatment of cancer.
Current methods for classifying human malignancies rely on a
variety of morphological, clinical, and molecular variables.
In spite of recent progress, there are still uncertainties in
diagnosis. Also, it is likely that the existing classes are
heterogeneous.
DNA microarrays may be used to characterize the molecular
variations among tumors by monitoring gene expression on a
genomic scale. This may lead to a more reliable classification
of tumors.
3
Tumor classification, cont
There are three main types of statistical problems
associated with tumor classification:
1. The identification of new/unknown tumor classes using
gene expression profiles - cluster analysis;
2. The classification of malignancies into known classes discriminant analysis;
3. The identification of “marker” genes that characterize
the different tumor classes - variable selection.
These issues are relevant to many other questions, e.g.
characterizing/classifying neurons or the toxicity of
chemicals administered to cells or model animals.
4
Clustering microarray data
We can cluster genes (rows), mRNA samples (cols), or
both at once.
• Clustering leads to readily interpretable figures.
• Clustering strengthens the signal when averages are taken
within clusters of genes (Eisen).
• Clustering can be helpful for identifying patterns in time or
space. Our main experience is in this area.
• Clustering is useful, perhaps essential, when seeking new
subclasses of cell samples (tumors, etc).
5
Recap: cDNA gene expression data
Data on p genes for n mRNA samples
mRNA samples
sample1 sample2 sample3 sample4 sample5 …
Genes
1
2
3
4
0.46
-0.10
0.15
-0.45
0.30
0.49
0.74
-1.03
0.80
0.24
0.04
-0.79
1.51
0.06
0.10
-0.56
0.90
0.46
0.20
-0.32
...
...
...
...
5
-0.06
1.06
1.35
1.09
-1.09
...
xij = expression level of gene i in mRNA sample j
= Log( Red intensity / Green intensity)
yj = tumor class for sample j.
6
The problem of finding patterns
In many situations we have a set of hybridizations with
temporal or spatial aspects to them, and seek genes
exhibiting patterns. An early example was the yeast cell cycle
data, a relatively easy case, as the nature of the pattern of
interest was clear.
Harder examples include developmental time series, or the
current case study (the mouse olfactory bulb). Here genes
are “on” or “off”, or “come on” and “go off” separately and
jointly, according to their own programs. In a sense every
gene has its own pattern, though we expect some sharing of
patterns. But most importantly, we usually don’t know the
patterns a priori.
Our problem seems to be distinguishing interesting or real
patterns from meaningless variation, at the level of the gene.
7
The problem, continued
Current solutions to our problem include the various forms of principal
component analysis (PCA), e.g. singular value decomposition (SVD) and
other descriptive multivariate techniques, and, of course, cluster analysis.
These all seem to based on the idea that we are mostly interested in
patterns shared by enough genes to stand out.
Can we do better?
How can contextual information be used to guide our search?
Have I even posed a meaningful question? Few answers here.
8
Strategies to find patterns
Two principal approaches (that I can see)
1a. Find the genes whose expression fits specific, predefined
patterns.
1b. Find the genes whose expression follows the pattern of
predefined gene or set of genes.
2. Carry out some kind of exploratory analysis to see what
expression patterns emerge.
I plan to briefly touch on 1a, and then discuss approach 2.
9
Some comments on strategy 2
In essence we are using the large-scale hybridizations as a
screen. Some reasonable fraction of genes with presumptive
‘interesting patterns’ will have these checked by in situ hybs or
something similar.
In essence we want to select genes based on our analysis
which gives a good hit rate (low false positive rate), certainly
compared to random selection or checking out of ‘candidate’
genes.
Some patterns are ‘stronger’ than others, and we would like to
see the full range of ‘real’ patterns in our data, including real but
weak ones. A strategy we tried early on was this: simply rank
genes on variation across the set of hybs. The problem was that
the highly ranked genes tended to give the same few patterns:
we didn’t see the full range.
10
Strategy 1a: identifying genes with
specific, predefined patterns
11
Finding early and late response
genes in a short times series
M
576
(u)
16
1
1/4
1/2
1
4
24
Time
Which genes increase or decrease like the function x2?
12
Doing this in the vector world
• For each gene, we have a vector, y, of expression
estimates at the different time points
• Project the vector onto the space spanned by the vector u
(the values of x2 at our time points).
y
y
u
u
C
C
• C is the scalar product
y30  1/ 4 
 y1   1 
y4  16 
   
C  y  u  y  57613

24
Use a qq-plot of C to identify a genes with “significant” projections along u.
14
Strategy 2: pattern discovery
15
Clustering microarray data
We can cluster cell samples (cols), e.g. neuron cells, for
identification (profiles). Also, we might want to estimate the
number of different neuron cell types in a set of samples, based
on gene expression levels.
We can cluster genes (rows) , e.g. using large numbers of
yeast experiments, to identify groups of co-regulated genes.
We can cluster genes (rows) to reduce redundancy (cf. variable
selection) in predictive models.
16
Row & column
clusters
Taken from
Nature February, 2000.
Paper by A Alizadeh et al.
Distinct types of diffuse
large B-cell lymphoma
identified by Gene
expression profiling.
17
Discovering tumor subclasses
18
Basic principles of clustering
Aim: to group observations that are “similar” based on
predefined criteria.
Issues: Which genes / arrays to use?
Which similarity or dissimilarity measure?
Which clustering algorithm?
It is advisable to reduce the number of genes from the
full set to some more manageable number, before
clustering. The basis for this reduction is usually quite
context specific, see later example.
19
There are many measures of (di)ssimilarity
•
•
•
•
•
•
Euclidean distance;
Mahalanobis distance;
Manhattan metric;
Minkowski metric;
Canberra metric;
one minus correlation;
etc.
And: there are many methods of clustering…..
20
Partitioning methods
Partition the data into a prespecified number k of
mutually exclusive and exhaustive groups.
Iteratively reallocate the observations to clusters
until some criterion is met, e.g. minimize within
cluster sums of squares.
Examples:
– k-means, self-organizing maps (SOM), PAM, etc.;
– Fuzzy: needs stochastic model, e.g. Gaussian mixtures.
21
Hierarchical methods
Hierarchical clustering methods produce a tree
or dendrogram.
They avoid specifying how many clusters are
appropriate by providing a partition for each k
obtained from cutting the tree at some level.
The tree can be built in two distinct ways
– bottom-up: agglomerative clustering;
– top-down: divisive clustering.
22
Agglomerative methods
• Start with n clusters.
At each step, merge the two closest clusters using a
measure of between-cluster disssimilarity, which
reflects the shape of the clusters.
• Between-cluster dissimilarity measures
– Unweighted Pair Group Method with Arithmetic
mean (UPGMA): average of pairwise
dissimilarities.
– Single-link: minimum of pairwise dissimilarities.
– Complete-link: maximum of pairwise dissimilarities.
23
Divisive methods
• Start with only one cluster.
At each step, split clusters into two parts.
• Advantages. Obtain the main structure of the
data, i.e. focus on upper levels of dendogram.
• Disadvantages. Computational difficulties when
considering all possible divisions into two
groups.
24
Partitioning vs. hierarchical
Partitioning:
Advantages
• Optimal for certain
criteria.
Disadvantages
• Need initial k;
• Often require long
computation times.
Hierarchical
Advantages
• Faster computation.
Disadvantages
• Rigid;
• Cannot correct later
for erroneous
decisions made
earlier.
25
Three generic clustering problems
Three important tasks (which are generic) are:
1. Estimating the number of clusters;
2. Assigning each observation to a cluster;
3. Assessing the strength/confidence of cluster
assignments for individual observations.
Not equally important in every problem. We now
return to our mouse olfactory bulb experiment.
26
Recall: we have sets of contrasts (two
forms) across the bulb as our patterns
Gene #15,228
27
Patterns and a metric on them
Aim: To identify spatial patterns of differential gene expression.
For every gene, we can define a magnitude for its expression
profile.

'
ˆ
ˆ
d i  i X ' X

1
ˆi
Simply ranking genes according to the size of di produced a list
dominated by similar patterns.
Sub aim: To identify genes with different gene expression
patterns.
The slight novelty here is that we are clustering on estimated
contrasts from the data, not the log ratios themselves.
28
Clustering genes based on their
similarity in expression patterns
• Start with sets of genes exhibiting some minimal level of
differential expression across the bulb - “top 100” from each
comparison x 15 comparisons = 635 genes chosen here.
• Carry out hierarchical clustering of the 635 genes, building a
dendrogram using Mahalanobis distance and Ward
agglomeration.
• Now consider all 635 clusters of >2 genes in the tree. Singles
are added separately.
• Measure the heterogeneity h of a cluster by calculating the
15 SDs across the cluster of each of the pairwise effects, and
taking the largest.
• Choose a score s (see plots) and take all maximal disjoint
clusters with h < s. Here we used s = 0.46 and obtained 16
29
clusters.
Plots guiding choice of clusters of genes
Number of
clusters
(patterns)
Number
of genes
Cluster heterogeneity h (max of 15 SDs)
30
P
DA
VA
MA
LA
DP
VP
MP
LP
VD
MD
LD
MV
LA
LV
LM
Red :genes chosen
Blue: :controls
31
15 p/w effects
16 Clusters Systematically Arranged (6 point representation)
*
*
*
*
*
32
One Cluster (6 point representation)
Thick red line: average Dashed blue: +/-2h
33
Visualizing gene expression patterns across regions in our data
set and find a set of genes that show different spatial patterns.
34
Discrimination
35
Discrimination
• A predictor or classifier for K tumor classes partitions the
space X of gene expression profiles into K disjoint
subsets, A1, ..., AK, such that for a sample with expression
profile x=(x1, ...,xp)  Ak the predicted class is k.
• Predictors are built from past experience, i.e., from
observations which are known to belong to certain
classes. Such observations comprise the learning set L =
(x1, y1), ..., (xn,yn).
• A classifier built from a learning set L is denoted by
C( . ,L): X  {1,2, ... ,K},
with the predicted class for observation x being C(x,L).
36
Fisher linear discriminant analysis
First applied in 1935 by M. Barnard at the suggestion of R. A.
Fisher (1936), Fisher linear discriminant analysis (FLDA)
consists of
i. finding linear combinations x a of the gene expression
profiles x=(x1,...,xp) with large ratios of between-groups to
within-groups sums of squares - discriminant variables;
ii. predicting the class of an observation x by the class
whose mean vector is closest to x in terms of the
discriminant variables.
37
FLDA
38
Maximum likelihood discriminant rules
When the class conditional densities pr(x|y=k) are
known, the maximum likelihood (ML) discriminant rule
predicts the class of an observation x by
C(x) = maxk pr(x|y=k).
For multivariate normal class densities, i.e., for
x|y= k ~ N(k, k), this is (in general) a quadratic rule.
In practice, population mean vectors and covariance
matrices are estimated by the corresponding sample
quantities.
39
ML discriminant rules - special cases
1. Linear discriminant analysis, LDA. When the class densities
have the same covariance matrix, k  , the discriminant
rule is based on the square of the Mahalanobis distance and
is linear and given by
C(x) = arg min k (x-)’ -1(x - )
2. Diagonal linear discriminant analysis, DLDA. In this simplest
case, when the class densities have the same diagonal
covariance matrix = diag(s12, …, sp2)
2
p ( x j   kj )
C ( x)  arg min
k 
s2
j 1
j
Note. Weighted gene voting of Golub et al. (1999) is a
minor variant of DLDA for two classes (wrong variance
calculation).
40
Nearest neighbor rule
Nearest neighbor methods are based on a measure of
distance between observations, such as the Euclidean
distance or one minus the correlation between two gene
expression profiles.
The k-nearest neighbor rule, due to Fix and Hodges
(1951), classifies an observation x as follows:
i. find the k observations in the learning set that
are closest to x;
ii. predict the class of x by majority vote, i.e.,
choose the class that is most common among
those k observations.
The number of neighbors k can be chosen by crossvalidation.
41
Nearest neighbor rule
42
Classification trees
Binary tree structured classifiers are constructed by
repeated splits of subsets (nodes) of the measurement
space X into two descendant subsets, starting with X itself.
Each terminal subset is assigned a class label and the
resulting partition of X corresponds to the classifier.
Three main aspects of tree construction are:
i. the selection of the splits;
ii. the decision to declare a node terminal or to continue splitting;
iii. the assignment of each terminal node to a class.
Different tree classifiers use different approaches to these
three issues. Here, we use CART: Classification And
Regression Trees, Breiman et al. (1984).
43
Classification trees
44
Aggregating predictors
Breiman (1996, 1998) found that gains in accuracy
could be obtained by aggregating predictors built from
perturbed versions of the learning set.
In classification, the multiple versions of the predictor
are aggregated by voting.
Let C(., Lb) denote the classifier built from the bth
perturbed learning set Lb , and let wb denote the weight
given to predictions made by this classifier.
The predicted class for an observation x is given by
argmaxk sb wb I(C(x,Lb) = k).
45
Aggregating predictors
1. Bagging. Bootstrap samples of the same size as the
original learning set.
- non-parametric bootstrap, Breiman (1996);
- convex pseudo-data, Breiman (1998).
2. Boosting. Freund and Schapire (1997), Breiman
(1998).
The data are resampled adaptively so that the weights
in the resampling are increased for those cases most
often misclassified.
The aggregation of predictors is done by weighted
voting.
46
Prediction votes
For aggregated classifiers, prediction votes assessing the
strength of a prediction may be defined for each
observation.
The prediction vote (PV) for an observation x is defined to
be
PV(x) = maxk ∑b wb I(C(x,Lb) = k) / ∑ bwb .
When the perturbed learning sets are given equal weights,
i.e., wb =1, the prediction vote is simply the proportion of
votes for the “winning'' class, regardless of whether it is
correct or not.
Prediction votes belong to [0,1].
47
Other prediction methods
• Support vector machines (SVMs)
• Neural networks
• Bayesian regression methods
48
Datasets
• Leukemia data, Golub et al. (1999).
• n=72 tumor mRNA samples;
• 3 known classes (B-cell ALL, T-cell ALL, AML);
• p=6,817 genes.
• Lymphoma data, Alizadeh et al. (2000).
• n=81 tumor mRNA samples;
• 3 known classes (CLL, FL, DLBCL);
• p=4,682 genes.
• NCI 60 data, Ross et al. (2000).
• n=64 cell line mRNA samples;
• 9 known classes;
• p=5,244 genes.
49
Data pre-processing
• Image analysis and normalization: beyond our
control for these data.
• Imputation: k-nearest neighbor imputation, where
genes are``neighbors'' and the similarity measure
between two genes is the correlation in their
expression profiles.
• Standardization: Standardize observations
(arrays) to have mean 0 and variance 1 across
variables (genes).
50
Comparison of predictors:
Study design
The original datasets are repeatedly randomly divided
into a learning set and a test set, comprising respectively
2/3 and 1/3 of the data.
For each of N=150 runs:
• Select a subset of p genes from the learning set based on
their ratio of between to within-groups sums of squares,
BSS/WSS. p=50 for lymphoma, p=40 for leukemia, p=30
for NCI 60.
• Build the different predictors using the learning sets with p
genes.
• Apply the predictors to the observations in the test set to
obtain test set error rates.
51
Lymphoma data, 3 classes: Test set error rates; N=150 LS/TS 52runs.
53
Leukemia data, 2 classes: Test set error rates; 150 LS/TS runs.
54
Leukemia data, 3 classes: Test set error rates; 150 LS/TS runs.
NCI 60 data: Test set error rates; 150 LS/TS runs.
55
Leukemia data: Image of correlation matrix for 72 mRNA
56
samples based on the top 40 genes for the 3-class BSS/WSS.
Prediction votes
Leukemia data, 2 classes. Top panels: % correct predictions;
57
bottom panels: prediction votes/strengths.
Results
• In the main comparison, the nearest neighbor classifier
and DLDA had the smallest error rates, while FLDA had
the highest error rates.
• Aggregation improved the performance of CART
classifiers, the largest gains being with boosting and
bagging with convex pseudo-data.
• For the lymphoma and leukemia datasets, increasing the
number of variables to p=200 didn't greatly affect the
performance of the various classifiers. There was an
improvement for the NCI 60 dataset.
• A more careful selection of a small number of genes
p=10 improved the performance of FLDA dramatically.
58
Discussion
• ``Diagonal'' LDA: ignoring correlation between genes helped
here.
• Unlike classification trees and nearest neighbors, LDA is unable
to take into account gene interactions.
• Although nearest neighbors are simple and intuitive classifiers,
their main limitation is that they give very little insight into
mechanisms underlying the class distinctions.
• Classification trees are capable of handling and revealing
interactions between variables.
• Useful by-product of aggregated classifiers: prediction votes.
• Variable selection: A crude criterion such as BSS/WSS may not
identify the genes that discriminate between all the classes and
59
may not reveal interactions between genes.
Closing remarks on discrimination
This is a very active area of research now. Much larger
data sets are becoming available. SVM and other kernel
methods are becoming the “methods of choice”,
outperforming the ones we have discussed here. The
first reference in this area is:
Michael P.S. Brown, William Noble Grundy, David Lin, Nello
Cristianini, Charles Sugnet, Manuel Ares, David
Haussler 1999
Knowledge-based analysis of microarray gene expression
data by using support vector machines. PNAS 97: 262267.
60
Acknowledgments
Clustering section based on lecture 3 in Temple short course
The olfactory bulb data is provided by Dave Lin.
Discriminant analysis section: based on S. Dudoit and J. Fridlyand
(Bioconductor short course lecture 5)
http://www.bioconductor.org
61