Transcript Document
Classification of Microarray Data
- Recent Statistical Approaches
Geoff McLachlan and Liat Ben-Tovim Jones
Department of Mathematics & Institute for Molecular Bioscience
University of Queensland
Tutorial for the APBC January 2005
Institute for Molecular Bioscience,
University of Queensland
Outline of Tutorial
• Introduction to Microarray Technology
• Detecting Differentially Expressed Genes in Known Classes
of Tissue Samples
• Cluster Analysis: Clustering Genes and Clustering Tissues
• Supervised Classification of Tissue Samples
• Linking Microarray Data with Survival Analysis
Outline of Tutorial
• Introduction to Microarray Technology
• Detecting Differentially Expressed Genes in Known Classes
of Tissue Samples
BREAK
• Cluster Analysis: Clustering Genes and Clustering Tissues
• Supervised Classification of Tissue Samples
• Linking Microarray Data with Survival Analysis
Outline of Tutorial
• Introduction to Microarray Technology
• Detecting Differentially Expressed Genes in Known Classes
of Tissue Samples
• Cluster Analysis: Clustering Genes and Clustering Tissues
• Supervised Classification of Tissue Samples
• Linking Microarray Data with Survival Analysis
“Large-scale gene expression studies are not
a passing fashion, but are instead one aspect
of new work of biological experimentation,
one involving large-scale, high throughput
assays.”
Speed et al., 2002, Statistical Analysis of Gene
Expression Microarray Data, Chapman and
Hall/ CRC
Growth of microarray and microarray methodology literature listed in PubMed from 1995 to
2003.
The category ‘all microarray papers’ includes those found by searching PubMed for
microarray* OR ‘gene expression profiling’. The category ‘statistical microarray papers’
includes those found by searching PubMed for ‘statistical method*’ OR ‘statistical
techniq*’ OR ‘statistical approach*’ AND microarray* OR ‘gene expression profiling’.
A microarray is a new technology which allows the
measurement of the expression levels of thousands of genes
simultaneously.
(1) sequencing of the genome (human, mouse, and others)
(2) improvement in technology to generate high-density
arrays on chips (glass slides or nylon membrane).
The entire genome of an organism can be probed
at a single point in time.
(1) mRNA Levels Indirectly Measure Gene Activity
Every cell contains
the same DNA.
Cells differ in the
DNA (gene) which
is active at any one
time.
Genes code for
proteins through the
intermediary of
mRNA.
The activity of a
gene (expression)
can be determined
by the presence of
its complementary
mRNA.
Target and Probe DNA
Probe DNA
- known
Sample (target)
- unknown
(2) Microarrays Indirectly Measure Levels of mRNA
•mRNA is extracted from the cell
•mRNA is reverse transcribed to cDNA (mRNA itself is unstable)
•cDNA is labeled with fluorescent dye TARGET
•The sample is hybridized to known DNA sequences on the array
(tens of thousands of genes) PROBE
•If present, complementary target binds to probe DNA
(complementary base pairing)
•Target bound to probe DNA fluoresces
Spotted cDNA Microarray
Compare the gene expression levels for two cell
populations on a single microarray.
Microarray Image
Red: High expression in target labelled with cyanine 5 dye
Green : High expression in target labelled with cyanine 3 dye
Yellow : Similar expression in both target samples
Assumptions:
Gene Expression
cellular mRNA levels directly
reflect gene expression
(1)
mRNA
(2)
intensity of bound target is a
measure of the abundance of the
mRNA in the sample.
Fluorescence Intensity
Experimental Error
Sample contamination
Poor quality/insufficient mRNA
Reverse transcription bias
Fluorescent labeling bias
Hybridization bias
Cross-linking of DNA (double strands)
Poor probe design (cross-hybridization)
Defective chips (scratches, degradation)
Background from non-specific hybridization
The Microarray Technologies
Spotted Microarray
Affymetrix GeneChip
cDNAs, clones, or short and long
oligonucleotides deposited onto
glass slides
short oligonucleotides synthesized in situ
onto glass wafers
Each gene (or EST) represented
by its purified PCR product
Each gene represented multiply - using
16-20 (preferably non-overlapping)
25-mers.
Simultaneous analysis of two
samples (treated vs untreated cells)
provides internal control.
Each oligonucleotide has single-base
mismatch partner for internal control of
hybridization specifity.
relative gene expressions
absolute gene expressions
Each with its own advantages and disadvantages
Pros and Cons of the Technologies
Spotted Microarray
Affymetrix GeneChip
Flexible and cheaper
More expensive yet less flexible
Allows study of genes not yet sequenced Good for whole genome expression
(spotted ESTs can be used to discover
analysis where genome of that organism
new genes and their functions)
has been sequenced
Variability in spot quality from slide to
slide
High quality with little variability between
slides
Provide information only on relative
gene expressions between cells or tissue
samples
Gives a measure of absolute expression
of genes
Aims of a Microarray Experiment
• observe changes in a gene in response to external stimuli
(cell samples exposed to hormones, drugs, toxins)
• compare gene expressions between different tissue types
(tumour vs normal cell samples)
To gain understanding of
• function of unknown genes
• disease process at the molecular level
Ultimately to use as tools in Clinical Medicine for diagnosis,
prognosis and therapeutic management.
Importance of Experimental Design
• Good DNA microarray experiments should have clear
objectives.
• not performed as “aimless data mining in search of
unanticipated patterns that will provide answers to unasked
questions”
(Richard Simon, BioTechniques 34:S16-S21, 2003)
Replicates
Technical replicates: arrays that have been hybridized to the same
biological source (using the same treatment, protocols, etc.)
Biological replicates: arrays that have been hybridized to different
biological sources, but with the same preparation, treatments, etc.
Extracting Data from the Microarray
•Cleaning
Image processing
Filtering
Missing value estimation
•Normalization
Remove sources of systematic variation.
Sample 1
Sample 2
Sample 3
Sample 4 etc…
Gene Expressions from Measured Intensities
Spotted Microarray:
log 2(Intensity Cy5 / Intensity Cy3)
Affymetrix:
(Perfect Match Intensity – Mismatch Intensity)
Data Transformation
log x c x
2
2
Rocke and Durbin (2001), Munson (2001), Durbin et al. (2002),
and Huber et al. (2002)
Representation of Data from M Microarray Experiments
Sample 1 Sample 2
Expression Signature
Gene 1
Gene 2
Expression Profile
Gene N
Sample M
Assume we have
extracted gene
expressions values
from intensities.
Gene expressions
can be shown as
Heat Maps
Microarrays present new problems for statistics because
the data is very high dimensional with very little
replication.
Gene Expression Data represented as N x M Matrix
Sample 1 Sample 2
Expression Signature
Gene 1
Gene 2
Expression Profile
Gene N
Sample M
N rows correspond to the
N genes.
M columns correspond to the
M samples (microarray
experiments).
Microarray Data Notation
Represent the N x M matrix A:
A = (y1, ... , yM)
Classifying Tissues on Gene Expressions
the feature vector yj contains the expression levels on the N genes
in the jth experiment (j = 1, ... , M). yj is the expression signature.
AT = (y1, ... , yN)
Classifying Genes on the Tissues
the feature vector yj contains the expression levels on the M tissues
on the jth gene (j = 1, ... , N). yj is the expression profile.
In the N x M matrix A:
N = No. of genes (103-104)
M = No. of tissues (10-102)
Classification of Tissues on Gene Expressions:
Standard statistical methodology appropriate
when M >> N,
BUT here
N >> M
Classification of Genes on the Basis of the Tissues:
Falls in standard framework, BUT not all the
genes are independently distributed.
Mehta et al (Nature Genetics, Sept. 2004):
“The field of expression data analysis is particularly
active with novel analysis strategies and tools being
published weekly”, and the value of many of these
methods is questionable. Some results produced by using
these methods are so anomalous that a breed of ‘forensic’
statisticians (Ambroise and McLachlan, 2002; Baggerly et
al., 2003) who doggedly detect and correct other HDB
(high-dimensional biology) investigators’ prominent
mistakes, has been created.
Outline of Tutorial
•Introduction to Microarray Technology
•Detecting Differentially Expressed Genes in Known Classes
of Tissue Samples
•Cluster Analysis: Clustering Genes and Clustering Tissues
•Supervised Classification of Tissue Samples
•Linking Microarray Data with Survival Analysis
Gene 1
Gene 2
.
.
.
.
Gene N
Sample 2
.
.
.
.
Sample 1
Sample M
Sample 2
.
.
.
.
Sample 1
Gene 1
Gene 2
.
.
.
.
Gene N
Class 1
Class 2
Sample M
Fold Change is the Simplest Method
Calculate the log ratio between the two classes
and consider all genes that differ by more than
an arbitrary cutoff value to be differentially
expressed. A two-fold difference is often
chosen.
Fold change is not a statistical test.
Multiplicity Problem
Perform a test for each gene to determine the statistical significance
of differential expression for that gene.
Problem: When many hypotheses are tested, the probability of a
type I error (false positive) increases sharply with the number of
hypotheses.
Further complicated by gene co-regulation and subsequent
correlation between the test statistics.
Example:
Suppose we measure the expression of 10,000 genes
in a microarray experiment.
If all 10,000 genes were not differentially expressed,
then we would expect for:
P= 0.05
for each test, 500 false positives.
P= 0.05/10,000 for each test, .05 false positives.
Methods for dealing with the Multiplicity Problem
•The Bonferroni Method
controls the family wise error rate (FWER).
(FWER is the probability that at least one false positive
error will be made.) - but this method is very
conservative, as it tries to make it unlikely that even
one false rejection is made.
•The False Discovery Rate (FDR)
emphasizes the proportion of false positives among the
identified differentially expressed genes.
Test of a Single Hypothesis
The M tissue samples are classified with respect to g classes on
the basis of the N gene expressions.
Assume that there are ni tissue samples from each class
Ci (i = 1, …, g), where
M = n1 + … + ng.
Take a simple case where g = 2.
The aim is to detect whether some of the thousands of genes
have different expression levels in class C1 than in class C2.
Test of a Single Hypothesis (cont.)
For gene j, let Hj denote the null hypothesis of no association
between its expression level and membership of the class, where
(j = 1, …, N).
Hj = 0
Hj = 1
Null hypothesis for the jth gene holds.
Null hypothesis for the jth gene does not hold.
Retain Null
Hj = 0
Hj = 1
Reject Null
type I error
type II error
Two-Sample t-Statistic
Student’s t-statistic:
Tj
y1 j y 2 j
2
1j
s
n1 s
2
2j
n2
Two-Sample t-Statistic
Pooled form of the Student’s t-statistic, assumed common
variance in the two classes:
Tj
y1 j y 2 j
sj 1 n1 1 n2
Two-Sample t-Statistic
Modified t-statistic of Tusher et al. (2001):
Tj
y1 j y 2 j
sj 1 n1 1 n2 a0
Multiple Hypothesis Testing
Consider measures of error for multiple hypotheses.
Focus on the rate of false positives with respect to the
number of rejected hypotheses, Nr.
Possible Outcomes for N Hypothesis Tests
Accept Null
Reject Null
Total
Null True
N00
N01
N0
Non-True
N10
N11
N1
Total
N - Nr
Nr
N
Possible Outcomes for N Hypothesis Tests
Accept Null
Reject Null
Total
Null True
N00
N01
N0
Non-True
N10
N11
N1
Total
N - Nr
Nr
N
FWER is the probability of getting one or more
false positives out of all the hypotheses tested:
FWER pr{N 01 1}
Bonferroni method for controlling the FWER
Consider N hypothesis tests:
H0j versus H1j,
j = 1, … , N
and let P1, … , PN denote the N P-values for these tests.
The Bonferroni Method:
Given P-values P1, … , PN reject null hypothesis H0j if
Pj < a / N .
False Discovery Rate (FDR)
# (false positives)
FDR
# (rejectedhypotheses
)
The FDR is essentially the expectation of the
proportion of false positives among the identified
differentially expressed genes.
Possible Outcomes for N Hypothesis Tests
Accept Null
Reject Null
Total
Null True
N00
N01
N0
Non-True
N10
N11
N1
Total
N - Nr
Nr
N
N 01
FDR
Nr
Controlling FDR
Benjamini and Hochberg (1995)
Key papers on controlling the FDR
• Genovese and Wasserman (2002)
• Storey (2002, 2003)
• Storey and Tibshirani (2003a, 2003b)
• Storey, Taylor and Siegmund (2004)
• Black (2004)
• Cox and Wong (2004)
Benjamini-Hochberg (BH) Procedure
Controls the FDR at level a when the P-values following
the null distribution are independent and uniformly distributed.
(1) Let
p(1) p( N ) be the observed P-values.
(2) Calculate
kˆ argmaxk : p(k ) ak / N .
1 k N
(3) If
kˆ exists then reject null hypotheses corresponding to
p (1) p ( k ) . Otherwise, reject nothing.
Example: Bonferroni and BH Tests
Suppose that 10 independent hypothesis tests are carried out
leading to the following ordered P-values:
0.00017 0.00448 0.00671 0.00907 0.01220
0.33626 0.39341 0.53882 0.58125 0.98617
(a) With a = 0.05, the Bonferroni test rejects any hypothesis
whose P-value is less than a / 10 = 0.005.
Thus only the first two hypotheses are rejected.
(b) For the BH test, we find the largest k such that P(k) < ka / m.
Here k = 5, thus we reject the first five hypotheses.
SHORT BREAK
Null Distribution of the Test Statistic
Permutation Method
The null distribution has a resolution on the order of the
number of permutations.
If we perform B permutations, then the P-value will be estimated
with a resolution of 1/B.
If we assume that each gene has the same null distribution and
combine the permutations, then the resolution will be 1/(NB)
for the pooled null distribution.
Using just the B permutations of the class labels for the
gene-specific statistic Tj , the P-value for Tj = tj is assessed as:
where t(b)0j is the null version of tj after the bth permutation
of the class labels.
If we pool over all N genes, then:
Null Distribution of the Test Statistic: Example
Class 1
Gene 1
Gene 2
A1(1) A2(1) A3(1)
A1(2) A2(2) A3(2)
Class 2
B4(1) B5(1) B6(1)
B4(2) B5(2) B6(2)
Suppose we have two classes of tissue samples, with three samples
from each class. Consider the expressions of two genes, Gene 1 and
Gene 2.
Class 1
Gene 1
Gene 2
A1(1) A2(1) A3(1)
A1(2) A2(2) A3(2)
Class 2
B4(1) B5(1) B6(1)
B4(2) B5(2) B6(2)
To find the null distribution of the test statistic for Gene 1, we
proceed under the assumption that there is no difference between the
classes (for Gene 1) so that:
Gene 1
A1(1) A2(1) A3(1)
A4(1) A5(1) A6(1)
And permute the class labels:
Perm. 1 A1(1) A2(1) A4(1)
...
There are 10 distinct permutations.
A3(1) A5(1) A6(1)
Ten Permutations of Gene 1
A1(1) A2(1) A3(1)
A4(1) A5(1) A6(1)
A1(1) A2(1) A4(1)
A3(1) A5(1) A6(1)
A1(1) A2(1) A5(1)
A3(1) A4(1) A6(1)
A1(1) A2(1) A6(1)
A3(1) A4(1) A5(1)
A1(1) A3(1) A4(1)
A2(1) A5(1) A6(1)
A1(1) A3(1) A5(1)
A2(1) A4(1) A6(1)
A1(1) A3(1) A6(1)
A2(1) A4(1) A5(1)
A1(1) A4(1) A5(1)
A2(1) A3(1) A6(1)
A1(1) A4(1) A6(1)
A2(1) A3(1) A5(1)
A1(1) A5(1) A6(1)
A2(1) A3(1) A4(1)
As there are only 10 distinct permutations here, the
null distribution based on these permutations is too
granular.
Hence consideration is given to permuting the labels
of each of the other genes and estimating the null
distribution of a gene based on the pooled permutations
so obtained.
But there is a problem with this method in that the
null values of the test statistic for each gene does not
necessarily have the theoretical null distribution that
we are trying to estimate.
Suppose we were to use Gene 2 also to estimate
the null distribution of Gene 1.
Suppose that Gene 2 is differentially expressed,
then the null values of the test statistic for Gene 2
will have a mixed distribution.
Class 1
Class 2
Gene 1
Gene 2
A1(1) A2(1) A3(1)
A1(2) A2(2) A3(2)
B4(1) B5(1) B6(1)
B4(2) B5(2) B6(2)
Gene 2
A1(2) A2(2) A3(2)
B4(2) B5(2) B6(2)
Permute the class labels:
Perm. 1 A1(2) A2(2) B4(2)
...
A3(2) B5(2) B6(2)
Example of a null case: with 7 N(0,1) points and
8 N(0,1) points; histogram of the pooled two-sample
t-statistic under 1000 permutations of the class
labels with t13 density superimposed.
ty
Example of a null case: with 7 N(0,1) points and
8 N(10,9) points; histogram of the pooled two-sample
t-statistic under 1000 permutations of the class
labels with t13 density superimposed.
ty
The SAM Method
Use the permutation method to calculate the null
distribution of the modified t-statistic (Tusher et al., 2001).
The order statistics t(1), ... , t(N) are plotted against their null
expectations above.
A good test in situations where there are more genes being
over-expressed than under-expressed, or vice-versa.
Two-component Mixture Model Framework
Two-component model
f (t j ) 0 f0 (t j ) 1 f1 (t j )
0 is the proportion of genes that are not
differentially expressed, and 1 1 0
is the proportion that are.
Two-component model
f (t j ) 0 f0 (t j ) 1 f1 (t j )
0 is the proportion of genes that are not
differentially expressed, and 1 1 0
is the proportion that are.
Then
0 f0 (t j )
0 (t j )
f (t j )
is the posterior probability that gene j is not
differentially expressed.
1) Form a statistic tj for each gene, where a large positive
value of tj corresponds to a gene that is differentially
expressed across the tissues.
2) Compute the Pj-values according to the tj and fit a mixture
of beta distributions (including a uniform component)
to them where the latter corresponds to the class of genes that
are not differentially expressed.
or
3) Fit to t1,...,tp a mixture of two normal densities
with a common variance, where the first component has the
smaller mean (it corresponds to the class of genes that are
not differentially expressed). It is assumed that the tj
have been transformed so that they are normally distributed
(approximately).
tj
4) Let ˆ0(tj) denote the (estimated) posterior probability that
gene j belongs to the first component of the mixture.
If we conclude that gene j is differentially
expressed if:
ˆ0(tj) c0,
then this decision minimizes the (estimated)
Bayes risk
where
Estimated FDR
where
Suppose 0(t) is monotonic decreasing in t. Then
ˆ0(tj ) c 0
for
tj t 0.
1 F 0(t 0)
ˆ
FDR ˆ 0
1 Fˆ (t 0)
E 0(t ) | t t 0
Suppose 0(t) is monotonic decreasing in t. Then
ˆ0(tj ) c 0
for
tj t 0.
1 F 0(t 0)
ˆ
FDR ˆ 0
1 Fˆ (t 0)
Suppose 0(t) is monotonic decreasing in t. Then
ˆ0(tj ) c 0
for
tj t 0.
1 F 0(t 0)
ˆ
FDR ˆ 0
1 Fˆ (t 0)
where
F 0(t 0) (t 0)
ˆ
t
0
i
ˆ
F (t 0) ˆ 0(t 0) ˆi
ˆi
i 1
g
For a desired control level a, say a = 0.05, define
ˆ R(t ) a
t 0 argmin FD
t
If
1 F 0(t )
0
1 F (t )
(1)
is monotonic in t, then using (1)
to control the FDR [with ˆ 0 1 and Fˆ (t 0) taken
to be the empirical distribution function] is equivalent
to using the Benjamini-Hochberg procedure based on
the P-values corresponding to the statistic tj.
Example
The study of Hedenfalk et al. (2001), used cDNA arrays to
measure the gene expressions in breast tumour tissues taken from
patients with mutations in either the BRCA1 or BRCA2 genes.
We consider their data set of M = 15 patients, comprising
two patient groups: BRCA1 (7) versus BRCA2-mutation
positive (8), with N = 3,226 genes.
The problem is to find genes which are differentially
expressed between the BRCA1 and BRCA2 patients.
Example of a null case: with 7 N(0,1) points and
8 N(0,1) points; histogram of the pooled two-sample
t-statistic under 1000 permutations of the class
labels with t13 density superimposed.
ty
Example of a null case: with 7 N(0,1) points and
8 N(10,9) points; histogram of the pooled two-sample
t-statistic under 1000 permutations of the class
labels with t13 density superimposed.
ty
Fit
0 N (0,1) 1N (1 , )
2
1
to the N values of tj (pooled two-sample t-statistic)
jth gene is taken to be differentially expressed if
ˆ0 (t j ) c0
Estimated FDR for various levels of c0
c0
Nr
FDˆ R
0.5
1702
0.29
0.4
1235
0.23
0.3
850
0.18
0.2
483
0.12
0.1
175
0.06
Use of the P-Value as a Summary Statistic
(Allison et al., 2002)
Instead of using the pooled form of the t-statistic, we can work
with the value pj, which is the P-value associated with tj
in the test of the null hypothesis of no difference in expression
between the two classes.
The distribution of the P-value is modelled by the h-component
mixture model
h
f ( pj ) if ( pj;ai1,ai 2) ,
i 1
where a11 = a12 = 1.
Use of the P-Value as a Summary Statistic
Under the null hypothesis of no difference in expression for the
jth gene, pj will have a uniform distribution on the unit interval;
ie the b1,1 distribution.
The ba1,a2 density is given by
where
Outline of Tutorial
•Introduction to microarray technology
•Detecting Differentially Expressed Genes in Known Classes
of Tissue Samples
•Cluster Analysis: Clustering Genes and Clustering Tissues
•Supervised Classification of Tissue Samples
•Linking Microarray Data with Survival Analysis
Gene Expression Data represented as N x M Matrix
Sample 1 Sample 2
Expression Signature
Gene 1
Gene 2
Expression Profile
Gene N
Sample M
N rows correspond to the
N genes.
M columns correspond to the
M samples (microarray
experiments).
Two Clustering Problems:
Clustering of genes on basis of tissues –
genes not independent
(n = N, p = M)
•Clustering of tissues on basis of genes latter is a nonstandard problem in
cluster analysis
(n = M, p = N, so n << p)
5
4
3
2
1
0
-1
-2
-1
0
1
2
3
4
5
CLUSTERING OF GENES WITH
REPEATED MEASUREMENTS:
Medvedovic and Sivaganesan (2002)
Celeux, Martin, and Lavergne (2004)
Chapter 5 of McLachlan et al. (2004)
UNSUPERVISED CLASSIFICATION
(CLUSTER ANALYSIS)
INFER CLASS LABELS z1, …, zn of y1, …, yn
Initially, hierarchical distance-based methods
of cluster analysis were used to cluster the
tissues and the genes
Eisen, Spellman, Brown, & Botstein (1998, PNAS)
Hierarchical (agglomerative) clustering algorithms
are largely heuristically motivated and there exist a
number of unresolved issues associated with their
use, including how to determine the number of
clusters.
“in the absence of a well-grounded statistical
model, it seems difficult to define what is
meant by a ‘good’ clustering algorithm or the
‘right’ number of clusters.”
(Yeung et al., 2001, Model-Based Clustering and Data Transformations
for Gene Expression Data, Bioinformatics 17)
McLachlan and Khan (2004). On a
resampling approach for tests on the
number of clusters with mixture modelbased clustering of the tissue samples.
Special issue of the Journal of Multivariate
Analysis 90 (2004) edited by Mark van der
Laan and Sandrine Dudoit (UC Berkeley).
Attention is now turning towards a model-based
approach to the analysis of microarray data
For example:
• Broet, Richarson, and Radvanyi (2002). Bayesian hierarchical model
for identifying changes in gene expression from microarray
experiments. Journal of Computational Biology 9
•Ghosh and Chinnaiyan (2002). Mixture modelling of gene expression
data from microarray experiments. Bioinformatics 18
•Liu, Zhang, Palumbo, and Lawrence (2003). Bayesian clustering with
variable and transformation selection. In Bayesian Statistics 7
• Pan, Lin, and Le, 2002, Model-based cluster analysis of microarray
gene expression data. Genome Biology 3
• Yeung et al., 2001, Model based clustering and data transformations
for gene expression data, Bioinformatics 17
The notion of a cluster is not easy to define.
There is a very large literature devoted to
clustering when there is a metric known in
advance; e.g. k-means. Usually, there is no a
priori metric (or equivalently a user-defined
distance matrix) for a cluster analysis.
That is, the difficulty is that the shape of the
clusters is not known until the clusters have
been identified, and the clusters cannot be
effectively identified unless the shapes are
known.
In this case, one attractive feature of
adopting mixture models with elliptically
symmetric components such as the normal
or t densities, is that the implied clustering
is invariant under affine transformations of
the data (that is, under operations relating
to changes in location, scale, and rotation
of the data).
Thus the clustering process does not
depend on irrelevant factors such as the
units of measurement or the orientation of
the clusters in space.
Height
y W eight
BP
H W
H-W
BP
MIXTURE OF g NORMAL COMPONENTS
f ( x) 1 ( x; μ1 , Σ1 ) g ( x; μg , Σg )
where
22log
μμ)) Σ
((xx μμ)) constant
log((x;
x;μ,
μ,Σ
Σ)) (
(xx
Σ
TT
11
MAHALANOBIS DISTANCE
( x μ)T ( x μ)
EUCLIDEAN DISTANCE
MIXTURE OF g NORMAL COMPONENTS
f ( x) 1 ( x; μ1, Σ1 ) g ( x; μg , Σg )
k-means
σ II
Σ1 Σgg σ
22
SPHERICAL CLUSTERS
Equal spherical covariance matrices
With a mixture model-based approach to
clustering, an observation is assigned
outright to the ith cluster if its density in
the ith component of the mixture
distribution (weighted by the prior
probability of that component) is greater
than in the other (g-1) components.
f ( x) 1 ( x; μ1, Σ1 ) i ( x; μi , Σi )
g ( x; μg , Σ g )
Figure 7: Contours of the fitted component
densities on the 2nd & 3rd variates for the blue crab
data set.
Estimation of Mixture Distributions
It was the publication of the seminal paper of
Dempster, Laird, and Rubin (1977) on the EM
algorithm that greatly stimulated interest in
the use of finite mixture distributions to
model heterogeneous data.
McLachlan and Krishnan (1997, Wiley)
• If need be, the normal mixture model can
be made less sensitive to outlying
observations by using t component densities.
• With this t mixture model-based approach,
the normal distribution for each component
in the mixture is embedded in a wider class
of elliptically symmetric distributions with an
additional parameter called the degrees of
freedom.
The advantage of the t mixture model is that,
although the number of outliers needed for
breakdown is almost the same as with the
normal mixture model, the outliers have to
be much larger.
Mixtures of Factor Analyzers
A normal mixture model without restrictions
on the component-covariance matrices may
be viewed as too general for many situations
in practice, in particular, with high
dimensional data.
One approach for reducing the number of
parameters is to work in a lower dimensional
space by using principal components; another
is to use mixtures of factor analyzers
(Ghahramani & Hinton, 1997).
Two Groups in Two Dimensions. All cluster information would
be lost by collapsing to the first principal component. The
principal ellipses of the two groups are shown as solid curves.
Mixtures of Factor Analyzers
Principal components or a
single-factor analysis model
provides only a global linear
model.
A global nonlinear approach
by postulating a mixture of
linear submodels
g
f ( x j ) i ( x j ; i , i ),
i 1
where
i Bi B Di
T
i
(i 1,..., g ),
Bi is a p x q matrix and Di is a
diagonal matrix.
Single-Factor Analysis Model
Yj B U j e j
( j 1,...,n) ,
whereU j is a q - dimensional (q p)
vect orof latentor unobservable
variablescalled factorsand Bi is a
p x p matrixof factorloadings.
The Uj are iid N(O, Iq)
independently of the errors ej,
which are iid as N(O, D), where D
is
a
diagonal
matrix
D diag ( ,..., )
2
1
2
p
Conditional on ith component
membership of the mixture,
Y j i BiU ij eij (i 1,..., g ).
where Ui1, ..., Uin are independent,
identically distibuted (iid) N(O, Iq),
independently of the eij, which are iid
N(O, Di), where Di is a diagonal matrix
(i = 1, ..., g).
An infinity of choices for Bi for
model still holds if Bi is replaced by
BiCi where Ci is an orthogonal matrix.
Choose Ci so that
1
i
T
i
B D B
i
is diagonal
Number of free parameters is then
pq p q(q 1).
1
2
Reduction in the number of parameters
is then
1
2
{ ( p q)2 (p q)}
We can fit the mixture of factor analyzers model
using an alternating ECM algorithm.
1st cycle: declare the missing data to
be the component-indicator vectors.
Update the estimates of
i and i
2nd cycle: declare the missing
data to be also the factors.
Update the estimates of
Bi and Di
M-step on 1st cycle:
( k 1)
i
( k 1)
i
n
j 1
(k )
ij
n
j 1
for i = 1, ... , g .
/n
n
(k )
ij
y j /
j 1
(k )
ij
M step on 2nd cycle:
( k 1)
i
Β
( k 1)
i
D
where
Vi
( k 1 / 2 )
diag {Vi
(k )
i
(
( k )T
i
( k 1 / 2 )
(B B
(k )
i
Iq
(k )
i
( k )T
i
V
Vi
( k )T
i
(k )
i
( k 1 / 2 ) ( k )
i
i
( k 1 / 2 )
(k )
i
( k ) 1
i
(k 1)T
i
B
D ) B
Bi
( k 1 / 2 ) 1
i
(k )
i
,
)
}
j 1
n
j 1
n
i ( y j ;
Vi ( k 1/ 2 )
i ( y j ;
)
( k 1 / 2 )
j i
j i
)(
y
)(
y
)
( k 1 / 2 )
( k 1)
( k 1) T
is given by
Work in q-dim space:
(BiBiT + Di ) -1=
Di –1 - Di -1Bi (Iq + BiTDi -1Bi) -1BiTDi -1,
|BiBiT+D i| =
| Di | / |Iq -BiT(BiBiT+Di) -1Bi| .
PROVIDES A MODEL-BASED
APPROACH TO CLUSTERING
McLachlan, Bean, and Peel, 2002, A Mixture ModelBased Approach to the Clustering of Microarray
Expression Data, Bioinformatics 18, 413-422
http://www.bioinformatics.oupjournals.org/cgi/screenpdf/18/3/413.pdf
Example: Microarray Data
Colon Data of Alon et al. (1999)
M = 62 (40 tumours; 22 normals)
tissue samples of
N = 2,000 genes in a
2,000 62 matrix.
Mixture of 2 normal components
Mixture of 2 t components
Clustering of COLON Data
Genes using EMMIX-GENE
Grouping for Colon Data
1
6
2
7
11
16
3
8
12
17
4
9
13
18
5
10
14
19
15
20
Clustering of COLON Data
Tissues using EMMIX-GENE
Grouping for Colon Data
1
6
2
7
11
16
3
8
12
17
4
9
13
18
5
10
14
19
15
20
Heat Map Displaying the Reduced Set of 4,869 Genes
on the 98 Breast Cancer Tumours
Insert heat map of 1867 genes
Heat Map of Top 1867 Genes
1
2
3
4
5
6
7
8
9
10
11
16
12
17
13
18
14
19
15
20
21
22
23
24
25
26
27
28
29
30
31
36
32
37
33
38
34
39
35
40
i
1
mi
Ui
146 112.98
i
11
mi
66
Ui
25.72
i mi
21 44
Ui
13.77
i mi
31 53
Ui
9.84
2
93 74.95
12
38
25.45
22 30
13.28
32 36
8.95
3
61
46.08
13
28
25.00
23 25
13.10
33 36
8.89
4
55
35.20
14
53
21.33
24 67
13.01
34 38 8.86
5
43
30.40
15
47
18.14
25 12
12.04
35 44 8.02
6
92
29.29
16
23
18.00
26 58
12.03
36 56 7.43
7
71
28.77
17
27
17.62
27 27
11.74
37 46 7.21
8
20
28.76
18
45
17.51
28 64
11.61
38 19 6.14
9
23
28.44
19
80
17.28
29 38
11.38
39 29 4.64
10
23
27.73
20
55
13.79
30 21 10.72
40 35 2.44
where
i = group number
mi = number in group i
Ui = -2 log λi
Heat Map of Genes in Group G1
Heat Map of Genes in Group G2
Heat Map of Genes in Group G3
Number of Components
in a Mixture Model
Testing for the number of components,
g, in a mixture is an important but very
difficult problem which has not been
completely resolved.
Order of a Mixture Model
A mixture density with g components might
be empirically indistinguishable from one
with either fewer than g components or
more than g components. It is therefore
sensible in practice to approach the question
of the number of components in a mixture
model in terms of an assessment of the
smallest number of components in the
mixture compatible with the data.
Likelihood Ratio Test Statistic
An obvious way of approaching the
problem of testing for the smallest value of
the number of components in a mixture
model is to use the LRTS, -2logl. Suppose
we wish to test the null hypothesis,
H 0 : g g 0 versus H1 : g g1
for some g1>g0.
We let Ψˆ i denote the MLE of Ψ calculated
under Hi , (i=0,1). Then the evidence against
H0 will be strong if l is sufficiently small, or
equivalently, if -2logl is sufficiently large,
where
2 log l 2{log L(Ψˆ 1 ) log L(Ψˆ 0 )}
Bootstrapping the LRTS
McLachlan (1987) proposed a resampling
approach to the assessment of the P-value
of the LRTS in testing
H 0 : g g0
v
H1 : g g1
for a specified value of g0.
Bayesian Information Criterion
The Bayesian information criterion (BIC)
of Schwarz (1978) is given by
ˆ ) d log n
2 log L(
as the penalized log likelihood to be
maximized in model selection, including
the present situation for the number of
components g in a mixture model.
Gap statistic (Tibshirani et al., 2001)
Clest (Dudoit and Fridlyand, 2002)
Outline of Tutorial
•Introduction to Microarray Technology
•Detecting Differentially Expressed Genes in Known Classes
of Tissue Samples
•Cluster Analysis: Clustering Genes and Clustering Tissues
•Supervised Classification of Tissue Samples
•Linking Microarray Data with Survival Analysis
Supervised Classification (Two Classes)
Sample 1
.......
Sample n
.......
Gene 1
Gene p
Class 1
(good prognosis)
Class 2
(poor prognosis)
Microarray to be used as routine
clinical screen
by C. M. Schubert
Nature Medicine
9, 9, 2003.
The Netherlands Cancer Institute in Amsterdam is to become the first institution
in the world to use microarray techniques for the routine prognostic screening of
cancer patients. Aiming for a June 2003 start date, the center will use a panoply
of 70 genes to assess the tumor profile of breast cancer patients and to
determine which women will receive adjuvant treatment after surgery.
Selection bias in gene extraction on the
basis of microarray gene-expression data
Ambroise and McLachlan
Proceedings of the National Academy of Sciences
Vol. 99, Issue 10, 6562-6566, May 14, 2002
http://www.pnas.org/cgi/content/full/99/10/6562
Supervised Classification of Tissue Samples
We OBSERVE the CLASS LABELS z1, …, zn where
zj = i if jth tissue sample comes from the ith class
(i=1,…,g).
AIM: TO CONSTRUCT A CLASSIFIER c(y) FOR
PREDICTING THE UNKNOWN CLASS LABEL z
OF A TISSUE SAMPLE y.
e.g.
g = 2 classes
C1 - DISEASE-FREE
C2 - METASTASES
Sample 1 Sample 2
Expression Signature
Gene 1
Gene 2
Expression Profile
Gene N
Sample M
LINEAR CLASSIFIER
FORM
c( y) b0 β y
β0 β1 y1 β p y p
T
for the production of the group label z of
a future entity with feature vector y.
FISHER’S LINEAR DISCRIMINANT FUNCTION
z sign c( y)
where
1
β S ( y1 y2 )
1
T 1
b 0 ( y1 y2 ) S ( y1 y2 )
2
and
y1 , y2 , and S are the sample means and pooled sample
covariance matrix found from the training data
SUPPORT VECTOR CLASSIFIER
Vapnik (1995)
c( y ) β0 β1 y1 β p y p
where β0 and β are obtained as follows:
min
β , b0
subject to j
0,
1
2
n
β j
2
z j c(yj ) 1 j
j 1
( j 1,, n)
1 ,, n relate to the slack variables
separable case
n
βˆ aˆ j z j y j
j 1
with non-zero aˆ j only for those observations j for which the
constraints are exactly met (the support vectors).
n
c( y ) aˆ j z j y Tj y bˆ0
j 1
n
aˆ j z j y j , y bˆ0
j 1
Support Vector Machine (SVM)
REPLACE
y
by h( y )
n
c( y ) aˆ j h( y j ), h( y ) bˆ0
j 1
n
aˆ j K ( y j , y ) bˆ0
j 1
where the kernel function K ( y j , y ) h( y j ), h( y )
is the inner product in the transformed feature space.
HASTIE et al. (2001, Chapter 12)
The Lagrange (primal function) is
LP
1
2
n
n
n
β j a j z j C ( y j ) (1 j ) l j j
2
j 1
j 1
j 1
which we maximize w.r.t. β, β0, and ξj.
Setting the respective derivatives to zero, we get
n
β a j z j y j
(2)
j 1
n
a j z j
(3)
j 1
a j lj
( j 1,, n).
(4)
with a j 0, l j 0, and j 0 ( j 1, , n).
(1)
By substituting (2) to (4) into (1), we obtain the Lagrangian dual function
n
LD a j
j 1
n
n
aa
2
1
j 1 k 1
We maximize (5) subject to 0 a j
j
k
T
j
z j z k y yk
n
and
a
j 1
j
(5)
z j 0.
In addition to (2) to (4), the constraints include
a j z j c( y j ) (1 j ) 0
l j j 0
z j c( y j ) (1 j ) 0
(6)
(7)
(8)
for j 1,, n.
Together these equations (2) to (8) uniquely characterize the solution
to the primal and dual problem.
Leo Breiman (2001)
Statistical modeling:
the two cultures (with discussion).
Statistical Science 16, 199-231.
Discussants include Brad Efron and David Cox
GUYON, WESTON, BARNHILL &
VAPNIK (2002, Machine Learning)
LEUKAEMIA DATA:
Only 2 genes are needed to obtain a zero
CVE (cross-validated error rate)
COLON DATA:
Using only 4 genes, CVE is 2%
Since p>>n, consideration given to
selection of suitable genes
SVM: FORWARD or BACKWARD (in terms of
magnitude of weight βi)
RECURSIVE FEATURE ELIMINATION (RFE)
FISHER: FORWARD ONLY (in terms of CVE)
GUYON et al. (2002)
LEUKAEMIA DATA:
Only 2 genes are needed to obtain a zero
CVE (cross-validated error rate)
COLON DATA:
Using only 4 genes, CVE is 2%
GUYON et al. (2002)
“The success of the RFE indicates that RFE has a
built in regularization mechanism that we do not
understand yet that prevents overfitting the
training data in its selection of gene subsets.”
Error Rate Estimation
Suppose there are two groups G1 and G2
c(y) is a classifier formed from the
data set
(y1, y2, y3,……………, yn)
The apparent error is the proportion of
the data set misallocated by c(y).
Cross-Validation
From the original data set, remove y1 to
give the reduced set
(y2, y3,……………, yn)
Then form the classifier c(1)(y ) from this
reduced set.
Use c(1)(y1) to allocate y1 to either G1 or
G2.
Repeat this process for the second data
point, y2.
So that this point is assigned to either G1 or
G2 on the basis of the classifier c(2)(y2).
And so on up to yn.
Ten-Fold Cross Validation
1
Test
2
3
4
5
6
7
Training
8
9
10
Figure 1: Error rates of the SVM rule with RFE procedure
averaged over 50 random splits of colon tissue samples
Figure 3: Error rates of Fisher’s rule with stepwise forward
selection procedure using all the colon data
Figure 5: Error rates of the SVM rule averaged over 20 noninformative
samples generated by random permutations of the class labels of the
colon tumor tissues
ADDITIONAL REFERENCES
Selection bias ignored:
XIONG et al. (2001, Molecular Genetics and Metabolism)
XIONG et al. (2001, Genome Research)
ZHANG et al. (2001, PNAS)
Aware of selection bias:
SPANG et al. (2001, Silico Biology)
WEST et al. (2001, PNAS)
NGUYEN and ROCKE (2002)
BOOTSTRAP APPROACH
Efron’s (1983, JASA) .632 estimator
B.632 .368 AE .632 B1
where B1 is the bootstrap when rule
the training sample.
* is applied to a point not in
Rk
A Monte Carlo estimate of B1 is
n
B1 Ej n
j 1
K
where
Ej IjkQjk
k 1
K
I
jk
k 1
with
and
Ijk 10
Qjk
if xj kth bootstrap sample
otherwise
*
1 if R k misallocates xj
0 otherwise
Toussaint & Sharpe (1975) proposed the
ERROR RATE ESTIMATOR
A(w) (1 - w)AE wCV2E
where
w 0.5
McLachlan (1977) proposed w=wo where wo is
chosen to minimize asymptotic bias of A(w) in the
case of two homoscedastic normal groups.
Value of w0 was found to range between 0.6
and 0.7, depending on the values of p, , and n1 .
n2
.632+ estimate of Efron & Tibshirani (1997, JASA)
B.632 (1 - w)AE wB1
where
.632
w
1 .368r
B1 AE
r
AE
(relative overfitting rate)
g
pi (1 qi )
(estimate of no information error rate)
i 1
If r = 0, w = .632, and so B.632+ = B.632
r = 1, w = 1, and so B.632+ = B1
MARKER GENES FOR HARVARD DATA
For a SVM based on 64 genes, and using 10-fold CV,
we noted the number of times a gene was selected.
No. of genes
55
18
11
7
8
6
10
8
12
17
Times selected
1
2
3
4
5
6
7
8
9
10
MARKER GENES FOR HARVARD DATA
No. of Times
genes selected
55
1
18
2
11
3
7
4
8
5
6
6
10
7
8
8
12
9
17
10
tubulin, alpha, ubiquitous
Cluster Incl N90862
cyclin-dependent kinase inhibitor 2C
(p18, inhibits CDK4)
DEK oncogene (DNA binding)
Cluster Incl AF035316
transducin-like enhancer of split 2,
homolog of Drosophila E(sp1)
ADP-ribosyltransferase (NAD+; poly
(ADP-ribose) polymerase)
benzodiazapine receptor (peripheral)
Cluster Incl D21063
galactosidase, beta 1
high-mobility group (nonhistone
chromosomal) protein 2
cold inducible RNA-binding protein
Cluster Incl U79287
BAF53
tubulin, beta polypeptide
thromboxane A2 receptor
H1 histone family, member X
Fc fragment of IgG, receptor,
transporter, alpha
sine oculis homeobox
(Drosophila) homolog 3
transcriptional intermediary
factor 1 gamma
transcription elongation factor
A (SII)-like 1
like mouse brain protein E46
minichromosome maintenance
deficient (mis5, S. pombe) 6
transcription factor 12 (HTF4,
helix-loop-helix transcription
factors 4)
guanine nucleotide binding
protein (G protein), gamma 3,
linked
dihydropyrimidinase-like 2
Cluster Incl AI951946
transforming growth factor,
beta receptor II (70-80kD)
protein kinase C-like 1
Breast cancer data set in van’t Veer et al.
(van’t Veer et al., 2002, Gene Expression Profiling Predicts
Clinical Outcome Of Breast Cancer, Nature 415)
These data were the result of microarray experiments
on three patient groups with different classes of
breast cancer tumours.
The overall goal was to identify a set of genes that
could distinguish between the different tumour
groups based upon the gene expression information
for these groups.
van de Vijver et al. (2002) considered a further 234 breast
cancer tumours but have only made available the data for
the top 70 genes based on the previous study of van ‘t Veer
et al. (2002)
Number of Genes
Error Rate for Top
70 Genes (without
correction for
Selection Bias as
Top 70)
Error Rate for Top
70 Genes (with
correction for
Selection Bias as
Top 70)
Error Rate for
5422 Genes (with
correction for
Selection Bias)
1
0.50
0.53
0.56
2
0.32
0.41
0.44
4
0.26
0.40
0.41
8
0.27
0.32
0.43
16
0.28
0.31
0.35
32
0.22
0.35
0.34
64
0.20
0.34
0.35
70
0.19
0.33
-
128
-
-
0.39
256
-
-
0.33
512
-
-
0.34
1024
-
-
0.33
2048
-
-
0.37
4096
-
-
0.40
5422
-
-
0.44
Nearest-Shrunken Centroids
(Tibshirani et al., 2002)
The usual estimates of the class means
overall mean y of the data, where
n
yi zijyj / ni
j 1
and
n
y yj / n.
j 1
yi are shrunk toward the
The nearest-centroid rule is given by
where yv is the vth element of the feature vector y and yiv ( y )v .
In the previous definition, we replace the sample mean yiv of
the vth gene by its shrunken estimate
v
where
1
1 2
mi (ni 1 n )
Comparison of Nearest-Shrunken Centroids with SVM
Apply (i) nearest-shrunken centroids and
(ii) the SVM with RFE
to colon data set of Alon et al. (1999), with
N = 2000 genes and M = 62 tissues
(40 tumours, 22 normals)
Nearest-Shrunken Centroids applied to Alon data
(a) Overall Error Rates
(b) Class-specific Error Rates
SVM with RFE applied to Alon data
(a) Overall Error Rates
(b) Class-specific Error Rates
Outline of Tutorial
•Introduction to Microarray Technology
•Detecting Differentially Expressed Genes in Known Classes
of Tissue Samples
•Cluster Analysis: Clustering Genes and Clustering Tissues
•Supervised Classification of Tissue Samples
•Linking Microarray Data with Survival Analysis
Breast tumours have a genetic signature. The expression
pattern of a set of 70 genes can predict whether a tumour
is going to prove lethal, despite treatment, or not.
“This gene expression profile will outperform all currently
used clinical parameters in predicting disease outcome.”
van ’t Veer et al. (2002), van de Vijver et al. (2002)
Problems
•Censored Observations – the time of occurrence of the event
(death) has not yet been observed.
•Small Sample Sizes – study limited by patient numbers
•Specific Patient Group – is the study applicable to other
populations?
•Difficulty in integrating different studies (different
microarray platforms)
A Case Study: The Lung Cancer data sets from
CAMDA’03
Four independently acquired lung cancer data sets
(Harvard, Michigan, Stanford and Ontario).
The challenge: To integrate information from different
data sets (2 Affy chips of different versions, 2 cDNA arrays).
The final goal: To make an impact on cancer biology and
eventually patient care.
“Especially, we welcome the methodology of survival analysis
using microarrays for cancer prognosis (Park et al.
Bioinformatics: S120, 2002).”
Methodology of Survival Analysis using Microarrays
Cluster the tissue samples (eg using hierarchical clustering), then
compare the survival curves for each cluster using a non-parametric
Kaplan-Meier analysis (Alizadeh et al. 2000).
Park et al. (2002), Nguyen and Rocke (2002) used partial least
squares with the proportional hazards model of Cox.
Unsupervised vs. Supervised Methods
Semi-supervised approach of Bair and Tibshirani (2004), to combine
gene expression data with the clinical data.
AIM: To link gene-expression data with survival from lung cancer
in the CAMDA’03 challenge
A CLUSTER ANALYSIS
We apply a model-based clustering approach to classify tumour
tissues on the basis of microarray gene expression.
B SURVIVAL ANALYSIS
The association between the clusters so formed and patient
survival (recurrence) times is established.
C DISCRIMINANT ANALYSIS
We demonstrate the potential of the clustering-based prognosis
as a predictor of the outcome of disease.
Lung Cancer
Approx. 80% of lung cancer patients have NSCLC (of which
adenocarcinoma is the most common form).
All Patients diagnosed with NSCLC are treated on the basis of
stage at presentation (tumour size, lymph node involvement and
presence of metastases).
Yet 30% of patients with resected stage I lung cancer will die of
metastatic cancer within 5 years of surgery.
Want a prognostic test for early-stage lung adenocarcinoma to
identify patients more likely to recur, and therefore who would
benefit from adjuvant therapy.
Lung Cancer Data Sets
(see http://www.camda.duke.edu/camda03)
Wigle et al. (2002), Garber et al. (2001), Bhattacharjee et al. (2001),
Beer et al. (2002).
Genes
Heat Map for 2880 Ontario Genes (39 Tissues)
Tissues
Genes
Heat Maps for the 20 Ontario Gene-Groups (39 Tissues)
Tissues
Tissues are ordered as:
Recurrence (1-24) and Censored (25-39)
Expression Profiles for Useful Metagenes (Ontario 39 Tissues)
Gene Group 1
Gene Group 2
Log Expression Value
Our Tissue Cluster 1
Our Tissue Cluster 2
Recurrence (1-24)
Censored (25-39)
Gene Group 19
Gene Group 20
Tissues
Tissue Clusters
CLUSTER ANALYSIS via EMMIX-GENE of 20
METAGENES yields TWO CLUSTERS:
CLUSTER 1 (38): 23 (recurrence) plus Poor-prognosis
8 (censored)
CLUSTER 2 (8): 1 (recurrence) plus
7 (censored)
Good-prognosis
SURVIVAL ANALYSIS:
LONG-TERM SURVIVOR (LTS) MODEL
S (t ) prob.{T t}
1S1 (t ) 2
where T is time to recurrence and 1 = 1- 2 is the
prior prob. of recurrence.
Adopt Weibull model for the survival function for
recurrence S1(t).
Fitted LTS Model vs. Kaplan-Meier
Second PC
PCA of Tissues Based on Metagenes
First PC
Second PC
PCA of Tissues Based on Metagenes
First PC
Second PC
PCA of Tissues Based on All Genes (via SVD)
First PC
Second PC
PCA of Tissues Based on All Genes (via SVD)
First PC
Cluster-Specific Kaplan-Meier Plots
Survival Analysis for Ontario Dataset
• Nonparametric analysis:
Cluster
1
2
No. of Tissues No. of Censored
29
8
Mean time to Failure (SE)
665 85.9
1388 155.7
8
7
A significant difference between Kaplan-Meier estimates for
the two clusters (P=0.027).
• Cox’s proportional hazards analysis:
Variable
Cluster 1 vs. Cluster 2
Tumor stage (I vs. II&III)
Hazard ratio (95% CI)
P-value
6.78 (0.9 – 51.5)
1.07 (0.57 – 2.0)
0.06
0.83
Discriminant Analysis (Supervised Classification)
A prognosis classifier was developed to predict the class
of origin of a tumor tissue with a small error rate after
correction for the selection bias.
A support vector machine (SVM) was adopted to identify
important genes that play a key role on predicting the
clinical outcome, using all the genes, and the metagenes.
A cross-validation (CV) procedure was used to calculate
the prediction error, after correction for the selection bias.
ONTARIO DATA (39 tissues): Support Vector Machine
(SVM) with Recursive Feature Elimination (RFE)
0.12
Error Rate (CV10E)
0.1
0.08
0.06
0.04
0.02
0
0
2
4
6
8
10
12
log2 (number of genes)
Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector
Machine (SVM). applied to g=2 clusters (G1: 1-14, 16- 29,33,36,38;
G2: 15,30-32,34,35,37,39)
STANFORD DATA
918 genes based on 73 tissue samples from 67 patients.
Row and column normalized, retained 451 genes after
select-genes step. Used 20 metagenes to cluster tissues.
Retrieved histological groups.
Genes
Heat Maps for the 20 Stanford Gene-Groups (73 Tissues)
Tissues
Tissues are ordered by their histological classification:
Adenocarcinoma (1-41), Fetal Lung (42), Large cell (43-47), Normal
(48-52), Squamous cell (53-68), Small cell (69-73)
STANFORD CLASSIFICATION:
Cluster 1: 1-19
(good prognosis)
Cluster 2: 20-26
(long-term survivors)
Cluster 3: 27-35
(poor prognosis)
Genes
Heat Maps for the 15 Stanford Gene-Groups (35 Tissues)
Tissues
Tissues are ordered by the Stanford classification into AC groups: AC
group 1 (1-19), AC group 2 (20-26), AC group 3 (27-35)
Expression Profiles for Top Metagenes (Stanford 35 AC Tissues)
Log Expression Value
Gene Group 1
Gene Group 2
Stanford AC group 1
Stanford AC group 2
Stanford AC group 3
Misallocated
Gene Group 3
Gene Group 4
Tissues
Cluster-Specific Kaplan-Meier Plots
Cluster-Specific Kaplan-Meier Plots
Survival Analysis for Stanford Dataset
• Kaplan-Meier estimation:
Cluster
1
2
No. of Tissues No. of Censored
17
5
Mean time to Failure (SE)
37.5 5.0
5.2 2.3
10
0
A significant difference in survival between clusters (P<0.001)
• Cox’s proportional hazards analysis:
Variable
Cluster 3 vs. Clusters 1&2
Grade 3 vs. grades 1 or 2
Tumor size
No. of tumors in lymph nodes
Presence of metastases
Hazard ratio (95% CI)
P-value
13.2 (2.1 – 81.1)
1.94 (0.5 – 8.5)
0.96 (0.3 – 2.8)
1.65 (0.7 – 3.9)
4.41 (1.0 – 19.8)
0.005
0.38
0.93
0.25
0.05
Survival Analysis for Stanford Dataset
• Univariate Cox’s proportional hazards analysis (metagenes):
Metagene
Coefficient (SE)
P-value
1
2
3
4
5
1.37 (0.44)
-0.24 (0.31)
0.14 (0.34)
-1.01 (0.56)
0.66 (0.65)
0.002
0.44
0.68
0.07
0.31
6
7
8
9
10
-0.63 (0.50)
-0.68 (0.57)
0.75 (0.46)
-1.13 (0.50)
0.73 (0.39)
0.20
0.24
0.10
0.02
0.06
11
12
13
14
15
0.35 (0.50)
-0.55 (0.41)
-0.61 (0.48)
0.22 (0.36)
1.70 (0.92)
0.48
0.18
0.20
0.53
0.06
Survival Analysis for Stanford Dataset
• Multivariate Cox’s proportional hazards analysis (metagenes):
Metagene
Coefficient (SE)
P-value
1
3.44 (0.95)
0.0003
2
-1.60 (0.62)
0.010
8
-1.55 (0.73)
0.033
11
1.16 (0.54)
0.031
The final model consists of four metagenes.
STANFORD DATA: Support Vector Machine
(SVM) with Recursive Feature Elimination (RFE)
0.07
Error Rate (CV10E)
0.06
0.05
0.04
0.03
0.02
0.01
0
0
1
2
3
4
5
6
7
8
9
10
log2 (number of genes)
Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector
Machine (SVM). Applied to g=2 clusters.
CONCLUSIONS
We applied a model-based clustering approach to
classify tumors using their gene signatures into:
(a) clusters corresponding to tumor type
(b) clusters corresponding to clinical outcomes
for tumors of a given subtype
In (a), almost perfect correspondence between
cluster and tumor type, at least for non-AC
tumors (but not in the Ontario dataset).
CONCLUSIONS (cont.)
The clusters in (b) were identified with clinical
outcomes (e.g. recurrence/recurrence-free and
death/long-term survival).
We were able to show that gene-expression
data provide prognostic information, beyond
that of clinical indicators such as stage.
CONCLUSIONS (cont.)
Based on the tissue clusters, a discriminant analysis
using support vector machines (SVM) demonstrated
further the potential of gene expression as a tool for
guiding treatment therapy and patient care to lung
cancer patients.
This supervised classification procedure was used to
provide marker genes for prediction of clinical
outcomes.
(In addition to those provided by the cluster-genes
step in the initial unsupervised classification.)
LIMITATIONS
Small number of tumors available (e.g Ontario and
Stanford datasets).
Clinical data available for only subsets of the tumors;
often for only one tumor type (AC).
High proportion of censored observations limits
comparison of survival rates.