Assessment of the reproducibility of gene expression

Transcript Assessment of the reproducibility of gene expression

Class prediction for experiments
with microarrays
Lara Lusa
Inštitut za biomedicinsko informatiko
Medicinska fakulteta
Lara.Lusa at mf.uni-lj.si
Outline
• Objectives of microarray experiments
• Class prediction
– What is a predictor?
– How to develop a predictor?
• Which are the available methods?
• Which features should be used in the predictor?
– How to evaluate a predictor?
• Internal v External validation
– Some examples of what can go wrong
• The molecular classification of breast cancer
Scheme of an experiment
Study design
Performance of the experiment
Sample preparation
Hybridization
Image analysis
Quality control and normalization
Data analysis
Class comparison
Class prediction
Class discovery
Interpretation of the results
Aims of high-throughput experiments
•
Class comparison - supervised
– establish differences in gene expression between predetermined classes
(phenotypes)
•
•
•
•
•
Tumor vs. Normal tissue
Recurrent vs. Non-recurrent patients treated with a drug (Ma, 2004)
ER+ vs ER- patients (West, 2001)
BRCA1, BRCA2 and sporadics in breast cancer (Hedenfalk, 2001)
Class prediction - supervised
– prediction of phenotype using gene expression data
• morphology of a leukemia patient based on his gene expression (ALL vs. AML, Golub
1999)
• which patients with breast cancer will develop a distant metastasis within 5 years
(van’t Veer, 2002)
•
Class discovery - unsupervised
– discover groups of samples or genes with similar expression
• Luminal A, B, C(?), Basal, ERBB2+, Normal in Breast Cancer (Perou 2001, Sørlie,
2003)
Data from microarray experiments
How to develop a predictor?
• On a training set of samples
– Select a subset of genes (feature selection)
– Use gene expression measurements (X)
Obtain a RULE (g(X)) based
on gene-expression for the
classification of new samples
Predict class
membership (Y) of new samples
(test set)
An example from Duda et al.
Rule: Nearest-neighbor classifier
– For each sample of the
independent data set
(“testing set”) calculate
Pearson’s (centered)
correlation of its gene
expression with each sample
from the “test”
– Classification rule: assign
the new sample to the class
to which belongs the
samples from the training
set which has the highest
correlation with the new
sample
Bishop, 2006
Samples
from
training
set
correlation
new sample
Rule: K-Nearest-neighbor classifier
– For each sample of the
independent data set
(“testing set”) calculate
Pearson’s (centered)
correlation of its gene
expression with each
samplefrom the “test”
Samples
from
training
set
correlation
– Classification rule: assign
the new sample to the class
to which belong the
majority of the samples
from the training set which
have the K highest
correlation with the new
sample
K=3
Bishop, 2006
new sample
Rule: Method of centroids
(Sørlie et al. 2003)
•
Method of centroids – class
prediction rule:
–
Define a centroid for each class on
the original data set (“training set”)
•
–
–
For each gene, average its expression
from the samples assigned to that
class
centroids
For each sample of the independent
data set (“testing set”) calculate
Pearson’s (centered) correlation of
its gene expression with each
centroid
Classification rule: Assign the sample
to the class for which the centroid
has the highest correlation with the
sample (if below .1 do not assign)
correlation
new sample
Assigned to the class which
centroid has highest correlation
with the new sample
Rule: Diagonal Linear Discriminant
Analysis (DLDA)
•
•
•
Calculate mean expression of samples from Class 1 and Class 2 in the
training set for each of the G genes
and the pooled within class variance
•
For each sample x* of the test set evaluate if
•
where x*j is the expression of the j-th gene for the new sample
•
Classification rule: if the above inequality is satisfied, classify the
sample in Class 1, otherwise to Class 2.
Rule: Diagonal Linear Discriminant
Analysis (DLDA)
•Particular case of discriminant
analysis with the hypotheses that
•the feature are not correlated
•the variances of the two classes
are the same
Bishop, 2006
•Other methods used in microarray
studies are variants of discriminant
analysis
•Compound covariate predictor
•Weighted vote method
Other popular classification
methods
• Classification and Regression
Trees (CART)
• Prediction Analysis of
Microarrays (PAM)
• Support Vector Machines
(SVM)
• Logistic regression
• Neural networks
Bishop, 2006
How to choose a classification
method?
• No single method is optimal in every
situation
– No Free Lunch Theorem: in absence of
assumptions we should not prefer any
classification algorithm over another
– Ugly Ducking Theorem: in absence of
assumptions there is no “best” set of
features
The bias-variance tradeoff
Hastie et al, 2001
MSE=ED[ (g(x; D) – F(x))2]=
= ( ED[ g(x; D) – F(x) ] )2
Duda et al, 2001
+ED[ ( g(x; D) –ED [g(x;D)] )2 ]=
= Bias2+Variance
Feature selection
•
Can ALL the gene expression
variables be included in the
classifier?
•
Which variables should be used
to build the classifier?
– Filter methods
• Prior to building the classifier
–
One feature at a time or joint
distribution approaches
– Wrapper methods
• Performed implicitly by the
classifier
–
CART, PAM
From Fridlyand, CBMB Workshop
A comparison of classifiers’
performance for microarray data
• Dudoit, Fridlyand and Speed -2002,
JASA on 3 data sets
– DA, DLDA, k-NN, SVM, CART
• Good performance of simple classifiers as DLDA
and NN
• Feature selection: small number of features
included in the classifier
How to evaluate the performance
of a classifier
• Classification error
– A sample is classified in a class to which it does
not belong
• g(X) ≠ Y
• Predictive accuracy=% of correctly classified samples
– In a two-class problem, using the terminology from
diagnostic tests (“+”=diseased, “-”=healthy)
•
•
•
•
Sensitivity = P(classified +| true +)
Specificity = P(classified -| true -)
Positive predictive value = P( true +| classified + )
Negative predictive value = P( true - | classified -)
Class prediction: how to assess
the predictive accuracy?
• Use an independent data set
test
train
• If it is not available?
train
test
– ABSOLUTELY WRONG:
test
• Apply your predictor to the test
data you used to
develop it and see how well it predicts
test
data
train train train train
test
– OK
• cross validation
• bootstrap
How to develop a cross-validated
class predictor
•Training set
• Select a subset of genes
– Use gene expression measurements
Obtain parameters of a
“mathematical function”
•Test set
Predict class
•Predict class using class predictor from test set
test
train
test
train
test
Dupuy and Simon, JNCI 2007
Supervised prediction
12/28 reported a misleading
estimate of prediction accuracy
50% of studies contained one or more
major flaws
Class prediction: a famous example
van’t Veer et al. report
results obtained with
wrong analysis in the
paper and correct
analysis (with less
striking results) just in
the supplementary
material
What went wrong?
Produces highly biased estimates of
predictive accuracy
+ Going beyond the quantification of predictive
accuracy and attempting to make inference with crossvalidated class predictor: INFERENCE MADE IS NOT
VALID
Observed
Microarray Good
predictor
<5
yrs
31
prognosis
Bad
2
prognosis
>5yrs
Hypothesis:
there is no difference between classes
18
LOO CV
26
Prop. of rejected H0
0.01
0.05
0.10
0.268
0.414
0.483
(n = 100)
Lusa, McShane, Radmacher, Shih, Wright, Simon,
Statistics in Medicine, 2007
Odds ratio=15.0, p-value=4 *10^(-6)
Parameter
Logistic Coeff Std. Error
Odds ratio
95% CI
--------------------------------------------------------------------------------------------------------Grade
-0.08
0.79
1.1
[0.2 5.1]
ER
0.5
0.94
1.7
[0.3 10.4]
PR
-0.75
0.93
2.1
[0.3 13.1]
size (mm)
-1.26
0.66
3.5
[1.0 12.8]
Age
1.4
0.79
4
[0.9 19.1]
Angioinvasion
-1.55
0.74
4.7
[1.1 20.1]
Microarray
2.87
0.851
7.6
[3.3 93.7]
---------------------------------------------------------------------------------------------------
Michiels et al, 2005 Lancet
Final remarks
• Simple classification methods such as
LDDA have proved to work well for
microarray studies and outperform
fancier methods
• A lot of classification methods which
have been proposed in the field with
new names are just slight modifications
of already known techniques
Final remarks
• Report all the necessary information about your
classifier so that other can apply it to their data
• Evaluate correctly the predictive accuracy of the
classifier
– in “early microarray times”, many papers presented analyses
that were not correct, or drew wrong conclusions from their
work.
– still now, middle and low IF journals keep publishing obviously
wrong analyses
• Don’t apply methods without understanding exactly
– what they are doing
– on which assumptions they rely
Other issues in classification
•
•
•
•
Missing data
Class representation
Choice of distance function
Standardization of observations and
variables
– An example where all this matters…
Class discovery
• Mostly performed through hierarchical
clustering of genes and samples
– Often abused method in microarray analysis, used
instead of supervised methods
• In very few examples
– stability and reproducibility of clustering is
assessed
– results are“validated” or further used after
“discovery”
– a rule for classification of new samples is given
• “Projection” of the clustering to new data
sets seems still problematic
It becomes a class prediction problem
Molecular taxonomy of breast cancer
•
Perou/Sørlie (Stanford/Norway)
– Class sub-type discovery
(Perou, Nature 2001, Sørlie,
PNAS 2001, Sørlie, PNAS
2003)
– Association of discovered
classes with survival and other
clinical variables (Sørlie,
PNAS 2001, Sørlie, PNAS
2003)
– Validation of findings
assigning class labels defined
from class discovery to
independent data sets (Sørlie,
PNAS 2003)
Sørlie et al, PNAS 2003
n=79 (64%) 28
(ρ)
(>.32)
89%
ER +
Hierarchical clustering of the 122
samples from the paper using the
“intrinsic gene-set” (~500 genes)
Average linkage and distance= 1Pearson’s (centered) correlation
Number of samples in each class (node
correlation for the core samples included for
each subtype) and percentage of ER positive
samples
11
(>.28)
82%
11
(>.34)
64%
19
(>.41)
22%
10
(>.31)
2/3
Can we assign subtype membership to
samples from independent data sets?
Sørlie et al. 2003
•
Method of centroids – class
prediction rule:
–
centroids
Define a centroid for each class on
the original data set (“training set”)
•
For each gene, average its expression
from the samples assigned to that
class
–
For each sample of the independent
data set (“testing set”) calculate
Pearson’s (centered) correlation of
its gene expression with each
centroid
–
Classification rule: Assign the sample
to the class for which the centroid
has the highest correlation with the
sample (if below .1 do not assign)
West data set
correlation
Assigned to
the class
which centroid
new sample
has highest
correlation
with the new
•Cited thousands of times sample
•Widely used in research papers and
praised in editorials
•Recent concerns raised about their
reproducibility and robustness
Predicted class membership
Sørlie
our data
Tam113: Tamoxifen treated BR Ca
113 ER+/ 0 ER-
BRCA60: Hereditary
BRCa (42ER+/16ER-)
• Loris: “I obtained the subtypes on our data! All the samples
from Tam113 are Lum A, a bit strange... there are no Lum B
in our data set”
• Lara: “Have you tried also on the BRCA60?”
• Loris: “No [...] Those are mostly LumA, too. Some are
Normal, very strange..there are no basal among the ER-!”
• Lara: “[...] Have you mean-centered the genes?”
• Loris:” No [...] Looks better on BRCA60: Now the ER- of are
mostly basal... On Tam113 I get many lumB... But 50% of the
samples from Tam113 are NOT luminal anymore!”
Something is wrong!
How are the systematic differences between
microarray platforms/batches taken into account?
•
Sørlie’s et al 2003 data set
– Genes were mean (and eventually median) centered
“[…], the data file was adjusted for array batch differences as follows; on a gene-by-gene basis, we computed the
mean of the nonmissing expression values separately in each batch. Then for each sample and each gene, we
subtracted its batch mean for that gene. Hence, the adjusted array would have zero row-means within each
batch. This ensures that any variance in a gene is not a result of a batch effect.”
“Rows (genes) were median-centered and both genes and experiments were clustered by using an average hierarchical
clustering algorithm.”
•
West et al data set (Affymetrix, single channel data)
Mean-centering
– Genes were “centered”
“Data were transformed to a compatible format by normalizing to the median experiment […] Each
absolute expression value in a given sample was converted to a ratio by dividing by its average
expression value across all samples.”
•
van’t Veer et al data set
•
– Genes do not seem to have been mean-centered
Other data sets where the method was applied
– Genes were always centered
ER-
ER+
Possible concerns on the application
of the method of centroids
• How are the classification results influenced
by...
– normalization of the data (mean-centering of the
genes)?
– differences in subtype prevalence across data
sets?
– presence of study (or batch) effects?
– choice of the method of centroids as a
classification method?
– the use of the arbitrary cut-off for non
classifiable samples?
Lusa et al, Challenges in projecting clustering results across gene expressionprofiling datasets JNCI 2007
ER (Ligand-Binding Assay): 34 ER-/65 ER+
7650 clones (6878 unique)
1. Effects of mean-centering the genes
method of
centroids
Sorlie’s
centroids
(derived from
centered data
set)
centered (C)
Sotiriou’s
data set
336/552 common
and unique clones
ER+ subset
(65 samples)
non centered
(N)
ER- subset
full data set(34 samples)
(99 samples)
ER+ subset
Full data
Centered
Not centered
Centered
Not centered
Number classified
(ρ<.1)
ER+
Number classified
(ρ<.1)
ER+
Number classified
(ρ<.1)
Number classified
(ρ<.1)
Luminal A
43 (5)
41
59 (1)
55
19 (6)
55 (1)
Luminal B
13 (2)
11
1 (1)
1
13 (3)
1 (0)
ERBB2+
13 (2)
6
10 (0)
2
11 (1)
2 (0)
Basal
21 (0)
0
5 (0)
0
11(5)
0 (0)
Normal
9 (0)
7
24 (2)
7
11 (1)
7 (0)
Class
2. Effects of prevalence of subgroups in (training
and) testing set?
Predictive
accuracy
ER+ / ER10 ER+/ 10 ER-
Test set
55 ER+/
24 ER-
95% / 79%
55 ER+/
24 ER24 ER+/
24 ER12 ER+/
24 ER55 ER+/
0 ER-
53% / ND
0 ER+/
24 ER-
ND / 62%
78% / 88%
88% / 83%
92% / 79%
2b. What is the role played by prevalence
of subgroups in training and testing set?
ER status prediction
Sotiriou’s data set
multiple (100) random SPLITS
Training set
(C)
method of
centroids
Testing set
751 variance
filtered unique
clones
(N)
ωtr=1/2 (ntr=20)
10 ER+/10ER-
(C)
0≤ ωtest ≤ 1
(ntest=24)
0 ER+/24ER1 ER+/23ER…
24 ER+/0ER-
(N)
ω :% of ER+ samples in the testing set
% correctly classified in class of ER+
% correctly classified in class of ER% of correctly classified overall
3. (Possible) study effect on real data
Sotiriou
van’t Veer (Centered)
van’t Veer
van’t Veer (Non centered)
Predicted class membership
Class
True
ER+
(ρ<.1)
Predicte
dER+
39
Predicte
d ER-
7
(1)
(4)
True Cor (minERmax)
Class
4
Predicted 43
ER+
(ρ<.1)
(2)
67
(4)
.42 (.03-.62)
.26 (.01-.55)
True
ER+
(ρ<.1)
Predicted
ER-
3
(43)
(3)
True
ER(ρ<.1)
8
63
Cor (minmax)
(7)
.02 (-.24.13)
(53)
-.03(-.2316)
•The predictive accuracy is the same
•Most of the samples in the non-centered analysis
would not be classificable using the threshold
Conclusions I
• “Must”s for a clinically useful classifier
– It classifies unambiguously a new sample, independently of any
other samples being considered for classification at the same
time
– The clinical meaning of the subtype assignment (survival
probability, probability of response to treatment) must be
stable across populations to which the classifier might be
applied
– The technology used to assay the samples must be stable and
reproducible – sample assayed on different occasions assigned
to the same subtype
•
BUT we showed that subgroup assignments of new samples can be
substantially influenced by
– Normalization of data
• Appropriateness of gene-centering depends on the situation
– Proportion of samples from each subtype in the test set
– Presence of systematic differences across data sets
– Use of arbitrary rules for identifying non-classifiable samples
•
Most of our conclusions apply also to different classification
method
Conclusions II
• Most of the studies claiming to have validated the subtypes have
focused only on comparing clinical outcome differences
– Shows consistency of results between studies
– BUT does not provide direct measure of the robustness of the
classification essential before using the subtypes in clinical
practice
• Careful thought must be given to comparability of patient
populations and datasets
• Many difficulties remain in validating and extending class
discovery results to new samples and a robust classification rule
remains elusive
The subtyping of breast cancer seems promising
BUT
a standardized definition of the subtypes based on a robust
measurement method is needed
Some useful resources and
readings
• Books
– Simon et al. – Design and Analysis of DNA Microarray
Investigations – Ch.8
– Speed (Ed.) – Statistical Analysis of Gene Expression Microarray
Data – Ch.3
– Bishop- Pattern Recognition and Machine Learning
– Hastie, Tibshirani and Friedman – The Elements of Statistical
Learning
– Duda, Hart and Stork – Pattern Classification
• Software for data analysis
– R and Bioconductor (www.r-project.org, www.bioconductor.org)
– BRB Array Tools (http:// linus.nci.nih.gov)
• Web sites
– BRB/NCI web site (NIH)
– Tibshirani’s web site (Stanford)
– Terry Speed’s web site (Berkley)

Assessment of the reproducibility of gene expression

Transcript Assessment of the reproducibility of gene expression

Directory