Transcript Slide 1

Alexander Statnikov
Discovery Systems Laboratory
Department of Biomedical Informatics
Vanderbilt University
10/3/2007
1
Project history
 Joint project with Chun Li and Constantin Aliferis
 Cancer Research 2005 paper by Hu et al.: “Genome-Wide
Association Study in Esophageal Cancer Using GeneChip
Mapping 10K Array”
 Reported near-perfect classification of cancer patients & healthy controls
on the basis of only SNP data from a case-control GWA study.
 This finding suggests that esophageal cancer is a solely genetic disease…
 Initial idea of Chun Li
 At DSL we had independently obtained the GWA dataset prior
to Chun and Constantin have initiated this project
2
Background
 SNPs make up >90% of all human genetic variation
and have been extensively studied for functional
relationships between phenotype & genotype.
 Modern high-throughput genotyping technologies
allow fast evaluation of SNPs on a genome-wide scale
at a relatively low cost.
 During last 2 years, several studies have reported
success in using SNP genotyping assays in GWA
studies in cancer. Probably, the strongest result is
reported in the study by Hu et al.
3
Claims of Hu et al.
 “Using the generalized linear model (GLM) with
adjustment for potential confounders and multiple
comparisons, we identified 37 SNPs associated with
disease.”
 “When the 37 SNPs identified from the GLM recessive
mode were used in a principal components analysis, the
first principal component correctly predicted 46 of 50
cases and 47 of 50 controls.” […] “The permutation tests
indicated that our PCA classification can be
generalized.”
4
5
Study dataset & its preparation
 Study dataset:
 50 esophageal squamous cell carcinoma patients
 50 healthy controls (matched by age, sex, place of residence)
 10k Affymetrix SNP arrays with 11,555 SNPs
 Additional variables:





Age
Tobacco use
Alcohol consumption
Family history
Consumption of pickled vegetables
 Removed ~1.5k SNPs to minimize genotyping errors
 Implemented recessive A encoding
 Imputed missing genotypes
6
SNP selection:
Original method of Hu et al.
(denoted as GLM1)
 Fit a GLM model using data for all 100 subjects:
Probability(Cancer) = 1 / (1 – exp(-f)), where
f = a + b ∙ SNP + c ∙family history + d ∙alcohol consumption
 Obtain deviances:
 D1 - deviance of the above fitted model

 D0 - deviance of the null model (without predictor variables)
From χ2 distribution, compute a p-value for the test statistic
D0-D1 with 3 degrees of freedom
 Perform Bonferroni correction at 0.05 alpha level
7
SNP selection:
Unbiased GLM-based method
(denoted as GLM2)
 Fit a GLM model using data for all 100 subjects:
Probability(Cancer) = 1 / (1 – exp(-f)), where
f = a + b ∙ SNP + c ∙family history + d ∙alcohol consumption
 Obtain deviances:
 D1 - deviance of the above fitted model
 D0΄- deviance of the model with family history and alcohol
consumption
 From χ2 distribution, compute a p-value for the test statistic
D0΄-D1 with 1 degree of freedom
 Perform Bonferroni correction at 0.05 alpha level
8
Recap of SNP selection methods
Method
GLM1
(Hu et al.)
GLM2
(Current study)
SNP, family history,
alcohol consumption
D1
D0
Null
family history,
alcohol consumption
Degrees of
freedom
3
1
9
Classification:
Original method of Hu et al.
 Perform principal component analysis (PCA) on selected
SNPs using all 100 subjects in the dataset.
 Extract the first principal component (PC1).
 Use the following rule to classify each of the same 100
subjects as used for the PCA:
If PC1 > 0, classify as control, otherwise classify as case
10
Evaluation of classification
performance
 Hu et al. used proportion of correct classifications; their
classifier is trained and tested in the same dataset
 We employ area under ROC curve performance metric and
repeated 10-fold cross-validation scheme
0.83
0.9 0.8 0.9 0.6 0.9 0.8 0.7 0.8 0.9 1.0
SNP dataset (100 subjects)
0.83
0.81
0.9
0.6 0.8
0.9 0.9
0.9 0.6
0.6 0.9
0.9 0.8
0.5 0.7
0.9 0.8
0.9 0.9
0.9 1.0
1.0
…
0.79
1.0 0.8 0.9 0.7 0.9 0.8 0.7 0.8 0.6 0.7
11
Reproducing findings of Hu et al.
 Using GLM1 method, Hu et al. reported 37 significant
SNPs, we found 226!
 Apparently, they used an extra filtering step that was not
reported in the paper (personal comm. with their PI).
 Nevertheless, the application of
PCA-based classifier (as in Hu et al.)
to GLM1 significant SNPs resulted
in 0.93 proportion of correct
classifications and 0.98 AUC.
 Major findings are reproduced
using methods of Hu et al.
12
Bias in SNP selection method
GLM1 of Hu et al.
 Calculation of p-values in GLM1 does not reflect
significance of the SNP, but the significance of 3
variables combined (SNP, family history, and alcohol
consumption)
 Family history & alcohol consumption are strong risk
factors  p-value is biased towards 0.
13
Bias in SNP selection method
GLM1 by Hu et al.
 The distribution of SNP p-values for method GLM1 is not
uniform: most p-values are <10-3
 On the contrary, GLM2
reflects significance of
SNPs and does not suffer
from the above bias:
Bonferroni
adjusted α-level
 Its distribution of SNP
p-values is uniform
 It returns no SNPs
significant at the
Bonferroni adjusted
alpha-level
14
Empirical demonstration of bias in
SNP selection method
 Main idea: Create a null distribution where SNPs are
completely unrelated to the response variable and see
how frequently methods GLM1 and GLM2 find
statistically significant SNPs.
Repeat 1,000 times
Permute all subjects in the SNP data while leaving the
response variable, family history of esophageal cancer,
and alcohol consumption intact.
2. Apply GLM1 and GLM2 to the permuted SNP data.
1.
15
Results of permutation experiments
 GLM1 found significant SNPs in all 1000 permutations!
The number of significant SNPs found in a permuted
dataset ranges from 185 to 1,938 (357 on average).
 GLM2 found significant SNPs in only 48/1000
permutations. The number of significant SNPs found in
a permuted dataset ranges from 1 to 3.
 GLM1 is biased, while GLM2 is not.
16
Bias in the classification
performance estimate of Hu et al.
 All data-analysis methods of Hu et al. use data for all
subjects. Neither cross-validation nor independent
sample validation were performed.
 We repeated their data-analysis (GLM1+PCA) embedded
in the repeated 10-fold cross-validation design. The
resulting performance is only 0.68 AUC (versus 0.98
AUC).
 0.30 AUC bias (overestimation) in the reported results
17
Empirical demonstration of
performance estimation bias
 Main idea: Create a null distribution where SNPs are
completely unrelated to the response variable (i.e.
AUC=0.5), apply GLM1+PCA methodology and record
resulting performance estimates.
Repeat 1,000 times
1.
2.
3.
4.
Permute all subjects in the SNP data while leaving the response
variable, family history of esophageal cancer, and alcohol
consumption intact.
Apply GLM1 to the permuted SNP data.
Build and apply classifier using PCA.
Estimate classification performance (AUC).
18
Results of permutation experiments
 Classification performance of GLM1+PCA; both
methods applied as in Hu et al. to all data (no crossvalidation): 0.99 AUC
 Classification performance of GLM1+PCA; GLM1 applied
to all data, PCA applied by cross-validation (incomplete
cross-validation): 0.98 AUC
 Classification performance by GLM1+PCA applied by
cross-validation: 0.50 AUC
 0.48-0.49 AUC bias (overestimation) under the null
19
20
Classification:
Support Vector Machines (SVMs)
 Supervised baseline technique for many types high-
throughput data (microarray, proteomics, etc).
 Trained and applied by cross-validation
SNP 2
SNP 2
Cases
?
Cases
Cases
Controls
* * *
* *
*
* * ** *
*
**
*
Controls
* * ** * *
*
* ?
* * *
*
?
?
Controls
SNP 1
SNP 1
21
SNP selection for fitting SVMs:
Recursive Feature Elimination
 Among the best performing techniques for the analysis of
microarray gene expression data
 Applied only to a training set during cross-validation
10,000
SNPs
SVM
model
5,000
SNPs
SVM
model
2,500
SNPs
Performance
estimate
Important for
classification
Performance
estimate
Important for
classification
5,000
SNPs
Discarded
Not important
for classification
2,500
SNPs
…
Discarded
Not important
for classification
22
Classification results:
repeated 10-fold cross-valid. estimates
“+” denotes building of classifier by ensembling technique
23
24
Feedback on our analysis
from Hu et al.
1. Concerning bias in SNP selection:
 “If we use p-values to rank the SNPs, the two methods
[GLM1 and GLM2] will give the same order.”
 Our comment:
 Ranking of SNPs is irrelevant because the method of Hu
et al. (GLM1) as described and used in their paper is the
method for selection (and not ranking) of SNPs.
25
Feedback on our analysis
from Hu et al.
2. Concerning bias in estimation of classifier performance:
 “It was not our purpose to develop a classifier in this initial
pilot effort.”
 “…we made these calculations as a frame of reference only.”
 The authors presented results of their “cross-validation
effort”. SNPs were selected by GLM1 on all 100 subjects and the
classifier was trained and tested by cross-validation (2/3 of data is
used for training and 1/3 of data is used for testing). This crossvalidation procedure was repeated 1,000 times with different splits
into training and testing set.
26
Feedback on our analysis
from Hu et al.
 The authors obtain the following
histogram of classification
performance estimates
 Our comment:
 These results are expected because
their SNP selection procedure
utilizes both training and testing
data. This is “incomplete crossvalidation” and is shown to cause
biased performance estimation of
the classifier.
Proportion of correct classifications
27
Publications
 Statnikov A, Li C, Aliferis CF (2007) “Effects of
Environment, Genetics and Data Analysis Pitfalls in an
Esophageal Cancer Genome-Wide Association Study.”
PLoS ONE 2(9): e958.
 Statnikov A, Li C, Aliferis CF (2007) “A statistical
reappraisal of the findings of an esophageal cancer
genome-wide association study.” Cancer Research,
(accepted).
28
Conclusions
 Data-analysis pitfalls in Hu et al. led researchers to (1) identify
non-statistically significant SNPs and (2) derive biased
estimates of classification performance.
 Environmental factors and family history have modest
association with the disease, while SNPs do not appear to be
associated.
 It is crucially important to have sound statistical analysis in
genome-wide association studies.
 The amount of work involved in demonstration of errors (even
obvious), correcting the analysis, communicating with
authors, and publishing the rebuttal is significantly greater
than publishing the original paper!
29