Wolfinger Russ - MCP Conference 2015

Download Report

Transcript Wolfinger Russ - MCP Conference 2015

Reproducibility and Ranks of
True Positives in Large Scale
Genomics Experiments
Russ Wolfinger1, Dmitri Zaykin2, Lev Zhivotovsky3,
Wendy Czika1, Susan Shao1
1SAS
Institute, Inc., 2National Institute of Environmental Health
Sciences, 3Vavilov Institute of General Genetics
MCP Vienna
July 11, 2007
Criticism of Statistical Methods in
Genomics
1. Two labs run the same microarray experiment,
and resulting lists of significant genes barely
overlap.
2. Significant SNPs from a genetic study are not
validated in subsequent follow up studies.
Conclusions from scientific community:
Statistical results are not reproducible.
Genomics technology is not reliable.
“P vs FC” Controversy
• Occurred recently within the FDA-driven
Microarray Quality Control Consortium (MAQC)
• Biologists, chemists, regulators concerned with
lack of reproducibility of significant gene lists,
and have observed that lists based on fold
change (FC) are more consistent than those
based on p-values (P)
• Statisticians usually seek an optimal tradeoff
between specificity (Type 1) and sensitivity
(Type 2, power), often portrayed in a Receiver
Operating Characteristics (ROC) plot
Outline
1. Reproducibility versus specificity and
sensitivity
2. Rank distribution of a single true positive
3. P-value combination methods for
multiple true positives
All results are based on simulation.
Questions
• Should statisticians concern themselves
with reproducibility, the hallmark of
science? YES!
• How to define reproducibility?
• How does it relate to specificity and
sensitivity?
• Is it possible to dialectically reconcile
conflicting perspectives, or at least provide
an explanatory (and hence mollifying)
framework?
Simulation Study 1: Based on
MAQC Phase 1 Experiment
• Initially designed and implemented by Wendell
Jones, Expression Analysis Inc.
• Two treatment groups, n=5 in each
• 15,000 genes, 1000 truly changed with varying
degrees of expression that mimic real data
• Coefficient of variation (CV) on original data
scale set to varying percentages:
(2, 10, 30, 100)
Simulation Study 1 (continued)
• For sake of simplicity, we focus only on geneselection rules based on fold change (FC, same
as effect size) or simple t-test p-values
• Note that gene lists can be constructed in many
other ways; e.g. shrunken t-statistics
• Use Proportion of Overlapping Genes (POG) as
a measure of reproducibility, based on simple
Venn diagram
• Compute POG on simulated pairs of gene lists;
list sizes range from 10 to 15000
• Require direction of FC to match
Simulated POG vs. Gene List Size
P-Value Ranking
1
1
0.75
0.75
Y
Y
FC Ranking
0.5
0.5
0.25
0.25
0
0
10
20
40 60
100
200
500
1000 2000
4000
10000
10
20
40 60
100
size
Y
POG_CV_002
POG_CV_030
POG_CV_010
POG_CV_100
200
500
1000 2000
size
Y
POG_CV_002
POG_CV_030
POG_CV_010
POG_CV_100
4000
10000
Three Dimensions
CV=2%
FC Ranking
P-Value Ranking
1
1
0.75
0.75
0.5
0.5
sensitivity
0.25
0.25
0
0
1
1
0.75
0.75
0.5
0.5
one_minus_specificity
0.25
0.25
0
0
1
1
0.75
0.75
0.5
0.25
0
0
5
5
4
4
3
2
1
1
0 .25 .5 .75 1
1
2
3
pog
3
log10size
2
0 .25 .5 .75 1
one_minus_specificity
0.5
pog
0.25
0 .25 .5 .75 1
sensitivity
4
5
log10size
0 .25 .5 .75 1
0 .25 .5 .75 1
0 .25 .5 .75 1
1
2
3
4
5
Discussion 1
•
•
•
•
•
Reproducibility is not monotonically related to
specificity and sensitivity.
There appear to be tradeoffs in all three
dimensions: specificity, sensitivity, and
reproducibility.
The weight attached to each dimension
depends on the objectives of the study.
Simple rules based on both FC and P-value
cutoffs appear viable as a starting
compromise.
Challenge you to …
Enter the Third Dimension
Specificity – Sensitivity - Reproducibility
Volcano Plots Help Visualize
Ranking Rules
4
-log10(p)
3
2
1
0
sqrt(d)
sqrt(d)
|d|
|d|
d*d
d*d
-4
-3
-2
-1
0
1
2
3
4
diff
“Dormant” Volcano from Two-Sample
T-Test (df=4) on 10,000 Genes
5
Outline
1. Reproducibility versus specificity and
sensitivity
2. Rank distribution of a single true positive
3. P-value combination methods for
multiple true positives
All results are based on simulation.
Simulation Study 2A: Number of Best
T-Test Results Required to Cover a
Single True Positive
• Compare different ranking rules based on P,
FC, or functional combination
• Two treatment groups, n=100 in each
• 38,500 t-tests (4 df), only 1 truly changed
• Power for the one true positive set to (80, 90,
95, 99, and 80-Śidák) at alpha=5%
Simulation Study 2A Results
Number of best t-test (df=4) results out of 38,500 required to cover a
single true positive with 95% probability
Ranking by
Power
p-value (p)
log(p) |d|1/2
log(p) d2
log(p) |d|
|d|
80% at 5%
7255
6727
6544
6410
6374
90% at 5%
2067
1868
1863
1937
2322
95% at 5%
467
422
455
531
856
99% at 5%
11
11
16
26
101
80% at a*
1
1
1
2
12
p: p-value; d: effect size; a*: 1-(1-0.05)(1/38500)
Simulation Study 2B: Number of Best
Chi-Square Test Results Required to
Cover a Single True Positive
• Again compare different ranking rules based on pvalue, effect size, or a functional combination
• Two binomial proportions, n=500 in each group
• 200,000 chi-square 1-df tests, only 1 true association
• Genetic allele frequency for true negatives simulated
to be uniform [0.05,0.95]
• Genetic allele frequency for true positive control group
set to 0.1 or 0.5. Frequency for case group set higher
to achieve power of (80, 90, 95, 99, and 80-Śidák) at
alpha=5%
Simulation Study 2B Results
Number of best chi-square (1 df) test results out of 200,000 required to
cover a single true positive with 95% probability
TP case frequency 0.1
Ranking by
Power
p-value (p)
log(p) |d|1/2
log(p) d2
log(p) |d|
|d|
80% at 5%
38776
43559
46292
49332
58689
90% at 5%
12159
15075
16895
19675
27466
95% at 5%
2753
3764
4667
5900
10102
99% at 5%
55
101
157
261
869
80% at a*
1
1
1
2
7
p: p-value; d: effect size; a*: 1-(1-0.05)(1/200,000)
Simulation Study 2B Results
Number of best chi-square (1 df) test results out of 200,000 required to
cover a single true positive with 95% probability
TP case frequency 0.5
Ranking by
Power
p-value (p)
log(p) |d|1/2
log(p) d2
Log(p) |d|
|d|
80% at 5%
39940
35887
33784
31678
28451
90% at 5%
11107
9293
8451
7682
6685
95% at 5%
2962
2338
2078
1856
1582
99% at 5%
51
36
31
27
23
80% at a*
1
1
1
1
1
p: p-value; d: effect size; a*: 1-(1-0.05)(1/200,000)
Discussion 2
•
•
Incorporating effect size into ranking
rules can improve ranking performance,
particularly when variance of true
positives is comparatively larger than
variance of true negatives
Possible Empirical Bayes effect
Outline
1. Reproducibility versus specificity and
sensitivity
2. Rank distribution of a single true positive
3. P-value combination methods for
multiple true positives
All results are based on simulation.
Simulation Study 3: Compare Power of
P-Value Combination Methods with
Multiple True Positives
• 5,000 Chi-Square (1 df) tests
• Number of true associations ranges from 10 to
200 with various powers
• Compare Sidak, Simes, Fisher Combination,
and three more modern methods:
– Gamma Method (GM)
– Truncated Product Method (TPM)
– Rank Truncated Product (RTP)
Gamma Method (GM)
•
•
•
Generalization of Fisher and Stouffer
Sum inverse Gamma-transformed 1-pi
Tune using Soft Truncation Threshold,
accommodates effect heterogeneity
Truncated Product Method (TPM)
•
•
•
Combine only the subset of p-values less
than some threshold
Assess significance by evaluating
product distribution via Monte Carlo on
uniforms.
Upon rejecting the null, can claim true
positives are in the subset
Rank Truncated Product (RTP)
•
•
•
•
Combine the K smallest p-values
Assess significance by evaluating
product distribution with Monte Carlo
K=1 same as Sidak, K=max same as
Fisher
On rejecting the null, cannot claim true
positives are in the subset
Simulation Study 3 Results
Power of different p-value combination methods
from 5,000 chi-square (1 df) tests
#TA
TA
Power
Śidák
Simes
Fisher
GM
0.05
GM
0.1
TPM
0.05
TPM
0.01
TPM
0.005
10
0.90
0.899
0.756
0.225
0.791
0.650
0.279
0.455
0.550
50
0.50
0.498
0.351
0.525
0.799
0.789
0.595
0.650
50
0.60
0.592
0.553
0.693
0.961
0.950
0.788
100
0.30
0.297
0.181
0.598
0.644
0.697
100
0.40
0.401
0.339
0.831
0.926
200
0.20
0.202
0.143
0.756
200
0.25
0.255
0.216
200
0.30
0.297
0.300
TPM
0.001
RTP
10
RTP
50
RTP
100
RTP
200
0.752
0.879
0.814
0.739
0.625
0.656
0.601
0.636
0.751
0.769
0.764
0.876
0.888
0.864
0.875
0.947
0.951
0.942
0.595
0.543
0.495
0.378
0.377
0.544
0.607
0.649
0.944
0.861
0.853
0.825
0.715
0.703
0.874
0.907
0.926
0.653
0.746
0.696
0.563
0.490
0.332
0.314
0.511
0.605
0.682
0.920
0.883
0.936
0.895
0.814
0.742
0.545
0.509
0.765
0.847
0.904
0.981
0.978
0.992
0.980
0.949
0.915
0.764
0.715
0.932
0.967
0.984
Discussion 3
•
•
Gamma Method competitive as a global
test
Truncated Product Method enables more
specific inference.
Reproducibility and Ranks of
True Positives in Large Scale
Genomics Experiments
Russ Wolfinger1, Dmitri Zaykin2, Lev Zhivotovsky3,
Wendy Czika1, Susan Shao1
1SAS
Institute, Inc., 2National Institute of Environmental Health
Sciences, 3Vavilov Institute of General Genetics
MCP Vienna
July 11, 2007