Accounting for Multiple Testing

Download Report

Transcript Accounting for Multiple Testing

Multiple Testing
Matthew Kowgier
Multiple Testing
• In statistics, the multiple comparisons/testing
problem occurs when one considers a set of
statistical inferences simultaneously.
– Errors in inference
– Hypothesis tests that incorrectly reject the null
hypothesis
What a P-value isn’t
• P-value is NOT the probability of H0 given the
data
• P-value takes no account of the power of the
study
– Probability of accepting H0 when it is actually false
What a P-value IS?
• “Informal measure of the compatibility of the
data with the null hypothesis”
– Jewell 2004
• If we repeated our experiment over and over
again, each time taking a random sample of
observable units (people), what proportion of
the time could we expect to observe a result
(test statistic) at least as extreme, by chance
alone?
Type I Error
• “False positive": the error of rejecting a null
hypothesis when it is actually true.
• The error of accepting an alternative
hypothesis (the real hypothesis of interest)
when the results can be attributed to chance.
• Occurs when we observe a difference when in
truth there is none.
– e.g., A court finding a person guilty of a crime that they did not
actually commit.
• Try to set Type I error to 0.05 or 0.01
– there is only a 5 or 1 in 100 chance that the variation that we are
seeing is due to chance.
Type II Error
• “False negative": the error of failing to reject a
null hypothesis when the alternative
hypothesis is true.
• The error of failing to observe a difference
when in truth there is one.
– e.g., A court finding a person not guilty of a crime
that they did actually commit.
Actual Condition
Test
Result
Affected
Not Affected
Shows
Infected
True Positive
False Positive
Type I Error
Shows
“not
infected”
False Negative
Type II Error
True Negative
How Stringent a P-value?
• P < 0.05
– By chance alone, under the null hypothesis we will
observe a positive result (false positive) in 5% of
our tests
– 5/100
– 50/1,000
– 500/10,000
– 5,000/100,000
– 50,000/1,000,000
Genome Wide Association
• 12,000, 550,000, 1,000,000 SNPs
• Multiple diseases add tests
• Stratifying by sex, ethnicity, smoking status etc
adds tests (and reduces power by effectively
reducing sample size)
• Need to rethink our critical P-value
Not Accounting for Multiple Tests
• Invalid statistical conclusions
• Confidence intervals that don’t contain the
population parameter
• Incorrect rejection of H0
Implications
• Clinical Trial
– May result in approval of a drug as an improvement
over existing drugs, when it is in fact equivalent to
the existing drugs.
– Could happen by chance that the new drug appears
to be worse for some side-effect, when it is actually
not worse for this side-effect.
Accounting for Multiple Testing
• Make standards for each comparison more
stringent than for a single test
• Bonferroni correction
– Adjust allowable type I error by dividing alpha by
number of tests
– E.g. 20 tests – p-value cut-off becomes 0.05/20 =
0.0025
– E.g. 500,000 tests – p-value cut-off becomes
0.05/500,000 = 0.0000001
Accounting for Multiple Testing
• Bonferroni thought to be too stringent,
particularly for GWAs
• False Discovery Rate (FDR)
– Instead of controlling the chance of any false
positives (as Bonferroni does), FDR controls the
expected proportion of false positives
– A FDR threshold is determined from the observed
p-value distribution, and hence is adaptive to the
amount of signal in your data.
FDR
• q-value replaces a p-value
• http://faculty.washington.edu/jstorey/qvalue/