The P Value is Dead
Download
Report
Transcript The P Value is Dead
Planning, Performing, and
Publishing Research with
Confidence Limits
A tutorial lecture given at the annual meeting of the
American College of Sports Medicine, Seattle, June 4 1999.
© Will G Hopkins
Physiology and Physical Education
University of Otago
Dunedin NZ
[email protected]
Outline
Definitions and Mis/interpretations
Planning
Sample size
Performing
Sample size "on the fly"
Publishing
Methods, Results, Discussion
Meta-analysis
Publishing non-significant outcomes
Conclusions
Dis/advantages
Definitions and Mis/interpretations
Confidence limits: Definitions
"Margin of error"
Example: Survey of 1000 voters
Democrats 43%, Republicans 33%
Margin of error is ± 3% (for a result of 50%...)
Likely range of true value
"Likely" is usually 95%.
"True value" = population value
= value if you studied the entire population.
Example: Survey of 1000 voters
Democrats 43% (likely range 40 to 46%)
Democrats - Republicans 10% (likely range 5 to 15%)
Example: in a study of 64 subjects, the correlation between
height and weight was 0.68 (likely range 0.52 to 0.79).
observed
value upper
lower
confidence
confidence
limit
limit
0.00
0.50
correlation coefficient
1
Confidence interval: difference between the upper
and lower confidence limits.
Amazing facts about confidence intervals
(for normally distributed statistics)
To halve the interval, you have to quadruple sample size.
A 99% interval is 1.3 times wider than a 95% interval.
You need 1.7 times the sample size for the same width.
A 90% interval is 0.8 of the width of a 95% interval.
You need 0.7 times the sample size for the same width.
How to Derive Confidence Limits
Find a function(true value, observed value, data) with a
known probability distribution.
Calculate a critical value, such that for 2.5% of the time,
function(true value, observed value, data) < critical value.
probability
area =
0.025
critical value
probability
distribution
of function
(e.g. 2)
function (e.g. (n-1)s2/2)
Rearranging, for 2.5% of the time,
true value > function'(observed value, data, critical value)
= upper confidence limit
Mis/interpretation of confidence limits
Hard to misinterpret confidence limits for simple
proportions and correlation coefficients.
Easier to misinterpret changes in means.
Example: The change in blood volume in a study
was 0.52 L (likely range 0.12 to 0.92 L).
For 95% of subjects, the change was/would be between
0.12 and 0.92 L.
The average change in the population would be between
0.12 and 0.92 L.
The change for the average subject would be between
0.12 and 0.92 L.
There may be individual differences in the change.
P value: Definition
The probability of a more extreme absolute value
than the observed value if the true value was zero
or null.
Example: 20 subjects, correlation = 0.25, p = 0.29.
no effect
observed effect
(r = 0.25)
probability
area =
p value
= 0.29
-0.5
0
0.5
correlation coefficient
distribution of
correlations
for no effect
and n = 20
"Statistically Significant": Definitions
P < 0.05
Zero lies outside the confidence interval.
Examples: four correlations for samples of size 20.
-0.50
0.00
0.50
correlation coefficient
1
r
likely range
P
0.70
0.37 to 0.87
0.007
0.44
0.00 to 0.74
0.05
0.25
-0.22 to 0.62
0.29
0.00
-0.44 to 0.44
1.00
Incredibly interesting information about statistical
significance and confidence intervals
p < 0.05
p = 0.05
p > 0.05
Two independent estimates of a normally distributed statistic
with equal confidence intervals are significantly different at
the 5% level if the overlap of their intervals is less than 0.29
(1 - 2/2) of the length of the interval.
If the intervals are very unequal...
p < 0.05
p = 0.05
p > 0.05
Type I and II Errors
You could be wrong about significance or lack of it.
Type I error = false alarm.
Rate = 5% for zero real effect.
Type II error = failed alarm.
Traditional acceptable rate = 20% for smallest worthwhile
effect.
Lots of tests for significance implies more chance of
at least one false alarm: "inflated type I error".
Ditto type II error?
Deal with inflated type I error by reducing the p value.
Should we adjust confidence intervals? No.
Mis/interpretation of P < 0.05
(for an observed positive effect)
The effect is probably big.
There's a < 5% chance the effect is zero.
There's a < 2.5% chance the effect is < zero.
There's a high chance the effect is > zero.
The effect is publishable.
Mis/interpretation of P > 0.05
(for an observed positive effect)
The effect is not publishable.
There is no effect.
The effect is probably zero or trivial.
There's a reasonable chance the effect is < zero.
Planning Research
Sample Size via Statistical Significance
Sample size must be big enough to be sure you will
detect the smallest worthwhile effect.
To be sure: 80% of the time.
Detect: P < 0.05.
Smallest worthwhile effect: what impacts your subjects
correlation = 0.10
relative risk = 1.2 (or frequency difference = 10%)
difference in means = 0.2 of a between-subject standard deviation
change in means = 0.5 of a within-subject standard deviation
Example: 760 subjects to detect a correlation of 0.10.
Example: 68 subjects to detect a 0.5% change in a
crossover study when the within-subject variation is 1%.
But 95% likely range doesn't work properly with
traditional sample-size estimation (maybe).
Example: Correlation of 0.06, sample size of 760...
47.5% + 47.5% (=95%) likely range:
-0.1
0
0.1
correlation coefficient
Not significant, but
could be substantial.
Huh?
47.5% + 30% likely range:
-0.1
0
0.1
correlation coefficient
Not significant, and
can't be substantial.
OK!
Sample Size via Confidence Limits
Sample size must be big enough for acceptable
precision of the effect.
Precision means 95% confidence limits.
Acceptable means any value of the effect within these
limits will not impact your subjects.
Example: need 380 subjects to delimit a correlation of
zero.
smallest worthwhile
effects
-0.10
0
0.10
correlation coefficient
confidence
interval for
N = 380
But sample size needed to detect or delimit
smallest effect is overkill for larger effects.
Example: confidence limits for correlations of 0.10 and
0.80 with a sample size of 760...
-0.1
0
0.1
0.3
0.5
0.7
correlation coefficient
So why not start with a smaller sample and do
more subjects only if necessary?
Yes, I call it...
0.9
1
Performing Research
Sample Size "On the Fly"
Start with a small sample; add subjects until you
get acceptable precision for the effect.
Acceptable precision defined as before.
Need qualitative scale for magnitudes of effects.
Example: sample sizes to delimit correlations...
350
380
trivial
-0.1
0
small
0.1
270
moderate
0.3
155
large
46
very large
0.5
0.7
correlation coefficient
nearly
perfect
0.9
1
Problems with sampling on the fly
Do not sample until you get statistical significance: the
resulting outcomes are biased larger than life.
Sampling until the confidence interval is acceptable
produces bias, but it is negligible.
But researchers will rush into print as soon as they get
statistical significance.
And funding agencies prefer to give money once
(but you could give some back!).
And all the big effects have been researched anyway?
No, not really.
Publishing Research
In the Methods
"We show the precision of our estimates of outcome
statistics as 95% confidence limits (which define the
likely range of the true value in the population from
which we drew our sample)."
Amazingly useful tips on calculating confidence limits
Simple differences between means: stats program.
Other normally distributed statistics: mean and p value.
Relative risks: stats program.
Correlations: Fisher's z transform.
Standard deviations and other root mean square variations:
chi-squared distribution.
Coefficients of variation: standard deviation of 100x natural log
of the variable. Back transform for CV>5%.
Use the adjustment of Tate and Klett to get shorter intervals for
SDs and CVs from small samples.
Example:
coefficient of
variation for 10
subjects in 2 tests
usual
0
1
2
coefficient of variation (%)
adjusted
3
Ratios of independent standard deviations: F distribution.
R2 (variance explained): convert to a correlation.
Use the spreadsheet at sportsci.org/stats for all the above.
Effect-size (mean/standard deviation): non-central
F distribution or bootstrapping.
Really awful statistics: bootstrapping.
Bootstrapping (Resampling) for confidence limits
Use for difficult statistics, e.g. for grossly non-normal
repeated measures with missing values. Here's how...
For a large-enough sample, you can recreate (sort of) the
population by duplicating the sample endlessly.
Draw 1000 samples (of same size as your original) from
this population.
Calculate your outcome statistic for each of these
samples, rank them, then find the 25th and 975th placegetters. These are the confidence limits.
Problems
Painful to generate.
No good for infrequent levels of nominal variables.
In the Results
In TEXT
Change or difference in means
First mention:
...0.42 (95% confidence/likely limits/range -0.09 to 0.93) or
...0.42 (95% confidence/likely limits/range ± 0.51).
Thereafter:
...2.6 (1.4 to 3.8) or 2.6 (± 1.2) etc.
Correlations, relative risks, odds ratios, standard deviations,
ratios of standard deviations: can't use ± because the
confidence interval is skewed:
...a correlation of 0.90 (0.67 to 0.97)...
...a coefficient of variation of 1.3% (0.9 to 1.9)...
In TABLES
Confidence intervals
Variable A
Variable B
Variable C
Variable D
r
likely range
0.70
0.44
0.25
0.00
0.37 to 0.87
0.00 to 0.74
-0.22 to 0.62
-0.44 to 0.44
P values
Variable A
Variable B
Variable C
Variable D
Asterisks
r
p
0.70
0.44
0.25
0.00
0.007
0.05
0.29
1.00
r
Variable A
Variable B
Variable C
Variable D
0.70**
0.44*
0.25
0.00
In FIGURES
Told carbohydrate
Told placebo
Not told
-10
-5
0
5
10
Change in power (%)
Bars are 95% likely ranges
4
sea level
altitude
sea level
3
live low
train low
2
change in 1
5000-m 0
time (%)
-1
live high
train high
likely range of
true change
live high
train low
-2
-3
0
2
4
6
8
10
training time (weeks)
12
14
In the Discussion
Interpret the observed effect and its 95%
confidence limits qualitatively.
Example: you observed a moderate correlation, but the
true value of the correlation could be anything between
trivial and very strong.
trivial
-0.1
0
small
0.1
moderate
0.3
large
very large
0.5
0.7
correlation coefficient
nearly
perfect
0.9
1
Meta-Analysis
Deriving a single estimate and confidence interval
for an effect from several studies.
Here's how it works for two:
Equal Confidence Intervals
Study 1
Study 2
Study 1+2
Unequal Confidence Intervals
Study 1
Study 2
Study 1+2
Publishing non-significant outcomes
Publishing only significant effects from small-scale
studies leads to publication bias.
Publishing effects with confidence limits regardless
of magnitude is free of bias.
Many smaller studies are probably better than a
few larger ones anyway.
So bully the editor into accepting the paper about
your seemingly inconclusive small-scale study.
Conclusions
Disadvantages of Statistical Significance
Emphasizes testing of hypotheses.
Aim is to detect an effect--effects are zero until proven
otherwise.
Have to understand Type I and II errors.
Hard to understand; easy to misinterpret.
Have to consider sample size.
Focuses on statistically significant effects.
Advantages of Statistical Significance
Familiar.
All stats programs give p values.
Easy to put asterisks in tables and figures.
Disadvantages of Confidence Limits
Unfamiliar.
Not always available in stats programs.
Cluttersome in tables.
Display in time series can be a challenge.
Advantages of Confidence Limits
Emphasizes precision of estimation.
Aim is to delimit an effect--effects are never zero.
Only one kind of "error".
Meaning is reasonably clear, even to lay readers.
No confusion between significance and magnitude.
Journals now require them.