The P Value is Dead

Transcript The P Value is Dead

Planning, Performing, and
Publishing Research with
Confidence Limits
A tutorial lecture given at the annual meeting of the
American College of Sports Medicine, Seattle, June 4 1999.
© Will G Hopkins
Physiology and Physical Education
University of Otago
Dunedin NZ
[email protected]
Outline

Definitions and Mis/interpretations
 Planning
 Sample size

Performing
 Sample size "on the fly"

Publishing
 Methods, Results, Discussion
 Meta-analysis
 Publishing non-significant outcomes

Conclusions
 Dis/advantages
Definitions and Mis/interpretations

Confidence limits: Definitions
 "Margin of error"
 Example: Survey of 1000 voters
Democrats 43%, Republicans 33%
Margin of error is ± 3% (for a result of 50%...)
 Likely range of true value
 "Likely" is usually 95%.
 "True value" = population value
= value if you studied the entire population.
 Example: Survey of 1000 voters
Democrats 43% (likely range 40 to 46%)
Democrats - Republicans 10% (likely range 5 to 15%)
 Example: in a study of 64 subjects, the correlation between
height and weight was 0.68 (likely range 0.52 to 0.79).
observed
value upper
lower
confidence
confidence
limit
limit
0.00
0.50
correlation coefficient
1
 Confidence interval: difference between the upper
and lower confidence limits.
 Amazing facts about confidence intervals
(for normally distributed statistics)
 To halve the interval, you have to quadruple sample size.
 A 99% interval is 1.3 times wider than a 95% interval.
You need 1.7 times the sample size for the same width.
 A 90% interval is 0.8 of the width of a 95% interval.
You need 0.7 times the sample size for the same width.

How to Derive Confidence Limits
 Find a function(true value, observed value, data) with a
known probability distribution.
 Calculate a critical value, such that for 2.5% of the time,
function(true value, observed value, data) < critical value.
probability
area =
0.025
critical value
probability
distribution
of function
(e.g. 2)
function (e.g. (n-1)s2/2)
 Rearranging, for 2.5% of the time,
true value > function'(observed value, data, critical value)
= upper confidence limit

Mis/interpretation of confidence limits
 Hard to misinterpret confidence limits for simple
proportions and correlation coefficients.
 Easier to misinterpret changes in means.
 Example: The change in blood volume in a study
was 0.52 L (likely range 0.12 to 0.92 L).




 For 95% of subjects, the change was/would be between
0.12 and 0.92 L.
 The average change in the population would be between
0.12 and 0.92 L.
 The change for the average subject would be between
0.12 and 0.92 L.
 There may be individual differences in the change.

P value: Definition
 The probability of a more extreme absolute value
than the observed value if the true value was zero
or null.
 Example: 20 subjects, correlation = 0.25, p = 0.29.
no effect
observed effect
(r = 0.25)
probability
area =
p value
= 0.29
-0.5
0
0.5
correlation coefficient
distribution of
correlations
for no effect
and n = 20

"Statistically Significant": Definitions
 P < 0.05
 Zero lies outside the confidence interval.
 Examples: four correlations for samples of size 20.
-0.50
0.00
0.50
correlation coefficient
1
r
likely range
P
0.70
0.37 to 0.87
0.007
0.44
0.00 to 0.74
0.05
0.25
-0.22 to 0.62
0.29
0.00
-0.44 to 0.44
1.00
 Incredibly interesting information about statistical
significance and confidence intervals
p < 0.05
p = 0.05
p > 0.05
 Two independent estimates of a normally distributed statistic
with equal confidence intervals are significantly different at
the 5% level if the overlap of their intervals is less than 0.29
(1 - 2/2) of the length of the interval.
 If the intervals are very unequal...
p < 0.05
p = 0.05
p > 0.05

Type I and II Errors
 You could be wrong about significance or lack of it.
 Type I error = false alarm.
 Rate = 5% for zero real effect.
 Type II error = failed alarm.
 Traditional acceptable rate = 20% for smallest worthwhile
effect.
 Lots of tests for significance implies more chance of
at least one false alarm: "inflated type I error".
 Ditto type II error?
 Deal with inflated type I error by reducing the p value.
 Should we adjust confidence intervals? No.

Mis/interpretation of P < 0.05
(for an observed positive effect)











The effect is probably big.
There's a < 5% chance the effect is zero.
There's a < 2.5% chance the effect is < zero.
There's a high chance the effect is > zero.
The effect is publishable.
Mis/interpretation of P > 0.05
(for an observed positive effect)








The effect is not publishable.
There is no effect.
The effect is probably zero or trivial.
There's a reasonable chance the effect is < zero.
Planning Research

Sample Size via Statistical Significance
 Sample size must be big enough to be sure you will
detect the smallest worthwhile effect.
 To be sure: 80% of the time.
 Detect: P < 0.05.
 Smallest worthwhile effect: what impacts your subjects




correlation = 0.10
relative risk = 1.2 (or frequency difference = 10%)
difference in means = 0.2 of a between-subject standard deviation
change in means = 0.5 of a within-subject standard deviation
 Example: 760 subjects to detect a correlation of 0.10.
 Example: 68 subjects to detect a 0.5% change in a
crossover study when the within-subject variation is 1%.
 But 95% likely range doesn't work properly with
traditional sample-size estimation (maybe).
Example: Correlation of 0.06, sample size of 760...
 47.5% + 47.5% (=95%) likely range:
-0.1
0
0.1
correlation coefficient
Not significant, but
could be substantial.
Huh?
 47.5% + 30% likely range:
-0.1
0
0.1
correlation coefficient
Not significant, and
can't be substantial.
OK!

Sample Size via Confidence Limits
 Sample size must be big enough for acceptable
precision of the effect.
 Precision means 95% confidence limits.
 Acceptable means any value of the effect within these
limits will not impact your subjects.
 Example: need 380 subjects to delimit a correlation of
zero.
smallest worthwhile
effects
-0.10
0
0.10
correlation coefficient
confidence
interval for
N = 380
 But sample size needed to detect or delimit
smallest effect is overkill for larger effects.
 Example: confidence limits for correlations of 0.10 and
0.80 with a sample size of 760...
-0.1
0
0.1
0.3
0.5
0.7
correlation coefficient
 So why not start with a smaller sample and do
more subjects only if necessary?
Yes, I call it...
0.9
1
Performing Research

Sample Size "On the Fly"
 Start with a small sample; add subjects until you
get acceptable precision for the effect.
 Acceptable precision defined as before.
 Need qualitative scale for magnitudes of effects.
 Example: sample sizes to delimit correlations...
350
380
trivial
-0.1
0
small
0.1
270
moderate
0.3
155
large
46
very large
0.5
0.7
correlation coefficient
nearly
perfect
0.9
1
 Problems with sampling on the fly
 Do not sample until you get statistical significance: the
resulting outcomes are biased larger than life.
 Sampling until the confidence interval is acceptable
produces bias, but it is negligible.
 But researchers will rush into print as soon as they get
statistical significance.
 And funding agencies prefer to give money once
(but you could give some back!).
 And all the big effects have been researched anyway?
No, not really.
Publishing Research

In the Methods
 "We show the precision of our estimates of outcome
statistics as 95% confidence limits (which define the
likely range of the true value in the population from
which we drew our sample)."
 Amazingly useful tips on calculating confidence limits
 Simple differences between means: stats program.
 Other normally distributed statistics: mean and p value.
 Relative risks: stats program.
 Correlations: Fisher's z transform.
 Standard deviations and other root mean square variations:
chi-squared distribution.
 Coefficients of variation: standard deviation of 100x natural log
of the variable. Back transform for CV>5%.
 Use the adjustment of Tate and Klett to get shorter intervals for
SDs and CVs from small samples.
Example:
coefficient of
variation for 10
subjects in 2 tests
usual
0
1
2
coefficient of variation (%)
adjusted
3
 Ratios of independent standard deviations: F distribution.
 R2 (variance explained): convert to a correlation.
 Use the spreadsheet at sportsci.org/stats for all the above.
 Effect-size (mean/standard deviation): non-central
F distribution or bootstrapping.
 Really awful statistics: bootstrapping.
 Bootstrapping (Resampling) for confidence limits
 Use for difficult statistics, e.g. for grossly non-normal
repeated measures with missing values. Here's how...
 For a large-enough sample, you can recreate (sort of) the
population by duplicating the sample endlessly.
 Draw 1000 samples (of same size as your original) from
this population.
 Calculate your outcome statistic for each of these
samples, rank them, then find the 25th and 975th placegetters. These are the confidence limits.
 Problems

Painful to generate.
 No good for infrequent levels of nominal variables.

In the Results
 In TEXT
 Change or difference in means
First mention:
...0.42 (95% confidence/likely limits/range -0.09 to 0.93) or
...0.42 (95% confidence/likely limits/range ± 0.51).
Thereafter:
...2.6 (1.4 to 3.8) or 2.6 (± 1.2) etc.
 Correlations, relative risks, odds ratios, standard deviations,
ratios of standard deviations: can't use ± because the
confidence interval is skewed:
...a correlation of 0.90 (0.67 to 0.97)...
...a coefficient of variation of 1.3% (0.9 to 1.9)...
 In TABLES
 Confidence intervals
Variable A
Variable B
Variable C
Variable D
r
likely range
0.70
0.44
0.25
0.00
0.37 to 0.87
0.00 to 0.74
-0.22 to 0.62
-0.44 to 0.44
 P values
Variable A
Variable B
Variable C
Variable D
 Asterisks
r
p
0.70
0.44
0.25
0.00
0.007
0.05
0.29
1.00
r
Variable A
Variable B
Variable C
Variable D
0.70**
0.44*
0.25
0.00
 In FIGURES
Told carbohydrate
Told placebo
Not told
-10
-5
0
5
10
Change in power (%)
Bars are 95% likely ranges
4
sea level
altitude
sea level
3
live low
train low
2
change in 1
5000-m 0
time (%)
-1
live high
train high
likely range of
true change
live high
train low
-2
-3
0
2
4
6
8
10
training time (weeks)
12
14

In the Discussion
 Interpret the observed effect and its 95%
confidence limits qualitatively.
 Example: you observed a moderate correlation, but the
true value of the correlation could be anything between
trivial and very strong.
trivial
-0.1
0
small
0.1
moderate
0.3
large
very large
0.5
0.7
correlation coefficient
nearly
perfect
0.9
1

Meta-Analysis
 Deriving a single estimate and confidence interval
for an effect from several studies.
 Here's how it works for two:
Equal Confidence Intervals
Study 1
Study 2
Study 1+2
Unequal Confidence Intervals
Study 1
Study 2
Study 1+2

Publishing non-significant outcomes
 Publishing only significant effects from small-scale
studies leads to publication bias.
 Publishing effects with confidence limits regardless
of magnitude is free of bias.
 Many smaller studies are probably better than a
few larger ones anyway.
 So bully the editor into accepting the paper about
your seemingly inconclusive small-scale study.
Conclusions

Disadvantages of Statistical Significance
 Emphasizes testing of hypotheses.
 Aim is to detect an effect--effects are zero until proven
otherwise.
 Have to understand Type I and II errors.
 Hard to understand; easy to misinterpret.
 Have to consider sample size.
 Focuses on statistically significant effects.

Advantages of Statistical Significance
 Familiar.
 All stats programs give p values.
 Easy to put asterisks in tables and figures.

Disadvantages of Confidence Limits





Unfamiliar.
Not always available in stats programs.
Cluttersome in tables.
Display in time series can be a challenge.
Advantages of Confidence Limits
 Emphasizes precision of estimation.
 Aim is to delimit an effect--effects are never zero.
 Only one kind of "error".
 Meaning is reasonably clear, even to lay readers.
 No confusion between significance and magnitude.
 Journals now require them.

The P Value is Dead

Transcript The P Value is Dead

Directory