Expected distributions of p-values in replications are not what you

Download Report

Transcript Expected distributions of p-values in replications are not what you

High Expectations
R.J.Watt
D.I.Donaldson
University of Stirling
A typical population
We start with a
simulated population
of known effect size:
r=0.3
A typical design
We sample from
the population:
42 participants
random sampling
A random sample
p=0.03 (2-tailed)
null hypothesis is
rejected
n=42
sample:
population:
r=0.34
r=0.3
Another random sample
p=0.055 (2-tailed)
null hypothesis is
not rejected
This is a miss
n=42
sample:
population:
r=0.30
r=0.3
(because we know that the
population has an effect)
p value
Step1: how common are misses?
In this design, 50% of samples generate a miss.
p=0.05
p=0.01
p=0.001
NB:
10% p<0.0016
10% p>0.47
Step 1: miss rates are high
In this design, 50% of samples generate a miss.
2 implications:
1. the effect will be found by someone and reported
2. the effect has a 50% chance of subsequent “failure to
replicate”
NB. Privileged viewpoint: …we know that the effect exists.
Back to our 2 samples
p value for combined data
original
study
p value for replicate alone
p value for combined data
replications
fail
combination
fails
p value for replicate alone
replications individually significant
replications individually fail
but
combined data significant
p value for combined data
The two samples we started with
original
study
replication
p value for replicate alone
p value for combined data
The two samples we started with
original
study
replication
p value for replicate alone
1. Replication failed
2. Combined data is more significant than original alone.
p value for combined data
Many further attempts at replication
original
study
replications
p value for replicate alone
failure to replicate  51%
Step 1 summary: known population
• for our sample n=42, r=0.34 (p=0.04),
– the probability of a failure to replicate is 50%
• What if we don’t know the population it came
from?
Step 2: unknown population
• Same original sample
now suppose population is unknown
• Constraints on population:
1. distribution of effect sizes that could produce
current sample
2. a priori distribution of effect sizes
Constraint 1
The density function for populations that our sample could have arisen from.
Constraint 2
The simulated density function for effect sizes in Psychology.
Constraint 2
The simulated density function for effect sizes in Psychology.
Constraint 2
We use this.
We then see
this.
The simulated density function for effect sizes in Psychology.
Constraint 2
The simulated density function for effect sizes in Psychology
- matches the Science paper
Step 2: unknown population
• We draw replication samples from anywhere in
this distribution of possible source populations:
&
• and ask how often they are significant.
p value for combined data
original
study
replications
p value for replicate alone
failure to replicate  60%
Step 2 summary: unknown population
• for our sample n=42, r=0.34 (p=0.04),
– given an a-priori distribution of population effect
sizes
– the probability of a failure to replicate is 60%
• What about other combinations of r and n?
Final step: other r & n
• r: we already have an a-priori Psychology distribution
Final step: other r & n
• r: we already have an a-priori distribution
• n: 90 years of journal articles:
median n:
JEP: 30
QJEP: 25
The Simulation Process
>1000 Original Studies
Replicate
r chosen at random from
a-priori population
r chosen at random from
all source populations
n chosen at random from
[10 20 40 80]
n kept as original
1000 exact
replications
keep p<0.05
count: p<0.05
“published”
% success
proportion of successful replications
Final Step: other r & n
1
n
0.8
10
20
40
80
0.6
mean replication rate
= 40-45%
0.4
0.2
0
1e-05
0.0001
0.001
0.01
0.1
p-value for original sample
Each dot is one study: r is random n as indicated
1.0
Discussion 1
• The only free-parameter in this is
– a-priori distribution of effect sizes in Psychology.
• The result also depends on
– distribution of n in Psychology
Discussion 1
• If you accept this, then it necessarily follows
– if:
– then:
everyone is behaving impeccably
p(replication) is 40-45% in psychology
• or
– if:
– then:
p(replication) is 40-45% in psychology
everyone is behaving impeccably
• What about other a-priori distributions?
Overall Summary
• effects of a-priori assumption on outcome:
Overall Summary
• effects of a-priori assumption on outcome:
we thought:
%replication should be up here
actually:
%replication should be down here
Discussion 1(a)
– if:
everyone is behaving impeccably
– then: p(replication) is typically 40-45%
This is inescapable.
The End
• very much inspired by:
Cumming, B (2011) Understanding the New Statistics
Links
•
•
•
•
rep-p>
al
nh
e_es
Start with an
population that
generates a power
of 42% - to match
the typical value
for Psychology
r=0.3
n=35
Only the green studies are published.
This is the
distribution of all
expected effect
sizes.
r=0.3
n=35
All measured effect sizes
mean(r) = 0.3
But this is the
distribution of
published effect
sizes.
r=0.3
n=35
Published effect sizes
(ie those where p<0.05)
mean(r) = 0.45
Now we ask what
the power is
calculated for the
published sample
effect size of 0.45
The answer is 80%.
…compared with
the actual power
of 42%.
Power Calculations
• Actual power = 42%
• Power calculated from published effect sizes =
80%
• This difference arises because published effect
sizes are over-estimates
– caused by publication bias
Apparent (Sample) Power
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
Real (Population) Power
This graph shows how much power is over-estimated.
• Slide 32
• What can we do?
• anything that increases Type II error rate
will make matters worse
– eg reducing alpha (0.05)
• anything that decreases Type II error rate might
make matters better**
– eg adaptive sampling
** except that 42% power may be maximally reinforcing for the researcher
• Slide 32
null hypothesis testing
- the outcomes
Population
no effect
has effect
p<0.05
Type I error
correct
p>0.05
correct
Type II error
A pair of studies, with different outcomes, occupy the same column.
It cannot be known which column.
So failure to replicate  either of:
i) 1st study made Type I error &
ii) 1st study correct
&
It cannot be known which.
2nd study correct
2nd study made Type II error
• we get :
– p(Type I error):
• given null hypothesis is true,
• probability of obtaining your result or better
• we want :
– p(Type II error)
• given null hypothesis is not true
• probability of obtaining your result or better
• think about:
– given it is a weekday
– probability of dying in hospital
• compared with:
– given it is not a weekday
– what is the probability of dying in hospital?
• these are unrelated
• Slide 32
The effects of publication bias on the
effect sizes that are seen.
red: n=10
green: n=20
blue: n=40
yellow: n=80
• Slide 32
Finding the best fit value
for prior effect size.
Minimum chi-sqr at 0.28
Frequency of effect sizes:
1) in Science paper
2) in whole simulated population
(with ef=0.28)
3) in simulated published studies