Transcript here
Can I Have a P-value For That,
Please?
Christopher J. Miller
Associate Director, Biostatistics
AstraZeneca, LP
[email protected]
Outline
Definitions
Quiz
Hypothesis testing and Power
no math
philosophy
Things that make no sense to me
testing for differences at baseline
post-hoc power calculations
2
Biostatistics
A term which ought to mean “statistics for
biology” but is now increasingly reserved
for medical statistics.
S. Senn
3
Biostatistician
One who has neither the
intellect for mathematics nor
the commitment for medicine,
but likes to dabble in both.
S. Senn
4
Biometrics
An alternative name for statistics,
especially if applied to the life sciences.
The advantage of the name compared to
statistics is that the general public does not
understand what it means, whereas with
statistics the general public thinks it
understands what it means.
S. Senn
5
Quiz time!
6
A 95% Confidence Interval of
(5 to 11) for the population mean
implies:
1.
2.
3.
The probability that the true mean is between
5 and 11 is 0.95 (95%).
Ninety-five percent of the time (for 95% of
samples) the interval will include the true
mean. Five to 11 is one such interval.
Five to 11 covers 95% of the possible values of
the true mean.
7
A 95% Confidence Interval of
(5 to 11) for the population mean
implies:
1.
2.
3.
The probability that the true mean is between
5 and 11 is 0.95 (95%).
95% of the time (for 95% of samples)
the interval will include the true mean.
Five to 11 is one such interval.
Five to 11 covers 95% of the possible values of
the true mean.
8
A p-value < 0.05:
1.
2.
3.
Assuming the treatment is not effective, there
is less than a 5% chance of obtaining
such results.
The observed effect from the treatment is
so large that there is less than a 5% chance
that the treatment truly is no better than placebo.
On average, fewer than 5% of placebo-treated
patients will do better than active-treated patients.
9
A p-value < 0.05:
1.
2.
3.
Assuming the treatment is not effective, there
is less than a 5% chance of obtaining
such results.
The observed effect from the treatment is
so large that there is less than a 5% chance
that the treatment truly is no better than placebo.
On average, fewer than 5% of placebo-treated
patients will do better than active-treated patients.
10
Thoughts
How many people got both correct?
P-values and confidence intervals are often
misinterpreted.
P-values and confidence intervals do not
necessarily answer a relevant question.
Misunderstandings lead us to present
analyses that are nonsensical.
11
Hypothesis Testing
Question: Is the average effect of active
treatment better than that of placebo?
Null Hypothesis: Assume that there is no effect.
Ho : mA = mP or mA - mP = 0
Alternative Hypothesis
Ha : mA > mP or mA - mP > 0
12
Hypothesis testing (cont’d)
Assume Ho is true (true means equal)
Choose an analysis model and study design
Power study
Run an experiment
Collect data
See if you have enough evidence to reject Ho
Ho not false until proven false
Ho is never proven to be true
“not guilty, until proven guilty”
13
Hypothesis Testing Essentials
Population
Parameters
Probabilities are related to long-run
relative frequency of events in a series
of trials
14
Essentials: Population
“A largely theoretical concept which refers to
a (sometimes infinite or undefined) totality of
observations of interest.”
Example: All potential patients who might use
a new drug.
15
Essentials: Parameters
Used in conjunction with an underlying
population
“A function of the values of this population
which define their distribution”
Unobservable and unknowable
Nature, God, Truth
Example: Population mean or variance
When similar functions are calculated from a
sample, they are called “statistics”.
16
Essentials: Probabilities and
decisions
Parameters cannot have a probability
They are either equal to some value or not
Hypotheses cannot have a probability
They are either true or false
A decision to accept or reject a hypothesis is made
indirectly using the probability of the evidence
given the hypothesis, rather than vice versa.
Errors in decisions are controlled, on average,
based on an assumed series of results.
17
A 95% Confidence Interval of
(5 to 11) for the population mean
implies:
1.
2.
3.
The probability that the true mean is between
5 and 11 is 0.95 (95%).
95% of the time (for 95% of samples)
the interval will include the true mean.
This is one such interval.
Five to eleven covers 95% of the possible
values of the true mean.
18
A p-value < 0.05:
1.
2.
3.
Assuming the treatment is not effective, there
is less than a 5% chance of obtaining
such results.
The observed effect from the treatment is
so large that there is less than a 5% chance
that the treatment truly is no better than placebo.
On average, only 5% of placebo-treated patients
will do better than active-treated patients.
19
Things that make no sense to me #1
Baseline differences
You’re reporting on a randomized, parallelgroup trial. Active versus placebo.
To your dismay, the groups appear to have
been “different” at baseline
Mean (SD): 23 (2.3) versus 32 (2.7)
We need a p-value to tell us “how different”
they are!
P<0.05 tells us the study is uninterpretable, right?
21
The test
What is the “deep structure”?
Population?
Parameter of interest?
Long-term process?
Decision rule’s meaning?
Point?
22
Problem
Test appears to say something about the
adequacy of the given allocation, whereas
it can only be a test of the allocation
procedure.
23
What are we testing?
Null Hypothesis
The process of randomization will result
in balance across treatment groups.
Population
All possible random assignments of
patients to treatment.
24
What are we saying when p<.05?
When comparing 2 drugs after treatment
….the difference is rather large to be caused by chance
alone, therefore chance must not be the whole explanation.
Infer that the drugs have an effect on outcome.
Null hypothesis is not true.
When comparing 2 drugs before treatment
….the difference is rather large to be caused by chance
alone, therefore chance must not be the whole explanation.
Infer that randomization has not taken place???…fraud???
Type I error???…inadequate sample???
25
Bottom line
The underlying problem is that
randomization is, by definition, a chance
mechanism!
So, no matter what the p-value is – unless
we are willing to accept tampering as a
possibility – we need to conclude that
something unusual has happened because of
CHANCE alone!
26
Further silliness
Baseline imbalance does not necessarily
mean that meaningful treatment inferences
cannot be made
P-value for baseline test has no relation to
the ability to make valid treatment
comparisons at the end of the trial.
27
Solutions
ANCOVA
Answers the question: “If both groups had had
average overall baseline values, what treatment
difference would we have seen?”
Makes an average allowance for imbalance
Stratification
Allows valid treatment comparison within each
strata.
Need to think of this before the trial if you want
to do it correctly.
28
In short…
The fact that baseline tests are commonly
performed without much apparent harm is no
more of a defense than saying of the policy of
treating viruses with antibiotics that most
patients recover.
S. Senn
29
Power
30
Power
Systems are subject to random variation
otherwise, why would we experiment?
our lives would be simple without it
We try to see through the random variation
(noise) and determine the true effect (signal)
31
Power (cont’d)
How?
Well-planned, adequately-powered experiments
Loose definition of power:
“The probably that a statistically significant
difference will be found when the null
hypothesis is false (ie, when the treatments
truly are not equal).”
32
What Determines Power?
Hypothesis and model
Sample size
Variability among observations
What risk are you willing to take of wrongly
rejecting Ho?
How small of a difference among treatments
do you need to detect?
33
Calculating Power
Determine variable of primary interest
mean change from baseline in symptoms
Determine comparison of primary interest
and null hypothesis
assume mean active is the same as placebo
Determine analysis method
ANCOVA
34
Calculating Power (cont’d)
Get an estimate of population variability
among experimental units (Sigma)
literature
pilot/previous trials
can be a joke
Determine smallest difference between
treatments you would like to detect (Delta)
often a joke
35
Clinically relevant difference
A somewhat nebulous concept with various
conventions used by statisticians in their
power calculations and incidentally,
therefore, a means by which they drive their
medical colleagues to distraction. This is
used in the theory of clinical trials, as
opposed to the cynically relevant difference,
which is used in the practice.
S. Senn
36
Calculating Power (cont’d)
Determine risk you’re willing to take of
wrongly rejecting Ho
Type I error (a)
Decide there’s an effect when there really isn’t one
“false conviction”
set low at 5%, but arbitrary
37
Calculating Power (cont’d)
Power (%)
Sample size (n) and Power are the only elements
100
left!
90
80
70
60
50
50
100
150
200
250
Sample Size per Group (n)
38
Summary of power
Power is a function of:
hypothesis being tested
statistical model
sample size
assumed variability of population
risk you’re willing to take
minimum “relevant effect size”
No guarantees
39
Working definition
Power is the probability of a possible
outcome of a potential decision conditional
upon an imaginable circumstance given a
conceivable value of an algebraic embodiment
of an abstract mathematical idea and the strict
adherence to an extremely precise rule.
S. Senn
40
Things that make no sense to me #2
Post-hoc power calculations
Suppose we’ve run a well-designed and
adequately-powered study that “fails”
“fails” usually means p>0.05.
We need an excuse.
42
Post-hoc power calculations
Obviously, the study was underpowered!
assume that the variability was larger than
anticipated
the sample size was therefore too small
all other assumptions were fine
What was the “actual power” of this wimpy
study?
So, you see, the drug probably does work!
…I am just a terrible scientist.
43
Post-hoc power calculations
How do you pick which assumptions were
correct/incorrect when recalculating power?
Aribitrary
Ridiculous to do based on the results of 1 study
A view that I support
“The power of a trial is a useful concept when
planning the trial but has little relevance to the
interpretation of its results.” (S. Senn)
44
Conclusion
…Be careful!
45
References
Lang T, Secic M. How to Report Statistics
in Medicine, 1997.
Senn S. Statistical Issues in Drug
Development, 1997.
46