Interpreting p-values

Download Report

Transcript Interpreting p-values

Understanding
p-values
Annie Herbert
Medical Statistician
Research and Development Support Unit
[email protected]
0161 2064567
Outline
• Population & Sample
• What is a p-value?
• P-values vs. Confidence Intervals
• One-sided and two-sided tests
• Multiplicity
• Common types of test
• Computer outputs
Timetable
Time
Task
60 mins
Presentation
20 mins
Coffee Break
90 mins
Practical Tasks in
IT Room
‘Population’ and ‘Sample’
• Studying population of interest
• Usually would like to know typical value and
spread of outcome measure in population
• Data from entire population usually impossible
or inefficient/expensive so take a sample
(even census data can have missing values)
• Want sample to be ‘representative’ of population
• Randomise
Randomised Controlled Trial (RCT)
POPULATION
GROUP 1
OUTCOME
GROUP 2
OUTCOME
SAMPLE
RANDOMISATION
5 Key Questions
• What is the target population?
• What is the sample, and is it representative of
the target population?
• What is the main research question?
• What is the main outcome?
• What is the main explanatory factor?
Example – Dolphin Study
• Population: people suffering mild to moderate
depression
• Sample: outpatients diagnosed with suffering from mild
to moderate depression - recruited through internet,
radio, newspapers and hospitals
• Question: does animal-facilitated therapy help treatment
of depression?
• Outcome: Hamilton depression score at baseline and
end of treatment
• Explanatory Factors: whether patients participated in
dolphin programme (treatment) or outdoor nature
programme (control)
Dolphin Study - Making Comparisons
Hamilton
Depression
Score
Baseline
Mean (SD)
2 Weeks
Mean (SD)
Reduction
Mean (SD)
Treatment
Group
N=15
Control
Group
N=15
14.5 (2.6)
14.5 (2.2)
7.3 (2.5)
10.9 (3.4)
7.3 (3.5)
3.6 (3.4)
BMJ - Antonioli & Reveley, 2005;331:1231 (26 November)
Dolphin Study - does the treatment
make a difference?
• For both groups the Hamilton depression score
decreased between baseline and 2 weeks
• Clearly for our sample the treatment group has a
better mean reduction by:
7.3 - 3.6 = 3.7 points
• What does this tell us about the target population?
What is a p-value?
• Assume that there is really no difference in the
target population (this is the null hypothesis)
• p-value: how likely is it that we would see at
least as much difference as we did in our
sample?
• Dolphin study example: if treatments are equally
effective, how likely is it that we would see a
difference in mean reduction between the
treatment and control groups of at least 3.7
points? P=0.007
Assessing the p-value
• Large p-value:
– Quite likely to see these results by chance
– Cannot be sure of a difference in the target
population
• Small p-value:
– Unlikely to see these results by chance
– There may be a difference in the target
population
What is a small/large p-value?
• Cut-off point (‘significance level’) is arbitrary
• Significance level set to 5% (0.05) by convention
• Regard the p-value as the ‘weight of evidence’
• P < 5%: strong evidence of a difference
• P ≥ 5%: no evidence of a difference
(does not mean evidence of no difference)
Types of Statistical Error
• Type I Error = Probability of rejecting the
null hypothesis when it is in fact true.
• Type II Error = Probability of not
rejecting the null hypothesis when it is
false.
Confidence Intervals
• Confidence interval = “range of values that we
can be confident will contain the true value of the
population”
• The “give or take a bit” for best estimate
• Dolphin study example: what is the range of
values that we can be confident contains the
true difference of mean reduction between
treatment and control group?
(95% CI: 1.1 to 6.2)
p-values vs. Confidence Intervals
• p-value:
- Weight of evidence to reject null hypothesis
- No clinical interpretation
•
-
Confidence Interval:
Can be used to reject null hypothesis
Clinical interpretation
Effect size
Direction of effect
Precision of population estimate
Statistical Significance vs.
Clinical Importance
• p-value < 0.05, CI doesn’t contain 0: indicates a statistically
significant difference.
• What is the size of this difference, and is it enough to
change current practice?
• E.g. Dolphin study:
- P=0.007
- 95% CI = (1.1, 6.2)
• Expense? Side-effects? Ease of use?
• Consider clinically important difference when making
sample size calculations/interpreting results
One-sided & Two-sided Tests
• One-sided test: only possible that
difference in one particular direction.
• Two-sided test: interested in difference
between groups, whether worse or better.
Dolphin study example: is the treatment
reduction mean less or greater than the
control reduction mean?
• In real life, almost always two-sided.
Multiplicity
E.g. Significance level = 0.05
1/20 tests will be ‘significant’, even when no
difference in target population
Number of tests
1
2
3
5
10
20
Chance of at least one
significant value
0.05
0.10
0.14
0.23
0.40
0.64
Reducing Multiplicity Problems
• Pick one outcome to be primary
• Specify tests in advance
• Focus on research question and keep
number of tests to a minimum
• Do not necessarily believe a single significant
result (repeat experiment, use meta-analysis)
Types of Outcome Data
Categorical
Numerical/Continuous
Example: Yes/No
Example: Weight
Graphs: Histogram/Boxplot
Graphs: Bar/Pie Chart
Summary:
Frequency/Proportion
Summary:
• Mean (SD)
• Median (IQR)
Test: Chi-squared
Test (two groups):
t-test or Mann-Whitney U
Notable Exceptions
• Comparing more than two groups
• Continuous explanatory factors
• Paired Data:
- Paired t-test
- Wilcoxon
- McNemar
• Time-to-event Data: Log-rank test
(For all of the above, seek statistical advice)
Computer Output - StatsDirect
Computer Output - SPSS
Final Pointers
• Plan analyses in advance
– Seek statistical advice
• Start with graphs and summary statistics
• Keep number of tests to a minimum
• Include confidence intervals
• ‘Absence of evidence is not evidence of absence’