Significance testing „Hypothesis testing and estimation”

Download Report

Transcript Significance testing „Hypothesis testing and estimation”

Significance testing
and confidence intervals
Ágnes Hajdu
EPIET Introductory course
3.10.2011
The idea of statistical inference
Generalisation to the population
Conclusions based
on the sample
Population
Hypotheses
Sample
2
Inferential statistics
• Uses patterns in the sample data to draw
inferences about the population represented,
accounting for randomness.
• Two basic approaches:
– Hypothesis testing
– Estimation
• Common goal: conclude on the effect of an
independent variable (exposure) on a
dependent variable (outcome).
3
The aim of a statistical test
To reach a scientific decision (“yes” or “no”) on
a difference (or effect), on a probabilistic basis,
on observed data.
4
Why significance testing?
Botulism outbreak in Italy:
“The risk of illness was higher among diners who
ate home preserved green olives (RR=3.6).”
Is the association due to chance?
5
The two hypothesis!
There is NO difference between Null Hypothesis (H0)
the two groups
(=no effect)
(e.g.: RR=1)
There is a difference between
the two groups
(=there is an effect)
Alternative Hypothesis
(H1)
(eg: RR=3.6)
When you perform a test of statistical significance you usually
reject or do not reject the Null Hypothesis (H0)
6
Botulism outbreak in Italy
• Null hypothesis (H0): “There is no
association between consumption of green
olives and Botulism.”
• Alternative hypothesis (H1): “There is an
association between consumption of green
olives and Botulism.”
7
Hypothesis, testing and null
hypothesis
• Tests of statistical significance
• Data not consistent with H0 :
– H0 can be rejected in favour of some alternative
hypothesis H1 (the objective of our study).
• Data are consistent with the H0 :
– H0 cannot be rejected
You cannot say that the H0 is true.
You can only decide to reject it or not reject it.
8
How to decide when to reject the
null hypothesis?
H0 rejected using reported p value
p-value = probability that our result (e.g. a
difference between proportions or a RR) or
more extreme values could be observed under
the null hypothesis
9
p values – practicalities
Small p values = low degree of compatibility between H0
and the observed data:
you reject H0, the test is significant
Large p values = high degree of compatibility between H0
and the observed data:
you don’t reject H0, the test is not
significant
We can never reduce to zero the probability
that our result was not observed by chance alone
10
Levels of significance – practicalities
We need of a cut-off !
0.01
0.05
0.10
p value > 0.05 = H0 non rejected (non significant)
p value ≤ 0.05 = H0 rejected (significant)
BUT:
Give always the exact p-value rather than „significant“
vs. „non-significant“.
11
Examples from the literature
• ”The limit for statistical significance was set at p=0.05.”
• ”There was a strong relationship (p<0.001).”
• ”…, but it did not reach statistical significance (ns).”
• „ The relationship was statistically significant (p=0.0361)”
p=0.05 
Agreed convention
Not an absolute truth
”Surely, God loves the 0.06 nearly as much
as the 0.05” (Rosnow and Rosenthal, 1991)
12
p = 0.05 and its errors
• Level of significance, usually p = 0.05
• p value used for decision making
But still 2 possible errors:
H0 should not be rejected, but it was rejected :
Type I or alpha error
H0 should be rejected, but it was not rejected :
Type II or beta error
13
Types of errors
Truth
No diff
H0
Decision
based on
the p
value
H0 not rejected
No diff
H0 rejected (H1)
Diff
to be not rejected
Right decision
1-

Type I error
Diff
H0 to be rejected (H1)

Type II error
Right decision
1-
• H0 is “true” but rejected: Type I or  error
• H0 is “false” but not rejected: Type II or  error
14
More on errors
• Probability of Type I error:
– Value of α is determined in advance of the test
– The significance level is the level of α error that we
would accept (usually 0.05)
• Probability of Type II error:
– Value of β depends on the size of effect (e.g. RR, OR)
and sample size
– 1-β: Statistical power of a study to detect an effect on
a specified size (e.g. 0.80)
– Fix β in advance: choose an appropriate sample size
15
Even more on errors
H0 is true
1
H1 is true
1 
 
H0 Reality H1
Test statistics
T
H0
Decision
according to p
value
H1
ok
error
2. kind
1-

error
1. kind
ok

1-
1- Significance
 error
1- Power
16
Principles of significance testing
• Formulate the H0
• Test your sample data against H0
• The p value tells you whether your data are
consistent with H0
i.e, whether your sample data are consistent with a
chance finding (large p value), or whether there is
reason to believe that there is a true difference
(association) between the groups you tested
• You can only reject H0, or fail to reject it!
17
Quantifying the association
•
•
•
•
•
Test of association of exposure and outcome
E.g. Chi2 test or Fisher’s exact test
Comparison of proportions
Chi2-value quantifies the association
The larger the Chi2-value, the smaller the p
value
– the more the observed data deviate from the
assumption of independence (no effect).
18
Chi-square value
= sum of all cells: for each cell, subtract the
expected number from the observed number,
square the difference, and divide by the
expected number
(observednum.  expectednum.)
 
expectednum.
2
2
19
Botulism outbreak in Italy
2x2 table
Ill
Olives
9
Expected proportion
of ill and not ill :
4
52
43
47
5
No
olives
Expected number
of ill and not ill
for each cell :
Non ill
8
79
13
122
10 %
90 %
75
83
x10% ill
x 90% non-ill
x10% ill
x 90% non-ill
135
20
Chi-square value
 = 5.73
2
Botulism outbreak in Italy
Ill
Olives
No
olives
Non ill
(9 - 5.01)2
5.01
(43 - 46.99)2
46.99
(4 - 7.99)2
7.99
(79 - 75.01)2
75.01
p = 0.016
21
Botulism outbreak in Italy
“The relative risk (RR) of illness among diners
who ate home preserved green olives was 3.6
(p=0.016).”
The p-value is smaller than the chosen
significance level of a = 5%.
→ Null hypothesis can be rejected.
There is a 0.016 probability (16/1000) that the observed association could have
occured by chance, if there were no true association between
eating olives and illness.
22
Epidemiology and statistics
23
Criticism on significance testing
“Epidemiological application need more than a
decision as to whether chance alone could have
produced association.”
(Rothman et al. 2008)
→ Estimation of an effect measure (e.g. RR,
OR) rather than significance testing.
24
Why estimation?
Botulism outbreak in Italy:
“The risk of illness was higher among diners who
ate home preserved green olives (RR=3.6).”
How confident can we be in the result?
What is the precision of our point estimate?
25
The epidemiologist needs measurements
rather than probabilities
2 is a test of association
OR, RR are measures of association on a continuous scale
infinite number of possible values
The best estimate = point estimate
Range of values allowing for random variability:
Confidence interval  precision of the point estimate
26
Confidence interval (CI)
Range of values, on the basis of the sample
data, in which the population value (or true
value) may lie.
• Frequently used formulation:
„If the data collection and analysis could be
replicated many times, the CI should include the
true value of the measure 95% of the time .”
27
Confidence interval (CI)
e.g. CI for means
95% CI =
x – 1.96 SE up to x + 1.96 SE
 = 5%
α/2
Lower limit
of 95% CI
1-α
α/2
upper limit
of 95% CI
s
Indicates the amount of random error in the estimate
Can be calculated for any „test statistic“, e.g.: means, proportions, ORs, RRs
28
CI terminology
Point estimate
Confidence interval
RR = 1.45 (0.99 – 2.1)
Lower
confidence
limit
Upper
confidence
limit
29
Width of confidence interval depends on …
• The amount of variability in the data
• The size of the sample
• The arbitrary level of confidence you desire for your
study (usually 90%, 95%, 99%)
A common way to use CI regarding OR/RR is :
If 1.0 is included in CI  non significant
If 1.0 is not included in CI  significant
30
Looking the CI
A
B
RR = 1
Large RR
Study A, large sample, precise results, narrow CI – SIGNIFICANT
Study B, small size, large CI - NON SIGNIFICANT
Study A, effect close to NO EFFECT
Study B, no information about absence of large effect
31
More studies are better or worse?
• Decision making based on results from a collection of
studies is not facilitated when each study is classified as a
YES or NO decision.
Need to look at the point
estimation and its CI
But also consider its
clinical or biological
significance
20 studies with
different results...

1
RR
32
Botulism outbreak in Italy
• How confident can we be in the result?
• Relative risk = 3.6 (point estimate)
• 95% CI for the relative risk:
(1.17 ; 11.07)
The probability that the CI from 1.17 to 11.07
includes the true relative risk is 95%.
33
Botulism outbreak in Italy
“The risk of illness was higher among diners
who ate home preserved green olives
(RR=3.6, 95% CI 1.17 to 11.07).”
34
The p-value (or CI) function
• A graph showing the p value for all possible values of
the estimate (e.g. OR or RR).
• Quantitative overview of the statistical relation
between exposure and disease for the set of data.
• All confidence intervals can be read from the curve.
• The function can be constructed from the confidence
limits in Episheet.
35
Example: Chlordiazopoxide use and congenital
heart disease
C use
No C use
Cases
4
386
Controls
4
1250
OR = (4 x 1250) / (4 x 386) = 3.2
p=0.08 ; 95% CI=0.81–13
From Rothman K
3.2
p=0.08
Odds ratio
0.81 - 1337
Example: Chlordiazopoxide use and congenital
heart disease – large study
C use
No C use
Cases
1090
14 910
Controls
1000
15 000
OR = (1090 x 15000) / (1000 x 14910) = 1.1
p=0.04 ; 95% CI=1.05-1.2
From Rothman K
Precision and strength of association
Strength
Precision
39
Confidence interval provides more
information than p value
• Magnitude of the effect (strength of association)
• Direction of the effect (RR > or < 1)
• Precision of the point estimate of the effect
(variability)
p value can not provide them !
40
What we have to evaluate the study
2
A test of association. It depends on sample size.
p value
Probability that equal (or more extreme)
results can be observed by chance alone
OR, RR
Direction & strength of association
if > 1 risk factor
if < 1 protective factor
(independently from sample size)
CI
Magnitude and precision of effect
41
Comments on p-values and CIs
• Presence of significance does not prove
clinical or biological relevance of an effect.
• A lack of significance is not necessarily a lack
of an effect:
“Absence of evidence is not evidence of
absence”.
42
Comments on p values and CIs
• A huge effect in a small sample or a small
effect in a large sample can result in identical
p values.
• A statistical test will always give a significant
result if the sample is big enough.
• p values and CIs do not provide any
information on the possibility that the
observed association is due to bias or
confounding.
43
2 and Relative Risk
E
NE
Total
E
NE
Total
Cases Non cases
9
51
5
55
14
106
Total
Cases Non cases
Total
90
510
50
550
140
1060
60
60
120
2 = 1.3
p = 0.13
RR = 1.8
95% CI [ 0.6 - 4.9 ]
2 = 12
600
600
1200
p = 0.0002
RR = 1.8
95% CI [ 1.3-2.5 ]
« Too large a difference and you are doomed to
statistical significance »
44
Common source outbreak suspected
Exposure
Yes
No
cases non cases
15
20
50
200
Total
65
23%
AR%
42.8%
20.0%
220
2
= 9.1
p
= 0.002
RR
= 2.1
95%CI = 1.4-3.4
REMEMBER: These values do not provide any information on the
possibility that the observed association is due to a bias or confounding.
45
Recommendations
• Always look at the raw data (2x2-table). How
many cases can be explained by the exposure?
• Interpret with caution associations that
achieve statistical significance.
• Double caution if this statistical significance is
not expected.
• Use confidence intervals to describe your
results.
• Report p values precisely.
46
Suggested reading
• KJ Rothman, S Greenland, TL Lash, Modern Epidemiology,
Lippincott Williams & Wilkins, Philadelphia, PA, 2008
• SN Goodman, R Royall, Evidence and Scientific Research,
AJPH 78, 1568, 1988
• SN Goodman, Toward Evidence-Based Medical Statistics.
1: The P Value Fallacy, Ann Intern Med. 130, 995, 1999
• C Poole, Low P-Values or Narrow Confidence Intervals:
Which are more Durable? Epidemiology 12, 291, 2001
47
Previous lecturers
• Alain Moren
• Paolo D’Ancona
• Lisa King
• Preben Aavitsland
• Doris Radun
• Manuel Dehnert
48