Inference, Causation and Side Effects
Download
Report
Transcript Inference, Causation and Side Effects
Inference
and
Hypothesis Testing
Lecture 8b
June 22, 2005
Kevin Schwartzman MD
Inference & Hypothesis Testing
Reading
• Fletcher, chapters 9 and 11
Inference & Hypothesis Testing - Slide 1
Objectives
Students will be able to:
1. Distinguish between association and causation
2. Describe strengths and weaknesses of various study
designs with respect to causal inference
3. Define the terms alpha (significance level), beta,
type 1 and 2 errors, and statistical power
4. Explain the conceptual relationship between sample
size requirement and statistical power
5. Distinguish between clinical and statistical significance
6. Distinguish between measures of health effect and
measures of statistical association
Inference & Hypothesis Testing - Slide 2
Objectives
7. Distinguish indices of etiologic versus clinical or
public health importance
8. Describe the phenomenon of multiple comparisons
9. Apply this concept to understanding the difference
between primary and secondary research
objectives, and hypothesis testing versus hypothesis
generation
10. Provide potential explanations for conflicting
findings
Inference & Hypothesis Testing - Slide 3
During this course, we have focused primarily on
generation of valid estimates of association between
exposure (or treatment) and outcome, using various
study designs.
Valid estimates of association require that bias be
avoided (e.g. selection, measurement) and that
potential confounders be properly addressed.
Effect measures and the related statistical tests are used
to evaluate potential associations.
Association of exposure and outcome does not
necessarily imply causation - even without obvious
confounding, there may be other factors responsible.
Inference & Hypothesis Testing - Slide 4
Causes of most diseases remain incompletely
understood.
e.g. smoking is the strongest known risk factor
for lung cancer, but
- some non-smokers develop lung cancer too
- most smokers do not develop lung cancer
Rothman points out that differences which appear to be
due to “chance” now may ultimately be shown
to relate to important causal factors, e.g. genetics.
Inference & Hypothesis Testing - Slide 5
Hill criteria - used to evaluate possibility of causation;
(from Rothman, Modern Epidemiology, 2nd ed., 1998)
1. Strength of association (magnitude of effect)
2. Consistency - more an argument against associations due to chance
3. Specificity
- idea that a cause should lead to a single effect not tenable
4. Temporality
5. Biologic gradient (dose-response)
6. Biologic plausibility
7. Coherence (also biologic)—with natural history of disease/condition
8. Experimental evidence
9. Analogy — parallels exposure-outcome relationship for similar
exposure and/or disease
Inference & Hypothesis Testing - Slide 6
Study Design and Inference
Ecologic Studies
-
best for generating questions about
possible associations
-
no idea about confounding factors
-
many other potential explanations for
putative associations
Cross-Sectional Studies
-
better ability to assess confounders
-
significant problems with temporality
-
best suited for stable exposures and outcomes
Inference & Hypothesis Testing - Slide 7
Cohort Studies
-
temporality usually not a problem
-
threatened by selection bias,
ascertainment issues (especially losses to follow-up),
confounding
Case-Control Studies
-
temporality may be problematic
-
key issue is control selection
-
ascertainment issues particularly “reconstruction”
of exposure history
-
confounders (measured, unmeasured)
Inference & Hypothesis Testing - Slide 8
Randomized Clinical Trials
May not be feasible, ethical, or appropriate
(e.g. etiologic studies)
Control of confounding is only truly optimal at
time of randomization
Selection of subjects may severely hamper
generalizability
Ascertainment issues (blinding, losses to follow-up)
Cointerventions, overlapping treatments
(“contamination”)
Efficacy vs. effectiveness (what is the relevant
question?)
Variable concordance between experimental and
observational studies addressing same questions.
Inference & Hypothesis Testing - Slide 9
Example: Association vs. Causation?
Spitzer et al. Use of beta-agonists and the risk of death
and near-death from asthma (NEJM 1992)
Beta-agonists, e.g. Ventolin ® - bronchodilators for
asthma treatment
Used Saskatchewan prescription database
Identified cohort of over 12,000 patients using asthma
medications
Inference & Hypothesis Testing - Slide 10
Nested Case-Control Study:
Identified 44 deaths “probably” due to asthma, plus
85 subjects with near-fatal asthma these were the
cases
655 controls randomly selected, after matching on
region, welfare use, age, hospitalization within 2 years,
date of entry into cohort, date of index event
Inference & Hypothesis Testing - Slide 11
Investigators found increased odds of beta-agonist use
among patients with fatal or near-fatal asthma
Also found dose-response effect,
e.g., adjusted OR 8.0 for 1-2 canisters of albuterol
per month compared with no use
Does increased use of beta agonists cause
asthma fatalities?
Inference & Hypothesis Testing - Slide 12
Biologic plausibility:
• Beta agonists have cardiac effects, e.g. possible
arrhythmias - but this cause of death was not
specifically evaluated (though patients in study generally
died of respiratory failure)
Other potential explanations:
- confounding by indication (severity of disease)
- difficult to adjust for severity with this study design
- patient/physician behaviour
- marker for overreliance on symptomatic treatment
as opposed to more definitive anti-asthma therapy
- may also be a marker for quality of asthma care,
some aspects of which could not be adjusted
in this type of analysis
Inference & Hypothesis Testing - Slide 13
Like other scientific research, inference in epidemiology
involves explicit hypothesis testing
This requires a clearly defined hypothesis a priori
Data obtained is used to address the primary hypothesis
using a statistical approach, i.e.
1.
There is a so-called “alternative hypothesis” (Ha)
advanced, regarding an association between
exposure and outcome
2.
There is a corresponding null hypothesis (Ho),
which states that there is no association
between exposure and outcome.
Inference & Hypothesis Testing - Slide 14
Various “test statistics” are calculated, depending on
the setting, which summarize the difference
between the data observed, and those expected
under the null hypothesis, while accounting for
variance (spread) in the data.
Statistical hypothesis testing involves an estimate
as to the probability of obtaining an equally or more
extreme test statistic if the null hypothesis is in fact
correct.
Inference & Hypothesis Testing - Slide 15
Alpha () is the preset probability level at which
we reject the null hypothesis,
i.e. it is unlikely to explain the observed findings
-
this is conventionally set at = 0.05
i.e. the probability of the observed findings
is 5% or less if the null hypothesis is in fact correct
-
the estimated probability of obtaining an
equally or more extreme test statistic under
the null hypothesis is the significance level
(“P-value”)
Inference & Hypothesis Testing - Slide 16
Use of the confidence interval (e.g. 95% CI) has
the advantage of incorporating the notion of
spread/precision while preserving the hypothesis
testing component
-
the 95% confidence interval may be thought of as
the range within which 95% of the estimates for a
value (e.g. mean cholesterol levels among patients
with first myocardial infarcts) or an association
(e.g. odds ratio for smoking in lung cancer patients
vs controls) would be expected to lie, if the sampling
or study were repeated many times
-
we are “95% confident” that the “true” value lies
within the specified range
-
based on observed value and its variance
Inference & Hypothesis Testing - Slide 17
For example, the 95% CI for an OR =
exp Ln(OR) 1.96
var Ln(OR)
and
1 1 1 1
var Ln (OR )
a b c d
In general, if the 95% CI for the effect measure does
not include the null value then the association is
statistically significant at the P 0.05 threshold.
Similarly, if the 99% CI does not include the null value,
then the association is significant at the P 0.01
threshold.
Inference & Hypothesis Testing - Slide 18
The larger the sample size, the smaller the variance
of the various parameter estimates and of estimates
of effect measures
(increased precision with larger sample sizes)
-
reflected in narrower confidence intervals
-
stratification and related techniques
for statistical adjustment of confounders
lead to wider confidence intervals, because
of smaller numbers in study subgroups
Inference & Hypothesis Testing - Slide 19
Failure to reject the null hypothesis does not mean
the alternative hypothesis is false.
The sample size may simply be too small to detect a
“significant” deviation from values expected under
the null hypothesis.
Example: You believe a coin is “loaded”, i.e. not fair
•
Of 10 tosses, 7 are heads - could happen quite easily
with a fair coin (P=0.34, two-sided)
– For 7/10, 95% CI is (0.35, 0.93)
•
Of 100 tosses, 70 are heads - very unlikely to happen
with a “fair” coin (P=0.00008, two-sided)
– For 70/100, 99% CI is (0.57, 0.81)
Inference & Hypothesis Testing - Slide 20
The probability of failing to reject the null hypothesis
when the alternative hypothesis is in fact true
is known as beta ()
depends on the number of observations,
the variance of the measurement in question, and
the magnitude of the “true” difference/association (Ha)
- as well as the significance threshold
decreases with:
observations
variance
magnitude of true underlying effect
-level
Inference & Hypothesis Testing - Slide 21
and are analogous to false-positive and falsenegative rates with diagnostic tests:
Accept Ho
Reject Ho
Ho true
Ha true (Ho false)
Correct
Type 1 error
Type 2 error
Correct
= P (type 1 error) = P (reject Ho | Ho true)
= P (type 2 error) = P (fail to reject Ho | Ha true)
Inference & Hypothesis Testing - Slide 22
The statistical power of a study is analogous to
the sensitivity of a diagnostic test
It is the ability to detect an association when it is in fact
present in the underlying population of interest
Hence the power of a study is equal to 1-
just as true positive rate = 1 - false negative rate
Insufficient power leads to type 2 error--probably a
much more frequent phenomenon than type 1 error
Failure to detect an association does not prove
there is no association
Power issues should be addressed when studies are
reported
Inference & Hypothesis Testing - Slide 23
In designing studies, usually a minimum power of 80%
is targeted; sometimes 90% ( = 0.20 or 0.10)
The estimated sample size requirement hinges on
the power to detect a specified magnitude of
effect/association, assuming a given variance in the
measurements of interest.
For example, the ability to detect a decrease in the
10-year cumulative incidence of first myocardial
infarction from 20% to 10% among subjects
randomized to an intensive exercise program vs. usual
care
Inference & Hypothesis Testing - Slide 24
During the design phase, this requires
a specific definition of Ha,
e.g. a decrease in risk from 20% to 10%
associated with intensive exercise
--not simply “a reduction in risk”
For 80% power to detect
a significant difference with = 0.05, and
an underlying decrease in the risk of MI
from 20% to 10%,
219 subjects per study group are required
For 90% power, 286 per group are needed
Based on binomial distribution, two-tailed test
Inference & Hypothesis Testing - Slide 25
This may also require an estimate of the expected
variance in measurements--particularly if continuous
For example,
to detect a treatment effect of an antihypertensive drug:
If underlying mean diastolic BP is
80 mm Hg in treatment group and 90 in comparison group,
with standard deviation of 10 in both, then
16 subjects are needed per group, for 80% power
-
If the standard deviation is 20 in both groups,
then 63 per group are needed for 90% power
Predictions about effect size and measurement variability
may be “best guesses” based on other research,
or may be obtained from pilot studies conducted in the
same or similar settings
Inference & Hypothesis Testing - Slide 26
If the magnitude of the “true” association is greater
than predicted, power will increase for a given
sample size.
If it is less than predicted, power will decrease for a
given sample size.
Hence it is best to base sample size requirements on
conservative estimates of effect/association.
Sample size calculations should also account for
refusals, incomplete data, dropouts, etc.
Inference & Hypothesis Testing - Slide 27
Statistical hypothesis testing involves a judgment
as to whether the observed findings are likely to
relate to chance, rather than to a true association
Hence an effect, association, or contrast that is
statistically significant may be considerable or it
may be trivial; statistical significance per se does
not imply causation or even clinical/scientific
importance
For a fixed contrast, (e.g. 1% fewer myocardial
infarctions with a given treatment), statistical
significance increases with the number of
observations
Inference & Hypothesis Testing - Slide 28
Example:
Researcher 1 conducts a randomized clinical trial comparing
5-year mortality among smokers post-myocardial infarction
who receive vs do not receive an intensive smoking
cessation intervention.
The following results are obtained:
Intervention
No Intervention
Dead
20
40
Alive
180
160
For death, RR = (20/200)/(40/200) = 0.5 (0.30, 0.82)
RD = 0.1 - 0.2 = -0.1 (-0.17, -0.03)
P = 0.005
Inference & Hypothesis Testing - Slide 29
Researcher 2 conducts a randomized clinical trial
comparing 5-year mortality among post-myocardial
infarction patients who receive vs do not receive a novel
medication.
Recruitment is massive:
Medication
No Medication
Dead
19,500
20,000
Alive
80,500
80,000
RR = (19,500/100,000)/(20,000/100,000)
= 0.975 (0.958, 0.992)
RD = -0.005 (-0.0085, -0.0015)
P
= 0.005
Inference & Hypothesis Testing - Slide 30
Which finding carries more clinical significance?
How many patients must be treated to save one
additional life?
For study 1, 1/RD = 1/0.1 = 10 (5.9, 33)
For study 2, 1/RD = 1/0.005 = 200 (117, 667)
The “number needed to treat” (NNT) is the reciprocal of
the estimated risk difference
Like the risk difference, it reflects the frequency of the
outcome in the target group, as well as the treatment
effect
Hence even if an intervention has a major protective
effect, the risk difference will be small (and the NNT
large) if the outcome of concern is uncommon
Inference & Hypothesis Testing - Slide 31
Magnitude of health effect is not the same as strength
of statistical association
A lower P value or larger test statistic does not make a
treatment more effective or a risk factor for disease
more important
For a given effect size, statistical association grows
stronger as the number of observations/subjects grows
and/or the variance among observations/subjects falls
Health effects are measured using
a) ratio measures, e.g. odds ratio, risk or rate ratio
b) difference measures e.g. risk or rate difference
Confidence intervals should accompany point estimates
of effect size (ratio or difference)
Inference & Hypothesis Testing - Slide 32
Ratio measures of effect emphasize the “etiologic”
importance of an exposure or intervention, i.e. the
scientific relationship between exposure and outcome
An exposure is often said to be a “strong” risk factor
for an outcome if the associated ratio measure
(odds ratio, risk or rate ratio) is high
Hypothetical example:
- rate of malignant mesothelioma in asbestos-exposed
persons: 100/100,000 person-years
- rate in unexposed: 0.1/100,000 person-years
IRR = 1000;
ID = 99.9/100,000 person-years
Inference & Hypothesis Testing - Slide 33
The attributable fraction AMONG THE EXPOSED
is also an indication of “etiologic” importance,
since it reflects the magnitude of the ratio measure
e.g. (IRR-1)/IRR, which is 99.9% in this example
(99.9% of mesothelioma cases in asbestos-exposed
persons are the result of that exposure)
In contrast, the risk or rate difference emphasizes
the impact of an agent on the health of a group,
community, or population
Inference & Hypothesis Testing - Slide 34
An exposure can lead to a large risk/rate difference
if the outcome in question is common,
even if the risk/rate ratio is relatively low
Hypothetical example:
-
rate of myocardial infarction among
60 year-old male smokers:
1,500/100,000 person-years
-
rate among 60 year-old male nonsmokers:
1000/100,000 person-years
IRR = 1.5, but ID = 500/100,000 person-years
Inference & Hypothesis Testing - Slide 35
Population attributable risk/rate:
the absolute reduction in risk or incidence
which would occur in the population
if the exposure of concern were removed
= It - Io, which is equivalent to Pexp x (Ie - Io)
since It = (Ie x Pexp) + [I0 x (1 – Pexp)]
Probably the key parameter for public health
since it incorporates both the rate/risk difference
and the frequency of exposure
Inference & Hypothesis Testing - Slide 36
Population attributable risk/rate fraction:
the fraction of disease in the population
that would be eliminated if
the exposure of concern were removed
= (It - Io)/It, or (RR-1)/RR x P(exposure disease)
A rare disease which is almost always seen
in association with a characteristic exposure
may have a very high PAR fraction
but a very low PAR in absolute terms
Inference & Hypothesis Testing - Slide 37
•
Epidemiology has been defined as “the study of
the distribution and determinants of disease
occurrence in human populations” (MacMahon
and Pugh)
•
The methods of epidemiology may be used to
address questions in clinical medicine, public
health, occupational health, etc.
Inference & Hypothesis Testing – Slide 38
• Clinical medicine involves management of individual
patients; management is usually based on experience
with other, similar patients (hopefully involving
published studies which were properly conducted and
reported)
• Public health practice involves the prevention or
treatment of disease on a community or population
level.
• Epidemiologic methods are in fact crucial to clinical and
public health research and decisions, but priorities for
clinical and public health intervention may not match
priorities in epidemiologic research.
Inference & Hypothesis Testing - Slide 39
Clinicians have a duty to individual patients
Public health practice is more explicitly utilitarian
(“greatest good for the greatest number”)
Some of the greatest challenges in clinical medicine
and public health involve situations where answers
to the “classical” etiologic or treatment questions
are well known
Such challenges may involve questions of behaviour
uptake, adherence, accessibility, equitability, funding,
etc.
Inference & Hypothesis Testing - Slide 40
Examples:
-
diagnosis of pulmonary embolism
-
smoking cessation in persons at risk
for cardiovascular disease
-
injury prevention
-
tuberculosis treatment programs
-
uptake of safer sexual practices
Inference & Hypothesis Testing - Slide 41
Regardless of the content area, it is best to focus on
one key hypothesis/comparison of interest when
designing, conducting, and reporting a study
This allows a clear idea of the target population
and sampling strategy, and of data to be collected
regarding exposure, outcome, and potential
confounders or effect modifiers
Permits unambiguous sample size estimates
Avoids the problem of multiple comparisons
Inference & Hypothesis Testing - Slide 42
Recall that is the preset probability level at which
we reject the null hypothesis (usually no association
between exposure and outcome), based on the
probability of observing results as or more extreme
under the null hypothesis.
With the conventional = 0.05 threshold, we accept
we accept a 5% chance that we are erroneously
rejecting the null hypothesis, for a given exposureoutcome association (a “false-positive” finding)
If we examine 2 potential exposure-outcome
associations, the chance of correctly detecting no
association when none in fact exists is (0.95)2 =
0.9025
Inference & Hypothesis Testing - Slide 43
Hence there is a 9.75% chance of detecting at least
one spurious (false) association
The more potential associations we examine, the
greater the probability of detecting
a “significant”
association even if the truth is that no association of
any sort exists
It is 1 - (1 - )x , where x is the number of
associations/comparisons examined (this assumes
that each is independent of the others)
If 10 exposures are examined in a case-control study,
there is a 40% chance that at least one will yield a Pvalue of 0.05 or less, if none of the exposures is truly
related to the outcome of concern
Inference & Hypothesis Testing - Slide 44
This is the phenomenon of multiple comparisons: the
more comparisons that are made, the greater the
chance of a statistically significant
difference/association in the absence of any true
effect.
This does not exclude the possibility that true
associations have indeed been identified--but it makes
them much harder to distinguish and to justify
There are statistical techniques for adjustment for
multiple comparisons
may be decreased, such that the aggregate “falsepositive” probability remains low
Inference & Hypothesis Testing - Slide 45
Lowering the value of confers its own problems
-
this is usually done post hoc
-
this renders both “false” and “true” associations
more difficult to detect, since the sample size is
already fixed
Inference & Hypothesis Testing - Slide 46
Researchers generally distinguish between primary
and secondary objectives or endpoints
The primary objective involves the key question
to be answered, and design of the study (including
sample size) hinges on this objective
Secondary objectives involve questions of interest
which the investigators would like to explore
It is understood that these questions may be
addressed with less precision and less power to
detect differences between groups
Detection of secondary associations or of differences
in secondary endpoints does not carry the same
weight as the primary association/endpoint of interest
(e.g. subgroup analyses)
Inference & Hypothesis Testing - Slide 47
A study is labelled as “hypothesis testing” when study
design, conduct, and analysis are based primarily on
one specific hypothesized association
The evaluation of secondary endpoints or multiple
associations should be considered to represent
“hypothesis generation,” i.e.
the identification of future primary research questions
Inference & Hypothesis Testing - Slide 48
Why do studies addressing the same question sometimes
yield conflicting results?
Internal validity issues (selection or information bias,
confounding)--especially if different study designs
are used
Differences in exposure or outcome definition
“Association” actually due to chance (type 1 error)
True association not detected (type 2 error)
Differences in setting (generalizability issues)
e.g. different population characteristics,
differential presence of effect modifiers
Was the same question truly addressed?
Inference & Hypothesis Testing - Slide 49
Example 1:
Many studies have evaluated the protective effect
of the BCG vaccine against tuberculosis;
estimates of efficacy range from 0-80%
Differences in who was vaccinated and when
--in some instances, vaccinated individuals were
already infected (conferring no protection)
Behr and colleagues demonstrated that the vaccine
likely lost protective elements over the years
Inference & Hypothesis Testing - Slide 50
Example 2:
Risk of drug-induced hepatitis among persons
taking isoniazid preventive therapy for tuberculosis
A US Public Health Service study in the early 1970s
found substantial risks which increased with age-over 2% for persons over 50, and a number of deaths
A recent Seattle study found risks on the order of
0.1%, with no fatalities
The earlier study was distorted by a large number of
deaths in Baltimore, which experienced an unrelated
epidemic of hepatitis at that time
The more recent study involved stricter monitoring
techniques (and probably stricter patient selection)
Inference & Hypothesis Testing - Slide 51
Example 3:
Conflicting reports about the advisability of calciumchannel blockers for the treatment of hypertension
Estacio, NEJM 1998:
“A prospective, randomized, blinded clinical study
in a population of patients with non-insulin dependent
diabetes mellitus and hypertension demonstrated
that treatment with enalapril for a mean of five years
was associated with a lower incidence of myocardial
infarction than was treatment with nisoldipine [a
calcium-channel blocker] for the same period.”
Inference & Hypothesis Testing - Slide 52
Tuomilehto, NEJM 1999:
“Our trial demonstrated that [calcium-channel blocker]
-based antihypertensive treatment is particularly
beneficial in older diabetic patients with isolated
systolic hypertension. Thus our findings do not
support the hypothesis that the use of long-acting
calcium-channel blockers may be harmful in diabetic
patients.”
Different target population
Different comparison:
in study 1,
it was calcium-channel blocker vs alternative drug;
in study 2
it was calcium-channel blocker vs placebo!
Inference & Hypothesis Testing - Slide 53
Example 4
Mechanical ventilation strategies for ARDS
“Control” groups were not managed consistently, which
altered the results of the intervention
Example 5
Post-menopausal hormone replacement therapy
- Findings from randomized controlled trial contradicted
earlier observational studies
- Presumably reflects selection and measurement
issues, other confounders