EPI-820_Lect7_Stat_I..

Download Report

Transcript EPI-820_Lect7_Stat_I..

EPI-820 Evidence-Based
Medicine
LECTURE 7: CLINICAL STATISTICAL
INFERENCE
Mat Reeves BVSc, PhD
1
Objectives
• Understand the theoretical underpinnings
and the flaws associated with the current
approach to clinical statistical testing (the
frequentist approach).
• Understand the difference between testing
and estimation
• Understand the advantages of the CI and the
CI functions.
• Understand the logic of a Bayesian Approach
2
Personal Statistical History….
• Post-DVM
• Clue-less. Sceptical of the role of statistics
• Thinks research = the search for P < 0.05
• PhD Era:
• Increasing obsession with stat methods
• Lots of tools! SLR, ANOVA, MLR, LR, LL & Cox
• Thinks statistics = “real science”
• Post-PhD:
• Healthy scepticism for the way stats are used
• Stats = methods which have inherent limitations
• Not a substitute for clear scientific thought or understanding the
“scientific method”
3
Review of Significance Tests
Substantive hypothesis: Cows on BST will tend to gain weight
Null hypothesis (Ho): the mean body wt. of cows trt with BST
is not different from the mean body wt. of control cows
Ux = Uy
Alternative hypothesis (Ha): the mean body wt. of cows trt with
BST is different from the mean body wt. of control cows
Ux  Uy
4
Review of Significance Tests
- Logically, if Ho is refuted Ha is confirmed
- investigator seeks to 'nullify' Ho
Expt:
20 cows randomized to BST (X) and control (Y). Measure wt.
gain. Calculate mean wt. change per group.
5
Review of Significance Tests
Assumptions:
i) Sample statistic (X - Y) is one instance of an infinitely large
number of sample statistics obtained from an infinite number of
replications of the expt., under the same conditions (frequentist
assumption)
ii) Populations are normally distributed, equal variance
iii) The Ho is true
6
Review of Significance Tests (t-test)
t  X Y
S xy
N (0, 1)
df = (n1 – 1) (n2 – 1)
Where:
Sxy
(
1 1
 ). S 2
n1 n2
= standard error of the difference between two
independent means.
S2 = estimate of pooled population variance
- t may take on any value, no value is logically inconsistent with
Ho! Smaller t values are more consistent with Ho being true.
- all else equal, larger n’s increase value of t (higher power). 7
Review of Significance Tests
Large values of t indicate:
i) test assumptions are true, a rare event has occurred
ii) one of the assumptions of the test is false, and by convention
it is assumed that the Ho is not true.
- By convention, relative frequency of t where we decide to choose
(ii) above as a logical conclusion is set to 5% (alpha level or
significance level)
- Expt: t = 2.55, p = 0.02, reject Ho - result is significant
8
Review of Significance Tests
- Type 1 error (alpha), occurs 5% of the time when Ho is true
- Type II error (beta), occurs B% of the time when Ho is false
- Alpha and beta are inversely related
- Fixing alpha at 5%, means Sp is 95%
- Beta is not set 'a priori‘, hence Se (power) tends to be low
- Scientific caution dictates that set alpha small
- Scientific ignorance dictates we ignore beta!
9
Alpha and beta are inversely related


10
Relationship between diagnostic test result and disease status
DISEASE
PRESENT (D+)
POSITIVE (T+)
TP
FP
PVP= a
a+b
TN
PVN= d
c+d
a b
c d
TEST
NEGATIVE (T-)
ABSENT (D-)
FN
Se= a/a + c
Sp= d/b + d
Se= P(T+|D+)
Sp= P(T-|D-)
11
Relationship between significance test results and truth
TRUTH
REJECT Ho
SIGNF.
Ho False
Ho True
TP
FP
(1 - B)
Type I (a)
TEST
ACCEPT Ho
FN
TN
Type II (B)
(1 - a)
Se= TP/TP + FN
Se= Power (1 - B)
PVP= TP
TP + FP
PVN= TN
TN + FN
Sp= TN/TN + FP
12
Power
- Probability of rejecting Ho when Ho is false
- Se = TP/(TP + FN) or (1 - B)
- Power is a function of:
i)
Alpha (increase by making Ha one sided i.e., Ux > Uy)
(consistent with changing the cut-off value)
ii) Reliability (as measured by SE of the difference)
- Power increases with decreasing SE
- SE decreases with increasing sample size (= decr variance)
iii) Size of treatment effect
13
The Consequences of Low Power
i) difficult to interpret negative results
- truly no effect
- expt unable to detect true difference
ii) increase proportion of type 1 errors in literature
iii) fail to identify many important associations
iv) low power means low precision (indicated by the confidence
interval)
14
Questions?
• What proportion of statistically significant
findings published in the literature are false
positive (Type 1) errors?
• What well known measure is this proportion?
and, what elements does this figure therefore
depend on?
15
Hypothetical outcomes of 500 experiments, a= 0.05, Power= 0.50, and
20% prevalence of false Ho’s
TRUTH
Ho FALSE
REJECT Ho
Ho TRUE
50
20
50
380
100
400
Se = 50%
Sp = 95%
PV+ = 50/70
= 71%
SIGNF.
TEST
ACCEPT Ho
If all signf. results published, 29% are Type 1 errors
N = 500
16
The P value
- probability of obtaining a value of the test statistic (X) at least as
large as the one observed, given the Ho is true
- P (>=X | Ho true)
Common Incorrect Interpretations
- It is NOT P (Ho true|Data)!!!
- We can never state the probability of a hypothesis being
true! (under the frequentist approach)
- The probability that the results were due to chance!
17
Criticisms of Significance Tests
i) Decision vs Inference (Neyman-Pearson)
- pioneers of modern statistics were interested in producing
results that enabled decisions to be made
- problem of automatic acceptance or rejection based on an
arbitrary cutoff (P= 0.04 vs P=0.06)
- results should adjust your degree of belief in a hypothesis
rather than forcing you to accept an artificial dichotomy
- "intellectual economy"
18
Criticisms of Significance Tests
ii) Asymmetry of significance tests
- frequently, the experimental data can be found to be consistent
with a Ho of no effect or a Ho of a 20% increase
- acceptance of both Ho's given the data leads to 2 very
different conclusions!
- asymmetry was recognized by Fisher, hence convention is to
identify theory with the Ha but to test the Ho
- Is there an effect? is the wrong question! Should ask:
What is the size of the effect?
19
Criticisms of Significance Tests
iii) Corroborative power of significance tests
- Both Fisherian and Neyman-Pearson schools make no
assumption about the prior probability of Ho
- Both schools presume Ho is almost always false
- rejection of Ho does nothing to illuminate which of the vast
number of Ha’s are supported by the data!
- Failing to reject Ho does not prove Ho is true (Popper:
'we can falsify hypotheses but not confirm them')
20
Criticisms of Significance Tests
iv) Effect size and significance tests
- Test statistics and p values are a function of both effect size
and sample size
- Cannot infer size of an effect by inspection of the P value
reporting P< 0.00001 has no scientific merit!
- Highly significant results may be derived from trivial effects
if sample size is large.
- Confidence intervals give plausible range for the unknown
popl parameter (signf tests show what the parameter is not!)
21
Relationship between the Size of the
Sample and the Size of the P Value
• Example RCT:
• Intervention: new a/b for pneumonia.
• Outcome: Recovery Rate = % of patients in
clinical recovery by 5 days
• Facts:
• Known = Existing drug of choice results in 35%
recovery rate at 5 days
• Unknown = New drug improves recovery rate by
5% (to 40%)
22
P values Generated by RCT by Sample Size
Sample Size (N = 2x)
P value (Chi-square)
100
0.465
500
0.103
600
0.074
700
0.053
800
0.039
1000
0.021
23
Conclusion?
Significance testing should be abandoned and replaced with
interval estimation (point estimate and CI)! Why?
- not couched in pseudo-scientific hypothesis testing language
- do not imply any decision making implications
- give plausible range to unknown popl parameter
- gives clue as to sample size (width of the CI)
- avoids danger of inferring a large effect when result if highly
significant
24
Interval estimation
- view "experimentation" as a measurement exercise
- want an unbiased, precise measure of effect
- Point estimate: best estimate of the true effect, given the data (aka
MLE) and it indicates the magnitude of effect (but is imprecise)
- Confidence intervals indicate degree of precision of estimate.
Represent a set of all possible values for the parameter that
are consistent with the data
- width of CI depends on variability and level of confidence (%)
25
Interval estimation
- 90% CI:
- 90% of such intervals will include the true unknown popl.
parameter (necessary frequentist interpretation)
- it does not represent a 90% probability of including the true
unknown popl. parameter within it
- CIs indicate magnitude and precision.
- CI are linked to alpha and hypothesis testing (1 - alpha) = 95%
26
Interval estimation - Example
OUTCOME
+
-
TRT A
7
13
20
P(success)= 35%
TRT B
14
6
20
P(success)= 70%
Significance test: P= 0.06 or NS!
Interval estimation of difference: 35% (95%CI = -1,+71%)
27
Confidence Intervals
- CI are non-uniform, true parameter is more likely to be located
centrally than near to limits. Therefore precise location of
boundary is irrelevant!
- For a study to be reassuring about a lack of effect, boundaries
of CI should be near the null value
- CIs have clear advantages over the p-value but still suffer from
the necessary frequentist interpretation (a CI represents one
member of a family of CIs produced by an infinite number of
replications of the same experiment)
- CI functions
28
Which is the more important study?
Study A
Study B
larger effect
null point
29
Importance of Beta (Type II error) and
Sample Size in RCT’s
(Freiman et al 1978)
• Reviewed 71 “negative’ (P > 0.05) RCT
published from 1960-77
• Assume 25% treatment effect:
• 94% (N= 67) of trials had < 90% power
• Only 15% (N= 10) had sufficient evidence to
conclude no effect
• Assume 50% treatment effect:
• 70% (N= 50) of trials had < 90% power
• Only 32% (N= 16) had sufficient evidence to
conclude no effect
30
The P Value Fallacy - Goodman
• Derives from the simultaneous application of
the p-value as:
• A long-run, error based, deductive tool (Neyman
Pearson frequentist application), and
• A short-run, evidential and inductive tool (i.e.,
what is the meaning of this particular result?)
• The p-value was never designed to serve
these two conflicting roles
31
The Bayes Factor - Goodman
• Comparison of how well two hypotheses predict the
data:
P (Data | given the Ho)
P (Data | given the Ha)
• Allows explicitly the incorporation of external
evidence (in terms of prior probability/belief)
• Use of Bayesian statistics shows that weight of
evidence against the Ho is not as strong as the pvalue suggests (Table 2)
32