Session Slides/Handout

Download Report

Transcript Session Slides/Handout

Biostatistics in Practice
Session 3: Testing Hypotheses
Peter D. Christenson
Biostatistician
http://gcrc.humc.edu/Biostat
Readings for Session 3
from StatisticalPractice.com
• Significance test / hypothesis testing
• Significance tests simplified
Example
Consider a parallel study:
1. Randomize an equal number of subjects to treatment A or
treatment B.
2. Follow all subjects for a specified period of time.
3. Measure X= post-pre change in an outcome, such as
cholesterol.
Primary Aim: Do treatments A and B differ in mean
effectiveness?
Restated aim: If μA and μB are the true, unknown, mean postpre changes that would occur if all potential subjects
received treatment A or treatment B, do we have evidence
from our limited sample whether μA ≠ μB?
Extreme Outcome #1
Suppose results from the study are plotted as:
X
Each point is a
separate subject.
A
B
Obviously, B is more effective than A.
Extreme Outcome #2
Suppose results from the study are plotted as:
X
Each point is a
separate subject.
A
B
Obviously, A and B are equally effective.
More Realistic Possible Outcome I
Suppose results from the study are plotted as:
X
Each point is a
separate subject.
A
B
Is the overlap small enough to claim that B is more effective?
More Realistic Possible Outcome II
Suppose the ranges are narrower, with the same group mean
difference:
X
Each point is a
separate subject.
A
B
Now, is this minor overlap sufficient to come to a conclusion?
More Realistic Possible Outcome III
Suppose the ranges are wider, but so is the group difference:
X
Each point is a
separate subject.
A
B
Is the overlap small enough to claim that B is more effective?
More Realistic Possible Outcome IV
Here, the ranges for X are the same as the last slide, but there
are many more subjects:
X
Each point is a
separate subject.
A
B
So, just examining the overlap isn’t sufficient to come to a
conclusion, since intuitively the larger N should affect the results.
Our Goal
Goal: We need a rule that can be consistently applied to most
studies to make the decision whether or not μA ≠ μB.
From the previous 4 slides, relevant measures that will go into
our decision rule are:
1. Number of subjects, N; could be different for the groups.
2. Difference between groups in observed means (X-bar for A
and for B subjects).
3. Variability among subjects (SD for A and B subjects).
Goal, Continued
Goal: We need a rule that can be consistently applied to most
studies to make the decision whether or not μA ≠ μB.
Other relevant issues:
1. Our conclusion could be wrong. We need to incorporate a
mechanism for minimizing that possibility.
2. Small differences are probably unimportant. Can we
incorporate that as well?
A Graphical Look at All of the Issues
The figure on the following slide shows most of the issues that are
involved in testing hypotheses.
It is complicated, but we will through each of the factors that it
addresses, on slides after the figure:
1. Null hypothesis H0 vs. alternative hypothesis HA.
2. Decision rule: Choose HA if ….[involves Ns, means and SDs] .
3. α=Probability (Type I error)= Prob (choosing HA when H0 is
true).
4. β=Probability (Type II error)= Prob (choosing H0 when HA is
true).
5. What changes if N was larger?
Graphical Representation of Hypothesis Tests
1: Null hypothesis H0 vs. alternative hypothesis HA.
All statistical tests have two hypotheses to choose from:
The null hypothesis states a negative conclusion, that there is “no
effect”, which could mean various specific outcomes in different
studies. It always includes at least one mathematical expression
that is 0.
Here, the null hypothesis is H0: μA- μB = 0. This states that the
post-pre changes are, on the average, the same for A as for B. The
left (red) curve has it’s peak at this 0.
The alternative hypothesis includes every possibility other than 0,
i.e., HA: μA- μB ≠ 0. In the figure, we chose just one alternative for
illustration, namely that μA- μB = 3. The right (blue) curve has it’s
peak at this 0.
For each curve, the height represents the relative frequency of
subjects, so more subjects have X’s near the peak.
2: Decision Rule for Choosing H0 or HA.
A poor, but reasonable rule.
First suppose that we only consider choosing between H0 and the
particular HA: μA- μB = 3, as in the figure.
Common sense might say that we calculate x-bar (which is the
mean of changes for B subjects, minus the mean of changes for A
subjects), and then choose H0 if x-bar is closer to 0, the
hypothesized value under H0, or choose HA if it closer to 3, the
hypothesized value for HA.
The green line in the figure is on the x-bar from the sample, which
is 1.128, and so HA would be chosen with this rule, since it is
closer to 0 than 3.
A problem with this rule is that we cannot state how certain we
are about our decision. It seems like the reasonable choice
between the 2 possibilities, but if we used the rule in many
studies, we could not say that most (90%?, 95%?) were correct.
2: Decision Rule for Choosing H0 or HA.
The correct rule.
To start to quantify the certainty of some conclusions we will
make, recall the reasoning for confidence intervals.
If H0 is true, we expect that x-bar will not only be close to 0, but
that with 95% probability, it will be within about* ±2SE of 0, i.e.,
between about -2.8 and +2.8. This is the non-\\\’d region under the
H0 (red) curve.
Thus, the decision rule is: Choose HA if x-bar is outside 0±2SE, the
critical region. The reason for using this rule is that if H0 is really
true, then there is only a 5% chance we would get an x-bar in the
critical region. Thus, if we decide on HA, there is only a 5% chance
we are wrong for any particular test. Roughly, if the rule is
applied consistently, then only 5% of statistical tests will be false
positive conclusions, although which ones are wrong is unknown.
*See a textbook for exact calculations. The multiplier is slightly larger than 2.
3: Probabilities of False Positive Conclusions
A false positive conclusion, i.e., choosing HA (positive conclusion)
when H0 is really true (so the conclusion is false) is considered the
more serious error, denoted “Type I”.
We have guaranteed (previous slide) that the rate for this error,
denoted α=level of significance, is 0.05, or that there is a 5%
chance of it occurring.
The 0.05 or 5% value is just the conventional level of risk for
positive conclusions that scientists have decided is acceptable. The
FDA also requires this level in most clinical studies.
The concept carries over for other levels of risk, though, and
statistical tables can determine the critical region for other levels,
e.g., approximately 0±1.65SE for α=0.10, where we would choose
HA more often, and make twice as many mistakes in the long run
in so doing.
4: Probabilities of False Negative Conclusions
In our figure example, we choose H0, i.e., no treatment difference,
i.e., a negative conclusion, since x-bar=1.128 is between -2.8 and
+2.8. If we had chosen HA, we would know there was only a 5%
chance we were wrong.
Can we also quantify the chances of a false negative conclusion,
which we might be making here?
Yes, but it will depend on what really constitutes “false negative”.
I.e., we conclude μA- μB = 0, but if really μA- μB = 0.0001, are we
wrong in a practical sense? Often, a value for a clinically relevant
effect is specified, such as 3 in the figure example. Then, if HA: μAμB=3 is really true, but we choose H0, we have made a type-2
error. It’s probability is the area under the correct (HA now, blue)
curve in the region where H0 is chosen (///). The computer needs to
calculate this, and it is 0.41 here.
3 and 4: Tradeoffs Between Risks of Two Errors
In our figure example, if μA- μB=3 is the smallest difference that
we care about (smaller differences are 0 in a practical sense), then
we have an α=0.05 chance of wrongly declaring that treatments
differ when in fact they are identical, and a β=0.41 chance of
declaring them the same when they really differ by 3.
If we try to decrease the risk of one of the errors, the risk of the
other error increases, i.e., α↑ as β↓. [This is the same as sensitivity
and specificity of diagnostic tests.] To visualize it on our figure,
imagine shifting the ///\\\ demarcation at 2.8 to the left, to say 2.7.
That increases α. Then the /// area, i.e., β, decreases.
Practical application: If A is a current treatment, and B is a
potential new one, then smaller αs mean that we are more
concerned with marketing a non-superior new drug. Smaller βs
mean we are more concerned with missing a superior new drug.
5: Effect of Study Size on Risks of Error
In the previous slide, the FDA may want a small α, and drug
company might want a small β. To achieve this, a larger study
could be performed. We can verify this with our graph.
In our figure example, suppose we had had a larger study, say
twice as many subjects in each group. Then, both curves will be
narrowed, since their widths depend on SE, which has N in the
denominator. If we maintain α=0.05, the ///\\\ demarcation will
shift to the left due to the narrowed left curve, and β will be much
smaller, due to both the narrower right curve, and the
demarcation shift. The demarcation could then be shifted to the
right to lower α, which increases the current β, but still keeps it
small.
There are algorithms to choose the right N to achieve any desired
α and β.
Power of a Study
Statistical power = 1 – β.
Power is thus the probability of correctly detecting an effect in a
study. In our example, the drug company is really thinking not in
terms of β, but in the ability of the study to detect that the new
drug is actually more effective, if in fact it is.
Since the FDA requires α=0.05, then a major component of
designing a study is the determination of it’s size so that it has
sufficient power.
This is the topic for the next session #4.