Controlling the Experimentwise Type I Error Rate

Download Report

Transcript Controlling the Experimentwise Type I Error Rate

Controlling the Experimentwise
Type I Error Rate When Survival
Analyses Are Planned for Subsets
of the Sample.
Greg Yothers, MA
National Surgical Adjuvant Breast and Bowel Project (NSABP)
University of Pittsburgh, Department of Statistics
Joint Work With John Bryant, PhD
Director of the NSABP
University of Pittsburgh, Departments of Statistics and Biostatistics
• This work concerns the design and analysis of
clinical trials to compare treatment to control where
we wish to test the primary hypothesis on several
subgroups in addition to the global test.
• Unless steps are taken to control for multiple
comparisons, the type I error rate will be inflated in
this situation.
• Controlling for multiple comparisons generally
leads to a loss of power so that subgroup analyses
are often avoided. However, subgroup analyses
often serve a legitimate scientific purpose, and
should not be entirely avoided.
• To address this problem, we propose a method
whereby a pre-specified experimentwise alpha is
“spent” or allocated among the global (stratified) test
and the constituent subset (stratum-level) tests.
• We find the method to be efficient in terms of
experimentwise power when the treatment effect in
each stratum is in the same direction and the
magnitude of the range of treatment effects between
strata is not too great.
• The procedure can be used to make the design of a
clinical trial robust against the presence of a treatment
by strata interaction when a significant interaction is
not anticipated.
Outline
• Motivating Example - NSABP Protocol B-29.
• Define Experimentwise Type I Error Rate .
• Common methods of dealing with subgroup
testing: How do they control Type I error rate?
• Multiple testing approach: Perform all tests at
reduced nominal levels of significance so that the
experimentwise Type I error rate is controlled.
• Exploration of how to spend alpha on the
individual tests to achieve ‘good’ operating
characteristics for the overall experiment.
NSABP B-29 Schema
T1 or T2 or T3; pN0; M0
ER-Positive
Decision to use
Chemotherapy*
No Chemotherapy
Chemotherapy
Stratification
Stratification
• Age
• Pathologic Tumor Size
• Age
• Pathologic Tumor Size
Tamoxifen
Tamoxifen
+
Octreotide
AC
+
Tamoxifen
AC
+
Tamoxifen
+
Octreotide
Group 1
Group 2
Group 3
Group 4
* The decision to use AC chemotherapy must be made prior to randomization.
Design Considerations
• H0: Relative Risk = 1,
• Power  .8 to detect Relative Risk  .75, using a
.05-level two-sided stratified log-rank test.
• Power requirements and assumptions about rates
of accrual dictate the following:
i) Accrual of 3,000 patients over 5 years with
3 years additional follow-up.
ii) Final analysis following the 400th event.
• Physicians involved in the design of the trial thought
the effect of Octreotide would be unlikely to
materially interact with chemotherapy status.
• In planning the trial it was felt to be important to
provide for individual tests for the effect of Octreotide
in the presence of chemotherapy as well as in its
absence.
• It was considered unacceptable to treat these
subgroup analyses as post-hoc, or exploratory, so it
was necessary to design an analysis plan that
controlled for the experimentwise error rate.
Definition
Experimentwise Type I Error Rate
The probability of finding a significant
difference between treatment and control on
either the overall stratified test or any of the
stratum-specific tests given that no
difference exists.
Common approaches to controlling
experimentwise Type I error rate
• Unprotected Subgroup Tests – Perform the overall
stratified test at level ; follow-up with stratumspecific  level tests.
• Protected Subgroup Tests – Perform the overall
stratified test at level ; follow-up with stratumspecific  level tests only if treatment-by-strata
interaction is significant.
• Protected Subgroup Tests – Test for treatment-bystrata interaction at level . If interaction is
significant, test for treatment effect individually in
each stratum at level . If interaction is not
significant, test for overall treatment effect at level .
Equivalence of Protection Schemes
• The two alternatives for protecting the stratum
specific tests are actually quite similar in
operating characteristics, since if both interaction
test and the overall stratified test are significant,
it is almost certain that at least one stratum level
test will also be significant.
• It can be shown that this is true with probability
one in the case of k = 2 strata.
Experimentwise Level of Significance
Experimentwise Level of Significance
0.1175
0.1125
0.1075
0.1025
0.0975
0.0925
Unprotected
0.0875
Protected
0.0825
0.0775
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
a = "The proportion of events in stratum 1"
0.45
0.5
Range of experimentwise type I error rate
for protected and unprotected schemes. All
tests performed at  = .05
Number of Strata
2
3
4
5
7
10
Unprotected Tests
.098-.115
.143-.161
.185-.204
.226-.245
.302-.319
.401-.416
Protected Tests
.080-.098
.090-.098
.094-.098
.095-.098
.096-.098
.097-.098
Multiple testing approach
• We now consider a multiple testing approach
where one performs an overall test for
treatment effect based on the stratified log-rank
statistic followed by tests within each stratum.
• All tests are carried out at reduced levels of
significance so that the experimentwise level
of significance is maintained at a specified
rate.
L1 and L2 - the log-rank statistics from the individual stratum tests.
V1 and V2 - be the variances of the log-rank statistics L1 and L2 .
L0  L1 +L2 - the stratified log-rank statistic.
Then, since L1 and L2 are independent, V0 =V1 +V2 .
 0 - the nominal level of significance of the test based on L0 .
c0 - the corresponding critical value from the standard normal distribution.
1 and  2 - the nominal levels of significance of the tests based on L1 and L2 .
c1 and c2 - the corresponding critical values from the standard normal dist.
Li
Now, Zi 
, i  0,1,2 represents the standardized log-rank statistics.
Vi
L0  L1  L2  Z 0 V0  Z1 V1  Z 2 V2
 Z 0  Z1 a  Z 2 1  a
V1
where a 
, 0  a 1
V1  V2
Let RRi represent the relative risk in the ith stratum.
H0: RRi  1, i =1, 2 denotes the null hypothesis
RR1  RR1 and RR2  RR2 a specific alternative hypothesis.
Then, when the alternative hypothesis is true, the test statistic
Alt
Alt
 
Alt
Li
RR
ln
N
Zi 
is asymptotically distributed as
i
Vi

Let  i  ln RRi
Alt
Wi  Z i   i 
Li

Vi


Vi ,1 .
Vi , then

 ln RRi
Alt

Vi
is distributed as standard normal under the alternative hypothesis.
Definition
Experimentwise Power
The probability of detecting at least one significant
difference during the multiple testing procedure
given the true RR in each stratum. When the true RR
in each stratum is 1, we refer to the power as the
Type I error rate.
The experimentwise power against a specific alternate
hypothesis can be written as:
Power  0 ,1 , 2 , a,1 , 2   1     z1  1   z2   2  dz1dz2
Where  denotes the standard normal density, and
the integral is taken over the acceptance region
defined by:
 z , z  : z
1
2
1
 c1 , z2  c2 , z0  c0 
Using the simplified region of integration, we can
rewrite the power as follows:
Power  0 ,1 , 2 , a,1 , 2 
 
 

 c0  z2 1  a  
  max  c1 , min c1 ,
   1  

a
 
c2



 
 1     z2   2  
dz2

 c2



 
c0  z2 1  a  
   1  
  min  c1 , max  c1 ,
a
 



 
Where  is the CDF of the standard normal distribution.
These results generalize to k strata as follows:
k
Z0   Zi ai , where ai  Vi
i 1
Power  0 ,1 ,
1
ck

 ck
, k , a1 ,
c2
  z
2
k
V ,
j 1
, ak ,1 ,
j
0  ai  1,
k
a
j 1
, k  
  2    zk   k  
 c2
k
 
 



c0   zi ai  
 
 


 1   
i 2
  max  c1 , min c ,


1
 


 
a


1




 
 

 dz2
k




 
c0   zi ai  

  min  c , max 


i 2
 1
 c1 ,
 1 
 


a1

 
 



dzk
j
1
• The multiple integral in the previous equation can be
difficult to evaluate when the number of strata goes
beyond about 3 or 4.
• Fortunately there is a recursive representation of the
power function that facilitates computation when there
are many strata.
Given  0 , 1 ,
define
then,
and,
, k , a1 ,
, a k , 1 ,
, k ,

 r  z   Pr  z j  c j ,

j  1,


 r 1  z   
cr 1
 cr 1
, r,
z
i 1
1  z    max  c1 , min c1 , z

r

i

ai  z 



a1   1    c1  1 


  u   r 1   r z  u ar 1 du
For k strata,
Power  0 ,1 ,
, k , a1 ,
, ak ,1 ,
, k   1    k  c0    k  c0  
An S-Plus function implementing the recursive method of calculating
power is available.
How should we spend alpha?
• The question arises as to how the type I error rates should be
divided between the overall and the stratum-specific tests, or
rather, how much alpha should be spent on the stratum-specific
tests.
• For k = 2 strata and exper = 0.05, the table and figure which
follow show a variety of combinations of the nominal size of the
overall test (0) and the nominal size of the within stratum tests
(1 & 2).
• For simplicity, we only consider the case where 1 = 2. The
possibilities form a continuum between (.05, 0) (no stratum
specific tests) to (0, .0253) (no overall test).
• Given exper, 0, and the constraint 1 = 2, the common value of
1 & 2 is a function of a (the proportion of events in the first
stratum), however the effect of varying a is weak.
Possible -spending schemes for k = 2 strata, exper = .05, and 1 = 2.
a = 0.50
a = 0.25
a = 0.10
0
1 = 2
0
1 = 2
0
1 = 2
.050
.0000
.050
.0000
.050
.0000
.045
.0060
.045
.0060
.045
.0057
.040
.0099
.040
.0104
.040
.0108
.030
.0161
.030
.0168
.030
.0183
.020
.0207
.020
.0214
.020
.0229
.010
.0240
.010
.0244
.010
.0250
.000
.0253
.000
.0253
.000
.0253
Possible -spending schemes for k = 2 strata, exper = .05, and 1 = 2.
Size of Stratum Level Tests
0.025
0.02
0.015
0.01
a = .5
a = .25
0.005
a = .1
0
0
0.01
0.02
0.03
Size of Overall Test
0.04
0.05
Experimentwise Power (1 = 2, a = .5) & Power of Overall Stratified Test
0.85
0.8
Power
0.75
0.7
Experimentwise Power
Power of Overall Test
Baseline Overall Power
0.65
0.6
0.01
0.02
0.03
Size of Overall Test
0.04
0.05
Now we see how power is affected when there is no
difference between strata and some of our alpha is spent on
stratum specific tests. This figure shows the case where
there is an overall 25% reduction in event rate, no treatmentstratum interaction, and there are 200 events in each of the k
= 2 strata. The overall stratified log-rank test at the .05 level
has power 0.82. Using the multiple testing procedure with
 0  0.04 and 1   2  0.099 yields a power of 0.79 for the
overall test and an experimentwise power of 0.80! Thus,
spending 1% of total alpha (setting  0  0.04 ) leads to a
very small loss of power even in the case of no interaction.
Setting  0 less than about 0.03, on the other hand, leads to a
rather substantial loss in power.
Experimentwise Power (1 = 2) & Power of Overall Test
0.95
0.9
Power
0.85
0.8
0.75
0.7
Experimentwise Power
Power of Overall Test
Baseline Overall Power
0.65
0.6
0.01
0.02
0.03
Size of Overall Test
0.04
0.05
Next, we see how power is affected in the presence of
treatment-strata interaction when some of our alpha is spent
on stratum specific tests. The figure shows the case of a
41% reduction in event rate in one stratum and a 9%
reduction in the other stratum. We again assume that there
are 400 total events and that the number of events on the
control arm of each stratum is equal. When the strata are
pooled we have a 25% reduction in event rate. The overall
stratified log-rank test at the .05 level has power 0.829.
Reducing 0 to 0.04 reduces the power of the overall test to
0.804, but increases the experimentwise power to 0.906!
Spending any more than 1% of alpha on subgroup tests
does not materially increase the experimentwise power
even in the presence of this very substantial interaction.
Experimentwise Power (1 = 2) for various pairs of reduction in event rates
1
Experimentwise Power
0.95
0.9
0.85
0.8
(50, 0)
(40, 10)
(35, 15)
(25, 25)
Baseline Overall Power
0.75
0.7
0.01
0.02
0.03
Size of Overall Test
0.04
0.05
Next, we see how varying the magnitude of interaction affects
experimentwise power. The figure shows a variety of pairs of
reduction in event rates in the two strata such that when the
strata are pooled we have a 25% reduction in event rate. We
again assume that there are 400 total events and that the
number of events on the control arm of each stratum is equal.
The overall stratified log-rank test at the .05 level has power
approximately 0.82. Reducing 0 to 0.04 dramatically
increases the experimentwise power in the presence of
interaction. Spending any more than 1% of alpha on
subgroup tests does not materially increase the
experimentwise power even in the presence of substantial
interaction. Note that for small interaction (the pair (35, 15)),
0
the power is maximized near
= 0.04.
Experimentwise Power (1 = 2) for various pairs of reduction in event rates
0.85
Experimentwise Power
0.75
0.65
0.55
0.45
(50, 0)
(40, 10)
(35, 15)
(25, 25)
Baseline Overall Power
0.35
0.25
0.01
0.015
0.02
0.025
0.03
0.035
Size of Overall Test
0.04
0.045
0.05
Until now we have only considered what happens when
the number of patients in the two strata are roughly
equal. We next consider the case where most of the
patients are assigned to the stratum with the smaller
treatment effect. The figure shows a variety of pairs of
reduction in event rates for comparison with the
previous figure. We again assume that there are 400
total events, but now the numbers of events on the
control arms of the two strata are not equal. The
number of events on the control arm of stratum 2 is
three times the number of events as the control arm of
stratum 1. We see that the power suffers when most of
the patients are in the stratum with a small reduction in
event rate.
Experimentwise Power (1 = 2) for various pairs of reduction in event rates
(25, -25)
(20, -20)
(12, -12)
(0, 0)
Experimentwise Power
0.5
0.4
0.3
0.2
0.1
0
0.01
0.015
0.02
0.025
0.03
0.035
Size of Overall Test
0.04
0.045
0.05
Now, we consider the case of no overall treatment
effect but varying degrees of treatment-strata
interaction. The figure shows a variety of pairs of
reduction in event rates in the two strata such that
when the strata are pooled we have no reduction in
event rate. We again assume that there are 400 total
events and that the number of events on the control
arm of each stratum is equal. Reducing 0 increases
the experimentwise power in the presence of
interaction. When there is no overall effect and the
treatment is beneficial in one stratum and detrimental
in the other the multiple testing approach is not very
powerful.
Average Experimentwise Power (1 = 2) for Various Allocations of
Events to the Control Arms of the Two Strata.
0.86
Experimentwise Power
0.84
0.82
0.8
(1:1)
0.78
(1:2)
(1:3)
(1:5)
0.76
(1:9)
Baseline Overall Power
0.74
0.01
0.015
0.02
0.025
0.03
0.035
Size of Overall Test
0.04
0.045
0.05
The final figure shows average experimentwise power for
various allocations of events to the control arms (rates of
accrual) of the two strata. In each case, there is a 25%
reduction in event rate when the strata are pooled. We place a
prior probability distribution on the difference in percent
reduction in event rate. The prior is normal with mean zero
and standard deviation such that there is a 5% probability of
qualitative interaction (treatment is beneficial in one stratum
and detrimental in the other). The figure shows the expected
power given the prior distribution.
We see that the multiple testing procedure is most effective
when the number of events on the control arms of the two
strata are not too far out of balance.
Conclusion
• The alpha spending approach described here is very
efficient and effective when the treatment effect is in
the same direction in each stratum and there may or
may not be small to moderate differences in the size
of the effect between strata.
• The method is also sensitive to the balance of
allocation of patients (events) to the two strata. When
the sizes of the stratum level tests are equal, the
approach seems to be quite effective when the
balance is no worse than about 3 to 1. We suggest
spending more alpha on the stratum with the most
patients (events) when the number of patients is out
of balance.
• Spending between ½ and 1 percent of alpha (setting
0 equal to .045 to .04) would seem to be a prudent
choice for k = 2 strata and the range of circumstances
explored in this paper when substantial interaction is
thought to be unlikely apriori.
• When there is no overall effect but there may be
offsetting effects between the strata, the alpha
spending approach is not very powerful. Designing
the trial for a test for interaction would be much
more effective in this situation. If one were to use
the multiple testing procedure in this situation, most
of the alpha should be spent on the within strata tests.
• In the design of NSABP protocol B-29, we expected
little or no interaction and nearly equal accrual to the
two stratum levels. Given our design assumptions in
B-29, we spent about ½ % of alpha on stratum level
tests and set the size of the stratum level tests equal.
If we had anticipated unequal accrual to the strata or
significant interaction, we likely would have altered
our choices. Our choice of alpha spending (0 
0.045, 1 = 2  0.006), proved to preserve power in the
presence of mild perturbations of design
assumptions.
• The tools described in this paper can be adapted to
the design of other potential trials. Given prior
beliefs regarding the likelihood of significant
treatment-strata interaction, balance of accrual to
the stratum levels, and other factors, one can
explore the sensitivity of power to design
assumptions and parameters much as we have in
the latter part of this paper.