view slides - Columbia University

Download Report

Transcript view slides - Columbia University

University of Wisconsin – Madison
Department of Biostatistics and Medical Informatics
Equivalence Trials: a Horse of a
Different Color
Rick Chappell, Ph.D.
Professor
Department of Biostatistics and Medical Informatics
University of Wisconsin Medical School
[email protected]
1
R. Chappell - Columbia University, 10/18/2007
University of Wisconsin – Madison
Department of Biostatistics and Medical Informatics
D what? Choice of outcome
scale in non-inferiority trials
Rick Chappell
Professor
Department of Statistics and
Department of Biostatistics and Medical Informatics
University of Wisconsin
[email protected]
2
R. Chappell - Columbia University, 10/18/2007
Outline
I.
Definition of Equivalence Trials with a Motivating
Example
II. Choosing the Null Hypothesis
III. Consequences of the Choice
IV. One way to avoid the Choice - “Mixed Hypotheses”
(with Xiaodan Wei)
V. Consequences of Not Trying to Find an Effect
3
R. Chappell - Columbia University, 10/18/2007
I. A Motivating Example SPORTIF III (Darius, 2002)
Company: AstraZeneca
Comparison: H 376/95 (Ximelagatran) vs. Warfarin
Treatment and Followup: Up to twenty-six months
Outcome: Stroke and other Events
Measured by annual incidence rate
Type: Equivalency, margin 2% per year (6% vs. 4%)
Sample size: 3000 patients
4
R. Chappell - Columbia University, 10/18/2007
Definition of Equivalence Trial
Consider treatment rate T and control rate C:
• A Superiority Trial examines the treatment effect
C - T attempts to show it to be positive (when small values
are good).
• A higher sample size n yields more power to precisely
estimate C – T and detect a difference in effects.
5
R. Chappell - Columbia University, 10/18/2007
• An Equivalence (non-inferiority, active control) Trial
examines C – T and attempts to show it to be not too small.
• Naively, one can put a confidence interval on C - T and
claim success if it does contain 0. The chance of this is
maximized with lower n.
6
R. Chappell - Columbia University, 10/18/2007
Solution
Require that trial limits margin to be not less than a
"prespecified [negative] degree of inferiority" D
[ICH E-3],
C-TD .
Temple and Ellenberg (2002) point out that this is a
"not-too-much-inferiority trial".
Here also a higher n, by giving more precision to
estimate C - T, raises the chance of success.
7
R. Chappell - Columbia University, 10/18/2007
II. Specifying the Null Hypothesis
Hypotheses in Superiority Trials
H0: C - T = 0 (Treatment may have no effect)
vs.
HA: C - T > 0 (Treatment has Good Effect)
Hypotheses in Equivalence Trials
H0 : C - T < D (Treatment might be much worse)
vs.
HA : C - T  D (Treatment isn’t much worse)
8
R. Chappell - Columbia University, 10/18/2007
Ways Effects can be Defined for
a Time-to-Event Trial
1. Survival function S(t) at all followup times
2. Hazard function λ(t) at all followup times
3. Event rate by a given time
4. Median time to event
H0 in a superiority trial is the same for definitions 1. and 2.
Definitions 3. and 4. are weaker, but are implied by 1. and 2.
9
R. Chappell - Columbia University, 10/18/2007
"In Superiority Trials, all Null Hypotheses are the Same but all
Alternatives are Different."
- Tolstoy?
10
R. Chappell - Columbia University, 10/18/2007
For example,
SC (t) – ST (t) = D
lC (t) – lT (t) = D'
medianC – medianT = D''
are in general all incompatible unless D = D' = D'' = 0.
And Chen (2000) gives methods for testing equivalence
of differences of proportions, products of productions, and
odds ratios.
18
These hypotheses also are incompatible unless margins are
zero.
This is not the case in equivalence studies, so we must pick
the scale carefully. This ought to be the scale on which the
specified margin is relevant.
It may not result in the most convenient statistical analysis.
12
R. Chappell - Columbia University, 10/18/2007
A Motivating Example SPORTIF III (Darius, 2002)
Company: AstraZeneca
Comparison: H 376/95 vs. Warfarin
Treatment and Followup: Up to twenty-six months
Outcome: Stroke and other Events
Measured by annual incidence rate
Type: Equivalency, margin 2% per year (6% vs. 4%)
Sample size: 3000 patients
13
R. Chappell - Columbia University, 10/18/2007
In the SPORTIF example, outcome is annual
incidence rate, approximately the yearly hazard λ,
so that the hypotheses are
H0: lC - lT < -2% per year
vs.
HA: lC - lT  -2% per year
14
R. Chappell - Columbia University, 10/18/2007
• If event rates are constant then λ is just the exponential
rate parameter, estimated by
h
= Σ δi / Σ fi ,
where δi is an event indicator, fi is the followup time, and
summation is over all subjects.
• But typically the event rate will change with followup.
Then λ becomes an average annual rate over two years.
• How to estimate it efficiently - using full two years of data?
What if followup varies?
• How to estimate it robustly - not assuming any parametric
distribution?
III. Consequences of the Choice(s)
Here, we chose
1. The noninferiority margin, D = -2%/year;
and
2. The scale of comparison in H0, the annual
incidence rate (approximately, the hazard).
There is much literature on choice #1 but
almost none on #2.
16
R. Chappell - Columbia University, 10/18/2007
Choosing D = -2%/year:
D should be smaller than the original estimated
effect of Warfarin, so that if the event rate
with Warfarin is 4% reduced from say 7% it
would be nonsensical to consider Δ ≤ -3%.
That would imply that therapeutic
equivalence could be as bad as no
treatment.
But how do we choose Scale of comparison?
17
R. Chappell - Columbia University, 10/18/2007
Scale choice and balanced randomization
Consider the simple normal two-sample constant
variance case:
XiT ~ iidN(T, 2)
XiC ~ iidN(C, 2)
Then the usual (unstandardized) test statistic for
superiority is the difference in sample means.
Obviously, the allocation which minimizes its
variance is 1:1.
18
R. Chappell - Columbia University, 10/18/2007
Suppose we are designing an equivalence trial to test the
hypothesis
H0 : C - T < D .
Then the optimal allocation is still 1:1. However, if we want
to test
H0 : C / T < D* ,
which is equivalent to
H0 : C - D* × T < 0
instead, then the optimal allocation is D* : 1 in favor of the
control group - a big difference.
19
R. Chappell - Columbia University, 10/18/2007
Scale choice and power
Suppose, in a trial with binary “failure”
outcome, we are deciding between the null
hypothesis of additive noninferiority
H0: pT - pC < .004
and that of multiplicative noninferiority
H0: pT / pC < 1.5.
These are equal at pC = .008. Should power be the
same for
HA: pT = pC = .008?
20
R. Chappell - Columbia University, 10/18/2007
Answer: no (surprisingly, to me and the trial’s
principal investigator).
For proportions less than .01, the range in
which we are interested, the hypothesis of
multiplicative noninferiority is much more
demanding and requires a larger sample
size:
About 21,000 instead of 14,000!
23
R. Chappell - Columbia University, 10/18/2007
IV. An Unusual Choice of Scale for the
Sake of Clinical Relevance
We may be interested in equivalence on different
scales depending on the parameter’s size. For
example, consider a trial with a binary outcome
whose rates in two groups are p1 and p2 . We
might be interested in their difference for small p1
or their ratio for large p1.
If so, we could have a pair of mixed hypotheses:
24
R. Chappell - Columbia University, 10/18/2007
H 0 : p2  p1  Δ1 if p1  p *
p2  p1  Δ2 if p1  p *
H A : p2  p1  Δ1 if p1  π*
p2  p1  Δ2 if p1  π*
where Δ1 and Δ2 are the equivalence limits for the
difference and ratio. π* is the point at which the null
hypothesis changes from the difference to the ratio
test. We set π* = Δ1 / (Δ2 – 1) for continuity.
The treatments are equivalent if equivalence holds
on either the additive or multiplicative scale.
Alternatively, we could require both to hold.
25
R. Chappell - Columbia University, 10/18/2007
To derive the test statistic, first rotate and center H0 to
make it symmetric about the vertical axis:

1
  tan -1 ( D 2 ) 
2
1
  tan -1 ( D 2 ) 
2
p
8
p
8
 cos sin  

B  
 - sin  cos 
x 
 p1  p * 
   B

 p p*D 
 
2
 2
 y
where B is the rotation matrix and (x , y ) are the new
parameters translated from (p1 , p2). Now the null hypothesis
is H0 : y = tan( )|x| .
Let X1 and X2 be independent variables come from Bin(n, p1) and
Bin(n, p2) distribution. Let
X1
X2
p
ˆ1 
and p
ˆ2 
n
n
Estimate the rotated parameter (x, y) and the covariance matrix
as:
 2x  xy 
0
 T

  B p1 1 - p1 
 B

2
  
p2 1 - p2  
 0
 xy y 
 xn 
 pˆ 1 - p * 
   B

 pˆ 2 - p * D 2 
 yn 
Consider the test statistic
M
yn  tan | xn |
ˆ 2y  tan()2  2x  2ˆ xy tansign xn  / n
 M converges to the standard normal
distribution at all differentiable points of H0.
 At points far from π* , M is equivalent to
statistics from the relevant difference or normal
test.
 But at π* , M is converges to a mixture of
normal and half-normal distributions. This can
be shown by deriving results for a hyperbolic
H0 and letting the hyperbola converge to a bent
line.
29
R. Chappell - Columbia University, 10/18/2007
V. Consequences of “Not
Trying to Find an Effect”
Leads to questions about other notions of
quality in clinical trials besides sample size:
- noncompliance
- drug impurity
- loss to follow-up
- enrollment of ineligibles
- other protocol violations
30
R. Chappell - Columbia University, 10/18/2007
• Obviously we want to minimize protocol violations in all
trials. However, they have fundamentally different effects
depending on the type of study they afflict.
• In Superiority Trials conducted with proper randomization
and blinding, these violations degrade the treatment effect
C - T and are thus conservative: they bias results towards a
conclusion of no effect.
• In Equivalence Trials, even with proper randomization and
blinding, these violations can degrade the treatment effect
and are thus anti-conservative: they can bias results towards
a conclusion of equivalence.
8
Intent to Treat Revisited
Is it still the ironclad standard for primary
analysis?
"No participants should be withdrawn from the analysis due to
lack of adherence. The price to be paid is a possible decrease
in power."
- Friedman, Furberg and DeMets, referring to superiority trials.
General agreement, including in ICH guidelines:
"An analysis using all available data should be carried out
for all studies intended to establish efficacy" [ICH E-3].
32
R. Chappell - Columbia University, 10/18/2007
“”Intent to Treat” Analysis in the
presence of noncompliance
•
Has decreased power compared to situation with full
compliance
and
•
Results in estimate of C - T biased towards 0
- conservative in superiority trials
- anticonservative in equivalence trials
33
R. Chappell - Columbia University, 10/18/2007
"As Treated" Analysis in the
presence of noncompliance
•
Also has decreased power compared to situation with full
compliance:
and
•
Biases the estimate of C - T in an unknown fashion
- ? in superiority trials
- ? in equivalence trials
34
R. Chappell - Columbia University, 10/18/2007
Recommendation:
Stick to Intent to Treat in equivalence trials but
• Take Care to maximize quality of the data
• Pay Attention to patterns of quality during the trial
• Summarize aspects of quality in the report
Quality "Before, During and After”.
35
R. Chappell - Columbia University, 10/18/2007
ICH E-9 says that in an equivalence trial, the role of the
full analysis (intent-to-treat) data set "should be considered
very carefully."
What percent of noncompliance is unacceptable?
36
R. Chappell - Columbia University, 10/18/2007
• ICH E-10 states:
"The trial should also be conducted with high quality
(e.g., good compliance, few losses to follow-up)."
• It also has useful advice:
"The trial conduct should also adhere closely to
that of the historical trials."
• That is, the design and patient population should be similar
to previous trials used to determine evidence of sensitivity
to drug effects.
• My conclusion:
If noncompliance is less than that achieved in
prior trials which showed efficacy, good.
But if not, beware (same logic as in choice of Δ).
Other interesting problems:
We don’t just want to know if Ximelagatran is
noninferior to Warfarin, we want to know if it
“works” - if it is better than Placebo. We infer
about:
Effect of Xi. vs. Pl. = Effect of Xi. Vs. Wa. +
Effect of Wa. Vs. Pl.
38
R. Chappell - Columbia University, 10/18/2007
HISTORICAL TRIAL
EQUIVALENCE TRIAL
nH PATIENTS
nE PATIENTS
RANDOMIZATION
RANDOMIZATION
PLACEBO, nH /2
OLD DRUG, nH /2
COMPARE
OLD DRUG, nE /2
?
NEW DRUG, nE /2
COMPARE
24
Inference’s validity depends on randomization and
comparison with past trial in order to estimate
treatment effect without direct comparison with
placebo
•
Past trials give Historical Evidence of Sensitivity to
Drug Effects (HESDE)
•
HESDE is relevant only if populations in two trials are
similar
23
This contrasts with a superiority trial's validity,
which depends upon randomization: arms are
drawn from same population (Lachin, 1988).
n Patients
RANDOMIZATION
Old Drug, n/2
New Drug, n/2
COMPARE
But populations change:
•
Age distribution change
•
Other characteristics change
•
Adjuvant therapies arise
•
Earlier diagnosis is possible
•
The disease itself may change
These imply that we should use a recent trial for comparison.
42
R. Chappell - Columbia University, 10/18/2007
There is a problem with continuously comparing to
the most recent trials: ”Equivalency Drift“ (referred
to as “Bio-creep” in one FDA guidance).
43
R. Chappell - Columbia University, 10/18/2007
The Problem of Equivalency Drift
+4%
+4%
BENEFIT
BENEFIT
Margin of
Equivalency = 2%
00
DRUG 1
DRUG 1
DRUG 2
DRUG 2
EQUIVALENT
EQUIVALENT
DRUG 3
DRUG 3
DRUG 4
DRUG 4
EQUIVALENT
EQUIVALENT
EQUIVALENT EQUIVALENT
Another Thorny
Consideration
BENEFIT
+4%
0
Suppose you represent a drug manufacturer
conducting a clinical trial and you know that
the trial’s results would be used to help a
future competitor show its drug to be
effective. Then the narrower your
confidence intervals, the easier you make it
for your competitor! You are motivated to
make your results as imprecise as possible,
while still permitting FDA approval.
DRUG 1
DRUG 2
45
EQUIVALENT EQUIVALENT
R. Chappell - Columbia University, 10/18/2007
EQUIVALENT
One last (favorable!)
consequence
BENEFIT
+4%
Suppose have multiple alternative
hypotheses of equivalence and require
them all to hold. Then the naïve individual
testing approach is conservative.
0
DRUG 1
DRUG 2
46
EQUIVALENT EQUIVALENT
R. Chappell - Columbia University, 10/18/2007
EQUIVALENT
References
Chen, J.J., Tsong, Y., and Kang, S. “Tests for equivalence or non-inferiority between two
proportions. Drug Information Journal. 34, pp. 569-578 (2000).
Friedman, L.M., Furberg, C., and DeMets, D.L. Fundamentals of Clinical Trials, SpringerVerlag, New York (1998).
Halperin, J.L. “Ximelagatran compared with warfarin for prevention of thromboembolism in
patients with nonvalvular atrial fibrillation: Rationale, objectives, and design of a pair of
clinical studies and baseline patient characteristics (SPORTIF III and V).” Am. Heart J.
146, pp. 431-8 (2003).
International Conference on Harmonisation of Technical Requirements for Registration of
Pharmaceuticals for Human Use. Guidances
E3: Structure and Content of Clinical Study Reports (1995);
E9: Statistical Principles for Clinical Trials (1998); and
E10: Choice of Control Group in Clinical Trials (2000).
http://www.ich.org/ich5e.html#Reports
Lachin, J.M. "Statistical properties of randomization in clinical trials.“ Controlled Clinical Trials
9, pp. 289-311 (1988).
Temple, R. and Ellenberg, SS. “Placebo-controlled trials and active-control trials in the
evaluation of new treatments. Part 1: ethical and scientific issues.” Annals of Internal
Medicine 133, pp. 455-63 (2000).
47
R. Chappell - Columbia University, 10/18/2007