Conditional Power
Download
Report
Transcript Conditional Power
Adaptive Designs
P. Bauer
Medical University of Vienna
June 2007
Content
• Statistical issues to be addressed in planning a
classical frequentist trial
• Sequential trials – a step forward
• Type of design modifications
• The principles to achieve flexibility
• The price to be paid
Weighting
Optimality
Feasibility
• A typical scenario
• Concluding remarks
The classical frequentist trial
• The details of the design, such as sample size,
method of randomization, …, are laid down in
advance.
• The statistical analysis strategy is also pre-fixed.
• There is lack of flexibility to deal with current
information emerging from inside or outside
the trial which could raise the demand for
design modifications.
The classical frequentist trial Statistical issues for planning
• Population
Inclusion criteria
• Treatments (doses)
Application time(s), period and mode
• Main outcome variable(s)
Measurement time(s)
Recruitment time
Follow up time
…
• Secondary outcome variable(s)
• Safety variables
Statistical issues - continued
• Analysis strategy
Statistical model (e.g., parametric versus non- par.)
Goal (e.g., superiority versus non-inferiority)
Test statistics
Covariables
Handling Multiplicity
Subgroups
Handling missing information
Checking the assumptions
Handling deviations from the assumptions
Checking the stability of results
Handling current safety data (dropping treatments?)
...
Statistical issues - continued
• Significance level
• Power
• Sample size
Relevant effect size(s)
Variability of the outcome variable
Outcome in the control group(s)
Drop out rate
...
• Dealing with the unexpected ?
A step forward - Sequential designs
• The sample size is not fixed !
By looking at the outcome more than one time during
the trial decisions like stopping with an efficacy claim
or for futility may be taken earlier in the trial
The sample size adapts to the true effect
The sample size adapts to the true effect
• Further design issues
Number of interim analyses
Timing of interim analyses
Decision rules to be applied in the interim analyses
Rules for dropping treatments
Maximum sample size
...
Sequential designs - continued
• The false positive error rate can only be calculated
in advance if also the decision rules are specified
in advance!
• For particular deviations from the rules the impact
on the error rate may be known.
• E.g.: In case of a large mid-trial effect you may
increase, in case of a small effect you may
decrease the sample size! POSCH et al. [2003], see later
“Planned Adaptivity”
• Lay down a set of adaptation rules among you
can chose
• Calculate the maximum possible Type I error
rate inflation which may occur with these rules
• Adjust the rejection boundaries so that the
maximum type I error rate is controlled
• Like in sequential designs you have to
adhere to the pre-specified set of rules!
“Fully” adaptive (flexible) designs
Adaptive or flexible designs allow for mid-trial
design modifications based on information from
in- and outside the trial without compromising on
the false positive error rate (and hopefully
improving the performance of the running trial).
• Traditionally the notion “adaptive” was used for datadependent randomization methods like the play the
winner treatment allocation rule.
Issues of flexibility
• Selection of treatments – Reallocation of samples
• Modification of the total sample size
• Modification of statistical analysis
Choosing “optimal” scores …
•
•
•
•
Insertion or skipping of interim analyses
Subgroup selection
Changing goals (non-inferiority →superiority)
Modification of endpoints
...
Only writing amendments for “online” design
modifications will not be the general solution!
The invariance principles
to achieve flexibility:
1. Adaptive combination tests
• In a two-stage design an interim analysis is
planned after a sample size of n1 observations.
• Denote by p1 and p2 be the p-values of a test in
the disjoint first and second stage sample
respectively of the planned two stage design
(instead stage-wise z-scores, i.e. standardized treatment
differences, could be used).
1. Two stage combination tests (cont.)
• If different patients are investigated at the two
stages these two p-values p1 and p2, under the
null hypothesis, generally are independent and
uniformly distributed between [0,1] (can be relaxed).
This properties still hold true if the second stage
of the design is modified in the interim analysis
based on information from the first stage!
1. Two stage combination tests (cont.)
• The final test decision is based on a pre-fixed
combination of the two p-values (or z-scores),
e.g., on the product function [R. A. FISHER]
p1 p2
or the inverse normal function *
w1 1(1 p1) w2 1(1 p2 ) [ w1z1 w2 z2 ]
w1, w2 prefixed , w12 w22 1
The weighted z-cores - an explanation
(we pretend to know the variance σ2)
The z-scores *
1.Stage
x1A-x1B
z 1=
σ
√2
√n1
Mean treatment effect
(balanced design)
Standard error
If there is no treatment effect (under the null) the zscore for large sample sizes follows a st. norm. distr.!
This would be important not to be *
The weighted z-cores - an explanation
(we pretend to know the variance σ2)
The z-scores
1.Stage
2.Stage
x1A- x1B
z 1=
√2
σ
√n1
x2A- x2B
z 2=
√2
σ
√n2
Total
xA - xB
z=
√2
√(n1+n2)
σ
n1
n2
z w1z1 w2 z2
z1
z2
n1 n2
n1 n2
The clue
• The z-scores from disjoint samples under the null for
large samples follow ind. standard normal distr.
• Adaptations (e.g., increasing the sample size
~
n
n
from 2 to 2 ) performed before the sample at the
following stage is observed obviously have no
influence on this universal property!
• Using the planned sample sizes n1 and n2 for the
weights w1 and w2 to combine the scores z1 and
~z we again under the null get a standard normal
2
test statistics, which can be easily used for testing.
• If no adaptation is made we end with the
conventional test!
1. Two stage combination tests (cont.)
• Don’t pool the sample combine the test statistics!
The distribution of the comb. function under the null
does not depend on design modifications
• Hence the adaptive test is still a test at the level α
for the modified design !
• Applicable for multiple looks
• Sequential decision boundaries can be applied
• Recursive application allows for a flexible # of looks
BAUER [1989], BAUER and KÖHNE [1994], LEHMACHER and
WASSMER [1999], CUI et al.[1999], BRANNATH et al.[2002]
The two stage combination test
with sequential decision boundaries
p1
1. Stage
n1
0
1
Rejection of
H0
n2
n1+n2
ן
Continuation
Adaptation
1
Futility-Stop
Acceptance of H0
C(p1, p2)
ן
׀
____________________________
2. Stage
0
0
Rejection of
H0‘
c
1
Acceptance of
H0‘
The invariance principles (cont.)
2. The conditional error concept
• The conditional error probability of a designed
trial at a particular time is the probability under
the null hypothesis to get a rejection later on,
given the information observed up to that time.
PROSCHAN and HUNSBERGER [1995]
Any design change is allowed at any (unplanned)
time if the new design has a conditional error prob.
never exceeding that of the original design !
• Design modifications always preserving the
conditional error prob. also preserve the level α!
MÜLLER, SCHÄFER [2001, 2004]
*
1
0
1
Early Rejction
of H0
0
1
Early Rejction
of H0
0
Futility-Stop
Accept H0
0
1
Early Rejction
of H0
Continuation
(another n2 observations)
Futility-Stop
Accept H0
0
1
Early Rejction
of H0
Continuation
(another n2 observations)
Futility-Stop
Accept H0
CEF
• The conditional error function is defined as the
conditional error probability as a function of the
outcome up to the interim analysis, in our example a
function of p1
• It can be explicitly defined in advance, like the
circular error function (PROSCHAN & HUNSBERGER, 1995),
or implicitly by the pre-planned design, e.g. of a
group sequential design (MÜLLER & SChÄFER, 2001)
Combination test – Conditional error *
• If the conditional error can be calculated then there
is a close relationship between the two approaches
POSCH and BAUER [1999]
Example: Assume that the final test decision for the
product criterion was planned to be
p1p2≤cα
[α=0.025, cα=0.0038, p1=0.045]
Any second stage test with the rejection region
p2≤ cα/p1
[p2≤0.0844]
would preserve the conditional error !
Conditional error function - general
• Testing the mean of a normal distribution ( known
variance) in a design with fixed sample size n
• Look into the data after n1 observations:
• One-sided critical region at the end: {Z n z1 }
• Sufficient test statistics after n1 observations: Z n1
CEF ( z n1 ) CRP 0 ( z n1 )
Prob 0 ( Z n z1 | z n1 )
Interim analysis after n1 observations Reassessment of trial perspectives
What is the probability to reject H0 at the end,
conditionally on the results after n1 observations?
Conditional Power
CRPµ (Conditional Rejection Probability)
CRP ( zn1 ) Prob ( Z n z1 | zn1 )
Conditional power – Overall power
Overall power: Expected value of the r. v.
„conditional power“ taken over all possible
interim outcomes (including those which
definitely have not been observed in the trial).
It is tempting and suggests itself to reassess
the trial perspectives after knowing the
interim results.
Conditional power
The true conditional power is unknown since the true
mean µ is unknown too !
Proposals to estimate the conditional power
1. Use some fixed effect size μc for determining
the conditional power, e.g., the effect size
μc= μd from the planning phase:
Prob ( Z n z1 | Z n1 )
d
Estimation of the conditional power
2. Use effect observed in the interim analysis
Prob z / n ( Z n z1 | Z n1 )
n1
1
3. Weighted mixture from 1 and 2, e.g.,
posterior distribution of µ in interim analysis
To remember
Since the conditional power in any case
depends on the interim outcome Z n it is a
1
random variable !
Density of the conditional power using the
effect size from the planning phase
Prob ( Z n z1 | Z n1 )
d
(further called „Conditional Power“)
=0.05/2=0.025, 1-=0.8, µd=1
n ( z1 z1 ) /
2
2
d
I. Density of the conditional power in a
perfectly powered study depending
on the inspection time r
• We assume that the true effect size is exactly
equal to the one used in the planning phase.
• Under these hypothetical assumptions we
calculate the density of the true conditional
power depending on the information time
r=n1/n.
Density of the Conditional Power based on µd=1
Conditional power
Density of the Conditional Power based on µd=1
Conditional power
Density of the Conditional Power based on µd=1
Conditional power
Density of the Conditional Power based on µd=1
Conditional power
II. The conditional power halfway through
the trial depending on the true effect μ
• We assume that a mid-trial interim analysis
(r=1/2) is performed.
• We assume that for calculating the conditional
power the experimenter uses the effect size μd
the study has been powered for.
• We calculate the conditional power depending
on the unknown true effect size μ which is
generally different from μd .
Density of the Conditional Power based on µd=1
Conditional power
Density of the Conditional Power based on µd=1
Conditional power
Density of the Conditional Power based on µd=1
Conditional power
Density of the Conditional Power based on µd=1
Conditional power
III. Comparison of the
„Conditional Power“
Prob ( Z n z1 | Z n1 )
d
vs.
the „Predictive Power“
(using the interim effect estimate)
Prob z / n ( Z n z1 | Z n1 )
n1
1
Comparison of Conditional and Predictive “ Power
Conditional Power
„Predictive“ Power
Conditional/Predictive Power
Comparison of Conditional and Predictive “ Power
Conditional Power
„Predictive“ Power
Conditional/Predictive Power
Comparison of Conditional and Predictive “ Power
Conditional Power
„Predictive“ Power
Conditional/Predictive Power
Comparison of Conditional and Predictive “ Power
Conditional Power
„Predictive“ Power
Conditional/Predictive Power
Comparison of Conditional and Predictive “ Power
Conditional Power
„Predictive“ Power
Conditional/Predictive Power
IV. The deviation between predictive and
true conditional power- simulations
Pro’s
• Enormous flexibility (no pre-specification),
e.g., look into the un-blinded data at whatever
time to decide on forthcoming interim analyses.
• Using the appropriate combination function
(c.e.f), then, in case of no design adaptations,
the analysis is conventional. In this case no price
is paid for the option of flexibility!
• Every conventional test with a fixed adaptation
rule can deal with unexpected deviations from
this rule via the conditional error function.
Con’s
(There is no such thing as a free lunch!)
• Non-standard test statistics other than the
sufficient statistics are used.
• Conflicting decisions with conventional analyses
may occur (put on restrictions).
• Problems of estimation without mandatory
adaptation rules (CI’s available).
• Problems of interpretation if the adaptation
changes the hypothesis (multiple inference
available, see later “Combining phases”).
Estimation (without early stopping)
*
ˆ w (x x ) w (x x )
u
1 A1
B1
2 A2
B2
Unbiased
~( x x ) (1 u
~ )( x x )
ˆ
u
mu
A1
B1
A2
B2
~
u
w1 n1
~
w1 n1 w2 n
2
Median unbiased
random weight
Confidence intervals
• Define adaptive tests for every point in the
parameter space
• The one-sided (1-α)-CI contains all points for
which the one sided null hypothesis is not
rejected at the level α
• Use the same sequential boundaries for the
shifted test statistics (Conservative RCI which
can be calculated at any interim analysis)
• Use conditional error functions for all parameter
values to perform the dual tests
• Use an ordering in the sample space so that
early stopping always leads to wider CIs
BRANNATH et al. (2006); MEHTA et al. (2007); BRANNATH et al.( )
The optimality issue of not using
sufficient test statistics *
• Simple null and alternative hypotheses
• Fixed sample size reassessment rule defines the
spending functions for early rejection and early
acceptance respectively
• No costs for interim analyses
• There is a non-adaptive LR-test (with interim
analyses at all sample sizes possible in the
adaptive design) which leads to a lower average
sample size both under the null and the alternative
hypothesis.
TSIATIS & MEHTA (2003)
The optimality issue (cont.) *
• A risk function in form of a weighted sum of the
type I error probability, the type II error probability
and the expected information is considered
• The Bayes risk in form of an expectation of this
risk function over an a-priori distribution on a finite
grid in the parameter space is minimized
• Each admissible decision rule is a Bayes rule with
a solution based on the sufficient statistics
JENNISON & TURNBULL (2006)
The optimality issue (cont.)
• These optimality results only apply for the a-priori
fixed risk structure
• Real life is higher dimensional and more complex
than the simple models
E.g., a simple risk structure may (and often
does) change over time (and the trial) !
The optimality issue (cont.)
Example
• An unexpected safety issue arises in a clinical
trial, so that a larger sample size is required under
the experimental therapy to achieve a sufficiently
precise estimate of the cost benefit relationship
before registration
• Now, the original weight in the a-priori defined risk
function for the expected information may be
completely irrelevant. The costs for additional
sampling may be completely dominated by the
need to get more information on a variable other
than any pre-planned outcome variable
The optimality issue (cont.)
“Fully” adaptive designs are a way to deal with
such situations of a changing environment and
still not compromising on the type I error rate
• Given the adaptation and the interim data we
can again use conditionally optimal designs
based on the sufficient statistics for the rest of
the trial *
BRANNATH et al. (2006a)
The weighting issue:
When does sample size adaptation never inflate
the type I error rate of the conventional test?
One-sided z-test for a normal mean
0.025, (1 ) 0.8 n
Sample size adaptation in an interim look at the data
after half of the sample, n1=n/2
~
n n
(n2 n~2 )
When does sample size adaptation never inflate
the type I error rate of the conventional test? *
(POSCH et al. 2003)
n~2
0.025, 1 0.8, n1 n / 2, 1 r ~
n
~n
n
~n
n
~n
n
The weighting issue
• Adaptive tests generalize group sequential designs.
They can be identical to the conventional tests
when no adaptations are performed
• However, in case of (sample size) adaptations due
to the a-priori definition of the combination function
observations from different stages in general are
weighted differently (the decision of the adaptive
test is not based on the sufficient test statistics)
• Extreme situations for absurd test decisions can be
constructed (if you decide to take one observation
for the second stage instead of the planned 500 this
observations may overrule the first stage data) !
The weighting issue (cont.)
Comments
• There are statistics used in other fields where the
observations may be weighted differently
• Typically the rejection region of the adaptive test is
consistent with the fixed sample test if one applies
• reasonable sample size reassessment strategies,
• adequate early stopping boundaries,
• the marginally conservative dual test (which rejects
only if both the adaptive and the LR-test based on
the total sample reject). *
The weighting issue (cont.)
An alternative – the worst case adjustment *
• For every interim outcome determine the worst case
sample size reassessment rule which would
produce the largest type I error rate for a test based
on the conventional sufficient statistics
• Adjust the critical level of this test so that the level is
controlled for all possible sample size rules
This test can be uniformly improved by a fully
adaptive test based on the worst case CE-function!
PROSCHAN & HUNSBERGER (1995), WASSMER (1999), BRANNATH
et al. (2006b)
Reassessment of trial perspectives –
may we be misguided?
• Scenario
Interim look halfway through a classical trial with
α=0.025 (one-sided), power 1-ß = 0.80, normal
distribution, σ2=1, effect size for planning μd=1
Conditional power to get a rejection at the end,
given the first half of the outcome data
Reassessment of trial perspectives (cont.)
• The conditional power is random because it
depends on the random interim results.
• The true conditional power is unknown, since
it depends on the unknown effect size.
• Some plug in the a priori effect size μd from
the planning phase.
▬▬▬
• Some use the effect size observed in the
interim analysis.
▬▬▬
The distribution
of the conditional power *
• What is the distribution of the conditional power
if we look halfway through an underpowered trial
into the data given that the actual effect size is
only 40% of the (optimistic) effect size used in
the planning phase?
BAUER and KÖNIG [2006]
Density of the conditional power
μd
observed effect
conditional power
Con’s (continued)
A fundamental concern
A trial is sized to finally get some decisions with
a controlled probability for erroneous decisions
Why do we expect that early information from
the trial will reliably guide us to the right track?
• Possible impact of the interim analysis on the
course of the trial (more than in sequential trials)
• Flexible designs are more difficult to handle
(also as compared to sequential designs)
The feasibility issue
• Flexible designs require a careful planning and
improved logistics in order to maintain integrity
and persuasiveness of the results
• The control of the information flow is crucial as
various un-blinded material may be needed in
interim analyses to achieve good decisions about
the necessity and type of adaptations to be done
• The adaptation process has to be documented
carefully so that adaptations can be justified
considering the complications connected with
such designs
The feasibility issue (cont.)
• Besides other (known) prices to be paid
for flexibility the feasibility issues may
be the largest obstacle against a wide
use of flexible designs!
Reflection Paper on Methodological Issues
in Confirmatory Clinical Trials with Flexible
Design and Analysis Plan, EMEA, 2006
FOR A SURVEY OF APPLICATIONS, see BAUER & EINFALT (2006)
Planning for adaptations is important …
SEVERAL AUTHORS (1994 -
)
• “Sometimes one can get the impression that
papers on flexible design have to include this
type of warning, and the further impression that
the authors feel somewhat guilty for showing
how much freedom is possible, and, after having
gone public, want to get some distance between
their work and the possible consequences.”
TALK BY J.RÖHMEL, 15.10.2005
A typical application:
Dose selection and confirmative inference
(the burning issue of combining phases)
• Scenario
•
•
•
•
4 doses, Placebo, parallel groups, balanced
Many-one comparisons of doses with Placebo
Multiple level α
E.g., a sequential adaptive Dunnett, Hochberg or
strictly hierarchical test procedure
BAUER & KIESER (1999). HOMMEL (2001)
“Adaptive Seamless Designs”: POSCH et al. (2005), THE
NOVARTIS GROUP, e.g., BIOMETRICAL JOURNAL (2006)
Comparison of ASD for treatment selection with
separate phase II and III trials (1)
Learning
Standard
2 phases
Confirming
A
Plan & Design
Phase IIb
B
Plan &
C
Design
D
Control
Phase III
Adaptive
Learning, Selecting and Confirming
Seamless
Design
A
B
Plan & Design
Phase IIb and III
C
D
Control
Dose Selection
75 PSI 2006 / Adaptive Designs
BRETZ et al. (2006)
Combining phases (cont.)
Adaptive interim analysis
Options
• Skip low ineffective (use a surrogate end point?)
or high unsafe doses
• Early stopping with a positive efficacy claim or
for futility (surrogate?)
• Redistribution of the sample units “saved”
among the remaining doses and Placebo
• Change of the reallocation ratio because more
observations on a high dose are needed in order
to address an arising safety problem
• Increase of the planned total sample size
• …
*
Selection and multiple (closed) testing
• Treatments:
• Two comparisons:
A, B, C (Control).
A vs. C and B vs. C
• We predefine adaptive combination tests for the null
hypotheses
A=C, B=C , A=B=C (no effects)
• Treatment B has been selected for the second stage
• Final analysis: B is claimed to be effective if both the
global null hypothesis
A=B=C
and the individual null hypothesis
B=C
are rejected in their combination test at the level α
Reflection Paper (EMEA)
• “In general changes to designs of an ongoing phase III trial
are not recommended. If such changes are anticipated in a
confirmatory clinical trial this would require pre-planning and
a clear justification form an experimental point of view”
• “Studies with interim analyses where there are marked
differences between different study part or stages, will be
difficult to interpret”
• “From a regulatory point of view, whenever trials are
planned to incorporate design modifications based on the
results of an interim analysis, the applicant must pre-plan
methods to ensure that results from different stages of the
trial can be justifiably combined. In this respect, studies with
adaptive designs need at least the same careful
investigation of heterogeneity and justification … as is
usually required for the combination of individual trials in a
meta-analysis”
Reflection Paper (EMEA)
• “The need to reassess sample size in some experimental
conditions is acknowledged. However, if more than one
sample size reassessment seems necessary, this might
raise concern that the experimental conditions are
fluctuating and not fully understood”
• “External knowledge from other studies may suggest … . In
such cases, adaptive designs may allow an opportunity to
discuss changes of the primary endpoint, changes in the
components … .”
• “A change in the primary endpoint after an interim analysis
should not be acceptable: …”
• “An adaptive design, combined with a multiple testing
procedure, may offer the opportunity to stop recruitment of a
placebo group after an interim analysis as soon as
superiority of the experimental treatment over placebo has
been demonstrated”
Reflection Paper (EMEA)
• “Investigator may wish to further investigate more than one
dose of the experimental treatment in phase III. Early interim
results may resolve some of the ambiguities and recruitment
may be stopped for some doses. …: it is not sufficient to
show that some dose of the experimental treatment is
effective. In consequence, a multiple testing procedure to
identify the appropriate dose should be incorporated”
• Switching from non-inferiority to superiority: “ may be the
desire to continue the study to demonstrate, with additional
patients, superiority of the experimental treatment over the
active comparator. This possibility should, however, be set
into perspective”
• “If, based on interim analysis results, it can be assumed that
the trial will still have sufficient power but using a
randomization ratio of, say 1:2, then this may be seen as a
useful option”
Reflection Paper (EMEA)
• “In some cases late phase II development, the selection of
doses is already well established and further investigation in
phase II would be performed with the same endpoints that
are of relevance in phase III. Similar considerations as
outlined for the selection of treatment arms at an interim
analysis may apply and would allow for the conduct of a
combined phase II / phase III trial”
• “Phase II / phase III combination trials, when appropriately
planned, may be used to better investigate the correlation
between surrogate endpoints and clinical endpoints, and
may, therefore, support the process of providing justification
that an optimal dose-regimen for the experimental drug has
been selected”
• “However, it will not be acceptable to argue … for the
acceptability of an application with only one combined
phase II / phase III trial”
Conclusions
• In late phase trials flexibility should be used
care- and thoughtfully to maintain integrity
and persuasiveness of the results
• Too early looks may be strongly misleading
• Flexible designs are an excellent tool to deal
with unexpected findings (e.g. larger sample
sizes are needed because an arising safety
issue, or can be afforded because of a new
market situation, a dose adaptations seems
to be indicated, a subgroup stands out, …)
References:
Special issue “Adaptive designs in clinical trials”
Biometrical Journal, 48 (4), 2006
Thank you for your patience!