Lecture 1: Regression Setting
Download
Report
Transcript Lecture 1: Regression Setting
Issues in the Use of Adaptive
Clinical Trial Designs
Scott S. Emerson, M.D., Ph.D.
Professor of Biostatistics
University of Washington
1
© 2002, 2003, 2004 Scott S. Emerson, M.D., Ph.D.
Clinical Trials
Experimentation in human volunteers
– Investigates a new treatment/preventive agent
• Safety:
» Are there adverse effects that clearly outweigh any
potential benefit?
• Efficacy:
» Can the treatment alter the disease process in a
beneficial way?
• Effectiveness:
» Would adoption of the treatment as a standard affect
morbidity / mortality in the population?
2
Clinical Trial Design
Finding an approach that best addresses the often
competing goals: Science, Ethics, Efficiency
•
•
•
•
•
•
•
Basic scientists: focus on mechanisms
Clinical scientists: focus on overall patient health
Ethical: focus on patients on trial, future patients
Economic: focus on profits and/or costs
Governmental: focus on validity of marketing claims
Statistical: focus on questions answered precisely
Operational: focus on feasibility of mounting trial
3
Statistical Planning
Ensure that the trial will satisfy the various
collaborators as much as possible
• Discriminate between relevant scientific hypotheses
– Scientific and statistical credibility
• Protect economic interests of sponsor
– Efficient designs
– Economically important estimates
• Protect interests of patients on trial
– Stop if unsafe or unethical
– Stop when credible decision can be made
• Promote rapid discovery of new beneficial treatments
4
Refine Scientific Hypotheses
– Target population
• Inclusion, exclusion, important subgroups
– Intervention
• Dose, administration (intention to treat)
– Measurement of outcome(s)
• Efficacy/effectiveness, toxicity
– Statistical hypotheses in terms of some summary
measure of outcome distribution
• Mean, geometric mean, median, odds, hazard, etc.
– Criteria for statistical credibility
• Frequentist (type I, II errors) or Bayesian
5
Statistics to Address Variability
At the end of the study:
– Frequentist and/or Bayesian data analysis to assess
the credibility of clinical trial results
• Estimate of the treatment effect
– Single best estimate
– Precision of estimates
• Decision for or against hypotheses
– Binary decision
– Quantification of strength of evidence
6
Statistical Sampling Plan
Ethical and efficiency concerns are addressed
through sequential sampling
• During the conduct of the study, data are analyzed at periodic
intervals and reviewed by the DMC
• Using interim estimates of treatment effect
– Decide whether to continue the trial
– If continuing, decide on any modifications to
» scientific / statistical hypotheses and/or
» sampling scheme
7
Sampling Plan: General Approach
– Perform analyses when sample sizes N1. . . NJ
• Can be randomly determined
– At each analysis choose stopping boundaries
• aj < b j < c j < d j
– Compute test statistic T(X1. . . XNj)
•
•
•
•
Stop if
T < aj
(extremely low)
Stop if bj < T < cj
(approximate equivalence)
Stop if
T > dj
(extremely high)
Otherwise continue (with possible adaptive modification of
analysis schedule, sample size, etc.)
– Boundaries for modification of sampling plan
8
Sequential Sampling Issues
– Design stage
• Choosing sampling plan which satisfies desired operating
characteristics
– E.g., type I error, power, sample size requirements
– Monitoring stage
• Flexible implementation of the stopping rule to account for
assumptions made at design stage
– E.g., adjust sample size to account for observed variance
– Analysis stage
• Providing inference based on true sampling distribution of
test statistics
9
Sequential Sampling Strategies
Two broad categories of sequential sampling
– Prespecified stopping guidelines
– Adaptive procedures
10
Prespecified Stopping Plans
Prior to collection of data, specify
– Scientific and statistical hypotheses of interest
– Statistical criteria for credible evidence
– Rule for determining maximal statistical information
• E.g., fix power, maximal sample size, or calendar time
– Randomization scheme
– Rule for determining schedule of analyses
• E.g., according to sample size, statistical information, or
calendar time
– Rule for determining conditions for early stopping
• E.g., boundary shape function for stopping boundaries on the
11
scale of some test statistic
Adaptive Sampling Plans
At each interim analysis, possibly modify
–
–
–
–
–
–
Scientific and statistical hypotheses of interest
Statistical criteria for credible evidence
Maximal statistical information
Randomization ratios
Schedule of analyses
Conditions for early stopping
12
Adaptive Sampling: Examples
– E.g., Modify sample size to account for estimated
information (variance or baseline rates)
• No effect on type I error IF
– Estimated information independent of estimate of
treatment effect
» Proportional hazards,
» Normal data, and/or
» Carefully phrased alternatives
– And willing to use conditional inference
» Carefully phrased alternatives
13
Estimation of Statistical
Information
If maximal sample size is maintained, the study
discriminates between null hypothesis and an
alternative measured in units of statistical
information
n
12V
( 1 0 )
2
n
12
( 1 0 ) 2
V
14
Estimation of Statistical
Information
If statistical power is maintained, the study sample
size is measured in units of statistical
information
n
12V
(1 0 ) 2
1
n
V
(1 0 ) 2
2
15
Adaptive Sampling: Examples
– E.g., Proschan & Hunsberger (1995)
• Modify ultimate sample size based on conditional power
– Computed under current best estimate (if high enough)
• Make adjustment to inference to maintain Type I error
– E.g., Self-designing Trial (Fisher, 1998)
• Combine arbitrary test statistics from sequential groups
• Prespecify weighting of groups “just in time”
– Specified at immediately preceding analysis
• Fisher’s test statistic is N(0,1) under the null hypothesis of no
treatment difference on any of the endpoints tested
– E.g., Randomized Play the Winner
• Biased coin favors currently best performing treatment
16
Motivation for Adaptive Designs
Scientific and statistical hypotheses of interest
– Modify target population, intervention, measurement
of outcome, alternative hypotheses of interest
– Possible justification
• Changing conditions in medical environment
– Approval/withdrawal of competing/ancillary treatments
– Diagnostic procedures
• New knowledge from other trials about similar treatments
• Evidence from ongoing trial
– Toxicity profile (therapeutic index)
– Subgroup effects
17
Motivation for Adaptive Designs
Modification of other design parameters may have
great impact on the hypotheses considered
–
–
–
–
–
Statistical criteria for credible evidence
Maximal statistical information
Randomization ratios
Schedule of analyses
Conditions for early stopping
18
Prespecified vs Adaptive
Major issues with use of adaptive designs
– What do we truly gain?
• Can proper evaluation of trial designs obviate need?
– What can we lose?
• Efficiency? (and how should it be measured?)
• Scientific inference?
– Science vs Statistics vs Game theory
– Definition of scientific/statistical hypotheses
– Quantifying precision of inference
19
Major Issue: Frequentist Inference
Frequentist inference is still the most commonly
used form of quantifying statistical strength of
evidence
– Estimates which minimize bias, MSE
– Confidence intervals
– P values; type I, II errors
Frequentist inference depends on sampling
distribution
20
Prespecified Sampling Plan
– Perform analyses when sample sizes N1. . . NJ
• Can be randomly determined
– At each analysis choose stopping boundaries
• aj < b j < c j < d j
– Compute test statistic T(X1. . . XNj)
•
•
•
•
Stop if
T < aj
(extremely low)
Stop if bj < T < cj
(approximate equivalence)
Stop if
T > dj
(extremely high)
Otherwise continue as prespecified
21
Boundary Scales
– Stopping rule for one test statistic is easily
transformed to a rule for another statistic
• “Group sequential stopping rules”
– Sum of observations
– Point estimate of treatment effect
– Normalized (Z) statistic
– Fixed sample P value
– Error spending function
• Conditional probability
• Predictive probability
• Bayesian posterior probability
22
Unified Family: MLE Scale
Boundary shape function unifying families of
stopping rules (Kittelson & Emerson, 1999)
– Wang & Tsiatis (1987) based families (R=0, A=0)
• P=1: O’Brien & Fleming (1979); P= 0.5: Pocock (1977)
• Emerson & Fleming (1989); Pampallona & Tsiatis (1994)
– Triangular test (Whitehead, 1983): (P=1, R=0, A=1)
– Seq cond probability ratio test (Xiong & Tan, 1994)
– Some boundaries constant on conditional or
predictive power
– Extensions: Peto-Haybittle (using Burington &
Emerson, 2003)
23
Spectrum of Conditions for
Early Stopping
– Down columns: Early stopping vs no early stopping
– Across rows: One-sided vs two-sided decisions
24
Spectrum of Boundary Shape
Functions
A wide variety of boundary shapes possible
– All of the rules depicted have the same type I error
and power to detect the design alternative
25
Boundary Scales
Conditional Probability Scale:
– Threshold at final analysis from unified family t XJ
– Hypothesized value of mean *
C j t XJ ,* Pr X J t XJ | X j ; *
N J t * N j x j *
XJ
1
N
N
J
j
26
Boundary Scales
Predictive Probability Scale:
– Prior distribution ~ N , 2
j Xj
H t
Pr X j t Xj | X j , | X j d
N N 2 2 t x 2 N N x
J
j
j
J
j
j
Xj
1
2
2
2
2
N
N
N
N
J
j
J
j
27
Boundary Scales
Bayesian Posterior Scale:
– Prior ~ N , 2
B j * Pr * | X 1 , , X N j
N 2 2 N 2 x 2
*
j
j
j
1
2
2
N j
28
Major Issue: Frequentist
Inference
Frequentist operating characteristics are based on
the sampling distribution
– Stopping rules do affect the sampling distribution of
the usual statistics
• MLEs are not normally distributed
• Z scores are not standard normal under the null
– (1.96 is irrelevant)
• The null distribution of fixed sample P values is not uniform
– (They are not true P values)
29
Sequential Sampling: The Price
It is only through full knowledge of the sampling
plan that we can assess the full complement of
frequentist operating characteristics
– In order to obtain inference with maximal precision
and minimal bias, the sampling plan must be well
quantified
– (Note that adaptive designs using ancillary statistics
pose no special problems if we condition on those
ancillary statistics.)
30
Sampling Distribution of
Estimates
31
Sampling Distribution of
Estimates
32
Sampling Distributions with
Stopping Rules
33
Sampling Distribution
For any known stopping rule, however, we can
compute the correct sampling distribution with
specialized software
– From the computed sampling distributions we then
compute
• Bias adjusted estimates
• Correct (adjusted) confidence intervals
• Correct (adjusted) P values
– Candidate designs can then be compared with
respect to their operating characteristics
34
Evaluation of Designs
Process of choosing a trial design
– Define candidate design
• Usually constrain two operating characteristics
– Type I error, power at design alternative
– Type I error, maximal sample size
– Evaluate other operating characteristics
• Different criteria of interest to different investigators
– Modify design
– Iterate
35
Operating Characteristics
The same regardless of the type of stopping rule
– Frequentist power curve
• Type I error (null) and power (design alternative)
– Sample size requirements
• Maximum, average, median, other quantiles
• Stopping probabilities
– Inference at study termination (at each boundary)
• Frequentist or Bayesian (under spectrum of priors)
– Futility measures
• Conditional power, predictive power
36
At Design Stage
In particular, at design stage we can know
– Conditions under which trial will continue at each
analysis
• Estimates, inference, conditional and predictive power
– Tradeoffs between early stopping and loss in
unconditional power
37
Frequentist Inference
O'Brien-Fleming
N
MLE
Bias Adj
Estimate
Pocock
95% CI
P val
MLE
Bias Adj
Estimat
e
95% CI
P val
Efficacy
425
-0.171
-0.163
(-0.224, -0.087)
0.000
-0.099
-0.089
(-0.152, -0.015)
0.010
850
-0.086
-0.080
(-0.130, -0.025)
0.002
-0.070
-0.065
(-0.114, -0.004)
0.018
1275
-0.057
-0.054
(-0.096, -0.007)
0.012
-0.057
-0.055
(-0.101, -0.001)
0.023
1700
-0.043
-0.043
(-0.086, 0.000)
0.025
-0.050
-0.050
(-0.099, 0.000)
0.025
425
0.086
0.077
(0.001, 0.139)
0.977
0.000
-0.010
(-0.084, 0.053)
0.371
850
0.000
-0.006
(-0.061, 0.044)
0.401
-0.029
-0.035
(-0.095, 0.014)
0.078
1275
-0.029
-0.031
(-0.079, 0.010)
0.067
-0.042
-0.044
(-0.098, 0.002)
0.029
1700
-0.043
-0.043
(-0.086, 0.000)
0.025
-0.050
-0.050
(-0.099, 0.000)
0.025
Futility
38
Efficiency / Unconditional Power
Tradeoffs between early stopping and loss of power
Boundaries
Loss of Power
Avg Sample Size
39
Stochastic Curtailment
Boundaries transformed to conditional or predictive
power
– Key issue: Computations are based on assumptions
about the true treatment effect
• Conditional power
– “Design”: based on hypotheses
– “Estimate”: based on current estimates
• Predictive power
– “Prior assumptions”
40
Conditional/Predictive Power
Symmetric O’Brien-Fleming
N
MLE
O’Brien-Fleming Efficacy, P=0.8 Futility
Conditional Power
Predictive Power
Design
Sponsor
Estimate
Noninf
MLE
Efficacy (rejects 0.00)
Conditional Power
Predictive Power
Design
Sponsor
Estimate
Noninf
Efficacy (rejects 0.00)
425 -0.171
0.500
0.000
0.002
0.000 -0.170
0.500
0.000
0.002
0.000
850 -0.085
0.500
0.002
0.015
0.023 -0.085
0.500
0.002
0.015
0.023
1275 -0.057
0.500
0.091
0.077
0.124 -0.057
0.500
0.093
0.077
0.126
Futility (rejects -0.0855)
Futility (rejects -0.0866)
425
0.085
0.500
0.000
0.077
0.000
0.047
0.719
0.000
0.222
0.008
850
0.000
0.500
0.002
0.143
0.023 -0.010
0.648
0.015
0.247
0.063
1275 -0.028
0.500
0.091
0.241
0.124 -0.031
0.592
0.142
0.312
0.177
41
Efficiency / Unconditional Power
Tradeoffs between early stopping and loss of power
Boundaries
Loss of Power
Avg Sample Size
42
Key Issues
Very different probabilities based on assumptions
about the true treatment effect
– Extremely conservative O’Brien-Fleming boundaries
correspond to conditional power of 50% (!) under
alternative rejected by the boundary
– Resolution of apparent paradox: if the alternative
were true, there is less than .0001 probability of
stopping for futility at the first analysis
43
Further Comments
Neither conditional power nor predictive power
have good foundational motivation
– Frequentists should use Neyman-Pearson paradigm
and consider optimal unconditional power across
alternatives
• And conditional/predictive power is not a good indicator in
loss of unconditional power
– Bayesians should use posterior distributions for
decisions
44
Fully Adaptive Sampling
What is the cost of planning not to plan?
– In order to provide frequentist estimation, we must
know the rule used to modify the clinical trial
• Hypothesis testing of a null is possible with fully adaptive
trials
– Statistics: type I error is controlled
– Game theory: chance of “winning” with completely
ineffective therapy is controlled
– Science:
» At best: ability to discriminate clinically relevant
hypothesis may be impaired
» At worst: uncertainty as to what the treatment has
effect on
45
Prespecified Modification Rules
Adaptive sampling plans exact a price in statistical
efficiency
– Tsiatis & Mehta (2002)
• A classic prespecified group sequential stopping rule can be
found that is more efficient than a given adaptive design
– Shi & Emerson (2003)
• Fisher’s test statistic in the self-designing trial provides
markedly less precise inference than that based on the MLE
– To compute the sampling distribution of the latter, the
sampling plan must be known
46
Conditional/Predictive Power
Additional issues with maintaining conditional or
predictive power
– Modification of sample size may allow precise
knowledge of interim treatment effect
• Interim estimates may cause change in study population
– Time trends due to investigators gaining or losing
enthusiasm
• In extreme cases, potential for unblinding of individual
patients
– Effect of outliers on test statistics
47
Self-Designing Trial
Additional issues with Self-Designing Trial
– The self-designing trial requires pre-specification of
the analysis at which the trial stops
• Trial stops when all of the remaining weight is to be applied
at the current analysis, as specified at the previous analysis
48
Randomized Play the Winner
Additional issues with Randomized Play the
Winner
– For a fixed total sample size, greatest efficiency is
obtained when ratio of sample sizes is equal to ratio
of statistical information from arms
• Constant ratio of standard deviation of observations to
sample size
– (Of course, PTW is designed to minimize the number
of subjects receiving an inferior treatment, which may
be a greater cost in total patients and time)
49
Final Comments
Adaptive designs versus prespecified stopping
rules
– Adaptive designs come at a price of efficiency and
(sometimes) scientific interpretation
– With adequate tools for careful evaluation of designs,
there is little need for adaptive designs
50
Bottom Line
You better think (think)
about what you’re
trying to do…
-Aretha Franklin
51