The Pharmaceutical Industry’s Dilemma with Large Patient

Download Report

Transcript The Pharmaceutical Industry’s Dilemma with Large Patient

The Pharmaceutical Industry’s Dilemma with Large
Patient Sample Sizes in Clinical Trials: The
Statistician’s Role Now and in the Future
Davis Gates Jr., Ph.D.
Associate Director
Schering-Plough Research Institute
Brief Background on the Pharmaceutical Industry’s Clinical
Trials Study Process
There are many areas of drug development in the pharmaceutical
industry such as toxicology, drug device development (inhalers, tablet
composition), market research, long term safety and surveillance,
etc., warranting their own discussion.
For this presentation, we will focus on Clinical Trials, which are
performed in a sequence of steps called Phases. The first three
Clinical Trials Studies Phases (I, II, III), allow researchers to collect
reliable information while best protecting patients.
Phased clinical trials require considerable input from statisticians.
For one, determination of sample size at times can challenge the
statistician’s role in a drug development program.
Phase I: Focusing on Safety
Phase I trials are the first step in testing a new treatment approach in humans.




Usually small sample size, divided into cohorts (for example N=12/dose)
What range of doses are safe (monitoring side effects)
Route of Administration (oral, injection into vein or muscle)
Schedule of doing (once-a-week, once-a-day, twice-a-day)
Phase II: Studying Effectiveness




Sample sizes are much larger (approaching/reaching large clinical trials)
Select optimal doses based on a range of dose levels
Optimal doses selected from a combination of safety and efficacy
Sometimes more than one dose is carried forward to Phase III
Phase III: Final Decision for Proposal of Treatment Options






Collective sample sizes across studies are in the thousands
Determine target dosing for market approval (Pivotal Studies)
Compile the safety profile by pooling studies (Adverse Events)
Determine if observed adverse events are acceptable with treatment
Compare test drug against standard therapy, or in some cases placebo
Finalization of the program results in submission to health authorities
Key Statistical Terms
Randomization
The process of randomly assigning a treatment to subjects enrolled into a
clinical trial, such that the subject has a pre-assigned probability of being
selected to either test drug(s) or a standard therapy/placebo.
A simple case would be to equally allocate subjects to a test drug or
standard treatment such that the sample sizes of two groups are as close
to equal as possible.
Other cases could assign unequal sample sizes across treatments, such as
twice as many subjects to an active as a placebo, for ethical reasons to
minimize those assigned to placebo. Here, a 2:1 ratio is designated such
that for every subject assigned to a placebo (P), two are assigned to the
active treatment (A), set up in block sizes in multiples of three (APA, AAP,
PAA)
Randomization is usually enforced through use of a pre-specified
randomization schedule generated by a statistician.
Blinding
A double-blind trial is defined in which both the subject and the
investigator conducting the trial do not know the treatment assignment.
An evaluator-blind trial is defined in which the subject has knowledge of
the treatment assignment but not the investigator. When two treatment
regimens differ substantially, it is not feasible to blind the subject.
Open-label studies are at times performed, usually to assess safety, in
which both the subject and investigator are un-blinded.
Pivotal trials, used to assess the efficacy of a drug for approval, are usually
double-blind. Though there are exceptions, such as evaluator-blind studies
with well defined biological endpoints.
Superiority
The test drug “beating” the standard treatment or placebo, usually
confirmed with a p-value (p <0.05) in which the test drug is statistically
significantly better than the test drug or placebo.
Clinically Meaningful Difference
A pre-defined minimum magnitude justifying a clinically relevant difference,
regardless of the observed statistical test. For example, a trial can be so
large that small (meaningless) differences can de detected, so a predefined difference is stated, such as 0.5 point difference in a quality of life
questionnaire. In many cases, study team members do not want to commit
to a magnitude if one doesn’t already exist.
The general belief in the industry is that the p-value “gets you in the door”
with health authorities such as the FDA, then one argues the clinical
relevance in the approval process.
Primary Endpoint
The designated measure of efficacy (driven by a primary hypothesis) that
determines the success or failure of the trial, from which the study is sized.
Generally, primary endpoints are powered at 80% to 90% to detect
differences, with two-sided 5% levels of significance. Low powered
endpoints are a company risk, but overpowered studies (>99% power) are
not well received by regulatory agencies.
Secondary Endpoints
Additional measures of efficacy generally used as supportive information
for the trial, but cannot determine study success when the primary endpoint
is not successful.
Usually not considered to size trials unless they are determined to be key in
providing additional test drug benefit. Even in these cases, sizing of the
trial can be overlooked.
Addition of “Key” Secondary Endpoints
Lately, Key Secondary Endpoints have been added to trial designs to
support additional benefits of the test drug, which should also be
considered to size the study, as the treatment difference in these endpoints
may require a larger sample size than the primary endpoint.
Typical Sample Size Calculations address the Primary Endpoint:
Patient Reported Symptoms Study
A sample size of 125 subjects per treatment arm is required to detect a
difference of 1.0 point or more between active and placebo, assuming a
two-sided 5% level of significance and 90% power, with a pooled standard
deviation of 2.40 points.
Primary Endpoint
Total Nasal Symptom Score (a sum of congestion, nasal itching, sneezing,
and post nasal drip) averaged over four weeks.
Key Secondary Endpoints
1) Proportion of Symptom Free Days (each determined per patient)
across the four week treatment period
2) Quality of Life Questionnaire at Endpoint (the last post-baseline
observation carried forward)
3) Morning Peak Nasal Inspiratory Flow Rate (Liters/minute) averaged
over four weeks
Power Calculations for 125 subjects per treatment arm
Endpoint
Total Nasal Symptom Score
Symptom Free Days
Quality of Life Questionnaire
Morning Peak Flow
Deltaa Pooled STD (CV)
1.0
0.1
0.5
10
2.4 (0.42)
0.28 (0.36)
0.8 (0.63)
36 (0.28)
Power
90%
80%
>99%
60%
Joint power (assuming independent Endpoints) = 43%
a: Delta=The treatment difference between the active and placebo.
Re-powering around the weakest Endpoint to assure a reasonable
probability of study success: propose 205/arm for 80% power to detect
Morning Peak Flow
Power Calculations for 205 subjects per treatment arm
Endpoint
Delta Pooled STD (CV)
Total Nasal Symptom Score 1.0
Symptom Free Days
0.1
Quality of Life Questionnaire 0.5
Morning Peak Flow
10
2.4 (0.42)
0.28 (0.36)
0.8 (0.63)
36 (0.28)
Joint power (assuming independent Endpoints) = 74%
Power
98%
95%
>99%
80%
Multiplicity
Multiple Endpoints require adjustments to control the overall alpha level of
significance. In general, the primary endpoint is tested first, followed by
tests of the key secondary endpoints.
 Bonferroni: Split the alpha level across all the key secondary
endpoints and test them simultaneously, so the failure of one key
secondary does not effect the testing of another key secondary.
 Sequential: Order the key secondary endpoints and test in the prespecified sequence. Though if one test fails, the following tests lose
the overall alpha control. More desirable in cases where one has
knowledge of the probability of success.
 Create a family tree: Divide the key secondary endpoints into
groups, assign a partitioned alpha to each, and test the families
simultaneously, but sequentially within each family.
 More complex methods, such as Hochberg, etc., can be applied,
but are worthy of a separate presentation.
When is it time to reconsider including Key Secondary Endpoints?
 When powering a key secondary endpoint up to 80% forces the
primary endpoint to be heavily overpowered (>99%), such that
meaningless treatment differences become statistically significant.
o Example: An easily powered primary endpoint (such as forced
expiratory volume in an asthma study using an analysis of
covariance) followed by a more difficult to power key secondary
endpoint (such as a time-to-infrequent-event using a logrank
test)
 When the joint power across the primary and key secondary
endpoints is <50%, usually a sign that there are too many key
secondary endpoints, or it is not feasible to properly power one or
more of the key secondary endpoints without overpowering the
primary endpoint. In these cases, consider running a separate
trial.
Overview of Factors Influencing Sample Size Calculations
 Powering the Primary Endpoint
 The Need to Power Key Secondary Endpoints
 Minimum Sample Size Requirements for a pooled studies safety
database
 Superiority/Non-Inferiority Margins (Treatment differences)
Additional Tools for Dealing with Sample Size Issues
 Adaptive Designs: adjustments carried out during the trial
o Dropping ineffective treatment arms
o Re-estimating the sample size
 Pooling studies
o Pool similar design studies prior to analysis of endpoints
o Pool all studies for evaluation of adverse events
Non-Inferiority for studies in which a placebo is not Feasible
A defined criteria whereas the test drug is no worse than the standard
treatment. Criteria require use of treatment difference confidence intervals
where the lower bound cannot fall below a pre-specified margin such as a
percentage of the treatment effect size.
Back to the Pools of Studies for Examination of Safety
A single study designed to examine efficacy is too small for a thorough
examination of adverse events (an undesired outcome such as a
headache, cough, or more serious medical condition such as elevated
blood pressure, dizziness, or liver toxicity).
At 125 subjects per treatment arm, if an underlying adverse event rate in a
control/placebo is 2%, a rate of 8.5% or more must occur in the active
treatment arm to reach statistical significance (p < 0.05 at 50% power using
a binary outcome test). Thus, it would require more than a four-fold
increase in adverse events for chance to be reasonably ruled out.
Therefore, similar studies are pooled to accrue sufficient sample size to
examine adverse events.
Suppose the placebo event rate is 2%, what is the least significant
difference (LSD) event rate in the active treatment arm for the following
pool of data:
Sample Size
per treatment
125
250
500
1000
LSD active treatment event rate
8.5%
6.0%
4.5%
3.6%
Rare Events occurring in less than 1% of subjects
Rare but serious adverse event rates, which can include death or severe
liver damage, require large databases to detect safety signals.
For example, detecting a difference of 0.5% in adverse event rates could
require over 3000 subjects per treatment arm.
Some notes on Non-Inferiority
Placebo controlled trials are easy examples to use when discussing
treatment effect sizes and sample size calculations in public
presentations, but in the world of clinical trials they are not always
ethical.
For example, trials in oncology (cancer), HIV, or in diseases involving
pain, etc., require every patient to be on some type of therapy.
Therefore, new drugs are tested against active comparators, usually the
“standard of care”. Here, if the new drug is “at least as effective”, and is
believed to have another advantage, such as better safety, lower cost,
easier dosing compliance, then tests for non-Inferiority can be
considered.
Non-Inferiority Criteria
Criteria are based on lower bounds of confidence intervals of the
treatment difference. Usually the assumption is that both drugs are
equally effective (but there are exceptions), and the magnitude of the
lower bound is a pre-defined fraction of the full treatment effect against
an inactive treatment.
This fractional approach drives up the sample size.
Example of a Sample Size Statement
For the test of non-inferiority in lung function, defined as the forced
expiratory volume in Liters (FEV1), a sample size of 145 subjects for
each treatment is required, assuming a standard deviation of 0.30 Liters
at about 80% power.
Non-Inferiority is achieved when the lower bound of the 95% confidence
interval of the treatment difference is -0.10 Liters or more (no upper
bound requirement), which is one-half the magnitude of an estimated
treatment difference of an active treatment vs. an-inactive treatment
(0.20 Liters).
Pre-Defined Fractions: Effect on Sample Size for Non-Inferiority
Fraction of Effect Size
for Lower Bound 95% CI
1.00
0.75
0.67
0.50
0.33
0.25
Sample Size/Treatment
37
64
81
143
316
567
There are many other methods and criteria used to determine the lower
bound criteria for non-inferiority. However, the one-half treatment effect
criteria has been used in clinical trials, and illustrates the effect on
sample size. In fact, this criteria results in a four-fold increase in sample
size over the test of superiority (beating the standard drug at 80% power
and 5% significance).
The important consideration is that the magnitude of the lower bound
criteria must be below a reasonable definition of therapeutic advantage
of the standard of care.
Conclusions and Overall Thoughts for Statisticians
 The Pharmaceutical industry will be reacting to increased scrutiny
of submitted data for drug approval.
o More safety data will be required (large safety trials designed
to evaluate rare adverse events)
o More efficient pivotal trials will be required to contain
enrollment costs in Phase III programs.
 Other methods of drug evaluation will evolve and work their way
into the industry
o Map of the human genome will allow a more efficient
enrollment criteria through better subject identification.
o Development of more specific biomarkers will help reduce
the variability of outcome measures.