Transcript 2007
Statistical Principles for Clinical
Research
Conducting Clinical Trials 2007
Sponsored by:
NIH General Clinical Research Center
Los Angeles Biomedical Research Institute
at Harbor-UCLA Medical Center
November 1, 2007
Peter D. Christenson
Speaker Disclosure Statement
The speaker has no financial relationships
relevant to this presentation.
Recommended Textbook: Making Inference
Design issues
Biases
How to read papers
Meta-analyses
Dropouts
Non-mathematical
Many examples
Example: Harbor Study Protocol
18 Pages of Background and Significance,
Preliminary Studies, and Research Design and
Methods. Then:
“Pearson correlation, repeated measure of the
general linear model, ANOVA analyses and student t
tests will be used where appropriate. …
The [two] main parameters of interest will be … [A
and B. For A, using a t-test] 40 subjects provide 80%
assurance that a XX reduction … will be detected,
with p<0.05.
Similar comparisons as for … [A and B] will be
carried out …”
Example: Harbor Study Protocol
The good ….
“The [two] main parameters of interest will be … [A
and B. For A, using a t-test,] 40 subjects provide
80% assurance that a XX reduction … will be
detected, with p<0.05.”
Because:
• Explicit: Specifies primary outcome of interest.
• Explicit: Justification for # of subjects.
Example: Harbor Study Protocol
… the Bad …
“Pearson correlation, repeated measure of the
general linear model, ANOVA analyses and student t
tests will be used where appropriate. …”
Because:
• Boilerplate.
• These methods are almost always used.
• “Where appropriate”?
• Tries to satisfy reviewer, not science.
Example: Harbor Study Protocol
… and the Ugly.
“Similar comparisons as for … [A and B] will be
carried out …”
Because:
• 1º OK: Diff b/w 2 visits for 2 measures, A & B.
• But, 15 measures taken at each of 19 visits.
• Torture the data long enough, and it will
confess to something.
Goals of this Presentation
More good.
Less bad.
Less ugly.
Biostatistical Involvement in Studies
Off-site statistical design and analysis
Multicenter studies; data coordinating center.
In house drug company statisticians.
CRO through NIH or drug company.
Local study contracted elsewhere
e.g. UCLA, USC, CRO.
Local protocol, and statistical design and analysis
Occasionally multicenter.
Studies with Off-Site Biostatistics
Not responsible for statistical design and analysis.
Are responsible for study conduct that may:
• … impact analysis, believability of results.
• … reduce sensitivity (power) of the study to
be able to detect effects.
Review of Basic Method
of Inference
from
Clinical Studies
Typical Study Data Analysis
Large enough “signal-to-noise ratio” → Proves an
effect beyond a reasonable doubt. Often:
Ratio
=
Signal
Noise
=
Observed Effect
Natural Variation/√N
For a t-test comparing two groups:
t Ratio
=
Difference in Means
SD/√N
Degree of allowable doubt → How large t needs to be.
5% (p<0.05) → |t| > ~2
Meaning of p-value
p-value:
Probability of a test statistic (ratio) that is at least as
deviant as was observed, if there is really no effect.
Smaller p-values ↔ more evidence of effect.
Validity of p-value interpretation typically requires:
• Proper data generation, e.g., randomness.
• Subjects provide independent information.
• Data is not used in other statistical tests.
or: an accounting for not satisfying these criteria.
→ p-values are earned by satisfying appropriately.
Analogy with Diagnostic Testing
Analogy
Truth:
No Effect
Study
Claims:
No Effect
Correct
True Effect
Effect
↔
Disease
Error
Specificity
Sensitivity
Effect
Error
Correct
Study
Claim
↔
Diagnosis
Set p≤0.05
Specificity=95%
← Typical →
Power: Maximize.
Choose N for 80%
Study Conduct Impacting Analysis
↓ effect detectability (and ↓ratio) results from:
Non-adherence of study personnel to the protocol in
general. [Increases variation.]
Enrolling subjects who do not satisfy inclusion or
exclusion criteria. [ E.g., no effect in 10% wrongly
included & real effect=50% → ~0.9(50%) = 45%
observed effect. Can decrease observed effect.]
Subjects not completing entire study. [May decrease
N, or give potentially conflicting results.]
Potentially Conflicting Results
Example: Subjects not completing the entire study.
Tigabine Study Results: How Believable?
1
2
3
Conclusions differ depending on how non-completing
subjects (24%) are handled in the analysis.
Primary analysis here is specified, but we would prefer
robustness to the method of analysis (agreement),
which is more likely with more completing subjects.
Study Conduct Impacting Analysis
Intention-to-Treat (ITT)
ITT typically specifies that all subjects are included
in analysis, regardless of treatment compliance or
whether lost to follow-up.
Purposes: Avoid bias from subjective exclusions or
differential exclusion between treatment groups;
sometimes argued to mimic non-compliance in real
world setting.
More emphasis on policy implications of societal
effectiveness than on scientific efficacy.
Not appropriate for many studies.
Continued …
Study Conduct Impacting Analysis
Intention-to-Treat (ITT)
Lost to follow-up:
Always minimize; no “real world” analogy as for
treatment compliance.
Need to define outcomes for non-completing subjects.
Current Harbor study:
N≈1200 would need N≈3000 if ITT used, 20% lost, and
lost counted as treatment failures.
ITT: Need to Impute Unknown Values
Observations
LOCF:
0
Ignore
Presumed
Progression
Change from
Baseline
Individual
Subjects
Baseline
Intermediate
Visit
Final Visit
Ranks
LRCF:
0
Maintain
Expected
Relative
Progression
Change from
Baseline
Baseline
Intermediate
Visit
Final Visit
Study Conduct Impacting Feasibility
Potential Effects of Slow Enrollment
• Needed N may be impossible → Study stopped.
• Competitive site enrollment → Local financial loss.
• Insufficient person-years (PY) of observation for
some studies, even if N is attained:
# of Subjects
Area = PY
Detects
Effect=Δ
N
0
1
Planned
2
Detects
Effect=1.1Δ
0
1
Slower
Detects
Effect=1.7Δ
2
0
1
Slower Yet
2
Year
Biostatistical Involvement in Studies
Off-site statistical design and analysis
Multicenter studies; data coordinating center.
In-house drug company statisticians.
By CRO through NIH or drug company.
Local study contracted elsewhere
e.g. UCLA, USC, CRO
Local protocol, and statistical design and analysis
Occasionally multicenter.
Local Protocols and Data Analysis
1. Develop protocol and data analysis plan.
2. Have randomization and blinding strategy, if
study requires.
3. Data management.
4. Perform data analyses.
Local Data Analysis Resources
Biostatistician:
Peter Christenson, [email protected].
Develop study design, analysis plan.
Advise throughout for any study.
Perform all non-basic analyses.
Full responsibility for studies with funded %FTE.
Review some protocols for committees.
Data Management:
Database development for GCRC studies by
database manager.
Statistical Components of Protocols
•
•
•
•
•
•
•
•
•
•
Target population / source of subjects.
Quantification of aims, hypotheses.
Case definitions, endpoints quantified.
Randomization plan, if any.
Masking, if used.
Study size: screen, enroll, complete.
Use of data from non-completers.
Justification of study size (power, precision, other).
Methods of analysis.
Mid-study analyses.
Selected
Statistical Components
and Issues
Case Definitions and Endpoints
• Primary case definitions and endpoints need
careful thought.
• Will need to report results based on these.
Example: Study at Harbor
Definition of cure very strict.
Analyzed data with this definition.
Cure rates too low - would not be taken seriously.
Scientific method → need to report them; otherwise
cherry-picking.
Publication: Use primary definition; explain; also report
with secondary definition. Less credible.
Randomization
• Helps assure attributability of treatment effects.
• Blocked randomization assures approximate
chronologic equality of numbers of subjects in each
treatment group.
• Recruiters must not have access to randomization
list.
• List can be created with a random number
generator in software, printed tables in stat texts,
or even shuffled slips of paper.
Non-completing Subjects
• Enrolled subjects are never “dropouts”.
• Protocol should specify:
– Primary analysis set (e.g., ITT or perprotocol).
– How final values will be assigned to noncompleters.
• Time-to-event (survival analysis) studies may
not need final assignments; use time followed.
• Study size estimates should incorporate the
number of expected non-completers.
Study Size: Power
Power = Probability of detecting real effects of a
specified minimal (clinically relevant) magnitude
• Power will be different for each outcome.
• Power depends on the statistical method.
• Five factors including power are inter-related.
Fixing four of these specifies the fifth:
– Study size
– Heterogeneity among subjects (SD)
– Magnitude of treatment effect to be detected
– Power to detect this magnitude of effect
– Acceptable chance of false positive conclusion,
usually 0.05
Free Study Size Software
www.stat.uiowa.edu/~rlenth/Power
Free Study Size Software: Example
Pilot data: SD=8.19 in 36 subjects.
We propose N=40 subjects/group in order to provide
80% power to detect (p<0.05) an effect Δ of 5.2:
Study Size : May Not be Based on Power
Precision refers to how well a measure is estimated.
Margin of error = the ± value (half-width) of the 95%
confidence interval.
Smaller margin of error ←→ greater precision.
To achieve a specified margin of error, solve the CI
formula for N.
Polls: N ≈ 1000→ margin of error on % ≈ 1/√N ≈ 3%.
Pilot Studies, Phase I, Some Phase II: Power not
relevant; may have a goal of obtaining an SD for
future studies.
Mid-Study Analyses
• Mid-study comparisons should not be made
before study completion unless planned for
(interim analyses). Early comparisons are
unstable, and can invalidate final comparisons.
• Interim analyses are planned comparisons at
specific times, usually by an unmasked advisory
board. They allow stopping the study early due
to very dramatic effects, and final comparisons,
if study continues, are adjusted to validly
account for “peeking”.
Continued …
Mid-Study Analyses
Too many
analyses
Effect
0
Wrong early
conclusion
Number of Subjects Enrolled
Time →
Need to monitor, but also account for many analyses
Mid-Study Analyses
• Mid-study reassessment of study size is advised
for long studies. Only standard deviations to
date, not effects themselves, are used to assess
original design assumptions.
• Feasibility analysis:
– may use the assessment noted above to
decide whether to continue the study.
– may measure effects, like interim analyses, by
unmasked advisors, to project ahead on the
likelihood of finding effects at the planned end
of study.
Continued …
Mid-Study Analyses
Examples: Studies at Harbor
Randomized; not masked; data available to PI.
Compared treatment groups repeatedly, as more
subjects were enrolled.
Study 1: Groups do not differ; plan to add more
subjects.
Consequence → final p-value not valid; probability
requires no prior knowledge of effect.
Study 2: Groups differ significantly; plan to stop study.
Consequence → use of this p-value not valid; the
probability requires incorporating later comparison.
Multiple Analyses at Study End
False
Positive
Conclusions
Torturing
Data
Replacing
“Subgroup”
with
“Analysis”
Gives a
Similar
Problem
Lagakos NEJM 354(16):1667-1669.
Multiple Analyses at Study End
• There are formal methods to incorporate the
number of multiple analyses.
• Bonferroni
• Tukey
• Dunnett
• Transparency of what was done is most
important.
• Should be aware of number of analyses and
report it with any conclusions.
Summary:
Bad Science That May Seem So Good
1. Re-examining data, or using many outcomes,
seeming to be performing due diligence.
2. Adding subjects to a study that is showing
marginal effects; or, stopping early due to strong
results.
3. Examining effects in subgroups. See NEJM 2006
354(16):1667-1669.
Actually bad? Could be negligent NOT to do these,
but need to account for doing them.
Statistical Software
Professional Statistics Software Package
Output
Stored
data;
accessible.
Enter
code;
syntax.
Microsoft Excel for Statistics
• Primarily for
descriptive statistics.
• Limited output.
Almost Free On-Line Statistics Software
www.statcrunch.com
Run from browser; not
local.
$5/ 6 months usage.
Potential HIPPA
concerns
Supported
by NSF
Typical Statistics Software Package
Select Methods from Menus
www.ncss.com
www.minitab.com
www.stata.com
$100 - $500
Output after menu selection
Data in spreadsheet
http://gcrc.labiomed.org/biostat
This
and
other
biostat
talks
posted
Conclusions
Don’t put off slow enrollment; find the cause; solve it.
I am available.
Do put off analyses of efficacy, not of design
assumptions.
I am available.
P-values are earned, by following methods which are
needed for them to be valid.
I am available.
You may have to pay for lack of attention to protocol
decisions, to satisfy the scientific method.
I am available.
Software always takes more time than expected.
Thank You
Nils Simonson, in
Furberg & Furberg,
Evaluating Clinical Research