Response vs. Remission - Oregon Psychiatric Association
Download
Report
Transcript Response vs. Remission - Oregon Psychiatric Association
Title - Critical Evaluation of
Clinical Trial Data
Erick Turner, M.D.
Oregon Health & Science University
Dept of Psychiatry; Dept of Pharmacology
Portland VA Medical Center
Mood Disorders Center
Disclosure
No trade names, advertising, or productgroup messages
Recovering promotional speaker
– Last “slip” was in fall of 2005
Objectives
Things to watch for in evaluating medical
information
Heighten your level of skepticism and
paranoia
May or may not apply to today’s talks
More about clinical trials in general, esp.
industry-sponsored
Studies Presented Today
CATIE
STAR*D
STEP-BD
BOLDER
The A*C*R*O*N*Y*M Study
Effect of Acronym Name
Doubled the citation rate
Independent of study size, quality,
outcome
Source
– Poster: What's in a NAME?
– Peer Review Congress 2005 (AMA)
Standard Clinical Trials vs.
Large Simple Clinical Trials
Signal-to-noise
Small & clean N (standard clinical trials)
Big & dirty N
“Dirt” “comes out in the wash”
Efficacy vs. Effectiveness
Patients: “squeaky clean” vs. “real world”
Comorbidities
– EtOH, other drugs
– Depression + anxiety
“The clinical evidence”
Whose evidence?
– Intellectual COI
• “I was right! I’ve been vindicated!”
• Attracting grant money - “the Midas touch”
Which evidence?
– Available evidence-based medicine
Selective Publication
Nonsignificant studies tend not to get
published
Some studies never see the light of day
Among studies that are published
– Selective presentation of endpoints within
those studies
– “Outcome reporting bias”
Why the Need for
Selective Publication?
Unimpressive effect sizes in psychiatry
Many NS antidepressant trials
– 47/92 (51%) active tx arms NS
• Khan 2003 Neuropsychopharm
• Later-approved drugs and dosages
“The Emperor’s New Drugs”
80% of drug effect duplicated by placebo
2-point difference between drug and
placebo
– HAMD-17-item max = 50 points
– 21-item max = 62 points
Kirsch I. Prevention & Treatment, Volume 5, Article 23, posted July 15, 2002
There Must Be 50 Ways . . .
…to put lipstick on a pig
Splice the Y-Axis
Depakote and Lithium
Mania Rating Scale Scores
30
25
20
15
Placebo
10
Divalproex
Lithium
5
0
0
5
10
15
21
Time on Protocol, d
*p < 0.05
(Bowden et al, JAMA, 271:12, March 1994)
Show Change from Baseline
(not Absolute) Scores
Mania Rating Scale Score
40
35
30
25
20
15
10
5
0
0
2
4
7
14
21
Study Day
(Keck et al, Am J. Psychiatry, 160:4, April 2003)
Non-Psychiatric Example
Graph in PDR
Change scores
Same numbers
Absolute scores
Don’t Show Variability in Data
Noise in data
– random variability
– Interindividual differences
• Perhaps your patient isn’t “Mr. Mean”
Showing just means can be misleading
– Liquid N2
Prefer error bars (or even raw data points)
But how much/little overlap do you
want the error bars to show?
Have it Your Way
Small
Standard E rror
M edium C onfidenc e interval (9 5 %
L arge
Standard D eviation
Overpower Your Study
Unnecessarily large N
Clinically insignificant result
statistically significant
Candidate A vs. Candidate B
Effect of the Number of Voters
The split:
Total No. Vot ers
P value
News Headline
1 ,0 0 0
0 .9 5
tie
1 0 ,0 0 0
0 .8 4
tie
1 0 0 ,0 0 0
0 .5 3
tie
1 ,0 0 0 ,0 0 00 .0 4 6 (<.0 5 )
A wins
1 0 ,0 0 0 ,0 0 0 <.0 0 0 1
A wins by a lands lide! !
Disclaimer: Assumes that popular vote matters
Limitation of P Values
P values confounded by sample size
– Clinically insignificant difference can be statistically
very significant
P values tell about precision,
– how likely the difference observed could have occurred
by chance
Clinicians and pts also interested in magnitude of
effect
– Effect size
– Confidence intervals
– Reading: Jacob Cohen: The Earth is Round, P<.05
Underpowered Studies
Could have clinically significant difference
N too small to reach statistical significance
Michael Jordan free-throw shootout
MJ vs. ET -- 7 free throws each
MJ makes 7, I make 3
P = .07 (NS, Fisher Exact test)
Conclusions
– There was “no difference” between us.
– I’m as good as Michael Jordan!
Vickers A, Medscape 2006. Michael Jordan Won't Accept the Null Hypothesis:
Notes on Interpreting High P Values
Lack of a significant difference does
not mean equality!
If it’s not black, it’s not necessarily white,
either… could be gray
Study could be underpowered
Beware claims of equivalence
But what if Ns are adequate?
Claims of Equivalence
Example: Two drugs performed “the
same”.
Were both medications really equally
effective?
Or were they equally ineffective?
St. John’s Wort vs. Sertraline
30
HAM-D Scores
25
20
Hypericum
15
Sertraline
10
5
0
0
1
2
3
4
5
6
7
8
Study Week Mean decrease = 47% for Zoloft
(vs. 38%) p = .06
JAMA Apr 10, 2002 -- Vol 287, No. 14, 1807-1814
. . . and with Placebo in the Picture
30
HAM-D Scores
25
20
Hypericum
15
Placebo
Sertraline
10
5
0
0
1
2
3
4
5
Study Week
6
7
Comparison
Hyp vs. Pbo
Ser vs. Pbo
8
Ser vs. Hyp
p
.59
.18
.06
St. John’s Wort vs. Sertraline
Analysis of other primary efficacy endpoint
30
25 %
% Full Responders
24 %
p = .99
20
10
0
Hypericum
Chi-squared test, Yates corrected
Sertraline
. . . with Placebo in the Picture
% Full Responders
40
32 %
30
24 %
25 %
Hypericum
Sertraline
20
10
0
Placebo
Comparative Claims
FDA leery
– …of equivalence claims
– …of superiority claims
FDA does not allow them in labeling (package
insert, advertising)
Efficacy advantage
– Underdose competing drug
Safety advantage
– Dose competing drug too high and/or too fast
Transitivity
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Am J Psychiatry 163:185-194, February 2006
Consider the Source
RESULTS: Of the 42 reports identified by
the authors, 33 were sponsored by a
pharmaceutical company. In 90.0% of the
studies, the reported overall outcome was
in favor of the sponsor’s drug. This pattern
resulted in contradictory conclusions
across studies when the findings of studies
of the same drugs but with different
sponsors were compared.
Beware the Comparison
to Nothing!
Open-label study - pts know what they are
getting
– Voice alteration in VNS trials
Often single-arm w/ no placebo control
Anyone ever seen an open-label study in which
pts did not get better compared to baseline?
(How do they get published?)
Single-Blind Studies
A step above open-label in rigor
Investigators know what tx the study pt is
getting
Examples:
– Acupuncture studies
– Many device studies (e.g. rTMS)
The Problem with Single-Blind Studies:
Clever Hans
Use Lots of Scales
Don’t Put All Your Eggs in One Basket
Observer-based
– MADRS
– CGI
• CGI-I (improvement)
• CGI-S (severity)
– HAMD in all its
flavors
•
•
•
•
17-item
21-item
28-item
33-item
Self-report
– BDI (Beck)
– QIDS-SR (STAR*D)
– Quality of life scales
Pros and Cons of Many Scales
The upside of multiple endpoints:
– Internal replication
– Robustness (vs. fragile finding)
The downside
– Increased probability of chance finding
– Multiplicity, aka multiple comparisons
Put Enough Monkeys at Enough
Typewriters . . .
…and sooner or
later you’ll
have the
complete works
of William
Shakespeare
Multiple Subscales
HAMD-33 item, you also get . . .
–
–
–
–
28-item
21-item
176- (“core items”)
Anxiety subscale of the HAMD
Depression subscale of the PANSS
But was it in the original protocol?
What Can You Do
With All These Scales?
Continuous measure
– Use each score as-is (absolute score)
– Change from baseline
Transform into categorical measure
– Cutoffs patients either above or below
– Remitters
– Responders
Responders
Just “responders”
– >= 50% decrease from baseline
• Ex. Baseline score 40 -> endpoint score = 20
– < 50% ==> “nonresponder”
• Baseline = 40, endpoint score = 21
Gradations of responders
– Partial responders (25-50% decrease from baseline)
– Full responders (>50% decrease)
Remitters
“Remission” usually = absolute score (HAMD
< 8)
STAR*D defines remission as 75% decrease
from baseline
Advantage - set threshold deemed clinically
significant
But % remitters may still differ between
groups to extent that is just statistically
significant (remember the “election” slide)
Handling Dropouts
LOCF
– last observation carried forward
OC
– Observed cases
– aka. completers
MMRM
– Mixed model repeated measures
HARKing
Hypothesizing
After the
Results are
Known
A priori vs. post hoc
How the FDA
Guards Against This
FDA gets protocol before study begins
Sponsors can’t “censor” studies that don’t
go well
Drugs approved based on all studies
It’s the Protocol, Stupid!
“If the Devil is in the Details, Salvation is in the
Protocol”
– Talk by Paul Andreason, FDA
Primary endpoints
– a priori hypothesis
– Where you’re placing your bet
Secondary endpoints
– Exploratory
– If you make it, fine, but don’t make a big deal about it.
– Repeat study, designate it as primary, see if it replicates
Off-Label Use
Drug used for something FDA has not approved
it for
(FDA does not regulate prescribing)
Often appropriate to prescribe off-label
– No approved drugs for condition (but why not?)
– You’ve exhausted approved drugs
Ask why isn’t drug approved for this condition?
– Could they have submitted and gotten it rejected?
– If they haven’t submitted an application, why not?
How do you Know Whether a Drug is FDAApproved for the Condition You’re Treating?
Beware of sources that talk about “uses”
– AHFS Drug Information (“The Red Book”)
– Fluoxetine uses: obesity, bipolar d/o, myoclonus,
cataplexy, EtOH dependence
Gabapentin has never been approved for any psych
indication
Just look in the package insert or PDR
– Indications & Usage section
– More details in Clinical Trials section
The End
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.