Response vs. Remission - Oregon Psychiatric Association

Download Report

Transcript Response vs. Remission - Oregon Psychiatric Association

Title - Critical Evaluation of
Clinical Trial Data
Erick Turner, M.D.
Oregon Health & Science University
Dept of Psychiatry; Dept of Pharmacology
Portland VA Medical Center
Mood Disorders Center
Disclosure
No trade names, advertising, or productgroup messages
Recovering promotional speaker
– Last “slip” was in fall of 2005
Objectives
Things to watch for in evaluating medical
information
Heighten your level of skepticism and
paranoia
May or may not apply to today’s talks
More about clinical trials in general, esp.
industry-sponsored
Studies Presented Today
CATIE
STAR*D
STEP-BD
BOLDER
The A*C*R*O*N*Y*M Study
Effect of Acronym Name
 Doubled the citation rate
 Independent of study size, quality,
outcome
 Source
– Poster: What's in a NAME?
– Peer Review Congress 2005 (AMA)
Standard Clinical Trials vs.
Large Simple Clinical Trials
Signal-to-noise
Small & clean N (standard clinical trials)
Big & dirty N
“Dirt” “comes out in the wash”
Efficacy vs. Effectiveness
Patients: “squeaky clean” vs. “real world”
Comorbidities
– EtOH, other drugs
– Depression + anxiety
“The clinical evidence”
Whose evidence?
– Intellectual COI
• “I was right! I’ve been vindicated!”
• Attracting grant money - “the Midas touch”
Which evidence?
– Available evidence-based medicine
Selective Publication
Nonsignificant studies tend not to get
published
Some studies never see the light of day
Among studies that are published
– Selective presentation of endpoints within
those studies
– “Outcome reporting bias”
Why the Need for
Selective Publication?
Unimpressive effect sizes in psychiatry
Many NS antidepressant trials
– 47/92 (51%) active tx arms NS
• Khan 2003 Neuropsychopharm
• Later-approved drugs and dosages
“The Emperor’s New Drugs”
80% of drug effect duplicated by placebo
2-point difference between drug and
placebo
– HAMD-17-item max = 50 points
– 21-item max = 62 points
Kirsch I. Prevention & Treatment, Volume 5, Article 23, posted July 15, 2002
There Must Be 50 Ways . . .
…to put lipstick on a pig
Splice the Y-Axis
Depakote and Lithium
Mania Rating Scale Scores
30
25
20
15
Placebo
10
Divalproex
Lithium
5
0
0
5
10
15
21
Time on Protocol, d
*p < 0.05
(Bowden et al, JAMA, 271:12, March 1994)
Show Change from Baseline
(not Absolute) Scores
Mania Rating Scale Score
40
35
30
25
20
15
10
5
0
0
2
4
7
14
21
Study Day
(Keck et al, Am J. Psychiatry, 160:4, April 2003)
Non-Psychiatric Example
Graph in PDR
Change scores
Same numbers
Absolute scores
Don’t Show Variability in Data
Noise in data
– random variability
– Interindividual differences
• Perhaps your patient isn’t “Mr. Mean”
Showing just means can be misleading
– Liquid N2
Prefer error bars (or even raw data points)
But how much/little overlap do you
want the error bars to show?
Have it Your Way
Small
Standard E rror
M edium C onfidenc e interval (9 5 %
L arge
Standard D eviation
Overpower Your Study
Unnecessarily large N
Clinically insignificant result 
statistically significant
Candidate A vs. Candidate B
Effect of the Number of Voters
The split:
Total No. Vot ers
P value
News Headline
1 ,0 0 0
0 .9 5
tie
1 0 ,0 0 0
0 .8 4
tie
1 0 0 ,0 0 0
0 .5 3
tie
1 ,0 0 0 ,0 0 00 .0 4 6 (<.0 5 )
A wins
1 0 ,0 0 0 ,0 0 0 <.0 0 0 1
A wins by a lands lide! !
Disclaimer: Assumes that popular vote matters
Limitation of P Values
 P values confounded by sample size
– Clinically insignificant difference can be statistically
very significant
 P values tell about precision,
– how likely the difference observed could have occurred
by chance
 Clinicians and pts also interested in magnitude of
effect
– Effect size
– Confidence intervals
– Reading: Jacob Cohen: The Earth is Round, P<.05
Underpowered Studies
Could have clinically significant difference
N too small to reach statistical significance
Michael Jordan free-throw shootout
 MJ vs. ET -- 7 free throws each
 MJ makes 7, I make 3
 P = .07 (NS, Fisher Exact test)
 Conclusions
– There was “no difference” between us.
– I’m as good as Michael Jordan!
Vickers A, Medscape 2006. Michael Jordan Won't Accept the Null Hypothesis:
Notes on Interpreting High P Values
Lack of a significant difference does
not mean equality!
If it’s not black, it’s not necessarily white,
either… could be gray
Study could be underpowered
Beware claims of equivalence
But what if Ns are adequate?
Claims of Equivalence
Example: Two drugs performed “the
same”.
Were both medications really equally
effective?
Or were they equally ineffective?
St. John’s Wort vs. Sertraline
30
HAM-D Scores
25
20
Hypericum
15
Sertraline
10
5
0
0
1
2
3
4
5
6
7
8
Study Week Mean decrease = 47% for Zoloft
(vs. 38%) p = .06
JAMA Apr 10, 2002 -- Vol 287, No. 14, 1807-1814
. . . and with Placebo in the Picture
30
HAM-D Scores
25
20
Hypericum
15
Placebo
Sertraline
10
5
0
0
1
2
3
4
5
Study Week
6
7
Comparison
Hyp vs. Pbo
Ser vs. Pbo
8
Ser vs. Hyp
p
.59
.18
.06
St. John’s Wort vs. Sertraline
Analysis of other primary efficacy endpoint
30
25 %
% Full Responders
24 %
p = .99
20
10
0
Hypericum
Chi-squared test, Yates corrected
Sertraline
. . . with Placebo in the Picture
% Full Responders
40
32 %
30
24 %
25 %
Hypericum
Sertraline
20
10
0
Placebo
Comparative Claims
 FDA leery
– …of equivalence claims
– …of superiority claims
 FDA does not allow them in labeling (package
insert, advertising)
 Efficacy advantage
– Underdose competing drug
 Safety advantage
– Dose competing drug too high and/or too fast
Transitivity
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Am J Psychiatry 163:185-194, February 2006
Consider the Source
RESULTS: Of the 42 reports identified by
the authors, 33 were sponsored by a
pharmaceutical company. In 90.0% of the
studies, the reported overall outcome was
in favor of the sponsor’s drug. This pattern
resulted in contradictory conclusions
across studies when the findings of studies
of the same drugs but with different
sponsors were compared.
Beware the Comparison
to Nothing!
 Open-label study - pts know what they are
getting
– Voice alteration in VNS trials
 Often single-arm w/ no placebo control
 Anyone ever seen an open-label study in which
pts did not get better compared to baseline?
 (How do they get published?)
Single-Blind Studies
A step above open-label in rigor
Investigators know what tx the study pt is
getting
Examples:
– Acupuncture studies
– Many device studies (e.g. rTMS)
The Problem with Single-Blind Studies:
Clever Hans
Use Lots of Scales
Don’t Put All Your Eggs in One Basket
 Observer-based
– MADRS
– CGI
• CGI-I (improvement)
• CGI-S (severity)
– HAMD in all its
flavors
•
•
•
•
17-item
21-item
28-item
33-item
 Self-report
– BDI (Beck)
– QIDS-SR (STAR*D)
– Quality of life scales
Pros and Cons of Many Scales
The upside of multiple endpoints:
– Internal replication
– Robustness (vs. fragile finding)
The downside
– Increased probability of chance finding
– Multiplicity, aka multiple comparisons
Put Enough Monkeys at Enough
Typewriters . . .
…and sooner or
later you’ll
have the
complete works
of William
Shakespeare
Multiple Subscales
 HAMD-33 item, you also get . . .
–
–
–
–
28-item
21-item
176- (“core items”)
 Anxiety subscale of the HAMD
 Depression subscale of the PANSS
 But was it in the original protocol?
What Can You Do
With All These Scales?
 Continuous measure
– Use each score as-is (absolute score)
– Change from baseline
 Transform into categorical measure
– Cutoffs  patients either above or below
– Remitters
– Responders
Responders
 Just “responders”
– >= 50% decrease from baseline
• Ex. Baseline score 40 -> endpoint score = 20
– < 50% ==> “nonresponder”
• Baseline = 40, endpoint score = 21
 Gradations of responders
– Partial responders (25-50% decrease from baseline)
– Full responders (>50% decrease)
Remitters
 “Remission” usually = absolute score (HAMD
< 8)
 STAR*D defines remission as 75% decrease
from baseline
 Advantage - set threshold deemed clinically
significant
 But % remitters may still differ between
groups to extent that is just statistically
significant (remember the “election” slide)
Handling Dropouts
LOCF
– last observation carried forward
OC
– Observed cases
– aka. completers
MMRM
– Mixed model repeated measures
HARKing
Hypothesizing
After the
Results are
Known
 A priori vs. post hoc
How the FDA
Guards Against This
FDA gets protocol before study begins
Sponsors can’t “censor” studies that don’t
go well
Drugs approved based on all studies
It’s the Protocol, Stupid!
 “If the Devil is in the Details, Salvation is in the
Protocol”
– Talk by Paul Andreason, FDA
 Primary endpoints
– a priori hypothesis
– Where you’re placing your bet
 Secondary endpoints
– Exploratory
– If you make it, fine, but don’t make a big deal about it.
– Repeat study, designate it as primary, see if it replicates
Off-Label Use
 Drug used for something FDA has not approved
it for
 (FDA does not regulate prescribing)
 Often appropriate to prescribe off-label
– No approved drugs for condition (but why not?)
– You’ve exhausted approved drugs
 Ask why isn’t drug approved for this condition?
– Could they have submitted and gotten it rejected?
– If they haven’t submitted an application, why not?
How do you Know Whether a Drug is FDAApproved for the Condition You’re Treating?
Beware of sources that talk about “uses”
– AHFS Drug Information (“The Red Book”)
– Fluoxetine uses: obesity, bipolar d/o, myoclonus,
cataplexy, EtOH dependence
 Gabapentin has never been approved for any psych
indication
 Just look in the package insert or PDR
– Indications & Usage section
– More details in Clinical Trials section
The End
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.