Transcript Slide 1
Measurement Challenges
in Addiction
Treatment Research
Michael L. Dennis, Ph.D.
Chestnut Health Systems, Bloomington, IL
Presentation at the International Conference on Outcome Measurement,
September 11, 2008, Bethesda, MD. This presentation supported by National
Institute on Drug Abuse (NIDA) grant no R37 DA11323 and Center for
Substance Abuse Treatment (CSAT), Substance Abuse and Mental Health
Services Administration (SAMHSA) contract 270-07-019. The opinions are
those of the author and do not reflect official positions of the consortium or
government. Available on line at www.chestnut.org/LI/Posters or by contacting
Joan Unsicker at 720 West Chestnut, Bloomington, IL 61701, phone: (309) 8276026, fax: (309) 829-4661, e-Mail: [email protected]
Objectives are to...
Examine why more traditional clinical trials type
researchers need to care about measurement
Provide explicit practical examples of how
addressing measurement in Addiction Research can
help improve it
Since the early 1960s, Jacob Cohen and
colleagues has suggest that clinical trials
research should:
Focus on Statistical power, which is
-
the probability of finding what you are looking for
given that it is there
Combine data from multiple clinical trials into
meta analyses, which can be used as
-
a more stable estimate of truth
-
to evaluate the accuracy of our early estimates and
how methods can be improved
In a review of over 200 meta analyses of
medical, social and legal studies published
between 1960-1990, Lipsey consistently found
Less than a third of the individual articles
coded even mentioned
-
the statistical power of their core contrast
-
reliability, validity, or sensitivity of their outcome
measure
That relative to final effect size estimated from
the meta analysis, the studies averaged less
than 50% power
-
in other words, it was more accurate to flip a coin
than to use a statistical test the way they were being
used “on average” in the published literature
Movement to Improve the Methodological
Quality of Clinical Trials Research
In 1993 a group of 30 experts (medical journal editors,
clinical trialists, epidemiologists, and methodologists)
met in Ottawa to try to identify methodological gaps in
the literature
In 1996 this growing group issued the Consolidated
Standards of Reporting Trials (CONSORT;
www.consort-statement.org)
Since 2000, NIH has required DSMB on all Phase 3
and multi-site phase 2 studies (Notice OD-00-38) –
which also push CONSORT
Today virtually every major medical, psychiatric,
psychological, criminological, and social journal has
signed onto CONSORT
Basic ways to increase power
While the most common
approach, these are also the
most expensive and
logistically difficult to do
Increase sample size
Increase observations
Target a higher severity/less heterogeneous
sample
Increase implementation
Reduce measurement error
Reduce unexplained variance (which may be
systematic)
More accurately model error and unexplained
variance in analysis
Today’s focus
“Observed” Effect size
goes down with lower
reliability
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
Reliability of Dependent Variable
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
No Measurement
Error
True
Effect
Size
d=.2
d=.4
d=.8
1.00
Observed Effect Size (Observed d)
Observed Effect Size as a function of
“True” effect size (Cohen’s d) and
reliability of dependent variable
1000
900
800
700
600
500
400
300
200
100
0
Reliability of Dependent Variable
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
d=.2
d=.4
d=.8
0.90
A reliability of
.7 doubles
sample size
requirements
True
Effect
Size
1.00
n per group for 80% power
Sample size required for 80% power as a
function of “True” effect size (Cohen’s d) and
reliability of dependent variable
Increasing
reliability from
.4 to .7 cuts
sample size
requirements
by over 50%
Impact of Comprehensive Data Collection
Protocol Certification on Measurement Issues
-0.6
-0.4
0 <- Cohen's da
-0.2
Proportion of
Inconsistencies (100%)*
-0.39
Duration
(in Minutes)*
-0.25
Denial/Misrepresentation
(Staff Rating)*
-0.24
Context Effect
(Staff Report)
-0.10
-0.04
\a Cohen's d (Post Certification - Pre Certification)/Pooled STD
* p<.05
Proportion of Missing
Data (100%)
-0.03
Atypicalness
(Outfit in Logits)
-0.03
Randomness
(Infit in Logits)
Source: GAIN coordinating center
Staff Experience Matters as well
0.60
Major improvement
over the first 15
interviews
Good <--Z-Score --> Bad
0.40
Inconsistencies
Missing
0.20
0.00
Randomness
-0.20
Atypicalness
-0.40
Duration
Most improvements
have occurred by 60
interviews
-0.60
-0.80
0
10
20
30
40
50
60
Denial/Misrep.
70
80
90 100
Staff Experience
Source: GAIN coordinating center
Key Advantages of Creating Scales
and Indices for Clinical Research
One of the lowest cost ways to reduce measurement
error and increase statistical power
Reduce clinical omissions and backtracking for validity
checks
Increase conceptual robustness, interpretability and
make it easier to explain to others
Facilitates profiling over a large number of items
Impact of Number of Items on
Reliability (Alpha) Observed
by Average Inter-item Correlation
Avg
Item R
0.9
0.8
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.7
0.6
0.5
0.4
Symptom counts related to a
syndrome or latent construct
usually max out in 5-13 items
0.3
0.2
Behavioral Measures (e.g., how
many days, times) have high
reliability and max out around
3-5 items
Number of Items
30
28
26
24
22
20
18
15
13
11
9
7
5
3
0.1
1
Reilablity (Alpha)
Generally
target
1.0
.7 to .9 0.9
Covert Scales (e.g., MMPI),
summative indices, and other
measures with low inter item R
may take 30 items (or more)
Note you can also create a summary
measures across different sources of data
Source: Lennox et al 2006 (CFI=.98)
Formal Measurement Models Can Be Used to
Place people along a more reliable/sensitive ruler (aka common
or latent factor)
Look at the slope/ discrimination of items (primarily 2 parameter
IRT)
Related items in terms of their average severity
Look at the match/mismatch of people and item locations
(primarily Rasch / 1 parameter IRT)
Study real differences by primary substance, gender, race, age or
other groups
Identify potential bias at the item and test level by gender, race
or other groups
Identify atypical patterns of answers (e.g. outfit)
Identify random response patterns (e.g., infit) or less valid
response patterns
Replace missing data (whether small amounts or do to computer
adaptive testing
Impact of Item Discrimination (aka steepness
of slope) on Sample Size Requirements
1000
900
800
700
600
500
400
300
200
100
0
True Effect Size
(number of items)
n per group for 80% power
16-36%
reduction
in sample
size
IRT is generally
more efficient
if the items have
low or varied
discrimination
d=.2
(50 items)
d=.4
(10 items)
d=.8
(10 items)
0.5
1.0
1.5
2.0
2.5
Flat<-Average Item Discrmination/slope -> Steep
Rasch is generally
more efficient
if the items have
high
discrimination
Raw v Rasch v IRT Scales (my take)
Raw, Rasch and IRT scales generally correlated over .95 and
vary by less than 5% in sample size requirements/power
Raw scales are the easiest to calculate (even by hand) and get
most of the benefit. On the down side items are not equal, rarely
helps you build theory, and require separate approaches to
handle missing data
Rasch scales focus on high discriminate items, fitting the data to
a common measurement model that is very efficient when
comparing items and people and theories. On the down side
they assume your focus is on building an interval ruler, that item
slopes are similar and that you want to compare subgroups of
people with each other or over time
IRT scales focus on fitting the measurement model to the data
(opposite of Rasch), explaining additional variance by adding
parameters for slope and guessing, and are particularly useful
when you have a preexisting items with a wide range of
discrimination. On the down side they are more difficult to
calculate, require multiple iterations and larger sample sizes.
Example of how scales can also be inter-related
andassociated
used
Higher scores
with for validation
Higher scores associated with greater dysfunction
Higher scores associated with mental
alcohol and drug abuse medication
(e.g., dropping out of school, unemployment,
health treatment (e.g., anti depressants,
(methadone, naltrexone, antaabuse,
financial problems, homelessness)
seritonin reuptake inhibitors (SSRI),
buprenorphine)
and/or substance
Structure
of GAIN’s Psychopathology Measures and monoamine
Validityoxidase
Checks
inhibitors (MAOI)
induced legal, mental health,
sedatives) and/or a history of traumatic
physical health, and withdrawal
victimization, and/or high levels of
problems
General Individual Severity Scale (GISS)
stress
Substance Problem Scale
Substance Issues Index (SII)
Substance Abuse Scale (SAS)
Substance Dependence Scale (SDS)
Behavior Complexity Scale
Inattentiveness Disorder Scale (IDS)
Hyperactivity-Implusivity Scale (HIS)
Conduct Disorder Scale (CDS)
Higher scores associated with mental health treatment (e.g.,
Ritalin, Adderall, lithium), special/alternative education,
school or work problems, gambling and other evidence of
impulse control problems, and/or anti-social/borderline
personality disorders
Internal Mental Distress Scale
Somatic Symptom Index (SSI)
Depression Symptom Scale (DSS)
Homicidal/Suicidal Thought Index (HSTI)
Anxiety/Fear Symptom Scale (AFSS)
Traumatic Distress Scale (TDS)
Crime/Violence Scale
General Conflict Tactic Scale (GCTS)
Property Crime Scale (PCS)
Interpersonal Crime Scale (ICS)
Drug Crime Scale (DCS)
Higher scores associated with arrests, detention/jail time,
probation, parole, size of drug habit
Internalizing Disorder Subscale Item Calibrations
when Considering Diagnoses Separately
3
Depression
Anxiety
Trauma
2
Logits
1
0
-1
-2
-3
Most common has narrow
range of variation
Small to major gaps in measure
Suicidal
Internalizing Disorder Subscale Item Calibrations
Considered as a Second Order Factor
3
2
Logits
1
0
-1
-2
-3
Suicidal
Trauma
Anxiety
Somatic
Depression
On-Going Debates About SUD Concept
•
•
•
•
•
Formal assumption that symptoms of “physiological
dependence” (either tolerance or withdrawal) are
markers of high severity
Debate about whether “abuse” symptoms should be
dropped, thought of as early dependence, or thought
of as moderate/high severity markers that warrant
treatment even in the absence of a full syndrome
Debate about whether to treat diagnostic orphans (1-2
symptoms of dependence) as abuse or continue to
ignore them
Concern about whether the current symptoms (which
were based primarily on adult data) are appropriate
for use with adolescents
Concern about the sensitivity to change
Sample Characteristics
Young Adult:
Adolescents:
18-25
<18 (n=2474)
(n=344)
Male
74%
Caucasian
48%
African American
18%
Hispanic
12%
Average Age
15.6
Substance Disorder
85%
Internal Disorder
53%
External Disorder
63%
Crime/Violence
64%
Residential Tx
31%
Current CJ/JJ invol.
69%
Note: all significant, p < .01
Adults:
26+
(n=661)
58%
47%
54%
29%
27%
63%
7%
2%
20.2
37.3
82%
90%
62%
67%
45%
37%
51%
34%
56%
74%
74%
45%
Item Relationships Across Substances
Withdrawal (+0.34)
Desp.PH/MH (+0.10)
Give up act. (+0.05)
Can't stop (+0.05)
Tolerance (0.00)
Loss of Contro (-0.10)
Fights/troub. (0.17)
0.00
Role Failure (-0.12)
0.20
Time Cons. (-0.21)
Rasch Severity Measure
0.40
Hazardous (-0.03)
Average Item Severity (0.00)
0.60
1st dimension explains
75% of variance (2nd explains 1.2%)
Despite Legal (+0.10)
0.80
-0.20
-0.40
-0.60
Abuse Sx:
Abuse Symptoms are also
spread over continuum
Physiological Sx:
While Withdrawal is
High severity, Tolerance
Dependence Sx:
is only Moderate
Other dependence Symptoms
spread over continuum
Symptom Severity Varied by Drug
0.80
Withdrawal much less likely for CAN
AVG (0.00)
0.60
CAN
AMP (+0.89)
Rasch Severity Measure
OPI (+0.44)
COC (-0.22)
0.40
ALC (-0.44)
CAN (-0.67)
0.20
ALC
CAN
0.00
AMP
OPI
ALC
COC
-0.20
OPI
AMP
ALC
CAN
COC
COC
OPI
AMP
COC
OPI
OPI
CAN
ALC
AMP
COC
ALC
AMP
CAN
CAN
OPI
AMP
COC
OPI
COC
-0.60
Easier to endorse
Easier to endorse time fighting/ trouble
for ALC/CAN
consuming for CAN
OPI
COC
ALC
CAN
AMP
OPI
ALC
CAN
ALC
AMP
AMP
OPI
COC
AMP
ALC
CAN
-0.40
ALC
AMP
CAN
OPI
COC
CAN
ALC
COC
Easier to
endorse
hazardous
use for
ALC/CAN
Easier to
endorse
moderate
Sx for
COC/OPI
Easier to
endorse
Easier to
despite legal endorse
problem for Withdrawal
ALC/CAN
for
AMP/OPI
Symptom Severity Varied Even More By Age
1.8
Rasch Severity Measure
1.6
26+
Age
1.4
<18
1.2
18-25
Continued use in spite
of legal problems more
likely among Adol/YA
26+
1
0.8
1825
0.6
26+
0.4
26+
0.2
1825
0
<18
1825
-0.2
-0.4
-0.6
-0.8
<18
1825
<18
26+
<18
1825
<18
1825
<18
1825
26+
<18
1825
1825
<18
1825
<18
26+
26+
26+
26+
26+
26+
-1
More likely to lead to
fights among Adol/YA
1825
<18
<18
Hazardous use more
likely among Adol/YA
Adults more
likely to endorse
most symptoms
Rasch Severity by Past Month Status
2.00
Rasch Severity Measure
1.50
1.00
0.50
Diagnostic Orphans (1-2
dependence symptoms)
are lower, but still overlap
with other clinical groups
0.00
-0.50
-1.00
-1.50
-2.00
-2.50
-3.00
-3.50
None
Diagnostic Diagnostic Lifetime
Lifetime
SUD
Orphan Orphan
SUD
in early
in early
in CE
remission 45+ days
remission
Abuse
Only
Dependence Both
Only
Abuse
and
Dependence
Rasch Severity Measure
Severity by Past Year Symptom Count
2.00
1.50
1.00
0.50
0.00
-0.50
-1.00
-1.50
-2.00
-2.50
-3.00
-3.50
-4.00
1. Better Gradation
2. Still a lot of overlap in range
0
1
2
3
4
5
6
7
8
9
10
11
Rasch Severity Measure
Severity by Weighted (past month=2, past year=1)
Number of Substance x SUD Symptoms
1. Better Gradation
2. Much less overlap in range
2.00
1.50
1.00
0.50
0.00
-0.50
-1.00
-1.50
-2.00
-2.50
-3.00
-3.50
-4.00
0
1-4
5-8
9-12 13-16 17-20 21-24 25-30 31-40 41+
Construct Validity (i.e., does it matter?)
Emotional
Problems
Recovery
Environment
DSM diagnosis \a
Symptom Count Continuous \b
0.47
0.48
0.40
0.43
0.32
0.39
0.30 0.30
0.32 0.31
Weighted Symptom Rasch \c
0.57
0.46
0.39
0.39 0.32
\a Categorized as Past year physiology dependence, non-physiological
dependence, abuse, other
\b Raw past year symptom count (0-11)
\c Symptoms weighted by recency (2=past month, 1=2-12 months ago, 0=other)
Social Risk
Past Week
Withdrawal
Rasch
does
a little
Better
still
Frequency
Of Use
Past year
Symptom
count did
better than
DSM
Implications for SUD Concept
“Tolerance” is not a good marker of high severity;
withdrawal (and substance induced health problems are)
“Abuse” symptoms are consistent with the overall syndrome
and represent moderate severity or “other reasons to treat in
the absence of the full blown syndrome”
Diagnostic orphans are lower severity, but relevant
Pattern of symptoms varies by substance and age, but all
symptoms are relevant
“Adolescents” experienced the same range of symptoms,
though they (and young adults) were particularly more likely
to be involved with the law, use in hazardous situations, and
to get into fights at lower severity
Symptom Counts appear to be more useful than the current
DSM approach to categorizing severity
While weighting by recency & drug delineated severity, it did
not improve construct validity