DIF detection using OLR - University of California, Davis
Download
Report
Transcript DIF detection using OLR - University of California, Davis
DIF detection using (Ordinal)
Logistic Regression
Laura Gibbons, PhD
Paul K. Crane, MD MPH
Internal Medicine
University of Washington
Outline
•
•
•
•
•
Brief statistical background
DIFdetect package
What do we do when we find DIF?
New, simpler, faster solutions!
Discussion
Statistical background
• Recall definition of DIF: when a
demographic characteristic interferes
with relationship expected between
ability level and responses to an item
• A conditional definition; have to control
for ability level, or else we can’t
differentiate between DIF and
differential test impact
The 2 Parameter Logistic model
• Logit P(Y=1|a,b,θ)=Da(θ-b)
– Produces an item characteristic curve
– Models probability that a person correctly
responds to an item given the item
parameters (a,b) and their person level θ
– D is a constant
– a, b notation reversed from biomedical
conventions
The 2 PL model
• Logit P(Y=1|a,b,θ)=Da(θ-b)
– b is the item difficulty
• When θ=b, 50% probability of getting the item
correct
– a is item discrimination
• a determines slope around the point where θ=b
Modest Uniform DIF
Item characteristic curves for "Close your eyes"
in Spanish and English speakers
1
0.5
0
-3
-2
-1
0
1
2
3
Non-Uniform DIF
Item category characteristic curves for the item
“ability to walk 1 block” separately in AfricanAmericans (yellow lines) and whites
Probability of endorsing
1
0.5
0
-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Physical functioning
Uniform and Non-uniform DIF
Itemcharacteristic curves for "Repeating
Phrase" in English and Spanish speakers
1
0.5
0
-3
-2
-1
0
1
2
3
Logistic regression applied
to DIF detection
• Swaminathan and Rogers (1990)
• Tested two models:
– P(Y=1|X, group)=f(β1X+β2*group+β3*X*group)
– P(Y=1|X)=f(β1X)
• Compared the –2 log likelihoods of these two
models to a chi squared distribution with 2 df
• Uniform and non-uniform tested at same time
Camilli and Shepard (1994)
• Recommended a two step procedure, to first
test for non-uniform DIF and then for uniform
DIF
– P(Y=1|X, group)=f(β1X+β2*group+β3*X*group)
– P(Y=1|X, group)= f(β1X+β2*group)
– P(Y=1|X)=f(β1X)
• -2 log likelihoods of each pair of models
compared to determine non-uniform DIF and
uniform DIF in two separate steps
Millsap and Everson (1994)
• Dismissive of “observed score”
techniques such as logistic regression
• X contains several items that have DIF,
so adjusting for X is theoretically
problematic
• Advocated latent approaches such as
IRT for DIF detection
Zumbo (1999)
• Extended Swaminathan and Rogers
framework to ordinal logistic regression
case to handle polytomous items
• Did not address latent trait; also used a
single step rather than two steps
Crane, van Belle, Larson (2004)
• Logistic regression model is a reparameterization of the IRT model, as
long as IRT-derived θ estimates are
used as ability scores
• Raised the issue of multiple hypothesis
testing of non-uniform DIF
Crane et al. (2004) – 2
• Biggest change in terms of specific
criteria for uniform DIF
– Recognized that non-uniform and uniform
DIF were analogous to effect modification
and confounding
– Employed epidemiological thinking about
how to detect confounding relationships
from the data; size of effect.
Crane et al. (2004) – 3
• Same models used (though now θ not X)
– P(Y=1|θ, group)= f(β1θ+β2*group)
– P(Y=1|θ)=f(β1'θ)
• Determine the impact of including the group
term on the magnitude of the relationship
between θ and item responses
• Determine size of |(β1-β1')/β1|. If this is large,
uniform DIF (confounding) is present
Work still pending
• “Optimal” criteria for uniform and nonuniform DIF are unknown
– Adjust α for multiple hypotheses?
– Effect size for non-uniform DIF? In huge
data sets, likely to have a significant
interaction term.
– What proportional change in β1 is
meaningful UDIF?
Also under investigation
• What is the role of model fit statistics?
For example, if NU DIF is present, the
model with group and ability only should
not fit.
• How important is the proportional
odds/Graded response assumption?
Should stereotype or other models be
used in some instances?
DIFdetect package
• Can download from the web
• www.alz.washington.edu/DIFDETECT/welcome.html
• STATA-based user friendly package
For those who tire of clicking
• Difd varlist,
ABility(str) GRoups(str)
[with lots of optional specifications]
Outline revisited
Brief statistical background
DIFdetect package
• What do we do when we find DIF?
• New, simpler, faster solutions!
• Discussion
What to do when we find DIF?
• Educational settings often items with
DIF are discarded
• Unattractive option for us
– Tests are too short as it is; lose variation
– Lose precision
– DIF doesn’t mean that the item doesn’t
measure the underlying construct at all,
just that it does so differently in different
groups
What do we do – 2
• Need a technique to incorporate items
found to have DIF differently than DIFfree items
• Precedent for this approach in Reise,
Widaman, and Pugh (1993)
– Constrain parameters for DIF-free items to
be identical across groups
– Estimate parameters for items found with
DIF separately in appropriate groups
Compensatory DIF
• Compensatory DIF occurs when DIF in
some items leads to erroneous findings
in other items
– Both false-positive and false-negative DIF
findings
Adjust ability for DIF
1. Rearrange the data to estimate a
DIFadjusted theta score in PARSCALE
2. Use that new theta estimate to evaluate for
compensatory DIF
•
Repeat steps 1 and 2 until the same items
are identified each time = no compensatory
DIF
Rearrange data for PARSCALE
Population A
Population B
DIF free item 1
Present
Present
DIF free item 2
Present
Present
DIF free item 3
Present
Present
DIF Item n
Present
Missing
DIF Item n
Missing
Present
DIF Item n+1
Present
Missing
DIF Item n+1
Missing
Present
… DIF free item
(n-1)
…
Modified data set
•
•
•
•
•
•
•
•
0001 12XX2
0002 12XX4
0003 01XX3
…
0132 1X2X2
0133 0X1X3
0134 1X2X4
…
•
•
•
•
0932 0XX22
0933 1XX23
0934 0XX14
…
New tools!
1. Difforpar itemlist, ID(id)
RUnname(test0) ABility(ability0)
GRoups(group)
[with lots of optional specifications]
Look at log for lack of convergence,
dropped variables, nonsense output, and
other warnings.
New tools, continued
2. run PARSCALE with code_test0.psl
3. run thetain: thetain origdata origid test0
[merges thetatest0 and sethetatest0 into
original data set]
The process continues
• Repeat steps 1-3 with the new thetas
until the same items come up with DIF
• For short lists, you can read the log file
• For long lists, examine vars_testN.txt
• When finished, you can check Difd.dta
for model fit and assumptions
Adjusting for additional groups
• mergevirtual origdata originalid
[merges itemdata (containing final
virtual items) into original data set]
• run DIFforPar with the next group, with
the new list of some original and virtual
items (can copy from vars_testN.txt)
and do it all again!
Other tools for Stata
• PrePar Writes code and data for Parscale
– Syntax: prepar namelist, ID(str) ru()
• DIFforSRZ Do file for DIFdetect using SRZ 1step criteria
– Syntax: run difforsrz abil ru
– Set variable list, group, criteria, in the do file.
Coming soon
• DIFforPar extended for grouped
variables with more than 2 categories;
continous in Stata, grouped in Parscale.
• Samemetric.ado (for now use:
prepardata itemlist, ID(str) RUnname(str)).
Have we adjusted for DIF/
controlled for confounding?
• Can only adjust for measured
covariates
• Confounders such as education level
may mean different things for different
groups
• Unmeasured confounders
• May lack power or data may be too
sparse
Adjusted cognitive ability scores
• So far our adjusted scores correlate
highly with non-adjusted scores.
• May contain additional information.
• Language DIF
Questions and comments