Transcript zwick_DIF
Empirical Bayes DIF Assessment
Rebecca Zwick, UC Santa Barbara
Presented at
Measured
Progress
August 2007
Overview
Definition and causes of DIF
Assessing DIF via Mantel-Haenszel
EB enhancement to MH DIF (1994-2002, with
D. Thayer & C. Lewis)
Model and Applications
Simulation findings
Discussion
What’s differential item functioning ?
DIF occurs when equally skilled members of 2
groups have different probabilities of answering
an item correctly.
(Only dichotomous items considered today)
IRT Definition of (absence of) DIF
Lord, 1980: P(Yi = 1| , R) = P(Yi = 1| , F)
means DIF is absent
P(Yi = 1| , G) is the probability of correct
response to item i, given , in group G,
G = F (focal) or R (Reference).
is a latent ability variable, imperfectly
measured by test score S. (More later...)
Reasons for DIF
“Construct-irrelevant difficulty” (e.g., sports
content in a math item)
Differential interests or educational background:
NAEP History items with DIF favoring Black testtakers were about M. L. King, Harriet Tubman,
Underground Railroad (Zwick & Ercikan, 1989)
Often mystifying (e.g., “X + 5 = 10” has DIF; “Y +
8 = 11” doesn’t)
Mini-history of DIF analysis:
DIF research dates back to 1960’s
In late 1980’s (“Golden Rule”), testing
companies started including DIF analysis as a
QC procedure.
Mantel-Haenszel (Holland & Thayer, 1988):
method of choice for operational DIF analyses
Few assumptions
No complex estimation procedures
Easy to explain
Mantel-Haenszel:
Compare item performance for members of 2
groups, after matching on total test score, S.
Suppose we have K levels of the score used for
matching test-takers, s1, s2, …sK
In each of the K levels, data can be represented
as a 2 x 2 table
(Right/Wrong by
Reference/Focal).
Mantel-Haenszel
For each table, compute conditional odds ratio=
Odds of correct response| S=sk, G=R
Odds of correct response| S=sk, G=F
Weighted combination of these K values is MH
ˆ MH
odds ratio,
ˆ MH )
MH DIF statistic is -2.35 ln(
Mantel-Haenszel
The MH chi-square tests the hypothesis,
H0: k = = 1, k = 1, 2, … K
versus
H1: k = ≠ 1, k = 1, 2, … K
where k is the population odds ratio at score level k.
(Above H0 is similar, but not, in general, identical to the IRT H0;
see Zwick, 1990 Journal of Educational Statistics)
Mantel-Haenszel
ETS: Size of DIF estimate, plus chi-square results are
used to categorize item:
A: negligible DIF
B: slight to moderate DIF
C: substantial DIF
For B and C, “+” or “-” used to indicate DIF direction:
“-” means DIF against focal group.
Designation determines item’s fate.
Drawbacks to usual MH approach
May give impression that DIF status is
deterministic or is a fixed property of the item
Reviewers of DIF items often ignore SE
Is unstable in small samples, which may arise
in CAT settings
EB enhancement to MH:
Provides more stable results
May allow variability of DIF findings to be
represented in a more intuitive way
Can be used in three ways
Substitute more stable point estimates for MH
Provide probabilistic perspective on true DIF status
(A, B, C) and future observed status
[Loss-function-based DIF detection]
Main Empirical Bayes DIF Work
(supported by ETS and LSAC)
An EB approach to MH DIF analysis (with Thayer &
Lewis). JEM, 1999. [General approach, probabilistic DIF]
Using loss functions for DIF detection: An EB approach
(with Thayer & Lewis). JEBS, 2000. [Loss functions]
The assessment of DIF in CATs. In van der Linden & Glas
(Eds.) CAT: Theory and Practice, 2000. [review]
Application of an EB enhancement of MH DIF analysis to a
CAT (with Thayer). APM, 2002. [simulated CAT-LSAT]
What’s an Empirical Bayes Model?
(See Casella (1985), Am. Statistician)
In Bayesian statistics, we assume that parameters have
prior distributions that describe parameter “behavior.”
Statistical theory, or past research may inform us about
the nature of those distributions.
Combining observed data with the prior distribution
yields a posterior (“after the data”) distribution that
can be used to obtain improved parameter estimates.
“EB” means prior’s parameters are estimated from data
(unlike fully Bayes models).
EB DIF Model
MHi is the MH statistic for item i.
i2 SE 2 MHi is the squared standard
error (treated as known) of MHi .
(Sensitivit y analyses revealed no problem
with known-variance assumption.)
EB DIF Model
f MHi |i is N i , i2 , i is unknown
DIF parameter (true value of DIF)
Note: Distribution of MH is asymptotically
normal (e.g., Agresti, 1990)
EB DIF Model
Prior:
f i is N ,
2
,
where is across-item mean of DIF
paramet ers and is across-item varian ce.
2
(No DIF implies = 0.)
2
EB DIF Model
f i | MHi f MHi | i f i = posterior
distribution of i , given MHi .
Bayes model with normal prior, normal
likelihood normal posterior--posterior
mean and v ariance have simple expressions
(see, e.g., Gelman et al., 1995).
EB DIF Model
Posterior mean = Wi MHi (1 Wi )
2
where Wi 2 2 .
i
EB DIF statistic = estimated posterior mean
ˆ.
Ğa weighted combination of MHi and
2
Posterior variance = Wi i
2
i .
Estimation of and
2
We estimated and 2 from current data:
ˆ Average (MHi ).
ˆ 2 Vaˆ r(MHi ) Average (SEi2 (MHi )),
where Vaˆ r(MHi ) = observed across-item
variance of MHi statistics, i.e., 2 is
estimated b y deflating Vaˆ r(MHi ) by average
of estimated standard errors.
Recall: EB DIF estimate is a weighted
combination of MHi and prior mean.
Prior mean will be (near) 0 because MHi
values sum to (about) 0 across items when
we match on number-right score (or similar
scores).
EB DIF estimate is closer to 0 than MH.
Lots of data: little "shrinkage" to 0
Sparse data: lots of shrinkage; prior leads to
more stable estimation.
Next…
Performance of EB DIF estimator
“Probabilistic DIF” idea
How does EB DIF estimator EBi
compare to MHi?
Applied to real data, including GRE
Applied to simulated data, including simulated
CAT-LSAT (Zwick & Thayer, 2002):
Testlet CAT data simulated, including items with
varying amounts of DIF
EB and MH both used to estimate (known) True
DIF
Performance compared using RMSR, variance, and
bias measures
Design of Simulated CAT
Pool: 30 5-item testlets (150 items total)
10 Testlets at each of 3 difficulty levels
Item data generated via 3PL model
CAT algorithm was based on testlet scores
Examinees received 5 testlets (25 items)
Test score (used as DIF matching variable) was
expected true score on pool (Zwick, Thayer, &
Wingersky, 1994 APM)
Simulation Conditions Differed on
Several Factors:
Ability distribution:
Always N(0,1) in Reference group
Focal group either N(0,1) or N(-1,1)
Initial sample size per group: 1000 or 3000
DIF: Absent or Present (in amounts that vary
across items)
600 replications for results shown today
Definition of True DIF for Simulation
True DIF =
PiR ( ) / QiR ( )
2.35 ln
f ( )d ,
PiF ( ) / QiF ( )
f( ) is reference gp. ability dist., PiG ( ) is
IRF for group G, QiG ( ) = 1 - PiG ( ).
Like MH with no measurement or sampling
error
Range of True DIF: -2.3 to 2.9, SD ≈ 1.
Definition of Root Mean Square Residual
RMSR is average deviation, in MH metric, of
DIF estimate from True DIF.
For each item in each condition, get
1 R
(Est DIF ( j) TrueDIF )2
R j1
j indexes replications
R = 600 is the number of reps
Est DIF(j) is EB or MH value from jth rep
MSR = Variance + Squared Bias
MSR = RMSR2 =
1 R
2
{EstDIF( j) Avg(EstDIF)}
R j1
(Ave(EstDIF) TrueDIF)
2
RMSRs for No-DIF condition,
Initial N=1000; Item N’s = 80 to 300
Summary over
150 items
EB
MH
25th %ile
.068
.543
Median
.072
.684
75th %ile
.078
.769
RMSRs - 50 hard items, DIF condition, Focal N(-1,1)
Focal N’s = 16 to 67,
Reference N’s 80 to 151
Summary over
50 items
EB
MH
25th %ile
.514
1.190
Median
.532
1.252
75th %ile
.558
1.322
RMSRs for DIF condition, Focal N(-1,1)
Initial N=1000; Item N’s = 16 to 307
Summary over
150 items
EB
MH
25th %ile
.464
.585
Median
.517
.641
75th %ile
.560
1.190
Variance and Squared Bias for Same Condition
Initial N=1000; Item N’s = 16 to 307
Summary
over 150
Items
EB
MH
Variance Squared Variance Squared
Bias
Bias
25th %ile
.191
.004
.335
.000
Median
.210
.027
.402
.002
75th %ile
.242
.088
1.402
.013
Summary-Performance of
EB DIF Estimator
RMSRs (and variances) are smaller for EB than for
MH, especially in (1) no-DIF case and
(2) very small-sample case.
EB estimates more biased than MH; bias is toward 0.
Above findings are consistent with theory.
Implications to be discussed.
“External” Applications/Elaborations of
EB DIF Point Estimation
Defense Dept: CAT-ASVAB (Krass & Segal,
1998)
ACT: Simulated multidimensional CAT data
(Miller & Fan, NCME, 1998)
ETS: Fully Bayes DIF model (NCME, 2007) of
Sinharay et al: Like EB, but parameters of
prior are determined using past data (see ZTL).
Also tried loss function approach.
Probabilistic DIF
In our model, posterior distribution is normal,
so is fully determined by mean and variance.
Can use posterior distribution to infer the
probability that DIF falls into each of the ETS
categories (C-, B-, A, B+, C+), each of which
corresponds to a particular DIF magnitude.
(Statistical significance plays no role here.)
Can display graphically.
Probabilistic DIF status for an “A” item in LSAT sim.
MH = 4.7, SE = 2.2, Identified Status = C+
Posterior Mean = EBi= .7, Posterior SD = .8
NR=101
NF = 23
0%
14%
1%
20%
65%
CBA
B+
C+
Probabilistic DIF, continued
In EB approach can be used to accumulate
DIF evidence across administrations.
Prior can be modified each time an item is
given: Use former posterior distribution as
new prior (Zwick, Thayer & Lewis, 1999).
Pie chart could then be modified to reflect
new evidence about an item’s status.
Predicting an Item’s Future Status: The
Posterior Predictive Distribution
A variation on the above can be used to predict
future observed DIF status
Mean of posterior predictive distribution is
same as posterior mean, but variance is larger.
For details and an application to GRE items,
see Zwick, Thayer, & Lewis, 1999 JEM.
Discussion
EB point estimates have advantages over MH
counterparts
EB approach can be applied to non-MH DIF methods
Advisability of shrinkage estimation for DIF needs to
be considered
Reducing Type I error may yield more interpretable results
Degree of shrinkage can be fine-tuned
Probabilistic DIF displays may have value in
conveying uncertainty of DIF results.