Transcript zwick_DIF

Empirical Bayes DIF Assessment
Rebecca Zwick, UC Santa Barbara
Presented at
Measured
Progress
August 2007
Overview
 Definition and causes of DIF
 Assessing DIF via Mantel-Haenszel
 EB enhancement to MH DIF (1994-2002, with
D. Thayer & C. Lewis)
 Model and Applications
 Simulation findings
 Discussion
What’s differential item functioning ?
 DIF occurs when equally skilled members of 2
groups have different probabilities of answering
an item correctly.
(Only dichotomous items considered today)
IRT Definition of (absence of) DIF
 Lord, 1980: P(Yi = 1| , R) = P(Yi = 1| , F)
means DIF is absent
 P(Yi = 1| , G) is the probability of correct
response to item i, given , in group G,
G = F (focal) or R (Reference).
  is a latent ability variable, imperfectly
measured by test score S. (More later...)
Reasons for DIF
 “Construct-irrelevant difficulty” (e.g., sports
content in a math item)
 Differential interests or educational background:
NAEP History items with DIF favoring Black testtakers were about M. L. King, Harriet Tubman,
Underground Railroad (Zwick & Ercikan, 1989)
 Often mystifying (e.g., “X + 5 = 10” has DIF; “Y +
8 = 11” doesn’t)
Mini-history of DIF analysis:
 DIF research dates back to 1960’s
 In late 1980’s (“Golden Rule”), testing
companies started including DIF analysis as a
QC procedure.
 Mantel-Haenszel (Holland & Thayer, 1988):
method of choice for operational DIF analyses
 Few assumptions
 No complex estimation procedures
 Easy to explain
Mantel-Haenszel:
 Compare item performance for members of 2
groups, after matching on total test score, S.
 Suppose we have K levels of the score used for
matching test-takers, s1, s2, …sK
 In each of the K levels, data can be represented
as a 2 x 2 table
(Right/Wrong by
Reference/Focal).
Mantel-Haenszel
 For each table, compute conditional odds ratio=
Odds of correct response| S=sk, G=R
Odds of correct response| S=sk, G=F
 Weighted combination of these K values is MH
ˆ MH
odds ratio, 
ˆ MH )
 MH DIF statistic is -2.35 ln(

Mantel-Haenszel
The MH chi-square tests the hypothesis,
H0: k =  = 1, k = 1, 2, … K
versus
H1:  k =  ≠ 1, k = 1, 2, … K
where k is the population odds ratio at score level k.
(Above H0 is similar, but not, in general, identical to the IRT H0;
see Zwick, 1990 Journal of Educational Statistics)
Mantel-Haenszel
 ETS: Size of DIF estimate, plus chi-square results are
used to categorize item:
 A: negligible DIF
 B: slight to moderate DIF
 C: substantial DIF
 For B and C, “+” or “-” used to indicate DIF direction:
“-” means DIF against focal group.
 Designation determines item’s fate.
Drawbacks to usual MH approach
 May give impression that DIF status is
deterministic or is a fixed property of the item
 Reviewers of DIF items often ignore SE
 Is unstable in small samples, which may arise
in CAT settings
EB enhancement to MH:
 Provides more stable results
 May allow variability of DIF findings to be
represented in a more intuitive way
 Can be used in three ways
 Substitute more stable point estimates for MH
 Provide probabilistic perspective on true DIF status
(A, B, C) and future observed status
 [Loss-function-based DIF detection]
Main Empirical Bayes DIF Work
(supported by ETS and LSAC)
 An EB approach to MH DIF analysis (with Thayer &
Lewis). JEM, 1999. [General approach, probabilistic DIF]
 Using loss functions for DIF detection: An EB approach
(with Thayer & Lewis). JEBS, 2000. [Loss functions]
 The assessment of DIF in CATs. In van der Linden & Glas
(Eds.) CAT: Theory and Practice, 2000. [review]
 Application of an EB enhancement of MH DIF analysis to a
CAT (with Thayer). APM, 2002. [simulated CAT-LSAT]
What’s an Empirical Bayes Model?
(See Casella (1985), Am. Statistician)
 In Bayesian statistics, we assume that parameters have
prior distributions that describe parameter “behavior.”
 Statistical theory, or past research may inform us about
the nature of those distributions.
 Combining observed data with the prior distribution
yields a posterior (“after the data”) distribution that
can be used to obtain improved parameter estimates.
 “EB” means prior’s parameters are estimated from data
(unlike fully Bayes models).
EB DIF Model
MHi is the MH statistic for item i.
i2  SE 2 MHi  is the squared standard
error (treated as known) of MHi .
(Sensitivit y analyses revealed no problem
with known-variance assumption.)
EB DIF Model


f MHi |i  is N i ,  i2 , i is unknown
DIF parameter (true value of DIF)
Note: Distribution of MH is asymptotically
normal (e.g., Agresti, 1990)
EB DIF Model
Prior:

f  i  is N , 
2
,
where  is across-item mean of DIF
paramet ers and  is across-item varian ce.
2
(No DIF implies  = 0.)
2
EB DIF Model
f i | MHi   f MHi | i  f i  = posterior
distribution of i , given MHi .
Bayes model with normal prior, normal
likelihood  normal posterior--posterior
mean and v ariance have simple expressions
(see, e.g., Gelman et al., 1995).
EB DIF Model
Posterior mean = Wi MHi  (1  Wi ) 
2
where Wi  2 2 .
 i 
EB DIF statistic = estimated posterior mean
ˆ.
Ğa weighted combination of MHi and 
2
Posterior variance = Wi i


2
i .
Estimation of  and 
2
We estimated  and  2 from current data:
ˆ  Average (MHi ).

ˆ 2  Vaˆ r(MHi )  Average (SEi2 (MHi )),

where Vaˆ r(MHi ) = observed across-item
variance of MHi statistics, i.e.,  2 is
estimated b y deflating Vaˆ r(MHi ) by average
of estimated standard errors.
Recall: EB DIF estimate is a weighted
combination of MHi and prior mean.
Prior mean will be (near) 0 because MHi
values sum to (about) 0 across items when
we match on number-right score (or similar
scores).
EB DIF estimate is closer to 0 than MH.
Lots of data: little "shrinkage" to 0
Sparse data: lots of shrinkage; prior leads to
more stable estimation.
Next…
 Performance of EB DIF estimator
 “Probabilistic DIF” idea
How does EB DIF estimator EBi
compare to MHi?
 Applied to real data, including GRE
 Applied to simulated data, including simulated
CAT-LSAT (Zwick & Thayer, 2002):
 Testlet CAT data simulated, including items with
varying amounts of DIF
 EB and MH both used to estimate (known) True
DIF
 Performance compared using RMSR, variance, and
bias measures
Design of Simulated CAT
 Pool: 30 5-item testlets (150 items total)
 10 Testlets at each of 3 difficulty levels




Item data generated via 3PL model
CAT algorithm was based on testlet scores
Examinees received 5 testlets (25 items)
Test score (used as DIF matching variable) was
expected true score on pool (Zwick, Thayer, &
Wingersky, 1994 APM)
Simulation Conditions Differed on
Several Factors:
 Ability distribution:
 Always N(0,1) in Reference group
 Focal group either N(0,1) or N(-1,1)
 Initial sample size per group: 1000 or 3000
 DIF: Absent or Present (in amounts that vary
across items)
 600 replications for results shown today
Definition of True DIF for Simulation
True DIF =
PiR ( ) / QiR ( ) 
 2.35  ln 
f ( )d ,
PiF ( ) / QiF ( ) 
f( ) is reference gp. ability dist., PiG ( ) is
IRF for group G, QiG ( ) = 1 - PiG ( ).
Like MH with no measurement or sampling
error
Range of True DIF: -2.3 to 2.9, SD ≈ 1.
Definition of Root Mean Square Residual
RMSR is average deviation, in MH metric, of
DIF estimate from True DIF.
For each item in each condition, get
1 R
 (Est DIF ( j)  TrueDIF )2
R j1
j indexes replications
R = 600 is the number of reps
Est DIF(j) is EB or MH value from jth rep
MSR = Variance + Squared Bias
MSR = RMSR2 =
1 R
2
 {EstDIF( j)  Avg(EstDIF)}
R j1
 (Ave(EstDIF)  TrueDIF)

2
RMSRs for No-DIF condition,
Initial N=1000; Item N’s = 80 to 300
Summary over
150 items
EB
MH
25th %ile
.068
.543
Median
.072
.684
75th %ile
.078
.769
RMSRs - 50 hard items, DIF condition, Focal N(-1,1)
Focal N’s = 16 to 67,
Reference N’s 80 to 151
Summary over
50 items
EB
MH
25th %ile
.514
1.190
Median
.532
1.252
75th %ile
.558
1.322
RMSRs for DIF condition, Focal N(-1,1)
Initial N=1000; Item N’s = 16 to 307
Summary over
150 items
EB
MH
25th %ile
.464
.585
Median
.517
.641
75th %ile
.560
1.190
Variance and Squared Bias for Same Condition
Initial N=1000; Item N’s = 16 to 307
Summary
over 150
Items
EB
MH
Variance Squared Variance Squared
Bias
Bias
25th %ile
.191
.004
.335
.000
Median
.210
.027
.402
.002
75th %ile
.242
.088
1.402
.013
Summary-Performance of
EB DIF Estimator
 RMSRs (and variances) are smaller for EB than for
MH, especially in (1) no-DIF case and
(2) very small-sample case.
 EB estimates more biased than MH; bias is toward 0.
 Above findings are consistent with theory.
 Implications to be discussed.
“External” Applications/Elaborations of
EB DIF Point Estimation
 Defense Dept: CAT-ASVAB (Krass & Segal,
1998)
 ACT: Simulated multidimensional CAT data
(Miller & Fan, NCME, 1998)
 ETS: Fully Bayes DIF model (NCME, 2007) of
Sinharay et al: Like EB, but parameters of
prior are determined using past data (see ZTL).
Also tried loss function approach.
Probabilistic DIF
 In our model, posterior distribution is normal,
so is fully determined by mean and variance.
 Can use posterior distribution to infer the
probability that DIF falls into each of the ETS
categories (C-, B-, A, B+, C+), each of which
corresponds to a particular DIF magnitude.
(Statistical significance plays no role here.)
 Can display graphically.
Probabilistic DIF status for an “A” item in LSAT sim.
MH = 4.7, SE = 2.2, Identified Status = C+
Posterior Mean = EBi= .7, Posterior SD = .8
NR=101
NF = 23
0%
14%
1%
20%
65%
CBA
B+
C+
Probabilistic DIF, continued
 In EB approach can be used to accumulate
DIF evidence across administrations.
 Prior can be modified each time an item is
given: Use former posterior distribution as
new prior (Zwick, Thayer & Lewis, 1999).
 Pie chart could then be modified to reflect
new evidence about an item’s status.
Predicting an Item’s Future Status: The
Posterior Predictive Distribution
 A variation on the above can be used to predict
future observed DIF status
 Mean of posterior predictive distribution is
same as posterior mean, but variance is larger.
 For details and an application to GRE items,
see Zwick, Thayer, & Lewis, 1999 JEM.
Discussion
 EB point estimates have advantages over MH
counterparts
 EB approach can be applied to non-MH DIF methods
 Advisability of shrinkage estimation for DIF needs to
be considered
 Reducing Type I error may yield more interpretable results
 Degree of shrinkage can be fine-tuned
 Probabilistic DIF displays may have value in
conveying uncertainty of DIF results.