applications_of_irt_models
Download
Report
Transcript applications_of_irt_models
Applications of IRT
Models
DIF and CAT
Which of these is the situation of a
biased test?
The average score for males and females is
different on an item is not the same.
The correlation between males’ scores on an
item is stronger than that for the females’
scores.
A group of males and females with exactly the
same ability achieve different scores on an
item.
Disentangling the Terminology
Item impact
DIF
The differential probability of a correct response for examinees at the same trait
level but from different groups.
DIF occurs when examinees from different groups show differing probabilities of
success on (or endorsing) the item after matching on the underlying ability that the item
is intended to measure.
Item bias
Item impact is evident when examinees from different groups have differing
probabilities of responding correctly to (or endorsing) an item because there are
true differences between the groups in the underlying ability being measured by
the item.
Item bias occurs when examinees of one group are less likely to answer an item
correctly (or endorse an item) than examinees of another group because of some
characteristic of the test item or testing situation that is not relevant to the test
purpose.
Adverse Impact
Adverse impact is a legal term describing the situation in which group differences
in test performance result in disproportionate examinee selection or related
decisions (e.g., promotion). This is not evidence for test bias.
No DIF
There are two types of DIF
Uniform DIF
The referent group always has a higher probability
of a correct response than that for the focal group.
Non-uniform DIF
The direction of the advantage of one group’s
likelihood of a correct response changes in different
regions of the ability scale.
Uniform DIF
Non uniform DIF
Differential Test Functioning
DTF Against Reference Group
1.0
Proportion Correct True Score
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Focal
0.2
Reference
0.1
0.0
-3.0
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
Theta
0.5
1.0
1.5
2.0
2.5
3.0
Relationship between IRT and CTST
models
It has been shown that there is a relationship
between 2 PL normal ogive IRT models and the
single factor FA model (Lord & Novick, 1968)
The b-parameter is related to the threshold
parameter divided by the item factor loading
The discrimination parameter is e2qual to the factor
loading divided by the communality of the item
Highly discriminating items will have high factor loadings
Examining Measurement Invariance
in CTST
Examining factorial invariance
Configural invariance
Pattern (metric) invariance
Zero and non-zero loading patterns are the same across groups
The factor loadings are equal across groups
Scalar (strong) invariance
The factor loadings and intercepts are equal across groups
Any group differences in means can be attributed to the common
factors, which allows for meaningful group mean comparisons
Strict invariance
Factor loadings, intercepts, and unique variances are equal across
groups
Any systematic differences in group means, variances, or covariances are
due to the common factors
Examining DIF in IRT
IRT tests of DIF examine if the IRC (Item response curve) the
same for the reference group as it is for the focal group.
The focal group is the smaller group in questions (the minority group).
The reference group is the larger group that generally has the established
parameters.
If they are different, then this means that the probability of an individual
in one group with ability x responding correctly is different than the
probability of an individual with the same ability x in group two if getting
the item correct.
DTF refers to a difference in the test characteristic curves,
obtained by summing the item response functions for each
group.
DTF is perhaps more important for selection because decisions are made
based on test scores, not individual item responses.
Procedures for Detecting
DIF/DTF
Parametric Procedures
Compare
item parameters from two
groups of examinees
Lord’s
Chi-Square
Likelihood Ratio Test
Compare
IRFs from two groups of
examinees by measuring areas between
them
Raju’s
Area Measures
Likelihood Ratio Test
G2j 2log L(compact model) 2log L(augmented model)
Distributed as a chi-square with degrees of freedom
equal to the difference in the number of parameters
estimated in the compact and the augmented model
The compact model assumes item parameters are the same
for both groups
The augmented model constrains anchor items to be equal,
but allows items of interest to have parameters that vary
across groups
Raju’s Area Measures
Signed and unsigned areas
Indicates the area between two IRCs
Requires separate calibrations of the item parameters in each group, then
use a linear transformation to put them on the same scale
Signed area 2 1
Unsigned area 2 1
2 1 2
D1 2 2 1
Unsigned area
ln 1 exp
2 1
D1 2
1 2
Procedures for Detecting DIF/DTF
Non Parametric Procedures
Bivariate frequencies between item responses and
group memberships conditional on levels of ability
or trait estimation Logistic Regression
Simultaneous Item Bias Test (SIBTEST)
Mantel-Haenszel (MH)
Logistic Regression
Procedures for Detecting DIF/DTF
Simultaneous Item Bias Test (SIBTEST)
Examinees are matched on a true score ability
estimate of ability
Creates a weighted mean difference between the
reference and focal groups, which is then tested
statistically
The means are adjusted to correct for differences in the
ability distributions with a regression correction procedure
Some examination of this procedure has been conducted to
examine changes in Type I error rates when the percent of DIF
items is large
SIBTEST
H 0 : UNI 0
H1 : UNI 0
UNI B f F d
B P , R P , F
f F is the density function for in the focal group
d is the differential of theta
Mantel-Haenszel (MH)
Compares the item performance of two groups
who were previously matched on the ability scale
Total test score can be used
K 2x2 contingency tables are made for each item for K
number of ability levels
DIF is shown if the odds of correctly answering
the item at a given score level is difference for the
two groups
Mantel-Haenszel (MH)
Response to Suspect Item
Group j Right (1)
Reference
group
Aj
Focal
group
Cj
Wrong (0)
Bj
Dj
pR j
pF j
1 p 1 p
Rj
Fj
Mantel-Haenszel (MH)
The statistic for detecting DIF in an item is
K
K
Aj E Aj 0.5
j 1
j 1
MH
K
Var Aj
j 1
K
MH
Aj D j / N.. j
j 1
K
B C
j 1
j
j
/ N .. j
MH 2.35ln( MH )
2
•Type A items – negligible DIF
with ΔαMH < |1|
•Type B items – moderate DIF
with |1|<= ΔαMH <= |1.5, and
MH test is statistically
significant|
•Type C items – large DIF with
ΔαMH > |1.5|
Logistic Regression
e f ( x)
p (u 1 | X )
1 e f ( x)
p(u 1 | X ) is the conditional probability of obtaining a
correct answer given X independent variables
f ( x ) 0 1 2G 3 G
G is the independent (group) variable
is the matching criterion (normally test score)
If the group effect is significant and the interaction is not, then there is
uniform DIF
If the interaction is significant, then there is non-uniform DIF
Conduct model comparisons by adding each successive model term
Computerized Adaptive Testing
(CAT)
To obtain equal precision of measurement to
that of a linear test, but with greater efficiency.
Give people only the items that are informative
about them.
Reduce testing time and opportunity for error.
CAT System
Initial ability estimate.
Select first item.
Estimate ability.
Mean
Prior
Most discriminating.
Least discriminating.
MLE
Bayesian Methods
Select items.
Max info.
Exposure control.
Content specs.
Estimate ability.
Check stopping rule.
SE stopping rule.
Max # of items.
Issues of Research in a CAT system.
Early Issues
Precision of measurement
Equivalence
Reliability of Estimate, Test Form Equivalence (Test Information),
Testing Mode
Efficiency
Estimation procedure, Prior estimates
Item selection methods, Test length
Newer Issues
Security
Item exposure
Tetstlet models
Item Exposure and Item Selection
Methods
Sympson-Hetter
Directly controls item exposure probabilistically
P(S) probability that an item is selected as the best item
P(A) probability that an item is administered
P(A|S) conditional probability that an item is administered
given that it is selected
Places a filter between item selection and item administration
Items are administered below a prespecified maximum exposure rate
Item exposure parameter
P(A)=P(A|S)*P(S)<=rmax
P(A|S) is easy to determine if P(S) is known, but P(S) must
be determined through an iterative process
Item Exposure and Item Selection
Methods
Conditional Sympson-Hetter or SLC (Sotcking
and Lewis, 1998)
SH controls that item exposure for a population, but
at various ability levels the exposure rates can be
quite high
P(A|S) is determined at specific trait levels rather
than across a population
Item Exposure and Item Selection
Methods
a-stratified design (STR CAT; Chang & Ying, 1996,
1999)
Partition the item pool into multilevels and multistages
according to the discrimination parameters
Start with the less discriminating items
This approach seems to improve item pool utilization and
balanced item exposure rates
Then use a b-matching item selection procedure
It is less computationally complex
No other restrictions on item exposure is imposed