applications_of_irt_models

Download Report

Transcript applications_of_irt_models

Applications of IRT
Models
DIF and CAT
Which of these is the situation of a
biased test?



The average score for males and females is
different on an item is not the same.
The correlation between males’ scores on an
item is stronger than that for the females’
scores.
A group of males and females with exactly the
same ability achieve different scores on an
item.
Disentangling the Terminology

Item impact


DIF



The differential probability of a correct response for examinees at the same trait
level but from different groups.
DIF occurs when examinees from different groups show differing probabilities of
success on (or endorsing) the item after matching on the underlying ability that the item
is intended to measure.
Item bias


Item impact is evident when examinees from different groups have differing
probabilities of responding correctly to (or endorsing) an item because there are
true differences between the groups in the underlying ability being measured by
the item.
Item bias occurs when examinees of one group are less likely to answer an item
correctly (or endorse an item) than examinees of another group because of some
characteristic of the test item or testing situation that is not relevant to the test
purpose.
Adverse Impact

Adverse impact is a legal term describing the situation in which group differences
in test performance result in disproportionate examinee selection or related
decisions (e.g., promotion). This is not evidence for test bias.
No DIF
There are two types of DIF

Uniform DIF


The referent group always has a higher probability
of a correct response than that for the focal group.
Non-uniform DIF

The direction of the advantage of one group’s
likelihood of a correct response changes in different
regions of the ability scale.
Uniform DIF
Non uniform DIF
Differential Test Functioning
DTF Against Reference Group
1.0
Proportion Correct True Score
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Focal
0.2
Reference
0.1
0.0
-3.0
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
Theta
0.5
1.0
1.5
2.0
2.5
3.0
Relationship between IRT and CTST
models

It has been shown that there is a relationship
between 2 PL normal ogive IRT models and the
single factor FA model (Lord & Novick, 1968)
The b-parameter is related to the threshold
parameter divided by the item factor loading
 The discrimination parameter is e2qual to the factor
loading divided by the communality of the item


Highly discriminating items will have high factor loadings
Examining Measurement Invariance
in CTST

Examining factorial invariance

Configural invariance


Pattern (metric) invariance


Zero and non-zero loading patterns are the same across groups
The factor loadings are equal across groups
Scalar (strong) invariance

The factor loadings and intercepts are equal across groups


Any group differences in means can be attributed to the common
factors, which allows for meaningful group mean comparisons
Strict invariance

Factor loadings, intercepts, and unique variances are equal across
groups

Any systematic differences in group means, variances, or covariances are
due to the common factors
Examining DIF in IRT

IRT tests of DIF examine if the IRC (Item response curve) the
same for the reference group as it is for the focal group.




The focal group is the smaller group in questions (the minority group).
The reference group is the larger group that generally has the established
parameters.
If they are different, then this means that the probability of an individual
in one group with ability x responding correctly is different than the
probability of an individual with the same ability x in group two if getting
the item correct.
DTF refers to a difference in the test characteristic curves,
obtained by summing the item response functions for each
group.

DTF is perhaps more important for selection because decisions are made
based on test scores, not individual item responses.
Procedures for Detecting
DIF/DTF

Parametric Procedures
 Compare
item parameters from two
groups of examinees
 Lord’s
Chi-Square
 Likelihood Ratio Test
 Compare
IRFs from two groups of
examinees by measuring areas between
them
 Raju’s
Area Measures
Likelihood Ratio Test
G2j  2log L(compact model)  2log L(augmented model)

Distributed as a chi-square with degrees of freedom
equal to the difference in the number of parameters
estimated in the compact and the augmented model


The compact model assumes item parameters are the same
for both groups
The augmented model constrains anchor items to be equal,
but allows items of interest to have parameters that vary
across groups
Raju’s Area Measures

Signed and unsigned areas


Indicates the area between two IRCs
Requires separate calibrations of the item parameters in each group, then
use a linear transformation to put them on the same scale
Signed area   2  1
Unsigned area   2  1
2 1   2  
 D1 2   2  1   
Unsigned area 
ln 1  exp 
    2  1 

D1 2
1   2


 
Procedures for Detecting DIF/DTF

Non Parametric Procedures

Bivariate frequencies between item responses and
group memberships conditional on levels of ability
or trait estimation Logistic Regression
Simultaneous Item Bias Test (SIBTEST)
 Mantel-Haenszel (MH)
 Logistic Regression

Procedures for Detecting DIF/DTF

Simultaneous Item Bias Test (SIBTEST)
Examinees are matched on a true score ability
estimate of ability
 Creates a weighted mean difference between the
reference and focal groups, which is then tested
statistically


The means are adjusted to correct for differences in the
ability distributions with a regression correction procedure

Some examination of this procedure has been conducted to
examine changes in Type I error rates when the percent of DIF
items is large
SIBTEST
H 0 : UNI  0
H1 : UNI  0
UNI   B   f F   d   
B    P  , R   P  , F 
f F   is the density function for  in the focal group
d   is the differential of theta
Mantel-Haenszel (MH)

Compares the item performance of two groups
who were previously matched on the ability scale



Total test score can be used
K 2x2 contingency tables are made for each item for K
number of ability levels
DIF is shown if the odds of correctly answering
the item at a given score level is difference for the
two groups
Mantel-Haenszel (MH)
Response to Suspect Item
Group j Right (1)
Reference
group
Aj
Focal
group
Cj
Wrong (0)
Bj
Dj
 pR j
  pF j

   1 p   1 p 
Rj 
Fj 


Mantel-Haenszel (MH)

The statistic for detecting DIF in an item is
K
 K

   Aj    E  Aj   0.5
j 1
 j 1

MH 
K
Var  Aj 
j 1
K
 MH 
 Aj D j / N.. j
j 1
K
B C
j 1
j
j
/ N .. j
 MH  2.35ln( MH )
2
•Type A items – negligible DIF
with ΔαMH < |1|
•Type B items – moderate DIF
with |1|<= ΔαMH <= |1.5, and
MH test is statistically
significant|
•Type C items – large DIF with
ΔαMH > |1.5|
Logistic Regression
e f ( x)
p (u  1 | X ) 
1  e f ( x)
p(u  1 | X ) is the conditional probability of obtaining a
correct answer given X independent variables
f ( x )   0   1   2G   3 G
G is the independent (group) variable
 is the matching criterion (normally test score)



If the group effect is significant and the interaction is not, then there is
uniform DIF
If the interaction is significant, then there is non-uniform DIF
Conduct model comparisons by adding each successive model term
Computerized Adaptive Testing
(CAT)

To obtain equal precision of measurement to
that of a linear test, but with greater efficiency.
Give people only the items that are informative
about them.
 Reduce testing time and opportunity for error.

CAT System
Initial ability estimate.
Select first item.
Estimate ability.
Mean
Prior
Most discriminating.
Least discriminating.
MLE
Bayesian Methods
Select items.
Max info.
Exposure control.
Content specs.
Estimate ability.
Check stopping rule.
SE stopping rule.
Max # of items.
Issues of Research in a CAT system.

Early Issues

Precision of measurement


Equivalence


Reliability of Estimate, Test Form Equivalence (Test Information),
Testing Mode
Efficiency


Estimation procedure, Prior estimates
Item selection methods, Test length
Newer Issues

Security


Item exposure
Tetstlet models
Item Exposure and Item Selection
Methods

Sympson-Hetter

Directly controls item exposure probabilistically





P(S) probability that an item is selected as the best item
P(A) probability that an item is administered
P(A|S) conditional probability that an item is administered
given that it is selected



Places a filter between item selection and item administration
Items are administered below a prespecified maximum exposure rate
Item exposure parameter
P(A)=P(A|S)*P(S)<=rmax
P(A|S) is easy to determine if P(S) is known, but P(S) must
be determined through an iterative process
Item Exposure and Item Selection
Methods

Conditional Sympson-Hetter or SLC (Sotcking
and Lewis, 1998)
SH controls that item exposure for a population, but
at various ability levels the exposure rates can be
quite high
 P(A|S) is determined at specific trait levels rather
than across a population

Item Exposure and Item Selection
Methods

a-stratified design (STR CAT; Chang & Ying, 1996,
1999)




Partition the item pool into multilevels and multistages
according to the discrimination parameters
Start with the less discriminating items
This approach seems to improve item pool utilization and
balanced item exposure rates
Then use a b-matching item selection procedure


It is less computationally complex
No other restrictions on item exposure is imposed