Item Response Theory
Download
Report
Transcript Item Response Theory
Item Response Theory in Health
Measurement
Outline
Contrast IRT with classical test theory
Introduce basic concepts in IRT
Illustrate IRT methods with ADL and
IADL scales
Discuss empirical comparisons of IRT
and CTT
Advantages and disadvantages of IRT
When would it be appropriate to use
IRT?
Test Theory
Any
item in any health measure has two
parameters:
The
level of ability required to answer the question
correctly.
In
health this translates into the level of health at which the
person doesn’t report this problem
The
level of discrimination of the item: how accurately
it distinguishes well from sick
Classical Test Theory
This is the most common paradigm for scale development
and validation in health.
Few theoretical assumptions, so broadly applicable
Partitions observed score into True Score + Error
Probability of a given item response is a function of
person to whom item is administered and nature of item
Item difficulty: proportion of examinees who answer item
correctly (in health context: item severity…)
Item discrimination: biserial correlation between item and
total test score.
Classical test theory
Probability of ‘no’ answer depends on type of item
(difficulty) and the level of physical functioning (e.g. SF36 bathing vs. able to do vigorous activities)
Some limitations
Item difficulty, discrimination, and ability are confounded
Sample dependent; item difficulty estimates will be different in different
samples. Estimate of ability is item dependent
Difficult to compare scores across two different tests because not
on same scale
Often, ordinal scale of measurement for test
Assumes equal errors of measurement at all levels of ability
Item Response Theory
Complete theory of measurement and item selection
Theoretically, item characteristics are not sample
dependent; estimates of ability are not item dependent
Item scores are presented on the same scale as ability
Puts all individual scores on standardized, interval level
scale; easy to compare between tests and individuals
Item Response Theory
Assumes that a normally distributed latent trait underlies
performance on a measure
Assumes unidimensionality
Assumes local independence
I.e., all items measure the same construct
Items are uncorrelated with each other when ability is held
constant
Given unidimensionality, any response to an item is a
monotonically increasing function of the latent trait
(see the item characteristic curves in next slide)
Illustration of IRT with ADL
and IADL Scales
The latent traits represent the ability to perform self-care
activities and instrumental activities (necessary for
independent living)
Item difficulty (b): the level of function corresponding to
a 50% chance of endorsing the item
Item discrimination (a): slope of the item characteristic
curve, or how well it differentiates low from high
functioning people
Example of differing item characteristic curves
(Note: parameter = 2.82 for the steep curve, 0.98 for the shallow curve)
IRT can show distribution of respondents along theta and can also
show distribution of item difficulties (lower chart)
And can also show you the theta location of different
response levels (here 0 to 3 scale)
Differential Item Functioning
Assuming that the measured ability is unidimensional and
that the items measure the same ability, the item curve
should be unique except for random variations, irrespective of the
group for whom the item curve is plotted…
…items that do not yield the same item response function
for two or more groups are violating one of the fundamental
assumptions of item response theory, namely that the item and
the test in which it is contained are measuring the same
unidimensional trait…
Possible DIF
Item Bias
Items may be biased against one gender, linguistic, or
social group
Can result in people being falsely identified with problems or
missing problems
Two elements in bias detection
Statistical detection of Differential Item Functioning
Item review
If source of problems not related to performance, then item is
biased
DIF detection
Important
part of test validation
Helps to ensure measurement equivalence
Scores on individual items are compared for two
groups:
Reference
Focal
Groups
group under study
matched on total test score (ability)
DIF detection
DIF
can be uniform or nonuniform
Uniform
Probability
of correctly answering item correctly is
consistently higher for one group
Nonuniform
Probability
of correctly answering item is higher for
one group at some points on the scale; perhaps lower at
other points
3 models
One-parameter
(Rasch) model provides estimates
of item difficulty only
Two-parameter
model provides estimates of
difficulty and discrimination
Three-parameter
IRT
model allows for guessing
does have different methods for dichotomous
and polytomous item scales
IRT models: dichotomous items
One parameter model
Probability correct response (given theta)
= 1/[1 + exp(theta – item difficulty)]
Two-parameter model
Probability correct response (given theta)
= 1/{1 + exp [ – discrimination (theta – item difficulty)]}
Three parameter model:
Adds pseudo-guessing parameter
Two parameter model is most appropriate for
epidemiological research
Steps in applying IRT
Step One: Assess dimensionality
Factor analytic techniques
Exploratory factor analysis
Study ratio of first to second eigenvalues (should be 3:1 or 4:1)
Also χ2 tests for dimensionality
Calibrate items
Calculate item difficulty and discrimination and examine how
well model fits
χ2 goodness of fit test
Compare goodness of fit between one-parameter and two-parameter
models
Examine root mean square residual (values should be < 2.5)
Steps in IRT: continued
Score
the examinees
Get item information estimates
Based
Study
on discrimination adjusted for ‘standard error’
test information
If choosing items from a larger pool, can discard
items with low information, and retain items that
give more information where it is needed
Item Information
Item information is a function of item difficulty and
discrimination. It is high when item difficulty is close to
the average level of function in the group and when ICC
slope is steep
The ADL scale example
Caregiver
ratings of ADL and IADL performance
for 1686 people
1048
with dementia and 484 without dementia
1364
had complete ratings
ADL/IADL example
Procedures
Assessed dimensionality. Found two dimensions: ADL and
IADL
Assessed fit of one-parameter and two parameter model for each
scale
Two-parameter better
Only 3 items fit one-parameter model
Sig. improvement in χ2 goodness of fit
Used two-parameter model to get item statistics for 7 ADL items
and 7 IADL items
ADL/IADL
Got
results for each item: difficulty,
discrimination, fit to model
Results
for item information and total scale
information
Example of IRT with Relative’s Stress
Scale
The latent trait (theta) represents the intensity of stress due
to recent life events
Item severity or difficulty (b): the level of stress
corresponding to a 50% chance of endorsing the item
Item discrimination (a): slope of the item characteristic
curve, or how well it differentiates low from high stress
cases
Item information is a function of both: high when (b) is
close to group stress level and (a) is steep
Stress Scale: Item Information
item information is a function of item difficulty and
discrimination. It is high when item difficulty is close to
group stress level and when ICC slope is steep
item 1
2
3
info .05 .5
.4
4
5
6
7
8
9
10
.05 .9
27
.5
.4
.06 .08
Stress Scale: Item Difficulty
Item severity or difficulty (b) indicates the level of stress
(on theta scale) corresponding to a 50% chance of
endorsing the item
item 1
2
3
4
5
6
7
8
9
10
diff. 6.2 3.9 3.4 6.2 2.8 1.6 2.3 3.8 9.5 7.9
Stress Scale: Item Discrimination
item discrimination reflected in the slope of the item
characteristic curve (ICC): how well does the item
differentiate low from high stress cases?
item 1
2
3
4
5
6
7
8
9
10
disc 0.2 0.6 0.5 0.2 0.8 4.3 0.7 0.5 0.2 0.2
Example of developing Index of
Instrumental Support
Community
Sample: CSHA-1
Needed baseline indicator of social support as it is
important predictor of health
Concept: Availability and quality of instrumental
support
Blended IRT and classical methods
Sample
8089
people
Randomly divided into two samples:
Development
and validation
Procedures
Item
selection and coding
7 items
Procedure
IRT
analyses
Tested
dimensionality
Two-parameter
model
Estimated
item parameters
Estimated
item and test information
Scored
individual levels of support
External validation
Internal
consistency
Construct
validity
Correlation
with size of social network
Correlation
with marital status
Correlation
with gender
Predictive
validity
Empirical comparison of IRT and
CTT in scale validation
Few
studies. So far, proponents of IRT assume it
is better. However,
IRT
and CTT often select the same items
High
correlations between CTT and IRT difficulty and
discrimination
Very
high (0.93) correlations between CTT and IRT
estimates of total score
Empirical comparisons (cont’d)
Little
difference in criterion or predictive validity of
IRT scores
IRT
scores are only slightly better
When
IRT
item discriminations are highly varied, IRT is better
item parameters can be sample dependent
Need
to establish validity on different samples, as in CTT
Advantages of IRT
Contribution of each item to precision of total test score
can be assessed
Estimates precision of measurement at each level of ability and for
each examinee
With large item pool, item and test information excellent for testbuilding to suit different purposes
Graphical illustrations are helpful
Can tailor test to needs: For example, can develop a criterionreferenced test that has most precision around the cut-off score
Advantages of IRT
Interval
level scoring
More
analytic techniques can be used with the scale
Ability
Good
on different tests can be easily compared
for tests where a core of items is administered, but different
groups get different subsets (e.g., cross-cultural testing, computer
adapted testing)
Disadvantages of IRT
Strict assumptions
Large sample size
(minimum 200; 1000 for
complex models)
More difficult to use than
CTT: computer programs
not readily available
Models are complex and
difficult to understand
When should you use IRT?
In test-building with
Large
item pool
Large
number of subjects
Cross-cultural
To develop short versions of tests
testing
(But also use CTT, and your knowledge of the test)
In test validation to supplement information from classical
analyses
Software for IRT analyses
Rasch or one parameter models:
BICAL (Wright)
RASCH (Rossi)
RUMM 2010 http://www.arach.net.au/~rummlab/
Two or three parameter models
NOHARM (McDonald)
LOGIST
TESTFACT
LISREL
MULTILOG