Item Analysis: Classical and Beyond

Download Report

Transcript Item Analysis: Classical and Beyond

Item Analysis: Classical and Beyond
SCROLLA Symposium
Measurement Theory and Item Analysis
Modified for EPE/EDP 711 by Kelly Bradley on January 8, 2013
Why is item analysis relevant?
Item analysis provides a way of measuring the
quality of questions - seeing how appropriate
they were for the respondents how well they
measured their ability.
Item analysis also provides a way of re-using
items over and over again in different
instruments with prior knowledge of how they
are going to perform.
What kinds of item analysis are there?
Item Analysis
Classical
Latent Trait Models
Item Response theory
IRT1
IRT2
Rasch
IRT3 IRT4
Classical Test Theory
Classical analysis is the easiest and most
widely used form of analysis. The statistics
can be computed by generic statistical
packages (or at a push by hand) and need
no specialist software.
Classical analysis is performed on the survey
or test instrument as a whole rather than on
the item and although item statistics can be
generated, they apply only to that group of
students on that collection of items
Classical Test Theory Assumptions
Classical test theory assumes that any test
score (or survey instrument sum) is
comprised of a “true” value, plus
randomized error.
Crucially it assumes that this error is normally
distributed; uncorrelated with true score and
the mean of the error is zero.
xobs = xtrue + G(0, err)
Classical Analysis Statistics
• Difficulty
(item level statistic)
• Discrimination
(item level statistic)
• Reliability
(instrument level statistic)
Classical Test Theory Difficulty
The difficulty of a (single response selection)
question in classical analysis is simply the
proportion of people who answered the
question incorrectly. For multiple mark
questions, it is the average mark expressed
as a proportion.
Given on a scale of 0-1, the higher the
proportion the greater the difficulty.
Classical Test Theory Discrimination
The discrimination of an item is the (Pearson)
correlation between the average item mark
and the average total test mark.
Being a correlation it can vary from –1 to +1
with higher values indicating (desirable) high
discrimination.
Classical Test Theory Reliability
Reliability is a measure of how well the test or
survey “holds together”. For practical
reasons, internal consistency estimates are
the easiest to obtain which indicate the
extent to which each item correlates with
every other item.
This is measured on a scale of 0-1. The
greater the number the higher the reliability.
Classical Analysis versus
Latent Trait Models
• Classical analysis has the survey, or test, (not
the item) as its basis. Although the statistics
generated are often generalized to similar
populations completing a similar survey, or
taking a similar test; they only really apply to
those students taking that test
• Latent trait models aim to look beyond that at the
underlying traits which are producing the test
performance. They are measured at item level
and provide sample-free measurement
Latent Trait Models
• Latent trait models have been around since the
1940s, but were not widely used until the 1960s.
Although theoretically possible, it is practically
unfeasible to use these without specialist software.
• They aim to measure the underlying ability (or trait)
which is producing the test performance rather than
measuring performance per se.
• This leads to them being sample-free. As the
statistics are not dependant on the test situation
which generated them, they can be used more
flexibly.
Rasch versus
Item Response Theory
Mathematically, Rasch is identical to the most basic IRT
model (IRT1), however there are some important
differences which makes it a more viable proposition
for practical testing
For instance,
• In Rasch the model is superior. Data which does not fit
the model is discarded (carefully and not dumped).
• Rasch does not permit abilities to be estimated for
extreme items and persons.
IRT - the generalized model
Where
ag = gradient of the ICC at the point 
(item discrimination)
bg = the ability level at which ag is maximized
(item difficulty)
cg = probability of low persons correctly
answering question (or endorsing) g
IRT - Item Characteristic Curves
•An ICC is a plot of the respondents
ability (likeliness to endorse) over
the probability of them correctly
answering the question (endorsing).
The higher the ability the higher the
chance that they will respond
correctly.
c - intercept
b - ability at max (a)
a - gradient
IRT - About the Parameters Difficulty
• Although there is no “correct” difficulty for
any one item, it is clearly desirable that the
difficulty of the test (or survey instrument) is
centred around the average ability of the
respondents.
• The higher the “b” parameter the more
difficult the question.
– This is inversely proportionate to the probability
of the question being answered correctly.
IRT - About the Parameters Discrimination
• In IRT (unlike Rasch) maximal discrimination
is sought.
– Thus the higher the “a” parameter the more
desirable the question.
• Differences in the discrimination of questions
can lead to differences in the difficulties of
questions across the ability range.
IRT - About the Parameters Guessing
• A high “c” parameter suggests that
candidates with very little ability may choose
the correct answer.
• This is rarely a valid parameter outwith
multiple choice testing…and the value
should not vary excessively from the
reciprocal of the number of choices.
IRT - Parameter Estimation
• Before being used (in an item bank or for
measurement) items must first be calibrated. That is
their parameters must be estimated.
• There are two main procedures - Joint Maximal
Likelihood and Marginal Maximal Likelihood. JML is
most common for IRT1 and 2, while MML is used
more frequently for IRT3.
• Bayesian estimation and estimated bounds may be
imposed on the data to avoid high discrimination
items being over valued.