irt_estimation_talk

Download Report

Transcript irt_estimation_talk

Introduction to Estimation Techniques in Item
Response Theory Models
Mister Ibik
Division of Psychology in Education
Arizona State University
Applied Item Response Theory Conference
September 21, 2006
Tempe, Arizona
1
Motivation and Objectives
• Why estimate?
– Distinguishing feature of IRT modeling as compared to classical
techniques is the presence of parameters
– These parameters characterize and guide inference regarding
entities of interest (i.e., examinees, items)
• We will think through:
–
–
–
–
Different estimation situations
Alternative estimation techniques
The logic (and some math) underpinning these techniques
Various strengths and weaknesses
• What you will have
– An overview focusing on principles
– A resource to be revisited…and revisited…and revisited
2
Outline
• Maximum Likelihood and Bayesian Theory
• Estimation of Person Parameters When Item Parameters are Known
– ML
– MAP
– EAP
• Estimation of Item Parameters When Person Parameters are Known
– ML
• Simultaneous Estimation of Item and Person Parameters
– JML
– CML
– MML
• (Some) Other approaches
3
Notation
• Persons indexed by i = 1,…,N
• Items indexed by j = 1,…,J
• Xij denotes the random variable corresponding to the scored
item responses for person i to item j; xij denotes the realized
value
• Xi denotes the vector of item responses for person i
• θi denotes the latent trait (ability) of person i,
• ωj denotes the vector of parameters for item j (bj, aj, cj)
• X denotes the full collection of item responses (data)
• Θ = (θi,…, θN) denotes the full collection of latent traits
• Ω = (ω1,…, ωJ) denotes the full collection of item parameters
4
Maximum Likelihood: Logic
• A general approach to parameter estimation
• The use of a model implies that the data may be sufficiently
characterized by the features of the model, including the
unknown parameters
• Parameters govern the data in the sense that the data depend
on the parameters
– Given values of the parameters we can calculate the
(conditional) probability of the data
– P(Xij = 1 | θi, bj) = exp(θi – bj)/(1+ exp(θi – bj))
• Maximum likelihood (ML) estimation asks: “What are the
values of the parameters that make the data most probable?”
5
Decomposing The Joint
Distribution Into Conditional Probabilities
• Assume
– Respondent independence
– Local independence
• The joint distribution of the data may be represented as
PX | Θ, Ω  PX1 ,, X N | 1 ,, N , ω1 ,, ω J 
N
  PX i |  i , ω1 , ω 2 ,  , ω J 
i 1
  PX ij |  i , ω j 
N
J
i 1
j 1
6
The Likelihood
• The Likelihood may be thought of as the conditional
probability, where the data are known and the parameters vary
PX | Θ, Ω  LΘ, Ω | X
• Let Pij = P(Xij = 1 | θi, ωj)
LΘ, Ω | X    P X ij  xij |  i , ω j 
N
J
i 1
j 1
N
J
i 1
j 1
  ( Pij ) ij (1  Pij )
x
1 xij
• The goal is to maximize this function – what values of the
parameters yield the highest value?
7
Maximum Likelihood Estimation
• It is numerically easier to maximize the natural logarithm of
the likelihood
N
J
ln LΘ, Ω | X   xij ln( Pij )  (1  xij ) ln( 1  Pij )
i 1 j 1
• The log-likelihood has the same maximum as the likelihood
8
ML: Numerical Techniques
• One can maximize a function by finding a point where its
derivative is 0
• A variety of methods are available for maximizing L, or ln[L]
– Newton-Raphson
– Fisher Scoring
– Estimation-Maximization (EM)
• The generality of ML estimation and these numerical
techniques results in the same concepts and estimation routines
being employed across modeling situations
– Logistic regression, log-linear modeling, FA, SEM, LCA
9
ML Estimation of Person Parameters When
Item Parameters Are Known: Assumptions
• Assume
– item parameters bj, aj, and cj, are known
– respondent and local independence
Conditional
probability now
depends on person
parameter only
PX | Θ   PX1 ,, X N | 1 ,, N 
  PX ij |  i 
N
J
i 1 j 1
N
J
i 1
j 1
L1 ,, N | X   ( Pij ) ij (1  Pij )
x
1 xij
Likelihood
function for the
person parameters
only
10
ML Estimation of Person Parameters
When Item Parameters Are Known
N
J
ln L1 ,, N | X   xij ln( Pij )  (1  xij )(1  Pij )
i 1 j 1
• Choose each θi such that ln[L] is maximized
• Example from Embretson and Reise (2000)
– One examinee responding to 10 items
J
ln Li | X   xij ( Pij )  (1  xij )(1  Pij )
j 1
– To maximize ln[L] we need to find where the derivative is 0
– Let’s do things visually…
11
ML Estimation of Person Parameters When Item
Parameters Are Known: The Log-Likelihood
• The log-likelihood to
be maximized
• Select a start value
and iterate towards a
solution using
Newton-Raphson
• A “hill-climbing”
sequence
12
ML Estimation of Person Parameters When
Item Parameters Are Known: Newton-Raphson
• Start at -1.0
• If we were at the
maximum, the
derivative would be 0
•  ln L i | x i 
 3.211
 i
• Need to move to the
right…how much
depends on the second
derivative
13
ML Estimation of Person Parameters When
Item Parameters Are Known: Newton-Raphson
• Move to 0.09
•  ln L i | x i 
 0.335
 i
• Need to move to the
left…but only a little
14
ML Estimation of Person Parameters When
Item Parameters Are Known: Newton-Raphson
• Move to -0.0001
•  ln L i | x i 
 0.0003
 i
• Need to move?
• When the change in θi
is arbitrarily small
(e.g., less than 0.001),
stop estimation
• No meaningful
change in next step
• The key is that the
tangent is 0: the fact
that the estimate is
also 0 is coincidence!
15
ML Estimation of Person Parameters
When Item Parameters Are Known
• We demonstrated univariate estimation for a single θi
• But we have N examinees each with a θi to be estimated
• Multivariate estimation based on L(θ1,…, θN | X) much more
involved…
• However, thanks to the assumption of respondent
independence, the partial derivative of the multivariate
likelihood function – with respect to the parameter for one
examinee – involves terms corresponding to that examinee
• The payoff is that the updates to each θi are independent of
one another
• We can proceed one examinee at a time
16
ML Estimation of Person Parameters When
Item Parameters Are Known: Standard Errors
• The approximate, asymptotic standard error of the ML
estimate of θi is
SE (ˆ ) 
i
1

I (i )
1
I (ˆi )
  2 ln[ L] 

• where I(θi) is the information function: I (i )   E
2
 i 
• Standard errors are
– asymptotic with respect to the number of items
– approximate because only an estimate of θi is employed
– asymptotically approximately unbiased
17
ML Estimation of Person Parameters When
Item Parameters Are Known: Strengths
• ML estimates have some desirable qualities
– They are consistent
– If a sufficient statistic exists, then the MLE is a function of that
statistic (Rasch models)
– Asymptotically normally distributed
– Asymptotically most efficient (least variable) estimator among
the class of normally distributed unbiased estimators
• Asymptotically with respect to what?
18
ML Estimation of Person Parameters When
Item Parameters Are Known: Weaknesses
• ML estimates have some undesirable qualities
– Estimates may fly off into outer space
– They do not exist for so called “perfect scores” (all 1’s or 0’s)
– Can be difficult to compute or verify when the likelihood
function is not single peaked (may occur with 3-PLM or more
complex IRT models)
19
ML Estimation of Person Parameters When
Item Parameters Are Known: Weaknesses
• Strategies to handle wayward solutions
– Bound the amount of change at any one iteration
• Atheoretical
• No longer common
– Use an alternative estimation framework (Fisher, Bayesian)
• Strategies to handle perfect scores
– Do not estimate θi
– Use an alternative estimation framework (Bayesian)
• Strategies to handle local maxima
– Re-estimate the parameters using different starting points and
look for agreement
20
ML Estimation of Person Parameters When
Item Parameters Are Known: Alternatives
• An alternative to the Newton-Raphson technique is Fisher’s
method of scoring
– Instead of a matrix of second derivatives (used to see how far to
step in climbing the hill) it uses the information matrix (which is
based on the second derivatives)
– This usually leads to quicker convergence
– Often is more stable than Newton-Raphson
• But what about those perfect scores?
21
Bayes’ Theorem
• We can avoid some of the problems that occur in ML
estimation by employing a Bayesian approach
• Bayes’ Theorem for random variables A and B
Posterior distribution
of A, given B:
“The probability of A,
given B.”
Conditional probability
of B, given A
PB | AP A
P A | B  
P B 
Prior probability of A
Marginal
probability of B
22
Bayesian Estimation of
Person Parameters: The Posterior
• Select a prior distribution for θi denoted P(θi)
• Recall the likelihood function takes on the form P(Xi | θi)
• The posterior density of θi given Xi is
PX i |  i P i 
P i | X i  

PX i 

PX i |  i P i 
 PX
i
|  i P i d i

• Since P(Xi) is a constant
Pi | Xi   PXi | i Pi 
23
Bayesian Estimation of
Person Parameters: The Posterior
Pi | Xi   PXi | i Pi 
The Likelilhood
The Prior
The Posterior
24
Maximum A Posteriori
Estimation of Person Parameters
Pi | Xi   PXi | i Pi 
~
• The Maximum A Posteriori (MAP) estimate  i is the
maximum of the posterior density of θi
• Computed by maximizing the posterior density
• What we’ve done is add prior beliefs to the data, whereas in
ML all we had was the data
• If the prior is the uniform distribution (aka “noninformative”)
then the posterior is proportional to the likelihood
Pi | Xi   PXi | i 
• In this case, the MAP is very similar to the ML estimate
25
MAP Estimation of Person Parameters: Features
• The approximate, asymptotic standard error of the MAP is
1
1
SE (i ) 

~
I (i )
I (i )
where I(θi) is the information from the posterior density
~
• Advantages of the MAP estimator
– Exists for every response pattern – why?
– Generally leads to a reduced tendency for local extrema
• Disadvantages of the MAP estimator
– Must specify a prior
– Exhibits shrinkage in that it is biased towards the mean: May
need lots of items to “swamp” the prior if it’s misspecified
– Calculations are iterative and may take a long time
– May result in local extrema
26
Expected A Posteriori (EAP)
Estimation of Person Parameters
• The Expected A Posteriori (EAP) estimator is the mean of the
posterior distribution

 i    i P i | X i d i

• Exact computations are often intractable
• We approximate the integral using numerical techniques
• Essentially, we take a weighted average of the values, where
the weights are determined by the posterior distribution
– Recall that the posterior distribution is itself determined by the
prior and the likelihood
27
Numerical Integration Via Quadrature
∑ ≈ .165
.002 ⁄ .165 = .015
.021 ⁄ .165 = .127
• The Posterior
Distribution
• With quadrature
points
• Evaluate the heights
of the distribution at
each point
• Use the relative
heights as the
weights
28
EAP Estimation of via Quadrature
• The Expected A Posteriori (EAP) is estimated by a weighted
average:

 i    i P i | X i d i   Qr  H (Qr )

r
where H(Qr) is weight of point Qr in the posterior (compare
Embretson & Reise, 2000; p. 177)
• The standard error is the standard deviation in the posterior
and may also be approximated via quadrature
s 
i
2
(



)
 i i Pi | Xi di 
2
(
Q


)
 r i H (Qr )
29
EAP Estimation of Person Parameters
• Advantages
–
–
–
–
Exists for all possible response patterns
Non-iterative solution strategy
Not a maximum, therefore no local extrema
Has smallest MSE in the population specified by the prior
• Disadvantages
– Must specify a prior
– Exhibits shrinkage to the prior mean: If the prior is misspecified,
may need lots of items to “swamp” the prior
30
ML Estimation of Item Parameters When
Person Parameters Are Known: Assumptions
• Assume
– person parameters θi are known
– respondent and local independence
Conditional
probability now
depends on item
parameters only
PX | Ω  PX1 ,, X N | ω1 ,, ω J 
  PX ij | ω j 
N
J
i 1 j 1
N
J
i 1
j 1
Lω1 ,, ω J | X    ( Pij ) ij (1  Pij )
x
1 xij
Likelihood
function for the
item parameters
only
31
ML Estimation of Item Parameters When
Person Parameters Are Known: The Likelihood
N
J
ln Lω1 , ω 2 ,, ω J | X   xij ln( Pij )  (1  xij )(1  Pij )
i 1 j 1
• Choose each ωj such that ln[L] is maximized
• Just as we could estimate subjects one at a time thanks to
respondent independence, we can estimate items one at time
thanks to local independence
– Can add a few items and calibrate them without recalibrating the
items already calibrated
– Useful for item banking and linking
32
ML Estimation of Item Parameters When
Person Parameters Are Known: Estimation
N
J
ln Lω1 , ω 2 ,, ω J | X   xij ln( Pij )  (1  xij )(1  Pij )
i 1 j 1
• Estimate ω1, then ω2, then ω3,…
• In the case of the Rasch model we have 1 item parameter,
ωj = bj, estimation is similar to the case of estimating subjects
• In the case of the 2-PL, ωj = (bj, aj), and 3-PL, ωj = (bj, aj, cj),
we can go item by item, but not parameter by parameter
• Use a multivariate form of the Newton-Raphson to update
item parameters
33
ML Estimation of Item Parameters When
Person Parameters Are Known: Standard Errors
• To obtain the approximate, asymptotic standard errors
– Invert the associated information matrix, which yields the
variance-covariance matrix
– Take the square root of the elements of the diagonal

Diag [ I (b, a, c)]1

• Asymptotic w.r.t. sample size and approximate because we
only have estimates of the parameters
• This is conceptually similar to those for the estimation of θ
SE (ˆi )  I (ˆi )
1
• But why do we need a matrix approach?
34
ML Estimation of Item Parameters When
Person Parameters Are Known: Features
• ML estimates of item parameters have same properties as
those for person parameters: consistent, efficient, asymptotic
(w.r.t. subjects)
• aj parameters can be difficult to estimate, tend to get inflated
with small sample sizes
• cj parameters are often difficult to estimate well
– Usually because there’s not a lot of information in the data about
the asymptote
– Especially true when items are easy
• Generally need larger and more heterogeneous samples to
estimate 2-PL and 3-PL
• Can employ Bayesian estimation (more on this later)
35
Estimation of Person and Item
Parameters When Neither Are Known
•
•
•
•
•
Often we won’t know either person or item parameters
Need to simultaneously estimate all parameters
Model indeterminacies rear their heads
“Straightforward” ML of all parameters not feasible
Use ML procedures
– Joint Maximum Likelihood
– Conditional Maximum Likelihood
– Marginal Maximum Likelihood
• Use Bayesian procedures
– Marginal Maximum Likelihood
– Fully Bayesian
36
Joint Maximum Likelihood Estimation
• A 2-step iterative
approach, starting with
initial “estimates” of
item parameters
• Stopping rules: Iterate
until the absolute change
in…
– any item parameter is
arbitrarily small (<.001)
– all item parameters is
arbitrarily small (<.001)
– the likelihood is
arbitrarily small (<.001)
– etc.
Solve for person
parameter estimates
using current item
parameter estimates
Solve for item
parameter estimates
using current person
parameter estimates
37
Joint Maximum Likelihood Estimation of
Person and Item Parameters: Weaknesses
•
•
•
JML estimates are not consistent
unless both the number of
subjects and items grows large in
the correct ratio
This never happens
As N grows large, the number of
person parameters grows, which
leads to an inconsistency in the
estimates of the person
parameters
– Some attempts have been
made to correct for the bias
in the parameter estimates
– Seem to be dependent on the
actual configurations of the
parameters (i.e., not general)
J
i
1
…
J1
…
J2
1
*
*
*
*
*
…
*
*
*
*
*
N1
*
*
*
*
*
…
*
*
*
*
*
N2
*
*
*
*
*
38
Joint Maximum Likelihood Estimation of
Person and Item Parameters: Weaknesses
• Estimates of standard errors calculated based on the
information matrix
• These estimates are only asymptotically unbiased if the
parameter estimates are consistent, which they are not
• The standard error for an individual’s parameter, θi, ignores
the uncertainty in the item parameter estimates
• The standard error for an item’s parameter(s) ignores the
uncertainty in the person parameter estimates
• Everything difficult about ML gets worse
– Local maxima, minima
– Difficulties with estimating a’s and especially c’s
39
Conditional Maximum Likelihood
Estimation of Person and Item Parameters
• In the Rasch model, we have sufficient statistics for the
parameters
• Can condition on the sufficient statistic one type of parameter
and maximize the resulting function – the conditional
likelihood – with respect to the other type of parameter.
– θi can be estimated without reference to bj
– bj can be estimated without reference to θi
– Typically, we estimate bj and then use ML to estimate θi
• The resulting estimates are consistent!
• Numerically intensive, tough with a large number of items
• We do not have sufficient statistics for θi in the 2- and 3-PL
– Cannot get the benefits of CML
40
Marginal Maximum Likelihood Estimation
of Person and Item Parameters: Motivation
• One way to view the consistency problem in JML is that in
order to get solid estimates of item parameters, we need more
people, but each person brings along a new θ to be estimated
• Wouldn’t it be great if this weren’t the case? Wouldn’t it be
great if we could have a likelihood function that did not have
θi?
• CML does this using a sufficient statistic to get rid of θi, but it
can only be employed for the Rasch model
• How else can we get rid of θi?
41
Marginal Maximum Likelihood Estimation
of Person and Item Parameters: Theory
• Marginal Maximum Likelihood (MML) integrates θi out of the
likelihood function
• The resulting function is called the marginal likelihood
• May be viewed as the likelihood after averaging (or
integrating or collapsing) over the distribution of θi
• The big idea:
– With known values of the θi we could estimate the item
parameters
– Though we don’t know the θi, we assume a distribution…
– And say θi is at each location with a certain probability and
estimate item parameters accordingly
42
MML Estimation of Person and Item
Parameters: The Marginal Likelihood
The likelihood for one
examinee’s response pattern
Product over
examinees
N

J
   (P )
ij
xij
1 xij
(1  Pij )
Prior distribution
P(i )di
i 1  j 1
Integration over the prior distribution
• Essentially we’re evaluating the likelihood at each possible
value of θi and then taking a weighted average, where the
weights are defined by the prior distribution
43
Numerical Integration Via Quadrature
∑≈5
.054 ⁄ 5 = .011
.242 ⁄ 5 = .048
• Normal Distribution
is the most common
choice of prior
• Select quadrature
points
• Evaluate the heights
of the distribution at
each point
• Use the relative
heights as the
weights
44
MML: Estimation
1. Choose quadrature points, calculate weights
2. Maximize the marginal likelihood or the log of the marginal
likelihood to obtain item parameter estimates
3. Employ the item parameter estimates to estimate person
parameters
–
–
–
ML
MAP
EAP
45
MML: Estimation via the EM Algorithm
•
•
•
The Expectation-Maximization (EM) algorithm is a general
method for estimating parameters given incomplete data
In IRT, our data are incomplete because we “only” have item
responses, not values for the θi
Big idea
–
–
–
If we had complete data (i.e., values for the θi) we could
estimate item parameters
We “fill in” the incomplete data (the Expectation) to get
complete data to enable estimation (the Maximization)
Revisit the incomplete data, and repeat
46
MML: Item Parameter
Estimation via the EM Algorithm
•
Start with provisional estimates for the item parameters and
pre-specified points for along the θ scale (as in quadrature)
1. Expectation step
–
Obtain the posterior distribution of θ via Bayes’ Theorem
•
•
–
Use the prior for θ and the likelihood function for the data
(treating item parameters as known)
Approximate this by using the quadrature points
Estimate the number of examinees who get each item right at
each quadrature point
•
These are called pseudocounts
47
MML: Item Parameter
Estimation via the EM Algorithm
2. Maximization step
–
Use the estimated posterior distribution and pseudocounts
(from E-step) as though they were real data
•
•
–
•
•
•
As if the posterior was the true distribution of θ
As if the pseudocounts were the true numbers of correct
responses at θ
Estimate item parameters using ML with known θ
Take these new parameter estimates and return to the E-step
Iterate between E- and M-steps until convergence
Upon completion, run one iteration of Newton-Raphson to
find final parameter estimates and standard errors
48
MML: Person Parameter
Estimation via the EM Algorithm
•
As a byproduct of the estimation, we get probabilities for the
latent distribution at each quadrature point (part of the E-step)
–
–
–
These constitute an empirical representation of the posterior
distribution of θ in the population
Use these in conjunction with estimates of item parameters to
estimate person parameters
MAP, EAP
49
MML: Strengths and Weaknesses
• Parameter estimates and standard errors seem to be consistent
• Estimates available for all response patterns
• Marginal likelihoods seem to follow general likelihood theory
for hypothesis testing
• However, we must specify a prior distribution
– Look at existing research
– Can estimate features of the prior by “floating” the prior
– Estimates may be robust to prior assumptions if data are
sufficient to swamp the prior
• Existing software (BILOG, PARSCALE, MULTILOG,
ConQuest)
50
Bayesian Modal Estimation:
Marginal Maximum a Posteriori
• We’ve already seen Bayesian elements and arguments
– MAP, EAP, MML
• Can include prior distributions for item parameters
–
–
–
–
Use these priors to augment the likelihood to yield the posterior
Maximize the posterior (or its log) to get estimates
Estimates known as Marginal Maximum a Posteriori (MMAP)
Helps to stabilize estimation
•
•
Fewer local maxima
Fewer parameters flying off into space
– Much more common for c’s and a’s than b’s
– Encounter issues of prior specification, shrinkage bias
51
Conclusion: What Have We
Done And Where Do We Go?
• We’ve covered a lot!
• Takaway points
–
–
–
–
Different estimation circumstances
Principles of ML estimatnion
Differences between Bayesian and ML approaches
Strategies and logic of estimation
• What to do now
– Take a deep breath
– Revisit this, connecting principles and pictures to equations
– Revisit this again in other contexts
• Estimation of other models
• (Even more) alternatives in IRT (e.g., MCMC)
52