estimation_1_class

Download Report

Transcript estimation_1_class

Estimation of Item Response Models
Mister Ibik
Division of Psychology in Education
Arizona State University
EDP 691: Advanced Topics in Item Response Theory
1
Motivation and Objectives
• Why estimate?
– Distinguishing feature of IRT modeling as compared to classical
techniques is the presence of parameters
– These parameters characterize and guide inference regarding
entities of interest (i.e., examinees, items)
• We will think through:
–
–
–
–
Different estimation situations
Alternative estimation techniques
The logic and mathematics underpinning these techniques
Various strengths and weaknesses
• What you will have
– A detailed introduction to principles and mathematics
– A resource to be revisited…and revisited…and revisited
2
Outline
• Some Necessary Mathematical Background
• Maximum Likelihood and Bayesian Theory
• Estimation of Person Parameters When Item Parameters are Known
– ML
– MAP
– EAP
• Estimation of Item Parameters When Person Parameters are Known
– ML
• Simultaneous Estimation of Item and Person Parameters
– JML
– CML
– MML
• Other Approaches
3
Background: Finding the Root of an Equation
• Newton-Raphson Algorithm
f(x)
– Finds the root of an equation
– Example: the function f(x) = x2
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
-0.5
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
x
– Has a root (where f(x) = 0) at x = 0
4
Newton-Raphson
• Newton-Raphson takes a given point, x0, and systematically
progresses to find the root of the equation
– Utilizes the slope of the function to find where the root may be
• The slope of the function is given by the derivative
– Denoted
–
–
–
–
f (x) or
f(x)
x
Gives the slope of the straight line that is tangent to f(x) at x
Tangent: best linear prediction of how the function is changing
For x0, the best guess for the root is the point where f′(x) = 0
f(x)
This occurs at - f(x)
x
f(x 0 )
– So the next candidate point for the root is: x1  x 0 - f(x 0 )
x
5
Newton-Raphson Updating (1)
x1  1.5 - 2.25 3  0.75
• Suppose x0 = 1.5
f(x 0 )
• x1  x 0 - f(x 0 )
x
f′(x0) = 3
f(x)
f(x)
f(x)  x ,
 2x
x
4.5
2
4
3.5
3
2.5
2
1.5
1
0.5
0
-0.5
-2.5
f(x0) = 2.25
x1 = 0.75
x0 = 1.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
x
6
Newton-Raphson Updating (2)
x 2  0.75- 0.56251.5  0.375
• Now x1 = 0.75
f(x 1 )
• x 2  x1 - f(x 1 )
x
f(x)
f(x)
f(x)  x ,
 2x
x
4.5
2
4
3.5
3
2.5
2
1.5
1
0.5
0
-0.5
-2.5
f′(x1) = 1.5
f(x1) = 0.5625
x2 = 0.375
-2
-1.5
-1
-0.5
x1 = 0.75
0
0.5
1
1.5
2
2.5
x
7
Newton-Raphson Updating (3)
x 3  0.375- 0.1406 0.75  0.1875
• Now x2 = 0.375
f(x 2 )
• x 3  x 2 - f(x 2 )
x
f′(x2) = 0.75
f(x)
f(x)  x0.25,
 2x
x
0.2
2
f(x2) = 0.1406
f(x)
0.15
0.1
x3 = 0.1875
0.05
0
-0.05
-0.5
-0.4
-0.3
-0.2
-0.1
0
x
0.1
0.2
0.3
0.4
0.5
x2 = 0.375
8
Newton-Raphson Updating (4)
x 4  0.1875- 0.0352 0.375  0.0938
• Now x3 = 0.1875
f(x 3 )
• x 4  x 3 - f(x 3 )
x
f′(x3) = 0.375
f(x)
f(x)  x , 0.09  2x
0.08
x
f(x)
2
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
-0.01
-0.25
f(x3) = 0.0352
x4 = 0.0938
-0.2
-0.15
-0.1
-0.05
0
x
0.05
0.1
0.15
0.2
0.25
x3 = 0.1875
9
Newton-Raphson Example
f(x)
x
f(x)
f(x)
x
f(x)
x - f(x)
x
Iteration
Value
f(x)
0
1.5000
2.2500
3.0000
0.7500
0.7500
1
0.7500
0.5625
1.5000
0.3750
0.3750
2
0.3750
0.1406
0.7500
0.1875
0.1875
3
0.1875
0.0352
0.3750
0.0938
0.0938
4
0.0938
0.0088
0.1875
0.0469
0.0469
5
0.0469
0.0022
0.0938
0.0234
0.0234
6
0.0234
0.0005
0.0469
0.0117
0.0117
7
0.0117
0.0001
0.0234
0.0059
0.0059
8
0.0059
0.0000
0.0117
0.0029
0.0029
9
0.0029
0.0000
0.0059
0.0015
0.0015
10
0.0015
0.0000
0.0029
0.0007
0.0007
10
Newton-Raphson Summary
• Iterative algorithm for finding the root of an equation
• Takes a starting point and systematically progresses to find the
root of the function
f(x)
• Requires the derivative of the function
x
• Each successive point is given by
f(x)
x - f(x)
x
• The process continues until we get arbitrarily close, as usually
measured by the change in some function
11
Difficulties With Newton-Raphson
f(x)
• Some functions have multiple roots
• Which root is found often depends on the start value
2.5
2
1.5
1
0.5
0
-0.5
-1
-1.5
-2
-2.5
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
x
12
Difficulties With Newton-Raphson
• Numerical complications can arise
• When the derivative is relatively small in magnitude, the
algorithm shoots into outer space
0.6
0.4
f(x)
0.2
0
-0.2
-0.4
-0.6
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
x
13
Logic of Maximum Likelihood
• A general approach to parameter estimation
• The use of a model implies that the data may be sufficiently
characterized by the features of the model, including the
unknown parameters
• Parameters govern the data in the sense that the data depend
on the parameters
– Given values of the parameters we can calculate the
(conditional) probability of the data
– P(Xij = 1 | θi, bj) = exp(θi – bj)/(1+ exp(θi – bj))
• Maximum likelihood (ML) estimation asks: “What are the
values of the parameters that make the data most probable?”
14
Example: Series of Bernoulli
Variables With Unknown Probability
• Bernoulli variable: P(X = 1) = p
• The probability of the data is given by pX × (1-p)(1-X)
• Suppose we have two random variables X1 and X2
2
P X 1 , X 2 | p    p j (1  p)
X
1 X j
j 1
•
•
•
•
When taken as a function of the parameters, it is called the likelihood
Suppose X1 =1, X2 = 0
P(X1 =1, X2 = 0|p) = L(p|X1 =1, X2 = 0) = p × (1-p)
Choose p to maximize the conditional probability of the data
– For p = 0.1, L = 0.1 × (1-0.1) = 0.09
– For p = 0.2, L = 0.2 × (1-0.2) = 0.16
– For p = 0.3, L = 0.3 × (1-0.3) = 0.21
15
Example: Likelihood Function
0.3
0.25
0.2
0.15
0.1
0.05
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
16
The Likelihood Function in IRT
• The Likelihood may be thought of as the conditional
probability, where the data are known and the parameters vary
PX | Θ, Ω  LΘ, Ω | X
• Let Pij = P(Xij = 1 | θi, ωj)
LΘ, Ω | X    PX ij  xij |  i , ω j 
N
J
i 1
j 1
N
J
i 1
j 1
  ( Pij ) ij (1  Pij )
x
1 xij
• The goal is to maximize this function – what values of the
parameters yield the highest value?
17
Log-Likelihood Functions
• It is numerically easier to maximize the natural logarithm of
the likelihood
N
J
lnLΘ, Ω | X   xij ln(Pij )  (1  xij ) ln(1  Pij )
i 1 j 1
• The log-likelihood has the same maximum as the likelihood
18
Maximizing the Log-Likelihood
• Note that at the maximum of the function, the slope of the
tangent line equals 0
• The slope of the tangent is given by the first derivative
• If we can find the point at which the first derivative equals 0,
we will have also found the point at which the function is
maximized
19
Overview of Numerical Techniques
• One can maximize the ln[L] function by finding a point where
its derivative is 0
• A variety of methods are available for maximizing L, or ln[L]
– Newton-Raphson
– Fisher Scoring
– Estimation-Maximization (EM)
• The generality of ML estimation and these numerical
techniques results in the same concepts and estimation routines
being employed across modeling situations
– Logistic regression, log-linear modeling, FA, SEM, LCA
20
ML Estimation of Person Parameters
When Item Parameters Are Known
• Assume item parameters bj, aj, and cj, are known
• Assume unidimensionality, local and respondent independence
PX | Θ  PX1 ,, X N | 1 ,, N 
Conditional probability
now depends on person
parameter only
  PX ij | i 
N
J
i 1 j 1
N
J
i 1
j 1
L1 ,, N | X   ( Pij ) (1  Pij )
N
xij
1 xij
Likelihood
function for the
person parameters
only
J
lnL1 ,, N | X   xij ln(Pij )  (1  xij )(1  Pij )
i 1 j 1
21
ML Estimation of Person Parameters
When Item Parameters Are Known
• Choose each θi such that L or ln[L] is maximized
• Let’s suppose we have one examinee
J
lnLi | Xi    xij ( Pij )  (1  xij )(1  Pij )
j 1
• Maximize this function using any of several methods
• We’ll use Newton-Raphson
22
Newton-Raphson Estimation Recap
• Recall NR seeks to find the root of a function (where = 0)
• NR updates follow the general structure
Current value
Updated value
What is our
function of
interest?
x next  x - f(x)
f(x)
x
What is the
derivative of
this
function?
Derivative of the
function of interest
Function of interest
23
Newton-Raphson
Estimation of Person Parameters
• Newton-Raphson uses the derivative of the function of interest
• Our function is itself a derivative, the first derivative of ln[L]
with respect to θi
 lnL i | x i 
 i
• We’ll need the second derivative as well as the first derivative
 2 lnLi | xi 
i2
• Updates given by inext
 lnLi | xi   2 lnLi | xi 
 i i
i2
24
ML Estimation of Person Parameters When Item
Parameters Are Known: The Log-Likelihood
• The log-likelihood to
be maximized
• Select a start value
and iterate towards a
solution using
Newton-Raphson
• A “hill-climbing”
sequence
25
ML Estimation of Person Parameters When
Item Parameters Are Known: Newton-Raphson
• Start at -1.0
 lnLi | xi 
 3.211
i
 2 lnLi | xi 
 2.920
2
i
 inext
3.211
 1 
 2.920
 .09
26
ML Estimation of Person Parameters When
Item Parameters Are Known: Newton-Raphson
• Move to 0.09
 lnL i | xi 
 0.335
 i
 2 lnLi | xi 
 3.363
2
i
 inext
 0.335
 .09 
 3.363
 0.0001
27
ML Estimation of Person Parameters When
Item Parameters Are Known: Newton-Raphson
• Move to -0.0001
 lnL i | xi 
 0.0003
 i
 2 lnLi | xi 
 3.368
2
i
• When the change in θi
is arbitrarily small
(e.g., less than 0.001),
stop estimation
• No meaningful
change in next step
• The key is that the
tangent is 0
28
Newton-Raphson Estimation
of Multiple Person Parameters
• But we have N examinees each with a θi to be estimated
N
J
i 1
j 1
L1 ,, N | X   ( Pij ) (1  Pij )
N
xij
1 xij
J
lnL1 ,, N | X   xij ln(Pij )  (1  xij )(1  Pij )
i 1 j 1
• We need a multivariate version of the Newton-Raphson
algorithm
29
First Order Derivatives
  lnL  lnL

2


1  2
 1  
1
    2 lnL  2 lnL
2
  2     


2
  2 1

   2 
 N    lnL  2 lnL
  N 1  N  2
2
 1 
 
 2

 
 N  next
2
 lnL 

1  N 
 2 lnL 
 2  N 

 
 2 lnL 
 N2 
2




1
  lnL
  
 1 
  lnL
  2 
  
  lnL


  N 
Why???
• First order derivatives of the log-likelihood
• ∂ln[L]/∂θi only involves terms corresponding to subject i
30
Second Order Derivatives
  lnL  lnL

2


1  2
 1  
1
    2 lnL  2 lnL
2
  2     


2
  2 1

   2 
 N    lnL  2 lnL
  N 1  N  2
2
 1 
 
 2

 
 N  next
2
 lnL 

1  N 
 2 lnL 
 2  N 

 
 2 lnL 
 N2 
2




1
  lnL
  
 1 
  lnL
  2 
  
  lnL


  N 
• Hessian: second order partial derivatives of the log-likelihood
• This matrix needs to be inverted
Why??
?
• In the current context, this matrix is diagonal
31
Second Order Derivatives
  lnL  lnL

2


1  2
 1  
1
    2 lnL  2 lnL
2
  2     


2
  2 1

   2 
 N    lnL  2 lnL
  N 1  N  2
2
 1 
 
 2

 
 N  next
2
 lnL 

1  N 
 2 lnL 
 2  N 

 
 2 lnL 
 N2 
2




1
  lnL
  
 1 
  lnL
  2 
  
  lnL


  N 
• The inverse of the Hessian is diagonal with elements that are
the reciprocals of the diagonal of the Hessian
• Updates for each θi do not depend on any other subject’s θ
32
Second Order Derivatives
  lnL  lnL

2


1  2
 1  
1
    2 lnL  2 lnL
2
  2     


2
  2 1

   2 
 N    lnL  2 lnL
  N 1  N  2
2
 1 
 
 2

 
 N  next
2
 lnL 

1  N 
 2 lnL 
 2  N 

 
 2 lnL 
 N2 
2




1
  lnL
  
 1 
  lnL
  2 
  
  lnL


  N 
• The updates for each θi are independent of one another
• The procedure can be performed one examinee at a time
33
ML Estimation of Person Parameters When
Item Parameters Are Known: Standard Errors
• The approximate, asymptotic standard error of the ML
estimate of θi is
SE (ˆ ) 
i
1

I ( i )
1
I (ˆi )
  2 ln[L] 

• where I(θi) is the information function: I ( i )   E 
2
  i 
• Standard errors are
– asymptotic with respect to the number of items
– approximate because only an estimate of θi is employed
– asymptotically approximately unbiased
34
ML Estimation of Person Parameters When
Item Parameters Are Known: Strengths
• ML estimates have some desirable qualities
– They are consistent
– If a sufficient statistic exists, then the MLE is a function of that
statistic (Rasch models)
– Asymptotically normally distributed
– Asymptotically most efficient (least variable) estimator among
the class of normally distributed unbiased estimators
• Asymptotically with respect to what?
35
ML Estimation of Person Parameters When
Item Parameters Are Known: Weaknesses
• ML estimates have some undesirable qualities
– Estimates may fly off into outer space
– They do not exist for so called “perfect scores” (all 1’s or 0’s)
– Can be difficult to compute or verify when the likelihood
function is not single peaked (may occur with 3-PLM or more
complex IRT models)
36
ML Estimation of Person Parameters When
Item Parameters Are Known: Weaknesses
• Strategies to handle wayward solutions
– Bound the amount of change at any one iteration
• Atheoretical
• No longer common
– Use an alternative estimation framework (Fisher, Bayesian)
• Strategies to handle perfect scores
– Do not estimate θi
– Use an alternative estimation framework (Bayesian)
• Strategies to handle local maxima
– Re-estimate the parameters using different starting points and
look for agreement
37
ML Estimation of Person Parameters When
Item Parameters Are Known: Weaknesses
• An alternative to the Newton-Raphson technique is Fisher’s
method of scoring
– Instead of the Hessian, it uses the information matrix (based on
the Hessian)
– This usually leads to quicker convergence
– Often is more stable than Newton-Raphson
• But what about those perfect scores?
38
Bayes’ Theorem
• We can avoid some of the problems that occur in ML
estimation by employing a Bayesian approach
• All entities treated as random variables
• Bayes’ Theorem for random variables A and B
Posterior distribution
of A, given B:
“The probability of A,
given B.”
Conditional probability
of B, given A
Prior probability of A
PB | AP A
P A | B  
P B 
Marginal
probability of B
39
Bayes’ Theorem
• If A is discrete
PB | AP A
PB | AP A
P A | B  

 PB | AP A
P B 
 PB | AP A
A
• If A is continuous
PB | AP A
PB | AP A
P A | B  

 PB | AP A
P B 
 PB | AP AdA
A
• Note that P(B|A) = L(A|B)
40
Bayesian Estimation of
Person Parameters: The Posterior
• Select a prior distribution for θi denoted P(θi)
• Recall the likelihood function takes on the form P(Xi | θi)
• The posterior density of θi given Xi is
PXi | i P i 
Pi | Xi  

PXi 

PXi |  i P i 
 PX
i
| i P i d i

• Since P(Xi) is a constant
Pi | Xi   PXi | i Pi 
41
Bayesian Estimation of
Person Parameters: The Posterior
Pi | Xi   PXi | i Pi 
The Likelilhood
The Prior
The Posterior
42
Maximum A Posteriori
Estimation of Person Parameters
Pi | Xi   PXi | i Pi 
• The Maximum A Posteriori (MAP) estimate ~i is the
maximum of the posterior density of θi
• Computed by maximizing the posterior density, or its log
• Find θi such that
 lnPi | Xi 
0
i
• Use Newton-Raphson or Fisher scoring
• Max of ln[P(θi| Xi)] occurs at max of ln[P(Xi | θi)] + ln[P(θi)]
• This can be thought of as augmenting the likelihood with prior
information
43
Choice of Prior Distribution
• Choosing P(θi) ~ U(-∞, ∞) yields the posterior to be
proportional to the likelihood
Pi | Xi   PXi | i Pi 
• In this case, the MAP is very similar to the ML estimate
• The prior distribution P(θi) is often assumed to be N(0, 1)
– The normal distribution commonly justified by appeal to CLT
– Choice of mean and variance identifies the scale of the latent
continuum
44
MAP Estimation of Person Parameters: Features
• The approximate, asymptotic standard error of the MAP is
~
SE ( i ) 
1

I ( i )
1
~
I ( i )
where I(θi) is the information from the posterior density
• Advantages of the MAP estimator
– Exists for every response pattern – why?
– Generally leads to a reduced tendency for local extrema
• Disadvantages of the MAP estimator
– Must specify a prior
– Exhibits shrinkage in that it is biased towards the mean: May
need lots of items to “swamp” the prior if it’s misspecified
– Calculations are iterative and may take a long time
– May result in local extrema
45
Expected A Posteriori (EAP)
Estimation of Person Parameters
• The Expected A Posteriori (EAP) estimator is the mean of the
posterior distribution

i   i Pi | Xi di

• Exact computations are often intractable
• We approximate the integral using numerical techniques
• Essentially, we take a weighted average of the values, where
the weights are determined by the posterior distribution
– Recall that the posterior distribution is itself determined by the
prior and the likelihood
46
Numerical Integration Via Quadrature
∑ ≈ .165
.002 ⁄ .165 = .015
.021 ⁄ .165 = .127
• The Posterior
Distribution
• With quadrature
points
• Evaluate the heights
of the distribution at
each point
• Use the relative
heights as the
weights
47
EAP Estimation of via Quadrature
• The Expected A Posteriori (EAP) is estimated by a weighted
average:

i   i Pi | Xi di   Qr  H (Qr )

r
where H(Qr) is weight of point Qr in the posterior (compare
Embretson & Reise, 2000; p. 177)
• The standard error is the standard deviation in the posterior
and may also be approximated via quadrature
 
i
2
(



)
 i i Pi | Xi di 
2
(
Q


)
 r i H (Qr )
48
EAP Estimation of via Quadrature
• Advantages
–
–
–
–
Exists for all possible response patterns
Non-iterative solution strategy
Not a maximum, therefore no local extrema
Has smallest MSE in the population
• Disadvantages
– Must specify a prior
– Exhibits shrinkage to the prior mean: If the prior is misspecified,
may need lots of items to “swamp” the prior
49
ML Estimation of Item Parameters When
Person Parameters Are Known: Assumptions
• Assume
– person parameters θi are known
– respondent and local independence
N
J
x


L b1 , a1 , c1 ,, bJ , aJ , cJ | X   ( Pij )
ij
i 1
N
1 xij
(1  Pij )
j 1
J
lnLb1 , a1 , c1 ,, bJ , aJ , cJ | X   xij ( Pij )  (1  xij )(1  Pij )
i 1 j 1
• Choose values for item parameters that maximize ln[L]
50
Newton-Raphson Estimation
 b1 
a 
 1
 c1 
 

bJ 
 
a J 
c 
 J  next
  2 lnL

2

b
 2 1
  lnL
b
 1 
 a   a1 b1
 1    2 lnL
 c1   c b
   1 1
  

 2
bJ    lnL
   b b
a J   2 J 1
 c    lnL
 J
 a J b1
  2 lnL

 cJ b1
 2 lnL
b1 a1
 2 lnL
a12
 2 lnL
c1 a1

2
 lnL
bJ a1
 2 lnL
a J a1
 2 lnL
cJ a1
 2 lnL
b1 c1
 2 lnL
a1 c1
 2 lnL
c12

2
 lnL
bJ c1
 2 lnL
a J c1
 2 lnL
cJ c1







 2 lnL
b1 bJ
 2 lnL
a1 bJ
 2 lnL
c1 bJ

2
 lnL
bJ2
 2 lnL
a J bJ
 2 lnL
cJ bJ
 2 lnL
b1 a J
 2 lnL
a1 a J
 2 lnL
c1 a J

2
 lnL
bJ a J
 2 lnL
a J2
 2 lnL
cJ a J
 2 lnL

b1 cJ 
 2 lnL
ba cJ 

2
 lnL
ca cJ 
 

 2 lnL
bJ cJ 
 2 lnL

a J cJ 
 2 lnL

cJ2 
1
  lnL
 b 
 1 
  lnL
 a1 
  lnL


 c1 
  
  lnL
 b 
 J 
  lnL
 a J 
  lnL


 cJ 
• What is the structure of this matrix?
51
ML Estimation of Item Parameters
When Person Parameters Are Known
• Just as we could estimate subjects one at a time thanks to
respondent independence, we can estimate items one at time
thanks to local independence
• Multivariate Newton-Raphson:
  lnL

b 2j

b j 
    2 lnL
 a j   
a b
c j   2 j j
    lnL

 c j b j
2
b j 
 
a j 
c j 
 
next
1
 lnL  lnL   lnL
 

b j a j b j c j   b j 
 2 lnL  2 lnL   lnL

a 2j
a j c j   a j 
 2 lnL  2 lnL   lnL

c j a j
c 2j   c j 
2
2
52
ML Estimation of Item Parameters When
Person Parameters Are Known: Standard Errors
• To obtain the approximate, asymptotic standard errors
– Invert the associated information matrix, which yields the
variance-covariance matrix
– Take the square root of the elements of the diagonal

Diag [ I (b, a, c)] 1

• Asymptotic w.r.t. sample size and approximate because we
only have estimates of the parameters
• This is conceptually similar to those for the estimation of θ
SE(ˆi )  I (ˆi )
1
• But why do we need a matrix approach?
53
ML Estimation of Item Parameters When
Person Parameters Are Known: Standard Errors
• ML estimates of item parameters have same properties as
those for person parameters: consistent, efficient, asymptotic
(w.r.t. subjects)
• aj parameters can be difficult to estimate, tend to get inflated
with small sample sizes
• cj parameters are often difficult to estimate well
– Usually because there’s not a lot of information in the data about
the asymptote
– Especially true when items are easy
• Generally need larger and more heterogeneous samples to
estimate 2-PL and 3-PL
• Can employ Bayesian estimation (more on this later)
54