presentation (slides)
Download
Report
Transcript presentation (slides)
Measurement: 7 key statistical and
psychometric principles for policy researchers
Andrew Ho
Harvard Graduate School of Education
University of Michigan Short Course
Ann Arbor, Michigan, December 7, 2016
1
Measurement: 7 Key Principles
1. We don’t validate tests. We validate score uses.
2. Content is king. Not models. Content.
3. Start with Classical Test Theory (CTT): The
descriptive stats of measurement.
4. Your reliability is not Reliability (G Theory).
5. Item Response Theory (IRT) is just a model. A
very, very useful model.
6. Your scale is pliable. Bend it; don’t break it.
7. Know the process that generated your scores.
Use them accordingly.
2
Front Matter
3
How to learn measurement?
Learn it. Use it.
Learn it again.
Use it again.
4
Google Docs for Advance Readings
NB Annotations? Perusall? Or Google Docs?
Read.
Write.
Do.
5
Measurement: 7 Key Principles
1. We don’t validate tests. We validate score uses.
2. Content is king. Not models. Content.
3. Start with Classical Test Theory (CTT): The
descriptive stats of measurement.
4. Your reliability is not Reliability (G Theory).
5. Item Response Theory (IRT) is just a model. A
very, very useful model.
6. Your scale is pliable. Bend it; don’t break it.
7. Know the process that generated your scores.
Use them accordingly.
6
Validation (Kane, 2006; 2013)
© Andrew Ho
Harvard Graduate School of Education 7
4 Measurement Theories (see Briggs, 2013)
1. Operationalism (Bridgman, 1927): “we mean by a
concept nothing more than a set of operations.”
2. Instrumentalism (Duhem, 1954): a good
measurement is useful.
3. Representationalism (Stevens, 1946; Suppes & Zinnes,
1963): Axiomatic distinctions between scales (NOIR),
discernable empirically.
4. Classicism (Michell, 1990): A good measurement must
be quantitative (has equal-interval properties); we can
verify its nature empirically.
• Modern test validation theory (e.g., Kane, 2013) is
dominated by instrumentalism, concerned with test
score uses and interpretations.
• This can be frustrating!
© Andrew Ho
Harvard Graduate School of Education 8
Five sources of validity evidence for score use (5 Cs)
1. Content
– Evidence based on test
• e.g., Alignment studies
2. Cognition
I developed a scale with theory.
content I fit a CFA and got a good CFI.
My reliability is greater than 0.8.
My scores predict desirable outcomes.
So I have a valid and reliable measure.
– Evidence based on response processes
• e.g., Think-aloud protocols
3. Coherence
WAIT! What are your scores? How will
you use them? What would have
happened had you not measured?
– Evidence based on internal structure
• e.g., Reliability analyses, EFA/CFA/IRT.
4. Correlation
– Evidence based on relations to other variables
• e.g., Convergent evidence
5. Consequence
– Evidence based on consequences of testing
• e.g., Long-term evaluations. Had you not measured… what then?
© Andrew Ho
Harvard Graduate School of Education 9
Measurement: 7 Key Principles
1. We don’t validate tests. We validate score uses.
2. Content is king. Not models. Content.
3. Start with Classical Test Theory (CTT): The
descriptive stats of measurement.
4. Your reliability is not Reliability (G Theory).
5. Item Response Theory (IRT) is just a model. A
very, very useful model.
6. Your scale is pliable. Bend it; don’t break it.
7. Know the process that generated your scores.
Use them accordingly.
10
Measurement (1) Know your items. Read each one. Take the test.
Col
Var
1
X1
You have high standards of teacher
performance.
2
X2
You are continually learning on the
job.
3
X3
You are successful in educating
your students.
4
X4
It’s a waste of time to do your best
as a teacher.
5
X5
You look forward to working at
your school.
successful
3 = successful
4 = very successful
1 = strongly agree
2 = agree,
3 = slightly agree
4 = slightly disagree,
To incorporate these
multiple indicators successfully
5 = disagree
6 = strongly
disagreeinto subsequent analysis – whether as outcome or
predictor
– you must
deal with several issues:
1 = strongly
disagree
2 = disagree
3 = slightly disagree
4 = slightly agree
1. You must decide whether each of the indicators
5 = agree
6 = strongly agree
X6
How much of the time are you
satisfied with your job?
1 = never
2 = almost never
subsequent analyses,
or whether they should be
3 = sometimes
4 = always
6
Variable Description
As is typicalLabels
of many datasets, TSUCCESS contains:
1 = strongly
2 = disagree
disagree
Multiple variables
– or “indicators” – that record
3 = slightly disagree
4
=
slightly
agreesurvey items.
teacher’s responses to the
5 = agree
6 = strongly agree
These
multiple
are intended to provide
1 = strongly disagree
2 = indicators
disagree
teachers with4 replicate
opportunities to report
3 = slightly disagree
= slightly agree
job satisfaction”
5 = agree their job satisfaction
6 = strongly(“teacher
agree
being the focal
in the research).
1 = not successful
2 = “construct”
a little
should be treated as a separate variable in
combined to form a “composite” measure of the
underlying construct of teacher job satisfaction.
2. To form such a composite, you must be able to
confirm that the multiple indicators actually
“belong together” in a single composite.
3. If you can confirm that the multiple indicators do
indeed belong together in a composite, you must
decide on the “best way” to form that composite.
© John Willett and Andrew Ho, Harvard Graduate School of
Education
Unit 6a – Slide 11
Measurement (2): Always know the scale of your items. Score your test.
Indicators are not created equally.
Different scales
Positive or negative wording/direction/“polarity”
Different variances on similar scales
Different means on similar scales (difficulty)
Different associations with the construct (discrimination)
Var
Variable
Description
Labels
X1
You have high
standards of
teacher
performance.
1 = strongly disagree
3 = slightly disagree
5 = agree
2 = disagree
4 = slightly agree
6 = strongly agree
X2
You are continually
learning on the job.
1 = strongly disagree
3 = slightly disagree
5 = agree
2 = disagree
4 = slightly agree
6 = strongly agree
1 = not successful
3 = successful
2 = a little successful
4 = very successful
1 = strongly agree
3 = slightly agree
5 = disagree
1 = strongly disagree
3 = slightly disagree
5 = agree
2 = agree,
4 = slightly disagree,
6 = strongly disagree
2 = disagree
4 = slightly agree
6 = strongly agree
1 = never
3 = sometimes
2 = almost never
4 = always
X3
X4
X5
X6
You are successful
in educating your
students.
It’s a waste of time
to do your best as a
teacher.
You look forward
to working at your
school.
How much of the
time are you
satisfied with your
job?
• Different Indicators Have Different
Metrics:
i. Indicators X1, X2, X4, & X5 are
measured on 6-point scales.
ii. Indicators X3 & X6 are measured on 4point scales.
iii. Does this matter, and how do we deal
with it in the compositing process?
iv. Is there a “preferred” scale length?
• Some Indicators “Point” In A
“Positive” Direction And Some In A
“Negative” Direction:
i. Notice the coding direction of X4,
compared to the directions of the rest of
the indicators.
ii. When we composite the indicators,
what should we do about this?
• Coding Indicators On The “Same”
Scale Does Not Necessarily Mean
That They Have The Same “Value”
At The Same Scale Points:
i. Compare scale point “3” for indicators
X3 and X6, for instance.
ii. How do we deal with this, in
compositing?
© John Willett and Andrew Ho, Harvard Graduate School of
Education
Unit 6a – Slide 12
Measurement (3) Look at your data…
.
list X1-X6 in 1/35, nolabel clean
X1
X2
X3
X4
X5
X6
Every row is a person. A person-by-item matrix,
1.
5
5
3
3
4
2
a standard data representation in psychometrics.
2.
4
3
2
1
1
2
3.
4
4
2
2
2
2
Note that we have some missing data.
4.
.
6
3
5
3
3
5.
4
4
3
2
4
3
6.
.
5
2
4
3
3
*----------------------------------------------------------------------------------7.
4
4
4
4
5
3
* Input
name
and
label the variables and selected values.
8.
6 the4 raw 4dataset,
1
1
2
9.
6
6
3
6
5
3
*----------------------------------------------------------------------------------10.
3 the5 target
3
6
3
3
* Input
dataset:
infile X1-X6 using "C:\My Documents\ … \Datasets\TSUCCESS.txt"
* Label the variables:
label variable X1
label variable X2
label variable X3
label variable X4
label variable X5
label variable X6
Standard data-input and indicator-naming
statements. Label items descriptively,
ideally with item stems/prompts.
"Have high standards of teaching"
"Continually learning on job"
"Successful in educating students"
"Waste of time to do best as teacher"
"Look forward to working at school"
"Time satisfied with job"
* Label the values of the variables:
label define lbl1
1 "Strongly Disagree"
4 "Slightly Agree"
label values X1 X2 X3 lbl1
label define lbl2
2 "Disagree"
5 "Agree"
1 "Strongly Agree"
2 "Agree"
4 "Slightly Disagree" 5 "Disagree"
3 "Slightly Disagree" ///
6 "Strongly Agree"
3 "Slightly Agree" ///
6 "Strongly Disagree"
label values X4 lbl2
label define lbl3
1 "Not Successful"
3 "Successful"
2 "Somewhat Successful" ///
4 "Very Successful"
label values X3 lbl3
label define lbl4
Make absolutely sure that
your item scales are
oriented in the same
direction: Positive should
mean something similar.
If not, fix it sooner or later.
1 "Almost Never" 2 "Sometimes" ///
3 "Almost Always" 4 "Always"
label values X6 lbl4
© Andrew Ho, Harvard Graduate School of Education
Unit 6a – Slide 13
Meaurement (4): Exploratory Data Analysis for Item Responses
tabstat X1-X6, stats(mean sd n min max) col(statistics)
mean
sd
N
min
max
X1
X2
X3
X4
X5
X6
4.331175
3.873361
3.152216
4.223199
4.418882
2.835902
1.090758
1.247791
.673924
1.669808
1.33348
.5724269
5097
5109
5144
5121
5116
5125
1
1
1
1
1
1
6
6
4
6
6
4
60
Are these items on the same “scale”?
4,955
169
15
13
13
12
93
80
0
20
Percent
0
10
20
Percent
20
10
0
0
2
4
6
Waste of time to do best as teacher
0
1
2
3
4
Successful in educating students
60
0
1
2
3
4
5
6
30
Freq.
2
4
6
Continually learning on job
40
30
NMISSING
0
40
table NMISSING
0
10
0
2
4
6
Have high standards of teaching
Percent
.
0
0
10
20
40
Percent
30
40
variable
20
Percent
20
Percent
30
40
.
0
2
4
6
Look forward to working at school
0
© Andrew Ho, Harvard Graduate School of Education
1
2
3
4
Time satisfied with job
Unit 6a – Slide 14
Measurement (7): To Standardize or not to Standardize
For an additive composite of “raw” indicators:
Each indicator remains in its original metric.
Composite scores are the sum of the scores on the raw indicators,
for each person in the sample:
X i X 1i X 2i X 3i X 4i X 5i X 6i
where X1i is the raw score of the ith teacher on the 1st indicator,
and so on …
For an additive composite of “standardized” indicators:
First, each indicator is standardized to a mean of 0 and a standard
deviation of 1:
X 1i 4.33
X 3.87
X 2*i 2i
1.09
1.24
X 4i 4.23
X 5i 4.42
*
X 5i
1.67
1.33
X 3i 3.15
0.67
X 6i 2.84
0.57
X 1*i
X 3*i
X 4*i
X 6*i
Then, the standardized indicator scores are summed
X i* X 1*i X 2*i X 3*i X 4*i X 5*i X 6*i
This is consequential. Consider:
Are the scales interchangable? Do a “very successful” and
an “always” and a “slightly agree” share meaning?
How would you score the test? If it’s a 4-point item, do you
only add a maximum of 4 points?
Standardizing assumes that 1) scores are sums of 𝑋 ∗ , 2)
scale points do not share meaning across items, and 3)
“one standard deviation” has more in common than “one
unit,” across items.
Here, we standardize, but I might possibly rescale by 𝑿′𝟑 =
𝟓(𝑿𝟑 −𝟏)/𝟑 + 𝟏 (so values go from 1-6) and not
standardize, in practice.
© John Willett and Andrew Ho, Harvard
Graduate School of Education
Unit 6a – Slide 15
.
Measurement (8): A Baseline Reliability Analysis
alpha X1-X6, label item
Test scale = mean(unstandardized items)
alpha X1-X6, label item asis
Item
X1
X2
X3
X4
X5
X6
Obs
5097
5109
5144
5121
5116
5125
Sign
+
+
+
+
+
+
item-test
corr.
0.6119
0.6480
0.5196
0.7318
0.7417
0.6242
item-rest
corr.
0.4157
0.4283
0.3892
0.4554
0.5444
0.5338
Test scale
interitem
cov.
alpha
.3838505
.3600352
.4501671
.3004472
.302952
.4331154
0.6583
0.6554
0.6761
0.6641
0.6122
0.6592
Have high standards of teaching
Continually learning on job
Successful in educating students
Waste of time to do best as teacher
Look forward to working at school
Time satisfied with job
.3717621
0.6955
mean(unstandardized items)
We use unstandardized items but
include the option, to standardize.
Sample size differs by item,
relationships estimated pairwise.
Positive signage because we
already reversed the polarity of X4.
Label
.372 is the straight average interitem covariance.
.696 is the scaled-up interitem correlation, our “internal
consistency” estimate of reliability.
Cronbach’s alpha is 0.7 and can be interpreted as the
estimated correlation between two sets of teacher
scores from a replication of this measurement
procedure (a measure of internal consistency, across
replication of items).
Cronbach’s alpha is 0.7 and can be interpreted as the
estimated proportion of observed score variance
accounted for by true score variance.
These are diagnostics that explain item functioning and sometimes, with additional analysis, warrant item
adaptation or exclusion. However, no item should be altered or excluded on the basis of these statistics alone.
Item-Test Correlation is the straight correlation between item responses and total test scores (higher the better).
Item-Rest Correlation is the same, but the total test score excludes the target item (higher the better).
Interitem Correlation shows 𝑟 for all items not including the target item (lower the better)
Alpha (excluded-item alpha) shows the would-be 𝜌𝛼 estimate if the item were excluded (lower the better).
© Andrew Ho, Harvard Graduate School of Education
Unit 6a – Slide 16
.4
.6
.8
1
Measurement (8): A Baseline Reliability Analysis
1
0
.2
3
1
2
3
4
5
Item Number
6
7
8
© Andrew Ho, Harvard Graduate School of Education
Unit 6a – Slide 17
Measurement: 7 Key Principles
1. We don’t validate tests. We validate score uses.
2. Content is king. Not models. Content.
3. Start with Classical Test Theory (CTT): The
descriptive stats of measurement.
4. Your reliability is not Reliability (G Theory).
5. Item Response Theory (IRT) is just a model. A
very, very useful model.
6. Your scale is pliable. Bend it; don’t break it.
7. Know the process that generated your scores.
Use them accordingly.
18
Questions? Grit-S Scale: Duckworth & Quinn (2009) skip to 24.
New ideas and projects sometimes distract me...
Setbacks don't discourage me.
I have been obsessed with a certain idea... but...
I am a hard worker.
I often set a goal but later choose to pursue...
I have difficulty maintaining my focus on projects...
I finish whatever I begin.
I am diligent.
© Andrew Ho
Harvard Graduate School of Education 19
Classical Test Theory (1): Decomposition
• We can decompose any observed score into a true score and
an error term.
(1)
𝑋 =𝑇+𝐸
– 𝑋 ~ an observed score for a person (𝑝) for one replication/form
(𝑓)
• Real replications are called strictly parallel: They have scores with equal
means and SDs that covary equally with each other.
– 𝑇 ~ a true score for a person
– 𝐸 ~ an error/residual term for a
person for one replication
𝑋𝑝𝑓 = 𝑇𝑝 + 𝐸𝑝𝑓
• Note that any error term
can be expressed as:
𝐸 =𝑋−𝑇
– The residual: the observed score
minus the expected score.
𝐸1
𝑋1
𝐸2
𝐸3
𝑇
𝑋2 𝑋3
One examinee’s observed score
distribution over 3 replications.
© Andrew Ho
Harvard Graduate School of Education 20
Classical Test Theory (2): Expectation and Covariance
• We define the true score as the expected value of
observed scores, over replications:
𝑇=𝐄 𝑋
– 𝐄, the expected value over replications, should not to be
confused with the error term, 𝐸. Unfortunate notation.
• The expected value of errors is 0:
(2)
𝐄 𝐸 =0
• The covariance between true scores and errors is zero:
(3)
Cov 𝑇, 𝐸 = 0
– In contrast, 𝑇 and 𝐸 are components of observed scores 𝑋
and will be correlated positively with 𝑋.
• Just as we can decompose the scores, we can
decompose the variance:
Var 𝑋 = 𝜎𝑋2 = 𝜎𝑇2 + 𝜎𝐸2
(4)
© Andrew Ho
Harvard Graduate School of Education 21
Classical Test Theory (3): Reliability
• Now, the covariance between scores on forms 𝑓 and 𝑔:
Cov 𝑋𝑝𝑓 , 𝑋𝑝𝑔 = Cov 𝑇𝑝 + 𝐸𝑝𝑓 , 𝑇𝑝 + 𝐸𝑝𝑔 = Var(𝑇𝑃 )
• To obtain the correlation, we standardize the
covariance. Assuming variances are equal across forms,
Cov 𝑋𝑝𝑓 , 𝑋𝑝𝑔
Var 𝑇𝑝
𝜌𝑋𝑋 ′ = Corr 𝑋𝑝𝑓 , 𝑋𝑝𝑔 =
=
𝜎𝑋𝑓 𝜎𝑋𝑔
Var 𝑋𝑝
• We define reliability as this correlation, between
observed scores, across replications:
𝜎𝑇2
𝜎𝐸2
𝜌𝑋𝑋 ′ = Corr 𝑋𝑝𝑓 , 𝑋𝑝𝑔 = 2 = 1 − 2 (5)
𝜎𝑋
𝜎𝑋
• It is a percentage of variance of observed scores
accounted for by true scores (and not error)
© Andrew Ho
Harvard Graduate School of Education 22
Classical Test Theory as Observed-on-True Regression
Observed Score
How much Observed Score Variance do True Scores predict?
Correlation is 0.6.
𝑋 =𝑇+𝐸
Reliability is 0.36
(proportion of variance
accounted for…)
𝜎𝐸 ~ Standard Error
of Measurement
2
𝜌𝑇𝑋 = 𝜌𝑋𝑋 ′ ; 𝜌𝑋𝑋 ′ = 𝜌𝑇𝑋
(6)
It should make sense that correlations
between two error-prone measures will be
lower than between true and observed scores.
True Score
© Andrew Ho
Harvard Graduate School of Education 23
CTT Predictions (1): Variation increases reliability
• From Equation 5, we derive the standard error of
measurement (SEM, 𝜎𝐸 ) in terms of the observed SD,
𝜎𝑋 , and reliability:
𝜎𝐸 = 𝜎𝑋 1 − 𝜌𝑋𝑋 ′
(7)
• The SEM (not to be confused with structural equation
modeling) is, like the RMSE (root mean squared
error), my preferred depiction of error, because it
depends less on population characteristics like 𝜎𝑋 .
• If your reliability is low, it may be because you have a
homogeneous sample. How could we generalize?
• Your correction for restriction of range:
𝜎𝑋2𝐴
𝜌𝑋𝐵 𝑋𝐵′ = 1 − 2 (1 − 𝜌𝑋𝐴 𝑋 ′ )
𝐴
𝜎𝑋𝐵
© Andrew Ho
Harvard Graduate School of Education 24
CTT Predictions (2): Attenuated Coefficients
• From Equation 5, we know that observed SDs are
inflated due to measurement error:
𝜎𝑋 = 𝜎𝑇𝑋 / 𝜌𝑋𝑋 ′
• Correlations between two observed variables 𝑋 and
𝑌 will thus be attenuated by measurement error in
both variables (but not covariance, FYI!):
Cov 𝑋, 𝑌 = Cov 𝑇𝑋 + 𝐸𝑋 , 𝑇𝑌 + 𝐸𝑌 = Cov 𝑇𝑋 , 𝑇𝑌
• We can “disattenuate” correlations to obtain
correlations untainted by measurement error:
𝜌𝑇𝑋 𝑇𝑌 = 𝐶𝑜𝑟𝑟(𝑋, 𝑌)/ 𝜌𝑋𝑋 ′ 𝜌𝑌𝑌 ′ (8)
• And we can disattenuate regression coefficients, too!
𝛽𝑌|𝑇𝑋 = 𝛽𝑌|𝑋 /𝜌𝑋𝑋 ′
(9)
© Andrew Ho
Harvard Graduate School of Education 25
CTT Predictions (3): Regression to the mean
• Recalling our “observed-on-true” (𝑋 on 𝑇)
regression, how would we estimate a true score
given only an observed score? Inverting to a
“true-on-observed” regression, we obtain:
𝑇 = 𝜇 𝑇 + 𝛽𝑇|𝑋 𝑋 − 𝜇𝑋
• Recalling that we assume 𝜇 𝑇 = 𝜇𝑋 and the
general fact that 𝛽𝐴|𝐵 = 𝜎𝐴 𝜌𝐴𝐵 /𝜎𝐵 , we can
simplify to “Kelley’s Regressed Scores”:
𝑇 = 𝜌𝑋𝑋 ′ 𝑋 + 1 − 𝜌𝑋𝑋 ′ 𝜇𝑋 (10)
• This is a wonderfully intuitive formula…
• Should we apply it to test scores in practice?
© Andrew Ho
Harvard Graduate School of Education 26
CTT Predictions (4): The Spearman-Brown Prophecy Formula
• Assuming that items (and part-tests) are parallel,
the Spearman-Brown Prophecy Formula predicts
reliability as we increase or decrease the number
of items (or part-tests) by a factor of 𝑘:
𝑘𝜌𝑋𝑋 ′
(11)
𝑆𝐵 𝜌𝑋𝑋 ′ =
1 + 𝑘 − 1 𝜌𝑋𝑋 ′
• For example, if reliability is 0.81 and we double
the test length, what is the predicted reliability?
.
* Spearman Brown
.
display r(rho)
.81415596
.
display 2*r(rho)/(1+r(rho))
.89755895
© Andrew Ho
Harvard Graduate School of Education 27
Questions? Grit-S Scale: Duckworth & Quinn (2009)
New ideas and projects sometimes distract me...
Setbacks don't discourage me.
I have been obsessed with a certain idea... but...
I am a hard worker.
I often set a goal but later choose to pursue...
I have difficulty maintaining my focus on projects...
I finish whatever I begin.
I am diligent.
© Andrew Ho
Harvard Graduate School of Education 28
Three steps to Cronbach’s 𝛼
1. Let’s estimate internal consistency reliability
with a “split half” correlation.
2. Let’s prophesize the reliability when we
double (or multiply by 𝑘 the test length) with
the Spearman-Brown Prophecy formula.
3. Let’s imagine the average of all possible splithalf correlations, then prophesize the
reliability when we double test length.
© Andrew Ho
Harvard Graduate School of Education 29
Cronbach’s Alpha: The “mean” of measurement
• By far the most widely used reliability coefficient.
– Easy to calculate, generally robust (to, for example,
violations of parallelism), fairly well understood as an
internal consistency measure.
– Convenient interpretation as an average of all possible
split-half reliability coefficients (each of them “scaled
up” by Spearman-Brown).
– Convenient interpretation as a “lower bound to
reliability” as long as item sampling is the only source of
error (this is wishful thinking).
– For a total score 𝑋 comprised of 𝑛 items with item
scores 𝑋𝑖 :
2
𝜎
𝑛
𝑖 𝑋𝑖
(12)
1−
𝛼 𝜌𝑋𝑋 ′ =
𝑛−1
𝜎𝑋2
© Andrew Ho
Harvard Graduate School of Education 30
Estimating Reliability in Practice
• We don’t observe true scores. What can we do to
estimate 𝜌𝑋𝑋 ′ ?
• Three types of reliability
– Parallel Forms Reliability
• Correlation between scores from two tests comprised of items
drawn from the same population of items, separated by some
time interval. Often considered ideal but costly.
– Test-Retest Reliability
• Correlation between scores from the same test, separated by
some time interval. Does not consider error due to item
sampling, among other sources.
– Internal Consistency Reliability
• Treats specific, random, or all possible halves of tests as
replications. Does not consider error due to time intervals
(occasions), among other sources.
© Andrew Ho
Harvard Graduate School of Education 31
Measurement: 7 Key Principles
1. We don’t validate tests. We validate score uses.
2. Content is king. Not models. Content.
3. Start with Classical Test Theory (CTT): The
descriptive stats of measurement.
4. Your reliability is not Reliability (G Theory).
5. Item Response Theory (IRT) is just a model. A
very, very useful model.
6. Your scale is pliable. Bend it; don’t break it.
7. Know the process that generated your scores.
Use them accordingly.
32
Generalizability Theory (Brennan, 2002; Shavelson & Webb, 1991)
The Generalizability Study
• Target: Estimates of Variance
Components.
• Question: How can I
decompose score variability
into meaningful
components? How much
variability is attributable to
particular sources?
• Notes: It is often difficult to
conduct a G Study on
secondary data, because we
need replications to
estimate variability, and
there may be none. As
always, an ounce of design is
worth a pound of analysis.
The Decision Study
• Target: Standard errors and
generalizability coefficients
(generalized reliability
coefficients)
• Question: What are my
standard errors and
reliabilities? How many
replications over which
sources of error will enable
me to obtain sufficient
precision? What scoring
protocols will maximize
precision for minimal cost?
• Notes: The D Study design
need not be the same as the
G Study design!
© Andrew Ho
Harvard Graduate School of Education 33
G and D Studies in Hill, Charalambous, & Kraft (2012)
© Andrew Ho
Harvard Graduate School of Education 34
D Study Targets
• 𝜎𝛿2 - Relative Error Variance: If I were to replicate this
measurement procedure, how much would my scores vary
relative to each other?
– And the relative standard error, 𝜎𝛿
• 𝜎Δ2 - Absolute Error Variance: If I were to replicate this
measurement procedure, how much would my scores vary
on an absolute scale?
– And the absolute standard error, 𝜎Δ
• 𝐄𝜌2 - Generalizability Coefficient for Relative Error: The
proportion of relative score variance that is attributable to
persistent person effects.
– A generalized reliability coefficient.
• Φ - Generalizability Coefficient for Absolute Error: The
proportion of absolute score variance that is attributable to
persistent person effects.
– A generalized reliability coefficient when absolute scale points
matter.
© Andrew Ho
Harvard Graduate School of Education 35
Measurement: 7 Key Principles
1. We don’t validate tests. We validate score uses.
2. Content is king. Not models. Content.
3. Start with Classical Test Theory (CTT): The
descriptive stats of measurement.
4. Your reliability is not Reliability (G Theory).
5. Item Response Theory (IRT) is just a model. A
very, very useful model.
6. Your scale is pliable. Bend it; don’t break it.
7. Know the process that generated your scores.
Use them accordingly.
36
Motivation
• Item response theory (IRT) supports the vast majority
of large-scale educational assessments.
– State testing programs
– National and international assessments (NAEP, TIMSS,
PIRLS, PISA).
– Selection testing (SAT, ACT)
• Many presentations of IRT use unfamiliar jargon and
specialized software.
– We will try to connect IRT to other more flexible statistical
modeling frameworks.
– We will use Stata.
• Some presentations treat IRT with veneration.
– We will acknowledge IRT advantages but take a more
critical perspective.
37
Focus on IRT
Practical application of IRT methods to real data.
1. Classical Test Theory (CTT) and descriptive statistics.
– Reliability, item p-values, point biserial correlations.
2. Theoretical and practical benefits of IRT
– Parameter interpretation and estimation
3. “Ability” (𝜃) parameter estimation and scoring
– Expected a posteriori; test characteristic curve-based scores
A critical perspective on IRT score scales
4. Extreme features of modern test score distributions
5. Scale drift over time
6. Weak “equal interval” properties
38
Example: Item-Level MCAS Data
• I live in a state with a researcher-friendly approach
to item-level data:
• Released Items:
• Grade 4 Mathematics, 2012.
http://www.doe.mass.edu/infoservices/research
http://www.doe.mass.edu/mcas/2012/release/g4math.pdf
39
Sample Items
40
Sample Items and Scoring
• There are 42 items on the Grade 4 Mathematics test:
– 32 dichotomously scored multiple choice items
– 6 dichotomously scored short-answer items (Items 5, 6, 18, 26, 36, and 37)
– 4 constructed response questions (7, 19, 30, and 42) scored on a 5-point, 0-to4 scale.
• For our purposes, these 4 constructed response items are scored
dichotomously, where students scoring a 3 or a 4 get a 1, and students
scoring a 0, 1, or 2 get a 0.
• There are around 70K complete item response vectors, but we’ll randomly
41
sample 5000 students from all students with complete records.
Learning to read/replicate Technical Reports
http://www.mcasservicecenter.com/documents
/MA/Technical%20Report/2012_Tech/201112%20MCAS%20Tech%20Rep.pdf
42
Focus on IRT
Practical application of IRT methods to real data.
1. Classical Test Theory (CTT) and descriptive statistics.
– Reliability, item p-values, point biserial correlations.
2. Theoretical and practical benefits of IRT
– Parameter interpretation and estimation
3. “Ability” (𝜃) parameter estimation and scoring
– Expected a posteriori; test characteristic curve-based scores
A critical perspective on IRT score scales
4. Extreme features of modern test score distributions
5. Scale drift over time
6. Weak “equal interval” properties
43
Classical Test Theory (Skip to 52)
• What is the range of item difficulties on the
test?
• How well do individual item scores correlate
with total test scores?
• How precise are total test scores over
replications of items?
44
A look at the data
45
.6
.4
.2
0
Proportion Correct
.8
1
Item Difficulty
0
10
20
Item #
30
40
46
.6
.4
.2
0
Proportion Correct
.8
1
Item Difficulty
0
10
20
Item #
30
40
47
0
.2
.4
.6
.8
1
Stata’s alpha command
0
•
•
•
•
10
20
Item #
30
40
Item-Test Correlation – Correlation between the item score and the total score.
Higher for more internally consistent items.
Item-Rest Correlation – Correlation between the item score and the total score, not
including the item in question. Higher for more internally consistent items.
Interitem Covariance – Average interitem covariance if the item were question is
excluded. Lower for more internally consistent items.
Alpha – “Omitted-item alpha,” if the item in question were excluded. Lower for more
48
internally consistent items.
Three
ways of looking
at “reliability”
Interpreting
Alpha
= 0.89
Three Definitions of Reliability
1. Reliability is the correlation between two sets of
observed scores from a replication of a
measurement procedure.
2. Reliability is the proportion of “observed score
variance” that is accounted for by “true score
variance.”
3. Reliability is like an average of pairwise interitem
correlations, “scaled up” according to the number
of items on the test (because averaging over more
items decreases error variance).
Three Necessary Intuitions
1. Any observed score is one of many possible
replications.
2. Any observed score is the sum of a “true score”
(average of all theoretical replications) and an
error term.
3. Averaging over replications gives us better
estimates of “true” scores by averaging over error
terms.
𝜌𝑋𝑋 ′ = 𝐄 𝐶𝑜𝑟𝑟 𝑋, 𝑋 ′
𝜌𝑋𝑋 ′
𝜌𝜶 =
𝜎𝑇2
= 2
𝜎𝑋
𝑛𝑗 𝑟
1 + 𝑛𝑗 − 1 𝑟
𝐸𝑖1
𝑋𝑖1
𝐸𝑖2
𝐸𝑖3
𝑇𝑖
𝑋𝑖2 𝑋𝑖3
.6
.4
0
0
.2
.2
.4
.6
Item-Rest Correlation
.8
.8
1
1
The “Population Dependence” of CTT
0
10
20
Item #
30
40
0
10
20
Item #
30
• CTT versions of item difficulties and discriminations are
“population dependent.”
• If items are administered to groups with higher or lower
ability/proficiency, these statistics will differ, and not along any
simple linear transformation.
• Score scales from large-scale testing programs must be flexible
enough to accommodate many subpopulations, over time.
50
• A more flexible modeling framework is necessary.
40
CTT as imperfect framework: Item-Observed Score Regression
0
.2
.4
.6
.8
1
• Plot probability of a correct response on item 𝑖, conditional
on total test score 𝑥.
0
10
20
sumscore
30
40
• Unfortunately, this doesn’t follow any consistent functional form:
• The shape depends upon the difficulty of other items included in
51
the test!
Focus on IRT
Practical application of IRT methods to real data.
1. Classical Test Theory (CTT) and descriptive statistics.
– Reliability, item p-values, point biserial correlations.
2. Theoretical and practical benefits of IRT
– Parameter interpretation and estimation
3. “Ability” (𝜃) parameter estimation and scoring
– Expected a posteriori; test characteristic curve-based scores
A critical perspective on IRT score scales
4. Extreme features of modern test score distributions
5. Scale drift over time
6. Weak “equal interval” properties
52
Item Response Theory
• In classical test theory (CTT), we model individual
responses to particular items in what amounts to an
ANOVA framework.
• The object modeled statistically:
– CTT: 𝑋𝑝𝑖 = 𝜇 + 𝜈𝑝 + 𝜈𝑖 + 𝜈𝑝𝑖 , examinee (𝑝) item responses on a
test, given items (𝑖) from a universe of items.
– IRT: 𝑃𝑖 𝑋𝑝𝑖 = 1|𝜃𝑝 , 𝑎𝑖 , 𝑏𝑖 , 𝑐𝑖 = ⋯, an examinee’s probability of
a correct response on a particular item.
• Conception of what is measured:
– CTT: An examinee’s “true score,” defined as an expected value
across replications of observed scores, often 𝜏 on the score scale
or 𝜇𝑝 on the mean scale.
– IRT: Examinee’s location in some (often unidimensional) latent
space, represented by 𝜃𝑝 .
53
IRT Applications (vs. CTT)
• CTT
– Innumerable applications. CTT is still by far the most
widely used psychometric “tool kit.”
– Should be a knee-jerk first analysis, just as descriptive
statistics are a first step for most statistical work.
• IRT
– Conception of items as unique, independent units is
useful for test design and construction.
– Standard approach for large-scale testing.
– A “scaling model” more than a “statistical model” that
can result in a test score scale that is easier to use,
defend, and maintain, particularly for many students,
across tests, and over time.
54
IRT as a Scaling Model
• What if there were an alternative scale, 𝜃, for
which item characteristic curves would not depend
upon each other, could be easily modeled, and
would support useful interpretations?
The Logistic Function
1
𝑝=
1 + 𝑒 −𝑥
The Logit Function
𝑥 = log
𝑝
1−𝑝
= log 𝑝 − log 1 − 𝑝
• 𝑃𝑖 𝑋𝑝𝑖 = 1|𝜃𝑝 , 𝑏𝑖 , the probability of person 𝑝 responding correctly to
item 𝑖, abbreviated 𝑃𝑖 𝜃𝑝 .
• The 1-Parameter Logistic (1PL) Model is widely known as the Rasch Model,
after George Rasch’s 1961 monograph.
– log
𝑃𝑖 𝜃𝑝
1−𝑃𝑖 𝜃𝑝
– 𝑃𝑖 𝜃𝑝 =
= 𝑎 𝜃𝑝 − 𝑏𝑖 ; 𝜃𝑝 ~𝑁 0,1
1
1+exp −𝑎 𝜃𝑝 −𝑏𝑖
; 𝜃𝑝 ~𝑁 0,1
55
8
730
14
18
2171
4303
34221
15
3251
2397
193
3
263
2
3
4
4198
639
3526
160
12
20
23512
.6
.4
.4
0
.2
.6
.8
CTT Difficulty
.8
1
17
1
1
1 Parameter Logistic (1PL) Item Characteristic Curves (ICCs)
-3
-2
-1
0
.2
IRT Difficulty
0
Items 1-8
-3
-2
-1
0
Theta
1
2
56
IRT as a latent variable measurement model (SEM): 1PL
SEM Formulation: 𝑃 𝑋𝑝𝑖 = 1 𝜃𝑝 , 𝑏𝑖 = logistic(𝑚𝜃𝑝 − 𝑏𝑖 )
• The slope (loading), 𝑚, is constrained to be common across items.
• The intercept is reverse coded and not shown in the path diagram by convention.
• The latent variable, 𝜃, is constrained to have a variance of 1.
– This solves the problem of “scale indeterminacy.”
– Another solution is to set the slope or “loadings,” 𝑚, to equal 1, but interpretation is
often easier under a unit variance identification (UVI) of the latent variable.
. gsem (Theta -> (item1-item42)@m), logit var(Theta@1)
© Andrew Ho
Harvard Graduate School of
.2
.4
.6
.8
1
2 Parameter Logistic (1PL) Item Characteristic Curves (ICCs)
𝑃𝑖 𝜃𝑝
1−𝑃𝑖 𝜃𝑝
= 𝑎𝑖 𝜃𝑝 − 𝑏𝑖 ; 𝜃𝑝 ~𝑁 0,1
0
log
-3
-2
-1
0
Theta
1
2
58
Item Characteristic Curve (ICC) Slider Questions
• What happens when we increase 𝑎 for the blue item?
Which item is more discriminating?
• What happens when we increase 𝑏 for the blue item?
Which item is more difficult?
• What happens when we increase 𝑐 for the blue item?
Which item is more discriminating?
• Try setting blue to .84, 0, .05 and red to .95,.3, .26. Why
might the 𝑐 parameter be the most difficult to estimate in
practice?
• Given this overlap, comparisons of items in terms of item
parameters instead of full curves will be shortsighted.
– Difficulty for which 𝜃? Discrimination for which 𝜃?
• For reference, the probability of a correct response when
1+𝑐𝑖
𝑎𝑖 1−𝑐𝑖
𝜃𝑝 = 𝑏𝑖 is
. The slope at this inflection point is
.
59
2
4
IRT in Stata: 1PL (The Rasch Model)
© Andrew Ho
Harvard Graduate School of Education 60
1 Parameter Logistic (1PL) Item Characteristic Curves (ICCs)
Item Characteristic Curves
1
Item Characteristic Curves
1
Probability
.5
.5
0
-4
-2
0
Theta
2
4
0
-4
-2.59
-1.65
-.616
.249
Theta
4
© Andrew Ho
Harvard Graduate School of Education 61
-2
0
2
4
6
1 Parameter Logistic (1PL) ICCs in Logit Space (Linear)
-4
Items 1-8
-3
-2
-1
0
Theta
1
2
62
The 2-Parameter Logistic (2PL) IRT Model
log
𝑃𝑖 𝜃𝑝
1 − 𝑃𝑖 𝜃𝑝
= 𝑎𝑖 𝜃𝑝 − 𝑏𝑖
Discrimination is the difference in the log-odds of a correct answer for every SD
distance of 𝜃𝑝 from 𝑏𝑖 .
Likelihood ratio test: reject null hypothesis that discrimination
© Andrew Ho
parameters are jointly equal. 2PL fits better than 1PL. Harvard Graduate
School of Education
63
2-Parameter Logistic (2PL) ICCs
Item Characteristic Curves
1
.5
Item Characteristic Curves
0
-4
-2
Probability
1
.5
0
Theta
2
4
0
-4
-2.21
-1.02
.217
Theta
4
© Andrew Ho
Harvard Graduate School of Education 64
The 3-Parameter Logistic (3PL) IRT Model
𝑃𝑖 𝜃𝑝 = 𝑐 +
1−𝑐
exp 𝑎𝑖 𝜃𝑝 − 𝑏𝑖
1 + exp 𝑎𝑖 𝜃𝑝 − 𝑏𝑖
The common 𝑐 parameter estimate is an estimated lower asymptote, the
pseudo-guessing parameter. Estimated in common across items 𝑐 rather than
𝑐𝑖 due to considerable estimation challenges in practice.
Likelihood ratio test: reject null hypothesis that
the common pseudo-guessing parameter is zero.
© Andrew Ho
Harvard Graduate School of Education 65
3-Parameter Logistic (3PL) ICCs
Item Characteristic Curves
1
Item Characteristic Curves
1
Probability
.5
0
-4
-2
.5
0
Theta
2
4
0
-4
-2.09
-.937
.332
Theta
4
© Andrew Ho
Harvard Graduate School of Education 66
0
.2
.4
.6
.8
1
Graphical Goodness of Fit for Items 1 and 8
-3
-2
-1
0
1
2
Theta
Predicted mean (item1)
eicc1
Predicted mean (item8)
eicc8
67
Aside: (Establishing the IRT 𝜃 scale)
Unit Variance Identification(standard; Stata default)
1PL: log
2PL: log
𝑃 𝑋𝑝𝑖 =1
1−𝑃 𝑋𝑝𝑖 =1
𝑃 𝑋𝑝𝑖 =1
1−𝑃 𝑋𝑝𝑖 =1
= 𝑎 𝜃𝑝 − 𝑏𝑖 ; 𝜃~𝑁 0,1
= 𝑎 𝜃𝑝 − 𝑏𝑖 ; 𝜃~𝑁 0,1
𝑎 – An SD unit in 𝜃 predicts an 𝑎𝑖 increase in the
log odds of a correct response.
𝑏 – The 𝜃 required for even odds (50%) of a
correct response.
Unit Loading Identification (common in SEM)
log
log
𝑃 𝑋𝑝𝑖 = 1
1 − 𝑃 𝑋𝑝𝑖 = 1
= 𝜃𝑝 − 𝑏𝑖 ; 𝜃~𝑁 0, 𝜎𝜃2
𝑃 𝑋𝑝𝑖 = 1
= 𝑎𝑖 𝜃𝑝 − 𝑏𝑖 ;
1 − 𝑃 𝑋𝑝𝑖 = 1
𝑎1 = 1; 𝜃~𝑁 0, 𝜎𝜃2
𝑎 – A unit increase in 𝜃 predicts an [𝑎𝑖 ] unit
increase in the log odds of a correct response.
𝜃 – Subtract 𝑏, and you’ll have the log-odds of a
© Andrew Ho
correct response to item 1.Harvard Graduate School of Education 68
Loose Sample Size Guidelines (Yen & Fitzpatrick)
• Rasch (1PL): 20 items and 200 examinees.
• Hulin, Lissak, and Drasgow:
– 2PL: 30 items, 500 examinees.
– 3PL: 60 items, 1000 examinees.
– Tradeoffs, maybe 30 items and 2000 examinees.
• Swaminathan and Gifford:
– 3PL: 20 items, 1000 examinees.
• Low scoring examinees needed for 3PL.
• Large samples (above 3500) needed for
polytomous items (scored 0/1/2/…),
particularly high or low difficulty items that will
have even higher or lower score points.
© Andrew Ho
Harvard Graduate School of
Focus on IRT
Practical application of IRT methods to real data.
1. Classical Test Theory (CTT) and descriptive statistics.
– Reliability, item p-values, point biserial correlations.
2. Theoretical and practical benefits of IRT
– Parameter interpretation and estimation
3. “Ability” (𝜃) parameter estimation and scoring
– Expected a posteriori; test characteristic curve-based scores
A critical perspective on IRT score scales
4. Extreme features of modern test score distributions
5. Scale drift over time
6. Weak “equal interval” properties
70
Preview: Estimating 𝜃𝑝 via Empirical Bayes (EAP)
-2
0
2
4
-4
-2
0
2
40
sumscore
20
0
4
2
logitx
0
-2
2
empirical
Bayes
means for
Theta
0
-2
-4
2
empirical
Bayes
means for
Theta
0
-2
-4
2
empirical
Bayes
means for
Theta
0
-2
-4
0
20
40
-4
-2
0
2
-4
© Andrew Ho
-2 Graduate0School of 2Education 71
Harvard
Local Independence: A Fundamental IRT Assumption
𝑃𝑖 𝑋𝑝𝑖 = 1|𝜃𝑝 , 𝑎𝑖 , 𝑏𝑖 , 𝑐𝑖 = 𝑐𝑖 +
1 − 𝑐𝑖
1 + exp −𝑎𝑖 𝜃𝑝 − 𝑏𝑖
• Probability of a correct response to an item is a
function of ability/proficiency, the item parameters,
and nothing else.
• An examinee’s probability of a correct response to a
particular item,𝑃 𝑢𝑖 = 1 , can be determined by 𝜃.
Conditional on 𝜃, item responses are independent.
𝑃 𝑢𝑖 = 1|𝜃 = 𝑃(𝑢𝑖 = 1|𝜃, 𝑢𝑗 , 𝑢𝑘 , 𝑢𝑙 … )
• Another way to define independence: when the joint
probability is the product of the marginals:
© Andrew Ho
Harvard Graduate School of Education 72
Local Independence and Unidimensionality
• Unidimensionality implies a single latent dimension
underlying performance.
• If, after item parameters and 𝜃, something else
could tell us more about item performance, there
would be a dependence on a dimension other than
𝜃 : multidimensionality
• Local independence follows from unidimensionality.
Local independence can also follow from
multidimensionality, given multiple 𝜃.
• Is a math test unidimensional? For any population
that takes it? A chemistry test? A reading test with
passages? Do we want tests to be unidimensional?
© Andrew Ho
Harvard Graduate School of Education 73
Some Practice IRT Calculations
𝑃𝑖 𝜃𝑝 = 𝑐𝑖 +
1 − 𝑐𝑖
1 + exp −𝑎𝑖 𝜃𝑝 − 𝑏𝑖
• On a 3-item test, what parameters would you need to
know in order to calculate the probability of getting a
perfect score?
– Local independence really helps us out.
• Define:
• So the probability of an item response vector of 010
can be written as:
• Let’s say that we know 𝑃1 2 = 0.8.
• We find out that an examinee with 𝜃 = 2 also got items
2 and 3 correct. How can we tell how much to increase
that examinee’s estimate of 𝜃?
© Andrew Ho
Harvard Graduate School of Education 74
An Illustrative
Test: Four ICCs
FourFour-Item
Familiar ICCs
Par | Item
1
2
3
4
a
1
1
2
1
b
-1
0
1
2
c
0
0
0
0.2
Theta P(theta) P(u1=1|theta) P(u2=1|theta) P(u3=1|theta) P(u4=1|theta) P(1100|theta) P(1100&theta) P(theta|1100)
-2
0.03125
0.15447
0.03230
0.00004
0.20089
0.00399
0.00012
0.00063
-1
0.15625
0.50000
0.15447
0.00111
0.20485
0.06134
0.00958
0.04863
0
0.31250
0.84553
0.50000
0.03230
0.22584
0.31672
0.09898
0.50213
1
0.31250
0.96770
0.84553
0.50000
0.32357
0.27674
0.08648
0.43874
2
0.15625
0.99394
0.96770
0.96770
0.60000
0.01243
0.00194
0.00985
3
0.03125
0.99889
0.99394
0.99889
0.87643
0.00014
0.00000
0.00002
sum = 0.19711
© Andrew Ho
Harvard Graduate School of Education 75
Aside: (Item Maps) (skip to 82)
• By picking a “response probability,” we can map items on to
the proficiency scale, e.g., RP70:
If 𝑃𝑖 𝜃𝑝 = 𝑐 +
𝑚𝑎𝑝
Then, 𝜃𝑖
=
1−𝑐
1+exp −𝑎𝑖 𝜃𝑝 −𝑏𝑖
1
𝑅𝑃−𝑐
log
+ 𝑏𝑖
𝑎𝑖
1−𝑅𝑃
http://nces.ed.gov/nationsreportcard/itemmaps/
Item mapping can give a scale substantive meaning
beyond simple normative ranking of examinees.
© Andrew Ho
Harvard Graduate School of Education 76
Estimating Item then Examinee Parameters
• Given the data, a person by item matrix of 0s and 1s, estimate 𝑎, 𝑏, and 𝑐
for each item, and 𝜃 for each examinee.
• Estimation generally proceeds by making initial guesses about item
parameters, e.g., their CTT versions.
–
𝑎′
=
′
𝜌𝑏𝑖𝑠
, a rescaling of item point-biserial correlations.
′2
1−𝜌𝑏𝑖𝑠
−Φ−1 %𝑐𝑜𝑟𝑟𝑒𝑐𝑡
, an inverse-normal rescaling of item p-values
– 𝑐′ = 𝑝𝑔 , the theoretical guessing probability.
– 𝑏′ =
• In a so-called joint estimation approach, we consider item parameters as
fixed, and estimate examinee parameters given these item parameter
estimates.
• Then we consider these just-estimated examinee parameters as fixed, and
we estimate our item parameters again, given our examinee parameter
estimates.
• This cycle repeats until some tolerance is reached wherein estimates
change very little.
• Most modern estimation procedures are known as “marginal maximum
likelihood” (random effects) approaches. Instead of estimating each
examinee’s 𝜃, they estimate the distribution of 𝜃, making computation for
complex models (anything above 1PL) tractable. Then, they estimate 𝜃s.
© Andrew Ho
Harvard Graduate School of Education 77
Estimation Example: Estimating Examinee Parameters
What is the probability of getting
the first two items correct and
the next two questions incorrect,
for any given 𝜃? 𝑃(1100|𝜃)?
Par | Item
1
2
3
4
a
1
1
2
1
b
-1
0
1
2
c
0
0
0
0.2
Theta P(u1=1|theta) P(u2=1|theta) P(u3=1|theta) P(u4=1|theta)
-3
0.03230
0.00606
0.00000
0.20016
-2.9
0.03805
0.00717
0.00000
0.20019
-2.8
0.04479
0.00849
0.00000
0.20023
-2.7
0.05265
0.01005
0.00000
0.20027
-2.6
0.06180
0.01189
0.00000
0.20032
-2.5
0.07243
0.01406
0.00001
0.20038
Likelihood
P(1100|theta)
P(1100)
0.00016
0.00022
0.00030
0.00042
0.00059
0.00081
© Andrew
Ho
Harvard Graduate School of Education 78
Likelihood Functions and Posterior Distributions
Likelihood Functions
(solid lines)
P(1100|𝜃)
P(1111|𝜃)
Overlaying Posterior
Distributions (dashed)
Informed by a
Normal prior.
© Andrew Ho
Harvard Graduate School of Education 79
Bayesian (EAP, MAP) Approaches
Par | Item
1
2
3
4
a
1
1
2
1
b
-1
0
1
2
Priorc
0
0
0
0.2
Likelihood
Joint
Posterior
Theta P(theta) P(u1=1|theta) P(u2=1|theta) P(u3=1|theta) P(u4=1|theta) P(1100|theta) P(1100&theta) P(theta|1100)
-2
0.03125
0.15447
0.03230
0.00004
0.20089
0.00399
0.00012
0.00063
-1
0.15625
0.50000
0.15447
0.00111
0.20485
0.06134
0.00958
0.04863
0
0.31250
0.84553
0.50000
0.03230
0.22584
0.31672
0.09898
0.50213
1
0.31250
0.96770
0.84553
0.50000
0.32357
0.27674
0.08648
0.43874
2
0.15625
0.99394
0.96770
0.96770
0.60000
0.01243
0.00194
0.00985
3
0.03125
0.99889
0.99394
0.99889
0.87643
0.00014
0.00000
0.00002
sum = 0.19711
Priors
𝑃 𝜃
Individual Item Response Likelihoods
Pattern
𝑃 𝑋
Joint
Likelihood Probability Marginal
𝑃 𝑋 𝜃 𝑃 𝑋, 𝜃
=𝑃 𝑋𝜃 𝑃 𝜃
• Requires estimation at selected points, shown here at -2, -1, 0, 1, 2, 3.
• Maximum A Posteriori (MAP): Like a ML estimate regularized by prior
information.
• Expected A Posteriori (EAP): An average likelihood weighted by priors.
• Do you think these help with our infinite theta estimate for perfect scores?
• Do you think EAP and MAP estimates are the same?
© Andrew Ho
Harvard Graduate School of Education 80
Intuition about Bayesian Inference
Overlaying Posterior
Distributions (dashed)
Informed by a
Normal prior.
• If the prior distribution were completely uninformative (uniform),
what would the posterior distribution look like?
• As priors get more informative, what happens to the shape and
location of the posterior distributions?
• As data/information increases, what happens to the shapes of the
likelihood functions and posteriors?
• What is an intuitive statistic for the estimated standard error of 𝜃,
given the posterior distribution?
© Andrew Ho
Harvard Graduate School of Education 81
Number-Correct vs. Pattern Scoring
• For the 1- and 2-parameter logistic models, a sufficient
statistic for estimating 𝜃 is 𝑠 = 𝑛𝑖=1 𝑎𝑖 𝑢𝑖 , the sum of item
discriminations for items that an examinee scores correctly.
• For Rasch models (1PL), this implies a 1-to-1 mapping of
number-correct scores to 𝜃.
• For 2PL (and 3PL, where there is no sufficient statistic),
different item response patterns will map to different 𝜃 even if
number-correct scores are the same (unless, for the 2PL,
𝑛
𝑖=1 𝑎𝑖 𝑢𝑖 happens to be the same).
• Pattern scoring may be difficult to explain and defend to
practitioners.
• Pattern scoring may also be difficult operationally, as we often
calibrate IRT models on smaller, trial samples and then score
operationally for much larger samples.
– Since we may not have full representation of all 2𝐼 possible item
response patterns for long tests (1.1 trillion possible patterns for a
40-item test!), it’s easier to operationalize a number-correct
scoring algorithm.
© Andrew Ho
Harvard Graduate School of Education 82
0
2
4
6
1PL EAP Scoring
10
20
sumscore
30
40
0
2
4
6
0
-3
-2
-1
0
empirical Bayes' means for Theta
1
2
© Andrew Ho
Harvard Graduate School of Education 83
2
2PL vs. 1PL EAP Scoring
sumscore
1.0000
0.9626
0.9873
0.9833
0.9850
1.0000
0.9918
0.9881
0.9870
1.0000
0.9962
0.9959
1.0000
0.9996
-1
0
-1
0
1PL Theta Estimates
-2
-2
Note the many-to-one
mapping of 2PL 𝜃 to 1PL 𝜃.
What makes a relatively
1
2
high and relatively low 𝜃2𝑃𝐿
for any given 𝜃1𝑃𝐿 ?
-3
-3
2PL Theta Estimates
-3
1
-2
2
-1
0
1
sumscore
logitx
theta1pl
theta2pl
theta3pl
logitx theta1pl theta2pl
© Andrew Ho
3 4 5 6 7 8 9101112131415161718192021222324252627282930
313233
343536
3738of39
404142 84
Harvard
Graduate
School
Education
The Test
Test Characteristic
Characteristic Curve
The
Curve
Par | Item
1
2
3
4
a
1
1
2
1
b
-1
0
1
2
c
0
0
0
0.2
Theta P(theta) P(u1=1|theta) P(u2=1|theta) P(u3=1|theta) P(u4=1|theta) P(1100|theta) P(1100&theta) P(theta|1100)
-2
0.03125
0.15447
0.03230
0.00004
0.20089
0.00399
0.00012
0.00063
-1
0.15625
0.50000
0.15447
0.00111
0.20485
0.06134
0.00958
0.04863
0
0.31250
0.84553
0.50000
0.03230
0.22584
0.31672
0.09898
0.50213
1
0.31250
0.96770
0.84553
0.50000
0.32357
0.27674
0.08648
0.43874
2
0.15625
0.99394
0.96770
0.96770
0.60000
0.01243
0.00194
0.00985
3
0.03125
0.99889
0.99394
0.99889
0.87643
0.00014
0.00000
0.00002
sum = 0.19711
Is this a logistic curve?
© Andrew Ho
Harvard Graduate School of Education 85
The Test Characteristic Curve
• From this perspective, the IRT 𝜃 is a rescaled true
score, 𝑇.
• Note that examinees with the same 𝜃 have the same
true score 𝑇 (although neither is ever known in
practice; they are estimated).
• Lord (1980): True score (𝑇) and proficiency (𝜃) are the
same thing expressed on different scales of
measurement.
• But the measurement scale of 𝑇 depends on the items
in the test. The measurement scale for 𝜃 is
independent of the choice of test items (if the model
fits the data).
• The TCC can also be used for scoring...
© Andrew Ho
Harvard Graduate School of Education 86
A Test Characteristic Curve for a 25-Item 3PL Model
• The sum of all ICCs on a 25 Item Test.
• The TCC forms an intuitive basis for scoring:
– Invert the function: For any given score, find the appropriate 𝜃 and call it 𝜃
Total Test Score
TCC scoring allows visualization of two key IRT functions:
1) A rescaling of the number-correct score to a scale
with improved properties (though this is not
what EAP does!), and 2) a compression
of the center of the number correct scale
with respect to the extremes, increasing
relative precision in that region.
𝜃
© Andrew Ho
Harvard Graduate School of Education 87
The Test Characteristic Curve for Scoring
• We can use the TCC for scoring, but it is problematic,
particularly at the extremes.
Total Test Score
Graphical TCC-based scoring will stretch out
extreme scores far too much, particularly perfect
scores (why?) and, for 3PL calibrations, scores at
or below 𝑖 𝑐𝑖 , which map to infinity or are
undefined.
𝜃
© Andrew Ho
Harvard Graduate School of Education 88
IRT Scoring Summary
• The 1PL (Rasch Model) has a 1-to-1 mapping of sum scores to
scale scores using, for example, EAP.
• The 2PL and 3PL map a single raw score to multiple scale
scores, resulting in “pattern scoring.”
• Back-mapping from raw scores to 𝜃 scale scores, that is,
“number-correct scoring” or “sum scoring,” is possible using a
graphical TCC approach, although there are some raw scores
that do not map to defined scale scores.
– Sum score mapping tables are also possible using the LordWingersky algorithm, a more defensible (but hard to explain)
approach (see Orlando & Thissen, 2001) implemented in IRTPro.
• Arriving at the “right” scoring algorithm requires balancing
numerous priorities, including the fit of the model to data,
the use of the scores, and the “face validity” of the scoring
procedure.
© Andrew Ho
Harvard Graduate School of Education 89
Focus on IRT
Practical application of IRT methods to real data.
1. Classical Test Theory (CTT) and descriptive statistics.
– Reliability, item p-values, point biserial correlations.
2. Theoretical and practical benefits of IRT
– Parameter interpretation and estimation
3. “Ability” (𝜃) parameter estimation and scoring
– Expected a posteriori; test characteristic curve-based scores
A critical perspective on IRT score scales
4. Extreme features of modern test score distributions
5. Scale drift over time
6. Weak “equal interval” properties
90
0
4
Percent
1
6
2
8
3
4
TCC-Scaled Distributions in Practice (Ho & Yu, 2015)
600
800
1000
2
TX
Score
0
400
400
500
600
2007 score
NY
700
800
© Andrew Ho
Harvard Graduate School of Education 91
6 “unskewed” distributions (Ho and Yu)
92
IRT scales have sparsely measured upper tails (Ho and Yu)
93
Skewness and Kurtosis of IRT-Scaled Distributions (Ho and Yu)
94
Focus on IRT
Practical application of IRT methods to real data.
1. Classical Test Theory (CTT) and descriptive statistics.
– Reliability, item p-values, point biserial correlations.
2. Theoretical and practical benefits of IRT
– Parameter interpretation and estimation
3. “Ability” (𝜃) parameter estimation and scoring
– Expected a posteriori; test characteristic curve-based scores
A critical perspective on IRT score scales
4. Extreme features of modern test score distributions
5. Scale drift over time
6. Weak “equal interval” properties
95
Non-Equivalent groups, Anchor Test (NEAT) Design
• We’ll start with a “trends over time” motivation.
– This year’s 4th graders differ from last year’s 4th graders.
– This year’s test differs from last year’s test.
– How can we address whether this year’s 4th graders perform
better than last year’s?
• One basis for a link across different tests taken by
different groups is an embedded “anchor test” that
both groups take.
– Anchor test items drive group comparisons, thus anchor item
selection and representation must be considered very
carefully.
– However, this design is quite practical, allowing for linking96 to
occur within operational test administrations.
A Simple NEAT (Non-Equivalent Anchor Test)Scenario
The NEAT person-by-item data matrix,
with items sorted by 1) X-only items, 2)
Anchor items, then 3) Y-only items.
1
2
3
Group 1
(2012)
2012
Only
Missing
Data
Anchor
Group 2
(2013)
Missing
Data
2013
Only
97
A Simple NEAT Scenario
5000
Persons x
5 Item
Matrix of
2012-Only
Items
10000
Persons x
15 Item
Matrix of
Anchor
Items
• There are usually far
more noncommon
items on state tests.
• This number of
common items is
about right.
• We can compare the
proportion correct
for common items in
2012 and 2013.
5000
Persons x
5 Item
Matrix of
2013-Only
Items
98
A “Delta Plot” comparing Proportions Correct
1
Average number correct
.8
year
2012
2013
7
18
total anchor
10.5
8
10.4
9.5
.6
9
8
Did scores go down, or up?
15
13
12
.4
11
6 17
20
10 14
19
16
0
.2
Focus on the
“anchor test”
items that both
groups took in
common.
• Don’t relative
proportions
depend upon
the population
taking the test?
• Shouldn’t we
consider the
other items?
0
.2
It sure would be nice to have some
kind of item difficulty parameter
that didn’t theoretically depend
upon the population taking the test;
some kind of model that allowed us
to take the other item responses
into account.
.4
.6
.8
Proportion Correct - 2013
99
1
Linking with IRT: The Mean-Sigma Method
• We remember that, with IRT, if the model fits the
data, the item parameters are population invariant
up to a linear transformation.
.5
– We can fit a 2-parameter logistic IRT model twice,
completely separately, once to all the item responses in
2012, and once to all the item responses in 2013.
• When we look at our
distributions of 𝜃, we find no
average difference.
.4
2012
.1
.2
.3
2013
0
Difference in means:
0 Standard Deviation Units
-2
-1
0
Theta
1
2
– This is expected. In any IRT
calibration, we assume that 𝜃
are distributed 𝑁(0,1) by
default: Scale indeterminacy.
– We need to give add the logic
of the link into the model to
map the 2013 scale to the100
2012 scale.
Common-Item Linking with IRT
Items 1-5
2012 Only
Items 6-20
2012 and 2013
Items 21-25 2013 Only
101
In which year are
items more difficult?
19
16
14
10
0
20 17 6
11
8
12
13
15
9
18
-2
Difficulty 2012
We plot the 𝑏2012
estimates from the
base year on the
𝑏2013 estimates from
the year we wish to
link to the base year.
Estimate 𝑚 and 𝑘 to
map difficulty,
linearly, from 2013 to
the 2012 scale:
2013
𝑏2012
= 𝑚𝑏2013 + 𝑘
This will allow us to
map proficiency from
2013 to 2012.
2013
𝜃2012
= 𝑚𝜃2013 + 𝑘
2
Linking with IRT: The Mean-Sigma Method
7
How should we
estimate 𝑚 and 𝑘?
-2
0
Difficulty 2013
2
102
Reminder: You have a choice of regression lines
The “ordinary least squares” regression criterion minimizes the sum of
vertical squared residuals. Other definitions of “best fit” are possible:
Vertical Squared Residuals (OLS)
Horizontal Squared Residuals (X on Y)
Orthogonal Residuals (PCA!)
“Principal Axis” Line
Which one would you choose for our purposes?
19
16
What is interesting
about this item?
14
10
0
20 17 6
11
Outliers reflect sampling
variability or can indicate
8
differential difficulty above
9
and beyond that explained
by the unidimensional 𝜃
scale. Multidimensionality,
18
7
targeted instruction, or
What is interesting inflation? What should be
about this item?
done?
12
13
15
-2
Difficulty 2012
The ideal 𝑋 to 𝑌
transformation is
symmetrical (NOT the
OLS regression line).
This is the “principal
axis” regression line:
𝑠𝑏2012
𝑚=
𝑠𝑏2013
𝑘 = 𝑏2012 − 𝑚𝑏2013
Then, for all items
and examinees:
2013
𝜃2012
= 𝑚𝜃2013 + 𝑘
2013
𝑏2012
= 𝑚𝑏2013 + 𝑘
𝑎2013
2013
𝑎2012 =
𝑚
2
Linking with IRT: The Mean-Sigma Method
-2
0
Difficulty 2013
2
104
Linking with IRT
105
.5
2012
After:
.4
2012
.4
.5
Before:
2013
.3
.2
.2
Density
.3
2013
.1
Difference in means:
.297 Standard Deviation Units
0
0
.1
Difference in means:
0 Standard Deviation Units
0
Theta
1
2
-2
-1
0
1
2
3
.6
.8
1
Theta
And all 25 item
characteristic
curves on the
same (2012) scale!
.4
-1
.2
-2
0
Fitted probability of a correct response
What have we accomplished?
106
-2.5
-2
-1.5
-1
-.5
0
Theta
.5
1
1.5
2
2.5
Next Steps
• There are a number of other ways to
link tests in a NEAT design.
– Concurrent Calibration: Treat missing
data as missing at random. Fit an IRT
model to the whole data matrix.
– Constrain 2013 parameters to be equal
to 2012 parameters.
2012
Only
Missing
Data
Anchor
Missing
Data
2013
Only
– “Characteristic Curve” approaches find the linear
transformation that makes the item or test characteristic
curves most similar. This uses all the information in the ICC
(not just the 𝑏 parameter estimates) and is more state-ofthe-art.
• One can compare how these different linking methods
lead to different estimates of grade-to-grade growth
over time.
107
Measurement: 7 Key Principles
1. We don’t validate tests. We validate score uses.
2. Content is king. Not models. Content.
3. Start with Classical Test Theory (CTT): The
descriptive stats of measurement.
4. Your reliability is not Reliability (G Theory).
5. Item Response Theory (IRT) is just a model. A
very, very useful model.
6. Your scale is pliable. Bend it; don’t break it.
7. Know the process that generated your scores.
Use them accordingly.
108
Equal-Interval Properties of IRT Scales?
• We have already shown that estimation of the IRT model
requires setting a scale for the latent variable, 𝜃, usually mean 0
and variance 1.
• The 𝜃 scale is equal-interval with respect to the log-odds of
correct responses to items.
• Lord (1975, 1980) shows that similar indeterminacy holds for
nonlinear transformations.
– “Once we have found the scale 𝜃 on which all item response curves
are (say) logistic, it is often thought that this scale has unique
virtues…” (Lord, 1980, p. 84).
• Imagine an exponential transformation of the 𝜃 scale to 𝜃 ∗ : 𝜃 ∗ =
𝐾𝑒 𝑘𝜃 . We simply transform the item response function to
match, and all our predicted values for any given 𝜃 are the same.
• This transformation is completely permissible. There is a logistic
item response function on the 𝜃 scale and a transformed item
response function on the 𝜃 ∗ scale.
• The data cannot tell which of the two is preferable.
109
What to make of scale indeterminacy
• The logistic item response function is
mathematically convenient and has loose
rational basis under normal assumptions.
• However, the data cannot tell which of many
plausible monotone transformations is more
desirable.
• There is no one “correct” or “natural” scale for
measuring traits or abilities in education.
• Interpretations should be robust to plausible
alternative score scales.
110
Scaling: Between the Ordinal and the Interval
• Imagine that adjacent numbers seem to sway as if
adjoined by springs.
• Successive distances, between 1 and 2, and 2 and 3,
are not only uneven but seem to expand and
contract unpredictably.
• But the springs are not infinitely compressible and
stretchable.
• The scale is pliable. Neither ordinal nor interval. Its
equal-interval argument is weak but not baseless.
111
The Fragility/Pliability of an Achievement Gap
• Two population-level test score distributions, say, high
socioeconomic status (𝑎) vs. low socioeconomic status (𝑏),
separated by 1 standard deviation unit.
• We can express the difference in terms of standard deviation units.
𝑋𝑎 − 𝑋𝑏
𝑑=
=1
2
2
𝑠𝑎 + 𝑠𝑏
2
• Interpretations of averages rely on the equal-interval properties of
scales.
112
The Fragility/Pliability of an Achievement Gap
Untransformed (𝑑 = 1)
Negative skew
(𝑑 ∗ = .89)
Positive skew
(𝑑 ∗ = .89)
113
Plausible Transformations
• Consider a baseline standard normal distribution:
𝑥~𝑁(0,1)
• To which we apply a family of exponential
transformations (thanks to Sean Reardon for this
particular family):
𝑥 ∗ = 𝑎 + 𝑏𝑒 𝑐𝑥
• Constrained to preserve the mean and variance.
𝐸 𝑥 ∗ = 0; 𝐸 𝑥 ∗ = 1
• These constraints result in the transformation:
2
𝑐
𝑠𝑔𝑛 𝑐
∗
𝑥 =−
1 − 𝑒 𝑐𝑥− 2
2
𝑐
𝑒 −1
• Where 𝑥 ∗ = 𝑥 when 𝑐 = 0.
114
The Exponential Family
𝑐 = +.5
𝑐 = −.5
𝑐 = +.5
The bounds of 𝑐 are set such that
the slope of the transformation at
the 5th percentile is 1/5 to 5 times
that of the slope at the 95th
percentile.
𝑐 = −.5
115
A Simple Sensitivity Check
1. Take existing scores.
2. Apply a “family of plausible transformations”
3. Calculate metrics of interest from each “plausibly
transformed” dataset.
4. Assess robustness of interpretations of metrics
across plausible transformations.
Is this the right family? Is
𝑐 bounded appropriately?
116
The scale-sensitivity of a correlation (Reardon & Ho)
𝜌≈
2 𝜌∗
𝑐
𝑒
−1
2
ec − 1
Is this the right family? Is
𝑐 bounded appropriately?
117
Referencing Plausibility to Skewness
𝑐 = +.5
𝑐 = −.5
Skewness: ±1.8
118
Skewness and Kurtosis of IRT-Scaled Distributions (Ho and Yu)
119
The Fragility of District-Level Residuals
•
•
•
•
•
A midsize statewide dataset with over 30K students per grade.
District-level residuals by grade cohort for Grade 6, over 250 districts.
Using three prior years of test scores.
Applied transformations to all grades, on standardized (z) and normalized (n) scores.
District-level residuals are above the diagonal, mean percentile ranks of residuals
(mPRRs, akin to Student Growth Percentiles, Castellano & Ho, 2012) below.
• Normalization vs. Standardization makes little difference for these well behaved
distributions. Comparing negative to positive skew drops correlations to .85-.89.
Resids
Standardized
Normalized
mPRRs
120
𝑟 = .88
-.2
.3
-.1
.4
0
.5
.1
.2
.3
The Skew-Dependence of District Residuals
0
.2
District Residuals after Positive Skew
.4
0
.1
.2
-.2
0
10
20
Frequency
30
121
What We’ve Done
Practical application of IRT methods to real data.
1. Classical Test Theory (CTT) and descriptive statistics.
– Reliability, item p-values, point biserial correlations.
2. Theoretical and practical benefits of IRT
– Parameter estimation
3. “Ability” (𝜃) parameter estimation and scoring
– Expected a posteriori; test characteristic curve-based scores
A critical perspective on IRT score scales
4. Extreme features of modern test score distributions
5. Scale drift over time
6. Weak “equal interval” properties
122
Measurement: 7 Key Principles
1. We don’t validate tests. We validate score uses.
2. Content is king. Not models. Content.
3. Start with Classical Test Theory (CTT): The
descriptive stats of measurement.
4. Your reliability is not Reliability (G Theory).
5. Item Response Theory (IRT) is just a model. A
very, very useful model.
6. Your scale is pliable. Bend it; don’t break it.
7. Know the process that generated your scores.
Use them accordingly.
123
Appendix: What about a more robust, ordinal statistic?
• The nonparametric literature gives us many
alternative gap representations, for example:
𝑛𝑎 𝑛𝑎 + 1
𝑟𝑎 −
2
Pr 𝑋𝑎 > 𝑋𝑏 =
𝑛𝑎 𝑛𝑏
• Interpretable on a 0/1 scale: the probability that a
randomly drawn 𝑋𝑎 is greater than a randomly
drawn 𝑋𝑏 , where 0.5 represents no gap.
• I’ve long been a fan of this statistic, but I’ve
recently been rethinking how to explain its
“transformation invariant” property.
124
Ranks as Transformations
• The idea of ranks as transformations, themselves,
has been around as long as nonparametric statistics
and is reviewed well by Conover and Iman (1981).
• They show that the Mann-Whitney-Wilcoxon test,
for example, is equivalent to a 𝑡-test conducted on
ranks (with adjustment to 𝑡𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 ).
• Similarly, we can show that Pr 𝑋𝑎 > 𝑋𝑏 is a
difference in average percentile ranks:
𝑟𝑎 + 𝑟𝑏
Pr 𝑋𝑎 > 𝑋𝑏 =
+ 0.5
𝑛𝑎 + 𝑛𝑏
• Wait… an average of ranks?!
125
The Rank-Interval Scale
• Far from liberating gaps from scale information, the
metric implicitly assumes that distances between
successive ranks are equal in interpretation.
• Compare:
– 1) Take observed scores. 2) Transform to IRT 𝜃 scale. 3)
Take average differences. 4) Perhaps standardize for
interpretation.
– 1) Take observed scores. 2) Apply 𝑟/(𝑛𝑎 + 𝑛𝑏 ) = 𝐹(𝑥).
3) Take average differences. 4) Add 0.5 for interpretation.
• The latter is a “uniform-izing” transformation, after
which we average, implicitly making equal-interval
assumptions on ranks.
126
Another Scale: Percentage Proficient
• By far the most widely available large-scale educational
statistic is the percentage of proficient students.
• I and others have condemned this statistic for gap
measurement, but it, too, is a difference of averages.
Rank (Uniform-izing)
Transformation
Proficiency (Dichotomizing)
Transformation
127
A Different “Problem of Scale”
• Most large-scale analyses of educational scores are
entangled with implicit equal-interval assumptions,
whether on ranks, raw scores, or scale scores.
• Rather than throw up our hands, I’d like to attempt
to describe whether the pliability of educational
score scales is consequential.
• Doing so requires a framework for considering
“families of plausible transformations.”
128