PowerPoint - plaza - University of Florida
Download
Report
Transcript PowerPoint - plaza - University of Florida
Improving Content Validity:
A Confidence Interval
for Small Sample Expert Agreement
Jeffrey M. Miller & Randall D. Penfield
NCME, San Diego
April 13, 2004
University of Florida
[email protected] & [email protected]
INTRODUCING CONTENT VALIDITY
“Validity refers to the degree to which
evidence and theory support the
interpretations of test scores entailed by
proposed uses of tests
(AERA/APA/NCME, 1999)
Content validity refers to the degree to
which the content of the items reflects
the content domain of interest (APA,
1954)
THE NEED FOR IMPROVED REPORTING
Content is a precursor to drawing a score-based
inference. It is evidence-in-waiting (Shepard,
1993; Yalow & Popham, 1983)
“Unfortunately, in many technical manuals,
content representation is dealt with in a
paragraph, indicating that selected panels of
subject matter experts (SMEs) reviewed the
test content, or mapped the items to the
content standards…(Crocker, 2003)”
QUANTIFYING CONTENT VALIDITY
Several indices for quantifying expert
agreement have been proposed
The mean rating across raters is often used in
calculations
However, the mean alone does not provide
information regarding its proximity to the
unknown population mean.
We need a usable inferential procedure go
gain insight into the accuracy of the sample
mean as an estimate of the population mean.
THE CONFIDENCE INTERVAL
A simple method is to calculate the traditional Wald
confidence interval
s
X tdf
n
However, this interval is inappropriate for rating scales.
1. Too few raters and response categories to assume population
normality has not been violated.
2. No reason to believe the distribution should be normal.
3. The rating scale is bounded with categories that are discrete.
AN ALTERNATIVE IS THE
SCORE CONFIDENCE INTERVAL
FOR RATING SCALES
Penfield (2003) demonstrated that the Score
method outperformed the Wald interval
especially when
The number of raters was small (e.g., ≤ 10)
The number of categories was small (e.g., ≤ 5)
Furthermore, this interval is asymmetric
It is based on the actual distribution for the mean
rating of concern.
Further, the limits cannot extend below or above
the actual limits of the categories.
STEPS TO CALCULATING THE SCORE CONFIDENCE INTERVAL
1. Obtain values for n, k, and z
n = the number of raters
K = the highest possible rating
z = the standard normal variate associated
with the confidence level (e.g., +/- 1.96
at 95% confidence)
2. Calculate the mean item rating
The sum of the ratings for an item
divided by the number of raters
3. Calculate p
p=
Or if scale begins with 1 then
p=
4. Use p to calculate the upper and lower
limits for a confidence interval for
population proportion (Wilson, 1927)
5. Calculate the upper and lower limits of
the Score confidence interval
for the population mean rating
Shorthand Example
Item: 3 + ? = 8
The content of this item represents the ability to add single-digit
numbers.
1
Strongly Disagree
2
3
4
Disagree
Agree
Strongly Agree
Suppose the expert review session includes 10 raters.
The responses are 3, 3, 3, 3, 3, 3, 3, 3, 3, 4
Shorthand Example
n = 10
k=4
z = 1.96
the sum of the items = 31
= 31/10 = 3.10
p=
so, p = 31 / (10*4) = 0.775
Shorthand Example (cont.)
= (65.842 – 11.042) / 87.683 = 0.625
= (65.842 + 11.042) / 87.683 = 0.877
Shorthand Example (cont.)
= 3.100 – 1.96*sqrt(0.938/10) = 2.500
= 3.100 + 1.96*sqrt(0.421/10) = 3.507
We are 95% confident that the
population mean rating falls somewhere
between 2.500 and 3.507
Content Validation
1.
Method 1: Retain only items with a Score interval
of a particular width based on
a.
b.
2.
A priori determination of appropriateness
An empirical standard (25th and 75th percentiles of all
widths)
Method 2: Retain items based on hypothesis test
that the lower limit is above a particular value
EXAMPLE WITH 4 ITEMS
Rating Frequency for 10 Raters
95% Score CI
Item
0
1
2
3
4
Mean
Lower
Upper
1
0
0
0
4
6
3.60
3.08
3.84
2
0
0
2
5
3
3.10
2.50
3.51
3
2
0
2
6
0
2.20
1.59
2.77
4
1
2
3
3
1
2.10
1.50
2.68
Conclusions
1.
Score method provides a confidence interval that is
not dependent on the normality assumption
2.
Outperforms the Wald interval when the number of
raters and scale categories is small
3.
Provides a decision-making method for the fate of
items in expert review sessions.
4.
Computational complexity can be eased through
simple programming in Excel, SPSS, and SAS
For further reading,
Penfield, R. D. (2003). A score method for constructing
asymmetric confidence intervals for the mean of a
rating scale item. Psychological Methods, 8, 149163.
Penfield, R. D., & Miller, J. M. (in press). Improving
content validation studies using an asymmetric
confidence interval for the mean of expert ratings.
Applied Measurement in Education.