Rater Reliability

Transcript Rater Reliability

Rater Reliability
How Good is Your Coding?
Why Estimate Reliability?
Quality of your data
Number of coders or raters needed
Reviewers/Grant Applications
For What Variables Do You
Need Reliability Estimates?
Any variables with judgments
Ratings of any kind
Recordings, even of numbers or counts
Basically, all of them
Data Collection (1)
1 judge rates all targets. NA1.
2 judges, each rates (different) half of the
targets. More than 2, but each rates different
targets. NA2.
2 judges, each rate all targets. 3 or more, all
rate all. Crossed design.
4 judges, different pairs rate each targets – all
targets by 2, but different 2 each target. 3 or
more, not all rate all. Nested design.
Data Collection (2)
IMHO, Use a fully crossed design to estimate
reliability (otherwise it will be hard to estimate and
you have to hire help). Fully crossed is good for
final data collection, too, but may not be feasible.
Use any design (crossed or nested) to collect real
data.
Use proper estimate of reliability (fixed for crossed,
random for nested, proper number of raters) for
the design you finally used.
Estimation (1)
Use the data you collected to compute sums of squares
for judge, target, and error. SAS GLM can do this for
you.
Compute ICC(2,1) or ICC(3,1) depending on whether
your design will be fixed (crossed) or random (nested)
Apply Spearman-Brown to estimate the reliability of
your data.
Estimation (2)
If you collected fully crossed data (all judges saw all
targets for entire study), you can treat each rater as a
column (item), and each target or study as a row
(person), and then compute Cronbach’s alpha for
those data as rater reliability index. Alpha =ICC(3,k).
Can’t do that if raters and targets are not crossed.
Illustration (1)
3 raters judge rigor of 5 articles using 1 to 5 scale.
Study
Jim
Joe
Sue
1
2
3
1
2
3
2
2
3
4
3
3
4
5
4
4
5
5
5
3
Illustration (2)
Computer Input: One column for ratings, one for
rater, one for target.
Analysis: GLM – rating equals rater, target, rater by
target. (can use SAS, SPSS, R, whatever)
Output: sums of squares and mean squares for each.
Source Type III SS Mean Square
Rater
3.73
1.87
Target
14.27
3.57
Rater*Target 2.93
.37
Illustration
(3)
Use mean squares to compute intraclass correlations.
ICC(2,1) = one
random rater
ICC(3,1) = one
fixed rater
BMS  EMS
BMS  (k  1) EMS  k ( JMS  EMS ) / n
BMS  EMS
BMS  (k  1) EMS
3.57  .37
 .61
3.57  (3  1).37  3(1.87  .37) / 5
3.57  .37
 .74
3.57  (3  1).37
See Shrout & Fleiss, 1979, to see additional ICCs.
Illustration (4)
Use Spearman Brown to estimate reliability of
multiple raters and to estimate the number of raters
needed for a desired level of reliability.
Reliability of 2
raters
 CC '
k ii

1  (k  1)  ii
random
fixed
 CC '
Raters needed for
rxx of .90
 * (1   L )
m
 L (1   * )
2(.61)

 .76
1  .61
m
.9(1  .61)
 5.75  6
.61(1  .90)
2(.74)
 .85
1  .74
m
.9(1  .74)
 3.16  4
.74(1  .90)
 CC ' 
SPSS
Raters are columns, ratings are rows
Analyze, Scale, Reliability Analysis
Drag all columns into Items
The default: Model Alpha will produce ICC(3,k)
In this case alpha = .897 (three judges, same judges all
rate every target & take the average)
SPSS (2)
To get 1 fixed judge, Analyze, Scale, Reliability, all
colums into Items, then click Statisics
Check box Intraclass correlation coefficient
For 1 fixed judge, click 2-way mixed, ok, then run
In this case 1 fixed judge is .74.
For 1 random judge, click 1-way random
In this case, 1 random judge .59 (not quite .61 because
of my rounding error.
Categorical Agreement
If the same data were categorical, we could compute a
percent agreement for each item and average over
items. This does not take chance agreement into
account, but it is easy to do.
We should use kappa in such a cases.
Can use SPSS if 2 raters, but not if there are more.
You can use SAS (my program) if more than two
http://faculty.cas.usf.edu/mbrannick/software/kappa.
htm

Rater Reliability

Transcript Rater Reliability

Directory