Transcript Reliability

Reliability
Consistency in testing
Types of variance
• Meaningful variance
– Variance between test takers which reflects
differences in the ability or skill being
measured
• Error variance
– Variance between test takers which is caused
by factors other than differences in the ability
or skill being measured
• Test developers as ‘variance chasers’
Sources of error variance
•
•
•
•
•
•
•
Measurement error
Environment
Administration procedures
Scoring procedures
Examinee differences
Test and items
Remember, OS = TS + E
Estimating reliability for NRTs
• Are the test scores reliable over time?
Would a student get the same score if tested
tomorrow?
• Are the test scores reliable over different
forms of the same test?
Would the student get the same score if given
a different form of the test?
• Is the test internally consistent?
Reliability coefficient (rxx)
• Range: 0.0 (totally unreliable test) to 1.0
(perfectly reliable test)
• Reliability coefficients are estimates of the
systematic variance in the test scores
• lower reliability coefficient = greater
measurement error in the test score
Test-retest reliability
1.
2.
3.
•
Same students take test twice
Calculate reliability (Pearson’s r)
Interpret r as reliability (conservative)
Problems
– Logistically difficult
– Learning might take place between tests
Equivalent forms reliability
1. Same students take parallel forms of test
2. Calculate correlation
• Problems
– Creating parallel forms can be tricky
– Logistical difficulty
University of Michigan English
Placement Test
(University of Michigan English Placement Test Examiner’s Manual)
Internal consistency reliability
• Calculating the reliability from a single
administration of a test
• Commonly reported
– Split-half
– Cronbach alpha
– K-R20
– K-R21
• Calculated automatically by many
statistical software packages
Split-half reliability
1. The test is split in half (e.g., odd / even)
creating “equivalent forms”
2. The two “forms” are correlated with each
other
3. The correlation coefficient is adjusted to
reflect the entire test length
– Spearman-Brown Prophecy formula
Calculating split half reliability
ID Q1 Q2 Q3 Q4 Q5 Q6 Odd
Even
Odd
1
1
0
0
1
1
0
2
1
Mean
1.83
2
1
1
0
1
0
1
1
3
SD
0.75
3
1
1
1
1
1
0
3
2
4
1
0
0
0
1
0
2
0
5
1
1
1
1
0
0
2
2
6
0
0
0
0
1
0
1
0
Even
Mean
1.33
SD
1.21
Calculating split half reliability
(2)
Odd
Mean Diff
Even
Mean Diff
2
1.83
1
Prod.
0.17
1
1.33
-0.33
-0.056
1.83
-0.83
3
1.33
1.67
-1.386
3
1.83
1.17
2
1.33
0.67
0.784
2
1.83
0.17
0
1.33
-1.33
-0.226
2
1.83
0.17
2
1.33
0.67
0.114
1
1.83
-0.83
0
1.33
-1.33
1.104
0.334
Calculating split half
0.334
= 0.06
(6)(.75)(1.21)
Adjust for test length using Spearman-Brown Prophecy formula
2 x 0.06
(2 – 1)0.06 +1
rxx =0.11
Cronbach alpha
• Similar to split half but easier to calculate
 S odd  S
  21 
2
S total

2
2 (1 -
2
even
(0.75)2 + (1.21)2
(1.47)2




)
= 0.12
K-R20
• “Rolls-Royce” of internal reliability
estimates
• Simulates calculating split-half reliability
for every possible combination of items
K-R20 formula
k  S
1  2
K  R 20 
k 1 
St
Note that this is
variance, not
standard deviation
2
i



Sum of Item
Variance = the
sum of IF(1-IF)
K-R21
• Slightly less accurate than KR-20, but can
be calculated with just descriptive statistics
• Tends to underestimate reliability
KR-21 formula
k  M (k  M ) 
K  R 21 
1 

2
k 1 
kS

Note that this is variance
(standard deviation
squared)
Test summary report (TAP)
Number of Items Excluded
Number of Items Analyzed
Mean Item Difficulty
Mean Item Discrimination
Mean Point Biserial
Mean Adj. Point Biserial
KR20 (Alpha)
KR21
SEM (from KR20)
# Potential Problem Items
High Grp Min Score (n=15)
Low Grp Max Score (n=14)
=
=
=
=
=
=
=
=
=
=
=
=
0
40
0.597
0.491
0.417
0.369
0.882
0.870
2.733
9
31.000
17.000
Split-Half (1st/ 2nd) Reliability = 0.307 (with Spearman-Brown = 0.470)
Split-Half (Odd/Even) Reliability = 0.865 (with Spearman-Brown = 0.927)
Standard Error of Measurement
If we give a student the same test repeatedly (test-retest),
we would expect to see some variation in the scores
50
49
52
50
51
49
48
50
With enough repetition, these scores would form a
normal distribution
We would expect the
student to score near the
center of the distribution
the most often
Standard Error of Measurement
• The greater the reliability of the test, the
smaller the SEM
• We expect the student to score within one
SEM approximately 68% of the time
• If a student has a score of 50 and the SEM
is 3, we expect the student to score
between 47 ~ 53 approximately 68% of the
time on a retest
Interpreting the SEM
For a score of
29: (K-R21)
26 ~ 32 is
within 1 SEM
23 ~ 35 are
within 2 SEM
20 ~ 38 are
within 3 SEM
Calculating the SEM
SEM  S 1  rxx
What is the SEM for a test with a reliability of r=.889
and a standard deviation of 8.124?
SEM = 2.7
What if the same test had a reliability of r = .95?
SEM = 1.8
Reliability for performance
assessment
Traditional fixed
response assessment
Performance assessment
(i.e. writing, speaking)
Test-taker
Test-taker
Task
Instrument
(test)
Score
Performance
Score
Scale
Rater / judge
Interrater/Intrarater reliability
1. Calculate correlation between all
combinations of raters
2. Adjust using Spearman-Brown to
account for total number of raters giving
score