ClassicalTestTheory

Download Report

Transcript ClassicalTestTheory

Introduction to
Classical Test Theory (CTT)
X=T+e
•Meaning of X, T, and e.
•Basic assumptions
•Parallel tests
•Reliability
•Standard Error of Measurement
•p-value & point-biserial
X=T+e
•X = observed score (this is obvious)
•T = “true” score
•e = error
T and e require some explanation
T, “true” score
•Take two forms of equal difficulty, you
get two different scores.
•Suppose you take many such tests
•T is your mean score on all these tests
•T is an unobservable theoretical concept
e, error
•Does NOT refer to “error” as in baseball NOR to
mistakes in testing or scoring.
•e is the difference between X and T.
•e is, thus, related to “standard error”
•If we have many samples:
–X is the sample statistic
–T is the average of X over the samples
–standard error of X is SD of X over the samples,
i.e., approximately the average of e
e, error
•Why does a student get different scores on two
different tests of the same difficulty level?
•Short answer: Luck!
•Example: Spelling test, 1000-word pool.
Suppose you know 90%. Imagine two tests, 10
words each, assembled to have same average
score for all students.
–On one test, by luck, you know all 10 words.
–On the other, by really bad luck, you know only 7!
Basic Assumptions of CTT
•No cheating.
•No copying between examinees
•Luck is completely random.
–Difficulty level of form does not affect luck
–How you did on another test does not affect luck
•If you take two forms, no “learning” occurred while
taking the first
Parallel Tests
According to CTT, two tests are parallel if:
– All students have same “true” score on both tests.
– SD of observed scores is the same on both tests.
First condition relates to test difficulty. Second
condition relates to reliability.
So, we could simplify and say two tests are
parallel if:
– They have the same difficulty
– They have the same reliability
Parallel Tests: Beyond CTT
Of course, Parallel Tests must be much more
than just statistically parallel:
–
–
–
–
–
–
Types of questions
Content
Time limit
Test-taking and administration directions
Legibility
Art work
Reliability
•A test is a measurement. Two parallel tests are two
independent measurements. A student’s scores on two
parallel tests are likely to be different.
•Roughly speaking, the degree to which such differences
are minimized is Reliability.
•The greater the consistency of test scores between a test
and its parallel form, the greater the reliability of the test.
•Definition:
Reliability = correlation b/t scores on parallel forms.
Correlation
Analysis of Data that come in pairs
• Examples:
– {Sodium/serving | Sugar/serving} in sample of 10
cereals
– {Height | Weight} in a sample of 25 individuals
– {Score on one item | sum score on rest of items} in
a sample of 400 students (point-biserial)
– {Score on test form | Score on parallel form} in a
sample of 1000 students (reliability)
Interpreting Correlations
• Correlations measure the degree to which one variable
has a linear relationship with another.
• 0 < Magnitude of Correlation < 1
0 means no linear relationship
1 means perfect linear relationship
• Sign of correlation:
– Positive  increase in one gives increase in other
– Negative  increase in one gives decrease in other
Interpreting Correlations (.87, .56, -.57, .19)
Interpreting Reliability as a Correlation
• The degree to which scores on a test are linearly
related to scores on a parallel test.
• 0 < Reliability < 1
• Reliability is typically about 0.9 for
standardized tests.
Methods for Estimating Reliability
• Parallel forms (never have parallel forms)
• Test-Retest (same test twice? Forget it!)
• Split-half
• Cronbach’s alpha
Estimating Reliability: Split-half
• Here’s a great idea:
- Split the test in half (two parallel halves)
- Correlate scores on the two halves
- Scale up to get the correlation that
corresponds to two full-length tests
• Hard to get two parallel halves.
Estimating Reliability: Cronbach’s a
• Another great idea:
- Do all possible split halves
- Take average of all the scaled-up
correlations
- That’s Cronbach’s alpha!
• Sounds computationally intensive
Estimating Reliability: Cronbach’s a
N= No. of items on test
SD =SD of scores on test
pi = p-value for item i
N  SD2  [ p1 (1  p1 )  p2 (1  p2 )  ...pN (1  pN )]


2
N 1 
SD

Standard Error of Measurement
•SEM is an estimate of the average size of e in a
population.
•Beginning with X = T + e we can derive the following
formula for SEM:
SEM = SD x SQRT(1 – Reliability)
•CTT assumes every examinee has the same SEM
•If everyone took many parallel forms, the SD of their
scores would be the same for everyone.
CTT Item Parameters
•P-value:
–the mean of the scored responses for an item
–Used as an indicator of item difficulty
•Point-biserial:
–the correlation between score on an item and the
sum-score on all the other items on the test.
– Used as an indicator of item discrimination power.
•Simple, yet informative, statistics: accurately
measured with sample sizes as small as 400
Limitations of CTT
•Item parameters change (even their order of
difficulty!) with student population, making
them hard to interpret.
•True scores change across test forms. Hard to
compare students who took different forms.
•Test level model, not item model.
•SEM is same for all examinees.
•Reliability changes with student population
Summary: Intro to CTT
•Basic equation is: X = T + e
•Parallel Test Forms
–Examinees have same T on both forms
–Observed Score SD is same for both forms.
•Reliability:
–Correlation of X across Parallel Tests.
–Estimated by Cronbach’s a (modified split-half
approach)
Summary: Intro to CTT
•SEM=Standard Error of Measurement
–SD of X for an examinee over many parallel
forms
–Related to reliability by simple formula
•CTT statistics have severe limitations
– item statistics change with student population
– student statistics change with test forms
– SEM is same for all student scores
– Reliability changes with student population
Introduction to CTT
Thanks again for coming! And for the nice
comments on my Basic Stats presentation.
Hope to see you next time when Liz will unravel the
complexities of setting standards and show how
CTT statistics and human judgment are used to
yield a logical step-by-step process that makes
sense of this complex enterprise.
Famous Two-Number Data Summary:
Mean & Standard Deviation
• Mean: the ordinary average of all the data.
– If you had to pick one number to typify your data.
– p-value is a mean
• Standard Deviation (SD): average deviation
from mean
– Obviously, not all the data equal the mean
– SD tells, on average, how spread out the data are
from the mean
– Can be used to identify extreme values
Mean & SD
Example using p-values
Mean = 0.63
1 SD = 0.18
2 SD’s
50
25
0
0
0.2
0.4
0.6
0.8
1
Mean & SD
Example using Heights
Mean = 67
1 SD = 3.48
2 SD’s
40
32
24
16
8
0 55
60
65
70
75
80
The “Standard Error”
• Data are sampled from a population.
• Sample Mean is calculated.
• A 2nd sample would have a different Mean.
The “Standard Error”
• How much would the Sample Mean vary, on
average, over many samples?
• What would the SD of the Sample Mean be
over many samples?
• That’s the Standard Error (SE)!
• It tells you how reliable your Sample Mean is.
Standard Error
• Ok, this is very nice. But in real life we
can’t take 10,000 samples!
• In real life we get ONE sample!
How can we possibly figure out the SE for
the sample mean for a real data sample?
• Magic of Statistics
SE = (SAMPLE SD ) / SQRT(sample size)