Testing 05 - 厦门大学外文学院

Download Report

Transcript Testing 05 - 厦门大学外文学院

Testing 05
Reliability
Errors & Reliability
• Errors in the test cause unreliability.
• The fewer the errors, the more reliable the
test
• Sources of errors:
• Obvious: poor health, fatigue, lack of
interest
• Less obvious: facets discussed in Fig. 5.3
Reliability & Validity
• Reliability is a necessary condition for validity.
• Reliability & validity are complementary aspects
of the measurement.
• Reliability: How much of the performance is due
to measurement errors, or to factors other than the
language ability we want to measure.
• Validity: How much of the performance is due to
the language ability we want to measure.
Reliability Measurement
• Reliability measurement includes: logical
analysis and empirical research, i.e. identify
sources of errors and estimate the
magnitude of their effects on the scores.
Logical Analysis
• Example of identification of source of errors:
• Topic in an oral interview: business
negotiation
• Source of error: if we want to measure the
test taker’s ability of general topics.
• Indicator of the ability: if we want to the
test taker’s ability of business English.
Empirical Research
•
•
•
•
•
Procedures are usually complex.
Three kinds of theories
Classical true score theory (CTS)
Generalizability theory (G-Theory)
Item Response Theory (IRT)
Factors on Test Scores
•
•
•
•
Characteristics of factors
general vs. specific
lasting vs. temporary
systematic vs. unsystematic
Factors that affect language test scores
Variance & Standard Deviation
•
•
•
•
•
•
•
•
•
s: standard deviation of the sample
σ: standard deviation of the population
s2: variance of the sample
σ2: variance of the population
s=√∑(X-Xˉ)2/n-1
where
X: individual score
Xˉ: mean score
n: number of students
Correlation Coefficient (相关系数)
• Covariance (COV): two variables, X and Y,
vary together.
• COV(X,Y)=1/(n-1)∑(Xi-Xˉ)(Yi-Yˉ)
• Correlation Coefficient (Pearson Productmoment Correlation Coefficient 皮尔逊积
差相关系数)
• r(x,y)=COV(x,y)/sxsy
• r(x,y)= 1/(n-1)∑(Xi-Xˉ)(Yi-Yˉ)/ sxsy
Correlation Coefficient
•
•
•
•
•
•
•
•
Where
n: number of items
Xi: individual score of the first half
Xˉ: mean of the scores in the first half
Yi: individual score of the second half
Yˉ: mean of the scores of the second half
sx: standard deviation of the first half
sy: standard deviation of the second half
Calculation of Correlation
Coefficient
• Manually
• Manual + Excel
• Excel
Classical True Score Theory
• also referred to as the classical reliability theory
because its major task is to estimate the reliability
of the observed scores of a test. That is, it attempts
to estimate the strength of the relationship
between the observed score and the true score.
• sometimes referred to as the true score theory
because its theoretical derivations are based on a
mathematical model known as the true score
model
Assumptions in CTS
• Assumption 1: The observed score consists
of the true score and the error score, i..e.
x=xt+xe
• Assumption 2: Error scores are
unsystematic, random and uncorrelated to
the true score, i.e. s2=st2+se2
Parallel Test
• Two tests are parallel if
xˉ=x’ˉ
sx2=sx’ˉ2
rxy=rx’y
Correlation Between Parallel
Tests
• If the observed scores on two parallel tests are
highly correlated, the effects of the error scores are
minimal.
• Reliability is the correlation between the observed
scores of two parallel tests.
• The definition is the basis for all estimates of
reliability within CTS theory.
• Condition: the observed scores on the two tests are
experimentally independent.
Error Score Estimation and
Measurement
• Relations between reliability, true score and
error score:
• The higher the portion of the true score, the
higher the correlation of the two parallel
tests. (True scores are systematic)
• The higher the portion of the error score, the
lower the correlation of the two parallel
tests. (Error scores are random)
Error Score Estimation and
Measurement
•
•
•
•
•
•
rxx’=st2/se2
(st2+se2)/sx2=1
se2/ sx2=1- st2/ sx2
st2/ sx2= rxx’
se2/ sx2=1- rxx’
se2=(1- rxx’)/ sx2
Approaches to Estimate
Reliability
• Three approaches based on different sources
of errors.
• Internal consistency: source of errors from
within the test and scoring procedure
• Stability: How consistent test scores are
over time.
• Equivalence: Scores on alternative forms of
tests are equivalent.
Internal Consistency
• Dichotomous
Split-half reliability estimates
The Spearman-Brown split-half estimate
The Guttman split-half estimate
Kuder-Richardson reliability coefficients
• Non-dichotomous
Coefficient alpha
Rater consistency
Split-half Reliability Estimates
• Split the test into two halves which have
equal means and variances (equivalence)
and are independent of each other
(independence).
• 1. divide the test into the first and second
halves.
• 2. random halves
• 3. odd-even method
Spearman-Brown Reliability
Estimate
•
•
•
•
•
•
rxx’=2rhh’/(1+rhh’)
where:
rhh’: correlation between the two halves of the test
Procedure:
1. Divide the test into two equal halves
2. Calculate the correlation coefficient between
the two halves
• 3. Calculate the Spearman-Brown reliability
estimate
Guttman Split-Half Estimate
•
•
•
•
•
rxx’=2(1-(sh12+sh22)/sx2)
where
sh12: variance of the first half
sh22: variance of the second half
sx2: variance of the total scores
Kuder-Richardson Formula 20
•
•
•
•
rxx’=k/(k-1)(1-∑pq/sx2)
where
k: number of items on the test
p: proportion of the correct answers, i.e.
correct answers/total answers (difficulty)
• q: proportion of the incorrect answers, i.e.
1-p
• sx2: total test score variance
Kuder-Richardson Formula 21
•
•
•
•
•
rxx’=(ksx2-xˉ(k-xˉ))/(k-1)sx2
where
k: number of items on the test
sx2: total test score variance
xˉ: mean score
Coefficient alpha
•
•
•
•
α=k/(k-1)(1-∑si2/sx2)
where
k: number of items on the test
∑si2 : sum of the variances of the different
parts of the test
• sx2: variance of the test scores
Comparison of Estimates:
Assumptions
Assumption
Effect if
violated
assumption
is
Estimate
Equivalence
Independence
Equivalence
Independen
ce
SpearmanBrown
+
+
underestima
te
overestimat
e
Guttman
-
+
K-C
+
+
underestima
te
overestimat
e
Coefficientα
-
-
-
-
overestimat
e
Summary: Estimate Procedure
• Spearman-Brown
–
–
–
–
1. split
2. variances of each half
3. correlation coefficient of each half
4. reliability coefficient
Summary: Estimate Procedure
• Guttman
–
–
–
–
1. split
2. variances of each half
3. variance of the whole test
4. reliability coefficient
Summary: Estimate Procedure
•
•
•
•
•
•
•
K-C 20
1. number of questions
2. proportion of correct answers of each question
3. proportion of incorrect answers of each question
4. sum of the product of p and q
5. variance of the whole test
6. reliability coefficient
Summary: Estimate Procedure
•
•
•
•
•
K-C 21
1. number of questions
2. mean of the test
3. variance of the test
4. reliability coefficient
Summary: Estimate Procedure
•
•
•
•
•
•
•
•
Coefficientα
1. number of the parts of the test
2. mean of each part
3. variance of each part
4. sum of variances of all parts
5. mean of the test
6. variance of the test
7. reliability coefficient
Rater Consistency
• Intra-rater
• Inter-rater
Intra-rater Reliability
• Rate each paper twice. Condition: the two
ratings must be independent of each other.
• Two ways of estimating:
• Spearman-Brown: Take each rating as a
split half and compute the reliability
coefficient.
Intra-rater Reliability
• Conditions: the two ratings must have the
similar means and variances to ensure the
equivalence of the two ratings
• Coefficient alpha: Take two ratings as two
parts of a test.
• α=(k/(k-1))(1-(sx12+sx22)/sx1+x22)
Intra-rater Reliability
•
•
•
•
•
•
where
k: number of ratings
sx12: variance of the first rating
sx22: variance of the second rating
sx1+x22: variance of the summed ratings
Since k=2, the formula can be reduced to
the Guttman Reliability Coefficient Formula.
Inter-rater Reliability
• If there are only two raters, use split-half
estimates to obtain the reliability coefficient.
• Or Grade Correlation Coefficient:
• rxx’=1-6∑D2/(n(n2-1))
• where
• D: difference between the grades of the two
ratings
Inter-rater Reliability
•
•
•
•
n: number of the test takers
See testing 05-2 sheet 5 for example
Note: the same grade should be shared.
If there are more than two raters, use
Coefficient alpha estimate
Stability (test-retest reliability)
• Administer the test twice to a group of
individuals and compute the correlation
between the two set of scores. The
correlation can then be interpreted as an
indicator of how stable the scores are over
time.
• Learning effects and practice effects must
be taken into account.
Equivalence (parallel forms
reliability)
• Use alternative forms of a given test.
Compute and compare the means and
standard deviations of for each of the two
forms to determine their equivalence. The
correlation between the two sets can be
interpreted as an indicator of the
equivalence of the two tests or an estimate
of the reliability of either one.
GENERALIZABILITY
THEORY
GENERALIZABILITY THEORY
• Generalizability theory (G-theory) is a framework
of factorial design and the analysis of variance. It
constitutes a theory and set of procedures for
specifying and estimating the relative effects of
different factors on observed test scores, and thus
provides a means for relating the uses or
interpretations to the way test users specify and
interpret different factors as either abilities or
sources of error.
GENERALIZABILITY THEORY
• G-theory treats a given measure or score as a
sample from a hypothetical universe of possible
measures, i.e. on the basis of an individual's
performance on a test we generalize to his
performance in other contexts.
• Reliability = generalizability
• The way we define a given universe of measures
will depend upon the universe of generalization
Application of G-theory
• Two stages:
–G-study
– D-study
G-study
• consider the uses that will be made of the
test scores, investigate the sources of
variance that are of concern or interest.On
the basis of this generalizability study, the
test developer obtains estimates of the
relative sizes of the different sources of
variance ('variance components').
D-study
• When the results of the G-study are
satisfactory, the test developer
administers the test under operational
conditions, and uses G-theory procedures
to estimate the magnitude of the
variance components. These estimates
provide information that can inform the
interpretation and use of the test
scores.
Significance of G-theory
• The application of G-Theory thus enables
test developers and test users to
specify the different sources of
variance that are of concern for a given
test use, to estimate the relative
importance of these different sources
simultaneously, and to employ these
estimates in the interpretation and use
of test scores.
Universes Of Generalization
And Universe Of Measures
• universe of generalization, a
domain of uses or abilities (or
both)
• the universe of possible measures: types of
test scores we would be willing to accept as
indicators of the ability to be measured for
the purpose intended.
Populations of Persons
• In addition to defining the universe of
possible measures, we must define the
group, or population of persons about whom
we are going to make decisions or
inferences.
Universe Score
• A universe score xp is thus defined as
the mean of a person's scores on all
measures from the universe of possible
measures. The universe score is thus the
G-theory analog of the CTS-theory true
scores. The variance of a group of
persons' scores on all measures would be
equal to the universe score variance sp2,
which is similar to CTS true score
variance in the sense that it represents
that proportion of observed score
variance that remains constant across
different individuals and different
measurement facets and conditions.
Universe Score
• The universe score is different from the
CTS true score, however, in that an
individual is likely to have different
universe scores for different universes of
measures.
Generalizability Coefficients
• The G-theory analog of the CTS-theory
reliability coefficient is the
generalizability coefficient, which is
defined as the proportion of observed
score variance that is universe score
variance:
• pxx’2=sp2/sx2
• where sp2 is universe score variance and
sx2 is observed score variance, which
includes both universe score and error
variance.
Estimation
• Variance components: sources of
variances
• persons(p), forms(f), raters(r)
• sx2=sp2+sf2+sr2+spf2+spr2+sfr2+spfr2
• Use ANOVA to compute for the
magnitude of the variance
• Analyse those that are significantly large.
Standard Error of Measurement (SEM)
• We need to know the extent the test
score may vary.(SEM)
• Formula of SEM Estimation
• se=sx√(1-rxx’)
• From:
• rxx’=st2/sx2 (1)
• st2/sx2+se2/sx2=1 (2)
• se2/sx2=1-st2/sx2 (3)
• se2/sx2=1-rxx’
• se2=sx2(1-rxx’)
Interpretation of Test Scores
• Difficulty
• Distinction
• Z score
Difficulty for Dichotomous Scoring
•
•
•
•
•
p=R/n
where:
p: difficulty index
R: right answers
n: number of students
Difficulty for Dichotomous Scoring
(Corrected)
•
•
•
•
•
Cp=(kp-1)/(k-1)
Where
Cp: corrected difficulty index
p: uncorrected difficulty index
k: number of choices
Difficulty for Nondichotomous Scoring
• p=mean/full score
• 30%--85%
Distinction
• Label the top 27% of the total as the
high group and the lowest 27% of the
total as the low group.
•
•
•
•
D=PH-PL
Where
D: distinction index
PH: rate of the correct answers in the
high group
• PL: rate of the correct answers in the low group
Z score
• A way of placing an individual score in
the whole distribution of scores on a
test; it expresses how many standard
deviation units lie above or below the
mean. Scores above the mean are positive;
those below the mean are negative.
• An advantage of z scores is that they
allow scores from different tests to be
compared, where the mean and standard
deviation differ, and where score points
may not be equal.
• Z=(X-X’)/s
T-score
• A transformation of a z score,
equivalent to it but with the
advantage of avoiding negative
values, and hence often used for
reporting purposes.
• T=10Z+50
Standardized Score
• A transformation of raw scores which
provides a measure of relative standing
in a group and allows comparison of raw
scores from different distributions, eg.
from tests of different lengths. It does
this by converting a raw score into a
standard frame of reference which is
expressed in terms of its relative
position in the distribution of scores.
The z score is the most commonly used
standardized score.
Standardized score = 100Z+500