Testing - Research and Development (11Oct07)

Download Report

Transcript Testing - Research and Development (11Oct07)

The BILC BAT:
A Research and Development
Success Story
Ray T. Clifford
BILC Professional Seminar
Vienna, Austria
11 October
Language is the most complex of
human behaviors.
• Language proficiency is clearly not a
simple, one-dimensional trait.
• Therefore, language development can not
be expected to be linear.
• However, language proficiency can be
assessed against a hierarchy of
identifiable common stages of language
skill development.
Testing Language Proficiency
in the Receptive Skills
• Norm-referenced statistical analyses are
problematic when testing for proficiency.
– Rasch one-factor IRT analysis assumes:
• A one-dimensional trait.
• Linear skill development.
• All test items discriminate equally well.
– Norm-referenced statistics are meant to
distinguish all students from each another, not
separating passing students from failing
students.
Testing Language Proficiency
in the Receptive Skills
• Norm-referenced statistical analyses are
problematic when testing for proficiency.
– They require too many subjects for use in
LCTLs.
• About 100 to 300 test subjects of varying abilities
must answer each item.
• There may not be that number of people to be
tested.
– The results do not have a direct relationship
to proficiency levels or other external criteria.
Testing Language Proficiency
in the Receptive Skills
• Norm-referenced statistical analyses are
problematic when testing for proficiency.
– There has not been an adequate way of
insuring that the range of skills tested and the
difficulty of any given test match the targeted
range of the language proficiency scale.
– Setting passing scores using norm-referenced
statistics is an imprecise process.
– Setting multiple cut-scores from a total test
score violates the criterion-referenced
principle of non-compensatory scoring.
Test Development Procedures:
Norm-Referenced Tests
•
•
•
•
Create a table of test specifications.
Train item writers in item-writing techniques.
Develop items.
Test the items for difficulty and reliability by
administering them to several hundred learners.
• Use statistics to eliminate “bad” items.
• Administer the resulting test.
• Report results compared to other students or
attempt to relate these norm-referenced results to
a polytomous set of criteria (such as the STANAG
scale).
Test to be calibrated
100
50
Level
3
Grou
p
Level
2
Group
Level
1
Group
0
Groups of ”known” ability
Traditional Method of
Setting Cut Scores
The Results You Hope For:
50
Level
3
Grou
p
Level
2
Group
Level
1
Group
0
Groups of “known” ability
Test to be calibrated
100
Level
3
Grou
p
Test Scores Received
100
???
???
50
Level
2
Group
???
???
0
Level
1
Group
Groups of ”known” ability
The Results You Always Get:
Why is there always an overlap?
• Total scores are by definition
“compensatory” scores.
– Every answer guessed correctly adds to the
individual’s score.
– There is no way to check for ability at a given
proficiency level.
• Students with different abilities may have
attained the same scores, by
– Answering only the Level 1 questions right.
– Answering 25% of all the questions right.
Level
3
Grou
p
Test Scores Received
100
???
???
50
Level
2
Group
???
???
0
Level
1
Group
Groups of ”known” ability
No matter where the cut scores are
set, they are wrong for someone.
A Better Way
• We can test language proficiency using
criterion-referenced instead of normreferenced testing procedures.
Criterion-Referenced Proficiency
Testing in the Receptive Skills
• Items must strictly adhere to the
proficiency “Table of Specifications”.
• Every component of the test item must be
aligned with and match the specifications
of a single level of the proficiency scale.
– The text difficulty
– The author purpose
– The task asked of the reader/listener
Criterion-Referenced Proficiency
Testing in the Receptive Skills
• Testing reading and listening proficiency
requires “Independent, non-compensatory
scoring” for each proficiency level, not
calculating a single score for the entire
test.
• This makes the test development process
more complex.
– Requires trained item writers and reviewers.
– Begins with “modified Angoff” ratings instead
of IRT procedures to validate items.
The BILC
Benchmark Advisory Test
(Reading)
Is a Criterion-Referenced
Proficiency Test.
Steps in the Process
1. We updated the STANAG 6001
Proficiency Scale.
a. Each level describes a measurable point on
the scale.
b. These assessment points are not arbitrary,
but represent useful levels of ability, e.g.
Survival, Functional, Professional, etc.
c. Thus, each level represents a defined
“construct” of language ability.
Steps in the Process
2. We validated the scale.
a. The hierarchical nature of these constructs
had been operationally – but not statistically
– validated.
b. A statistical validation process was run in
Sofia, Bulgaria.
c. The results substantiated the validity of the
scale’s operational use.
STANAG 6001
Scale Validation Exercise
Conducted at
Sofia, Bulgaria
13 October 2005
Instructions
•
On the top of a blank piece of paper, write
the following information:
1. Your current work assignment:
Teacher, Tester, Administrator, Other______
2. Your first (or dominate) language: _________
3. You do not need to write your name!
Instructions
• Next, write the numbers:
0
1
2
3
4
5
down the left side of the paper.
Instructions
• You will now be shown 6 descriptions of
language speaking proficiency.
• Each description will be labeled with a
color.
Instructions
• Rank the descriptions according to their
level of difficulty by writing their color
designation next the appropriate number:
0 (easiest)
1 (next easiest)
2 (next easiest)
3 (next easiest)
4 (next easiest)
5 (most difficult)
=
=
=
=
=
=
Color ?
Color ?
Color ?
Color ?
Color ?
Color ?
Ready?
• The descriptions will now be presented…
– One at a time,
– In a random sequence,
– For 15 seconds each.
• You will see each of the descriptors 4
times.
• Thank you for participating in this
experiment.
STANAG 6001 Scale Validation:
A Timed Exercise Without Training
• 74 people turned in their rankings.
• They marked their current work
assignments as:
– Administrator
– Teacher
– Tester
– Other
49
26
19
1
Results of the
STANAG Scale Validation
( n = 74 )
Steps in the Process
3. We used the STANAG 6001 base
proficiency levels as the definitive
specifications for item development.
a. Author task and purpose in producing the
text have to be aligned with the question or
task asked of the reader.
b. The written (or audio) text type and linguistic
characteristics of each item must also be
characteristic of the proficiency level
targeted by the item.
Steps in the Process
4. The items developed had to then pass a
strict review of whether each item
matched the design specifications.
a. Multiple expert judges made independent
judgments of whether each item matched
the targeted level.
b. Only the items which passed this review with
the unanimous consensus of trained judges,
were taken to the next step.
Steps in the Process
5. The next step was a “bracketing” process
to check the adequacy of the question’s
multiple choice options.
a. Experts were asked to make independent
judgments about how likely a learner at the
next lower level would be to answer the
question correctly.
•
•
Responses significantly above chance (or 25%)
made the item unacceptable.
In such cases the item, item question, or item
choices had to be discarded or revised.
Steps in the Process
5. (Cont.)
b. Experts made independent judgments about
how likely a learner at the next higher level
would be to answer each question correctly.
•
•
If the item would not be answered correctly by
this more competent group, it was rejected.
(Because of human limitations, inattention,
fatigue, carelessness, etc, it was recognized that
the correct response probability for this more
competent group would be less than 100%.)
Steps in the Process
6. Items that passed the technical
specifications review and the bracketing
process, then underwent a “Modified
Angoff” rating procedure.
a. Expert judges rated the probability that each
item would be correctly answered by a
person who was fully competent at the
targeted proficiency level.
b. If the independent probability ratings
produced an outlier rating or a standard
deviation of more than 5 points, the item
was rejected and/or revised.
Steps in the Process
7. Items found acceptable in the “Modified
Angoff” rating procedure, where
assembled into an online test.
a. The test had three subtests of 20 items
each.
b. A separate subtest for each of the Reading
proficiency Levels 1, 2, and 3.
c. Each test was to be graded separately.
d. “Sustained performance” (passing) on each
subtest was defined as the mean Angoff
rating minus one standard deviation or 70%.
More About Scoring
• Scoring had to follow CriterionReferenced, non-compensatory
Proficiency assessment procedures.
– “Sustained” ability would be required to qualify
as proficient at each level.
– Summary ratings would consider both “Floor”
and “Ceiling” abilities.
– Each learner’s performance profile would
determine “between-level” ratings (if any).
And the results…
More pilot testing will be done,
but here are the results of the first 36 pilot
tests:
BAT - Reading Proficiency Profiles
n = 36
Sustained
(70 % or more)
Developing
Emerging
Random
1
2
n = 19,
Level 3
n = 7,
Level 2+
n = 5,
Level 2
n = 1,
Level 2
n = 2,
n = 1,
Level 1+ Level 1+
3
n = 1,
Level 1
S
u
b
t
e
s
t
L
e
v
e
l
Congratulations!
Working together, we have solved a major
testing problem –
a problem which has plagued language
testers for decades.
We have developed a criterionreferenced proficiency test of Reading
which
– Accurately assigns proficiency levels.
– Has both face- and statistical validity.
Questions?
Some additional thoughts…
• The assessment points or levels in the STANAG 60001 scale may
be thought of as “chords” – each of which describe a short segment
along an extended multi-dimensional proficiency development scale.
• These “chords” represent cross-dimensional constellations of factors
that represent different levels of language ability. Like the concept
of “chords” in calculus, these defined progress levels allow us to
accurately measure whether the particular set of factors described at
each level has been mastered.
• Each proficiency level or factor constellation can also be seen as a
separate construct, and these constructs can be shown to form an
ascending array or hierarchy of increasing language proficiency
which meets Guttman scaling criteria.
• Therefore, these “points” in the scale can also indicate overall
proficiency development.