Technology's Roles in the NCLB/SOL Assessment Dance

Download Report

Transcript Technology's Roles in the NCLB/SOL Assessment Dance

TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
The Many Threats to Test Validity
David Mott, Tests for Higher Standards
and Reports Online Systems
Presentation at the Virginia Association of Test Directors
(VATD) Conference, Richmond, VA, October 28, 2009
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
The Many Threats to Test Validity
In order for a test or assessment to have any
value whatsoever, it must be possible to
make reasonable inferences from the score.
This is much harder than it seems. The test
instruments, the testing conditions, the
students, and the score interpreters, and
perhaps Fate, ALL need to be working
together to produce data worth using. Many
specific threats will be delineated; a number
of solutions suggested; and audience
participation is strongly encouraged.
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Validity and Value come from the same Latin
root. The word has to do with being strong,
well, good.
Validity = Value
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Initial Attitude Adjustment
Amassing Statistics
The government are very keen on amassing statistics — they collect
them, raise them to the n th power, take the cube root and prepare
wonderful diagrams. But what you must never forget is that every one
of those figures comes in the first instance from the village watchman,
who just puts down what he damn well pleases.
(J. C. Stamp (1929). Some Economic Factors in
Modern Life. London: P. S. King and Son)
Distance from Data
I have noticed that the farther one is from the source of data, the
more likely one is to believe that the data could be a good basis for
action.
(D. E. W. Mott (2009). Quotations.)
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
The
Examination
The
Examination
F ACILITATE ACHIEVEMENT
as shown
shown by
by the
the Ghost
Ghost of
of Testing
Testing Past
Past
as
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Validity — Older Formulations
1950’s through 1980’s




content validity
concurrent validity
predictive validity
construct validity
Lee J. Cronbach
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Content Validity —
 Refers to the extent to which a measure
represents all facets of a given social
construct. Social constructs such as: Reading
Ability, Math. Computation Proficiency,
Optimism, Driving Skill, etc. It is a more formal
term than face validity. As face validity refers,
not to what the test actually measures, but to
what it appears to measure. Face validity is
whether a test "looks valid" to the examinees
who take it, the administrative personnel who
decide on its use, and to others.
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Concurrent Validity —
 Refers to a demonstration of how well a test
correlates well with a measure that has
previously been validated. The two measures
may be for the same construct, or for
different, but presumably related, constructs.
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Predictive Validity —
 Refers to the extent to which a score on a
scale or test predicts scores on some criterion
measure. For example, how well do your final
benchmarks predict scores on the state SOL
Tests?
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Construct Validity —
 Refers to whether a scale measures or
correlates with the theorized underlying
psychological construct (e.g., "fluid
intelligence") that it claims to measure. It is
related to the theoretical ideas behind the
trait under consideration, i.e. the concepts
that organize how aspects of personality,
intelligence, subject-matter knowledge,
etc. are viewed.
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Validity — New Formulation
1990’s through now
 Six aspects or views of
Construct Validity






content aspect
substantive aspect
structural aspect
generalizability aspect
external aspect
consequential aspect
Samuel Messick
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Validity — New Formulation
Six aspects or views of Construct Validity
 Content aspect – evidence of content relevance,
representativeness, and technical quality





Substantive aspect – theoretical rationales for consistency in test responses,
including process models, along with evidence that the processes are actually
used in the assessment tasks
Structural aspect – judges the fidelity of scoring to the actual structure of the
construct domain
Generalizability aspect – the extent to which score properties and
interpretations generalize to related populations, settings, and tasks
External aspect – includes converging and discriminating evidence from
multitrait-multimethod comparisons as well as proof of relevance and utility.
Consequential aspect – shows the values of score interpretation as a basis for
action and the actual and potential consequences of test use, especially in
regard to invalidity related to bias, fairness, and distributive justice
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Validity — New Formulation
Six aspects or views of Construct Validity

Content aspect – evidence of content relevance, representativeness, and
technical quality

Substantive aspect – theoretical rationales for consistency in test
responses, including process models, along with evidence that
the processes are actually used in the assessment tasks

Structural aspect – judges the fidelity of scoring to the actual structure of the
construct domain
Generalizability aspect – the extent to which score properties and
interpretations generalize to related populations, settings, and tasks
External aspect – includes converging and discriminating evidence from
multitrait-multimethod comparisons as well as proof of relevance and utility.
Consequential aspect – shows the values of score interpretation as a basis for
action and the actual and potential consequences of test use, especially in
regard to invalidity related to bias, fairness, and distributive justice



TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Validity — New Formulation
Six aspects or views of Construct Validity


Content aspect – evidence of content relevance, representativeness, and
technical quality
Substantive aspect – theoretical rationales for consistency in test responses,
including process models, along with evidence that the processes are actually
used in the assessment tasks
 Structural aspect – judges the fidelity of scoring to the
actual structure of the construct domain



Generalizability aspect – the extent to which score properties and
interpretations generalize to related populations, settings, and tasks
External aspect – includes converging and discriminating evidence from
multitrait-multimethod comparisons as well as proof of relevance and utility.
Consequential aspect – shows the values of score interpretation as a basis for
action and the actual and potential consequences of test use, especially in
regard to invalidity related to bias, fairness, and distributive justice
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Validity — New Formulation
Six aspects or views of Construct Validity



Content aspect – evidence of content relevance, representativeness, and
technical quality
Substantive aspect – theoretical rationales for consistency in test responses,
including process models, along with evidence that the processes are actually
used in the assessment tasks
Structural aspect – judges the fidelity of scoring to the actual structure of the
construct domain
 Generalizability aspect – the extent to which score
properties and interpretations generalize to related
populations, settings, and tasks


External aspect – includes converging and discriminating evidence from
multitrait-multimethod comparisons as well as proof of relevance and utility.
Consequential aspect – shows the values of score interpretation as a basis for
action and the actual and potential consequences of test use, especially in
regard to invalidity related to bias, fairness, and distributive justice
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Validity — New Formulation
Six aspects or views of Construct Validity




Content aspect – evidence of content relevance, representativeness, and
technical quality
Substantive aspect – theoretical rationales for consistency in test responses,
including process models, along with evidence that the processes are actually
used in the assessment tasks
Structural aspect – judges the fidelity of scoring to the actual structure of the
construct domain
Generalizability aspect – the extent to which score properties and
interpretations generalize to related populations, settings, and tasks
 External aspect – includes converging and discriminating
evidence from multitrait-multimethod comparisons as well
as proof of relevance and utility.

Consequential aspect – shows the values of score interpretation as a basis for
action and the actual and potential consequences of test use, especially in
regard to invalidity related to bias, fairness, and distributive justice
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Validity — New Formulation
Six aspects or views of Construct Validity





Content aspect – evidence of content relevance, representativeness, and
technical quality
Substantive aspect – theoretical rationales for consistency in test responses,
including process models, along with evidence that the processes are actually
used in the assessment tasks
Structural aspect – judges the fidelity of scoring to the actual structure of the
construct domain
Generalizability aspect – the extent to which score properties and
interpretations generalize to related populations, settings, and tasks
External aspect – includes converging and discriminating evidence from
multitrait-multimethod comparisons as well as proof of relevance and utility.
 Consequential aspect – shows the values of score
interpretation as a basis for action and the actual and
potential consequences of test use, especially in regard to
invalidity related to bias, fairness, and distributive justice
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Administration Validity
 Administration Validity is my own term.
A test administration or a test session is
valid if nothing happens that causes a
test, an assessment, or a survey to fail
to reflect the actual situation.
Test-session validity is an alternate term.
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Administration Validity
 Many things can come between the
initial creation of an assessment from
valid materials and the final uses of the
scores that come from that assessment.
 Imagine a chain that is only as strong as
its weakest link. If any link breaks, the
value of the whole chain is lost.
 This session deals with some of those
weak links.
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Areas of Validity Failure
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Areas of Validity Failure
 We create a test out of some “valid”
items —
Discuss some of the realities most of us face:
We either have some “previously validated”
tests or we have a “validated” item bank we
make tests from. Let’s assume that they really
are valid, this is, the materials have good
content matches with the Standards/
Curriculum Frameworks/Blueprints, and so on.
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Areas of Validity Failure
Some examples of things that can creep
in within the supposedly “mechanical”
aspects of creating a test from a bank.
 Here are two items from a Biology
benchmark test we recently made for
a client:
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Two Biology Items
Bio.3b
5.
Which organic compound is correctly matched with the
subunit that composes it?
A maltose – fatty acids
B starch – glucose
C protein – amino acids
D lipid – sucrose
Bio.3b
6.
Which organic compounds are the building blocks of
proteins?
A sugars
B nucleic acids
C amino acids
D polymers
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Two Biology Items
Bio.3b
5.
Which organic compound is correctly matched with the
subunit that composes it?
Standard BIO.3b The student will
A maltose – fatty acids
investigate and understand the chemical
B starch – glucose
and biochemical principles essential for
C protein – amino acids
life. Key concepts include b) the structure
and function of macromolecules.
D lipid – sucrose
Bio.3b
6.
Which organic compounds are the building blocks of
proteins?
A sugars
B nucleic acids
C amino acids
D polymers
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Two Biology Items
Bio.3b
5.
Which organic compound is correctly matched with the
subunit that composes it?
A maltose – fatty acids
B starch – glucose
C protein – amino acids *
D lipid – sucrose
Bio.3b
6.
Which organic compounds are the building blocks of
proteins?
A sugars
B nucleic acids
C amino acids
D polymers
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Two Biology Items
Bio.3b
5.
Which organic compound is correctly matched with the
subunit that composes it?
A maltose – fatty acids
B starch – glucose
C protein – amino acids *
D lipid – sucrose
Bio.3b
6.
Which organic compounds are the building blocks of
proteins?
A sugars
B nucleic acids
C amino acids *
D polymers
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
A Life Science Item
LS.6c
12. In this energy pyramid, which letter would represent
producers?
A
B
C
D
A
B
C
D
A
B
C
D
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
The same Life Science Item “Randomized”
LS.6c
12. In this energy pyramid, which letter would represent
producers?
A
B
C
D
C
D
A
B
A
B
C
D
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Moving from test creation
to test administration
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
What Can Fail in the Test Administration Process
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
What Can Fail in the Test Administration Process
 Students aren’t properly motivated




Random responding
Patterning responses
Unnecessary guessing
Cheating
Let’s look at what some of these look like:
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
One Student's Item Analysis
A B C D
1
1
2
0
3
0
4
0
5
0
6
1
7
0
8
0
9
0
10
0
11
0
12
0
13
0
14
0
15
1
16
0
17
0
18
1
19
0
20
1
21
0
22
0
23
0
24
0
25
1
What happened
here?
6
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Another Student's Item Analysis
A B C D
1
0
2
0
3
0
4
1
5
0
6
1
7
0
8
0
9
0
10
0
11
0
12
0
13
1
14
0
15
0
16
1
17
0
18
1
19
1
20
0
21
0
22
0
23
1
24
0
25
0
7
What happened
here?
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Yet Another Student's Item Analysis
A B C D
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
1
10
1
11
1
12
1
13
1
14
0
15
1
16
1
17
1
18
1
19
0
20
0
21
0
22
1
23
0
24
0
25
1
19
What happened
here?
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
What Can Fail in the Test Administration Process
 Students or teachers make mistakes.




Stopping before the end of test
Getting off position on answer sheets
Giving a student the wrong answer sheet
Scoring a test with the wrong key
Let’s look at what some of these look like:
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Yet, Yet Another Student's Item Analysis
A B C D
1
1
2
1
3
1
4
1
5
1
6
1
7
8
0
9
0
10
0
11
1
12
0
13
0
14
0
15
0
16
0
17
0
18
0
19
1
20
0
21
0
22
0
23
0
24
0
25
0
8
26
0
27
28
29
30
What happened
here?
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Moving to diagnosing students’ needs
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
What is the obvious conclusion about these test results?
Standard
Student 1
Student 2
Student 3
Student 4
Student 5
Student 6
Student 7
Student 8
Student 9
Student 10
Student 11
Student 12
Average
Results for a Three-Standard Test
4.1
4.2
4.3
Subtotal
Subtotal
Subtotal
3
2
4
4
3
5
3
1
5
5
0
5
5
0
4
5
2
5
4
1
4
4
0
3
4
1
3
5
2
4
5
1
4
5
0
5
.87
.22
.85
Total
9
12
9
10
9
12
9
7
8
11
10
10
.64
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
What do you think now?
Standard
Item
Student 1
Student 2
Student 3
Student 4
Student 5
Student 6
Student 7
Student 8
Student 9
Student 10
Student 11
Student 12
Average
4.1
1
1
1
1
1
1
1
0
1
1
1
1
1
.92
4.1
2
1
0
1
1
1
1
1
1
1
1
1
1
.92
4.1
3
0
1
1
1
1
1
1
1
1
1
1
1
.92
4.1
4
1
1
0
1
1
1
1
0
1
1
1
1
.83
4.1
5
0
1
0
1
1
1
1
1
0
1
1
1
.75
Results for a Three-Standard Test
4.3 4.3 4.3 4.3 4.3
6
7
8
9
10
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
0
1
0
1
1
1
0
1
1
0
1
0
1
1
1
0
1
1
1
0
1
1
1
1
1
1
1
.92 1.00 .83 .67 .83
4.2
11
1
1
1
0
0
1
1
0
1
1
1
0
.67
4.2
12
1
1
0
0
0
1
0
0
0
1
0
0
.33
4.2
13
0
1
0
0
0
0
0
0
0
0
0
0
.08
4.2
14
0
0
0
0
0
0
0
0
0
0
0
0
.00
4.2
15
0
0
0
0
0
0
0
0
0
0
0
0
.00
Total
9
12
9
10
9
12
9
7
8
11
10
10
.64
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
The chain has many links
 Nearly any of them can break
 Try to find the weakest links in your
organizations efforts
 Fix them – one by one
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
What are some of my solutions to all of this?
 To the problems of mistakes in test creation
 Use test blueprints
 Be very careful of automatic test construction
 Read the test carefully yourself and answer the
questions
 Have someone else read the test carefully and
answer the questions
 Use “Kid-Tested” items *
* Future TfHS initiative
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
What are some of my solutions to all of this?
 Be careful when reading reports – look past the
obvious
 For problems of careless, unmotivated test taking
by students (even cheating) — Make the test less
of a contest between the system/teacher and the
student and more of a communication device
between them
 Watch the students as they take the test and realize
that proctoring rules necessary for high-stakes tests
are possibly not best for formative or semi-formative
assessments
 Look for/flag pattern marking and rapid responding *
 Watch the students as they take the test
* Future TfHS/ROS initiative
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Here is a graph showing the timing of student responses to an item
Number of Responses Over Time to a Rather Easy Item
16
Number of Responses
14
12
10
8
6
4
2
0
0
0.5
1
1.5
2
2.5
3
3.5
Time (in sec)
4
4.5
5
5.5
6
6.5
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
For online tests it is possible to screen for rapid responding *
Number of Responses Over Time to a Rather Easy Item
16
Number of Responses
14
12
10
8
6
4
2
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
Time (in sec)
* Future TfHS/ROS initiative
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
A major new way of communicating!
 Let the students tell you when they don’t
know or understand something – eliminate
guessing
 New mc scoring scheme: *
 1 point for each correct answer
 0 points for each wrong answer
 ⅓ point for each unanswered question
 Students mark where they run out of time
* Future TfHS/ROS initiative
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Continued
A major new way of communicating!
 Students have to be taught the new rules
 Students need one or two tries to get the
hang of it
 Students need to know when the new
scoring applies
 It is better for students to admit not
knowing than to guess
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
One Student's Item Analysis
A B C D
1
1
2
1
3
1
4
1
5
Answering
under the
6
1
Test Length
25
7
1
Regular
1, 1/3, 0 scheme
new scoring
8
1
11
13.00
9
1
44.00%
52.00%
10
1
11
Corrected for Test Length
12
1
Regular
1, 1/3, 0
13
11
13.00
14
0
55.00%
65.00%
15
16
17
1
18
19
20
0
21
0 0
22
23
24
25
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
One Student's Item Analysis
A B C D
1
1
2
1
3
1
4
1
5
Answering
under the
6
1
Test Length
25
7
1
Regular
1, 1/3, 0 scheme
new scoring
8
1
11
13.00
9
1
44.00%
52.00%
10
1
11
Corrected for Test Length
12
1
Regular
1, 1/3, 0
13
11
13.00
14
0
55.00%
65.00%
15
16
17
1
18
19
20
0
21
0 0
22
23
24
25
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Scored under the
new scoring scheme
One Student's Item Analysis
A B C D
1
1
2
1
3
1
4
1
5
6
1
Test Length 25
7
1
Regular
1, 1/3, 0
8
1
11
13.00
9
1
44.00%
52.00%
10
1
11
Corrected for Test Length
12
1
Regular
1, 1/3, 0
13
11
13.00
14
0
55.00%
65.00%
15
16
17
1
18
19
20
0
21
0 0
22
23
24
25
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
Humor
Time flies like an arrow;
fruit flies like a banana.
We sometimes need to take a 90° turn in our thinking
TE S T S F O R
HI G H E R STA N D A R D S
P ROVIDE F OCUS
+
F ACILITATE ACHIEVEMENT
My contact information
 David Mott –
 TfHS website –
 ROS website –
[email protected]
866.724.7997
804.282.3111
www.tfhs.net
rosworks.com