The Science of Test Development
Download
Report
Transcript The Science of Test Development
The Science and Art of
Exam Development
Paul E. Jones, PhD
Thomson Prometric
What is validity and how do I know
if my test has it?
Validity
“Validity refers to the degree to which evidence and theory
support the interpretations of test scores entailed by the
proposed uses of tests. Validity is, therefore, the most
fundamental considerations in developing and evaluating tests.”
(APA Standards, 1999, p. 9)
A test may yield valid judgments about people…
•
•
•
•
•
•
If it measures the domain it was defined to measure.
If the test items have good measurement properties.
If the test scores and the pass/fail decisions are reliable.
If alternate forms of the test are on the same scale.
If you apply defensible judgment criteria.
if you allow enough time for competent (but not necessarily
speedy) candidates to take the test.
• If it is presented to the candidate in a standardized fashion,
without environmental distractions.
• If the test taker is not cheating and the test has not
deteriorated.
Is this a Valid Test?
1. 4 - 3 = _____
6. 3 - 2 = _____
2. 9 - 2 = _____
7. 8 - 7 = _____
3. 4 - 4 = _____
8. 9 - 5 = _____
4. 7 - 6 = _____
9. 6 - 2 = _____
5. 5 - 1 = _____
10. 8 - 3 = _____
The Validity = Technical Quality of the Testing System
Design
Live Testing
Development
Data Gathering
Standard Setting
Item
Bank
Form Assembly
Item Analysis
& Selection
The Validity Argument is Part of the Testing System
Doc
Doc
Doc
Design
Live Testing
Development
Standard Setting
Item
Bank
Form Assembly
Doc
Doc
Data Gathering
Doc
Item Analysis
& Selection
Doc
How should I start a new testing
initiative?
A Testing System Begins with Design
Design
Standard Setting
Live Testing
Item Bank
Development
Data Gathering
Form Assembly
& Equating
Item Analysis
& Selection
Test Design Begins with Test Definition
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Test Title
Credential Name
Test Purpose (“This test will certify that the successful candidate has
important knowledge and skills necessary to…” )
Intended Audience
Candidate Preparation
High-Level Knowledge and Skills Covered
Products or Technologies Addressed
Knowledge and Skills Assumed but Not Tested
Knowledge and Skills Related to the Test but Not Tested
Borderline Candidate Description
Testing Methods
Test Organization
Test Stakeholders
Other Information
Test Definition Begins with Program Design
Test Definition Leads to Practice Analysis
1. Knowledge of Networking Practices
1.1
Installing the Network
1.1.1
Install and configure a work-station network card.
1.1.1.1 Identify the correct NIC to support a given network
scenario.
1.1.1.1.1
Ethernet
1.1.1.1.2
Token Ring
1.1.1.1.3
FDDI
1.1.1.1.4
Wireless
1.1.1.2 Install and configure the correct device to support a
dial-up scenario.
1.1.1.3
Given an installation scenario, select the appropriate
drivers and configure resource settings.
Given a network installation scenario, select the appropriate NIC and configure
the settings (e.g. IRQ, full/half duplex, speeds, etc).
Test Objective
Practice Analysis Leads to Test Objectives
Test Objectives are Embedded in a Blueprint
Sections / Performance Statements/Test Objective Cues
ID
Section 1
1.1
1.2
1.3
1.4
Section / Performance Statement/Test Objective Cue
Item Count
Targets
Current
Form Bank
Have
To Go
Understand concepts as they apply to AAM
Know general networking and domain concepts and terminology (IP
address, Subnet mask, Media Access Control [MAC Address], Node,
Routing, domain types [e.g., Internet, AAM, etc.]).
Know the general concepts and terminology of High Availability (HA)
environments (Resource Group, cluster topologies, highly available vs.
fault tolerance, virtual IP, single point of failure, heartbeating, LAN vs.
WAN failover considerations).
Know fundamental AAM concepts, components, and infrastructure
(preferred node list, isolation detection, failure detection, network line
usage as it relates to AAM).
Know the elements of Resource Groups (data sources, process or
service, managed IP, network interface card, node alias, utility
process).
1.00
3
3
0
1.00
3
4
(1)
3.00
9
2
7
3.00
9
9
0
Once I have a blueprint, how do I
develop appropriate exam items?
The Testing System
Design
Standard Setting
Live Testing
Item Bank
Development
Data Gathering
Form Assembly
& Equating
Item Analysis
& Selection
Creating Items
Content
Characteristics
Content
Options
Response
Modes
Choose Many
Choose one
Text
Single M/C
Multiple M/C
Single P&C
Multiple P&C
Drag & Drop
Brief FR
Essay FR
Simulation/App
Graphics
Audio
Video
Item
Simulations
Applications
Scoring
Desirable Measurement Properties of Items
•
•
•
•
Item-objective linkage
Appropriate difficulty
Discrimination
Interpretability
Item-Objective Linkage
Good Item Development Practices
•
•
•
•
•
•
SME writers in a social environment
Industry-accepted item writing principles
Item banking tool
Mentoring
Rapid editing
Group technical reviews
How can I gather and use data to
develop an item bank?
The Testing System
Design
Standard Setting
Live Testing
Item Bank
Development
Data Gathering
Form Assembly
& Equating
Item Analysis
& Selection
Classical Item Analysis: Difficulty and Discrimination
Classical Option Analysis:
Good Item
n
>
A
B
C
D
121
661
22
36
proportion discrim
0.144
0.786
0.026
0.043
-0.217
0.326
-0.136
-0.142
Q1
Q2
Q3
Q4
Q5
10-30 31-33 34-36 37-40 41-50
48 28 24 16 5
104 114 139 166 138
11 8 1 1 1
16 9 9 1 1
Classical Option Analysis:
Problem Item
>
A
B
C
D
n proportion discrim
Q1
Q2
Q3
Q4
Q5
187
66
260
327
41
12
57
69
37
8
43
71
37
13
55
68
49
12
54
69
23
21
51
50
0.222
0.078
0.309
0.389
-0.018
0.108
0.022
-0.051
IRT Item Analysis: Difficulty and Discrimination
a=0.6
b=-1.5
c=0.4
a=1.2
b=-0.5
c=0.1
a=1.0
b=1.0
c=0.25
Good IRT Model Fit
How can I assemble test forms
from my item bank?
The Testing System
Design
Standard Setting
Live Testing
Item Bank
Development
Data Gathering
Form
Assembly
& Equating
Item Analysis
& Selection
Reliability
“Reliability refers to the degree to which test scores are free from
errors of measurement.” (APA Standards, 1985, p. 19)
More Reliable Test
Scores for Mia and Kim on Thirty
Forms of a More Reliable Test
Cut Score =70
85
Scores
80
75
Mia (True
Score = 73)
70
Kim (True
Score = 67)
65
60
Forms
Less Reliable Test
Scores for Kim and Mia on Thirty
Forms of a Less-Reliable Test
Cut Score = 70
85
Scores
80
75
70
Mia (True
Score = 73)
65
Kim (True
Score = 67)
60
Forms
How to Enhance Reliability When Assembling Test Forms
• Score reliability/generalizability
–
–
–
–
Select items with good measurement properties.
Present enough items.
Target items at candidate ability level.
Sample items consistently from across the content domain (use a
clearly-defined test blueprint).
• Score dependability
– Same as above.
– Minimize differences in test difficulty.
• Pass-Fail consistency
– Select enough items.
– Target items at the cut score.
– Maintain same score distribution shape between forms
Building Simultaneous Parallel Forms Using Classical
Theory
Building Simultaneous Parallel Forms Using IRT
Setting Cut Scores
Why not just set the cut score at 75% correct?
Setting Cut Scores
Why not just set the cut score so that 80% of the
candidates pass?
The logic of criterion-based cut score setting
• Certain knowledge and skills are necessary for practice.
• The test measures an important subset of these knowledge and
skills, and thus readiness for practice.
• The passing [cut] score is such that those who pass have a high
enough level of mastery of the KSJs to be ready for practice [at
the level defined in the test definition], while those who fail do
not. (Kane, Crooks, and Cohen, 1997)
The Main Goal in Setting Cut Scores
Meeting the “Goldilocks Criteria”
“We want the passing score to be neither too high nor too low, but at least
approximately, just right.”
Kane, Crooks, and Cohen, 1997, p. 8
Two General Approaches to Setting Cut Scores
• Test-Centered Approaches:
Modified Angoff
Bookmark
• Examinee-Centered Approaches:
Borderline
Contrasting Groups
The Testing System
Design
Standard
Setting
Live Testing
Item Bank
Development
Data Gathering
Form
Assembly
& Equating
Item Analysis
& Selection
What should I consider as I
manage my testing system?
Security of a Testing System
Design
Live Testing
Standard Setting
• Write more items!!!
• Create authentic items.
Item
Form Assembly
•
Use
isomorphs.
Development
Bank
• Use Automated Item Generation.
• Use secure banking software and connectivity
• Use in-person development
Data Gathering
Item Analysis
& Selection
Security of a Testing System
Design
Live Testing
Standard Setting
• Establish prerequisite qualifications.
• Use narrow testing windows.
Item
Form Assembly
test/retest restrictions.
Development • Establish
Bank
• Use identity verification and biometrics.
• Require test takers to sign NDAs.
• Monitor test takers on site.
• Intervene if cheating
is detected.
Item Analysis
Data Gathering
& Selection
• Monitor individual test center performance.
• Track suspicious test takers over time.
Security of a Testing System
• Perform frequent detailed psychometric review.
•Design
Restrict the use
of Testing
items and test
forms. Setting
Standard
Live
• Analyze response times.
• Perform DRIFT analyses.
• Calibrate items efficiently.
Development
Data Gathering
Item
Bank
Form Assembly
Item Analysis
& Selection
Item Parameter Drift
Item Characteristic Curves
1.0
0.9
0.8
0.7
P(Theta)
0.6
0.5
0.4
0.3
0.2
0.1
0.0
-4
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0
Theta
0.5
1
1.5
2
2.5
3
3.5
4
Security of a Testing System
Design
Live Testing
Development
Standard Setting
Item
Bank
Form Assembly
• Many unique fixed forms
• Linear on-the-Fly testing (LOFT)
• Computerized adaptive testing (CAT)
Data Gathering
• Computerized mastery testing (CMT)
• Multi-staged testing (MST)
Item Analysis
& Selection
Item Analysis Activity