Introduction to Test Development
Download
Report
Transcript Introduction to Test Development
Introduction to Test
Development
Graham McMahon, MD, MMSc.
Sarah E. Peyre, EdD
Educational Research Methods Program
Learning Objectives
Understand the pros and cons to various testing questions
for written examinations
Learn how to determine
Item difficulty and
Item discrimination
Understand the psychometrics of a high stakes test
Validity
Reliability
Standard Setting
Come to our Workshop!
Work in small groups to…
Review problematic multiple choice items
Establish validity and reliability for a test
Participate in standard setting exercise
Question Types – Pros and Cons
Essay Items
Short Answer and Completion Items
Matching Items
True-False and Multiple-Choice Tests
Interviews
Portfolios
….all can be scored and can be subject to test development
Multiple-Choice Items
Stem
An 85-year-old woman has difficulty raising
her arms above her head and
combing her hair. She has morning aches in her shoulders and neck. Her
reflexes are symmetrical and normal. There is no muscle tenderness or joint
swelling. Which one of following laboratory tests should be obtained to
confirm the most likely diagnosis?
Lead in
A. Anti-nuclear antibody.
B. Erythrocyte sedimentation rate.
C. Serum concentration of creatine kinase.
D. Serum concentration of angiotensin-converting
enzyme.
Responses
E. Urine microscopy.
Correct
response
Distractors
Tips for writing discriminant MCQs
Be sure that each item reflects a clearly defined learning outcome
Stem
The stem of the item should be self-contained and written in clear and
precise language.
Avoid ‘trigger’ words (e.g. pin-rolling tremor)
Negatives, excepts, absolutes and qualifiers in question stems are no-no’s.
Responses
All answers should be plausible and homogenous
Items need to be independent of one another
Answer choices should be similar in length and grammatical form
List answer choices in alphabetical or numerical order
Avoid ‘all of the above’ as a response
Avoid technical flaws (tense or plurality for example)
Pros and Cons of MCQ’s
Pros
Useful for measuring
learning outcomes at
almost any level
Easy to understand
Easy to score
Easily analyzed for
effectiveness
Allow broad coverage
efficiently
Cons
Good questions
Take a long time to write
Are difficult to write
Constrain creative
responses from learners
May have more than one
correct answer
Item Analysis
Qualitative: looks at whether the content
matches the information, attitude, characteristic
or behavior being assessed
Quantitative:
Item difficulty
Item discrimination
Determining item difficulty
The percentage of participants who get that item
correct
Item difficulty scores can range from 0 to 100%
Number of Students achieving each Score
30
20
Low value = high difficulty
10
High value = low difficulty
0
0
10
20
30
40
Hard Exam
0
50
60
70
Normal Exam
80
90
100
Easy Exam
High
(Difficult)
Medium
(Moderate)
Low
(Easy)
<= 30%
>30% AND
< 80%
>=80
%
10
20
30
40
50
60
70
80
90
100
Discrimination Index
The Discrimination Index distinguishes for each item
Index
of discrimination:
between
the performance of students who did well on the
The
difference
in who
the %did
of poorly.
people in one extreme group
exam
and
students
minus the % of people in the other extreme group
Item discrimination scores can range from -1.00 to +1.00
Example
100 test takers: 20 in top 25 were correct but only 5 in the
lowest 25 students were correct.
DI = (20-5)/25 = 0.8
Item
Discrimination
(D)
Item Difficulty
High
Med
Low
D =< 0%
review
review
revie
w
0% < D < 30%
ok
review
ok
D >= 30%
ok
ok
ok
Item Analysis Report
Order ID and group number
percentages
counts
The left half shows percentages, the right half counts.
The correct option is indicated in parentheses.
Point Biserial is similar to the discrimination index, but is not based on fixed upper and
lower groups. For each item, it compares the mean score of students who chose the correct
answer to the mean score of students who chose the wrong answer.
Test Validity
Validity:
The extent to which inferences made from a test are
appropriate, meaningful, or useful.
Does my test measure what it is intended to measure?
Content validity
Criterion validity – Predictive/Concurrent
Expert review
Scores can be related to another known metric
Construct validity
Successfully differentiates between levels of learners
Kissing Cousins
A test can not be valid until it is reliable:
Test Reliability
Reliability: Measure the underlying construct
consistently = trustworthiness/stability
Test-Retest Reliability
Alternate forms reliability
Internal consistency reliability (cronbach’s alpha)
Inter-rater reliability
How do I set a passing grade?
Standard Setting
Norm referenced: Z-scores
Number of standard deviations below the mean
Criterion Referenced: Angoff Method
Panel of experts are asked to evaluate each item and
estimate the number fraction of minimally competent
students who would answer each item correctly
Ratings are averaged across the experts for each item,
discussed and then summed to get panel raw cutscore
Thank you!
Welcome to Our
Workshop on Test
Development!
Graham McMahon, MD, MMSc.
Sarah E. Peyre, EdD
Educational Research Methods
The Academy at Harvard Medical School
Outline
Learning
Objectives
Creating MCQ Items
Item
Template
Item Flaws
Tips for Success
Establishing
Validity and Reliability for a Test
Mock Standard Setting
Item Creation
Consider
beginning with the end in mind
What
is it that you think the medical student should
demonstrate that he/she knows or knows how to do?
This should be an objective from your lesson plan.
Learning Activities
Objectives
Evaluation
Item Stems: Clinical Vignettes
Things to consider:
Patient description (46-year-old-female)
Functional disability (difficulty rising from a seated position,
but has no difficulty flexing her legs)
The question based on this item template:
A 46-year-old-female has difficulty rising from a seated
position, but has no difficulty flexing her legs. Which of the
following muscles has been injured?
[Objective: Identify and explain the function of the muscles in the…. ]
Item Creation
Lead-in: The most likely
diagnosis is
Options: disorders, diseases
Objective: Describe the signs
and symptoms of X. Compare
and contrast the signs and
symptoms of XY and Z.
Lead-in: Which of the
following additional
symptoms would you expect
to be present?
Options: symptoms
Objective: same as above
Lead-in: The most likely
cause is
Options: bacteria, toxins,
medications, metabolic defects
Objective: List and explain the
causes of X.
Lead-in: The most likely
mechanism is
Options: disease mechanisms,
pharmacologic mechanisms
Objective: Diagram and
explain the mechanism of drug
X.
Item Templates
Other considerations:
Age, gender, race, ethnicity
Site of care (ER, office visit)
Presenting complaint
presents for a routine physical exam
presents with a headache
Duration
Patient history, family history
Physical findings
There is no history of…
He has a history of…
Lab values, imaging studies, pathology reports
Treatment, subsequent findings
Item Creation
Add the lead-in (question) and the options
Which of the following pulmonary variables is most likely to be
lower than normal in this patient?
A. Alveolar-arterial PO2 difference
B. Compliance of the lung
C. Oncotic pressure of the alveolar fluid
D. Work of breathing
E. Residual volume
Item Creation: Taking Recall up
to Another Level
Recall
question:
What area is supplied with blood by the posterior
inferior cerebral artery?
[Objective: Identify the areas of the brain supplied by the
major cerebral arteries.]
Item Creation: Taking Recall up
to Another Level
Application question:
A 62-year-old man develops left-sided limb ataxia,
Horner’s syndrome, nystagmus and loss of facial pain
and temperature. Which artery is most likely to be
occluded?
[Objective: Differentiate the signs and symptoms that would occur
upon occlusion of each of the major cerebral arteries.]
Your Turn!
Review the distributed questions
and identify strengths and
weaknesses in each.
Question
Acute intermittent porphyria is the result of a
defect in the biosynthetic pathway for
A. collagen
B. corticosteroid
C. fatty acid
D. glucose
E. heme
Rewritten….
An otherwise healthy 33-year-old male has mild weakness and
occasional episodes of steady, severe abdominal pain with some
cramping but no diarrhea. One aunt and a cousin have had
similar episodes. During an episode, his abdomen is distended,
and bowel sounds are decreased. Neurological examination
shows mild weakness in the upper arms. These findings suggest
a defect in the biosynthetic pathway for:
A. collagen
B. corticosteroid
C. fatty acid
D. glucose
E. heme
Question
A 52-year-old male presents to the office with a one-week history of
flank pain and hematuria. Past medical history is unremarkable.
Physical examination reveals a left-sided abdominal mass. The
greatest risk factor for renal cell carcinoma is
A. diabetes
B. female gender
C. hyperlipidemia
D. low body mass index
E. smoking
Question
Which of the following is a correct statement about cystic
fibrosis (CF)?
A. The incidence of CF is 1:2000.
B. Children with CF usually die in their teens.
C. Males with CF are sterile.
D. CF is an autosomal recessive disease.
E. Symptoms of CF only appear in infancy.
What other flaws can you detect in this question?
Item Flaws: Unfocused items
Which of the following is correct regarding [topic]?
There is not enough information in the stem to answer
the question without looking at the options.
The responses are disparate. The distractors have to be
100% false. Thus, the question basically becomes a
true/false question. Avoid these!
A 45-year-old man comes to the physician because of a 6 week history of
a non-productive cough. An X-ray film of the chest shows a 0.8 cm well
circumscribed peripheral nodule in the right lung. Biopsy shows a
necrotizing granuloma. Which of the following is the most likely
diagnosis?
(A)
(B)
(C)
(D)
(E)
(F)
Pulmonary embolus
Small cell carcinoma
Pseudomonas aeruginosa infection
Histoplasma capsulatum
Herpes pneumonitis
Metastatic renal cell carcinoma
A healthy 57-year-old woman comes to the physician
because of 2 cm mass in her right breast. Biopsy reveals
an invasive ductal carcinoma. Which of the following is
the most important prognostic factor?
(A)
(B)
(C)
(D)
(E)
(F)
High grade tumor cytology
Infiltrative nature of tumor into benign breast
Numerous mitotic figures
Amount of tumor fibrosis
Presence of Lymph node metastasis
Number of plasma cells in tumor
A 63-year-old man comes to the physician because of a 6-week history of
progressive dyspnea on exertion, orthopnea, and ankle edema. He has
received multiagent chemotherapy for Waldenström’s macroglobulinemia
for the past year. Urinalysis shows proteinuria. A bone marrow biopsy
shows a partial response to therapy with ongoing marrow involvement
still identified. Which of the following is the most likely diagnosis?
(A)
(B)
(C)
(D)
(E)
Cardiac amyloidosis
Viral myocarditis
Cardiac sarcoidosis
Myocardial infarct
Hypertrophic cardiomyopathy
A question submitted
In aortic stenosis what other abnormal heart
sounds might accompany the resulting
murmur?
A.
B.
C.
D.
Physiological splitting of S2
An accentuated S2
Paradoxical splitting of S2
A muffled S2
Revised question
A 60 year old patient with an active lifestyle is found to
have a systolic murmur on a routine physical exam. He
currently has no symptoms. If this were aortic stenosis,
what other abnormal heart sounds might accompany
the systolic murmur?
A.) Physiological splitting of S2
B.) An accentuated S2
C.) Paradoxical splitting of S2
D.) A muffled S2
Determining item difficulty
The percentage of participants who get that item
correct
Item difficulty scores can range from 0 to 100%
Number of Students achieving each Score
30
20
Low value = high difficulty
10
High value = low difficulty
0
0
10
20
30
40
Hard Exam
0
50
60
70
Normal Exam
80
90
100
Easy Exam
High
(Difficult)
Medium
(Moderate)
Low
(Easy)
<= 30%
>30% AND
< 80%
>=80
%
10
20
30
40
50
60
70
80
90
100
Discrimination Index
The Discrimination Index distinguishes for each item
Index
of discrimination:
between
the performance of students who did well on the
The
difference
in who
the %did
of poorly.
people in one extreme group
exam
and
students
minus the % of people in the other extreme group
Item discrimination scores can range from -1.00 to +1.00
Example
100 test takers: 20 in top 25 were correct but only 5 in the
lowest 25 students were correct.
DI = (20-5)/25 = 0.8
Item
Discrimination
(D)
Item Difficulty
High
Med
Low
D =< 0%
review
review
revie
w
0% < D < 30%
ok
review
ok
D >= 30%
ok
ok
ok
Item Analysis Report
Order ID and group number
percentages
counts
The left half shows percentages, the right half counts.
The correct option is indicated in parentheses.
Point Biserial is similar to the discrimination index, but is not based on fixed upper and
lower groups. For each item, it compares the mean score of students who chose the correct
answer to the mean score of students who chose the wrong answer.
Summary
Utilize action verbs to write objectives
Write your exam items based on the objectives
Tie the clinical vignette to the lead-in
Choose appropriate options with one best answer
Avoid technical flaws
Utilize an item checklist to ensure that you have done all
you can to write the best items possible.
Pretest your items
Establishing Validity and
Reliability
(Groups)
Standard Setting
(Groups)
Graham McMahon
[email protected]
43
Item Discrimination: Examples
Item
No.
Number of Correct Answers in
Group
Item Discrimination
Index
Upper 1/4
Lower 1/4
1
90
20
0.7
2
80
70
0.1
3
100
0
1
4
100
100
0
5
50
50
0
6
20
60
-0.4
Number of students per group = 100
Distracter Analysis: Examples
Item 1
A*
B
C
D
E
Omit
% of students in upper ¼
20
5
0
0
0
0
% of students in the middle
15
10
10
10
5
0
% of students in lower ¼
5
5
5
10
0
0
Item 2
A
B
C
D*
E
Omit
% of students in upper ¼
0
5
5
15
0
0
% of students in the middle
0
10
15
5
20
0
% of students in lower ¼
0
5
10
0
10
0
(*) marks the correct answer.