Writing Effective Test Questions
Download
Report
Transcript Writing Effective Test Questions
Writing Effective Test Questions
Terry Stratton, Ph.D.
Assistant Dean, Student Assessment & Program Evaluation
Amy Murphy-Spencer, Ed.S.
Coordinator, Testing Services
Office of Medical Education
University of Kentucky College of Medicine
January 20, 2010
Acknowledgements
Center for Excellence in Medical
Education (CEME)
Dorcas Beatty
Others
Don Witzke, Ph.D.
Paul Murphy, M.D.
Donna Weber, Ph.D.
Darrell Jennings, M.D.
Outline
Basic Definitions/Vernacular
Group Exercise #1: Test-Taking “Savvy”
Reliability and Validity
Measurement Error
Item Writing Tips: Dos and Don’ts
Group Exercise #2: Critiquing Test Items
Test Item Statistics/Test Item Banking
Wrap-Up
Basic Vernacular
Assessment:
Q: How does something perform?
Evaluation:
Q: Does something work? Is
something effective?
What to Assess?
Attitudes
Altruism
Professionalism
Knowledge
Physiology
Biochemistry
Skills
Problem-Solving
Provider-Patient Communication
Clinical Procedures (e.g., NG tube placement)
How to Assess?
Self-Assessment
Ask the learner
Observation
Ward Evaluations
Often non-standardized, subjective
Objective Structured Clinical Exams (OSCEs)
Usually standardized, objective
Tests/Exams
Written
Oral
What is a Test?
A test is a sample of items or tasks
which represents a specified body of
knowledge or performance.
Coupled with the test is a scoring
procedure that enables one to estimate
examinees’ knowledge or performance.
Emphasis is on differentiation (who has
the knowledge/skill from who doesn’t)
Types of Test Items
Essays
Short-Answer
True/False
Multiple Choice (single best answer)
Extended Matching
Any of these formats may involve factual
recall, problem-solving, etc.
Our Focus Today
Knowledge Domain
Written Examinations
Multiple Choice Items (single best answer)
NBME-Type Vignettes
Anatomy of a Test Item
Who is the primary author of the
Declaration of Independence?
A.
B.
C.
D.
E.
Alexander Hamilton
Benjamin Franklin
George Washington
James Madison
Thomas Jefferson (*)
Stem
Distractors or Foils
Options
Do Tests Measure ONLY
Knowledge?
Exercise #1:
Test of General Rock & Roll
Knowledge
Are you ready and willing to
participate?
Are you a savvy test taker?
Which pair of artists met their
deaths from drug overdoses?
A. Joplin and Slick
B. Hendrix and Joplin
C. Hendrix and Redding
D. Joplin and Chapin
E. Lennon and Morrison
Demonstrates the “convergence” strategy: find the
answer that includes responses most frequently used
among all the options.
The Payola Scandal of the late
1950s and early 1960s involved:
A. tax evasion by recording artists
B. accusations of Rock & Roll as communist
propaganda
C. paying radio DJs to play records produced by record
companies
D. television game shows
E. violation of copyright laws
The word cue “pay” tips off the answer and makes it more likely C will be
guessed by examinees even if they don’t really know the answer.
The Moog synthesizer is an:
A. instrument
B. recording medium
C. musical genre of the 1960’s
D. Top 40 rock group
E. song
The grammatical cue “an” tips off this answer; A is the only grammatically
correct response.
The late 1960s Top 40 rock group “The
Monkees” are credited for being the first to
produce a recording with this instrument.
A. banjo
B. Moog synthesizer
C. slack guitar
D. slide guitar
E. sitar
Conceptual Proximity: The answer is tipped off by the prior question.
What was the primary reason Cleveland was selected
as the site for the Rock & Roll Hall of Fame?
A. Most rock and roll records were produced there
B. Cash
C. Pity
D. Economic stability
E. Alan Freed, a local 1950s Cleveland DJ, is largely
credited for coining the term Rock & Roll to identify
this new music genre
The longest response is usually the correct answer.
Which of the following statements characterizes the
sound of surf music from the 1960s?
A. All surf songs were instrumental
B. There were never drums in surf songs
C. All surf songs were produced by the Beach Boys or by Jan
& Dean
D. Many surf songs consisted of a twangy guitar sound and
higher range vocals
E. All surf music had female backup vocals and a reggae beat
“All or nothing concept”: exceptions to all, always, never, and none are
abundant, so test takers tend to skip options containing those words and
choose the non-absolute responses.
A technique for creating the staccato “thumping” bass
sound characteristic of funk was:
A. using a capo
B. using a slide
C. slapping the bass strings with the thumb
D. plucking the strings with a pick
E. plucking the strings with a fingernail
“Homonym and Synonym Cues”: Guessers might choose C, just because
‘slapping’ and ‘thumping’ are more closely associated than the alternatives.
Which record company contributed most to the
“Memphis Sound”?
A. Stax
B. Capitol and Warner Bros.
C. Columbia and Chess
D. Motown and Atco
E. Chess and American
This is another grammatical cue. The questions asks for one company and
there is only one option that lists only one company.
The British Invasion included which of the following
Rock & Roll groups?
A. Gerry and the Pacemakers
B. The Swinging Blue Jeans
C. The Animals
D. The Beatles
E. All of the Above
Most know the Beatles were British. To choose E as correct, one needs to
know one other group was British – Gerry, spelled with a G, might be a clue
that this is a British group.
Purposes of Testing
Communicate to students what is important
Motivate students to study
Identify areas where remediation is needed
Provide feedback on learning to students
Determine final grades/make promotion
decisions
Identify areas where course/curriculum is weak
“The amount of attention given to evaluating something should reflect its
relative importance”
Susan Case, NBME, 1995
What Should be Tested?
Exam content should match course
objectives
Important topics should be weighted
accordingly
# items/testing time should reflect topic
importance
The sample of items should represent the
domain of instructional content
Questions thus far?
Reliability
AKA reproducibility, dependability, internal
consistency
The degree to which we would obtain the same
result if an examination were repeated
Reliable methods of assessment will tend to yield
similar results when repeated under similar
conditions
Methods of estimating reliability
Cronbach’s alpha (α)
Test-retest
Parallel forms
Split-half
Lowest Acceptable Reliability
Scores from a single test with a
reliability coefficient of
less than .70
should not be used to characterize or
evaluate individuals or groups in
higher-stakes situations.
Reliability & Decision-Making
High reliability is demanded
when the decision:
is important
is final
is irreversible
is not confirmable with
other data
concerns
individuals
has lasting
consequences
Low reliability is tolerable
when the decision:
is of minor importance
is in the early stages of
decision-making
is reversible
is confirmable by other
data
concerns groups
has temporary effects
Source: Gronlund NE & Linn RL. Measurement & Evaluation in Teaching (6th Edition). New York:
Macmillan, 1990.
Standard Error of Measurement
If a single student were to take the same
test repeatedly - with no new learning
taking place between assessments and no
memory of question effects - the standard
deviation of his/her repeated test scores
is denoted as the standard error of
measurement.
Validity
The extent to which a test measures what
it purports to measure (e.g., IQ).
An indication of how well a measure
corresponds with reality
Is the exam representative of the universe of
possible exam items from that content area?
Does the exam provide data that increase the
accuracy of decisions made about the
examinee?
Reliability and Validity in a
Nutshell
True Score Theory
Measurement Error
Random Measurement Error
All chance (random) errors that confound
the measurement of any phenomena
Inversely related to the reliability of the
measuring instrument
Possible examples:
Testing environment
Testing administration
Examinee preparedness
Effects of Random Error
Non-Random Measurement Error
All systematic errors that confound the
measurement of any phenomena (bias)
Inversely related to the validity of the
measuring instrument
Possible examples:
Insufficient sampling of items
Assessing what was not taught
Poorly written test items
Effects of Non-Random Error
Sources of Error in Test Scores
Test Quality
Unclear, lengthy directions
Emphasis on non-focal skills (e.g., major reading in a
math test)
Item Sampling
Insufficient length to reflect course content
Not enough time for examinees to finish
Item Quality
Items are ambiguous and/or fail to discriminate among
examinees
Environmental Characteristics
Noise, distractions, temperature
Scoring
Important/core topics are not weighted accordingly
Questions?
Okay, How Do You Get There?
Avoid T/F questions
Write test questions with fewer flaws
(less error)
Develop a secure test item bank
Select items with known measurement
characteristics
Select items representing course
content
Multiple-Choice Questions: Pros
Versatility in measuring all levels of cognitive
ability
Highly reliable test scores
Scoring efficiency and accuracy
Objective measurement of student achievement or
ability
Wide sampling of content or objectives
Reduced guessing factor when compared to T/F
items
Different response alternatives which can provide
diagnostic feedback
Multiple-Choice Questions: Cons
Are difficult and time consuming to
construct
Leads to favoring simple recall of facts
Place a high degree of dependence on the
student's reading ability and instructor's
writing ability
One Best Answer Items: “Do”
Focus items on important concepts or problems
Gear items toward assessing application of
knowledge, not recall of an isolated fact
Write item stem to pose a clear question; one
should be able to answer an item with the
options (foils) covered
Make all options (foils) homogeneous
Construct items that require examinees to
compare the relative correctness of options
e.g., “Which of the following Xs is most likely to result in Y?”
One Best Answer Items: “Don’ts”
The following items are unfocused and have
heterogeneous options:
“Which of the following statements is correct?”
“Each of the following statements is correct
EXCEPT:”
Do not include additional irrelevant data
(e.g., unnecessary background, etc.)
Avoid items that pose irrelevant difficulty
Do not write items that allow options
to be eliminated in a T/F fashion
Avoid items containing vague
references to time (e.g., frequently,
usually, rarely, etc.)
Single Best Answer Items: Common Flaws
In poorly constructed single best answer items, the
correct response will probably be:
The longest response
The most qualified or detailed response
The response without spelling or grammatical errors
Neither of two responses that mean the same thing
A middle numerical value (not an extreme value)
The response which grammatically fits with the stem
One of two responses which are the opposite of each other
Source: Camp, MG. Maximizing your Score on Multiple-Choice Exams.
Layout: Stem and Options
Less Desirable Item Model
More Desirable Item Model
Group Exercise #2:
Critiquing Test Items
Item #1
A 58-year old man with a history of heavy
alcohol use and previous psychiatric
hospitalization is confused and agitated. He
speaks of experiencing the world as unreal.
This symptom is called:
A.
B.
C.
D.
E.
de-personalization
signal anxiety
de-realization
focal memory deficit
derailment
Option C repeats clue in stem;
responses not alphabetized.
Item #2
Following a second episode of infection,
what is the likelihood that a woman is
infertile?
A. Less than 20%
B. 20% to 30%
C. Greater than 50%
D. 90%
E. 75%
Overlapping numerical ranges (C,D, and E)
Item #3
Severe obesity in early adolescence:
A. usually responds dramatically to dietary
regimens
B. often is related to endocrine disorders
C. always has a 75% chance of clearing
spontaneously
D. never shows a poor prognosis
E. usually responds to pharmacotherapy and
intensive psychotherapy
Subjective qualifiers in all options;
responses not alphabetized
Item #4
Arrange the parents of the following children with Down’s
syndrome in order of highest to lowest risk of recurrence.
Assume that the maternal age in all cases is 22 years and that a
subsequent pregnancy occurs within 5 years. The karyotypes of
the daughters are:
I.
II.
III.
IV.
V.
46, XX, -14, +T (14q21q) pat
46, XX, -14, +T (14q21q) de novo
46, XX, -14, +T (14q21q) mat
46, XX, -21, +T (14q21q) pat
47, XX, -21, +T (21q21q) (parents not karyotyped)
A.
B.
C.
D.
E.
III, IV, I, V, II
IV, III, V, I, II
III, I, IV, V, II
IV, III, I, V, II
III, IV, I, II, V
Complex K-type, confusing, unacceptable item type
Item #5
Peer review committees in HMOs may move to take action
against a physician’s credentials to care for participants of the
HMO. There is an associated requirement to assure that the
physician receives due process in the course of these
activities. Due process must include which of the following:
A.
B.
C.
Notice, an impartial forum, a chance to hear and confront
evidence against him/her
Proper notice, a tribunal empowered to make the decision, a
chance to confront witnesses against him/her, and a chance to
present evidence in defense
Reasonable and timely notice, impartial panel empowered to
make a decision, a chance to hear evidence against
himself/herself and to confront witnesses, and the ability to
present evidence in defense.
C is the most detailed and lengthy option
Item #6
Local anesthetics are most effective in the:
A. anionic form, acting from inside the nerve
membrane
B. cationic form, acting from inside the nerve
membrane
C. cationic form, acting from outside the nerve
membrane
D. uncharged form, acting from inside the nerve
membrane
E. uncharged form, acting from outside the nerve
membrane
Repetitious form types and use of opposites
Item #7
In patients with advanced dementia,
Alzheimer’s type, the memory defect:
A. can be treated adequately with
phosphatidylcholine (lecithin)
B. possibly involves the cholinergic system
C. is never seen in patients with neurofibrillary
tangles at autopsy
D. could be a sequela of early Parkinsonianism
E. is never severe
Option B not grammatically correct; use of absolutes in C and E;
responses not alphabetized
Item #8
Secondary gain is:
A. synonymous with malingering
B. a frequent problem in obsessivecompulsive disorder
C. a complication of a variety of illnesses
and tends to prolong many of them
D. never seen in organic brain damage
Unfocused stem; responses not alphabetized
Item Templates: Vignettes
The patient vignettes may include some or all of
the following components:
Age, Gender (e.g., 45-year-old man)
Site of Care (e.g., comes to the emergency room)
Presenting Complaint (e.g., because of a headache)
Duration (e.g., that has continued for 2 days)
Patient History (with Family History)
Physical Findings
+/- Results of Diagnostic Studies
+/- Initial Treatment, Subsequent Findings, etc.
Current/prior medications
After the Exam
When appropriate, classify items into
multiple “subtests”
Review item analysis with students &
faculty
Provide feedback to each student
Revise (if necessary) and retain quality
test questions in a secure item bank
Importance of Feedback
Feedback: Information on examinees’
past performance intended to guide their
future performance
Reports should be simple, meaningful, and
easy-to-interpret
Report should help examinees identify
strengths & weaknesses
Security issue: return exams or not?
Pro: old exams guide learning
Con: threatens exam security
Sample Student Test Report
Item Analysis Report
Item Quality Graph
Measurement Error Chart
Sample of LXR Item Bank
Sample of LXR Exam Summary
Our Website
Student Assessment & Program Evaluation
Office of Medical Education
http://www.mc.uky.edu/meded/tande/index.asp
Thank you!!