Personality Research and IRT

Download Report

Transcript Personality Research and IRT

Personality/Psychopathology Measurement and IRT:
promising opportunities
Rob Meijer
Personality Assessment
• Diagnosis of personality and personality disorders requires
an evaluation of the individual
• Self-reports and peer-reports questionnaires are often used
to determine personality traits
• Contexts: health care, clinical psychology, personnel
selection and development
Topic
• How can item response theory (IRT) improve
understanding of personality questionnaires ?
• Discuss several applications, what I have learned from my
cooperation with clinical, personality, and I/O
psychologists
• Not enough research projects that communicates
convincingly the relative superiority of the IRT approach in
the personality domain
Topic
• IRT applied in educational and cognitive measurement
• The purpose of cognitive assessment precise and valid
scaling of individual differences.
• In (applied) personality assessment test score interpretation
and prediction of wide ranging behavior
• In cognitive assessment there is a large domain (e.g.,
spelling) where the items are sampled from, in personality
many domains are restricted. There are only a number of
indicators of e.g., social introversion, friendliness, or
narcissism.
Topic
• When IRT is transported from cognitive abilities into
typical performance assessment, special issues and
problems arise
• E.g., limited indicators (items), underlying distribution not
normal
Personality Assessment
• In 2002 and 2003, 20 of 39 research articles in JEM and 32
of 52 in APM involved IRT
• 2 out of 122 in Journal of Personality Assessment and 6 of
106 articles in Psychological Assessment included IRT
• Partly due to different psychometrics prevalent in the two
fields
CTT
CTT scale construction: item difficulty, item
discrimination, and reliability
drawback: reliability and SEM is constant for all
respondents
SEM  S
x
1 r
xx
IRT
• IRT assumes that a person has a true location on a continuous
latent dimension (theta).Theta is assumed to probabilistically cause
how a person responds to an item
• The equation that relates to the probability of endorsing an item is
the IRF (dichotomous item scores)
exp a (  b)
P( ) 
1  exp a (  b)
IRT
• Item difficulty (b) is the point on the latent variable scale
that where the probability equals .50
• Item discrimination (a) is proportional to the slope of the
IRF
• Important feature: IRT estimates the joint relation between
person properties and item properties
• a usually between [.5, 2.5]
• b usually between -2.5 (easy) and +2.5 (difficult)
a = .5
a = 1.5
0.6
0.4
0.2
0.0
Probability
0.8
1.0
Item Response Functions (IRF)
-4
-2
0
Theta
2
4
IRT
Assumptions:
(1) Unidimensionality
(2) Monotonic relation between trait level and probability of
endorsing an item
Statistical evaluation of model-to-data goodness of fit
Item and scale analysis
• CTT: item discrimination, item difficulty, reliability
• IRT: item analysis is done in a similar way but item
discrimination, difficulty, and reliability are examined in a
more powerful way
• Instead of test reliability, item information plays an
important role
Item and Test information
• Information indicates how well an item discriminates
among respondents who are at different levels of the latent
variable
• Items provide different amounts of information at different
ranges of the latent variable
• (1) Item information is additive across items: test
information function
• (2) information is inversely related to the SEM
Item and scale analysis
Item and Test information
• The amount of information an item provides is determined
by the item discrimination
• The location on the latent trait where information is
maximized is determined by the item difficulty
a = .5
a = 1.5
0.4
0.3
0.2
0.1
0.0
Information
0.5
0.6
Item Information
-4
-2
0
Theta
2
4
b=1
b = -1
0.4
0.3
0.2
0.1
0.0
Information
0.5
0.6
Item Information
-4
-2
0
Theta
2
4
Polytomous scores
• Graded response model (GRM), likert data
• a-parameter: magnitude reflects the degree to which the
item is related to the trait
• Two or more location parameters, b1, b2, ..
(equal to number response categories minus one)
Reflects the spacing of the response categories along the
trait scale
• Thus for m = 5 answer categories there are b1, b2, b3, and
b4 location parameters
Depression items
Item
Slope parameter Location parameters
a
b1
b2
b3
b4
1
1.07
−1.30
−0.45
0.45
1.70
2
2.30
−0.01
0.45
1.90
2.70
3
1.60
−1.60
−0.75
0.16
0.41
Item 2: I have recently
considered killing myself
Item 3: I am sometimes down
in the dumbs
Option response curves
Example 1: Construct validity clinical scales
• Can we use scales as a diagnostic instrument to classify
persons in different categories?
• In clinical psychology/psychiatry many rating scales are
constructed so that they cover DSM-IV(TR) categories. On
the basis of a scale a person is classified into different
categories such as no, versus mild, versus severe mental
illness states
• Because diagnostic criteria influence how psychiatric
disorders are recognized, researched and treated it is very
important to ensure their empirical validity
Practical Features
• Clinical change, degree of change within the individual, to
measure this, there should be a scale discriminating in the
area of interest
• Scale should be discriminating around cut-off scores
• Diagnostic Interview-Expanded Substance Scale
• IRT analysis to investigate the quality of the scale
• Can the scale be used as a diagnostic instrument ?
Alcohol use disorder (Langenbucher et al, 2004)
Cocaine use disorder
Conclusion
• Dense clustering of symptom item response functions
imply that a number of criteria (items) of substance abuse
carry the same information
• Measurement precision in only a narrow trait range
• Trichotomous diagnostic scheme of the DSM-IV
(undiagnosed, dependence, abuse) is not supported, only
impaired/less impaired can be distinguished
Conclusion
• Additive severe criteria (items) are needed to reliably and
broadly identify serious degrees of addictive pathology
• Additional mild criteria for screening and prevention and
establishing base rates (epidemiology)
• But are constructs fully continuous ? And can we find
measures (items) across an entire range?
Quasi-traits
• Researchers often assume that all construct are fully
continuous, defined at both ends of the construct
• IRT modeling shows that many personality constructs used
in clinical scales (psychopathology) are highly skewed or
quasi-traits
• For example, self-esteem
Quasi traits
• One explanation is that this is not due to poor items or
options but due to the nature of the self-esteem construct;
items only differentiate between people with low selfesteem because this is the only end of the construct that is
meaningful
• Future research should clarify whether we can write items
that also discriminate at the medium levels of the latent
trait
Example 2 Type D personality
• What is the effect of narrow band constructs combined
with limited item pools on the construct validity of our
scales?
• When only a few items have high slopes and the remainder
have low slopes care should be taken in interpreting the
latent trait.
Context
• Influence psychological factors on health, illness, and
death
• Psychosomatic research on cardiac disease needs to
include personality
• Distress as a risk factor
• High levels of distress are linked to anxiety, stress, and
anger  vital exhaustion
• DS-14 : 7 items Negative Affect + 7 items Social
Inhibition:
• Type D Score above median on both scales: Increased risk
Example 2 Type D
• Negative Affect (NA): tendency to experience aversive
emotional status with feelings of dysphoria, tension and
worry. (α = .88; fact. loadings .6-.8)
• Social inhibition (SI): inhibit self-expression in social
interactions in order to avoid disapproval by others (α =
.86; fact. Loadings .6-.8) (Emons, Meijer, Denollet, 2006)
Item
Content
Lower level
Position in DS14 construct
Negative affectivity
NA1
Worries about unimportant
things
2
Anxious apprehension
NA2
Often feels unhappy
4
Dysphoria
NA3
Is easily irritated
5
Irritability
NA4
Takes gloomy view of things
7
Dysphoria
NA5
Is often in a bad mood
9
Irritability
NA6
Often worries about something
12
Anxious apprehension
NA7
Is often down in the dumps
13
Dysphoria
Item
Slope parameter Location parameters
a
b1
b2
b3
b4
Negative affectivity
Anx1
1.07
−1.30
−0.45
0.45
1.70
Dys1
2.36
−0.25
0.51
1.29
2.15
Irit1
1.45
−1.37
−0.39
0.81
2.18
Dys2
2.76
−0.43
0.32
1.16
2.05
Irit2
1.30
−0.43
0.55
1.81
2.70
Anx2
1.60
−1.58
−0.70
0.16
1.41
Dys3
3.61
−0.30
0.38
1.10
1.78
Example 2: Type D
• Variable pattern of slopes may be problematic
• (1) The dysphoria items NA7, NA4, and NA2 dominate
the construct, remaining items are less important
• (2) A practitioner should be very careful in interpretation of
the underlying construct: NA = dysphoria and in
particular: “I am often down in the dumbs”
• (3) the latent trait does not reflect variance on a common
latent variable shared by other items, but reflects individual
differences on the items with the highest slopes
NA3 I am often irritated
NA1 I often make a fuss about unimportant things
5
5
4
4
I
N
F 3
O
R
M
A
T 2
I
O
N
1
I
N
F 3
O
R
M
A
2
T
I
O
N
1
0
-3
-2
-1
0
1
2
3
0
-3
-2
-1
0
Scale Score
1
2
3
Scale Score
NA7 I am often down in the dumps
NA4 I take a gloomy view of things
5
5
4
4
I
N
F
O 3
R
M
A
T 2
I
O
N
I
N
F
O 3
R
M
A
T 2
I
O
N
1
1
0
0
-3
-2
-1
0
Scale Score
1
2
3
-3
-2
-1
0
Scale Score
1
2
3
Example 3 Validity of test scores
• Test score validity: validity scales e.g., F-scale in MMPI,
items scored infrequently in the normal population, high
scores invalidate the interpretation of the MMPI
• Can we identify and interpret invalid test scores through
studying the configuration of individual item scores by
means of fit statistics that are proposed in the context of
item response theory IRT ? (Meijer, Egberink, Emons, Sijtsma, 2008)
Context
• On the basis of an IRT model observed and expected item
scores can be compared and many unexpected item scores
alert the researcher that the total score may not adequately
reflect the trait being measured.
• Gap between psychometric characteristics of several
statistical tests and measures on the one hand and the
articles that describe the practical usefulness of these
measures on the other hand.
Context
•
Try to integrate psychometric analysis with information
from qualitative sources to make judgments about the
validity of an individual’s test score. And replication !!
1.
Explore the usefulness of person-fit statistics to identify
invalid test scores using real data, and
Validate information obtained from IRT using personality
theory and qualitative data obtained from observation and
interviews
2.
Rationale of the method
• When measuring e.g., depressed suicidal ideation every
person that endorses the statement
• “I have recently considered killing myself” is expected to
also endorse the statement “I don’t seem to care what
happens to me”(relative to the previous item this item is
less extreme or, more popular)
• However, in practice, when analyzing personality data,
“errors” are found against this perfect pattern
• Many errors may point at invalid person scaling
Fit statistics
•
•
0100100000000001001011001010010100001100
1001000010110010111111000000000000000000
•
•
0101110111001010001011110001011111000000
1111110111111111101101000100000000000000
X+ = 12
X+ = 12
X+ = 20
X+ = 20
• Many statistics, we used several statistical tests, normed
Guttman errors (ZGE)
Data
• Harter’s Self-Perception Profile for Children (SPPC),
polytomous item scores (4 point scale)
• Intended to determine how children between 8 and 12
years of age judge their own functioning in several specific
domains and how they judge their global self-worth
• 6 subscales each consisting of 6 items:
Scholastic Competence (SC), Social Acceptance (SA),
Athletic Competence (AC), Physical Appearance (PA),
Behavioral Conduct (BC), Global Self-worth (GS)
Procedure
•
•
•
•
611 children between 6 and 12 years of age
Inspection of model fit
Calculation of person-fit statistics
Interviewing teachers, and children, and observation of
children
• Re-administration of the SPPC
Results
• In general, young children (8/9 years of age) scored less
consistent than older children
• Asking children to select personality statements that better
describe them may be relatively complex especially for
young children.
• They should understand the meaning of these statements
and they should also have a frame of reference which is
similar to that of old children. We observed that the
meaning of some items was problematic, and that
inconsistent answering behavior was often due to learning
disability
Results
• Older children more often than younger children choose
the categories 2 and 3.
• Older girls more often than older boys choose the 2 and 3
options. We speculate that these shifts point at a more
differentiated self-concept for older children as compared
to young children and at a more differentiated self-concept
for girls than boys
Profiles
Similar Profiles with different item score patterns
100
no. 275
90
no. 94
80
no. 242
70
Score
•
60
50
40
30
20
10
0
SC
SA
AC
PA
SPPC Dom ains
BC
GS
Profiles
• Child 275: very inconsistent item score pattern
(SC:422124, SA:444414, AC:411444, PA:313414,
BC:124443, GS:344143)
Child 94: consistent
SC:112422, SA:443423, AC:444322, PA:222242,
BC:333333, GS:424233;
Child 242: consistent
SC:222232, SA:443333, AC:333343, PA:322233,
BC:433223, GS:343333).
Re-administration
• As expected, the ZGE scores collected at the second
administration were lower than the ZGE scores collected at
the first administration.
• 8 out of the 27 children again produced irregular item
score patterns
• For 4 children this was due to cognitive problems: learning
disability, problems with reading comprehension skills
and/or lexical processing speed.
• For 2 other children this may be due to the home situation.
come from troubled homes, they have difficult relations
with their parents and, perhaps as a result of this, they are
very insecure.
Conclusions
• In clinical practice and applied research, the fundamental
question often is not whether unexpected item score
patterns exist but whether the patterns have any theoretical
or applied validity
• Because nothing in a (statistical) fit procedure guarantees
that identified patterns have associations with external
criteria or diagnostic categories it is important to use
information from other sources. Thus, one may combine
information from fit statistics with information obtained
from other subtest scores (score profiles), interviews,
and/or observation.