Analyzing Survey Error with Latent Class Models
Download
Report
Transcript Analyzing Survey Error with Latent Class Models
Analyzing Survey Error
with Latent Class Models
Paul Biemer
RTI International and University of North Carolina
March 18, 2005
What is Latent Class Analysis?
Special case of log-linear analysis with latent variables
Latent variables are constructs which are measured imperfectly by
indicator variables
Traditional LCA assumes local independence
i.e., P(A and B|X) = P(A|X)P(B|X) for latent variable X and
indicators A and B
LCA models contain
Structural component – describes relationship among latent
variables and covariates
Measurement component – describes the relationship among
the indicators, latent variables and covariates
Uses of Latent Class Analysis
in Survey Research
Substantive researchers focus on the structural component of the
LCM
Errors treated as nuisance parameters
Survey methods researchers focus on the measurement component
Estimate components of total survey error
Evaluation of questionnaires and alternative survey designs
Population size estimation
Compensation for missing data
Survey bias adjustment
Objective of LCA for Measurement
Error Analysis
Obtain estimates of classification error for a categorical survey
variable
For e.g., false positive and false negative error rates
Why are these LCA estimates useful?
Quantify the measurement error in the data
Identify the correlates of measurement
Trace error to its root causes
Eliminate the cause through redesign
Example – Estimating the Error in
Survey Measurements of Marijuana Use
Three Indicators of Marijuana Use
Indicator A - How long has it been since you last used marijuana
or hashish?
A = Yes, if indication of last 12 month use
A = “No” if otherwise
Indicator B - Now think about the past 12 months from your 12month reference date through today. On how many days in the
past 12 months did you use marijuana or hashish?
B = “Yes” if response is 1 or more days;
B = “No” otherwise
Indicator C – a composite variable based upon 7 questions such
as
used in last 12 months?
spent a great deal of time getting it, using it, or getting over its
effects?
used drug much more often or in larger amounts than intended?
C = “Yes” if response is positive to any question suggesting use in last
12 months
C = “No” otherwise
Statistical Framework
NOTATION
X =
true drug use status (1 if use, 2 if no use)
unknown latent variable
A, B, and C are 3 dichotomous indicators of X
P( A a, B b,C c)
P( X x)P( A a | X x)P(B b | X x)P(C c | X x)
x
or
abc x a|x b|x c|x
x
Log-linear Formulation of the Latent
Class Model
abc abcx x a|x b|x c|x
x
x
is equivalent to
log mabcx u u xX uaA ubB ucC uaxAX ubxBX ucxCX
in which
mabcx n abcx
i.e., hierarchical LLM {AX BX CX}
Estimation
Use MLE to obtain estimates of
x , a| x , b| x and c| x
from the multinomial likelihood equation of the AxBxC
classification table
L ( ABC ) C ( abc )
a
b
c
mabc
Some Results
(modeling details in Biemer and Wiesen, 2000)
LCA models were fit to three years of data from the National Survey
of Drug Use and Health
Discovered several important anomalies were in the estimates of
marijuana use
Low frequency marijuana uses tended to answer negatively to
the frequency question
Composite variable was subject to false positive as a result of a
questionnaire problem that was subsequently corrected
False Positive Error Rates Under
Model 1
Indicator of Past
Year Use
1994 1995 1996
P(A = 1|X=2)
0.03
0.01
0.08
P(B = 1|X=2)
0.73
0.78
0.84
P(C = 1|X=2)
4.07
1.17
1.36
Estimates of False Negative Error
Rates
Indicator of Past
Year Use
1994 1995 1996
P(A = 2|X=1)
7.29
8.96
8.60
P(B = 2|X=1)
1.17
0.90
1.39
P(C = 2|X=1)
6.60
5.99
7.59
Frequency of Use
for Persons Responding ‘No’ to A
More than 300 days
201 to 300 days
101 to 200 days
51 to 100 days
25 to 50 days
12 to 24 days
6 to 11 days
3 to 5 days
1 to 2 days
5.84
5.84
0.96
0.93
1.45
2.96
4.76
6.06
18.41
58.62
Other Applications
Nonsampling Error Research
Identifying flawed questions and other questionnaire
problems
Estimating census undercount in a capture-recapture
framework
Characterizing respondents, interviewers, and
questionnaire elements that contribute to survey error
Adjusting for nonresponse and missing data in
surveys
Other Applications (cont’d)
Substantive Research
Causal modeling
Log-linear analysis compensating for measurement
error
Cluster analysis
Variable reduction and scale construction
Importance of Model Validity Depends
Upon the Application
In the previous example, validity was “proven” by ability to identify
real questionnaire problems.
In other applications, this type of validation may be quite difficult
Further, LCA methodology is being pushed to adjust the reported
survey estimates for misclassification bias.
Unemployment rate
Expenditures
Total population size in a census
Some Issues for Future Research
Investigating the Validity of LCA Estimates
Robustness of the estimates of classification error probabilities
to violations of the model assumptions
Local dependence
Unobserved heterogeneity
Dependent classification errors
Unequal probability sampling
Sample clustering
Some Issues for Future Research
(cont’d)
Robustness of the model fit statistics
L2 and X2
Convergence problems
Local maxima
Boundary solutions
Bias in the estimates of standard errors of the estimates
Effects of weighting
Clustered samples
Some Recent Literature
Asparouhov, T., Muthen & Muthen (2004). “Weighting for Unequal
Probability of Selection in Latent Variable Modeling,” Mplus Web
Notes: No. 7, Version 3
Patterson, B., Dayton, M., and Graubard, B. (2002). “Latent Class
Analysis of Complex Sample Survey Data: Application to Dietary
Data,” JASA, Vol. 97, No. 459, pp. 721-741
Vermunt, J. and Magidson, J. (2001). “Latent Class Analysis with
Sampling Weights,” presented at the Sixth Annual Meeting of the
Methodology Section of the American Sociological Association,
University of Minnesota
Biemer, P., Brown, G., and Judson, D. (2004). “Robustness of LCA
Estimates of Population Size to Model Failure,” unpublished
Census Bureau project reports