A Theoretical Framework for Adaptive Collection Designs
Download
Report
Transcript A Theoretical Framework for Adaptive Collection Designs
A Theoretical Framework
for Adaptive Collection
Designs
Jean-François Beaumont, Statistics Canada
David Haziza, Université de Montréal
International Total Survey Error Workshop
Québec, June 19-22, 2011
Overview
Selected literature review
Framework
• Definition of the problem
• Choice of quality indicator and cost function
• Mathematical formulation of the problem
Solution and discussion
Conclusion
2
Literature review:
Groves & Heeringa (2006, JRSS, Series A)
Responsive designs: Use paradata to guide
changes in the features of data collection in order
to achieve higher quality estimates per unit cost
• Paradata: Data about data collection process
• Examples of features: mode of data collection, use of
incentives , …
• Need to define quality and determine quality indicators
• Two main concepts: phase and phase capacity
3
Literature review:
Groves & Heeringa (2006, JRSS, Series A)
Phase: Period of data collection during which the
same set of methods is used
• Phase 1:
gather information about design features
• Phases 2+:
alter features (e.g., subsampling of
nonrespondents, larger incentives, …)
A phase is continued until its phase capacity is
reached
• Judged by the stability of an indicator as the phase
matures
4
Literature review:
Schouten, Cobben & Bethlehem (2009, SM)
Goal: determine an indicator of nonresponse bias
as an alternative to response rates
Proposed a quality indicator, called R-indicator:
R(ρ) 1 2 Pop.Std.Dev.( i , i U ) , 0 R(ρ) 1
• Population standard deviation must be estimated
• Response probabilities, i , must be estimated using
some model
An issue: indicator depends on the proper choice
of model (choice of auxiliary variables)
5
Literature review:
Schouten, Cobben & Bethlehem (2009, SM)
Another issue: indicator does not depend on the
variables of interest but nonresponse bias does
1 R(ρ) S (y)
ˆ
Maximal bias of NA :
2
ˆ is the unadjusted estimator of the population
NA
mean:
ˆNA is wi yi
r
isr
wi
Two limitations of maximal bias (and R-indicator):
• unadjusted estimator is rarely used in practice
• depends on proper specification of
6
i
Literature review:
Peytchev, Riley, Rosen, Murphy & Lindblad (2010, SRM)
Goal: Reduce nonresponse bias through case
prioritization
Suggest targeting individuals with lower estimated
response probabilities
• For instance, give them larger incentives or give
interviewer incentives
• Their approach is basically equivalent to trying to
increase the R-indicator (or achieving a more
balanced sample)
Recommend using auxiliary variables that are
associated with the variables of interest
7
Literature review:
Laflamme & Karaganis (2010, ECQ)
Development and implementation of responsive
designs for CATI surveys at Statistics Canada
Planning phase:
• before data collection starts (determination of strategies,
analyses of previous data, …)
Initial collection phase:
• evaluate different indicators to determine when the next
phase should start
Two Responsive Designs (RD) phases
8
Literature review:
Laflamme & Karaganis (2010, EQC)
RD phase 1:
• prioritize cases (based on paradata or other information)
with the objective of improving response rates
• increase the number of respondents (desirable)
RD phase 2:
• prioritize cases with the objective of reducing the
variability of response rates between domains of
interest (increasing R-indicator)
• likely reduce the variability of weight adjustments
(desirable)
9
Literature review:
Schouten, Calinescu & Luiten (2011, Stat. Netherlands)
First paper to propose a theoretical framework for
adaptive survey designs
Suggest:
• Maximizing quality for a given cost; or
• Minimizing cost for a given quality
Requires a quality indicator (e.g., overall response
rate, R-indicator, Maximal bias, …)
• Which one to use?
10
Definition of the problem
Adaptive collection design: Any procedure of
calls prioritization or resources allocation that is
dynamic as data collection progresses
• Use paradata (or other information) to adapt itself to
what is observed during data collection
• Focus on calls prioritization
Our objective: Maximize quality for a given cost
Context: CATI surveys
11
Choice of quality indicator
Focus of the literature: Find collection designs
that reduce nonresponse bias (or maximize Rindicator) of an unadjusted estimator
We think the focus should not be on nonresponse
bias. Why?
• Any bias that can be removed at the collection stage
can also be removed at the estimation stage
We suggest reducing nonresponse variance of an
estimator adjusted for nonresponse
12
Quality indicator
Suppose we want to estimate the total:
iU
yi
Assuming that nonresponse is uniform within cells,
an asymptotically unbiased estimator is:
wgi
ˆ
A is
ygi
rg ˆ
g
g 1
G
with ˆ g
nrg
ng
Quality indicator: The nonresponse variance
2
varq ˆA s g1 1 ng 1 Swy
,g
G
g 1
g Eq ˆ g s Eq nrg s ng
13
Overall cost
Overall cost: CTOT g 1 CTOT , g
G
CTOT , g
(m
isrg
gi
1)CNR , g CR , g
isg srg
mgi CNR , g
mgi :total number of attempts for unit i
CNR , g :cost of an unsuccessful attempt
CR , g :cost of an interview
14
Expected overall cost
Expected overall cost:
CTOT Eq CTOT s g 1 CTOT , g
G
CTOT , g CR , g CNR , g ng g CNR , g mgi
isg
mgi Eq mgi s m pgi , M gi
Assumption : mgi does not dependon g
G
CTOT 0 1g ng g
15
g 1
Mathematical formulation
Objective: Find g , g 1,..., G, that minimizes the
nonresponse variance
var ˆ s
q
A
subject to a fixed expected overall cost, CTOT K
Solution:
Note:
16
1
2
2
1 ng1 S wy
,g
S wy , g
g
1g
Equivalent to maximizing the R-indicator
only in a very special scenario
Implementation
Find the effort egi (number of attempts) necessary
to achieve the target response probability g
egi
ln(1 g )
ln(1 pgi )
Procedure: Select cases to be interviewed with
probability proportional to the effort egi
Issues: 1) Avoid small estimated pgi to avoid an
unduly large effort egi
17
2) Might want to ensure that a certain
time has elapsed between two
consecutive calls
Graph of variance vs cost
Minimum nonresponse variance
18
Expected overall cost
Revised solution
Solution of the optimization problem is found
before data collection starts
May be a good idea to revise the solution
periodically (e.g., daily)
• Some parameters might need to be modified
• Update remaining budget and expected overall cost
• The revised optimization problem is similar to the initial
one
19
Revised solution
Solution (same as before):
2
1 ng1 S wy
,g
g
1g
1
2
Revised target response probability:
ng g nrg
g
Could be negative
ng nrg
Effort:
20
ln(1 g )
egi
ln(1 pgi )
Conclusion
Next steps:
• Simulation study
• Adapt the theory for practical applications
• Test in a real production environment
Which quality indicator? Nonresponse variance?
Others?
Reduction of nonresponse bias: subsampling of
nonrespondents
• Our approach could be used within the subsample
21
Thanks - Merci
For more information, Pour plus d’information,
veuillez contacter :
please contact:
Jean-François Beaumont ([email protected])
David Haziza ([email protected])
22