Methodological challenges in integrating data collections in

Download Report

Transcript Methodological challenges in integrating data collections in

Methodological challenges in
integrating data collections in
business statistics
Paul Smith
Office for National Statistics
Outline
• Data quality for different sources
quality measures for survey and administrative
inputs
quality measures for outputs
• Combinations of sources
familiar and more advanced situations
• Mode effects
• Models
• Discussion
Statistical data collections - quality
• Relevance
generally questions conform to desired concepts
may be tailoring for
• practicality
• consistency across collections even if concepts differ
• Accuracy
affected by sampling
impacts from non-response, measurement error
• Timeliness
generally relatively timely
Administrative data - quality
• Relevance
questions conform to administrative (not statistical)
concepts
few concessions to statistical needs
• Accuracy
unaffected by sampling
processes to discourage non-response
treatment of measurement error differs by variable
• Timeliness
generally slow
Differences between types of source
• Sampling accuracy is measurable for
surveys, not relevant for administrative data
sources
confidence in quality reduced for admin data
balance of accuracy measures different
• Building statistical requirements into
administrative series
requires negotiation and agreement
VAT classification information in the UK
INSEE has statistical and accounting information
well integrated
Questionnaire design
• Questionnaire design principles mostly used
in designing statistical collections
• Administrative data seen as “forms” not
“questionnaires”
less attention to question phrasing to obtain required
answer
more on statutory requirements
Output data quality
• Data quality from combined outputs can be
challenging to measure
function of the qualities of the input sources, and the
methods used to combine them
some well-known general approaches
development of measures needed for particular
cases (eg from models)
Combinations of sources - 1
• Frame and sample information
Sampling frames typically derived from
administrative sources
Multiple uses of frame information
•
•
•
•
sample design
sample selection
validation and editing
estimation and variance estimation
Quality easily derived – standard situation
Combinations of sources - 2
• Dual-frame surveys
More than one administrative source
•
•
•
•
Pension funds survey in the UK
Units
Business register
Challenges of population inflation if matching not perfect
Estimate probability that unit appears in sample from
either source
• use in appropriate weighting procedure
• adjustment for P(in both surveys) depends on survey
type
Combinations of sources - 3
• Multiple surveys
different periodicity
• summary information monthly, detail annually
• for example capital expenditure – quarterly breakdown,
annual summary
• Benchmarking
where short-period surveys small (and variable) and
annual larger (and less variable)
• Quality measures
account for sampling error in both sources
account for non-response and measurement errors
in larger survey
Combinations of sources - 4
• Auxiliary information
If administrative concept not close to statistical
concept, data may still be useful
Auxiliary information in estimation
• not required to be correct, only correlated with outcome
• the better the correlation, the better the accuracy
Auxiliary information in validation
• use tax data to improve validation follow-up activity
• Data confrontation
Use multiple sources to identify discrepancies
Balancing
Mode effects
• Mode effects manifest in several ways
differences in contact rate
differences in response rate given contact
differences in question replies given response
• Test differences through a designed
experiment (van den Brakel & Renssen 1998,
2005)
evaluates whole-process differences (not individual
steps)
non-response adjustment if good predictors for
response amongst auxiliary data (var increases)
model-based adjustments for other changes
Temporal differences
• Administrative data often have longer
reference period than statistical requirement
• Implies temporal disaggregation (modelbased) – Dagum & Cholette 2006
• Quality implications
estimated data as inputs
sensitivity of model to interesting changes
Models for combining data
• Full flexibility in combining data available
through modelling approach
• Models at boundary between statistical
producer and user
• Ideally statistical results insensitive to model
assumptions
small area estimates
• useful for social surveys
• challenges for business surveys not yet resolved
modelling for unit structures - BRES
Discussion
• Aim: more from existing sources
often imperfect matches
modelling only appropriate approach
• subjective
• robust to assumptions
• sensitivity analysis
• Mixed mode collections
usability and low cost
data combination
quality components harder to measure
• for more details see the paper, or contact
[email protected]