Transcript Admin data

Examining the use of administrative
data for annual business statistics
Joanna Woods, Ria Sanderson, Tracy Jones,
Daniel Lewis
Overview
• Background
- Motivation
- Admin data
- Variables of interest
• Methods tested
- Discontinuing the survey
- Cut-off sampling
• Results
• Conclusions
Motivation
• Drive to increase the use of admin data for
business statistics
- reduce survey costs
- decrease burden on survey respondents
• One possibility - replace survey data with admin
data
- Some variables have admin data directly available
- Other variables do not have a direct source of
admin data available
Annual Business Survey
• The Annual Business Survey (ABS) collects
financial variables
• Target population = UK economy
• Stratified simple random sample by industry,
region & employment
• Samples approximately 60,000 businesses
• Businesses with employment > 249 are
completely enumerated
• Ratio estimation
Available administrative data
• Two main sources available:
- VAT turnover data
- Company accounts data (balance sheet variables)
• These overlap with, but do not fully cover, the
target population
• Properties of these data sources are different
Survey population and admin data
Survey population
Survey population and admin data
Administrative data
Survey population
Survey population and admin data
Administrative data
MATCHED PART
Survey population
Administrative data sources
VAT turnover
Company Accounts
(balance sheets)
Created annual data sets
for 2003-2008
Annual data from April
2003 to March 2009
Matched to units in the
survey population
Complex matches to units
in survey population
Match rate 73-75%, few
missing values
Low match rate and many
missing values
ABS variables
• ABS variables which do not have admin data
directly available include
• Total Acquisitions – investment in land, existing buildings,
and computers
• Total Disposals – sale of land and existing buildings
• Proportion of zeros varies within each sizeband
• Total Acquisitions: 71% for 0-9 emp
9% for >250 emp
• Total Disposals: 93% for 0-9 emp
43% for > 250 emp
Acquisitions & Disposals
Methods Tested
• Aim: to see if admin data sources can be helpful as
auxiliary variables in estimating these totals to reduce
the sample size.
• Discontinuing the survey
- Predict values for investment variables based on
models derived from past survey data.
• Cut-off sampling
- Stop sampling some businesses
- Use admin data to estimate for these units
- Consider simple ratio adjustment
Methods Tested: Considerations
Discontinuing the
survey
Advantages
No survey is required
(provided admin data is
available for all)
Cut-off sampling
Reduces the burden
placed on small
businesses
Reduces survey costs
Disadvantages
Model parameters fixed,
cannot respond to
changes in economy,
may introduce bias
Different models required
for different survey
variables
Still requires a survey
component
May introduce bias
Methods tested: Discontinuing the
survey
• Produce models using past survey & admin data
to produce estimates
• Linear model – predict values for positive returns
• Logistic model – predict probability of positive return
• Build a model using data from last survey
• Model covariates can be admin data variables
• Apply model to future years & evaluate results.
Methods tested: Discontinuing the
survey - Linear model
• Aim - predict values for acquisitions/disposals
• Have skewed data, use log transformation
• Use positive returns from year t to create a model
• Apply model to year t+1, t+2 ... to get predicted
value for each business
• Back transform prediction to get back to original
linear scale
Methods tested: Discontinuing the
survey - Logistic model
• Aim – predict probability of company returning a
positive value
• Use all returned data from year t to model the
probability of a business returning a positive
value
• Apply model to predicted values in year t+1
• Multiply linear model prediction & logistic model
probability to produce predicted value for every
unit
Results: Discontinuing the survey
• Acquisitions
• Best linear model for predicting log(total acquisitions)
– Intercept,
– Standard Industrial Classification(SIC) at three digit level,
– Region,
– Employment band,
– log turnover,
– log turnover *SIC section
• R-squared = 0.66
Results: Discontinuing the survey
• Acquisitions
• Best logistic model for predicting probability of a
positive return
– Intercept,
– SIC division level,
– Region,
– Employment band,
– log turnover,
• Produced one of the lowest AIC
Results: Discontinuing the survey
Methods tested: Cut-off sampling
•
•
•
•
Reduces burden but introduces bias
Create a cut-off, based on employment
Stop sampling below the cut-off
Use sample information above the cut-off to
estimate for units below the cut-off in an effort
to reduce bias
• Missing data and match rates are the main
difficulty => can’t be applied to full survey
population, still need a sample
Simple ratio adjustment
• Estimate for units below the cut-off:
ˆ
Y
m
ˆ
Yc  X c
Xˆ m
Xc
Yˆm
Total of auxiliary variable below cut-off
X̂ m
Estimate of auxiliary variable above cut-off
Estimate of variable of interest above cut-off
Results: Simple ratio adjustment
Conclusions
• Discontinuing survey
- not an option for this variable
• Under predicts
• Growth rates differ
• Cut-off sampling with simple ratio adjustment
- can give reasonable results in some divisions but
not all
- sample size savings can be made where method
works well but is dependent on match rate
- multiple auxiliary variables are required
Any questions?
[email protected]