Transcript Background

Improved Register Data
Matching and its Impact on
Survey Population Estimates
Steve Vale
Office for National Statistics, UK
Contents
• Background
• Current matching systems
• Enhancements
• Impact on survey populations
Background
• No common business identifier in UK
• Data from different sources matched using
name, address and postcode
• Software based around SSAName3
• Limited clerical input for “possible match”
category (>10 employment)
• Quality marker (“inquiry stop”) used to
indicate probability of duplication and to
exclude some enterprises from survey
populations
VAT only proven:
188,380
(emp 846,100)
PAYE only
proven: 115,688
(emp 3,517,000)
VAT & PAYE
proven: 407,796
(emp 18,775,000)
PAYE only
unproven: 169,028
(emp 571,700)
Inq Stop 6,7:
188,356
(emp 385,450)
VAT only
unproven: 721,046
(emp 1,370,000)
VAT & PAYE
unproven: 349,983
(emp 1,450,000)
PAYE only >20:
1,100
(emp 67,520)
Inq Stop 9: 17,153
(emp 56,410)
VAT only > 20: 3,475
(emp 412,700)
Inquiry Stop 6 Units - Time series
200,000
190,000
180,000
170,000
160,000
150,000
140,000
130,000
120,000
110,000
100,000
Jun- Jul- Aug- Sep- Oct- Nov- Dec- Jan- Feb- Mar- Apr- May- Jun- Jul- Aug- Sep- Oct- Nov- Dec- Jan- Feb- Mar- Apr- May- Jun- Jul02 02 02 02 02 02 02 03 03 03 03 03 03 03 03 03 03 03 03 04 04 04 04 04 04 04
The Project
• Aim to improve the quality of automatic
matching
• Reduce the number of units on the
register that are not included in survey
populations
• Improve certainty about probability of
duplication
• Part funded by Eurostat
Matching Process 1
• Name is standardised to form a name
key
• Name keys are checked against existing
records at decreasing levels of accuracy
until possible matches are found
• The name, address and post codes of
possible matches are compared, and a
score out of 100 is calculated
Matching Process 2
• If the score is >79 it is considered to be
a definite match
• If the score is between 60 and 79 it is
considered a possible match, and is
reported for clerical checking
• If the score is <60 it is considered a
non-match
Matching Process 3
• Possible matches are checked clerically
and linked where appropriate using an
on-line system
• Non-matches with >9 employment are
checked - if no link is found they are
sent a Business Register Survey form
• Samples of definite matches and
smaller non-matches are checked
periodically
Improvements 1
• Re-matching using cleaned addresses
– Gains from timing
– Gains from cleaning and standardising
addresses
– Needs extra storage space on the register
for cleaned addresses (approx. 3Gb)
– Address cleaning tool used: Matchcode5
by Capscan
Improvements 2
• Enhancing name keys
– Standardised creation
– Inclusion of part of postcode
• Better treatment of compound names
– E.g. John Smith trading as Smiths Bakery
• More use of data on company
registrations to assist matching of
corporate units
Results 1
• Approximately 30% of units outside
survey populations will match to units
already in those populations
• Less than 5% of the remainder are
duplicates of units in the survey
populations
• Some units in survey populations found
to be duplicates (1%?)
Results 2
• Overall impact:
– 6% more units in survey populations
– Maximum of 1.4% increase in
employment
– Timing of change is an issue
– The risk of duplication will be less than the
risk of under-coverage
Conclusions
• Matching rates will be improved by regular
re-matching using cleaned addresses.
• Initial matching by name can be improved if
part of the postcode is included.
• Improvements to matching increase the
certainty that the remaining unmatched units
are genuinely single source.
• Desk profiling and clerical matching can
reduce duplication still further if targeted at
high risk units.
Further information
• www.statistics.gov.uk/idbr
• [email protected]
Any Questions?