Transcript Slide 1

Matching of administrative
data to validate the 2011
Census in England and Wales
NRS & RSS Edinburgh, October 2012
AGENDA
• Context: 2011 Census quality assurance and the role of
administrative data
• Data matching challenges and solutions
• Data to be matched
• Matching methods and interpretation
• Substantive results so far . . .
An overview of the methods
Method
Product
DSE
Bias adj
Overcount
Quality assurance
5 yr age/sex
CCS areas
Core checks
Ratio estimator
Nat adj
5 yr age/sex
EA /LA level
1 yr age/sex
OA level
Coverage
imputation
Supplementary
analysis
QA Review and sign-off
Main QA Panel
High Level QA Panel
First Release
Challenges and solutions
Issue
Solution
Matching limited to small QA ‘window’ Match selected LAs ahead of QA
Some data not available in advance
Flexible data architecture so new
sources can be added
Research questions only emerge
during QA
Stratified approach to matching so
the methods were tailored to the
questions
Scale of matching task potentially
huge
Initially restrict matching to CCS
postcode clusters
One: many address matches
Revised address data architecture
Data to be matched
Census
Non-Census
Post-out Address Register
NHS Patient Register
Address Register History File
Higher Education Statistics
Agency (HESA) data
Census returns
English and Welsh School Censuses
‘Associated Address’ data
Electoral Registers
Census Management Information
System
Valuation Office Agency data
Methods
•
Data cleaning, de-duplication, standardisation, quality analysis
•
Definitional alignment with Census enumeration base
•
Exact matching (dwelling: Address/ person: name, DoB, gender and
postcode)
•
Score-based address matching
•
Probabilistic person matching
•
Clerical resolution of candidate pairs from automatch
•
Clerical search for unmatched residuals
•
Resolution of unmatched residuals against the Address Register
History file and Census ‘associated addresses’
•
Evidence-based assessment of residuals
Interpretation: Who is actually present?
Non-URs
Census non-usual residents (matched and
unmatched to PR)
PR records
unmatched to
Census
respondents
and assessed
as not present
Matched to address deactivated in the field
Matched to unoccupied or vacant/absent/ 2nd res
dummy
Matched to ARHF invalid address
UR elsewhere, this is Usual Address 1 Year Ago
Matched to Census UR elsewhere
Unaccounted
Unmatched and unaccounted for
PR records
unmatched to
Census
respondents
and assessed
present
PR matched to Census missed/ unaccounted-for
address
PR/ Census
confirmed
URs
PR/ Census matched records
Census URs unmatched to PR
PR matched to address with ‘occupied’ dummy
PR validated through other administrative sources
Match rates in a ‘control’ LA
Female outcomes in a ‘control’ LA
Male outcomes in a ‘control’ LA
Match results in university towns
University town: female outcomes
University town: male outcomes
London: population churn
London churn: female outcomes
London churn: male outcomes
London LA: implied sex ratios
Data mining to address specific
Census/PR anomalies
University Hall of Residence
GP registrations/Hall capacity
Female students living in halls in April 2011
by NHS Authority acceptance date
Male students living in halls in April 2011
by NHS Authority acceptance date
LA summary: proportion of F4s and proportion
unresolved, within CCS postcode clusters
LA summary: concentration of Flag 4s in the PR
residual
FLag 4s in CCS clusters and in the PR residual
40.0
% Flag 4s in unmatched PR
residual
30.0
Controls
20.0
University
Inner London
Outer London
10.0
0.0
0.0
5.0
10.0
15.0
20.0
25.0
% Flag 4s in LA CCS clusters
30.0
35.0
40.0
LA summary: LA types, residual size and Flag 4s
Size and Flag 4 status of the PR residual, by LA
40.0
35.0
% Flag 4s in residual
30.0
25.0
Controls
20.0
Met
Inner London
15.0
Outer London
10.0
5.0
0.0
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
Percentage of PR unmatched/ unresolved
16.0
18.0
20.0
Further investigations
•Planned analysis of the PR residuals’ addresses and households to
identify ‘ghost’ records
•Longitudinal matching of the 2012 Patient Register to 2011 data to
identify registrations that have been cancelled by GP practices in the year
following Census
•Cluster analysis of all E&W LAs to see whether the typology of LAs
identified through matching is mirrored in list inflation patterns nationally
•Multi-level modelling to summarise results, with individual and area level
explanatory variables