Census 2011: A question of confidentiality

Download Report

Transcript Census 2011: A question of confidentiality

Census 2011 – A Question of
Confidentiality
Statistical Disclosure control for the
2011 Census
Carole Abrahams
ONS Methodology
BSPS – York, September 2011
Overview
• Brief introduction to SDC
• Census outputs & confidentiality
• Record swapping
• Data utility
• 2001 vs 2011
• Communal Establishments
• Further work
Introduction to SDC (1) - What is
disclosure risk?
There is a disclosure risk when information is
published that could allow an intruder to
indicate the identity or particulars of:
• an individual
• a household or family
• a business
• or another statistical unit
Introduction to SDC (2) - Examples of
disclosure risk
• Identification disclosure
• Attribute disclosure (AD)
• Group disclosure
4
Introduction to SDC (3) - Statistical
Disclosure Control
Statistical Disclosure Control (SDC) involves
either:
• introducing sufficient ambiguity/damage into, or
reducing level of detail, of published statistics,
so that the risk of disclosing confidential
information is reduced to an acceptable level
and/or:
• controlling access to data
Census outputs and confidentiality
• Disclosure control of Census outputs required
by law
• Pledge on Census forms
• Visible variables
– use to identify individual/family/household
– find out something new about them
– Data Environment Analysis Service (DEAS)
• Sensitive variables
– defined by DPA
Risk – Utility balance
High
Disclosure
Risk:
Information
about
confidential
units
Original Data
Maximum Tolerable Risk
No data
Low
Released
Data
Data Utility: Information about legitimate items
High
SDC for Census 2001
• Random record swapping
• Lack of harmonisation and late changes to
agreed methodology
• SCA applied in E, W, NI, not in Scotland
• SCA protected individual tables, but some
remaining risk through differencing
• Effect on utility at low geographies and in
creating bespoke geographies
Census Geography
104 Delivery Groups
(DGs) in England &
Wales
• ≈ 4 LADs in a DG
• ≈ 20 MSOAs in an
LAD
• ≈ 20 OAs in an MSOA
OA OA
MSOA MSOA
OA OA
LAD
LAD
MSOA MSOA
DG
LAD
LAD
9
SDC for Census 2011
• RsG agreement November 2006
– Small cell counts as long as ‘sufficient uncertainty’
– Main risk attribute disclosure
• Targeted record swapping
– Targeted to ‘risky’ records
– Risk looks at particular variables, takes account of
geography
– Risk scores for individuals combined to household score
– Households swapped
– Households swapped only as far as their risk is considered
‘high’
– Imputation considered as part protection
Targeted swapping (1)
• Households
− Risk score on uniqueness/rarity of small number of key
variables at different geographies
• Probability
− inversely related to area imputation rate
− positively related to household risk score
• Matching
− look for matches only as far as is necessary
− Match on household size, and other variables if possible
Targeted swapping – an example of
how it works (1)
Household is in area that
has high response rate,
therefore low imputation.
So area has higher than
average swapping rate
Risky within OA
Swap with h’hold in another OA in MSOA
Risky within MSOA
Swap with h’hold in another MSOA in LA
Risky within LA
Swap with h’hold in another
LA within delivery group
Targeted swapping – an example of
how it works (2)
Household found to be
risky within OA and is
selected for swapping.
Only swapped
between OAs in the
same MSOA.
Households are
matched on:
Adults = 2
Children = 1
Pets = 2
Swapping & Sufficient uncertainty
• Level of swapping in an area determined by
level of non-response / imputation
• Swapping lower where more imputed records
• Sufficient uncertainty has been assessed by
two factors:
– Percentage of real attribute disclosures (ADs)
protected by imputation & swapping
– Percentage of apparent ADs created
Effect of targeted swapping on data utility
LLTI by OA
LLTI by MSOA
• Typical effect of swapping on numbers of people with LLTI
• Based on 2001 data
• Utility higher at MSOA than at OA
Summary of SDC methodology
• Main effect on utility will be for small cells at low level
geographies
• Tables will be consistent and additive
• Will use minimum average cell size
• All univariate residence-based tables at OA publishable
• There will be no small cell adjustment
• Tables will contain apparent small cells and apparent ADs,
but an intruder can’t find out something about an individual
case with a “high degree of confidence”
Communal establishments
For client residents:
For staff residents:
17
Further work
• Minority population outputs.
• Flow data
• Microdata
• Workplace tables
• Commissioned tables
• Contact: SDC [email protected]