PRAM - unece

Download Report

Transcript PRAM - unece

WP 15
Experience of using a Post
Randomisation Method (PRAM) at ONS
Christine Bycroft, Katherine Merrett
Office for National Statistics, UK
Outline
• What is PRAM
• Why we needed to adapt the PRAM method
• Adapted PRAM Methodology
• Disclosure risks
• Effect on Data Quality
• Conclusions
What is PRAM
• PRAM is a disclosure control technique for categorical data
in microdata files.
• The values of a categorical variable are changed according
to a prescribed probability.
• Each new perturbed value may or may not be different from
the original value.
• For example, a person who is classified as a widow may be
re-classified as single.
Probability mechanism for PRAM
• The probability mechanism is described by an invertible
transition matrix P
• One P matrix for each variable
• Let P=(
pij ) be an LxL matrix for a variable having L
categories. The entries of the matrix are the conditional
probabilities.
Pij  Pr( New _ value  j | Old _ value  i )
• pii is the probability of no change
Risk and data utility for PRAM
Disclosure risk
• PRAM offers protection by inflow and outflow:
• inflow from safe combinations of values to risky combinations
• outflow from risky combinations to safe combinations.
Data Utility
• the Invariant PRAM method preserves univariate frequencies
in expectation
• No control over joint distributions
- may create edit failures, e.g. 14 year old doctor
- or highly unusual combinations, e.g. 17 year old widow
Why adapt the PRAM method?
• Applied to the 2001 Individual Sample of Anonymised
Records (SARs) drawn from the Census.
(know population uniques from Census records)
• Used recoding as first method to reduce risk
• Do not apply PRAM to the whole file
• Perturb only remaining high risk records (small proportion of
all records)
• Wish to preserve exact univariate frequencies, not just
expected values
• Wish to control joint distributions to minimise edit failures
and unusual combinations
Adapted PRAM Methodology
• Perturbing only those records which are high risk
• For the transition matrix, P we want to:
– Maximise the probability of changing values
– Preserve freqencies (ie P is invariant)
– Create perturbed records that are feasible and will not
result in highly unusual combinations
• Define a linear programming problem
Adapted PRAM Methodology
• The LP routine minimised the objective function, subject to
constraints. The objective function is
 wii pii   wij pij
i
i j
Where
W  ( wij ) is a weight matrix; a low weight for a
preferred transition and a high weight for a
non - preferred transition
• We have set up a Weight Matrix to avoid extreme
transitions.
• Rather than having extreme changes that might create
highly unusual individuals or invalid combinations, we prefer
to keep the values as they are.
Implementation
• PRAM variables sequentially - greatest contribution to risk
•
•
•
•
•
•
first
Define weight matrix for each variable
LP solved in SAS, to get P transition matrix
PRAM within control variables (eg PRAM age within marital
status categories)
Implementation of pij probabilities preserves exact
frequencies
Check for edit failures, and correct
Perturbed records are flagged as being imputed (whether
changed or not)
Results: Disclosure risks
• Our aim was to only protect against attempts at exact
matching. Assumed that perturbing the value of one
variable in a high risk record provides sufficient protection
• Protection by high outflow, but low inflow
Results showed high proportions changed, except for last
variables in sequence
• Acceptable, since these variables had the lowest overall
contribution to disclosure risk, and only a small number of
records were affected
Results: Data Quality
Preservation of the univariate frequencies excellent results
Preservation of the multivariate frequencies
a) very few records failed the edit checks
b) compare tables before and after PRAM:
c) Each cell: ratio of the relative error due to
PRAM and relative sampling error
Effect on Data Quality
Percentage of cells with
a ratio >1
Percentage of cells with
a ratio >2
Cell Frequency Before PRAM
11-20
21-40
41-90 91-150
0-5
6-10
35
25
24
13
15
9
8
6
4
5
150-500
500+
Total
10
17
10
16
4
7
4
5
Table 1: Percentage of Cells across all tables with a ratio of the error due
to PRAM and the sampling error of greater than 1 and 2
• Results from 15 tables (nearly 3,000 cells)
• The effect of perturbation relative to sample error
decreases as the cell size increases. Thus the
damage done by PRAM is greater for cells with low
frequencies.
Conclusions
•
As used in this context on targeted records, PRAM is an
efficient method of data perturbation, which is well
controllable.
•
Applying PRAM to a small proportion of the file has
allowed us to strike a good balance between recoding
and minimising the damage from perturbation.