Statistics Netherlands

Transcript Statistics Netherlands

Preserving edits when
perturbing microdata for
statistical disclosure
control
Natalie Shlomo
Southampton Statistical Sciences Research Institute,
University of Southampton
&
Ton de Waal
Statistics Netherlands
1
Topics for Discussion









Statistical Disclosure Control – an example
PRAM – Post Randomization Method of perturbation
Preserving edit constraints
Micro and macro edit constraints
Method for perturbing data and maintaining edits
Evaluation study
Results
Impact on the risk of re-identification
Discussion
2
Statistical Disclosure Control:
Example

Record in microdata:
 Name
of speaker:
 Nationality:
 Nationality of co-author:
 Ever stole candy:



Ton de Waal
Dutch
Israeli
yes
Name of speaker: direct identifier
Nationality and Nationality of co-author: indirect
identifier
Stealing candy: sensitive variable
3
Statistical Disclosure Control:
Example

“Protected” record:
 Nationality:
Dutch
 Nationality of co-author: Israeli
 Ever stole candy:
yes

I’m still not happy!
4
Post-Randomization Perturbation
Method

Method changes values of categorical variables
according to prescribed probability transition matrix

pij = p(perturbed category = j | original category = i)

Matrix P = (pij) applied independently to each record

For each record, value changed or not changed
according to probability and random draw
5
Properties of PRAM
t vector of original frequencies
 t* vector of perturbed frequencies
 E(t*|t) = tP
 Invariant PRAM: P selected in special
manner

 Condition
of invariance: t = tP (= E(t*|t))
6
Example of using PRAM

Record after PRAM:
 Nationality:
Dutch
 Nationality of co-author: Canadian
 Ever stole candy:
yes
7
Edit constraints

Changing values of categorical variables will
cause edited records to fail edit constraints:
 data
of low utility
 inconsistent record pinpoints to potential attacker
that record was perturbed

Example of edit constraint:
 “two
authors of a paper at the UN/ECE Work
Session in Ottawa are not from Canada and the
Netherlands”

Attacker knows that record in our example
has been perturbed
8
Preserving Edit Constraints
Clean
Data
Set
Perturbation
Failed
Edits
Imputation
Clean
Data
Set
 Take
edits as much as possible into account
while applying PRAM
 After
PRAM has been applied: correct remaining
edit failures by hot-deck imputation
 Correct
records for fixed perturbed variables
9
Example

Record after PRAM:
 Nationality:
 Nationality
of co-author:
 Ever stole candy:

Edited record after PRAM
 Nationality:
 Nationality
of co-author:
 Ever stole candy:

Dutch
Canadian
yes
Canadian
Canadian
yes
I’m happy!
10
Micro Edit Constraints

Data: 1995 Israel Census sample data, 35,773 individuals
aged 15 and over in 15,468 households across all regions and
characteristics

Variable age perturbed – 86 categories

14 micro-edits, such as:






E1 : {Under 16 and ever married} = Failure
E2 : {Age of marriage under 14} = Failure
E3 : {Age difference between spouse over 25} = Failure
E4 : {Age of mother under 14} = Failure
E5 : {Year of immigration less than year of birth} = Failure
E7 : {Under 16 and relation is spouse or parent} = Failure
11
Macro Edit Constraints
Let D dataset and D(c) cell frequency of cell c
 Hellinger Distance HD ( Dorig , D pert ) 

1
2n
(
cC
Dorig (c)  D pert (c) ) 2

Symmetrical distance metric that measures difference
between original and perturbed probability distributions

Information loss defined by larger Hellinger Distance
Cramer’s V
2
V1, 2 
N  min(( C1  1), (C 2  1))
and
CVi, j ( Dorig , D pert )  Vi, j ( Dorig )  Vi, j ( D pert )

Measures association between two categorical variables

Information loss defined by reduction in Cramer’s V
12
Macro Edit Constraints

Impact on R2 through
 Measures
SSB
 n (x
i
i.
 x )2
i
proportion of “between” variance
out of the total variance, i.e. homogeneity of
dependent variable within groupings
 Information
loss defined as the proportion
reduced in the “between” sum of squares for
perturbed groupings compared to original
groupings: SSB pert SSBorig
13
Methods of Perturbation




Perturb variable randomly across all categories
Perturb within a limited range of the variable, i.e.
divide variable into subgroups and calculate
transition matrix for each subgroup.
Perturb variable within control groups defined by
other highly-correlated variables.
Compound highly correlated variables.
14
Evaluation Study



Perturbation of 86 categories of age:
Random perturbation across all ages
Age perturbed within categories of marital status
(married, divorced, widowed and single)
Invariant matrices calculated for each category.
 Age perturbed within categories of marital status
x five age groupings
(15-17, 18-24, 25-44, 45-64, 65-74, 75+)

Age perturbed within narrow age groupings
(15-17, 18-24, 25-34, 35-44, 45-54, 55-64,65-69, 70-74,
75+)
15
Number of Edit Failures
Method of Perturbation
Random
No edit failures
Random
Within Marital
Status
31,983
Marital Status
Within Marital Status
and Broad Age Groups
33,143
Within Narrow Age
Groups
35,023
Marital Status and Age Groups
35,440
Age Groups
2,500
2,000
1,500
1,000
500
0
1 error
2
3
4+ errors
Note: large reduction in number of micro edits failures
16
Macro Edits Results
Hellinger Distance - District*Age*Sex
Distortion to
distribution
Hellinger Distance
0.1
0.095
0.09
0.085
0.08
0.075
Random
Marital Status
Marital Status
and Age Groups
Age Groups
Perturbation Method
0.08
0.06
0.04
0.02
0
's V
Loss in measures
of association
Reduction in Cramer
Cramer's V - In Labour Force*Age
Random
Marital Status
Marital Status
and Age Groups
Perturbation Method
Age Groups
17
Macro Edits Results
Unemployed - original age
Percent Unemployed
Impact on R2
and shrinking
means
Unemployed - random perturbed age
20
18
16
14
12
10
8
6
4
2
0
15-17
18-24
25-34
35-44
45-54
55-64
65-69
70-74
75+
Age Group
Ratio of Between Variance Percent Unemployed*Age
Loss in
homogeneity within
age
Ratio of SSB
1
0,8
0,6
0,4
0,2
0
Random
Marital Status
Marital Status
and Age Groups
Perturbation Method
Age Groups
18
Disclosure Risk Measures

Percent unperturbed records in small cells of the key

Without perturbation expected number of correct
matches would be measured by 1/Fk where Fk is the
number of records in the cell k.
Because only a proportion of records pd were likely not
to be perturbed, the expected number of correct matches
is pd/Fk

Proportion of records perturbed within a 5 year age
difference where the higher the proportion the more
likely to obtain a correct linkage
19
Disclosure Risk Results
Percent Unperturbed
Records
Percent Unperturbed Records
Slightly higher
percent
unperturbed
records for more
controlled PRAM
80
78
76
74
72
70
68
66
Random
Marital Status
Marital Status
and Age Groups
Age Groups
Perturbation Method
No increased disclosure
risk for more
controlled Pram
Number of Expected
Correct Matches
Expected Number of Correct Matches
3,000
2,500
2,000
1,500
1,000
500
0
Original
Random
Marital
Status
Marital
Age Groups
Status and
Age Groups
Perturbation Method
20
Disclosure Risk Results
Percent Perturbed Within
5 Years
Percent Perturbed Within 5 Years
Large increase in
percent perturbed
within 5 year age
band for more
controlled PRAM
90
80
70
60
50
40
30
20
10
0
Random
Marital Status
Marital Status
and Age Groups
Age Groups
Perturbation Method

Data protector must weigh the increased disclosure
risk against the benefits of obtaining higher utility data
21
Discussion

Controls in the perturbation raise utility of data by
minimizing micro and macro edit failures.

Risk of re-identification slightly increases depending on
the risk measure and the disclosure risk scenario.

Protecting microdata by PRAM alone leaves high
disclosure risk in the microdata and should be combined
with other data masking techniques.

Need for more sophisticated methods for correcting edit
failures on perturbed microdata based on principle of
minimum change in order to improve the quality and
utility of the data.
22
23

Statistics Netherlands

Transcript Statistics Netherlands

Directory