3.1 A Narrow Definition of Disclosure Risk

Download Report

Transcript 3.1 A Narrow Definition of Disclosure Risk

The Application of the Concept
of Uniqueness for Creating
Public Use Microdata Files
Jay J. Kim, U.S. National Center for Health Statistics
Dong M. Jeong, Korea National Statistical Office
Contents





Introduction
Intruders and Disclosure
Measures of Disclosure Risk
1. Narrow Definition of Disclosure Risk
2. Broader Definition of Disclosure Risk
Evaluation of Definition of Disclosure Risk
Concluding Remarks
1. Introduction.



Government agencies release microdata files from
their survey data or administrative records data.
Large amounts of information on individuals is
available to many organizations and data users, who
can become “intruders”.
If a public use microdata file (PUMF) is released,
intruders can try to match their records with the
ones from the PUMF and gain access to new
information.



Intruders use common variables between PUMF
and their files for linking the records on two files,
which are called “key variables” or “matching
variables”.
In the U.S., laws such as Title 13 stipulates
protection of the confidentiality of many types of
data.
Thus, the data disseminating agencies must protect
the confidentiality of the individuals on the
PUMFs. On the other hand, they should not ignore
the data users’ needs, i.e., the utility of the data files.




Here, we develop probability models quantifying
disclosure risk for a microdata file.
This is a modification of the Marsh, et al (1991)
procedure.
The model can use population and sample
“uniques” only, or it can also include population
twins or triplets.
We will show the results of applying the probability
model - using population and sample uniques only for creating disclosure-limited microdata files using
the 2005 Korean demographic census data.
2. Intruders and Disclosure
Potential intruders:
i). Organizational intruders, e.g., credit card
companies, mortgage departments of banks,
insurance companies, credit bureaus, trade
associations, etc.
ii). Individual intruders: with readily available high
powered computers, anyone can assemble his
own database using information in the public
domain and become an intruder.

Two types of disclosure:
i). Identity disclosure – identification.
If the intruder is a journalist and tries to
embarrass the data disseminating agencies, his
claim that he has been successful in identifying
someone on their PUMF would be sufficient.
If the intruder publicizes the findings in the news
media, it could have a devastating effect on the
agencies’ data collection efforts.
ii). Attribute disclosure;
After identification is made, one can gain new
sensitive information.
For defining a measure of disclosure risk, we will
consider that identity disclosure is the same as
disclosure.
3. Measures of Disclosure Risk

Define
P(a) = the probability of key variables being
recorded identically in both PUMF and
intruder’s file;
P(b|a) = the probability that an individual
appears in a PUMF is the same as the sampling
fraction for that individual in the PUMF;
P(c|a,b) = the probability of population unique;
and
P(d|a,b,c) = the probability of verifying
population unique.

Marsh, et al (1991) defined the probability of
correct identification of an individual as
P(a) P(b|a) P(c|a,b) P(d|a,b,c)

We modify the Marsh, et al’s model.
We assume in Marsh, et al’s formula that
i). There are no recording or classification errors
for the values of the key variables, i.e., P(a) = 1.

ii). We can verify correctly population uniqueness
with certainty, i.e., P(d|a,b,c) = 1.

Disclosure can occur when all the following 5
conditions are met:
i). An individual is unique in a population based on
key variables.
If the intruder’s file is a 100 percent population
file, he can establish uniqueness of a certain
individual by using his file.
ii). The individual is on the PUMF.
iii). The individual is on intruder’s file.
An intruder can have information on key
variables for a specific person and try to
examine whether that person appears in the
PUMF. In this case, intruder’s file has a single
record.
iv). The individual is unique on PUMF AND
v). The individual is unique on intruder’s file.
Define
A = an individual of interest;
F1 = PUMF;
F2 = an intruder’s file;
P1 = unique class in the population;
S1F1 = unique class in PUMF;
and
S1F2 = unique class in intruder’s file.
3.1 A Narrow Definition of
Disclosure Risk
This definition depends on the population and
sample uniques only.
3.1.1 Assume an Intruder does Phising (Fishing)
Expedition.

 

P  A  F1    A  F2   A  S1F1  A  S1F2   A  P1  
The probability of correct identification:

 

P  A  F1    A  F2   A  S1F1  A  S1F2   A  P1 
(1)
If an individual is a population unique, it would also be a
sample unique, i.e.,

 

 P  A  S    A  S  |  A  P   P  A  P 
P  A  S1F1  A  S1F2   A  P1  
1F1
 P  A  P1 
1F2
1
1
Equation (1) reduces to
P  A  F1    A  F2    A  P1  
which can be further re-expressed as follows:
P  A  F1    A  F2  |  A  P1   P  A  P1 
(2)
The event that A is unique in population is
independent of whether A is selected in sample or
not. Thus, equation (2) reduces to
P  A  F1    A  F2   P  A  P1 
(3)
The event that A is in the PUMF is usually
independent of the event that A is in the intruder’s
file. In this case, equation (3) can be simplified as
P  A  F1  P  A  F2  P  A  P1 
(4)
However, a survey can be a subset of another survey.
For example, U.S. Census Bureau’s PUMF is a subset
of their census sample. Thus if F1 is a subset of F2
P  A  F1    A  F2    P  A  F1  and equation (3) becomes
Also,
P  A  F1  P  A  P1 
(5)
P  A  F1    A  F2    P  A  F2  P  A  F1  |  A  F2  
(6)
P  A  P1  P  A  F2  Subsampling Rate of F1 from F2
3.1.2 Assuming an Intruder Already
Knows That A is in PUMF
If the intruder has response knowledge, then
P  A  F1   1
Thus, from equation (4), the disclosure risk will be
P  A  F2  P  A  P1 
3.2 Broader Definition of Disclosure
Risk

Even if an individual is not unique in the
population, he still can be identified with additional
information.

Suppose C individuals in the population have the
same values of the key variables and matching to
any one of them is equally likely.
Define
PC = Equivalence class of size C in the population.
Then the probability of correct identification is,

 

1 
P  A  F1    A  F2   A  S1F1  A  S1F2   A  PC  
C
4. Evaluation of Disclosure Risk

We used the measures of disclosure risk developed
here in creating PUMS from the 2005 Korean census
data.

We show the results of the applications on the 2005
census data from Choongchung (CC) Province.

Masking scheme used is to coarse (group) categories.

Korea National Statistical Office (KNSO) creates
the 2 percent PUMFs by taking a 20 percent
subsample of the 10 percent census sample,
(0.1 x 0.2 = 0.02).
F1 : 2 percent PUMF.
F2 : 10 percent census sample.
Table 1. Population Size, and Number of Households and
Housing Units – CC Province
Population
Households
Housing Units
1,798,397
660,526
586,757
Census Sample
(10%)
189,505
71,091
65,398
2% Microdata
38,027
14,218
13,038
Census

Key variables used: gender (2); age (111); marital
status (4 ); relationship to householder (14);
household type (5 ); tenure (6 ); building type of
residence (12); and type of housing and number of
floors of the building (12).

The probability of a population unique is calculated
using the 100 percent census file.

Without grouping, the number of uniques is 9,664.
It is 0.54 % of 1.8 million.

If we assume that the intruder has a 10 percent
census sample file, the disclosure risk is
0.1 0.2  0.0054  0.00011
However, whole blocks are selected in the 10
percent census sample, thus residents in the sample
blocks know that their neighbors are also in the
sample. To those who have response knowledge,
the disclosure risk is
0.2  0.0054  0.0011
Table 2. Number of Unique Persons before Grouping
Categories
# of Vars
Gender
1
x
1
Age
Relationship
x
2
x
1
0
x
2
x
2
x
2
x
x
2
x
2
x
x
3
x
x
3
x
x
3
x
x
0
x
x
0
5
x
2
3
# of Uniques
0
1
4
Marital Status
0
65
x
11
x
0
x
167
x
30
x
x
2
x
x
x
349
x
x
x
713
Table 3. Number of Uniques with 5 Year Intervals for Age
# of Vars
Gender
1
2
x
Grouped Age
x
5 → 2
2
x
3
x
x
3
x
x
x
# of Uniques
2 → 0
x
4
Marital Status
x
2
3
Relationship
65 → 6
x
x
11 → 1
167 → 18
x
x
30 → 3
x
x
x
349 → 53
x
x
x
713 → 106
Table 4. Number of Uniques with Grouped Age and
Relationship Categories
# of Vars
Gender
2
3
x
3
4
x
Grouped
Age
Grouped
Relationship
Marital Status
# of
Uniques
x
x
6 → 2
x
x
18 → 4
x
x
x
53 → 3
x
x
x
106 →
8
Table 5. Number of Uniques with Grouped Age,
Relationship and Marital Status Categories
# of
Vars
Gender
3
x
3
4
x
Grouped
Age
Grouped
Relationship
x
Grouped Marital
Status
# of
Uniques
x
3 → 1
x
x
x
3 → 3
x
x
x
8 → 4
Table 6. Two different groupings in the number
of categories
Relationship Building
Type
Type of
Housing and
# of Floors
# of
Uniques
Grouping 1
9
(14)
6
(12)
6
(12)
501
Grouping 2
3
(14)
4
(12)
4
(12)
495

Probability of unique = .028 % for both
groupings.
If we assume the intruder has the 10 percent
census sample file, the disclosure risk is
0.0000056
< 1 in 100,000.
If we assume response knowledge, the disclosure
risk goes up to
0.000028.
5. Concluding Remarks

We developed comprehensive probability
models quantifying disclosure risk for microdata
files and applied them to the Korean census
data.

Using the models, we measured the disclosure
risks for the original census data. The risks were
too high.

We grouped categories of the key variables and
re-calculated the disclosure risks. The risks were
lowered to a satisfactory level.

For creating their official 2 percent PUMFs from
the census data, KNSO used the approaches
mentioned here including the measures of
disclosure risks and grouping categories.
Thank you very much !
Jay J. Kim
[email protected]
Dong M. Jeong
[email protected]
Disclaimer: This paper represents the views of the
authors and should not be interpreted as representing
the views, policies or practices of the Centers for
Disease Control and Prevention, National Center for
Health Statistics.