Jerry`s presentation on risk measures

Download Report

Transcript Jerry`s presentation on risk measures

Estimating Identification Risks for
Microdata
Jerome P. Reiter
Institute of Statistics and Decision Sciences
Duke University, Durham NC, USA
Measures of identification
disclosure risk

Number of population uniques:
Does not incorporate intruders’ knowledge.
May not be useful for continuous data.
Hard to gauge effects of SDL procedures.
Hard to estimate accurately.

Probability-based methods
(Direct matching using external databases.
Indirect matching using existing data set.)
Require assumptions about intruder behavior.
May be costly to obtain external databases.
Notation for methods
Actual record j :
y j  (y , y )
Released record j :
z j  (z , z )

Available data:
z  (z , z )

Unavailable + perturbed data combined:


U
j
U
j
A
j
z  (z , z )
C
j
U
j
Ap
j
Ap
j
A
j
A
j
Ad
j
Probability of identification


Let J = j when record j in Z matches the
target record, t.
J = r + 1 when target is not in Z.
Pr( J  j | t, Z) 
Pr( ZC | J  j, t, ZAd ) Pr( J  j | t, ZAd )
r 1
 Pr( Z
j 1
C
| J  j , t, Z ) Pr( J  j | t, Z )
Ad
Ad
Calculating
Pr( J  j | t, Z )
Ad
CASE 1: Target assumed to be in Z:



Ad
j do
Units whose z
not match target’s
values have zero probability.
For matches, probability equals 1/nt
where nt is number of matches in Z.
Probability equals zero for j = r+1.
Calculating
Pr( J  j | t, Z )
Ad
CASE 2: Target not assumed to be in Z:



Ad
j
Units whose z do not match target’s
values have zero probability.
For matches, probability is 1/Nt
where Nt is number of matches in pop’n.
For j = r+1, probability is (Nt – nt) / Nt
Splitting
Pr( Z | J  j, t, Z)
C
Pr( Z | J  j , t, Z) 
C
Pr(z
Ap
j
| J  j , t, Z )
Ad
 Pr(z | z , J  j , t, Z )
U
j
Ap
j
Ad
 Pr( z ,, z , z ,, z | z , J  j , t, Z )
C
1
C
j-1
C
j1
C
r
C
j
Ad
Calculating Pr(z

Ap
j
| J  j, t, Z )
Ad
Data swapping:
Repeatedly simulate swapping
mechanism using Z.
Estimate probabilities for combinations
of original + swapped values.
Calculating Pr(z

Ap
j
| J  j, t, Z )
Ad
Noise addition:
Assume variable k perturbed using Gaussian
noise with mean zero and known variance σ2.
Pr(z
Ap
jk
| J  j, t, Z )  N(z jk | t jk ,  )
Ad
2
Calculating Pr(z | z , J  j, t, Z )
U
j
Ap
j
Ad
Pr( z | z , t, Z ) 
U
j
 Pr(z


A
j
U
j
Ad
U
j
A
j
Ad
U
j
A
j
Ad
| y , z , t, Z ) Pr( y | z , t, Z )dy
First distribution is for SDL methods.
Second distribution is best model for predicting
unavailable variables given what is known.
U
j
Calculating Pr(z | z , J  j, t, Z )
U
j
Ap
j
Ad
Pr(z | z , t, Z )  1
U
j
A
j
Ad
when values in U are not perturbed. Intruders
may act this way to avoid computations.
It is prudent to evaluate risk assuming they do.
Calculating
Pr(z ,, z , z ,, z | z , J  j, t, Z )
C
1

C
j-1
C
j1
C
r
C
j
Ad
Assume independence to obtain:
 Pr(z
C
i
Ad
i
|z )
i j
where
Pr( z | z )   Pr( z | y , z ) Pr( y | z )dy
C
i
Ad
i
C
i
C
i
Ad
i
C
i
Ad
i
C
i
Simulations

51,016 heads of household from 2000 CPS.

Potentially available variables:
Age, Sex, Race, Marital Status, Property Tax

Unavailable variables:
Education, Income, Social Security, Child
Support Payments
Simulations: SDL Procedures




Age: Group in five year intervals.
Race and Marital Status:
Swap randomly 30% of values for each variable.
Property taxes:
For positive taxes, add noise from N(0, 2902).
Constrain values to be positive. Do not alter 0s.
Other variables: Leave at original values.
Simulations: Targets

Everyman : has values near median for all

Unique : Sample unique on combination of

Big I : Highest income in data set.

Big P : Highest property tax in data set.
variables.
age, sex, race, marital status.
Simulations: Summary of results




Swaps needed to protect Unique.
Age recode plus swaps good protection.
Knowing property taxes greatly
increases probabilities of identification.
Adding noise to positive tax values is
not sufficient. (Top-coding helps.)