Jerry`s presentation on risk measures
Download
Report
Transcript Jerry`s presentation on risk measures
Estimating Identification Risks for
Microdata
Jerome P. Reiter
Institute of Statistics and Decision Sciences
Duke University, Durham NC, USA
Measures of identification
disclosure risk
Number of population uniques:
Does not incorporate intruders’ knowledge.
May not be useful for continuous data.
Hard to gauge effects of SDL procedures.
Hard to estimate accurately.
Probability-based methods
(Direct matching using external databases.
Indirect matching using existing data set.)
Require assumptions about intruder behavior.
May be costly to obtain external databases.
Notation for methods
Actual record j :
y j (y , y )
Released record j :
z j (z , z )
Available data:
z (z , z )
Unavailable + perturbed data combined:
U
j
U
j
A
j
z (z , z )
C
j
U
j
Ap
j
Ap
j
A
j
A
j
Ad
j
Probability of identification
Let J = j when record j in Z matches the
target record, t.
J = r + 1 when target is not in Z.
Pr( J j | t, Z)
Pr( ZC | J j, t, ZAd ) Pr( J j | t, ZAd )
r 1
Pr( Z
j 1
C
| J j , t, Z ) Pr( J j | t, Z )
Ad
Ad
Calculating
Pr( J j | t, Z )
Ad
CASE 1: Target assumed to be in Z:
Ad
j do
Units whose z
not match target’s
values have zero probability.
For matches, probability equals 1/nt
where nt is number of matches in Z.
Probability equals zero for j = r+1.
Calculating
Pr( J j | t, Z )
Ad
CASE 2: Target not assumed to be in Z:
Ad
j
Units whose z do not match target’s
values have zero probability.
For matches, probability is 1/Nt
where Nt is number of matches in pop’n.
For j = r+1, probability is (Nt – nt) / Nt
Splitting
Pr( Z | J j, t, Z)
C
Pr( Z | J j , t, Z)
C
Pr(z
Ap
j
| J j , t, Z )
Ad
Pr(z | z , J j , t, Z )
U
j
Ap
j
Ad
Pr( z ,, z , z ,, z | z , J j , t, Z )
C
1
C
j-1
C
j1
C
r
C
j
Ad
Calculating Pr(z
Ap
j
| J j, t, Z )
Ad
Data swapping:
Repeatedly simulate swapping
mechanism using Z.
Estimate probabilities for combinations
of original + swapped values.
Calculating Pr(z
Ap
j
| J j, t, Z )
Ad
Noise addition:
Assume variable k perturbed using Gaussian
noise with mean zero and known variance σ2.
Pr(z
Ap
jk
| J j, t, Z ) N(z jk | t jk , )
Ad
2
Calculating Pr(z | z , J j, t, Z )
U
j
Ap
j
Ad
Pr( z | z , t, Z )
U
j
Pr(z
A
j
U
j
Ad
U
j
A
j
Ad
U
j
A
j
Ad
| y , z , t, Z ) Pr( y | z , t, Z )dy
First distribution is for SDL methods.
Second distribution is best model for predicting
unavailable variables given what is known.
U
j
Calculating Pr(z | z , J j, t, Z )
U
j
Ap
j
Ad
Pr(z | z , t, Z ) 1
U
j
A
j
Ad
when values in U are not perturbed. Intruders
may act this way to avoid computations.
It is prudent to evaluate risk assuming they do.
Calculating
Pr(z ,, z , z ,, z | z , J j, t, Z )
C
1
C
j-1
C
j1
C
r
C
j
Ad
Assume independence to obtain:
Pr(z
C
i
Ad
i
|z )
i j
where
Pr( z | z ) Pr( z | y , z ) Pr( y | z )dy
C
i
Ad
i
C
i
C
i
Ad
i
C
i
Ad
i
C
i
Simulations
51,016 heads of household from 2000 CPS.
Potentially available variables:
Age, Sex, Race, Marital Status, Property Tax
Unavailable variables:
Education, Income, Social Security, Child
Support Payments
Simulations: SDL Procedures
Age: Group in five year intervals.
Race and Marital Status:
Swap randomly 30% of values for each variable.
Property taxes:
For positive taxes, add noise from N(0, 2902).
Constrain values to be positive. Do not alter 0s.
Other variables: Leave at original values.
Simulations: Targets
Everyman : has values near median for all
Unique : Sample unique on combination of
Big I : Highest income in data set.
Big P : Highest property tax in data set.
variables.
age, sex, race, marital status.
Simulations: Summary of results
Swaps needed to protect Unique.
Age recode plus swaps good protection.
Knowing property taxes greatly
increases probabilities of identification.
Adding noise to positive tax values is
not sufficient. (Top-coding helps.)