Impact on the quality of the SARs Perturbation and Imputation

Download Report

Transcript Impact on the quality of the SARs Perturbation and Imputation

Assessing Disclosure Risk in
Sample Microdata Under
Misclassification
Natalie Shlomo
Southampton Statistical Sciences Research Institute
University of Southampton
[email protected]
This is joint work with Prof Chris Skinner
1
Topics Covered
•
•
Introduction and motivation
Disclosure risk assessment for sample microdata:
•
Probabilistic modelling extended for
misclassification
•
Probabilistic record linkage – linking the
frameworks
•
Risk-utility framework for microdata subject to
misclassification
•
Discussion
2
Introduction
•
Disclosure risk scenario: ‘intruder’ attack on
microdata through linking to available public data
sources
Linkage via identifying key variables common to both
sources, eg. sex, age, region, ethnicity
•
Agencies limit risk of identification through statistical
disclosure limitation (SDL) methods:
• Non-perturbative – sub-sampling, recoding and
collapsing categories of key variables, deleting
variables
• Perturbative – data swapping, additive noise,
misclassification (PRAM) and synthetic data
3
Introduction
•
Need to quantify the risk of identification to inform
microdata release
•
Probabilistic models for quantifying risk of
identification based on population uniqueness on
identifying key variables
•
Population counts in contingency tables spanned by
key variables unknown
•
Distribution assumptions to draw inference from the
sample for estimating population parameters
•
Assumes that microdata has not been altered either
through misclassification arising from data processes
or purposely introduced for SDL
4
Introduction
•
Expand probabilistic modelling for quantifying risk of
identification under misclassification/perturbation
•
For perturbative methods of SDL, risk assessment
typically based on probabilistic record linkage
• Conservative assessment of risk of identification
• Assumes that intruder has access to original dataset
and does not take into account protection afforded
by sampling
•
Fit probabilistic record linkage into the probabilistic
modelling framework for categorical matching variables
5
Disclosure Risk Assessment
Probabilistic Modelling – no misclassification
• Let
f  { f k } denote a q-way frequency table
k  (k1 ,...,kq )
which is a sample from a
population table F  {Fk } where Fk indicates a
f k sample count in
cell population count and
cell k
•
Disclosure risk measure:
 1   I ( f k  1, Fk  1)
k
 2   I ( f k  1)
k
1
Fk
• For unknown population counts, estimate from the
conditional distribution of Fk | f k
ˆ1   I ( f k  1)Pˆ ( Fk  1 | f k  1)
k
ˆ2   I ( f k  1)Eˆ (
k
1
| f k  1)
Fk
6
Disclosure Risk Assessment
• Natural assumption:
Bernoulli sampling:
Fk ~ Poisson(k )
f k | Fk ~ Bin( Fk ,  k )
 k is the sampling fraction in cell k
It follows that:
f k ~ Poisson( k k )
and
Fk | f k ~ Poisson(k (1   k ))
where
Fk | f k
are conditionally independent
7
Disclosure Risk Assessment
• Skinner and Holmes, 1998, Elamir and Skinner, 2006 use
log linear models to estimate parameters {k }
•
•
Sample frequencies f k are independent Poisson
distributed with a mean of  k   k k
Log-linear model for estimating
{k } expressed as:
log( k )  xk 
where x design matrix of key variables and their
interactions
•
MLE’s calculated by solving score function:

[ f k  exp(xk  )]x k  0
k
8
Disclosure Risk Assessment
• Fitted values calculated by: uˆk  exp(xk ˆ ) and ˆk 
uˆ k
k
• Individual risk measures estimated by:
Pˆ (Fk  1| f k 1)  exp(ˆk (1  k ))
1
Eˆ ( | f k  1)  [1  exp(ˆk (1   k ))]/[ˆk (1   k )]
Fk
• Skinner and Shlomo (2009) develop goodness of fit
criteria which minimize the bias of disclosure risk
estimates, for example, for  1
Bˆ1   ˆk exp(ˆk )(1   k ){( f k  ˆ k )  (1   k )[(f k  ˆ k )2  f k ] /(2 k )}
k
9
Disclosure Risk Assessment
• Criteria related to tests for over and under-dispersion:
• over-fitting - sample marginal counts produce too
many random zeros, leading to expected cell counts
too high for non-zero cells and under-estimation of
risk
• under-fitting - sample marginal counts don’t take into
account structural zeros, leading to expected cell
counts too low for non-zero cells and over-estimation
of risk
• Criteria selects the model using a forward search
algorithm which minimizes Bˆ i / vˆi
for ˆi , i  1,2
where ˆ i is the variance of Bˆ i
10
Disclosure Risk Assessment
Example: Population of 944,793 from UK 2001 Census
SRS sample size 9,448
Key: Area (2), Sex (2), Age (101), Marital Status (6),
Ethnicity (17), Economic Activity (10) - 412,080 cells
Model Selection:
Starting solution: main-effects log-linear model which
indicates under-fitting (minimum error statistics too large)
Add in higher interaction terms until minimum error
statistics indicate fit
11
Model Search Example (SRS n=9,448)
True values , ~1  159 ~2  355.9
Area–ar, Sex-s, Age–a, Marital Status–m, Ethnicity–et, and Economic Activity-ec
ˆ1
ˆ2
Independence - I
386.6
701.2
48.54
114.19
All 2 way - II
104.9
280.1
-1.57
-2.65
{a*ec}
243.4
494.3
54.75
59.22
2: 1 + {a*et}
180.1
411.6
3.07
9.82
3: 2 + {a*m}
,
152.3
343.3
0.88
1.73
4: 3 + {s*ec}
149.2
337.5
0.26
0.92
5a: 4 + {ar*a}
148.5
337.1
-0.01
0.84
5b: 4 + {s*m}
147.7
335.3
0.02
0.66
6b: 5b + {ar*a}
147.0
335.0
-0.24
0.56
6c: 5b + {ar*m}
148.9
337.1
-0.04
0.72
6d: 5b + {m*ec}
146.3
331.4
-0.24
0.03
7c: 6c + {m*ec}
147.5
333.2
-0.34
0.06
7d: 6d + {ar*a}
145.6
331.0
-0.44
-0.03 12
1: I +
Bˆ1 / v1
Bˆ2 / v 2
Model Search Example
Preferred Model: {a*ec}{a*et}{a*m}(s*ec}{ar*a}
True Global Risk:
~2  355.9
~1  159
Estimated Global Risk
ˆ1  148.5 ˆ2  337.1
Log-scale
1
0.1
True risk
measure
0.01
0.001
0.001
0.01
0.1
1
Estimated per-record risk measure
13
Disclosure Risk Assessment
• Skinner and Shlomo, 2009 address complex survey
designs:
• Sampling clusters introduce dependencies - key
variables cut across clusters and assumption holds in
practice
• Stratification – include strata id in key variables to
account for differential inclusion probabilities
• Survey weights - Use pseudo maximum likelihood
estimation where score function modified to:
[Fˆ
 exp(xk  )]xk  0
k
ˆ
changed to: ˆk  fk / Fˆk where Fk   wi
k
k
•
ik
Partition large tables and assess partitions separately
(assumes that partitioning variable has an interaction
14
with other key variables)
Disclosure Risk Assessment Under
Misclassification
• Model assumes no misclassification errors either arising
from data processes or purposely introduced for SDL
• Shlomo and Skinner, 2010 address misclassification
errors
~
M

P
(
X
 k | X  j)
Let:
kj
where X cross-classified key variables:
X in population fixed
~
X in microdata subject to misclassification
15
Disclosure Risk Assessment Under
misclassification
• The per-record disclosure risk measure of a match of
external unit B to a unique record in microdata A that
has undergone misclassification:
~
P( A  B | f k  1) 
M kk /(1  M kk )
1

 Fj M kj /(1  M kj ) Fk
(1)
j
•
For small misclassification and small sampling fractions:
M kk
 F j M kj
M kk
~
Fk
or
j
•
M
 2   I ( f k  1) ~kk
Global measure:
~
k
 1
k
Fk
estimated by:
~
ˆ2   I ( f k  1)M kk Eˆ  ~ | f k 
F
k

(2)

where per-record risk:
 1 ~

ˆ

M kk E  ~ | f k  1
 Fk

(3)
16
Perturbation Methods of SDL
PRAM ( Post-randomisation method)
•
Probability transition matrix M containing conditional
c
probabilities M kj for a category c:
c
~
M kjc  P( X c  k | X c  j )
•
Let T be a vector of frequencies
•
On each record, category c changed or not changed
c
according to M
and the result of a draw of a random
variate u
•
•
T * vector of perturbed frequencies
Unbiased moment estimator of the original data:
Tˆ  T * M c ( 1)
c
assuming M has an inverse (dominant on the
diagonals)
17
Perturbation Methods of SDL
PRAM ( Post-randomisation method) - cont.
•
Invariant PRAM
c
c ˆc
TM c  T - Define: R  M Q
~
Q cjk  P( X c  j | X c  k )  M kjc P( X c  j ) /[l M klc P( X c  l )]
c
(vector of the original frequencies eigenvector of M
•
To ensure correct non-perturbation probability on
c*
c
diagonal, define R   R  (1   ) I
•
Expected values of marginal distribution preserved
•
Exact marginal distribution preserved using a without
replacement selection strategy to select records for
perturbation
)
18
Perturbation Methods of SDL
Random Record Swapping
•
•
•
•
Exchange values of a key variable between pairs of
records
Pairs selected within control strata to minimize bias
Typically geographical variable is swapped within a
large area:
• Geography highly identifiable
• Conditional independence assumption usually met
(sensitive variables relatively independent of
geography)
• Does not produce inconsistent records
• Marginal distributions preserved at higher
geographies
Can be targeted to high risk records
19
Misclassification Example
•
Population of individuals from 2001 United
Kingdom (UK) Census N=1,468,255
•
1% srs sample n=14,683
•
Six key variables: Local Authority (LAD) (11), sex
(2), age groups (24), marital status (6), ethnicity
(17), economic activity (10) K=538,560.
20
Misclassification Example
•
Record Swapping: LAD swapped randomly, eg. for
a 20% swap:
M kkc  0.8
Diagonal:
Off diagonal: M kjc  0.2  nk /  nl
where nk
l k
is the number of records in the sample from LAD k

•

Pram: LAD misclassified, eg. for a 20%
misclassification
c
Diagonal: M kk  0.8
Off diagonal: M kjc  0.02 (0.2 / 10)
Parameter:
  0.55
21
Misclassification Example
•
•
Random 20% perturbation on LAD
Global risk measures: Expected correct matches
from SU’s
Global Risk Measure
PRAM
Swapping
True risk measure in
original sample
358.1
362.4
Estimated naïve risk
measure ignoring
misclassification
349.5
358.6
Risk measure on nonperturbed records
292.2
292.8
Risk measure under
misclassification (1)
299.7
298.9
Sample uniques
2,779
2,831
Approximation based on
c
diagonals M kk
(2)
299.8
298.9
Estimated risk measure
under misclassification (3)
283.1
286.8
Expected correct match per sample unique:
Pram: 10.8%
Record swapping: 10.6%
22
Misclassification Example
•
Estimating individual per-record risk measures for
20% random swap based on log linear modelling
(log scale):
Risk
Measure
(1)
Estimated Risk Measure (3)
•
From perspective of intruder, difficult to identify
high risk (population unique) records
23
Information Loss Measures
•
•
•
Utility measured by whether inference can be carried out
on perturbed data similar to original data
Use proxy information loss measures on distributions
calculated from microdata:
Distance Metrics: AAD ( Dorig , D pert )  r ,c | D pert (r , c)  Dorig (r , c) | / RC
where RC number of cells in distribution
Let
Davg  r ,c Dorig (r , c) / RC
RAAD(Dorig , Dpert )  100 ( Davg  AAD) / Davg
measure of average absolute perturbation compared to
average cell size
Also, can consider Kolmogorov-Smirnov statistic,
Hellinger’s Distance and relative differences in means or
variances
24
Risk-Utility Map
•
Pram record swapping
~
~
Pr(A  B | f j  1)   jj / F j
Fj
X  j
    jSU [ jj /(1   jj )] /[k Fk  jk /(1   jk )]
~ ~
Eˆ (1 / F j | f j  1)
Random perturbation versus Targeted perturbation on
non-white ethnicities
25
Probabilistic Record Linkage
•
•
•
~
X a value of vector of cross-classified identifying key
variables for unit a in the microdata ( a  s1 )
X b corresponding value for unit b in the external
database (b s2 ) ( s2  P )
Misclassification mechanism via probability matrix:
~
P( X a  k | X a  j )  M kj
•
•
~
Comparison vector  ( X a , X b ) for pairs of units (a, b)  s1  s2
For subset ~s  s  s partition set of pairs in ~
s
1
2
Matches (M)
Non-matches (U)
through likelihood ratio: m( ) / u( )
where
~
m( )  P( ( X a , X b )   | (a, b)  M )
~
u( )  P( ( X a , X b )   | (a, b) U )
26
Probabilistic Record Linkage
•
•
p  P((a, b)  M )
probability that pair is in M
Probability of a correct match:
~
pM |  P((a, b)  M |  ( X a , X b ))  m( ) p /[m( ) p  u( )(1  p)]
•
Estimate parameters using previous test data or EM
algorithm and assuming conditional independence
~
m( )  P( ( X a , X b ) | (a, b)  M )
~
~
~
 P( 1 ( X a , X b ) | (a, b)  M )  P( 2 ( X a , X b ) | (a, b)  M )....P( K ( X a , X b ) | (a, b)  M )
27
Linking the Two Frameworks
No misclassification
•
Non-match
Disagree
Match
n( N  1)  f j ( Fj  1)
Agree
f j ( F j  1)
Total
n( N  1)
m( )  f j / n
pM | 
Total
n fj
Nn  f jFj
fj
f j Fj
n
Nn
p  1/ N
u( )  f j( Fj 1) / n( N 1)
1/ N  f j / n
1 / N  f j / n  (1  1 / N ) f j ( F j  1) / n( N  1)

1
Fj
28
Linking the Two Frameworks
•
With misclassification, denote
~
f j  M jj f j   M jk f k
k j
Non-match
Disagree
Agree
Total
m( )  M jj f j / n
pM | 
~
Nn  n  f j Fj  M jj f j
~
f j Fj  M jj f j
Nn  n
Match
n  M jj f j
M jj f j
n
~
u( )  ( f j Fj  M jj f j ) / n( N  1)
Total
~
Nn  f j Fj
~
f j Fj
Nn
p  1/ N
1 / N  M jj f j / n
M jj M jj

~
~  ~
Fj
1 / N  M jj f j / n  (1  1 / N )( f j Fj  M jj f j ) / n( N  1) f j
29
Linking the Two Frameworks
•
Matching 2,853 sample uniques to the population
and blocking on all key variables except LAD result
in 1,534,293 possible pairs
Non-match
Disagree LAD
Agree LAD
Total
m( )  0.78
•
Match
Total
1,388,069
619
1,388,688
143,321
2,234
145,555
1,531,390
2,853
1,534,293
u( )  0.09
p  0.002
On average across blocks, probability of a correct
match given an agreement on LAD
pM |  0.015
30
Linking the Two Frameworks
•
•
•
Probability of a correct match given an agreement pM |
~
for each  ( X a , X b )  j
~
Compare to risk measure M jj / Fj
Summing over pM |
measure of 289.5.
the global disclosure risk
31
Discussion
•
Global disclosure risk measures accurately estimated
for a risk-utility assessment assuming known nonmisclassification probability
•
Empirical evidence of connection between F&S record
linkage and probabilistic modelling for estimating
identification risk
•
Estimation carried out through log linear modelling for
the probabilistic modelling or the EM algorithm for
the F&S record linkage
•
Individual disclosure risk measures more difficult to
estimate without knowing true population parameters
in both frameworks
•
From the perspective of the intruder, it is difficult to
identify sample uniques that are population uniques
32
Discussion
• Statistical Agencies (MRP) need to:
- Assess disclosure risk objectively
- Set tolerable risk thresholds according to different access
modes
- Optimize and combine SDL techniques
- Provide guidelines on how to analyze disclosure
controlled datasets
• Future dissemination strategies presents new challenges:
- Synthetic data for web access prior to accessing real data
- Online SDL techniques for flexible table generating
software and remote access
- Auditing query systems
• Bridge the Statistical and Computer Science literature on
privacy preserving algorithms
33
Thank you for your attention
34