Revision of Jim`s presentation on Measures of Disclosure Risk

Download Report

Transcript Revision of Jim`s presentation on Measures of Disclosure Risk

Measures of Disclosure Risk and Harm
Diane Lambert, Journal of Official Statistics, 9 (1993), pp. 313-331
Jim Lynch
NISS/SAMSI & University of South Carolina
1
Measures of Disclosure Risk and Harm
•
•
•
•
•
•
Introduction-Discussion (Section 7)
What is Disclosure?
Risk of Perceived Identification
Modeling the Intruder
Risk of True Identification
Disclosure Harm
2
Discussion (Section 7)
• It is the intruder, and not the structure of
the data alone, that controls disclosure.
• When the intruder is sure enough that a
released record belongs to a respondent
– There is a reidentification.
– It may be incorrect, but the intruder perceives
there to be a reidentification.
3
Discussion (Section 7)
• The risk of perceived disclosure and the risk of true
disclosure cannot be measured without considering the
seriousness of the threat posed by the intruder's strategy.
• The harm that follows from a reidentification
– Depends on the attributes, if any, that the intruder infers about the
target
– The harm cannot be measured without considering the strategy that
the intruder uses to infer sensitive attributes.
• Once the intruder's strategy is modeled, disclosure risk and
harm can be evaluated
• Risk is measured in terms of probabilities
• Harm is measured in losses or costs.
4
Discussion (Section 7)
• All the agency can do to reduce disclosure risk or
harm is
– to mask the data before release
– or carefully select the individuals and organizations that
are given the data, or both.
• The models developed here imply that masking
and releasing only a subset of records does not
necessarily protect against disclosure.
• Masking may lower the risk of true reidentification
– But it may also lead to false reidentifications and false
inferences about attributes.
– The fact that inferred attributes may be wrong may be
little comfort to the respondent whose record is reidentified.
5
Discussion (Section 7)
• Masking also complicates data analysis
– An agency cannot be expected to predict and
minimize all the effects of masking on all the analyses
of interest.
– Nor is it reasonable to expect the data analyst to
describe how the data will be analyzed before the
data are obtained so that the agency can verify that
the conclusions will be the same for the masked data
as they would have been for the original data.
– Future masking techniques may preserve more
general features of the data, but for now data masked
enough to preserve confidentiality can be a challenge
to analyze appropriately.
6
Discussion (Section 7)
• It does seem reasonable to put some of the burden for
protecting confidentiality on the researcher.
– Institutions and researchers have to abide by all sorts of
conditions in experiments involving humans.
– The experience in those and other areas ought to provide some
guidance on protecting respondents in agency databases from
unscrupulous intruders.
– Would not necessarily remove the need for some masking, but it
might reduce the need for heroic masking that severely limits the
usefulness of the data.
• “Confidentiality issues for medical data miners,” Jules J. Berman,
Pathology Informatics Cancer Diagnosis Program, DCTD, NCI, NIH,
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed&cmd=Retrieve&list_uids=12234715&dopt=Citation
7
Discussion (Section 7)
• One could argue that models of disclosure
are hopeless because the issues are too
complex and the intruder too mysterious.
• This paper, though, argues that models of
disclosure are indispensable.
– They force definitions and assumptions to be
stated explicitly.
– When the assumptions are realistic, models of
disclosure can lead to practical measures of
disclosure risk and harm.
8
What is Disclosure?
• Key Attributes
– Useful for identification but usually not sensitive
– E.g., age, location, marital status, and profession
• Sensitive Attributes
– Disease, debts, credit rating
• Scenario: A sample of records is released
– Obvious identifiers removed
– Some attributes left intact such as marital status
– Others modified to protect confidentiality
• Incomes truncated, professions grouped more coarsely, and ages
on pairs of records swapped, some attributes on some records
might be missing or imputed.
9
What is Disclosure?
• Two major types of disclosure
– Identification or Re-identification
• Equivalent to inadvertent release of an identifiable record
– Attribute Disclosure
• Occurs when the intruder believes something new has been
learned about the respondent.
• May occur with or without re-identification
• E.g., the intruder may narrow the list of possible target records to
two with nearly the same value of a sensitive attribute. Then the
attribute is disclosed although the target record is not located. Or
two records may be averaged so the released record belongs to
no one. Yet the debt on the averaged record may disclose
something about the debt carried by the targeted individual. The
agency must decide whether attribute disclosures without
identifications are important.
10
What is Disclosure?
• Considers only disclosures that involve reidentifications but NOT attribute
disclosures without reidentifications.
• Attribute disclosures that result from reidentification are considered to the extent
that they harm the respondent.
• In this paper, the risk of disclosure is the
risk of reidentifying a released record and
the harm from disclosure depends on
what is learned from the identification.
11
What is Disclosure?
• Attribute disclosures that do not involve
identification are ignored
• This assumes that all intruders first look
for the record that is most likely to be
correct and then take information about
the targeted attribute from that record.
• Intruders with other strategies are ignored.
12
What is Disclosure?
• Includes
– true and false reidentifications and
– true and false attribute disclosures.
– Correct and incorrect inferences can be distinguished if desired (as
happens with measures of harm)
• It distinguishes between
– true identification and true attribute disclosure and
– perceived identification and perceived attribute disclosure (the
intruder believes the information is correct)
where, in the former, when correct inferences are to be
prevented and in the latter when perceived inferences are
to be prevented.
13
The Risk of Perceived Identification
• Basic Premise: Disclosure is limited only to
the extent that the intruder is discouraged
from making any inferences, correct or
incorrect, about a particular target
respondent.
14
The Risk of Perceived Identification
• Format (Similar to Jerry’s last time)
– Population of N records denoted Z
– A random sample of n masked records X=(x1,…, xn)
with k attributes
– Masking suppresses attributes in Z, adds random
noise, truncates outliers, or swaps values of an
attribute between records. Knowing this, which, if any,
record in the released file should be linked to the
target respondent’s record Y?
15
The Risk of Perceived Identification
• Rational Intruder has two options.
– 1. Decide that one of the released records belongs to the target
respondent. (i.e., link the i th released record xi to the target
record Y.
– 2. Decide not to link (the null link) any released record to Y,
perhaps because none of the released records is close enough
to what the intruder expects for Y or perhaps because too many
released records are close to what the intruder expects for Y.
The decision not
• Rational intruder chooses the link (nonnull or null)
believed most likely to be correct whenever any incorrect
choice incurs the same positive loss and a correct link
incurs no loss. (See Duncan and Lambert (1989) for
details.)
16
The Risk of Perceived Identification

pi = the intruder's probability that the i th released record in X is the target's.
n

q  1   pi is the intruder's probability that the target record has not
1

been released.
Intruder’s protocol
o If q  1 max i
o
pi don’t link (choose null link)
max i pi is large enough then the intruder links with a released
record
o
o
max i pi is called the risk of perceived re-identification
risk of perceived re-identification is not defined in terms of
the intruder's expected loss or the agency's or respondent's
expected loss
 defined in terms of the seriousness of the threat posed by
the intruder
 depends only on the intruder's posterior probability that
the i th released record is the correct one.
17
The Risk of Perceived Identification
Other Measures
D( X )  max j  N max i  n P[ xi is j th pop record X ]
 max j  N max i  n Pij  X 
D( X )  max i , j  n P[ xi is 1' s pop record x j is 2' s pop record | X ]
N
Dave ( X )   max i  n Pij  X  / N Dtot ( X ) / N
j 1
D ( X )  j  N : max i  n Pij  X   
where A denotes the cardinalit y of A
18
Modeling the Intruder
Example 4.1 – Pop of 2 Records: N=2=n
• One continuous attribute
• Intruder makes judgments about the M(Y), the
masked version of target Y
• Series of judgments leads to intruder modeling prior
about M(Y) as lognormal m,s with m,s  (0,1) (prior
denoted f1(x))
• Information about the other respondent, Y’, is
modeled as M(Y’)~lognormal(2,1) denoted f2(x)
• E(M(Y))=1.65 and E(M(Y’))=12.2
• Released data is X=(7,20)
19
Modeling the Intruder
Example 4.1 – A “Posterior Calculation”
• p1=P(M(Y)=7|X=(7,20))=P(M(Y’)=20|X=(7,20))
=f1(7)f2(20)/[f1(7)f2(20)+f2(7)f1(20)]=.89
• In the original population Y=Z1<Z2=Y’; p1 is just the
probability that the order is preserved in the released
data after masking. The terminology of “prior” and
“posterior” don’t suggest that this is Bayesian. Just
modeling the masking.
• If masking techniques require order to be preserved
then p1=1 and the joint distribution of M(Y) and M(Y’)
is not f1f2.
20
Modeling the Intruder
Example 4.1
D( X )  max j 2 max i 2 P[ xi is j th pop record X  (7,20)]
 max{ P( M (Y )  7 | X  (7,20)), P( M (Y )  20 | X  (7,20))}
 max{. 89,.11}  .89
•Suppose only one record is released and it is x=7. Then,
p1=P(M(Y) is selected and M(Y)=7|X=(7))
=.5f1(7)/[.5f1(7))+.5f2(7)]=.13
•In this case, D(X)=max(.13,.87)=.87
21
Modeling the Intruder
Example 4.2 – n of N records
• Intruder believes that the ith record in pop Z will
be appear as Mi=M(Yi)~ fi(x)
• The probability that the nth released record
belongs to target Y1 is
pn=P(Y1 is sampled and M1=xn|X)
=P(xn is sampled from f1 and x1,…, xn-1 are
sampled from f2,…, fN)/P(x1,…, xn are
sampled from f1,…, fN)
22
Modeling the Intruder
Example 4.2 - n=2 of N=3 records
Non-unique Records
• p1  P(Y1 is sampled and M1  x1 | X  (x 1 , x 2 ))
1
f1 (x 1 )[.5(f 2 ( x2 )  (f 3 ( x2 ))]
3 3
1
f j (x 1 )[.5 f i (x 2 )]

j 1 3
i j
• if f1  f 2  f 3 then p1  1/3 (in general, p1  1/N)
• if f1  f 2 then p1  1/2 (in general, p1  1/(N - 1) if
f1  f 2    f N-1 )
23
Example 4.2 - n=1 of N=2 records
Unknown respondents may be reidentifiable
• Intruder’s priors on Z
– Y1~Unif[-4,4], Y2~N(0,1), x1=-2.25
• p1  P(Y1 is sampled and Y1  x1 | X  (-2.25))

f1 (x1 )
2
 f (x )
j 1
j
1/ 8

 .798
1 / 8  f 2 (x1 )
1
24
Example 4.2 - n=10 of N=100 records
Sampling by itself need not protect confidentiality
•
•
•
•
Target is thought to be the smallest in Pop
The Priors: Y1~LogN(0,.5), Y2,…,Y100 iid~LogN(2,.5)
Masking is iid multiplicative LogN(0,.5)
Uncertainty in the released records
(masking+intruder prior)
M1~LogN(0,1)=f1, M2,…,M100 iid~LogN(2,1)=f2
• X=(.05, .14, 1.5, 2.4, 3.2, 3.8, 4.6, 8.7, 10.3, 10.7)
25
Example 4.2 - n=10 of N=100 records
Sampling by itself need not protect confidentiality
• p i  P(M1  x i | X)
1
99 C 9
f1 (x i )
f 2 (x j )

100
100 C10 j i
 10
10
C
C
1
99 9
99 10
f
(x
)
f
(x
)

f 2 (x j )



1
j
2
k
j 1 100
100 C10 k  j
100 C10 j1

f1 (x i ) / f 2 (x i )
10
 f (x )/f
j 1
1
j
2
(x j )  900
26
Example 4.2 - n=10 of N=100 records
Sampling by itself need not protect confidentiality
Values of Pj1(X)
x
0.05
0.14
1.50
2.40
3.20
3.80
4.60
8.70
10.30
10.70
f1
0.089778
0.412456
0.244974
0.113310
0.063384
0.043065
0.027068
0.004417
0.002553
0.002247
f2
f1/f2
90
900
0.0000304 2955.62 0.861949 0.697245
0.0010941 376.99 0.109942 0.088934
0.0745956
3.28
0.000958 0.000775
0.0883285
1.28
0.000374 0.000303
0.0878392
0.72
0.000210 0.000170
0.0841587
0.51
0.000149 0.000121
0.0775133
0.35
0.000102 0.000082
0.0452479
0.10
0.000028 0.000023
0.0366537
0.07
0.000020 0.000016
0.0348145
0.06
0.000019 0.000015
27
Risk of True Identification
• The agency cannot control the intruder's
perceptions and actions once the data are
released.
• All it can do is count the number of true
identifications for an intruder with a given set of
beliefs about the target and source file.
• A reasonable measure of the risk of true
identification, then, is simply the fraction of
released records (or number of released
records) that an intruder can correctly reidentify.
28
Risk of True Identification
• Distinguishes “Risk of Matching” (Spruill, 19824) with “Risk of True Identification” (Risk of
Matching is the proportion of masked records
whose closest source records are the actual
source records generated them)
• To illustrate Risk of True identification, consider
the following example where N is large and n
small so that we can calculate using sampling
with replacement
29
Risk of True Identification
• p n1 
P(x 1 ,..., x n -1 is from f 2 ,..., f n and x n is from f1 )
P(x 1 ,..., x n is from f1 ,..., f n )
n -1

f1 (x n ) P(x i is from f 2 or f 3 or... or f n )
i 1
n
 P(x
i
is from f1 or f 2 or... or f n )
i 1
n -1

N
f1 (x n ) f j (x i )
n
i 1 j 2
N
 f (x )
i 1 j1
j
i
30
Risk of True Identification
n
Thus, p11  p n1 
p11
p n1

N
f1 (x 1 ) f j (x i )
i  2 j 2
n -1 N
f1 (x n ) f j (x i )
1
i 1 j 2

f1 (x 1 )

N
f1 (x n )
N
 f (x )  f (x
j 2
j
1
j 2
j
n
)
For f i  N(yi ,  i ) this last inequality becomes
exp{-.5(x 1 - y1 )' 1-1 (x 1 - y1 )}
N
 exp{-.5(x
j 2
1
- y j )'  (x 1 - y j )}
-1
j

exp{-.5(x n - y1 )' 1-1 (x n - y1 )}
N
 exp{-.5(x
j 2
n
- y j )'  -j1 (x n - y j )}
31
Risk of True Identification
Source yj
x1=32
(15)
x2=35
(30)
9.8
0.016
10.8
0.024
14.1
0.065
14.6
0.072
14.7
0.074
15.0
0.078
30.0
0.202
40.7
0.183
47.1
0.156
53.2
0.130
0.010
0.017
0.048
0.054
0.056
0.059
0.199
0.205
0.188
0.164
TABLE 1
The intruder's probability, pij that the released record in the ith row comes from
the source record in the jth column. Intruder knows the source values. Unknown to
the intruder, 32 is the masked version of 15.0 and 35 is the masked version of
30.0. Masking fj ~ LogN(yj,.5)
•
•
•
Risk of True Identification is low (zero if .078 is too low to link). Look down
columns 15 and 30.
Risk of matching is not zero for both records? Look across rows. 32
matched with 30 which is incorrect but 35 is matched with 30? (Why not
40.7?) Claimed risk of matching is ½?
Risk of perceived re-identification? Look down all columns. If 1- the sum
of the column is more than the max of the column the intruder is
wasting their time. This is an assumption about the intruder that their
rational decision is that the record for that column has not been
released. In this example, this is true for all the columns.
32
Disclosure Harm
• Just Considers Harm to Respondent (not to agencies,
researchers, etc) whose released record has been reidentified or perceived to have been reidentified
• Scenario
– Masked Data is released X=(x1,…,xn) where = xi=(xi1,…,xik) and
xi is a binary attribute of interest. Assume that the target record
is Y1 and that the intruder has linked Y1 to x1.Let x-11=(x12,…,x1k)
and X-1 =(x-11,x2…,xn)
– Because of masking the intruder believes, independent of
everything else, that
• x11= Y11with probability q
• x11= 1-Y11with probability 1- q
33
Disclosure Harm
Let x 11  1. Then
P(Y11  1 | X)  P(Y11  1 | x11  1, x -1 )
P(x 11  1 | Y11  1, x -1 )P(Y11  1 | x -1 )

P(x 11  1 | Y11  1, x -1 )P(Y11  1 | x -1 )  P(x 11  1 | Y11  0, x -1 )P(Y11  0 | x -1 )

qP(Y11  1 | x -1 )
qP(Y11  1 | x -1 )  (1  q)P(Y11  0 | x -1 )
Similarly, when x 11  0,
P(Y11  1 | X)  P(Y11  1 | x11  0, x -1 )

(1  q )P(Y11  1 | x -1 )
(1  q )P(Y11  1 | x -1 )  (1  q )P(Y11  0 | x -1 )
34
Disclosure Harm-Logistic Regression
k
P(Yi1  1 | x -i )
Assume log
  0    j xij
P(Yi1  0 | x -i )
j 1
Then the contributi on to the log likelihood of x i1   i where  i  0 or 1 is
log P(x i1   i | x -i )  log{ q  i (1  q )1 i P(Yi1  1 | x -i )  q1 i (1  q ) i P(Yi1  0 | x -i )}
^
^
Obtain the mle' s  0 ,...,  k and estimate
k
^
P(Y11  1 | x1 ) by
^
exp{  0    j xij }
^
j 1
k
^
1  exp{  0    j xij }
j 1
35
Disclosure Harm-Measures of Harm
• Harm H(Y11,X) is a variable that takes on various values
depending on the action that the intruder takes based on their
their strategy
• These values are losses and are
–
–
–
–
0 if record is not identified
cFN if re-identification is incorrect and Y11 is not inferred
cTN if re-identification is correct and Y11 is not inferred
and
Infer
Re-ID
Correct
Incorrect
Correct
CTF
CTT
Incorrect
CFT
CFF
36
Disclosure Harm
Some Possibly Delusionary Closing Comments
• Think of the source data, Y, as the parameter
• The released data, X, is the sample
• This is somewhat like a two person game where
the agency plays the role of Mother Nature and
the intruder is the other person
• The agency controls the way it generates the
released data
37
Disclosure Harm
Some Possibly Delusionary Closing Comments
• When we describe the mechanism/structure/model that
is used to generate released data we are specifying
somewhat the model X|Y.
– Are we totally specifying this?
– There are at the very least some nuisance parameters regarding
weights, e.g.
• Is there a meaningful interpretation in randomizing over
the parameter from the agencies perspective?
• Perhaps we should reverse the roles of the agency and
the intruder. The parameter is then the intruder’s
strategy. In any event Lambert is suggesting that we
need to model the intruder strategy and formulate the
problem from a decision theory standpoint.
38
Disclosure Harm
Some Possibly Delusionary Closing Comments
Addendum Based on the Talk Last Week
• Last week, Bahjat described the mechanism/structure/model that
could be used to generate released data based on swapping.
• There the model for X|Y was completely specified.
• Modeling the intruder’s behavior involves
– Modeling the intruder’s prior p(Y) on the source data Y
• Here Y is the source tabular data prior to the swapping (not the original data
from which the source table was made).
• The prior has to satisfy the constraints imposed by swapping (column and
row totals preserved) if known.
– Then calculating the posterior pn(Y|X) where n is the number of swaps
– One intruder scenario is that the intruder is interested in a target and
has very precise info on some identifier variables yI for the target, where
I is a subset of the variables in the table. The intruder is really
interested in determining yS where y = (yI,yS) and should calculate the
distribution of yS|yI under pn(Y|X).
39
Disclosure Harm
Some Possibly Delusionary Closing Comments
Addendum Based on the Talk Last Week
– For Bahjat’s example last time, with prior p(a)=1/6, a=4,…,9, the
posterior is essentially the prior when n=32. E.g.,
p32(8|8)= p(8)P32(8|8)/p32(8)=1/6(.0471406/.04551803)
=1/6(1.03564554)
If this is the intruder’s prior, the intruder’s prior opinion about
what is the actual table has changed very little by knowing the
released table.
– Since the process is reversible, P32(a’|a) is the posterior for the
equilibrium prior where a is the disclosed table after 32 swaps.
There is quite a bit of difference between P32(a’|a) and p32(a’|a).
– In Bahjat’s example you have six tables labeled by a. For row 1
of these tables (yI=0 in my notation) the entries for yS =0 or 1 are
(9,1), (8,2),…,(4,6). Thus, the distribution of yS|yI under pn(Y|X)
is approximately (39/60, 21/60) (the six tables are almost equally
likely when n=32 under this prior). So, yS =0 is twice as likely as
yS =1 given yI =0
40