Shlomo - Turing Gateway to Mathematics

Download Report

Transcript Shlomo - Turing Gateway to Mathematics

Privacy Preserving Probabilistic
Record Linkage
Duncan Smith
([email protected])
Natalie Shlomo
([email protected])
Social Statistics, School of Social Sciences
University of Manchester
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 262608 (DwB - Data without Boundaries).
1
Topics Covered
•
•
•
•
•
•
Introduction
Probabilistic Record Linkage
String Anonymisation
Putting the probabilities back into Privacy–
Preserving Record Linkage
Experiment
Discussion
2
Introduction
•
Probabilistic record linkage is an important area for
official statistics (Fellegi and Sunter, 1969)
•
Administrative sources are being used to improve the
quality of surveys or to replace the traditional census
•
Traditionally, all datasets in one location, eg. NSI,
and matching variables (first name, last name,
address) used to link data without the need for
anonymisation
•
Data on individuals may be in distinct databases and
may be owned by different custodians: Alice (A) and
Bob (B)
•
Privacy restrictions prevent the release of certain
variables or information is suppressed/coarsened
that uniquely identifies an individual
3
Introduction
•
Techniques have been developed to anonymise
identifying variables
•
Third party (Carole) only sees matching variables and
returns pairs of unique record IDs (assigned by Alice
and Bob)
•
Two possible scenarios (there are more…):
•
Trusted Carole –sees the true values of single
matching variable
•
Non-trusted Carole –sees anonymised values of
single matching variable
•
Privacy preserving record linkage (PPRL) allows exact
matching and can allow linkage based on similarity
scores generated from anonymised values
•
F&S probabilistic record linkage typically not used in
PPRL
4
Introduction
•
Alice and Bob clean, harmonize and standardize
data and anonymise matching variables (using the
same method and seed)
•
In our new approach, we apply probabilistic record
linkage to anonymised values to obtain a probability
of a correct match (PPPRL)
•
Motivation:
•
•
Data can be held within an archive, users can
carry out PPPRL within a ‘black box’ - dynamic
database integration
•
Three party Alice, Bob, Carole scenario as set out
in Beyond 2011 project (Carole has access to
original values and can calculate string
comparators)
In PPRL, no possibility of clerical review and links
classified into 2 classes: true matches and false
matches
5
Probabilistic record linkage
•
Two datasets A and B and set of all possible matches
A  B  {( a ,b ); a  A,b  B }
•
Classify into two sets:
U  {( a ,b ) | a  b , a  A,b  B }
Fellegi and Sunter (1969) theory of probabilistic record
linkage:
•
Decision rule: likelihood ratio of agreement m(  ) / u(  )
(or disagreement ( 1  m(  )) /( 1  u(  )) ) where m(  )
probabillity of agreement in comparison vector  given
a match and u(  ) probability of an agreement given
a non-match (alternative)
M  {( a ,b ) | a  b , a  A,b  B }
•
•
Assume conditional independence so test statistic is sum
of log likelihood ratios
•
All pairs classified into matches, non-matches and those
sent off to clerical review
6
Probabilistic record linkage
•
Under a simple comparison vector for a pair j:
 j  (  1j ,..., Qj ) has an element of 1 if matching variable q
agrees or 0 otherwise, then
mq  P(  qj  1 | ( a ,b )j  M )
Q
P(  | M )   mq ( 1  mq )
j
q 1
•
 qj
1 qj
uq  P(  qj  1 | ( a ,b )j U )
Q
P(  | U )   uq ( 1  uq )
j
 qj
1 qj
q 1
Apply EM algorithm under the Binomial distribution:
E-step: expectation of the complete data log-likelihood
using Bayes theorem and initiating values of the
parameters
M-step: compute current estimates of parameters by
maximizing expected log-likelihood, i.e. setting partial
derivatives equal to zero for three maximization problems
7
String Comparators
•
•
Use string comparator to enhance decision rule
When strings are known, Jaro string comparator:
 q ( X a , X b )  1 / 3  (# common / str_len1# common / str_len 2 
( 1  0.5  (# half transposit ion /# common )))
•
str_len1 and str_len2 length of strings, #common is
number of common bits and #half transposition
where a bit moves 1 position left or right
Or:  q ( X a , X b ) Distance for numerical values
Down weight agreement weight (likelihood ratio):
q( X a , X b )
•
mq
 ( 1   q ( X a , X b ))
1  mq
uq
1  uq
Down weight log2 likelihood ratios:
 q ( X a ,X b )
 mq 


 uq 


( 1 q ( X a ,X b ))
 1  mq 


 1  uq 


8
String Anonymisation
•
Hash functions are used to anonymise strings (in this
case an ordinal/continuous key value treated as a
string)
•
Hash functions convert strings to integers and equal
strings produce equal hash values so linkage can be
carried out on equality or inequality of hash values
•
Anonymise strings to allow for estimating similarity
scores
•
Churches and Christensen (2004) proposed hashing
bigrams:
'john' → {'jo', 'oh', 'hn'} → {21299418, 21496024, 20971735}
'jon' → {'jo', 'on'} → {21299418, 21889246}
•
Similarity score: Dice Coefficient
DA ,B
2| A B|

| A|  | B |
9
String Anonymisation
•
Schnell et al. (2009) propose the use of Bloom filters (an
array of 0 and 1) which can be represented as an integer in
its binary form
•
m bits (all initially set to 0) and k hash functions used to map an
element to one of the m array positions
•
To add an element, produce k hash functions to get k array
positions and set these positions to 1
•
To query whether an element is in the set, feed it to each of the k
hash functions to get k array positions and check whether there is
a zero
•
If there’s a zero, element is not in the set, otherwise if all 1’s it
might be in the set (false positives)
•
Dice coefficient can be estimated from a pair of Bloom filters
•
These methods may be open to attacks where intruders can
learn the length of strings or identify tokens, eg. bigrams
•
Various approaches have been proposed to deal with these
issues
10
String Anonymisation
•
Minwise hashing (Broder 1997) generates a random
permutation of a set of elements and returns the hash
for the first ordered element
•
Consider union of two sets A and B, the probability of a hash
collision on the first ordered element is the Jaccard
similarity score:
| A B|
J A ,B 


| A B|
Generate hash values for all tokens in a set and return
minimum hash value
Estimate of Jaccard similarity score based on many hash
values where the number of collisions is distributed (m is
the number of hash functions): n ~ Bin ( m , J A ,B ) and
n and:
estimated by: Ĵ

A ,B
m
J ( 1  J A,B )
Var( Ĵ A,B )  A,B
m
11
String Anonymisation
•
Li and Konig (2011) propose b-bit hashing where
instead of returning full minwise hash value, only b-bits
are returned
•
The method was proposed to reduce storage but for our
purpose mitigates against frequency attacks and
disguises the size of token sets
•
Proposed method: concatenated 1-bit minwise hashing
Example: Minwise hashes and 1-bit minwise hashes under a binary
representation for S1={’jo’,’oh’,’hn’} and S2= {’jo’,’on’}
H1
H2
H3
H4
H5
S1
451153726
1123790273
2501120381
2030682762
965995804
S2
797504823
1123790273
262296169
1744666338
965995804
H1
H2
H3
H4
H5
S1
0
1
1
0
0
S2
1
1
1
0
0
…
Hm
…
Hm
…
Sn
…
Sn
12
String Anonymisation
•
•
•
•
•
•
Estimating the Jaccard similarity score under the
concatenated 1-bit minwise hashing results is:
n
1  J A2 ,B
Ĵ A ,B  2
1
Var( Ĵ A ,B ) 
and
m
m
The increase in variance from minwise hashing is:
1
1
J A ,B
In our example, for minwise hashing with 5 hash
functions, the estimate of the Jaccard similarity score is
2/5 and for the concatenated 1-bit hashing 3/5, the
true value is 1/4
Simulation Study: File A 300 names, File B obtained by
perturbing File A to simulate typographical errors
Tokenized bigrams with leading and trailing underscores
True Jaccard scores compared with estimated scores on
all pairs in A x B
13
String Anonymisation
• Bias in the Bloom filter approaches (corrected Bloom filter based
on more complex estimate which reduces bias)
• Smaller variance in minwise hash compared to concatenated 1-bit
hash but requires more storage (integers vs. bits)
• Concatenated hash approximately same MSE as the Bloom filter
• Precision can be controlled by choice of m – the number of hash
14
functions
Multinomial EM Algorithm
•
Extend to K categories, k=1,…,K where each category is
a grouping of similarity scores, i.e. 8 categories with
(inclusive) upper bounds:
[0.2,0.4,0.6,0.8,0.9,0.95,0.999,1]
•
Element in e agreement vector for variable q of pair j
j

with similarity score in category k, q ,k  1 , otherwise 0
•
Let:
mq ,k  P(  qj,k  1 | ( a ,b )j  M ) uq ,k  P(  qj,k  1 | ( a ,b )j U )
Q
•
K
Q
P(  | M )   mq ,k
|M |
p
| M U |
j
K
P(  | U )   uq ,k
 qj ,k
j
q  1 k 1
 qj ,k
q 1 k 1
The M-step is the MLE parameters of a multinomial:
R
m̂q ,k   ĝ 
j 1
j
jm q ,k
R
/  ĝ jm
j 1
R
ûq ,k   ĝ 
j 1
j
ju q ,k
R
/  ĝ ju
j 1
R
p̂   ĝ jm / R
j 1
15
Fellegi and Sunter Decision Rule and Blocking
•
•
•
•
•
•
Thresholds typically calculated with test
In PPRL, need to determine one cut-off value
Blocking reduces number of pairs that need to be
examined by introducing exact matching
Blocking variables pre-determined and anonymised;
examples of blocking variables: first 3 digits of postcode, initial of first name, year of birth
In PPRL literature, set all pairs with similarity scores
below a threshold to non-matches and produce ‘fuzzy’
blocking based on high similarity scores
Methods include: canopy clustering (McCallum et al.,
2000) which divides the pairs into overlapping subsets
before classification; multibit tree structures to identify
similar comparison vectors under the Bloom filter
framework (Bachteler et al.,2013 ), and more
16
Experiment
•
•
•
•
•
1000 records from a Census database with attached
English names (File A)
File B generated by perturbing File A under a
probabilistic approach including swapping, deleting and
transposing characters on variables: Gender, Year of
Birth, Month of Birth and First Name
4 Perturbed datasets perturbed at different levels of
perturbation
A random sample of 700 records from File A and a
random sample of 400 records from perturbed files
used for matching
No blocking was carried out
17
Experiment
PPPRL:
•Binary EM: standard EM approach based on exact matching of
strings. No similarity score used
•LR weighted: outputs of Binary EM and downweight likelihood
ratios
•Log LR weighted: outputs of Binary EM and downweight log
likelihood ratios
• EM (8): multinomial EM approach with 9 bins having upper bounds
[0.2, 0.4, 0.6, 0.8, 0.9, 0.95, 0.999, 1]. Jaccard similarity score
(with padded underscores on bigrams)
•EM (15): multinomial EM approach with 15 bins having upper
bounds [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.925,
0.95, 0.975, 0.999, 1]. As above
PRL:
•Jaro: multinomial EM approach with 8 bins using Jaro string
comparator
•Jaro-Winkler: As above but with Jaro-Winkler string comparator
18
Experiment
•
•
•
Correct links identified and used to construct precisionrecall plots
Plots show for any given threshold the precision and
recall based on false positives , true positives, false
negatives , true negatives, and can be used to
compare approaches
Good approaches will produce curves in the upper right
of the plot
tp
Pr ecision 
tp  fp
tp
Re call 
tp  fn
19
Experiments
low perturbation
high perturbation
• All approaches perform better with low level of perturbation
• Binary EM without similarity scores performs the worst
• Down weighting log likelihood ratios outperforms down weighting
of likelihood ratios
• Multinomial EM outperforms Binary EM with no clear difference
between 8 category and 15 category Jaccard score schemes
• Jaro and Jaro-Winkler schemes provide the best performance,
although these are not privacy preserving
20
Discussion
• PPPRL does not allow clerical review and one
threshold is determined based on posterior probability
of a correct link
• PPPRL can be tailored to different types of variables
via the choice/design of the tokenization scheme
• So far dealt with 1 to 1 matching
• Multinomial EM offers improved classification over the
unweighted and weighted binary EM schemes
• Under trusted Carole, the Jaro and Jaro-Winkler
schemes outperformed the padded bigram
tokenization scheme under PPPRL
21
Thank you for your attention
22