幻灯片 1 - Renmin University of China

Download Report

Transcript 幻灯片 1 - Renmin University of China

Record Linkage
Chengyin Xia
[email protected]
2015/7/18
1
1 Introduction
1.1 What is record linkage?
Record linkage is the methodology of bringing together
corresponding records from two or more files or finding duplicates
within files.
Record linkage, in the present context, is simply the bringing
together of information from two records that are believed to
relate to the same entity in reality.
2015/7/18
2
1.2 Application
The term record linkage originated in the public health area
when files of individual patients were brought together using name,
date-of-birth and other information.
In recent years, advances have yielded computer systems that
incorporate sophisticated ideas from computer science, statistics,
and operations research. Some of the work originated in
epidemiological and survey applications.
Very recent work is in the related areas of information retrieval
and data mining.
2015/7/18
3
2 Challenges
This task is easiest when unique, verified identification numbers
(e.g., ID Numbers) are readily available.
The task is more challenging when:
(1) the files lack unique identification numbers,
(2) information is recorded in non-standardized formats,
(3) the files are large.
In this case, names, addresses, and/or dates of birth are frequently
used in the matching process.
2015/7/18
4
3 Methods
3.1 Categories
Deterministic Record Linkage
Probabilistic Record Linkage
2015/7/18
5
3.2 Deterministic Record Linkage
3.2.1 Introduction
In deterministic record linkage, a pair of records is said to be a
link if the two records agree exactly on each element within a
collection of identifiers called the match key.
For example, when comparing two records on last name, street
name, year of birth, and street number, the pair of records is
deemed to be a link only if the names agree on all characters, the
years of birth are the same, and the street numbers are identical.
2015/7/18
6
3.2.2 Advantages and Disadvantages
Advantages:
• Simple and easy for implementation.
• Few false matches.
Disadvantages:
• Hard to find all matches.
• If the match key was weakened, both matches and false
matches would increase.
2015/7/18
7
3.3 Probabilistic Record Linkage
Fellegi and Sunter [1969] mathematically formalized probabilistic
record linkage based on earlier work of Newcombe et al. [1959].
Under the Fellegi–Sunter model, pairs of records are classified as
links, possible links, or non-links.
We consider the product space, A×B, of two files we denote by A
and B.
A×B to be partitioned into two sets: the set of true matches,
denoted by M, and the set of non-matches, denoted by U.
2015/7/18
8
3.3.1 The Ratio, R
Probabilistic record linkage assigns a numerical value to (the
similarity of) a pair of records, r, as a monotonically increasing
function of the ratio, R, (e.g. ln(R)), of two conditional
probabilities:
where γ is an arbitrary agreement pattern in a comparis on space Γ.
Let Γ consist of the eight possible patterns representing simple
agreement or disagreement on three fields (e.g., the last name,
street name, and street number) of a pair of records, r, drawn from
A×B.
2015/7/18
9
3.3.1 The Ratio, R
2015/7/18
10
3.3.2 Fellegi–Sunter Decision Rule
Fellegi and Sunter [1969] proposed the following decision rule in
such cases:
a) If R≥Upper, then call the pair, r, a designated match or a
designated link.
b) If Lower<R<Upper, then call the pair, r,a designated potential
match or a designated potential link and assign the pair to
clerical review.
c) If R≤Lower, then call the pair, r, a designated non-match or a
designated non-link.
2015/7/18
11
2015/7/18
12
3.3.3 Conditional Independence
We assume that conditional independence holds for all
combinations of fields(variables) that we use in matching.
2015/7/18
13
3.3.4 Definitions
Marginal probabilities:
P(agree first|M), P(agree last|M), P(agree age|M), P(agree first|U),
P(agree last|U), and P(agree age|U).
The marginal probabilities P(·|M) and P(·|U) are called m- and uprobabilities, respectively.
The base 2 logarithm of the ratio of the probabilities,log2(R) is
called the matching weight, total agreement weight, binit
weight or score.
The logarithms of the ratios of probabilities associated with
individual fields are called the individual agreement weights.
The m- and u-probabilities are also referred to as matching
parameters.
2015/7/18
14
3.3.5 Weighting
The individual agreement weight, wi, of the ith field of the rth pair
of records is computed from the m- and u- probabilities as follows.
2015/7/18
15
We further assume here that we want to base our scores on the
entries in n fields of each record. In this case, we want to consider
the n-long vector γ=(γ1,…,γn) and the n-dimensional cross-product
space Γ= Γ 1×···× Γ n.
2015/7/18
16
3.3.6 Example
Consider two databases that are to be matched. In each database
the value of the “gender” field is incorrect 10% of the time on
each member of the pairs that are matches. We also assume that
the gender value is never blank.
We wish to compare two records based on the values of two of
their fields: gender and Social Security Number.
(1) m-probability and u-probabilty of gender
The probability that both gender field values are correct is
0.9*0.9=0.81
The probability that both gender field values are incorrect
is0.1*0.1=0.01
Therefore, the desired m-probability is 0.81+0.01=0.82
2015/7/18
17
(2) m-probability and u-probability of Social Security
Number
Assume further that the m-probability for the Social Security
Number is 0.6 and the u-probability for the Social Security
Number is 10−7( one in 10 million).
(3) the agreement and disagreement weights for the Social
Security Number and gender
2015/7/18
18
(4) matching weight or score
The matching weight or score for this example is:
2015/7/18
19
2015/7/18
20