No Slide Title

Download Report

Transcript No Slide Title

Linking Mortality and
Inpatient Discharge Records:
Comparing Deterministic and
Probabilistic Methodologies
Richard Miller
Mike Yuan
Office of Health Informatics
Bureau of Community Health Promotion
Wisconsin Division of Public Health
June 2011
Linking (matching) Mortality Records and Inpatient
Discharge Records
Why Combine Mortality Records and Inpatient Discharge
Records?
How to link or match records
Method 1: Deterministic record linkage
Method 2: Probabilistic record linkage
How do the results compare?
Lessons learned
Why Combine Mortality Records and Inpatient
Discharge Records?
Improve surveillance of CVD and other chronic diseases
Enhanced surveillance analysis opportunities
– Mortality records capture CVD only if an underlying or contributing cause
– Inpatient records capture CVD treated in that setting, but the case history
ends at discharge
Capture hospital record information on demographics, co-morbidities,
complications, and surgical procedures.
Measure treatment outcomes on a population basis
The Time Frame for Linked Records
Analyses are more complete the more time there is to find a death
record following a hospitalization
The scale of mortality and inpatient records in Wisconsin:
 2 million inpatient discharge records 2006-08
Smaller number of individual patients
 140,000 mortality records 2006-08
How to find matching records? How to define links between records?
False Positives and Negatives
Matching records involves finding a balance between
false positive and false negative matches.
 False positive matches combine records for different people.
 False negatives fail to include all persons in the dataset of
matched records – possibly introducing bias.
Method 1.
Deterministic Record Linkage
Pairs of records are compared for exactly matching indentifying
information. Exact matches determine true record matches.
Works perfectly only if information that uniquely identifies the
same individual in two datasets is available, is captured
perfectly, and is recorded perfectly
In real world data systems:
– uniquely identifying elements often not available;
– recorded data have small differences between records
– some records have some fields with missing values.
Method 2.
Probabilistic Record Linkage
Every pair of records has some probability of being a “true match.”
Specialized software estimates that probability by applying
statistical principles and tools.
Set some threshold for “high probability matches”
 A common criterion is 0.9 probability of being a true match
 This defines the risk of accepting false positives
Some methods impute missing matches to pairs that look unlikely due
to possible reporting and recording errors.
Part I. Deterministic Linkage among Inpatient Records
Identifying Patients = de-duplicating inpatient records
Method: Iterative application of combinations of elements with
person-matching face validity.
Available fields:
•
•
•
•
•
•
•
Initials
3-digit encryption of last name (Miller = M460)
Date of birth
Gender
ZIP code of residence
Insurance ID >> “SSN-like string”
Hospital and medical record number
Part I: Deterministic Linkage among Inpatient Records
Uniqueness of Patient Identifiers
Wisconsin Inpatients Discharged 2006-08, N=2,017,339
“Patient” Identifier
% Records with
identifier
% with unique
values
Initials + DOB + sex
100%
56.2%
Initials + DOB + sex + ZIP
99.9%
63.4
Policy number + DOB + sex
92.1%
64.7
SSN-like string + DOB + sex
78.2%
61.2
Hospital + medical record
number
99.7%
70.9
Part I: Deterministic Linkage among Inpatient Records
Record links were evaluated by looking for three indicators
of false positive matches:
1. Any later admission date preceding the earliest
admission’s discharge date.
2. Any admission date preceding the previous
admission’s discharge date.
3. Records indicating the patient died but patient has
later hospitalizations.
Part II: Deterministic Matching
of Patients to Mortality Records
Matches between the 1,280,000 resident patients and the 135,000
Wisconsin occurrence deaths to residents.
Which inpatient record? The most recent one…
Iterative procedures use a succession of identifiers (combinations of
the available data elements).
•
•
•
•
•
•
•
Construct a linking identifier
Select records with unique values of the “linker”
Sort each set by that linking identifier
Matching and merge those records with identical linker values
Collect the remaining records
Construct an alternative linking combination
Repeat until plausible linking combinations have been exhausted.
Part II: Deterministic Matching
of Patients to Mortality Records
Iterative matching in two phases:
I. Match the records for in- hospital deaths
 Less time between events and more data elements in common
 Date of death = discharge date
 Hospital is match element
 25% of deaths; 2% of inpatients.
II. Examine the remaining records for matches
Part II: Deterministic Matching of Inpatient Records to Mortality Records
Phase I. Linked In-Hospital Deaths
Matched Records
Linker
# Pairs
Matched
% of inpatient
records
% of
mortality
records
All In-Hospital Deaths
Remaining Unmatched Records
# of inpatient
records
# of
mortality
records
32,816
35,745
Initials + DOB + Sex + ZIP
26,022
79%
73%
6,794
9,723
Initials + DOB + Sex + SSN
2,666
8
7
4,128
7,057
Initials + Sex + ZIP3 + DOD
1,496
4
4
2,632
5,561
Hospital + DOD + DOB
833
2
2
1,799
4,728
Initials + Sex + DOB
37
0.1
0.1
1,762
4,691
31,054
94.6%
86.9%
All Linked Pairs
Part II: Deterministic Matching of Inpatient Records to Mortality Records
Phase 2: Linked Residual Deaths and Patients
Linker
# Pairs
Matched
Matched Records
% of inpatient
records
% of
mortality
records
Residual Deaths
Remaining Unmatched Records
# of inpatient
records
# of mortality
records
1,195,638
104,023
Initials + DOB + Sex + ZIP
53,059
4%
51%
1,142,579
50,964
Initials + DOB + Sex +
SSN
5,514
<1
11
1,137,065
45,450
All Residual Linked Pairs
58,573
4.9%
56.3%
Part II: Deterministic Matching of Inpatient Records to
Mortality Records
Combined results:
Linked 66% of the mortality records to a hospital patient
• 89,627 of the 135,077 total 2006-08 resident and occurrence
deaths
Evaluated results with logic tests
• Admission date after previous discharge date
• Not hospitalized again after discharged ‘expired’
• Agreement rates among other data elements
Part III: Probabilistic Matching of Inpatient Records to
Mortality Records
A “probabilistic record linkage methodology” recognizes that a pair of
records has some probability of being a “true match.”
Specialized software products estimate that probability:
• LinkSolv – our choice
• LinkPlus
• LinkPro
LinkSolv is based on Bayesian statistics as applied by Fellegi and Sunter
and considerably developed by Dr. Michael McGlincy, the software
developer.
Part III: Probabilistic Matching of Inpatient Records to
Mortality Records
LinkSolv compares pairs of fields, incorporating a number of adjustments
to account for real-world violations of statistical assumptions:
•
•
•
•
The probability that apparently different values may both be correct;
Rates of missing data;
Estimated rates of reporting errors; and
Discounting some weights for matching/mismatching values if
agreements/disagreements on one field are related to
agreements/disagreements on another.
Comparisons may be for exact matches or acceptable differences
Part III: Probabilistic Matching of Inpatient Records to
Mortality Records
Some simplifying decisions:
 Use the most recent inpatient discharge identified by the
deterministic linkage process
 Drop the 30% of patients who are mothers and their newborns
 Work only with the patients whose last hospitalization was in
2006
Part III: Probabilistic Matching of Inpatient Records to
Mortality Records
Experimented with comparison fields:
•
•
•
•
•
Disaggregate birth date or not?
Break up ZIP in ZIP-3 and ZIP-2 components or not?
Break up name into separate initials and encrypted field?
Use full SSN or just last 4 digits (SSN-4)?
Use elements only available for the in-hospital deaths?
Part III: Probabilistic Matching of Inpatient Records to
Mortality Records
Final model was relatively simple:
•
•
•
•
•
•
Last initial + encryption (Miller = M460)
First initial
SSN-4
Date of birth as one field
Gender M/F
ZIP-3
Part III: Probabilistic Matching of Inpatient Records to
Mortality Records
This model was applied to three over-lapping subsets of records,
along with estimated corrections to statistical assumptions.
We merged the three linkage passes in a multiple imputation process
that applies Markov Chain-Monte Carlo techniques to create five
alternative sets of paired records.
– Identifies additional record pairs that have a low - but real probability of being true matches, due to possible measurement
errors.
For evaluation purposes, we de-duplicated these 5 sets to identify a
final set of 36,562 inpatient-mortality records linked with
probabilistic methods.
Comparison of Results
Combined Linked Pairs
How Linked?
Number of Linked
Pairs
% of Deterministic
Matches
% of Probabilistic
Matches
Both Methods
Same Death Record
Matched to Different
Patient Records
31,367
93%
86%
636
2
2
Probabilistic Only
Deterministic Only
4,559
1,673
-5
12
--
Total
38,235
100% (33,676)
100% (36,562)
 93% of deterministic matches were confirmed by the probabilistic matches
 14% of probabilistic matches were not captured by deterministic linking.
Comparison of Results
Evaluating the discrepant results:
High-probability matches not found in the deterministic matches.
The most common issue was discrepancies in the last two ZIP digits.
Low-probability matches
• 2% of the record pairs identified by both methods were evaluated by LinkSolv as
having a low probability of being a true match.
• This suggests that some deterministic criteria are weaker than would be
desirable, notably last name encryption and SSN.
Deterministic matches not confirmed by probabilistic matching. Should we be wary of
this 5% of matches?
• Disproportionately are in-hospital deaths
Conclusions
De-duplicating patients
The strongest linking combination was patient’s initials + date of birth + sex + ZIP.
•Yielded reasonable and apparently robust results.
Given the observed instability of ZIP code in the population of deceased recent patients,
we should experiment with substituting ZIP-3.
•This will result in fewer ‘patients’ being identified.
•The trade-off is the creation of more false-positive matches.
Conclusions
Linking patients to mortality records
The probabilistic process yields more matched pairs than the deterministic process,
but not dramatically so.
Overall, the more rigorous probabilistic method validated the results of the
deterministic linkage.
Initials, date of birth, and sex
• Patient and mortality records generally reliable and consistent.
ZIP
• Less reliable - small moves often result in different ZIPs.
• Older patients particularly likely to make such moves.
• Probabilistic models only used ZIP-3
SSN
• Using full SSN limited the success of exact matching.
• SSNs were teased out of policy numbers but are often missing or are a spouse’s
SSN.
• Probabilistic models used only SSN-4.
Conclusions
Both methods created reasonable sets of matched pairs of
records
Those sets had a high degree of common pairs.
The deterministic process is probably more accessible and
efficient for the general user.
However, the quality is heavily dependent on the
completeness and accuracy of the recorded data.
Conclusions
The probabilistic process, particularly as developed in LinkSolv, is more
statistically rigorous and will more thoroughly identify matched pairs.
Using multiply-imputed output datasets requires sophisticated statistical
treatment by well-trained researchers.
 Useful lessons can be learned from the application of both methods to the
same datasets. The probabilistic process provides a rigorous evaluation and,
perhaps, validation of the results of deterministic exact-matching.
 The probabilistic process provides insights into the utility of particular data
elements; this may be used to refine and improve a deterministic matching
process.
Acknowledgments
We gratefully acknowledge the support of CSTE’s
Cardiovascular Disease Surveillance Data Pilot
Project
We are indebted to Dr. Michael McGlincy, Strategic
Matching Inc., for his thoughtful advice.
Linking Mortality and Inpatient Records:
Comparing Deterministic and Probabilistic
Methods
Richard Miller
[email protected]
608.267.3858
HerngLeh (Mike) Yuan
[email protected]
608.267.2487
Wisconsin Division of Public Health