Data Heterogeneity Study
Download
Report
Transcript Data Heterogeneity Study
Data Heterogeneity Study
(Not Data Quality)
(OR)
“Type 2 Diabetes: A modern day
St. Valentine’s Day Massacre”
Feb.14, 2011
Purposes
• Compare Mayo and Intermountain data
– ICD-9 / diagnostic codes
– CPT procedure codes
– Medications
– Labs (fasting glucose / HbA1c)
– Associated conditions (obesity, …?)
– Practice characteristics (specialties)
– Health access characteristics
Specific Aims
•
•
•
•
To determine the relative frequency of the occurrence of
each of the ICD-9 codes for T2DM and significant
comorbidities or prediabetes syndromes, including obesity
indicators, at the 2 institutions, by age , sex, and ethnicity
To determine the relative frequency of medications
documented for treatment of T2DM
To determine the relative frequency of the performance of
diagnostic tests for T2DM (fasting or non-fasting BG,
Hba1c), and the values of results
To pilot test the Northwestern algorithm for electronically
defining T2DM in an equivalent way at the 2 institutions and
describe differences attributable to variation in the EHR
data
Study Design
Phase 1
• Determine sample space of findings
– Association mining (Susan Welch). Use set of
“seed” codes, retrieve broader set of codes /
findings that associate with these
– Run separately at each institution, then merge
them
– Avoids human subjectivity in selection of
codes/findings
Study design
Phase 2
• Retrieve data at each institution
– 1 observation / patient / episode / data
category
– Assign random patient ID (discard link)
– Assign random date shift
– Assemble 1 dataset / data category
• Exchange data and merge into 1 common
dataset
Phase 3
• Analyze data (within institution, and
between)
–
–
–
–
–
–
Raw frequencies of codes, procedures
Distributions of glucose, Hb A1c, BMI
Relative frequencies of T2DM findings
Associations between/ among T2DM findings
Associations between demographics / T2DM findings
Associations between health access / T2DM findings
Phase 4
• Interpretation
– What data are different between institutions?
– Why are they different?
– What else affects them?
– What data are not different?
– What is impact of time interval, time
– NOT “Who has Type 2 DM”?
• Conclusions
Other Issues
• Relationship/synergy with Centerphase
Project
• Relationship to other SHARP projects
• What about unstructured data / NLP?
• Is this dataset (or when is it?) a shared
resource?