Incorporating Singe-Site and Network

Download Report

Transcript Incorporating Singe-Site and Network

Incorporating Singe-Site and Network-based Data Quality
Assessment in the SAFTINet Distributed Research Network
Lisa Schilling, MD, MSPH
Department of Medicine
University of Colorado School of Medicine
Colorado Health Outcomes Program
Addressing Variations in Data Quality to Facilitate Multi-Institutional Comparative Effectiveness Research
AcademyHealth 2013 Annual Research Meeting
Monday June 24, 2013 11:30A-1:00P
Funding provided by AHRQ 1R01HS019908 (Scalable Architecture for Federated Translational Inquiries Network)
Acknowledging Collaborations
OBSERVATIONAL
MEDICAL
OUTCOMES
PARTNERSHIP
Partners and Collaborators
•
•
•
•
•
•
•
•
•
•
•
•
University of Colorado School of Medicine
American Academy of Family Physicians (AAFP)
Ohio State University (OSU) Department of Biomedical Informatics
Colorado Community Managed Care Network and the Colorado
Associated Community Health Information Enterprise
(CCMCN/CACHIE)
Salud Family Health Centers
Metro Community Provider Network (MCPN)
Denver Health and Hospital Authority (DHHA)
Cherokee Health Systems (CHS)
Colorado Department of Health Care Policy & Financing (HCPF)
QED Clinical d/b/a CINA
Observational Medical Outcomes Partnership (OMOP)
Recombinant Data Corporation, Inc
SAFTINet DRN Architecture
Partner Network
Why ROSITA?
• ROSITA: Reusable OMOP and
SAFTINet Interface Adaptor
• ROSITA: The only bilingual Muppet
• Converts EHR data into research limited data set
1.
2.
3.
4.
5.
Replaces local codes with standardized codes
Replaces direct identifiers with random identifiers
Supports clear-text and encrypted record linkage
Provides data quality metrics
Pushes data sets to grid node for distributed queries
ROSITA: From EHR to CER data
Or flat
files
Concept
mapping
Profiling Data
• Profiling and mapping EHR or surrogate EHR data
is huge!
– Difficult to validate
– Continuous responsibility
– Low cost, low burden
• Three large data “sources” to assess:
– The “raw” source data from ETL (input)
– The post-processed data in OMOP CDM V4 (ROSITA
output)
– Cross-grid comparisons (SAFTINet query portal)
Single site data quality assessment using
ROSITA
• ROSITA reporting system based on
JasperServer Community Edition
• LZ (landing zone) = Data with sitespecific values & coding schema
• OMOP = transformed data into CDM
V4 format and OMOP conceptIDs
Three types of DQ profiling reports:
1. LZ (original data)
2. OMOP (transformed data)
3. LZ-OMOP comparison (what changed during transformation?)
Data profiling using OMOP OSCAR
• Rules-based data profiling system originally
created by OMOP investigators
• Calculates different DQ stats based on data type
• DQ results are stored in a dedicated table in
ROSITA
• DQ results are pushed to the grid node for multisite queries
SAFTINet OSCAR rules by variable type
Statistic Type
1 – Count (of records)
2 – Mean
3 – Standard Deviation
4 – Minimum
5 – 25th Percentile
6 – Median
7 – 75th Percentile
8 – Maximum
9 – Number of NULL Values
10 – Number of Empty String Values
11 – Count (distinct)
Numeric









*

Variable Type
Continuous
Date
ID


Categorical




*


*

* This will return the count of empty string values when the underlying column being
analyzed is of type VARCHAR. Otherwise it will return 0.
Example – Continuous ID
The following rule would be used to gather Continuous statistics for the
visit_occurrence_id field in the visit_occurence table:
1= count
Column Name
Value
9= # of null
source_schema_name
omop
source_table_name
visit_occurrence
10= # of empty
variable_name
visit_occurrence_id
string
variable_type
1 – Continuous
11= count distinct
data_type
3 – ID
variable_formula
visit_occurrence_id
And the following results would be generated for this rule:
source_schema_
name
source_table_
name
variable_name
omop
omop
omop
omop
visit_occurrence
visit_occurrence
visit_occurrence
visit_occurrence
visit_occurrence_id
visit_occurrence_id
visit_occurrence_id
visit_occurrence_id
variable_ variable_
type
value
1
1
1
1
NULL
NULL
NULL
NULL
statistic_
type
1
9
10
11
statistic_
value
564
0
0
564
Variable type:
1=catgricl,2=cntns
Stat type:
1= count by record
OSCAR Results in ROSITA
Jasper Reports
•
JREPORT001 Landing Zone Key Summary Statistics:
–
–
–
–
•
JREPORT002 Random records for manual chart validation:
–
•
Summary statistics for Landing Zone database; Includes, for select fields:
Number of records in every table
Summary statistics (mean, minimum/maximum, number of missing) on numeric fields
Summary statistics (frequency) on categorical/character fields (excluding direct identifiers, such as Social
Security Number, Medical Record Number, names, and addresses)
Selection of records for chart review Randomly selects 25 visit occurrence records (and all associated records
from person, provider, care site, organization, drug exposure, procedure occurrence, condition occurrence, and
observation tables) from the ROSITA landing zone database.
JREPORT003 OMOP Database Key Summary Statistics:
– Summary statistics for OMOP database
– Source: SAFTINet Data Validation V1.0 2012 Nov 20.docx
ROSITA DQ Reporting
Care site statistics by visits / by patients (LZ)
Care site statistics by visits / by patients (OMOP)
Drug Exposure statistics (LZ)
Network-based DQ comparison queries
(Future work)
1. Implement DQ
query in DCQL
2. Submit query to
grid
3. Returns DQ
statistics across
all nodes
Conclusions
• A rules-based data quality assessment infrastructure
enables standardized DQ measures across all data
partners
• Three levels of DQ assessment:
– LandingZone (raw)
– OMOP (transformed)
– Grid (multi-site)
• Exploring graphical visualizations (in R) as rapid
screening
DQ Visualization: Thousands of data values
in one glance…..
Thank you!
Questions?
[email protected]