Transcript SD Launch

OVERVIEW OF THE
SYNTHETIC
DERIVATIVE
June 29, 2012
Melissa Basford, MBA
Program Manager – Synthetic Derivative
Synthetic Derivative resource overview
• Rich, multi-source database of de-identified
clinical and demographic data
• Contains ~1.8 million records
• ~1 million with detailed longitudinal data
• averaging 100k bytes in size
• an average of 27 codes per record
• Records updated over time and are current
through August 2011.
SD Establishment
SD
Database
Star Server
HEO
Data Parsing
Data Parsing
EDW
DE-IDENTIFICATION
Information collected
during clinical care
One way hash
Restructuring
for research
Access through
secured online
application
Data export
Data Types (so far)
• Narratives, such as:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Clinical Notes
Discharge Summaries
History & Physicals
Problem Lists
Surgical Reports
Progress Notes
Letters & Clinical Communications
Diagnostic codes, procedural codes
Forms (intake, assessment)
Reports (pathology, ECGs, echocardiograms)
Lab values and vital signs
Medication orders
TraceMaster (ECGs), Tumor Registry, STS Registry
˜120 SNPs for 7000+ & GWAS for 10,500+ samples
Technology + policy
• De-identification
• Derivation of 128-character identifier (RUI) from the MRN
generated by Secure Hash Algorithm (SHA-512)
• RUI is unique to input, cannot be used to regenerate MRN
• RUI links data through time and across data sources
• HIPAA identifiers removed using combination of custom techniques
and established de-identification software
• Restricted access & continuous oversight
• Access restricted to VU; not a public resource
• IRB approval for study (non-human)
• Data Use Agreement
• Audit logs of all searches and data exports
Date shift feature
• Our algorithm shifts the dates within a record by a time
period that is consistent within each record, but differs
across records
• up to 364 days backwards
• e.g. if the date in a particular record is April 1, 2005 and the
randomly generated shift is 45 days in the past, then the date in the
SD is February 15, 2005)
What the SD can’t do
• Outbreaks and other date-specific studies
(catastrophes, etc)
• Find a specific patient (e.g. to contact)
• Replace large scale epidemiology research (e.g.
TennCare database)
• Temporal search capabilities limited (but under
development)
• “First this, than that” study designs require significant
manual effort
• Expect “timeline” views and searching Q1-Q2
Demographic Characteristics
SD
Davidson
County
Tennessee
United States
1,716,085
578,698
6,038,803
299,398,484
Female
55.2
51.3
51.1
50.7
Male
44.6
48.7
48.9
48.3
0.2
-
-
-
Afr American
14.3
27.9
16.9
12.8
Asian / Pacific
1.2
3.0
1.4
4.6
80.5
60.1
77.5
66.4
Hispanic
2.6
7.1
3.2
14.8
Indian American
0.1
0.4
0.3
1.0
Others
1.4
-
-
-
0
1.5
1.0
1.6
N
Gender (%)
Unknown
Race/Ethnicity* (%)
Caucasian
Multiple Races
*A significant number of SD records are of unknown race/ethnicity. Multiple efforts are underway to better
classify these records including NLP on narratives.
yp
Ty ert
en
p
H
si
yp e II
D on
er
i
D
ep lipi abe
d
t
re
ss em es
ia
iv
N
e
O
D
S
is
M
or
ix
e d An
de
r
H em
C
yp
or
ia
on
e
N
ar rli p OS
y
i
A dem
th
ia
er
oVe
C
s
ar
A l
di
s
ac
th
m
M
H
ur a
yp
C
Ta mu
or erc
na ho chy rs
le
ry
ca
s
r
A
th tero dia
er
H
yp
o- lem
N
o
t
at ia
C
on hy
v
V
ge roi
di e s
st
l
s
iv
m
e
N
H
O
rt
Fa S
C
ar
ilu
di
re
ac
Se
E
D
ys de
ns
A
rh ma
or
y
ne tria
l F thm
ur
ib
o
ril ia
O He
la
a
th
er rin tion
Lu g L
ng os
D s
T
is
Pu yp
ea
lm e I
s
on Dia e
ar
be
y
C te s
o
Sl ll a
e e ps
e
p
A
pn
ea
H
Examples of frequent diagnoses
in total SD
70,000
60,000
50,000
40,000
Top diagnosis codes overall:
1. FEVER
2. CHEST PAIN
3. ABDOMINAL PAIN
4. COUGH
5. PAIN IN LIMB
6. HYPERTENSION
7. ROUTINE MEDICAL EXAM
8. ACUTE URI
9. MALAISE & FATIGUE
10. HEADACHE
11. URINARY TRACT INFECTION
30,000
20,000
10,000
0
ar
di
ac
M
Ve
ur
sc
m
u
ou
A rs
re
Fa te sth
A
m
ci r al
bn
a
al
R
or
e
A
m
no flu
al
x
C
H mo
on P h
ea
al
ri
ie
ge ys
n
s
D
ni
g
ta ev
Lo
l H elo
s
ea pm s
rt
e
G An n t
a
o
Id
s
io tro mo
pa
ly
e
th nte
ic
r
Sc itis
H oli
yp o
er sis
tr
op
Fa
hy
E
Sp
ilu p
i
ee
le
r
ch Hy e to ps
y
d
/L
an ron Thr
iv
gu
e
ag phr e
os
e
D
is is
o
A rd
Ty cid er
pe
R
I D eflu
x
ia
be
Ve
S
te
s
nt ex
ric ua
A
u l l P DH
ar
r
D
S eco
ep
ci
t
t
Sl De y
fe
e
A ep A ct
bn
or pn
m ea
al
D
E
ow
C
n'
A G
ut
s
Sy
is
m
H ndr
yp
om
er
te e
ns
io
n
C
Examples of frequent
diagnoses among peds in SD
9,000
8,000
7,000
6,000
Top diagnosis codes overall:
• ROUTIN CHILD HEALTH EXAM
• FEVER
• COUGH
• ACUTE PHARYNGITIS
• URIN TRACT INFECTION NOS
• VOMITING ALONE
• CARDIAC MURMURS NEC
• ABDOMINAL PAIN-SITE NOS
• OTITIS MEDIA NOS
• ACUTE URI NOS
• PAIN IN LIMB
5,000
4,000
3,000
2,000
1,000
0
Resources
• StarPanel
• Identified clinical data; designed for clinical use
• Record Counter
• De-identified clinical data; sophisticated phenotype searching
• Returns a number – record counts and aggregate demographics
• Synthetic Derivative
• De-identified clinical data; sophisticated phenotype searching
• Returns record counts AND de-identified narratives, test values,
medications, etc., for review and creation of study data sets
• Research Derivative
• Identified clinical data
• Programmer (human) supported
• BioVU
• Genotype data
• De-identified clinical data; sophisticated phenotype searching
• Able to link phenotype information to biological sample
LIVE DEMO
USING THE SD
RESOURCE
SD Access Protocol
Requests IRB
Exemption
Researcher
Enters
StarBRITE to
complete
electronic
application
(IRB status is
in StarBRITE)
Signs DUA
SD staff
verify/
access
granted
Researcher
accesses
SD
Data Use Agreement Components
Phenotype Searching
• Definition of phenotype for cases and controls is critical
• May require consultation with experts
• Basic understanding of data elements; uses and
limitations of particular data points is important
• List of ‘watch outs’ under development
• Reviewing records manually to make case determination
(or even to calculate PPV of search methodology) will be
somewhat time consuming
The problem with ICD9 codes
• ICD9 give both false negatives and false positives
• False negatives:
• Outpatient billing limited to 4 diagnoses/visit
• Outpatient billing done by physicians (e.g., takes too long to find the
unknown ICD9)
• Inpatient billing done by professional coders:
• omit codes that don’t pay well
• can only code problems actually explicitly mentioned in documentation
• False positives
• Diagnoses evolve over time -- physicians may initially bill for suspected
diagnoses that later are determined to be incorrect
• Billing the wrong code (perhaps it is easier to find for a busier clinician)
• Physicians may bill for a different condition if it pays for a given treatment
• Example: Anti-TNF biologics (e.g., infliximab) originally not covered for psoriatic
arthritis, so rheumatologists would code the patient as having rheumatoid arthritis
Lessons from preliminary phenotype
development (can be corrected)
• Eliminating negated and uncertain terms:
• “I don’t think this is MS”, “uncertain if multiple sclerosis”
• Delineating section tag of the note
• “FAMILY MEDICAL HISTORY: Mother had multiple
sclerosis.”
• Adding requirements for further signs of “severity of
disease”
• For MS: an MRI with T2 enhancement, myelin basic protein
or oligoclonal bands on lumbar puncture, etc.
• This could potentially miss patients with outside work-ups,
however
Other lessons (more difficult to correct via
algorithms)
• A number of incorrect ICD9 codes for RA and MS assigned to patients
• Evolving disease
• “Recently diagnosed with Susac’s syndrome - prior diagnosis of MS
incorrect.” (Notes also included a thorough discussion of MS,
ADEM, and Susac’s syndrome.)
• Difference between two doctors:
• Presurgical admission H&P includes “rheumatoid arthritis” in the
past medical history
• Rheumatology clinic visits notes say the diagnosis is
“dermatomyositis” - never mention RA
• Sometimes incorrect diagnoses are propagated through the record
due to cutting-and-pasting / note reuse