Transcript downloading

Overview of the Synthetic
Derivative
April 16, 2010
Melissa Basford, MBA
Program Manager – Synthetic Derivative
Synthetic Derivative resource
overview

Rich, multi-source database of de-identified
clinical and demographic data

Contains ~1.8 million records




~1 million with detailed longitudinal data
averaging 100k bytes in size
an average of 27 codes per record
Records updated over time and are current
through 7/31/09
SD Establishment
SD
Database
Star Server
HEO
Data Parsing
Data Parsing
EDW
DE-IDENTIFICATION
Information collected
during clinical care
One way hash
Restructuring
for research
Access through
secured online
application
Data export
Data Types (so far)

Narratives, such as:














Clinical Notes
Discharge Summaries
History & Physicals
Problem Lists
Surgical Reports
Progress Notes
Letters & Clinical Communications
Diagnostic codes, procedural codes
Forms (intake, assessment)
Reports (pathology, ECGs, echocardiograms)
Lab values and vital signs
Medication orders
TraceMaster (ECGs)
˜100 SNPs for 7000+ samples
Research use cases assumed in
resource development
(either alone, or with DNA samples)
Retrospective chart reviews
Hypothesis
generation
Rapid preliminary
data for
grant
submissions
Feasibility assessment
Technology + policy

De-identification

Derivation of 128-character identifier (RUI) from the MRN
generated by Secure Hash Algorithm (SHA-512)




RUI is unique to input, cannot be used to regenerate MRN
RUI links data through time and across data sources
HIPAA identifiers removed using combination of custom
techniques and established de-identification software
Restricted access & continuous oversight




Access restricted to VU; not a public resource
IRB approval for study (non-human)
Data Use Agreement
Audit logs of all searches and data exports
Date shift feature

Our algorithm shifts the dates within a
record by a time period that is
consistent within each record, but
differs across records


up to 364 days backwards
e.g. if the date in a particular record is
April 1, 2005 and the randomly generated
shift is 45 days in the past, then the date
in the SD is February 15, 2005)
What the SD can’t do

Outbreaks and other date-specific studies
(catastrophes, etc)

Find a specific patient (e.g. to contact)

Replace large scale epidemiology research
(e.g. TennCare database)

Temporal search capabilities limited (but
under development)


“First this, than that” study designs require
significant manual effort
Expect “timeline” views and searching Q1-Q2
Demographic Characteristics
SD
Davidson
County
Tennessee
United States
1,716,085
578,698
6,038,803
299,398,484
Female
55.2
51.3
51.1
50.7
Male
44.6
48.7
48.9
48.3
0.2
-
-
-
Afr American
14.3
27.9
16.9
12.8
Asian / Pacific
1.2
3.0
1.4
4.6
80.5
60.1
77.5
66.4
Hispanic
2.6
7.1
3.2
14.8
Indian American
0.1
0.4
0.3
1.0
Others
1.4
-
-
-
0
1.5
1.0
1.6
N
Gender (%)
Unknown
Race/Ethnicity* (%)
Caucasian
Multiple Races
*A significant number of SD records are of unknown race/ethnicity. Multiple efforts are underway to better
classify these records including NLP on narratives.
yp
Ty ert
en
p
H
si
yp e II
D on
er
i
D
ep lipi abe
d
t
re
ss em es
ia
iv
N
e
O
D
S
is
M
or
ix
e d An
de
r
H em
C
yp
or
ia
on
e
N
ar rli p OS
y
i
A dem
th
ia
er
oVe
C
s
ar
A l
di
s
ac
th
m
M
H
ur a
yp
C
Ta mu
or erc
na ho chy rs
le
ry
ca
s
r
A
th tero dia
er
H
yp
o- lem
N
o
t
at ia
C
on hy
v
V
ge roi
di e s
st
l
s
iv
m
e
N
H
O
rt
Fa S
C
ar
ilu
di
re
ac
Se
E
D
ys de
ns
A
rh ma
or
y
ne tria
l F thm
ur
ib
o
ril ia
O He
la
a
th
er rin tion
Lu g L
ng os
D s
T
is
Pu yp
ea
lm e I
s
on Dia e
ar
be
y
C te s
o
Sl ll a
e e ps
e
p
A
pn
ea
H
Examples of frequent
diagnoses in total SD
70,000
60,000
50,000
40,000
Top diagnosis codes overall:
1. FEVER
2. CHEST PAIN
3. ABDOMINAL PAIN
4. COUGH
5. PAIN IN LIMB
6. HYPERTENSION
7. ROUTINE MEDICAL EXAM
8. ACUTE URI
9. MALAISE & FATIGUE
10. HEADACHE
11. URINARY TRACT INFECTION
30,000
20,000
10,000
0
ar
di
ac
M
Ve
ur
sc
m
u
ou
A rs
re
Fa te sth
A
m
ci r al
bn
al
R a
or
e
A
m
no flu
al
x
C
H mo
on P h
ea
al
ri
ie
ge ys
n
n i De g s
ta
L
v
l H elo os
ea pm s
rt
e
G An n t
o
Id as
io tro mo
pa
ly
e
th nte
ic
r
Sc itis
H oli
yp o
er sis
tr
op
Fa
hy
E
Sp
ilu p
ile
ee
r
ch Hy e to ps
y
d
/L
an ron Thr
iv
gu
e
ag phr e
os
e
D
is is
o
A rd
c
er
Ty
i
pe d R
I D eflu
x
ia
be
Ve
S
te
s
nt ex
ric ua
A
u l l P DH
ar
r
D
S eco
ep
ci
t
t
Sl De y
fe
e
A ep A ct
bn
or pn
m ea
al
D
E
ow
C
n'
A G
ut
s
Sy
is
m
n
d
H
yp ro
er me
te
ns
io
n
C
Examples of frequent
diagnoses among peds in SD
9,000
8,000
7,000
6,000
Top diagnosis codes overall:
• ROUTIN CHILD HEALTH EXAM
• FEVER
• COUGH
• ACUTE PHARYNGITIS
• URIN TRACT INFECTION NOS
• VOMITING ALONE
• CARDIAC MURMURS NEC
• ABDOMINAL PAIN-SITE NOS
• OTITIS MEDIA NOS
• ACUTE URI NOS
• PAIN IN LIMB
5,000
4,000
3,000
2,000
1,000
0
Examples of ICD-9 codes for rare
diseases
Example Rare
Disease
Frequency
Number in SD
Number in BioVU
Microcephalus
0.00007
566
6
Pica
0.00004
59
9
Septicemic Plague
0.00004
20
0
Pick’s Disease
0.00004
72
7
Acromegaly and Gigantism
0.00041
464
57
Ehlers-Danlos Syndrome
0.00011
154
9
Narcolepsy without Cataplexy
0.00004
166
17
Spina Bifida
0.00022
1327
77
Stiff-Man Syndrome
0.00007
42
5
Tourette Syndrome
0.00007
366
9
Bell’s Palsy
0.00078
1509
141
Bulimia Nervosa
0.00021
640
35
Cushing’s
0.00116
1065
129
Peyronies Disease
0.00018
369
57
Statistical considerations and
limitations
Working with biostats (Schildcrout) on these issues. Some
considerations:

Selection bias for inclusion in population; representativeness of
cohort and generalizability

Bias in ICD-9 coding

Confounding by indication

Severity of disease

Medication prescribed/ordered vs received

Timing




For example, AE must come after medication (timecourse)
Timescale upon which events could be attributed to events
Dropout (Death vs. discharge vs. transfer)
Intervention based on in-hospital disease history
Using the SD resource
SD Access Protocol
Requests IRB
Exemption
Researcher
Enters
StarBRITE to
complete
electronic
application
(IRB status is
in StarBRITE)
Signs DUA
SD staff
verify/
access
granted
Researcher
accesses
SD
Data Use Agreement
Components
Phenotype Searching

Definition of phenotype for cases and controls is
critical


Basic understanding of data elements; uses and
limitations of particular data points is important


May require consultation with experts
List of ‘watch outs’ under development
Reviewing records manually to make case
determination (or even to calculate PPV of search
methodology) will be somewhat time consuming
The problem with ICD9 codes


ICD9 give both false negatives and false positives
False negatives:



Outpatient billing limited to 4 diagnoses/visit
Outpatient billing done by physicians (e.g., takes too long to find
the unknown ICD9)
Inpatient billing done by professional coders:



omit codes that don’t pay well
can only code problems actually explicitly mentioned in documentation
False positives



Diagnoses evolve over time -- physicians may initially bill for
suspected diagnoses that later are determined to be incorrect
Billing the wrong code (perhaps it is easier to find for a busier
clinician)
Physicians may bill for a different condition if it pays for a given
treatment

Example: Anti-TNF biologics (e.g., infliximab) originally not covered for
psoriatic arthritis, so rheumatologists would code the patient as having
rheumatoid arthritis
Lessons from preliminary phenotype
development (can be corrected)

Eliminating negated and uncertain terms:


Delineating section tag of the note


“I don’t think this is MS”, “uncertain if multiple
sclerosis”
“FAMILY MEDICAL HISTORY: Mother had multiple
sclerosis.”
Adding requirements for further signs of “severity of
disease”


For MS: an MRI with T2 enhancement, myelin basic
protein or oligoclonal bands on lumbar puncture, etc.
This could potentially miss patients with outside workups, however
Other lessons (more difficult
to correct via algorithms)




A number of incorrect ICD9 codes for RA and MS assigned to
patients
Evolving disease
 “Recently diagnosed with Susac’s syndrome - prior diagnosis
of MS incorrect.” (Notes also included a thorough discussion
of MS, ADEM, and Susac’s syndrome.)
Difference between two doctors:
 Presurgical admission H&P includes “rheumatoid arthritis” in
the past medical history
 Rheumatology clinic visits notes say the diagnosis is
“dermatomyositis” - never mention RA
Sometimes incorrect diagnoses are propagated through the
record due to cutting-and-pasting / note reuse
Resources

StarPanel


Record Counter



De-identified clinical data; sophisticated phenotype searching
Returns a number – record counts and aggregate
demographics
Synthetic Derivative



Identified clinical data; designed for clinical use
De-identified clinical data; sophisticated phenotype searching
Returns record counts AND de-identified narratives, test
values, medications, etc., for review and creation of study
data sets
BioVU



SNP data
De-identified clinical data; sophisticated phenotype searching
Able to link phenotype information to biological sample
Live Demo