Phillips_i2b2 - Buffalo Ontology Site
Download
Report
Transcript Phillips_i2b2 - Buffalo Ontology Site
Developing i2b2 Ontologies for
the Long Haul
Lori Phillips, MS
Partners HealthCare Systems, Inc
April 25, 2012
National Centers for Biomedical Computing
What is i2b2?
Software
for explicitly organizing and transforming personoriented clinical data in a way that is optimized for research
A
Allows integration of clinical data, trials data, and genotypic data
portable and extensible application framework
Modular software architecture allows additions without disturbing core
parts
Available as open source at https://www.i2b2.org
Where is it used?
CTSA’s
Boston University
Case Western Reserve University (including Cleveland Clinic)
Children's National Medical Center (GWU), Washington D.C.
Duke University
Emory University (including Morehouse School of Medicine and Georgia Tech )
Harvard University (including Beth Israel Deaconness Medical Center, Brigham and
Women's Hospital, Children's Hospital Boston, Dana Farber Cancer Center, Joslin
Diabetes Center, Massachusetts General Hospital)
Medical University of South Carolina
Medical College of Wisconsin
Oregon Health & Science University
Penn State MIlton S. Hershey Medical Center
Tufts University
University of Alabama at Birmingham
University of Arkansas for Medical Sciences
University of California Davis
University of California, Irvine
University of California, Los Angeles*
University of California, San Diego*
University of California San Francisco
University of Chicago
University of Cincinnati (including Cinncinati Children's Hospital Medical Center)
University of Colorado Denver (including Children's Hospital Colorado)
University of Florida
University of Kansas Medical Center
University of Kentucky Research Foundation
University of Massachusetts Medical School, Worcester
University of Michigan
University of Pennsylvania (including Children's Hospital of Philadelphia)
University of Pittsburgh (including their Cancer Institute)
University of Rochester School of Medicine and Dentistry
University of Texas Health Sciences Center at Houston
University of Texas Health Sciences Center at San Antonio
University of Texas Medical Branch (Galveston)
University of Texas Southwestern Medical Center at Dallas
University of Utah
University of Washington
University of Wisconsin - Madison (including Marshfield Clinic)
Virginia Commonwealth University
Weill Cornell Medical College
Academic Health Centers (does not include AHCs that are part of a CTSA):
Arizona State University
City of Hope, Los Angeles
Georgia Health Sciences University, Augusta
Hartford Hospital, CN
HealthShare Montana
Massachusetts Veterans Epidemiology Research and Information Center
(MAVERICK), Boston
Nemours
Phoenix Children's Hospital
Regenstrief Institute
Thomas Jefferson University
University of Connecticut Health Center
University of Missouri School of Medicine
University of Tennessee Health Sciences Center
Wake Forest University Baptist Medical Center
HMOs:
Group Health Cooperative
Kaiser Permanente
International:
Georges Pompidou Hospital, Paris, France
Hospital of the Free University of Brussels, Belgium
Inserm U936, Rennes, France
Institute for Data Technology and Informatics (IDI), NTNU, Norway
Institute for Molecular Medicine Finland (FIMM)
Karolinska Institute, Sweden
Landspitali University Hospital, Reykjavik, Iceland
Tokyo Medical and Dental University, Japan
University of Bordeau Segalen, France
University of Erlangen-Nuremberg, Germany
University of Goettingen, Goettingen, Germany
University of Leicester and Hospitals, England (Biomed. Res. Informatics Ctr. for
Clin. Sci)
University of Pavia, Pavia, Italy
University of Seoul, Seoul, Korea
Companies:
Johnson and Johnson (TransMART)
GE Healthcare Clinical Data Services
Why use i2b2?
Cohort
discovery
Enables and simplifies research cohort discovery across an institution’s
large, heterogeneous clinical datasets
Hypothesis
generation
Enables and simplifies analysis of data to support a hypothesis
Retrospective
data analysis
Enables the retrospective analysis of data to support/refute claims.
i2b2 Workbench
Data Model
FACTS
The quantitative or factual data being queried
DIMENSIONS
Groups of hierarchies and descriptors that define the facts.
STAR
SCHEMA
A single fact table surrounded by numerous dimension tables.
i2b2 Star Schema
visit_dimension
patient_dimension
PK
Patient_Num
Birth_Date
Death_Date
Vital_Status_CD
Age_Num*
Gender_CD*
Race_CD*
Ethnicity_CD*
1
∞
∞
Patient_Num
Encounter_Num
Concept_CD
Observer_CD
Start_Date
Modifier_CD
Instance_Num
End_Date
ValType_CD
TVal_Char
NVal_Num
ValueFlag_CD
Observation_Blob
Concept_Path
Concept_CD
Name_Char
PK
PK
PK
PK
PK
PK
PK
PK
∞
Encounter_Num
Start_Date
End_Date
Active_Status_CD
Location_CD*
∞
∞
∞ ∞ observer_dimension
PK
concept_dimension
PK
1
observation_fact
Observer_Path
Observer_CD
Name_Char
∞
modifier_dimension
PK
Modifier_Path
Modifier_CD
Name_Char
Observation (fact table) Primary Keys
Patient_num
Distinct number for every patient
Encounter_num
Distinct number for every visit
Concept_cd
Distinct code for every concept
Observer_cd
Distinct code for every observer
Start_date
Date-time observation began
Modifier_cd
Code to modify concept_cd
Instance_num Mechanism to group concept modifers
i2b2 Fact Table
In
i2b2, an atomic fact is an observation on a patient.
Examples
of facts
Diagnoses
Procedures
Lab data
Medications
Genetic data
i2b2 Dimension Tables
Dimension
tables contain descriptive information about the
facts.
Examples
Concept dimension describes the concepts stored in the concept_cd
field.
Provider dimension contains information about the observer_cd field
Patient dimension contains information about the patient_num field
Visit dimension contains information about the encounter_num field
Modifier dimension contains information about the modifier_cd field
How does i2b2 use Ontologies?
By
and large, the concepts stored in the fact table come from
clinical coding systems or ontologies.
Largely
dependent on data available to institution
Diagnoses
ICD9/ICD10/SNOMED
Procedures
CPT/ICD9
Medications
NDC/RXNORM
Lab results
LOINC
Molecular/genomic data
Custom or project specific data
Ontologies
are used to organize query terms (and concepts)
hierarchically.
Metadata table
Query
terms are stored in a separate metadata table.
There
is a one-to-one mapping of terms in the metadata to
concepts in the dimension table.
The
structure of the metadata table is integral to both the
visualization of the query terms (tree) and the query
mechanism itself.
Structure of Metadata Table
METADATA
C_HLEVEL
C_FULLNAME
C_NAME
C_SYNONYM_CD
C_VISUALATTRIBUTES
C_TOTALNUM
C_BASECODE
C_METADATAXML
C_FACTTABLECOLUMN
C_TABLENAME
C_COLUMNNAME
C_COLUMNDATATYPE
C_OPERATOR
C_DIMCODE
C_COMMENT
C_TOOLTIP
UPDATE_DATE
DOWNLOAD_DATE
IMPORT_DATE
SOURCESYSTEM_CD
VALUETYPE_CD
INT NULL
VARCHAR(900) NULL
VARCHAR(2000) NULL
CHAR(1) NULL
CHAR(3) NULL
INT NULL
VARCHAR(450) NULL
TEXT NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(10) NULL
VARCHAR(900) NULL
TEXT NULL
VARCHAR(900) NULL
DATETIME NULL
DATETIME NULL
DATETIME NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
i2b2 Metadata Root Level Categories
Terms
with c_hlevel = 1
Display name is c_name
Icon
(folder or container) is
determined by
c_visualattributes
Example
c_fullname:
\Diagnoses\
Query terms are visualized hierarchically in tree
\Diagnoses\
1
Respiratory system\
Chronic obstructive diseases\
2
3
Emphysema\
4
Why are hierarchies so important for i2b2?
Hierarchies
form the basis of both the visualization of the terms
and the query mechanism itself.
select * from metadata where c_fullname like
‘\Diagnoses\Respiratory system\Chronic obstructive
diseases\Emphysema\%’ and c_hlevel = 5
Structure of Metadata Table
METADATA
C_HLEVEL
C_FULLNAME
C_NAME
C_SYNONYM_CD
C_VISUALATTRIBUTES
C_TOTALNUM
C_BASECODE
C_METADATAXML
C_FACTTABLECOLUMN
C_TABLENAME
C_COLUMNNAME
C_COLUMNDATATYPE
C_OPERATOR
C_DIMCODE
C_COMMENT
C_TOOLTIP
UPDATE_DATE
DOWNLOAD_DATE
IMPORT_DATE
SOURCESYSTEM_CD
VALUETYPE_CD
INT NULL
VARCHAR(900) NULL
VARCHAR(2000) NULL
CHAR(1) NULL
CHAR(3) NULL
INT NULL
VARCHAR(450) NULL
TEXT NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(10) NULL
VARCHAR(900) NULL
TEXT NULL
VARCHAR(900) NULL
DATETIME NULL
DATETIME NULL
DATETIME NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
Hierarchies in queries
select patient_num from observation_fact where concept_cd IN (select
concept_cd from concept_dimension where concept_path LIKE
'\Diagnoses\Respiratory system\Chronic obstructive diseases\
Emphysema\%')
i2b2 Ontologies for the Long Haul
How
do I create i2b2 metadata for a known ontology?
ICD-10
What
happens to my legacy clinical data when I have to move
to ICD-10?
Merging ICD-9 with ICD-10
How
….
do I handle genomic metadata?
Custom metadata?
NCBO BioPortal ICD-10
Building an ICD-10 Ontology with NCBO services
Pull data from NCBO via REST services.
Reorganize information into i2b2 Metadata format
bioportal/concepts/46302/all
<data>
<pageNum>1</pageNum>
<numPages>1832</numPages>
<pageSize>50</pageSize>
<numResultsPage>50</numResultsPage>
<numResultsTotal>91590</numResultsTotal>
<contents
class="org.ncbo.stanford.bean.concept.
ClassBeanResultListBean">
<classBeanResultList>
<classBean>
<id>0-ICD10CM</id>
<fullId>http://purl.bioontology.org/
ontology/ICD10CM/0-ICD10CM</fullId>
<label>ICD-10-CM TABULAR LIST of
DISEASES and INJURIES</label>
<type>class</type>
<relations>
<entry>
<string>ChildCount</string>
<int>0</int>
</entry> ……
METADATA
C_HLEVEL
C_FULLNAME
C_NAME
C_SYNONYM_CD
C_VISUALATTRIBUTES
C_TOTALNUM
C_BASECODE
C_METADATAXML
C_FACTTABLECOLUMN
C_TABLENAME
C_COLUMNNAME
C_COLUMNDATATYPE
C_OPERATOR
C_DIMCODE
C_COMMENT
C_TOOLTIP
UPDATE_DATE
DOWNLOAD_DATE
IMPORT_DATE
SOURCESYSTEM_CD
VALUETYPE_CD
INT NULL
VARCHAR(900) NULL
VARCHAR(2000) NULL
CHAR(1) NULL
CHAR(3) NULL
INT NULL
VARCHAR(450) NULL
TEXT NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(10) NULL
VARCHAR(900) NULL
TEXT NULL
VARCHAR(900) NULL
DATETIME NULL
DATETIME NULL
DATETIME NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
Primary challenges
i2b2
Metadata depends upon hierarchical information
c_fullname, c_tooltip maintain the hierarchy from root to leaves
Diseases of the respiratory system \
Chronic lower respiratory diseases \
Emphysema
Challenges..
NCBO
REST service that enables pull of concepts includes
immediate parent/child info only
Hierarchy must be computed
<data>
<classBean>
<id>J43</id>
<label>Emphysema</label>
<relations>
<entry>
<string>SuperClass</string>
<list>
<classBean>
<id>J40-J47</id>
<label>Chronic lower respiratory diseases</label>
</classBean>
</list>
</entry>
</relations>
</classBean>
</data>
NCBO Extraction workflow
NCBO
REST
XML
Request to extract ontology
Extraction
Workflow
ICD-10
Process
Extracted
Data
i2b2
Metadata
Extracted ICD-10 terms
Released deliverables
https://community.i2b2.org/wiki/display/NCBO
What about my legacy ICD-9 data?
Ideally we would like an i2b2 ontology that integrates ICD-9 into ICD10.
Mapping Tool
Tool
to verify/(re)assign ontology mappings.
Navigating the Mapping Tool Tree
Displays terms mapped
from one ontology within
hierarchy of another
Mapped terms are
displayed adjacent to
terms they are mapped to
and appear in bold
Adding a new mapping
ICD9:269.3, Mineral
deficiency should
appear for ICD10:E63
Other nutritional
deficiencies
Copy term ICD9:269.3
Adding a new mapping
Paste onto
ICD10:E63 Other
nutritional
deficiencies
Move a mapping
Ascorbic acid
deficiency (ICD9:267)
can be moved down
one level to Ascorbic
acid deficiency
(ICD10:E54)
Drag and drop down
the term one level.
Unmap a mapping
ICD9:416.8 Other
chronic pulmonary
heart diseases
appears in two places:
the one attached to
ICD10:I27.2 appears
incorrect and can be
unmapped.
The Unmapped Terms List
Free form list of terms to be
mapped
Locate term you wish to map to
in the hierarchy tree. Drag from
table to term in the tree.
If you make a mistake you can
either reassign the mapped
term within the tree or unmap it
from tree.
Unmap will cause it to reappear
in the unmapped terms list if the
term has no other mappings.
Assigning an unmapped term
Drag from
unmapped
terms list
Drop onto
term we are
mapping to
Unmapping a term
Drag term
from tree
Drop onto
unmapped
terms list
Search Unmapped Terms By Name
Search Unmapped Terms by Code
Mapped Terms Viewer
Search Mapped Terms By Code
Search Mapped Terms By Name
Merging Ontologies
Mapping tool provides a
visualization of what the
merged ontologies would
look like
What if we could extract
a single metadata table
from this?
Integration tool
Request to integrate
Mapper
Cell
Integration
Workflow
ICD9 into ICD-10
For each mapped ICD-9
terms, compute ICD-10
hierarchy
ICD-10 merged
with ICD9 terms
Mapped ICD-9 terms
How to handle genomic data
Ability
Needs may differ between geneticist, physician, research scientist
Ability
to organize the variants for ease of navigation
to query for the variant in the workbench
Genomic labs may report data differently
Define the variant so it may be reliably identified over time
Implication is that the identifier for the variant does not change over time
or is maintainable.
How to (reliably) identify a genomic variant?
HGVS
Name ?
RS
#?
Chr location,
Nucleotide subst ?
Gene name +
flanking
sequences ?
All of
them??
RS number
Uniquely
Novel
identifies a variant over time ….but….
variants may not have rs number
User may not want to submit to dbSNP
Gene name + flanking sequences
Not
guaranteed if gene has several isoforms
EGFR
HGVS Name
Uniquely
identifies variant within a referenced and versioned
accession and details the nucleotide substitution.
NM_005228.3:c.2155G>T
RefSeq accession
Position
Coding DNA
Nucleotide
substitution
Is there a common denominator in all of this?
Yes
… all ultimately describe variant location on a
chromosome.
Nucleotide substitution defines the physical manifestation of
the variant.
WE PROPOSE:
HGVS name (n/t subst, positional info)
Flanking sequences (a way to verify positional info)
AS A WAY TO UNEQUIVOCALLY EQUATE TWO VARIANTS
ACROSS DOMAINS
ACROSS VERSIONS
Structure of Metadata Table
M ET A D A T A
C_HLEVEL
C_FULLNAME
C_NAME
C_SYNONYM_CD
C_VISUALATTRIBUTES
C_TOTALNUM
C_BASECODE
C_METADATAXML
C_FACTTABLECOLUMN
C_TABLENAME
C_COLUMNNAME
C_COLUMNDATATYPE
C_OPERATOR
C_DIMCODE
C_COMMENT
C_TOOLTIP
UPDATE_DATE
DOWNLOAD_DATE
IMPORT_DATE
SOURCESYSTEM_CD
VALUETYPE_CD
INT NULL
VARCHAR(900) NULL
VARCHAR(2000) NULL
CHAR(1) NULL
CHAR(3) NULL
INT NULL
VARCHAR(450) NULL
TEXT NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
VARCHAR(10) NULL
VARCHAR(900) NULL
TEXT NULL
VARCHAR(900) NULL
DATETIME NULL
DATETIME NULL
DATETIME NULL
VARCHAR(50) NULL
VARCHAR(50) NULL
Genomic MetadataXML record
GenomicMetadata
Version 1.0
ReferenceGenomeVersion hg18
SequenceVariant
HGVSName NM_0005228.3:c.2155G>T
SystematicName c.2155G>T
SystematicNameProtein p.Glu719Cys
AaChange missense
DnaChange substitution
SequenceVariantLocation
GeneName EGFR
FlankingSeq_5 GAATTCAAAAAGATCAAAGTGCTG
FlankingSeq_3 GCTCCGGTGCGTTCGGCACGGTGT
RegionType exon
RegionName Exon 18
Accessions
Accession
Name NM_005228
Type mrna (NCBI)
Accession
Name NP_005219
Type protein (NCBI)
Accession
Name NT_004487
Type contig (NCBI)
ChromosomeLocation
Chromosome chr7
Region 7p12
Orientation +
Organizational challenges
By
Disease?
By
Gene?
Combining equivalent terms
How to handle custom (local) metadata
Edit
Tool ideal for creating small, non-standard ontology for a
local project.
Consider the case for classifying patients as smokers, nonsmokers or smoking status unknown
The
Custom Metadata folder is designed for use with the
creation of local terms.
Create a “Smoking status” folder
Populate folder with “Smoker”, “Non-smoker”, etc
Smoking status custom metadata
www.i2b2.org
https://community.i2b2.org/wiki
http://bioportal.bioontology.org