Nigam_NCBO-Webinar

Download Report

Transcript Nigam_NCBO-Webinar

Using ontologies to make sense
of unstructured medical data
Nigam Shah, MBBS, PhD
[email protected]
NCBO: Key activities
• We create and maintain a library of
biomedical ontologies.
• We build tools and Web services to enable
the use of ontologies and their derivatives.
• We collaborate with scientific communities
that develop and use ontologies.
•
•
•
•
Download
Traverse
Search
Comment
Mapping
Services
•
•
•
Create
Download
Upload
Widgets
•
•
•
Tree-view
Auto-complete
Graph-view
Ontology
Services
http://rest.bioontology.org
Views
Annotation
Term recognition
Data Access
Fetch “data”
annotated with a
given term
http://bioportal.bioontology.org
Annotation service
Process textual metadata to automatically tag text
with as many ontology terms as possible.
90 million calls,
~700 GB of data
Resource index
Pubmed Abstracts
Adverse Events (AERS)
GEO
:
Clinical Trials
Drug Bank
Won 1st prize at the
2010 Semantic Web
Challenge @ ISWC
Creating Lexicons
Sentence in Clinical Note – 1
:
:
:
Sentence in Clinical Note – m
Frequency
counter
Term – 1
:
:
:
Term – n
Frequency
…
VP …
150,879 90,000 0.90 0.05 …
0.03 …
tf
ID
Term-1
ID
:
ID
Term-n
Syntactic types
df
NN
JJ
Annotation Analytics
Analyzing tagged data for hypothesis generation in bioinformatics
Generic GO based analysis routine
Genome
• Get annotations for each
gene in a set
Study Set
• Count the occurrence of
each annotation term in
the study set
• Count the occurrence of
that term in some
reference set (whole genome?)
• P-value for how surprising
their overlap is.
Reference set
Annotation Analytics Landscape
SNOMED-CT
NCIT
ICD-9
MeSH
?
Genes2MSH
:
Drugs, Chemicals
Cell Type
Human Disease
Gene Ontology
GOPubMed
Gene
Sets
Patient
Sets
Paper
Sets
Grant
Sets
Drug
Sets
Health Indicator Warehouse
datasets
Open questions
1. Can we use something other than the GO?
2. Lack of annotations—even today, roughly
20% of genes lack any GO annotation.
3. Annotation bias—annotation with certain
ontology terms is not independent of each
other.
4. Lack of a systematic mechanism to define a
level of abstraction.
Profiling a set of Aging genes
261 Age-related genes
Genome
Disease Ontology
~ 30% of
genome
Using ontologies other than GO
ERCC6  nucleoplasm
PARP1  protein N-terminus binding
ERCC6  <disease term?>
PARP1  <disease term?>
Enrichment Analysis with the DO
www.ncbi.nlm.nih.gov/pubmed/16107709
http://www.geneontology.org/GO.downloads.annotations.shtml
{ERCC6, PARP1}  PMID:16107709
{ERCC6, PARP1}  {Cockayne syndrome, DNA damage}
NCBO Annotator:
http://bioportal.bioontology.org
ERCC6
ERCC6
PARP1
ERCC6
PARP1
GO:0005654
GO:0008094
GO:0047485
GO:0005730
GO:0003950
PMID:16107709
PMID:16107709
PMID:16107709
PMID:16107709
PMID:16107709
Annotation Analytics on EMR data
Analysis of tagged data from electronic health records
Profiling patient sets
ICD9 789.00
(Abdominal pain, unspecified site)
86k patient Reports
Patient records processed from U. Pittsburgh NLP Repository with IRB approval.
Annotation (Clinical Text)
Generation of tagged data
Text clinical note
BioPortal – knowledge graph
Creating clean lexicons
Term – 1
:
:
:
Term – n
Frequency
Diseases
Procedures
Syntactic types
Term recognition
tool NCBO
Annotator
Annotation Workflow
NegEx
Patterns
Drugs
P1
ICD9
P1
T1,
T2,
no
T4
ICD9
…
T5,
T4,
T3
…
ICD9
ICD9
T4,
T3,
T1
T8,
T9,
T4
…
ICD9
ICD9
T6,
T8,
T10
T1,
T2,
no
T4
NegEx Rules –
Negation detection
P2
P2
Negation detection
P3
Cohort
of
Interest
Further Analysis
Terms Recognized
P3
:
:
Pn
Pn
Terms form a temporal series of tags 
Detecting the Vioxx Risk Signal
Vioxx Patients (1,560)
VioxxMI (339)
ROR of 2.058, CI of [1.804, 2.349]
The X2 statistic has p-value < 10-7
ROR=1.524, CI=[0.872,
2.666] X2 p-value = 0.06816.
MI Patients (1,827)
RA Patients (14,079)
p-value < 1.3x10-24
Detecting Adverse Events
Detecting Adverse Events
Linear Space Features
Drug frequency
Logarithmic
Features
Drug frequency
Space
Disease frequency
Disease frequency
Observed drug-first fraction
Observed co-mention count
Drug-first fraction z-score Co-mention count z-score
(fixed drug)
(fixed drug)
Drug-first fraction z-score Co-mention count z-score
(fixed disease)
(fixed disease)
Detecting Adverse Events
Detecting Off-label use
Annotation Analytics Landscape
SNOMED-CT
NCIT
What
questions
can we ask?
ICD-9
MeSH
Genes2MSH
:
Drugs, Chemicals
Cell Type
Human Disease
Agin
EMRs
g
Gene Ontology
GOPubMed
Gene
Sets
Patient
Sets
Paper
Sets
Grant
Sets
Drug
Sets
Health Indicator Warehouse
datasets
Associations and outcomes
Gene
Gene
Disease
Device
Procedure
Environment
Device
Procedure Environment
Enrichment
Off-label
Indications
Disease
Drug
Drug
Side effects
What
questions
can we ask?
Acknowledgements
•
•
•
•
•
•
•
Paea LePendu
Yi Liu
Srinivasan Iyer
Steve Racunas
Anna Bauer-Mehren
Clement Jonquet
Rong Xu
• Mark Musen
• NIH – NCBO funding
• Mayo Team
• Hongfang Liu
• Stephen Wu
• Sylvia Holland
• Alex Skrenchuk
Mining Annotations of Grants, Publications
Grants from 1972 to 2007
30 funding agencies
Publications from Medline
Only “Journal articles”
Sponsorship and Allocation
Who funds what