Graduate Seminar - The International Conference on Bioinformatics

Download Report

Transcript Graduate Seminar - The International Conference on Bioinformatics

Towards ontology driven
navigation of the lipid
bibliosphere
Chistopher J. O.Baker, Rajaraman Kanagasabai,
Wee Tiong Ang, Anitha Veeramani, Hong-Sang Low, and
Markus R. Wenk
International Conference on Bioinformatics 2007
(InCoB 2007)
27-31 August 2007
Motivation
 Lipid research in 21st century is in need of
reliable & sensible integration of data from
different sources.
 Lipid nomenclature in biomedical literature is
highly heterogeneous.
 Semantic data integration is necessary for lipid
research yet this is poorly achievable due to an
absence of a single unified, consistent, and
universally accepted lipid classification system.
Objective
 Develop a system that can facilitate the
navigation of the lipid bibliosphere using a
standardized lipid vocabulary with precise
semantics.
 To make use of the expressivity of a w3c
endorsed standard, the web ontology language
(OWL) for representing lipid nomenclature &
hierarchy.
Lipids
Ontologies
Lipids have many properties and
biologically related information that
needs to be systematically
captured in a domain model.
Capture knowledge:
The meaning of important vocabulary
(classes, properties/relations and
instance data in a domain model).
Lipids have no universally
accepted nomenclature.
Provides a common terminology
for a domain.
Lipid nomenclature is
not always intuitive.
Make the content in information
sources explicit.
Semantics of lipid terminology can be
ambiguous, synonym rich, non standard.
Provides an index and query
model to a repository of information.
Integration of lipid data is hampered
by a lack of unified classification system
and presence of multiple data formats.
Provides a basis for interoperability
between information systems.
Lipid Ontology
Lipid Upper Ontology
 Implemented in
OWL-DL language
 Uses LIPIDMAPS
systematic lipid
nomenclature
 560 named classes
 352 lipid subclasses
 71 Object properties
 4 Data properties
 Lipid instance:
LIPIDMAPS systematic
name
 Depth: 8 levels
Modeling lipid information

Multiple features of lipids are modeled in the Lipid_Specification
concepts and are directly related to the lipid classification hierarchy
found under the Lipid concept
Linking lipids with other
biological information
Lipid-Disease
 Modeled with Disease concept
 Disease instance: Disease
name from Disease Ontology
 Lipid concept is linked to the
Disease concept via the
hasRole_In_Disease property
Lipid-Protein
 Modeled with Protein concept
 Protein instance: Protein name
from SWISPROT
 Lipid concept is linked to the
Protein concept via the
InteractsWith_Protein property
A LIPID has many names
•Phosphatidylcholine is an important component of the mucus layer in the
large intestine.
•The distribution of these pores was examined using 1,2-di-oleoyl-snglycero-3-phosphocholine (DOPC) phospholipid vesicles under a
standard fluorescent microscope.
•Lecithin is usually used as a synonym for pure phosphatidylcholine, which
is the major component isolated from egg yolk or soy beans.
2-[[(2R)-2,3di(octadecanoyloxy)propoxy]hydroxyphosphoryl]oxyethyltrimethylazanium
Modelling Synonyms

4 types of name
 LIPIDMAPS
systematic
name
 IUPAC
systematic
name
 Broad lipid
name(nonsystematic)
 Exact lipid
name(nonsystematic)
Instances of names
are connected via the
properties
hasIUPAC_Synonym
hasLIPIPMAPS_Synonym
hasBroad_Lipid_Synonyn
hasExact_Lipid_Synonym
Literature Specification
Literature-driven,
ontology-centric ….

Content Delivery Platform - Automated



Text Mining - Customized and Automated


Domain Modeling / Customized / Rapid Prototype
Knowledge Navigation / Ontology Interrogation Tools Interactive


Regular Expressions, Named Entities, Relations, Co-reference
Knowledge Engineering Ontology Creation


Document delivery from Pubmed-PDF / USPTO-HTML
Tools for conversion of docs to text-minable text
Visual Query, Natural Language Interfaces
Service platform for knowledge-intensive lipid navigation tasks
Lipid Ontology as a knowledge
integration vehicle
Major Knowledge Sources
OWL interrogation
• Lipid Ontology
• NLP tagged text
• Database content
• DL reasoning & inference
• nRQL (new RACER Query Language)
• Semantic query tools
Knowledge navigation:
Ontology and Text Mining
1 Document Content
2 Sentence Extraction
3 Sentence Detection: lipid interaction protein
4 Entity Recognition:
term identification / assign lipid class
5 Normalization: collapse lipid synonyms
6 Relation Extraction: Lipid-Protein or Lipid Disease
"TLR4 binds to POPC", tagged as
"<term category="protein"> TLR4</term>
binds to
<term category="lipid">POPC</term>"
7 Classification: Identify ontology classes and specify
relations for all sentences, proteins, lipid subclasses.
8 Populate OWL ontology (JENA API)
Term List DB’s:
Lipid names,
LIPIDMAPS, Lipid Bank,
KEGG classifications,
Disease names,
Protein names
Stemmed Interactions
Document and
sentence meta data
Complete
Instantiated
OWL-DL
Ontology
Indexed Lipid Sentences
Lipid Class
Lipid Instance
Lipid Instance
Knowledge integration
pipeline
User input query
“lipid interact* protein”
Pubmed
Specification
• Content Acquisition pipeline:
• Automated Pubmed query
• Text format converter
User
Output for end user
110 full text papers
123 lipids,
361 proteins,
920 lipid-protein
interactions
Knowledge
Navigation
vehicle
“Instantiated ontology”
2 sec/Doc
NLP tagging
87 docs
tagged
with
relevant
name
entities
Ontology
instantiation
Knowledge integration
pipeline
User input query
“lipid interact* protein”
Pubmed
Specification
•Text-mining & NLP:
• BioText Suite for tokenization,
part of speech tagging, named entity
recognition, grounding,
association mining
User
Output for end user
Knowledge
Navigation
vehicle
110 full text papers
123 lipids,
361 proteins,
920 lipid-protein
interactions
“Instantiated ontology”
2 sec/Doc
NLP tagging
87 docs
tagged
with
relevant
name
entities
Ontology
instantiation
Knowledge integration
pipeline
User input query
“lipid interact* protein”
Pubmed
Specification
•Ontology Instantiation pipeline:
•custom script based on JENA API
User
Output for end user
Knowledge
Navigation
vehicle
110 full text papers
123 lipids,
361 proteins,
920 lipid-protein
interactions
“Instantiated ontology”
2 sec/Doc
NLP tagging
87 docs
tagged
with
relevant
name
entities
Ontology
instantiation
Knowledge integration
pipeline
User input query
“lipid interact* protein”
Pubmed
Specification
•Knowledge Navigation platform:
•Knowledge navigator or Knowlegator
•RACER
•nRQL
User
Output for end user
Knowledge
Navigation
vehicle
110 full text papers
123 lipids,
361 proteins,
920 lipid-protein
interactions
“Instantiated ontology”
2 sec/Doc
NLP tagging
87 docs
tagged
with
relevant
name
entities
Ontology
instantiation
OWL-DL Query with nRQL
Mark-up
Language
XML
Description
Query
Language
Structured
Document
XPath,
XQuery
RDF
Data
Model
for
objects
RDQL,
RQL,
Versa,
Squish
nRQL,
OWL
Data
Model +
Relations
OWL-QL,
JENA
Haarslev V., Moeller R., Wessel M., Querying the
Semantic Web with Racer + nRQL In Sean
Bechhofer, Volker Haarslev, Carsten Lutz, Ralf
Moeller (Eds) CEUR workshop proceedings of KI2004 Workshop on Applications of Description Logics
(ADL 04), Ulm, Germany, Sep 24 2004
The New Racer Query Language
www.cs.concordia.ca/~haarslev/racer/racerqueries.pdf
•nRQL queries are built on a Lisp syntax
• Elementary query atoms, combinable into highly
expressive but syntactically complex A-box queries to
derive assertions about instance data (individuals).
• Unary concept query (Instance Classification and retrieval)
• Does this instance belong to this class?
• What are instances of class X
• To which classes does instance X belong ?
• Binary role query
• What instances are related by relation X
• Binary role constraint query
• Unary has known successor (Ancestor / Descendant)
• Negation
• Intersect / Conjunction
• Union / Disjunction
• Combinations (And / Union)
Knowledge Navigation Tool
Query Composition Panel
Results Panel
Ontology
Content
Query
Syntax
Concept
Properties
Overview
Query Engine
Dialogue
Lipid Ontology as a
Query Model
Protein
PK
Disease
Protein_ID
Lipid
PK
Protein_Name
...
PK
Lipid_ID
Disease_ID
Disease_Name
...
Lipid_Name
...
relatedTo_Disease
interactsWith_Protein
occursIn_Sentence
FK1
FK2
FK1
FK2
Lipid_ID
Protein_ID
FK1
FK2
Lipid_ID
Sentence_ID
Sentence
PK
Sentence_ID
Sentence_Text
...
occursIn_Document
FK1
FK2
Query:
Find documents containing sentences where lipids
interact with proteins and the lipids are related to a disease.
Lipid_ID
Disease_ID
Sentence_ID
Document_ID
Document
PK
Document_ID
Title
Authors
Journal
...
Summary
 We build a lipid ontology in the Web Ontology Language (OWL) to
represent the LIPIDMAPS classification hierarchy.
 The ontology model resolves nomenclature inconsistencies by
grounding lipid synonyms to a individual lipid names.
 We report a document delivery system that in conjunction with a lipid
specific text mining platform instantiates lipid sentences into the lipid
ontology.
 We facilitate navigation of lipid literature using a drag ‘n’ drop visual
query composer which poses description logic queries to the OWL-DL
ontology.
 Lipid – disease and Lipid - protein statements in the lipid literature can
be readily queried and made easily available to lipid researchers.
Acknowledgement
 A*STAR – Agency for Science and
Technology, Singapore Government.
 National University of Singapore,
Graduate Student Travel Grant.