Semantics and Services enabled Problem Solving Environment

Download Report

Transcript Semantics and Services enabled Problem Solving Environment

Semantics and Services enabled Problem
Solving Environment for Trypanosoma cruzi
Collaboration with National Centers for Biomedical Computing (R01)
National Heart, Lung, and Blood Institute; National Institutes of Health
Partners & Partner Institutions
• Kno.e.sis Center, Wright State University
(Prof. Amit Sheth-PI, Satya S. Sahoo, Pablo Mendes, <Karthik
Gomadam, Ajith Ranabahu>)
• Tarleton Lab, Cellular Biology, University of Georgia
(Prof. Rick Tarleton, Dr. Flora Logan, Brent Weatherly)
– Computer Science, University of Georgia
(Prof. Prashant Doshi, GRA)
• NCBO/Stanford Medical Informatics Center
(Prof. Mark Musen, Dr. Natasha Noy, Martin O'Connor)
•
Potential community partners in later years:
–
–
–
Prof. Alberto M. R. Dávila, Instituto Oswaldo Cruz, FIOCRUZ – Brazil
Prof. Edmundo Carlos Grisard, Universidade Federal de Santa Catarina – Brazil
Dr. Christian Stocker – University of Pennsylvania.
Driving Biological Problem
• Trypanosoma Cruzi (T. cruzi) is a protozoan parasite and a key
causative agent of Chagas disease. Chagas afflicts 18 million
people in Latin America leading to heart disease and sudden death.
Driving factors in T. cruzi
research:
1. Identification of vaccine
candidates in T. cruzi
2. Diagnostic techniques for
identification of best
antigens
3. Identify genes for knockout
in T. cruzi
Project Overview
• Facilitate T.cruzi research through an ontology driven
(semantic)Problem Solving Environment
• Example Query:“Diagnostic techniques for identification
of best antigens”
• Potential Data Sources:
o Databases: TcruziDB + COG (for genome-scale analysis of
protein functions and evolution)
o Experimental Data: Genes with microarray transcript evidence
o Literature: PubMed for peer-reviewed published data
o Web services: BLAST, Clustal, PSORT
Opportunity: exploiting multimodal data
binary
text
Scientific
Literature
Health
Information
Services
PubMed
300 Documents
Published Online
each day
Elsevier
iConsult
Smart Mashups
(WebAPIs)
BioMashup
NCBI
Public Datasets
Clinical Data
Genome,
Protein DBs
new sequences
daily
Personal
health history
Semantics enhanced search, browsing,
complex query, hypothesis validation.
Laboratory
Data
Lab tests,
RTPCR,
Mass spec
Computer Science Objectives
• Four semantic informatics components to
support driving biological problem:
o Semantic provenance-enabled cyberinfrastructure for
experimental data
o Query interface usable by domain scientist for
complex query formulation and execution
underpinned by ontology and services
o Semantic text analysis approaches for extraction of
knowledge from biomedical literature
o A Semantic services-based smart mashups
environment
Opportunity: exploiting multimodal data
binary
text
Scientific
Literature
Health
Information
Services
PubMed
300 Documents
Published Online
each day
Elsevier
iConsult
Smart Mashups
(WebAPIs)
BioMashup
NCBI
Public Datasets
Clinical Data
Genome,
Protein DBs
new sequences
daily
Personal
health history
Semantics enhanced search, browsing,
complex query, hypothesis validation.
Laboratory
Data
Lab tests,
RTPCR,
Mass spec
Semantic Provenance-enabled Infrastructure
CONCLUSIONS
Results
e.g. genome, proteome, knock out
PROVENANCE
Metadata about materials and methods
e.g. sample size, parameters, equipment
VOCABULARY
Consistent description of data -> annotation
e.g. ‘gene’, ‘protein’, ‘lifecycle stage’
Courtesy: Dr. Flora Logan, Tarleton Lab, UGA
Provenance
• Scientific data is interpreted, evaluated and processed
based on multiple criteria
• Why - project details (Vaccine target for T.cruzi)
• Which– origin/initial sample (batch/cell line)
• How – generation process (statistical method used, ms/ms)
• What – equipments & parameters (QTOF, O18 labeling)
• When – time (24 hrs/48 hrs blood sample)
• Who – user
• Where – institution (Tarleton Lab, University of Georgia)
Semantic Provenance
• Use of ontology to handle terminological heterogeneity
in describing provenance as well as provide richer
model
• Domain specific provenance modeled in an ontology
and using it for annotation → semantic provenance
• Semantic provenance enables:
o Software agents to “understand” provenance
o Reasoning over provenance to discover new information
Semantic Annotation of Experimental Data
parent ion m/z
fragment ion m/z
830.9570
194.9604
2
580.2985
0.3592
688.3214
0.2526
779.4759
38.4939
784.3607
21.7736
1543.7476
1.3822
1544.7595
2.9977
1562.8113
37.4790
1660.7776
476.5043
parent ion charge
parent ion
abundance
fragment ion
abundance
ms/ms peaklist data
Mass Spectrometry Data
Semantic Annotation of Experimental Data
•
<ms-ms_peak_list>
•
<parameter instrument=“micromass_QTOF_2_quadropole_time_of_flight_mass_spectrometer”
•
mode=“ms-ms”/>
•
<parent_ion m-z=“830.9570” abundance=“194.9604” z=“2”/>
•
<fragment_ion m-z=“580.2985” abundance=“0.3592”/>
•
<fragment_ion m-z=“688.3214” abundance=“0.2526”/>
•
<fragment_ion m-z=“779.4759” abundance=“38.4939”/>
•
<fragment_ion m-z=“784.3607” abundance=“21.7736”/>
•
<fragment_ion m-z=“1543.7476” abundance=“1.3822”/>
•
<fragment_ion m-z=“1544.7595” abundance=“2.9977”/>
Ontological
•
<fragment_ion m-z=“1562.8113” abundance=“37.4790”/>
Concepts
•
<fragment_ion m-z=“1660.7776” abundance=“476.5043”/>
•
</ms-ms_peak_list>
Proteomics Analysis Protocol
Cell Culture
extract
Glycoprotein Fraction
proteolysis
Glycopeptides Fraction
1
n
Separation technique I
Glycopeptides Fraction
n
PNGase
Peptide Fraction
Separation technique II
n*m
Peptide Fraction
Mass spectrometry
ms data
Data
reduction
ms peaklist
N-dimensional array
Signal integration
Data reduction
ms/ms peaklist
binning
Parent protein and peptide
list
ms/ms data
Peptide identification
Peptide list
Data correlation
SPADE: Provenance Infrastructure
Semantic Web Process to incorporate provenance
Biological
Sample
Analysis by
MS/MS
O
Automatic
Semantic
Annotation
Agent
Raw Data
to
Standard
Format
I
Raw
Data
Agent
O
Data
Preprocess
I
Standard
Format
Data
Agent
DB Search
Agent
Results
Postprocess
(Mascot/S
equest)
O
Filtered
Data
I
Search
Results
Storage
Biological Information
O
Final
Output
(ProValt)
I
O
ProPreO Provenance Ontology
• An ontology to capture the experimental process, data and agents in
proteomics
• ProPreO ontology features:
o modeled in OWL-DL ontology language
o 490 classes, 30 named relationships
o 3.1 million instances of tryptic peptides
• Published through the National Center for Biomedical Ontology
(NCBO) and Open Biomedical Ontologies (OBO)
• For this project, we will need to model RT-PCR and microarray
protocols (where possible reusing existing/past efforts such as
MGED)
Resource page:
http://knoesis.wright.edu/research/semsci/application_domain/sem_life_sci/glycomics/resources/ontologies/propreo/
ProPreO ontology - provenance pathway
Initial data
Provenance pathway
Final data
Computational
Tasks
Parameters
Opportunity: exploiting multimodal data
binary
text
Scientific
Literature
Health
Information
Services
PubMed
300 Documents
Published Online
each day
Elsevier
iConsult
Smart Mashups
(WebAPIs)
NCBI
Public Datasets
Clinical Data
BioMashup
Genome,
Protein DBs
new sequences
daily
Personal
health history
Semantics enhanced search, browsing,
complex query, hypothesis validation.
Laboratory
Data
Lab tests,
RTPCR,
Mass spec
Querying for Hypothesis Validation
manual exploration
(GeneID: 9215)
has_associated_disease
Congenital muscular
dystrophy,
type 1D
has_molecular_function
Acetylglucosaminyltransferase activity
Adapted from: Olivier Bodenreider, presentation at HCLS Workshop, WWW07
Querying for Hypothesis Validation
with semantically enhanced data
SELECT DISTINCT ?t ?g ?d {
?t is_a GO:0016757 .
GO:0016757
?g has molecular functionglycosyltransferase
?t .
?g has_associated_phenotype ?b2 .
isa
?b2 has_textual_description ?d .
FILTER (?d, “muscular distrophy”, “i”) . GO:0008194
FILTER (?d, “congenital”,GO:0016758
“i”)
}
GO:0008375
acetylglucosaminyltransferase
GO:0008375
acetylglucosaminyltransferase
MIM:608840
Muscular dystrophy,
congenital, type 1D
has_molecular_function
LARGE
EG:9215
has_associated_phenotype
From medinfo paper: slide adapted from presentation by : Olivier Bodenreider at HCLS Workshop, WWW07
Knowledge driven query formulation
Complex queries can also include:
- on-the-fly Web services execution to retrieve additional data
- inference rules to make implicit knowledge explicit
T.Cruzi PSE Query Interface – Semantic
Annotation of Experimental Data
Mendes et al: TcruziKB: Enabling Complex Queries for Genomic Data Exploration (2008)
Opportunity: exploiting multimodal data
binary
text
Scientific
Literature
Health
Information
Services
PubMed
300 Documents
Published Online
each day
Elsevier
iConsult
Smart Mashups
(WebAPIs)
BioMashup
NCBI
Public Datasets
Clinical Data
Genome,
Protein DBs
new sequences
daily
Personal
health history
Semantics enhanced search, browsing,
complex query, hypothesis validation.
Laboratory
Data
Lab tests,
RTPCR,
Mass spec
Information Extraction (IE)
• Can we use ontologies to guide/help IE?
– lexicon: Fish Oils, Raynaud’s Disease, etc.
– types/labels: Fish Oils instance of Lipid
– relationships between types: Lipid affects Disease
• Simple identification of ontology terms in text is not
enough
– Compound Entities
– Complex Relationships
Challenge: Compound Entities
• Entities (MeSH terms) in sentences occur in modified forms
• “adenomatous” modifies “hyperplasia”
• “An excessive endogenous or exogenous stimulation”
modifies “estrogen”
• Entities can also occur as composites of 2 or more other entities
• “adenomatous hyperplasia” and “endometrium” occur as
“adenomatous hyperplasia of the endometrium”
25
Challenge: Compound Entities
• Entities not always singular tokens
– Complex entities
• Structurally complex
– Nested
– Overlapping
– Discontinuous
• “Semantically” complex
– Ontology term + Modifiers
– Compositions of ontology terms
• Large number of possible combinations of terms:
– low probability an ontology would contain all of them
7/17/2015
26
Method 1 – Constituency Tree-based (Identify entities and
relationships in Parse Tree)
Modifiers
Modified entities
Composite Entities
TOP
S
VP
NP
VBZ
PP
NP
DT
the
JJ
excessive
JJ
endogenous
IN
by
ADJP
CC
or
NP
induces
NN
estrogen
NP
NN
stimulation
JJ
NN
adenomatous hyperplasia
PP
IN
of
NP
JJ
exogenous
DT
the
NN
endometrium
27
Representation – Resulting RDF
hyperplasia
adenomatous
hasModifier
hasPart
modified_entity2
An excessive
endogenous or
exogenous stimulation
hasModifier
hasPart
modified_entity1
induces
composite_entity1
hasPart
hasPart
estrogen
Modifiers
Modified entities
Composite Entities
endometrium
28
Preliminary Results
• Swanson’s discoveries – Associations between Migraine and
Magnesium [Hearst99]
•
•
•
•
•
•
•
•
stress is associated with migraines
stress can lead to loss of magnesium
calcium channel blockers prevent some migraines
magnesium is a natural calcium channel blocker
spreading cortical depression (SCD) is implicated in some migraines
high levels of magnesium inhibit SCD
migraine patients have high platelet aggregability
magnesium can suppress platelet aggregability
• Data sets generated using these entities (marked red above)
as boolean keyword queries against pubmed
• Bidirectional breadth-first search used to find paths in
resulting RDF
29
Paths between Migraine and Magnesium
Paths are considered interesting if they have one or more named relationship
Other than hasPart or hasModifiers in them
30
Method 2 – Dependency tree-based
• Use a dependency parse to segment sentences
into SubjPredObject
• Subjects and Objects represent compound
entities
• Use corpus statistics to predict constituents of
compound entities
Algorithm
Relationship head
Subject head
Object head
Object head
32
Preliminary results
33
Extracted Triples
34
Predicting constituents
• Greedy mutual information based word grouping used to
predict constituents
– Given a sequence of tokens as input
– Compute dependency based mutual information for all token
pairs
– Starting at the head, attach tokens that increase average mutual
information of the token group so far
• Variants of this algorithm
– Compute the average dependency-based mutual information
across the corpus – use that as the threshold
35
Knowledge Discovery over text
Semantic metadata
in the form of
semi-structured data
Assigning interpretation to text
Text
Extraction of
Semantics
from text
Semantic Metadata
Guided
Knowledge Exploration
Triple-based
Semantic
Search
Semantic
browser
Semantic Metadata
Guided
Knowledge Discovery
Subgraph
discovery
36
Opportunity: exploiting multimodal data
binary
text
Scientific
Literature
Health
Information
Services
PubMed
300 Documents
Published Online
each day
Elsevier
iConsult
Smart Mashups
(WebAPIs)
BioMashup
NCBI
Public Datasets
Clinical Data
Genome,
Protein DBs
new sequences
daily
Personal
health history
Semantics enhanced search, browsing,
complex query, hypothesis validation.
Laboratory
Data
Lab tests,
RTPCR,
Mass spec
How do mashups help biologists ?
• Integrate services and create useful applications
– Many biological data are exposed through services
– Need different ‘mix’ of data depending on the situation
– Need to have quick and easy creation methods
• Biologists may not be computer gurus!
Example biomashup (Domain – Metabolomics)
Elevated
Compounds
Ex, Pyruvate,
Lactose, Malate,
Creatinine
Find genes that
code for each
protein
Proteins
contained in
Pathways
Metabolomics
Experiment
KEGG
NCBI
Gene names
or accession
numbers
Find pathways
that contain the
compounds
Microarray
experiments
Experimental
data that
support the
findings
GEO
Identify
microarray
experiments with
significant
expression values
for these genes
KEGG - Kyoto Encyclopedia of Genes and
External
Service
Service
description
Inputs /
outputs
Genomes
NCBI - National Center for Biotechnology Information
GEO – Gene Expression Omnibus
Courtesy – Prof. Michael L. Raymer – Kno.e.sis Center, WSU
What are SMashups
• Smart Mashups
– Key Point: NOT INTELLIGENT. Just Street smart
• Meta-programming
– Declarative approach to mashup creation using domain specific
language based approach1
– Importance of enabling easier data mediation
Maximilien,Ranabahu and Gomadam, An Online Platform for Web APIs and Service Mashups.
Key Components
• What APIs and Services
– Find APIs and Services using APIHut search engine
– Collect and bin APIs in the search interface
– Export them to the smashmaker application
• How do I mediate between services
– Describe each API that you are using
• Reuse descriptions (search supported via
Smashmaker search)
– Integrate using “smashlets”
APIHut: Faceted API Search
• Faceted Search for finding APIs
– Taxonomy of facets
APIHut: Faceted API Search (Cont..)
• Serviut (Service Utilization Rank)
– Ranks APIs based on utilization, web popularity (uses Alexa)
and user ratings
• Find, drag and drop APIs to create APISets
• Demo: http://apihut.info
Next Steps for APIHut
• Incorporate SA-REST and hRESTs
• SA-REST
– Microformat based approach for adding meta-data to APIs
– Supports various facets
– Part of W3C XG on Semantic Web Services
• hRESTs
– Microformat for describing operations, inputs and outputs in an
API document
Smashlets: Smashup Enablers
• Smashlets
– Small application bits that users create
• For example, a smashlet would facilitate mediation
between a mapping API and a photo management
API
– Created using idiosyncratic user interfaces
• Users can create smashlets and share them
– Other users can find them, tag them
Unpublished work. Please consult before sharing.
SMashlet principle
Service 1 Data Model
O-SMashlet
SMashmaker
Application Data Model
(ADM)
i-SMashlet
Service 2 Data Model
O-SMashlet
SMashmaker
Application Data Model
(ADM)
Unpublished work. Please consult before sharing.
How does one create a SMashup
• Smashmaker
– Similar to Yahoo! Pipes, Google Mashup Editor, Popfly and
others
– Supports meta programming and greater component reuse
• Not just entire mashup reuse but parts of it
Unpublished work. Please consult before sharing.
Creating a SMashup
• Import APIs from APIHut search
• Specify API types (Mapping API, Photo Sharing
API) to the smashmaker
– Can now find o-smashlets for the APIs imported
– Load i-smashlets
– Alert user if they are not there
• User can create a new one
• Arrange services
– Drag and drop APIs, select operations and smashlet
objects
• Export mashup
Unpublished work. Please consult before sharing.
Involving the people
• APIHut developer network
– Social network for allowing users
•
•
•
•
Create additional documentation
Upload code examples
Share presentations via slideshare
Discuss models and techniques
Current State of Research
• APIHut operational as a test –only alpha
– Email [email protected] or [email protected] for access.
• Currently investigating on Smashmaker
• APIHut developer network
– Investigating use of ning as opposed to create a new platform
– Supercoolschool for a facebook app
Preliminary Work to build upon
• Ontologies
o ProPreO ontology
o EnzyO Ontology
• Standards
o SAWSDL, <SA-REST, hTESTs>
• APIs/Tools/Systems
o Cubee, Semantic Browser, SAWSDL4J & Woden APIs,
SemBOWSER
o APIhut, SmashMaker
• Past projects:
o The Integrated Technology Resource for Biomedical Glycomics
Discussion
Project Web Site:
Semantics and Services enabled Problem Solving
Environment for Tcruzi