NCBO_Seminar_EFO_Atlas

Download Report

Transcript NCBO_Seminar_EFO_Atlas

Ontologically Modeling Sample Variables in
Gene Expression Data
James Malone
[email protected]
EBI, Cambridge, UK
Overview
•
•
•
•
•
Application Background
Motivation for ontologies – questions we to answer
Methodology
Ontology and application
Future work/things we’d like to do
Ontologically Modeling Sample Variables in Gene Expression Data
[email protected]
Gene Expression: Archive to Atlas
ArrayExpress
Curation
AE/GEO acquire
Curation
>250,000
Assays
Re-annotate & summarize
>10,000
experiments
Ontologically Modeling Sample Variables in Gene Expression Data
[email protected]
ATLAS
Gene Expression Sample Variable Annotations
Annotations
Species
Atlas
330
9
Samples
238,000
34,650
Annotations on samples
860,700
101830
37,500
6600
Assays (Hybridizations)
246,000
30,000
Annotations on assays
569,700
67,000
25,000
4000
Unique sample annotations
Unique assay annotations
4
Archive
Use Cases
• Query support (e.g, query for 'cancer' and get also ‘leukemia')
• Data visualisation – e.g., presenting an ontology tree to the user of
what is in the database
• Data integration by ontology terms – e.g., we assume that 'kidney' in
independent studies roughly means the same, so we can count how
many kidney samples we have in the database
• Intelligent template generation for different experiment types in
submission or data presentation
• Summary level data
• Nonsense detection – e.g. telling us that something marked as
cancer can not be marked as healthy
Ontologically Modeling Sample Variables in Gene Expression Data
[email protected]
Questions we want to answer
• Diverse nature of annotations on data
• Need to support complex queries which contain semantic
information
• E.g. which genes are under-expressed in brain cancer samples in
human or mouse
• If we annotate with adenocarcinoma do we get this data?
Ontologically Modeling Sample Variables in Gene Expression Data
[email protected]
Primary Question: Where to place our semantics?
Atlas/AE
cancer
adenocarcinoma
Ontologically Modeling Sample Variables in Gene Expression Data
[email protected]
Decoupling knowledge from data
Atlas/AE
Ontologically Modeling Sample Variables in Gene Expression Data
[email protected]
Methodology: Reference vs Application Ontology
• Debate in community about difference, here is our thesis
• A reference ontology describes a knowledge space; an
explicitly delineated part of a domain.
Biomedicine
Human
Anatomy
Cell type
GO Process
Ontologically Modeling Sample Variables in Gene Expression Data
[email protected]
Methodology: Reference vs Application Ontology
• An application ontology describes an application or data space; an
explicitly delineated part of a domain.
• Should consume reference ontologies to meet application needs
Biomedicine
Human
Anatomy
Cell type
GO Process
Ontologically Modeling Sample Variables in Gene Expression Data
[email protected]
Building the Experimental Factor Ontology
•
•
•
We consume parts of reference ontologies from domain
Construct new classes and relations to answer our use cases
Aim is reuse of existing resources, shared frameworks and mapping of
equivalencies where they exist
Ontology Biomedical
Investigations
Relation
Ontology
Disease Ontology
EFO
11
4/7/2016
Ontologically Modeling Sample Variables in Gene Expression Data
[email protected]
Chemical Entities of
Biological Interest
(ChEBI)
Anatomy
Reference
Ontology
Various
Species
Anatomy
Ontologies
Identify Upper Level Structure
• Taken a BFO-lite approach, hiding labels from users for
application purposes and sometimes different definition
information content entity (IAO)
site (BFO)
processual entity (BFO)
material entity (BFO)
specifically dependent continuant (BFO)
Specifically dependent continuant: A continuant [snap:Continuant] that
inheres in or is borne by other entities. Every instance of A requires some
specific instance of B which must always be the same.
Material property: A property or characteristic of some other entity. For
example, the mouse has the colour white.
Adding New Classes
@ www.ebi.ac.uk/efo/tools
• We wish to maximise our interoperability
• Submitters and other groups use many ontologies
• Trade-off: open to their data and preferences vs imposing a more
ordered view on semantics
• Our goal:
Where orthognality exists we aim to import only that classs. Where it
does not, we perform ‘mappings’ in our EFO classes via annotation
property references (in similar way to xrefs)
• E.g. chebi classes, import chebi URI
for ‘cancer’, create an EFO class and add multiple mappings
Creating Class Mappings
• For overlapping ontologies, we aim to create a ‘mapping
class’
• Use semi-automated text mining “double-metaphone”
algorithm
• Perform matching of our values in database to ontology
class labels and definitions.
• Also perform mappings from EFO to other ontologies, so
that EFO: cancer = NCI: cancer, DO: cancer et al.
• Sanity checking over mappings before adding to ontology
Keeping Up To Date with External Classes
• Use of tool to automatically update metadata every
release (monthly)
• Uses BioPortal web services to access latest
Class
URI/ID
definition,
synonyms
Ontologically Modeling Sample Variables in Gene Expression Data
[email protected]
Detecting Change in External Ontologies
• Bubastis tool for detecting axiomatic changes between
two ontologies (in our case 2 versions of same ontology)
• @todo: detect annotation property changes
• We also detect missing annotation properties with
Watchman tool (not released yet) – mainly used for labels
presently
Creating Relations and Equivalent Classes
species
(human)
cell line
(Hela)
cell type
(epithelial)
organism part
(cervix)
disease
(cervical
adenocarcinoma)
Ontologically Modeling Sample Variables in Gene Expression Data
[email protected]
Structure for queries
Ontologically Modeling Sample Variables in Gene Expression Data
[email protected]
Gene Expression Atlas
• Linking data to the ontology
Assay
Table
Sample
Table
Ontology
Term
Table
Database
formulated
query
OWL
Model
Query
Gene Expression Atlas
@ www.ebi.ac.uk/gxa
Query for Cell adhesion genes in all ‘organism parts’
‘View on EFO’
Ontologically Modeling Sample Variables in Gene Expression Data
[email protected]
ArrayExpress Archive
@ www.ebi.ac.uk/arrayexpress
Future Work: Linked Data
Linking data by dereferenceable URI for human
and machine
http://www.ebi.ac.uk/gxa/Experiment12345
http://www.ebi.ac.uk/gxa/Experiment12345
Developing an Ontology from the Application Up
[email protected]
Future Work: RDF Triple Store
@ www.ebi.ac.uk/efo/semanticweb/atlas
• Q: Is an RDF Triple store SPARQL query quicker than a SPARQL
translated into SQL?
OWL Ontology
RDFizer
SQL
Translation
Layer
Atlas Data
RDF
Triple
Store
S
P
A
R
Q
L
Future Work: Data Integration
• Consuming reference ontologies and mapping to multiple ontologies
where overlap exists offers us maximum interoperability
• The advantage of triple stores is not immediate yet
• Impetus required: “should we champion this technology”
QUERY
Rdf triple
Atlas
Rdf triple
Rdf triple
Amino
Acid
Ontology
Swiss
Prot
Rdf triple
Rdf triple
Rdf triple
Summary
• We have created a sustainable approach to consuming multiple
reference ontologies
• Tooling solutions to expedite process
• We consider EFO to be a ‘view’ of such ontologies for our application
needs
• The primary aim of this work is to enable novel research with the
experimental data we have
• Specifically, we can answer new questions, integrate across our data
resources, visualise and summarise the data
• Our belief is describing such data should be the driving force behind
ontology development
• Future work will look at linked data and rdf triple stores
Acknowledgements
•
•
•
Ontology creation:
• James Malone, Tomasz Adamusiak, Ele Holloway, Helen Parkinson, Jie
Zheng (U Penn)
Ontology Mapping tools and text mining evaluation:
• Tim Rayner, Holly Zheng, Margus Lukk
GUI Development
•
•
•
•
•
•
•
•
Misha Kapushesky, Pasha Kurnosov, Anna Zhukova. Nikolay Kolesinkov
External Review and anatomy:
• Jonathan Bard, Jie Zheng
ArrayExpress Production Staff
EBI Rebholz Group (Whatizit text mining tool)
Many source ontologies for terms and definitions esp. Disease Ontology, Cell Type
Ontology, FMA, NCIT, OBI
Funders: EC (Gen2Phen,FELICS, MUGEN, EMERALD, ENGAGE, SLING), EMBL,
NIH
Eric Neumann, Joanne Luciano and Alan Ruttenberg
W3C & HCLS Group - Eric Prud'hommeaux and Scott Marshall
OBI developers
Ontologically Modeling Sample Variables in Gene Expression Data
[email protected]