Trevor Paterson and Andy Law

Transcript Trevor Paterson and Andy Law

The Development of an Ontology
for Data Integration and Query in
Comparative Genomics.
Trevor Paterson and Andy Law
Roslin Institute, Scotland
Aims:
- To develop ‘enabling technologies’ for comparative
genomics.
- To integrate disparate resources (genomic mapping,
DNA sequence, evolutionary relationships,
functional information) across species boundaries.
- In order to inform and expedite genomic mapping:
particularly in non-model organisms.
Collaborators:
Farm animal, crop and microbial genomics;
Bioinformatics; Computer Sciences;
Statistics.
Dr. Andy Law
(project co-ordinator)
Dr Trevor Paterson
Roslin Institute
Dr. Peter Rice
Tony Burdett
EBI
Dr. Ian Roberts
IFR
Dr. Jo Dicks
RA Dr Robert Davey
JIC
Dr. Robert Stevens
Dr Andrew Gibson
Manchester
Dr. Darren Wilkinson
Dr. Richard Boys
(RA Dr Madhuchhanda
Bhattacharjee )
Newcastle
(Maths & Stats)
Dr. Neil Wipat
Dr. Matthew Pocock
Professor Paul Watson
Newcastle
(Computing
Science)
Dr. David Marshall
SCRI
DISPARATE GENOMIC MAPPING DATA
- for individual species
- multiple datatypes
- in many non-standard formats and databases
- archived in many locations, variety of access protocols
- data of variable quality and completeness
PLUS ONLINE BIOINFORMATICS RESOURCES
- DNA sequence and genome projects
- Gene structure and function
- Protein structure, family, function
- Evolutionary history, orthology, homology
- Phenotypes (genetic traits and diseases)
- Population genetics
- Gene expression patterns
- Publications
Current integration between datasources and across
species is largely manual.
i.e. difficult, error-prone and very inefficient.
Why do Biologists want to integrate mapping data
across species…?
What are they trying to do..?
GOAL  MAP,IDENTIFY AND UNDERSTAND GENES
BEHIND PHENOTYPES (i.e. DISEASES & TRAITS)
ComparaGrid aims to assist this process by
exploiting existing mapping data across species
boundaries.
UNDERLYING BIOLOGICAL PRINCIPAL BEHIND
CROSS-SPECIES MAP COMPARISON
Conservation of Synteny:
“Conservation of (blocks of) gene order throughout chromosomal
evolution”
As species evolve and diverge, their chromosomes rearrange through
duplications, inversions, translocations etc - but blocks of genes
can be traced through evolutionary history between even
relatively divergent species (e.g. chicken and man).
Therefore the known gene order in these blocks in one species can
inform/predict the order of evolutionarily related genes
(orthologues) in other species.
Ancestral
Chromosome
Modern
Species
Speciation
Event
Duplicative
inversion
Breakage
species B
species A
Inversion
species A’
20M
10M
years ago
NOW
Ancestral
Chromosome
Modern
Species
Speciation
Event
Duplicative
inversion
Breakage
species B
HyPOTHESIS
species A
Inversion
species A’
20M
10M
years ago
NOW
Sequence
Similarity &
Conserved
Synteny
=>
Orthology
COMPARATIVE GENOMICS USE CASE
Tasty Bacon
Agribusiness wants to map the underlying genetic basis of the
‘Tasty Bacon’ Trait ( a QTL ).
QTL (Genetic) Map
COMPARATIVE GENOMICS USE CASE
The position of the QTL is correlated on various types of Pig
Genetic maps
Tasty Bacon
QTL
Map
Linkage
Map
Radiation
Hybrid Map
COMPARATIVE GENOMICS USE CASE
There is a ‘known’ homology between a Pig Marker/Sequence in
this region and the human genome
Pig
Human
DNA Sequence
Similarity
=> Homology
=>? Orthology
QTL
Map
Linkage
Map
Radiation
Hybrid Map
Cytogenetic
Map
COMPARATIVE GENOMICS USE CASE
A Physical Map of BAC clones exists for this region of the Human
Genome
Pig
Human
BAC1
BAC2
BAC3
QTL
Map
Linkage
Map
Radiation
Hybrid Map
Cytogenetic
Map
Physical
Mapping
COMPARATIVE GENOMICS USE CASE
There are known chicken expressed sequences homologous to
Human Gene Sequences in this region
Pig
Chicken
Human
BAC1
EST1
BAC2
EST2
BAC3
QTL
Map
Linkage
Map
Radiation
Hybrid Map
Cytogenetic
Map
Physical
Mapping
COMPARATIVE GENOMICS USE CASE
Gene expression Data for these Chick ESTs might correlate with a
trait similar to ‘Tastiness’
Pig
Chicken
Human
BAC1
EST1
BAC2
EST2
BAC3
QTL
Map
Linkage
Map
Radiation
Hybrid Map
Cytogenetic
Map
Physical
Mapping
Expression
Analysis
COMPARATIVE GENOMICS USE CASE
The literature may detail Functions of Human genes in this region,
and homologies to genes in other species – helping the researcher
predict candidate genes in Pigs responsible for tastiness
Pig
Chicken
Human
BAC1
EST1
BAC2
EST2
BAC3
QTL
Map
Linkage
Map
Radiation
Hybrid Map
Cytogenetic
Map
Linked
References
Physical
Mapping
Expression
Analysis
COMPARATIVE GENOMICS USE CASE:
HOW CAN WE AUTOMATE THIS?
Provide Architecture to Link and Traverse Data Sources….
 GRID/ Web-services
Provide Data Standards to allow this
 Syntax and Semantics of Data
Formalise the Links between Data:
 these Relationships are Data too
 these are what the Biologists care about
WHAT DOES COMPARAGRID NEED TO INTEGRATE
DATASOURCES IN A BIOLOGICALLY
RELEVANT FASHION?
A lightweight Exchange Standard or a heavyweight
Ontology in OWL-DL?
1. Lightweight Mapping from RDB Schema to standard
Minimally: a data exchange standard
(defines structure and vocabulary for data exchange):
 XML Schema? RDF?
(a ‘straightforward’ mapping by data providers,
integration logic handling the meaning of
‘relationships’ must be in the Application)
WHAT DOES COMPARAGRID NEED TO INTEGRATE
DATASOURCES?
A lightweight Exchange Standard or a heavyweight
Ontology in OWL-DL?
2. More Heavyweight Mapping
Capturing the Semantics of the Data
 Defined RDFS Vocabulary?
(mapping still quite lightweight,
data is better defined & more reliably integrated,
integration of data can be automatic,
Applications can rely on semantics)
WHAT DOES COMPARAGRID NEED TO INTEGRATE
DATASOURCES?
A lightweight Exchange Standard or a heavyweight
Ontology in OWL-DL?
3. Heavyweight Mapping
Semantically represent the Relationships between Data
(and Relationships between Relationships…):
 Formal Ontology (OWL-DL)
(mapping from datasource to Ontology is
complex and specialist,
Automatic integration and inference is possible
over data represented as individuals of the
ontology)
DO WE NEED YET ANOTHER ONTOLOGY?
• We think comparative genomics is very different from other
biological knowledge domains…(SO, OBO, GO…)
• We need to integrate both abstract and physical data – experimental
observations positioning ‘markers’ on abstract maps, and physical
locations of ‘features’ on representations of DNA sequences
• Metadata is important – we need to treat mapping data as assertions
– that might be accepted or rejected on the basis of quality, provenance
and trust
• We need to represent evolutionary relationships between mapped
objects – these are also assertions – not facts – based through the
relatedness of underlying physical objects (sequence similarity).
• Integration between datasources depends on accepting these
evolutionary assertions!
IDEALIZED COMPARAGRID ARCHITECTURE:
The OWL Ontology forms the 'semantic glue' to integrate data sources
and express cross species queries.
The mapping between the data source schema and the integration
schema (the CG OWL Ontology) is critical.
COMPARAGRID STACK ARCHITECTURE:
A publisher service automates mapping DB Schema to OWL
Bespoke mapping rules map from DB-OWL to CG-OWL
Raw data
Syntax
SQL
Raw
data
Semantics
DB
Publisher
service
Aggregation
CG
Transformer
service
Integrator
BUILDING THE COMPARAGRID ONTOLOGY
Stage I (Biologists & Bioinformaticians input)
• Define the Scope of the Domain
• Collect the terminology used in the Domain
• Interview practising experts
• Document some use cases
• Observe how the experts perform an analysis
• Define the terms and relationships necessary
• Model the knowledge domain
OUTPUTS:
- a model of the knowledge domain
- a prototype ontology (in OWL-DL): terms and relationships
necessary to represent the data and the relationships between data
(Using Protégé).
BUILDING THE COMPARAGRID ONTOLOGY
Stage II (Biologists, Bioinformaticians, Ontologists)
• Hold workshops for panels of experts across the scope
of the domain (animal, plant, microbe).
• Confirm the Concepts and Relationships that are
required.
• Confirm our model of the knowledge domain.
• Iterate and refine the prototype model representing this
model.
OUTPUT:
version 1 prototype ComparaGrid OWL Ontology
HIERARCHY OF CONCEPTS IN THE
COMPARAGRID ONTOLOGY
COMPARAGRID ONTOLOGY:
Simple Relationships = Properties
Hierarchy of Object to Object
Properties
Hierarchy of Object to Value
Properties
In OWL-DL complex relationships can be modelled as Concepts
Simple RDF Statement Representation of a Relationship
Chromosome
Map
isMapOf
DomainConcept
property
DomainConcept
Richer Representation as OWL Class
Map
Chromosome
DomainConcept
DomainConcept
isMapOf
relatesFrom
(property)
Unidirectional
Relationship
relatesTo
(property)
property
hasEvidence
property
DomainConcept
Citation
Value
identifier <String of Characters>
The Importance of Relationships
Biologists and Bioinformaticians see an important conceptual difference
between:
The ‘nuts and bolts’ relationships with in the data
(‘EXPERIMENTAL OBSERVATIONS’ and ‘FACTS’)
Vs
The biological hypotheses (‘ASSERTIONS’)
Hopefully the richness and expressivity of OWL-DL will give us the
opportunity to capture the subtleties of the different types of relationships
and how they may relate to each other.
Critically we want to infer over the data represented as individuals – not
merely over properties of the ontology
COMPARAGRID ONTOLOGY:
Complex Relationships (as Concepts)
BUILDING THE COMPARAGRID ONTOLOGY
Stage III (Expert Ontologists)
• Refactor the prototype ontology according to good
design principles
• Build a core upper-level comparative mapping domain
ontology that will integrate with other domains
• Incorporate additional modules to represent specific
subdomains (Genetic Variation, Abstract Mapping
Concepts, Evidence, Evolutionary Relationships etc.)
OUTPUT:
modularised ComparaGrid OWL Ontology
THE MODULARISED COMPARAGRID ONTOLOGY
BUILDING THE COMPARAGRID ONTOLOGY
Timescale
• Stage I: 6 months
• Stage II: 6 months
• Stage III: ongoing / 3 years
Problem
how do we develop the architecture and software,
when we don’t have a final Ontology or model?
•
•
Use the Prototype version?
Use small hack ontologies for demonstration data?
But can we be sure the principals will work for the final larger,
more complex Ontology?
USING THE COMPARAGRID ONTOLOGY:
Querying distributed resources through the
ComparaGrid Stack Architecture
• Tools for converting DB schema to OWL ontology
i.e. Fun
Time
for the Computer
Scientists…..
• Tool support
for mapping
DB ontologies
to CG ontology
Under
Development…
• Automatic query translations up and down the stack
• Allows queries to be expressed and resolved in OWL
– should allow automated reasoning and inference
Roslin Ark Database’s experience as Data
Providers (and Biologists/Users)
• We want to export and import data in reusable format
• We could build all our own applications using a common data
format……..allowing us to traverse data sets according to
assertions made between the data.
• ….but want to use ComparaGrid’s ‘clever’ integration and query
through OWL
• i.e. we want to exchange data as OWL – so have to incorporate
mapping from schema to OWL into our service architecture
Roslin Ark Database’s experience as Data
Providers
Problems:
• We are waiting for the ‘final’ ontology
• We are waiting for the stack architecture
(…which is waiting for the ontology)
• The ComparaGrid Architecture/Toolset is being designed to map
from DB schema to OWL, but our DB schema captures none of
our domain model……our mapping should be from Object model
to OWL ….
• We have to implement our own mapping to OWL….
• We want to progress and ACTUALLY DO SOME BIOLOGY!
Schema
Object
Table
Relationship
Table
Web App
Drawing
Applet
≠
Ibatis
Java
Objects
ArkIIDB
Object
Model
Java
Application
Download
App
CG
Betwixt /
XSLT
RDFS
Vocabulary
Web
Service
Jena
OWL
Model
API
XML/RDF
CG-OWL
XML/RDFOWL
XML/RDF
DB-OWL
SERIALISATION
ComparaGrid Ontology:
Where are we at…and Why?
•
Prototype OWL Ontology created:
- used to demonstrate mapping of ArkDB to Webservices.
- Ontology is flabby and poorly designed?
- Mapping from Java to OWL/XML is a cumbersome/manual
process.
•
Refactoring/modularising the ComparaGrid OWL Ontology is
non trivial (Research Project in its own right!).
- We are not able to use a ‘final’ ontology to drive the
development of services.
•
Until we have a working common data format or ontology we
can’t start to import and export further datasources
ComparaGrid Ontology:
Where are we at…and Why?
•
Implementation of Comparagrid stack integration and query
architecture is ongoing.
•
Automated / Assisted mapping tools under development.
(DB relational schema  DB-OWL  CG-OWL)
[Using hack ontology fragments in the interim.]
• We need further tools to support mapping from any adhoc
database or object model to OWL
ComparaGrid Ontology:
Where are we at…and Why?
•
As data providers Roslin ArkDB is dependent on the tools and
infrastructure being developed by ComparaGrid – without
knowing how much added value an ontology will give….
•
•
•
We hope that the ontology will allow us to represent the
‘interesting’ biological relationships
That it will facilitate automated integration and data
traversal
That it will allow inference of new knowledge automatically
•
However…the burden is put on the data mapping process
– a more lightweight approach would simplify this
(e.g. RDF/RDFS), but might require that applications
understand the context of information sources.
•
RDF(S) is becoming quite well supported – and allows some
inference over semantic relationships. WOULD IT BE GOOD
ENOUGH FOR US?

Trevor Paterson and Andy Law

Transcript Trevor Paterson and Andy Law

Directory