Bio-ontologies for Annotation and Service Discovery

Download Report

Transcript Bio-ontologies for Annotation and Service Discovery

Bio-ontologies for Annotation
and Service Discovery
Chris Wroe
( + material from Carole Goble, Alan Rector,
Jeremy Rogers, Ian Horrocks)
University of Manchester, UK
Overview


Example driven tour of the why, what
and how of ontologies in life sciences
Cover the key features of an ontology


Vocabulary, definitions, hierarchies,
grammar & reasoning
Cover the key targets of ontology use

Biological knowledge, service
descriptions, (database schema)
Ontology – the discipline





Semantics – the meaning of meaning.
Philosophical discipline, branch of
philosophy that deals with the nature and
the organisation of reality.
Science of Being (Aristotle, Metaphysics,
IV,1)
What is being?
What are the features common to all
beings?
In science…ontology the
thing


A resource to aid the precise
communication and integration of
information
Binds a community to communicate
information in some domain of
interest in a consistent manner.
Gene Ontology – a community
effort



Model organism databases need to be
integrated
Not possible if they all use a
different vocabulary
Gene Ontology Consortium got
together to form

“a dynamic controlled vocabulary that
can be applied to all eukaryotes”
Gene Ontology – keeping it simple

Provide three separate vocabularies
to describe:



The function a gene product is capable of.
The process a gene product takes part in.
The location at which the gene product has
been found.
Annotation
GO annotations
Gene detail page in MGD for the
vitamin D receptor gene, Vdr
Annotation
Feature 1:
GO annotations
Ontologies provide a shared controlled
vocabulary of concepts.
Gene detail page in MGD for the
vitamin D receptor gene, Vdr
Gene ontology - definitions


A diverse community, so explicit
definitions important.
60% of GO concepts have a textural
definition e.g.

apoptotic nuclear changes GO:0030262
Changes affecting the nucleus and its contents during
apoptosis; includes condensation and fragmentation of
nuclear DNA and of the nucleus itself.
Gene ontology - definitions
Feature 2:
A diverse community so explicit
Ontologies
provide
an agreed definition
definitions
important.
for each
concept
to ensure
each
 60%
of GO concepts
have
a textural
concept
is usede.g.
in the same way.
definition


apoptotic nuclear changes GO:0030262
Changes affecting the nucleus and its contents during
apoptosis; includes condensation and fragmentation of
nuclear DNA and of the nucleus itself.
Gene ontology – organisation


An alphabetical list of 11000 terms is not enough
Hierarchies allow similar terms to be grouped
together.
biological process
death
cell death
tissue death
necrosis
histolysis
Gene ontology – hierarchy use

GO hierarchy is used for



Navigation of concepts by users
Indexing of information in databases
Aggregating information
Taxonomy remark 1

The world is animal
not a tree, it’s a lattice
vermin
rodent
wild
domestic
pet
working
dog
mouse
cat
cow
Taxonomy remark 2

What does the taxonomy mean?



Concept A is a parent of concept B iff every instance of B is
also an instance of A
Superset/subset
ICONCLASS
Kind of
a door
Door
Closing the Door
Monumental Door
Metalwork of a Door
Door-Knocker
Threshold
Door-keeper
Action associated
with a door
Something attached
to a door
The Celestial Emporium of Benevolent Knowledge, Borges
Classification trickiness
"On those remote pages it is written that animals are divided into:
a. those that belong to the Emperor
b. embalmed ones
c. those that are trained
d. suckling pigs
e. mermaids
f. fabulous ones
g. stray dogs
h. those that are included in this classification
i. those that tremble as if they were mad
j. innumerable ones
k. those drawn with a very fine camel's hair brush
l. others
m. those that have just broken a flower vase
n. those that resemble flies from a distance"
Classification is
task and culture specific
Dyirbal classification of objects in the universe,




Bayi: men, kangaroos, possums, bats, most snakes, most
fishes, some birds, most insects, the moon, storms,
rainbows, boomerangs, some spears, etc.
Balan: women, anything connected with water or fire,
bandicoots, dogs, platypus, echidna, some snakes, some
fishes, most birds, fireflies, scorpions, crickets, the stars,
shields, some spears, some trees, etc.
Balam: all edible fruit and the plants that bear them, tubers,
ferns, honey, cigarettes, wine, cake.
Bala: parts of the body, meat, bees, wind, yamsticks, some
spears, most trees, grass, mud, stones, noises, language, etc.
Gene ontology – directed acyclic graphs

Each concept is explicitly grouped either by is-a or part of
relationships





Functions are often grouped by type
Cellular components are often grouped by part
Each concept can have multiple parents
A concepts positions is represented by a directed acyclic
graph
Hierarchies are handcrafted so as to suit the ‘culture’ of
biologists
Feature 3:
Ontologies organise concepts in
multiple ways for multiple uses.
Principle of grouping should be explicit.
Taking it further

GO concepts are often phrases


insulin control element activator complex, insulin
processing, insulin receptor, insulin receptor complex,
insulin receptor ligand, insulin receptor signalling pathway,
insulin secretion, insulin acticated sodium/amino acid
transporter,
Components of phrase hidden to
computer applications
Explicit conceptualisation



Semantic similarity searching
Automated maintenance of hierarchies.
What we need is..


A formal grammar with which to compose
phrases
Software which can interpret phrases and
produce sound and complete hierarchies
The exploding bicycle







ICD-9 (E826) 8
READ-2 (T30..) 81
READ-3 87
ICD-10 (V10-19) 587
V31.22 Occupant of three-wheeled motor vehicle injured in
collision with pedal cycle, person on outside of vehicle,
nontraffic accident, while working for income
W65.40 Drowning and submersion while in bath-tub, street and
highway, while engaged in sports activity
X35.44 Victim of volcanic eruption, street and highway, while
resting, sleeping, eating or engaging in other vital activities
Defusing the exploding bicycle:
500 codes in pieces





10 things to hit…
 Pedestrian / cycle / motorbike / car / HGV / train /
unpowered vehicle / a tree / other
5 roles for the injured…
 Driving / passenger / cyclist / getting in / other
5 activities when injured…
 resting / at work / sporting / at leisure / other
2 contexts…
 In traffic / not in traffic
V12.24 Pedal cyclist injured in collision with two- or threewheeled motor vehicle, unspecified pedal cyclist, nontraffic
accident, while resting, sleeping, eating or engaging in other
vital activities
Coordination: Conceptual
Lego
gene
hand
protein
cell
extremity
expression
body
Lung
chronic
inflammation
acute
infection
bacterial
abnormal
deletion
normal
ischaemic
polymorphism
Conceptual Lego
“SNPolymorphism of CFTRGene causing Defect in MembraneTransport
of ChlorideIon causing Increase in Viscosity of Mucus in
CysticFibrosis…”
“Hand which is
anatomically
normal”
DAML+OIL



Specifically designed to compose phrases in a
compositional manner
Becoming a standard ontology interchange
language
Adopted by W3C and will soon become Ontology
Web Language (OWL)
Reasoning support
Consistency — check if knowledge is
meaningful
 Subsumption — structure knowledge,
compute taxonomy
 Equivalence — check if two classes denote
same set of instances
 Instantiation — check if individual i instance
of class C
 Retrieval — retrieve set of individuals that
instantiate C
Problems all reducible to consistency
(satisfiability)

Gene Ontology Next Generation

Early aim



Proof of concept showing DAML+OIL &
description logic can practically help in at least
one aspect of GO maintenance.
In cooperation with Mike Ashburner and the
GO editorial team
Further aims

Prototype an evolutionary environment in which
the benefits can be replicated on a larger scale
Preliminary task

Providing an exhaustive is-a taxonomy


GO is-a poly-hierarchy
It becomes increasingly laborious to make sure
that all concepts are linked to all possible is-a
parents
Metabolism terms:
e.g. heparin biosynthesis
[i] (GO:0006024
[chemical] biosynthesis (GO:0009058)
Axis 1:
Chemicals
[i] carbohydrate biosynthesis (GO:0016051)
[i] aminoglycan biosynthesis (GO:0006023)
[i] glycosaminoglycan biosynthesis (GO:0006024)
[i] heparin biosynthesis (GO:0030210)
Axis 2:
Process
[i] heparin metabolism (GO:0030202)
[i] heparin biosynthesis (GO:0030210)
Is this important?

Complete taxonomy not necessary for browsing by
biologist (and may actually get in the way)

BUT… improves fidelity of DB record retrieval.

Asking for records annotated with ‘glycosaminoglycan
biosynthesis’ or more specific will lead to an additional result
O94923 SPTr ISS - D-glucuronyl C5-epimerase (Fragment)
How can we support the task?


Step 0. Translate to DAML+OIL syntax
 Provided by OilEd
Provide DAML+OIL based definitions of GO
concepts – initially in the metabolism area
DAML+OIL definitions for metabolism concepts

heparin biosynthesis



class heparin biosynthesis defined
subClassOf biosynthesis
restriction onProperty acts_on hasClass heparin
(acts_on is unique)
Paraphrase: biosynthesis which acts solely on heparin
glycosaminoglycan biosynthesis

class glycosaminoglycan biosynthesis defined
subClassOf biosynthesis
restriction onProperty acts_on hasClass glycosaminoglycan
DAML+OIL definitions for metabolism concepts

heparin biosynthesis

class heparin biosynthesis defined
subClassOf biosynthesis
restriction onProperty acts_on hasClass heparin
(acts_on is unique)
Feature 4:
Paraphrase: biosynthesis which acts solely on heparin
Ontologies provide a formal computer
 glycosaminoglycan biosynthesis
interpretable
concept
definition.
class glycosaminoglycan biosynthesis defined


subClassOf biosynthesis
restriction onProperty acts_on hasClass glycosaminoglycan
A chemical ontology


Initially used MESH to create a DAML+OIL ontology
from a subset of the chemical taxonomy (using UMLS
tools/ API)
Provides the following information
carbohydrates
[i] polysaccharides
[i] glycosaminogylcans
[i] heparin
Reason over the combination


Combine GO definitions with chemical ontology
using OilEd API
Send to FaCT DL reasoner…
Paraphrased reasoning process

heparin biosynthesis


class heparin biosynthesis defined
subClassOf biosynthesis
restriction onProperty acts_on hasClass heparin
glycosaminoglycan biosynthesis

Is-a
class glycosaminoglycan biosynthesis defined
subClassOf biosynthesis
restriction onProperty acts_on hasClass glycosaminoglycan
Inferring a new is-a link

heparin biosynthesis


class heparin biosynthesis defined
subClassOf biosynthesis
Is-a
restriction onProperty acts_on hasClass heparin
glycosaminoglycan biosynthesis

Is-a
class glycosaminoglycan biosynthesis defined
subClassOf biosynthesis
restriction onProperty acts_on hasClass glycosaminoglycan
Inferring a new is-a link

heparin biosynthesis

class heparin biosynthesis defined
subClassOf biosynthesis
Is-a
restriction onProperty acts_on hasClass heparin
Feature 5:
Ontologies
can become biosynthesis
a dynamic
 glycosaminoglycan
class glycosaminoglycan biosynthesis defined
service with
reasoning support.
subClassOf biosynthesis
Is-a

restriction onProperty acts_on hasClass glycosaminoglycan
Output



OilEd API reports additional inferred is-a
relationships.
E.g.
heparin biosynthesis has new is-a parent
glycosaminoglycan biosynthesis
Sanitised version sent to GO editorial team for
comment.
They (Jane Lomax) makes changes to GO if
appropriate and sends back queries
Results

Carbohydrate metabolism


Amino acid metabolism


22 additional is-a links 17 of which now in GO
Further 17 additional is-a links now in GO
Currently preparing results for
metabolism as a whole
Where next with GONG?

Moving from proof of concept requires dedicated
software tools to support the process.



Authoring/ Curation of DAML+OIL definitions
Tracking GO as it evolves
Tracking suggested changes and response to changes.
myGrid




& high level ontologies
myGrid:
Personalised extensible environments
for data-intensive in silico experiments in
biology
Higher level services: workflow, databases,
knowledge management, provenance…
Bioinformatics services are published as Web
services (and soon Grid Services)
http://www.ebi.ac.uk/collab/mygrid/service0/axis/index.html
Ontologies for Service Discovery

Find appropriate type of services


Find appropriate instances of that service



sequence alignment
BLAST (an algorithm for sequence alignment), as delivered
by NCBI
Assist in forming an appropriate assembly of
discovered services.
Find, select and execute instances of services while
the workflow is being enacted.
Knowledge in the head of expert bioinformatician
An in silico experiment as a
workflow
RASMOL
Protein
name
Fetch
Fetch
View
WF
Similar
Structure
sequences
modelling
Four-tiered service descriptions
Domain “semantic”
Class of service:
1.
•
a protein sequence alignment, a protein sequence database.
Specific example of an abstract service:
2.
•
BLAST, SWISS-PROT.
Business “operational”
Instance service description of a specific service:
3.
•
BLAST, SWISS-PROT as offered by the EBI.
Invoked instance service description:
4.
•
BLAST as offered by the EBI on a particular date, with
particular parameters when a service was actually enacted.
Service description phrases




Build up a phrase describing classes of
service functionality.
Building blocks for phrase come from a
suite of ontologies
Template for the description based on
DAML-S specialised for bioinformatics.
Use reasoning to maintain a classification
of services
Suite
Upper level
ontology
Task
ontology
Informatics
ontology
Web service
ontology
Specialises. All concepts are
subclassed from those in the more
general ontology.
Contributes concepts to form
definitions.
Molecular
biology ontology
Bioinformatics
ontology
Publishing
ontology
Organisation
ontology
Suite
Upper level
ontology
Task
ontology
Informatics
ontology
Web service
ontology
Specialises. All concepts are
subclassed from those in the more
general ontology.
Contributes concepts to form
definitions.
Molecular
Publishing
Organisation
parameters:
output, ontology
biology ontologyinput,
ontology
precondition, effect
performs_task
uses-resource
Bioinformatics
is_function_of
ontology
class-def defined BLAST-n_service_operation
subclass-of atomic_service_operation
has_Class performs_task
(aligning has_Class has_feature local
has_Class has_feature pairwise)
has_Class produces_result
(report has_Class is_report_of sequence_alignment)
has_Class uses_resource
(database has_Class contains
(data has_Class encodes
(sequence has_Class is_sequence_of
nucleic_acid_molecule)))
has_Class requires_input
(data has_Class encodes
(sequence has_Class is_sequence_of
nucleic_acid_molecule))
has_Class is_function_of (BLAST_application)
class-def defined pairwise_sequence_alignment_service
subclass-of atomic_service_operation
has_Class performs_task
(aligning has_Class has_feature local
has_Class has_feature pairwise)
has_Class produces_result
(report has_Class is_report_of sequence_alignment)
has_Class uses_resource
(database has_Class contains
(data has_Class encodes
(sequence has_Class is_sequence_of
nucleic_acid_molecule)))
has_Class requires_input
(data has_Class encodes
(sequence has_Class is_sequence_of
nucleic_acid_molecule))
has_Class is_function_of (BLAST_application)
Description driven classification
Portal
Repository
Client
Personal
Repository
Workflow
Client
Workflow
Repository
Workflow
enactment
Bioinformatics
services
Client framework
Ontology
Client
myGrid.
version0
(Meta Data)
Ontology
Server
(Meta Data)
Service Type
Directory
Service
instance
directory
DAML+OIL
Reasoner
(FaCT)
Matcher
and
Ranker
1. User selects values from a drop
down list to create a property based
description of their required service.
Values are constrained to provide only
sensible alternatives.
2. Once the user has
entered a partial
description they submit
it for matching. The
results are displayed
below.
3. The user adds
the operation to
the growing
workflow.
4. The workflow
specification is complete
and ready to match
against those in the
workflow repository.
Ontology grounds out

Link ontology to WSDL and UDDI
types XML Schema
UDDI
businessEntity
messages
portType
operation
businessService
binding
Template
binding
service
WSDL
tModel
Other uses of ontology

Labelling data items in databases


Semantic typing for controlling inputs
and outputs
Use by distributed query processing
Ontology/ registry issues




How to best integrate with existing
registry technology such as UDDI
How do ontological descriptions of
data relate to type systems
How big should the phrases become
within the ontology?
Who builds these descriptions?
Summary



Different ontologies can have a
different selection of features
tailored to requirements
Form a wide spectrum of resources
Powerful technology available

Harness it for end users
And finally.. predates computers

Linnaeus 18th Century





Nomenclature/ classification of species
Language independent (Latin)
Promoted sharing and integration of knowledge about
related species
A community effort – botanists / zoologists
Farr 19th Century



Nomenclature of disease for consistent cause of
death reporting
Allowed aggregation/integration of data to discover
new knowledge about the aetiology of Cholera.
A community effort -- surgeons
Links



All myGrid tools & ontology available
from:
http://www.mygrid.org.uk
GONG site: http://gong.man.ac.uk
Building ontologies site:
http://oiled.man.ac.uk/building
Acknowledgements

Manchester metadata team



Carole Goble, Robert Stevens, Sean
Bechhofer, Phil Lord, Alan Rector,
Jeremy Rogers, Chris Garwood
myGrid team
GO Consortium

Esp. Mike Ashburner, Midori Harris,
Jane Lomax
Sharing info  Sharing meaning
Metadata


Data describing the content
and meaning of resources
and services.
But everyone must speak
the same language…
Service
provider
Service
provider
Terminologies



Shared and common
vocabularies
For search engines, agents,
curators, authors and users
But everyone must mean
the same thing…
Service
provider
Service
provider
Service
provider
Ontologies
Shared and common
understanding of a domain

Essential for search, exchange and
discovery

•
Origin
and
History
Humans require words (or at least symbols) to communicate
efficiently. The mapping of words to things is only indirect possible.
We do it by creating concepts that refer to things.
• The relation between symbols and things has been described in the
form of the meaning triangle:
Concept
“Jaguar“
[Ogden, Richards, 1923]
[Deborah McGuinness, Stanford]
So what is an ontology?
Thesauri
Catalog/
ID
Terms/
glossary
Informal
Is-a
Gene Ontology
Mouse Anatomy
Frames
(properties)
Formal
Is-a
General
Logical
constraints
Disjointness,
Inverse, partof
Formal
instance Value
restrictions
Arom
TAMBIS
EcoCyc
PharmGKB
• ...Human
Human
Agent 1
and machine communication
[Maedche et al., 2002]
Human
Agent 2
exchange symbol,
e.g. via nat. language
Machine
Agent 1
Machine
Agent 2
exchange symbol,
e.g. via protocols
Ontology
Description
Symbol
‘‘JAGUAR“
Formal Semantics
Internal
models
HA1
commit
commit
Concept
MA1
HA2
commit
Formal
models
Ontology
MA2
commit
a specific
domain, e.g.
animals
Things
Meaning
Triangle
? Important life science
ontologies

SWISS-PROT Keywords the SWISS-PROT keyword list now has definitions (in nat. lang.) associated with
each keyword.
Edinburgh Anatomies Have whole or partial anatomy ontologies for adult and developmental stages for
several model organisms.
The Ingenuity company has a large knowledge base of experimental findings in biology. Currently, their
ontology is not viewable.
The MGED ontology working group aim to develop ontologies for describing gene expression experiments
and data.
Semiotes
Regulatory Networks Model
PharmGKB: Pharmacogenetics Knowledge Base.
the TAMBIS ontology (TaO) an ontology of bioinformatics and molecular biology.
RiboWeb an ontology describing ribosomal components, associated data and computations for processing
those data.
EcoCyc an ontology describing the genes, gene product function, metabolism and regulation within E. coli.
Molecular Biology Ontology (MBO)A general, reference ontology for molecular biology.
Gene Ontology (GO) an ontology describing the function, the process and cellular location of gene
products from eukaryotes.
Mouse Genome Informatics GO browser
Mouse Anatomical Dictionary
ImMunoGeneTics (IMGT) Ontology
STAR/mmCIF Macromolecule structure ontology.
STAR/mmCIF Signal Transduction Knowledge #Environment (STKE).
GENAROM Ontology of gene product interactions.
GeneX Ontologies for comparing gene expression across species.
EpoDB Controlled Vocabulary function, cell and tissue type, developmental stage and experimental
type.
CBIL Controlled Vocabulary Terms for human anatomy.
Japan Bio-Ontology Committee including Signal Transduction Ontology