Ontologies for biological annotation

Download Report

Transcript Ontologies for biological annotation

Weaving and untangling the
GO
•
•
•
•
is_a completeness ~9 slides
granularity & BP ~3 slides
Linking MF to BP ~15 slides
Sensu ~13 slides
– linguistic qualifiers vs relations
• Linking GO to other ontologies ~40 slides
– GO+Cell
Tangled DAGs and complexity
• paths increasing
• GO process in
general has a
multiple axes of
classification
– qualifier -ve +ve
– anatomy
• structural
• spatial
– chemical
• structural
• functional
is_a
completeness
GO and is_a completeness
• Why?
• What’s wrong with every term having at
least one is_a or part_of parent?
– this is the way we’ve always done things
Ontologies should be
complete
• No errors of omission
• is_a completeness is the ontologically correct
thing to do
– every entity type is a subtype of some other thing
• Accurate ontologies = accurate queries
– currently a query for “find all kinds of
development” does not return “ovarian follicle
development”
• this is wrong
missing is_as hinders
common tool use
• We should play nicely with the others in
the playground
• Most (non-GOC) tools expect is_a
completeness
– GO looks funny when viewed in other tools
• the standard is to show only is_a relations in
default tree view
– missing is_as breaks reasoners
Filling is_a gaps brings
practical benefits
• Easier for tools to find inconsistencies in
GO
• We can start to untangle displays
Example: current displays mix
relations
• it’s a mess
untangling is_a and part_of
• difficult if is_a hierarchy is incomplete
– is_a orphans show up at root node in pure
is_a display
• not everything must have an asserted
part_of parent
– can infer from is_a parents
The new complete cellular
component
• Current CC:
– 277 is_a orphans / 1688 terms
– avg is-a-paths-to-root 1.4
– avg mixed-paths-to-root 6.97
• Jane’s fixed CC:
– 0 is_a orphans
– avg is-a-paths-to-root 3.36
– avg mixed-paths-to-root 38.6
Granularity
and the
organisation
of GO:BP
Fixing the upper levels of BP
• The upper portion of any ontology is
very important for organisation
• Design decisions percolate down
• Many users exploring GO top-down see
this first
• Diamonds are particularly bad in the
upper level
– significantly increases tangledness
biological
process
others
cellular
process
cellular
physiological
process
physiological
process
organismal
physiological
process
biological
process
Processes that are carried out at
the cellular level, but are not
necessarily restricted to a single
cell. For example, cell
communication occurs among
more than one cell, but occurs at
the cellular level
cellular
process
The
processes
pertinent to
the integrated
function of a
cell
cellular
physiological
process
A phenomenon marked by changes that
lead to a particular result, mediated by
one or more gene products
Those processes specifically pertinent
to the functioning of integrated living
units: cells, tissues, organs, and
organisms
physiological
process
organismal
physiological
process
The processes pertinent to the function of an
organism above the cellular level; includes the
integrated processes of tissues and organs
Consider… (long term view)
• Making top division by granularity of the
process itself
– biological process
• molecular level process?
• cellular level process
• (multi-cellular) level process
• These types are disjoint
• But what about physiological process?
– this is not disjoint from the granularity of the
process itself
Relations
between GO
ontologies
Outline
• We focus on MF & BP
• biological example from David
• the types and relations in reality
– maintaining the ALL-SOME definition of relations
• how should this be implemented in the GO?
– what links should be manifested
– retain some level of redundancy, or eliminate it?
GO:0006548
Histidine catabolism
GO:0004397
GO:0016153
Histidine
ammonia
GO:0050480
Urocanate
hydratase
lyase activity
GO:0030409
imidazolopropionase
activity
GO:0050415
Glutamateactivity
GO:0050129
FormimidoylFormimidoyl
N-formylglutamate
Glutamase
transferase
GO:0050416
deformylase
activity
Formimidoylglutamate
activity
deiminase
GO:0019557
Histidine catabolism
GO:0019556
to glutamate and
Histidine
catabolism
GO:????????
formate
to Histidine
glutamatecatabolism
and
formamide
to glutamate and
formiminotetrahydrofolate
activity
Overbeek, et al. The Subsystems Approach to Genome Annotation and its
Use in the Project to Annotate 1000 Genomes. NAR 2005, 33-17:5691-5702
Ontological Representation
• I will try and be clear when I am talking
about
– types in reality
– types we wish to manifest as terms in the
GO (or in other ontologies)
• all GO terms should be types
• not all types need to have terms created - we
limit for practical reasons
What are the relations in
reality?
• Between types in the same ontology, different
levels of granularity
– part_of
• Between functions and processes (at the
same level of granularity)
– functioning_of
• Between component and function
– has_function
• Between process and component
– located_in
What are the instances and
relations in reality?
some gene
product instance
some multistep
process instance
part_of
has
function
some molecular
function instance
function
functioning
of
some molecular
functionING
instance
process
What are the types and typelevel relations in reality?
some type of
multistep
process
some type of
gene product
part
(direction?)
has
function
some type of
molecular
function
function
functioning
of
some type of
molecular
functionING
process
types example
issues:
-- ALL-SOME structure
histidine
catabolism
coarse
part?
histidine
ammonia lyase
function
function
functioning
of
histidine
ammonia lyase
reaction
process
fine
What are the types and
relations in reality?
issues:
-- ALL-SOME structure
histidine
catabolism to
glutamate and
formate
coarse
has
part?
Formimidoylglutmat
e
deiminase function
function
functioning
of
Formimidoylglutmat
e
deiminase reaction fine
process
We want to capture these real
relationships between
biological types
• Between granular levels
• Between orthogonal ontologies
• But first we must be clear on the
definitions of these types, and which
types should be manifested as GO
terms
Can we just manifest this in
the
GO?
issues:
-- not all function terms
have a functionING corresponding some type of
multistep
term
process
-- even if they do, redundancy is
has
generally to be avoided
part(?)
some type of
molecular
function
function
functioning
of
some type of
molecular
functionING
process
coarse
fine
We already have some
redundancy
• function & process redundancy
• iron transport (BP)
• iron transporter (MF)
• function & component redundancy
• voltage-gated ion channel function
• voltage-gated ion channel complex
• If we retain this redundancy, these relations
can be trivially added
• But we don’t always have this redundancy
– not all functions have a corresponding functioning
term
Manifest shortcut relationships
• one relation
standing for two
some type of
process
coarse
has
part(?)
some type of
molecular
function
function
functioning
of
some type of
molecular
functionING
process
fine
most functionings are implicit
• current paradigm
histidine
catabolism
coarse
has
part(?)
histidine
ammonia lysase
function
function
functioning
of
histidine
ammonia lyase
REACTION
process
fine
When do we manifest
functions and processes?
• Need consistent stable policy
• Nothing in function ontology should have
activity suffix
– even though to a biochemist activity==potential,
this is still confusing
• Beyond this, do we retain current policy
– some redundancy
• Or take a more extreme approach
– eliminate redundancy
– eliminate current ‘activity’ MF terms and manifest
corresponding reaction terms in BP (Amelia)
‘purist process’ approach
some type of
gene product
histidine
catabolism
has
function
histidine
ammonia lyase
function
function
part
functioning
of
histidine
ammonia lysase
reaction
process
When is it safe to eliminate
redundancy?
• Does functioning always imply function?
– iron transport does not imply iron transporter
– but we could still extend annotation to allow for
specification of functioning-as-function
• Reactions and other ‘single-step’ processes
involving no helper
– function and corresponding functioning imply one
another
• Redundancy between function and
component should be retained
• Any obsoletion obviously causes disruption
Difficult functionings
• Structural constituents
• functioning happens at lower level of
granularity than is covered by GO
• these will not be linked to process - for
now
Implementation
• Still need to curate the actual links
– trivial links can be computed automatically
• Can proceed independently of resolving
ontological issues
– most likely retain current policy re: manifesting
terms
– need maintain 3 kinds of links
• granular (part, same ontology)
• functioning_of (function and functioning)
• ‘diagonal’
– ALL-SOME definition
Sensu
Sensu - outline
• Original use
– A linguistic qualifier
– denote differing community usage of a
terminological entity (a term)
• Perverted use
– A type qualifier
– Used for when the part_of structure is specific to
an organism type
• The fix
– provide separate mechanisms for each
Terms vs kinds
• The term ‘term’ is confusing
– Term (sensu GO)
– Term (sensu normal usage)
• strings, tokens
• GO is not a terminology
• A GO ID identifies a type of entity
–
–
–
–
a kind of entity
a universal (as opposed to instance)
more specific than a class
but not a concept
Sensu - original usage
• Sometimes the same string refers to different
types
– nucleus (sensu particle physicist)
– nucleus (sensu astrophysicist)
– nucleus (sensu biologist)
• Canonical GO example:
– bud
• no longer relevant, terms obsoleted
– trichome
Linguistic qualifiers are about
language, not biological reality
• No ontological requirement for linguistically
related terms to be ontologically related
– current GO docs are not correct
• trichome, sensu plant community
– should not state that there is some biological
relation between an instance of a trichome and the
plant community
The original usage has been
conflated
• Organism type specificity is a genuine
challenge for the GO
– ‘contextual’ part_ofs
– e.g. X part_of Y in species Z
• Sensu has been wrongly recruited to fix this
– standard pattern:
• X, sensu Z part_of Y
• X, sensu Z is_a Z
• Two problems
– conflation of meaning of sensu
– conflation results in lack of precision
• “as in, but not restricted to taxon” not rigorous enough
Two problems, two solutions
• Retain sensu as a linguistic qualifier only
– re-interpret as: sensu S community
– no requirement for taxon IDs
– no ontology structure requirements
• Introduce a new relation for genuine
organism-type specific terms
– in_organism
– standard inference rules can be used
• e.g.
– X in_organism X’, Y in_organism Y’, X is_a Y <=> X’ is_a
Y’
Contextual synonyms
[Term]
name: trichome (sensu insecta)
synonym: EXACT “hair” []
synonym: EXACT “trichome” [] {context=insecta}
def: “a polarized cellular extension that covers much of the insect
epidermis”
[Term]
name: trichome (sensu plant)
synonym: EXACT “trichome” [] {context=plant}
def: “An outgrowth from the epidermis. Trichomes vary in size and complexity and
include hairs, scales, and other structures and may be glandular. In Arabidopsis, patterning
of trichome development is not random but does not appear to be lineage-based like
stomata”
Advantages
• Lexical qualifiers dealt with use lexical
oboedit tags
• No need to be as specific as a taxon
– only as specific as is needed to decontextualise
• No false reasoning is done over synonyms
– cellular component types and cell types should not
be siblings
• Big user-friendliness win?
– Displays customised for particular users may
choose to display contextual exact synonyms in
place of the wordier sensu name
in_organism
• Standard ALL-SOME definition:
• Type level definition:
– P in_organism O
• for all instances p of P, there exists some
organism o of type O, and some time t, such
that p in_organism o at time t
• More specific relation than located_in in
OBO relations ontology
• Standard logical rules can be applied
photosystem I
thylakoid
is_a
is_a
part
of
thylakoid,
in cyanobacteria
in
organism
in
organism
cyanobacteria
photosystem I,
in cyanobacteria
Open question
• Sometimes the relation between two types is
largely lexical
– eg trichome
• Sometimes it isn’t so clear
• Can we have both a relation to a taxon, and a
contextual synonyms
• Is ‘eye’ an exact contextual synonym for
‘compound eye’ for the arthropod community?
Practical considerations
• Use NCBI Taxonomy as our organism
ontology
• xref or relationship tags?
– xrefs are more lightweight
– relationship tags are more accurate
– relationship tags would be ‘dangling’ unless
organism ontology is loaded
• See next section…
Composite
terms in GO finally…
Composite terms - outline
• The problems inherent in composite terms
and diamonds - brief review
• Actively managing composite terms in GO
– big change: parseable logical definitions
• Implementation plan
• Progress so far: logical definitions referring to
cell types
• Pre vs post composition
– composite terms in ontologies and annotations
biosynthesis
is_a
metabolism
cysteine
is_a
serine family amino acid
is_a
amino acid
is_a
amine
cysteine
is_a
serine family amino acid
is_a
amino acid
is_a
serine
Composed terms currently
cause problems
–
–
–
–
–
–
No link to external ontology term
Redundancy
Inconsistency
Extra work
Annotation bottleneck
Tangled DAGs and confusing displays
• we have no way to disentangle
• Solution so far:
– fix errors based on results of term name parsing
(Obol)
• reactive, not proactive
Solution: actively manage
composed terms
• Composed terms should now/soon be
generated using oboedit plugin
– building block terms are recorded in ontology
along with composite term
• Correct DAG structure can be inferred from
external ontologies
– placement & consistency checking automated
– additional work can be automated
• synonyms, text definitions
How will composite terms be
recorded by oboedit?
• How do we record a definition for a composite
term?
– using a logical definition (computational essence)
• A logical definition consists of:
– a generic term (aka genus)
– relationships to other terms which serve to
discriminate this specific term from other is_a
children of the generic term (aka differentiae)
• Can be written in natural language as:
– A <generic term> which <discriminating
characteristics>
Example of composite term
record
• cysteine biosynthesis
– generic term:
• biosynthesis
– discriminating characteristics:
• outputs cysteine
– a biosynthesis process which outputs
cysteine
id: GO:0019344
! cysteine biosynthesis
intersection_of: GO:0009058
! biosynthesis
intersection_of: outputs CHEBI:15356 ! cysteine
Now we have the ability to
untangle
• Process axis view (primary is_as, via generic
term):
– biological_process
• metabolism
– biosynthesis
» cysteine biosynthesis
• Process participant axis view:
– amine
• amino acid
– serine family amino acid
» cysteine
• Combined view
– (same as current tangled diamond lattice)
Recording the relationship is
important
• Why not just a simple cross-product?
– e.g. biosynthesis x cysteine
• Relationships are important for reasoning and
querying
– Consider:
• cysteine biosynthesis from serine
• mRNA export from nucleus during heat stress
• Without the relations, the logical definition is
not specific enough
– the essence is not captured
Multiple discriminating
characteristics are allowed
• Cysteine biosynthesis from
serine
– Generic term:
• biosynthesis
– Discriminating characteristics:
• output cysteine
• input serine
intersection_of: GO:0009058
intersection_of: outputs CHEBI:15356
intersection_of: input CHEBI:17822
Composite terms can be
nested
• regulation of cysteine biosynthesis
intersection_of: GO:0050789 ! regulation of biological process
intersection_of: regulates GO:0019344 ! cysteine biosynthesis
id: GO:0019344
! cysteine biosynthesis
intersection_of: GO:0009058
intersection_of: outputs CHEBI:15356
Composite terms can
optionally be manufactured in
bulk
• Generic term:
{metabolism,biosynthesis}
• Differentia: has_output {serine,
cysteine, …}
• With caution…
– Sparse vs dense matrices
– not all combinations are types
On the importance of
necessary and sufficient
conditions
• Why intersection_of?
• Why not just make normal links in the
GO DAG?
– normal relationships are for necessary
conditions only
– we want both necessary and sufficient
conditions
• captures the essence of the term
Normal DAG links only
capture necessary conditions,
not essence
immune cell
activation
inflammatory
response
text def:
A change in morphology and
behavior of a macrophage
resulting from exposure to a
cytokine, chemokine, cellular
ligand, pathogen, or soluble
factor
macrophage
activation
part_of
Normal DAG links only
capture necessary conditions,
not essence
immune cell
activation
macrophage
activates
is_a
macrophage
activation
inflammatory
response
part_of
essence captured by genusdifferentia
immune cell
activation
inflammatory
response
is_a
macrophage
activation
part_of
id: GO:macrophage_activation
intersection_of: GO:cell_activation
intersection_of: activates CL:macrophage
essence captured by genusdifferentia
text def:
A change in morphology and
behavior of a macrophage
resulting from exposure to a
cytokine, chemokine, cellular
ligand, pathogen, or soluble
factor
immune cell
activation
inflammatory
response
is_a
macrophage
activation
part_of
id: GO:macrophage_activation
intersection_of: GO:cell_activation
intersection_of: activates CL:macrophage
essence captured by genusdifferentia
cell
activation
immune cell
activation
is_a
inflammatory
response
(genus)
activates
macrophage
macrophage
activation
part_of
The power of reason
• with genus-differentia definitions that
are computationally parseable, we can
do a lot more consistency checking
Pre- vs post- composition
• It makes sense to pre-compose terms and
maintain them as part of GO
• Annotations can post-compose terms if they
choose to do so
– MGI, DictyBase are doing this already
• results remain local to MOD
– AmiGO-NG will allow querying of these
• The two approaches are complementary and
compatible
– proviso: if done properly
SO already contains
composite terms
• A silenced gene is a gene which
has the quality of being silenced
Plan: outline
• We want all new composite terms to be
created using appropriate oboedit plugin
– logical definitions automatically recorded
– term management automated
• Changes:
– editors must now be ‘OBO-aware’
– annotators and end-users can remain unaware of
changes if they choose to do so
• but using the logical defs can bring benefits
• But first we need to find logical definitions for
all the existing composite terms
Where we were at, 2005
• Lots of terms to be retrofitted
– Where to start?
• Previous strategy:
– Obol guesses logical def for each term
– Obol uses logical def to reason
• errors of omission
• inconsistencies
– Batch reports to curators
OBO
editor
go.obo
cell.obo
cell.obo
cell.obo
name
parser
go+
ldefs
obol
config
go
‘fixed’
reasoner
obol
cjm
oboedit
GO
editor
obol
report
Obol produces genus-differentia logical definitions
GO
OBO
editor
oboedit
go.obo
editor
cell.obo
cell.obo
cell.obo
name
parser
Ego.obo
obol
config
reasoner
obol
cjm
go
‘fixed’
obol
report
Limitations of this approach
• Good as proof-of-principle
• But..
– only the end results are evaluated
– Obol makes the identical mistakes in
guessing logical definitions each iteration
– we want to evaluate and preserve the
logical definitions that are generated by
Obol
What we’ve been doing since
then
•
•
•
•
•
•
Focused on OBO Cell ontology
Used Obol to infer logical defs
Manually curate logical defs
Feed back results to improve Obol
Iterate and refine
Use oboedit reasoner to check consistency
between GO & CellO
• Next: incorporate into curation process
OBO
editor
go.obo
cell.obo
cell.obo
cell.obo
name
parser
obol
config
obol
cjm
oboedit
GO
editor
ego-cell
.obo
Results so far
• Test set of 337 logical definitions
curated
– only a fraction of the composite terms in
GO
• Relations not finalised
• Composite terms involving CellO
present some interesting challenges
• …but first, here’s a demo
Open issues: what relations
do we use?
• We are concerned for now with relations
between processes and cells
–
–
–
–
–
–
neuroblast activation & neuroblast
T cell differentiation & T cell
T cell homeostasis & T cell
cell homeostasis & homeostasis
sperm incapacitation & sperm
sperm motility & sperm
OBO Relations ontology
• OBO Relations ontology has
– has_participant
• sub-relations:
– has_agent (active participant)
– has_patient (inactive participant)
» (not in obo-rel yet)
– between a process and a continuant
– follows standard ALL-SOME structure
has_participant
• P has_participant C if and only if: given any
process p that instantiates P there is some
continuant c, and some time t, such that: c
instantiates C at t and c participates in p at t
• has_participant is a primitive instance-level relation
between a process, a continuant, and a time at which the
continuant participates in some way in the process. The
relation obtains, for example, when this particular
process of oxygen exchange across this particular
alveolar membrane has_participant this particular sample
of hemoglobin at this particular time
Is this the appropriate
relation?
neuroblast activation has_participant neuroblast
T cell differentiation has_participant T cell
T cell homeostasis has_participant T cell
cell homeostasis has_participant homeostasis
sperm incapacitation has_participant sperm
sperm motility has_participant sperm
these are all correct…
…but are they too general?
more specific kinds of
participation
• has_agent (has_active_participant)
– As for has_participant, but with the
additional condition that the component
instance is causally active in the relevant
process
• has_patient
(has_inactive_participant)
– Yes, this is a daft name
– The component instance is acted upon
• (not yet in OBO REL)
Cell differentiation
• T cell differentiation
– A cell differentiation instance in which a cell
acquires_features_of T cell
• problem:
– not a simple relation between the process
(T cell differentiation) and the cell (T cell)
• 3-place relation: process, instance, type
Cell differentiation, attempt 2
• T cell differentiation has_output T cell
– Compare to:
• cysteine biosynthesis has_output cysteine
• We should distinguish between participation
relations in which the continuant relations are
– transformation_of
– derives_from
• e.g. something made (biosynthesis) vs
something transformed (differentiation)
Cell differentiation, attempt 3
• T cell differentiation
has_transformed_output_participant T
cell
– …not exactly catchy…
has_primary_participant
• T cell differentiation
has_primary_participant T cell
– aka has_theme
• ontologically a good relation?
• Meaning partly resides in the process
term
• Can be migrated to other relations later
To decompose or not to
decompose
• We could have a logical definition for sperm
incapacitation
– genus: incapacitation
– differentia: has_participant sperm
• Requires creating a new term
– incapacitation
• Not used in any other logical def
• Logical def does not capture full essence
– this term is a little more complex
• involves at least three continuants
• Instead just use a relationship to capture
necessary conditions only
‘Anonymous’ terms
• border follicle cell delamination
– The splitting off of border cells from the
anterior epithelium
• genus: delamination
– no such term
• we can create as ‘anonymous’ term
– exists only in order to make logical
definitions
• ..or we can just create a normal term
Implementation
• We have 337 logical definitions (nearly)
ready
• When can we merge them into the GO?
adding logical defs to the GO
• Will this cause disruption to users?
• gene_ontology.obo file exactly the same as
before, but will have
– fewer inconsistencies!
– new intersection_of tags
• specified in obo v1.2
• can easily be ignored by parsers
• oboedit users must either:
– load cell.obo, relationship.obo at same time as go.obo
– OR select “allow dangling terms”
• may still confuse some users
– ‘anonymous’ terms
power users &
advanced applications
cvs
rel.obo
cvs
gene_ontology
_edit.obo
filter
oboedit
cvs
gene_ontology.obo
cell.obo
normal downstream stuff
(website, amigo, users)
unaffected
GO
editor
CellO
editor
Applications may want to take
advantage of enhanced GO
• enhanced GO isn’t just to help curation
• queries possible with ego:
– find genes associated with blood cells
• annotations to microglial cell activation
– differentiation of any microglial precursor
• annotations to monocyte differentiation
Post-composition
• This approach is highly compatible with postcomposition
• We should extend the annotation format to
allow denoting more specific classes
– e.g.
• cholesterol transport in liver
– advanced applications can query this
– standard applications suffer no loss
– extended annotations can be used to help seed
new terms in the ontology
• This is already being done (MGI,Dicty)
– we just want to capture this in interopeable way
Post-composition in gene
association files
• New column in file format
…
Gene
Product
Term ID
Slots
AABC1
GO:0030301
(cholesterol
transport)
OBOREL:located_in[MA:liver]
AABC2
GO:0048663
(neuron fate
development)
OBOREL:has_primary_participant[FB
bt:Y_neuron]
AABC3
GO:000003
Important note on postcomposition
• This is not an either-or situation
• We will retain pre-composed terms
– terms will continue to be created for real biological
types
• Annotation post-composition can be used to
further refine existing pre-composed terms
– if the post-composed term is later created in the
GO, the annotation can be automatically migrated
• Tools can ignore post-composition for small
loss in specificity
– defaults to the current paradigm
Avoiding diamonds
• Surely larval locomotory
behavior involves a diamond?
• yes, but we can disentangle the two
axes of classification
Solution
• Curator asserts:
id: GO:larval_locomotory_behavior
intersection_of: GO:locomotory_behavor
intersection_of: occurs_in FBbt:larval_stage
• Oboedit infers diamond:
id: GO:larval_locomotory_behavior
intersection_of: GO:locomotory_behavor
intersection_of: occurs_in FBbt:larval_stage
is_a: GO:locomotory_behavor
! genus
is_a: GO:larval_behavior
! inferred
Next Steps
• Tidy up cell logical definitions
• integrate them into curation process
• Look at composite terms within GO
– larval locomotory behaviour
– regulation
• Chemicals
• Anatomical entities