Pre-coordination

Download Report

Transcript Pre-coordination

GO terms implicitly refer to
other term
•
•
•
•
•
•
•
•
•
•
cysteine biosynthesis
myoblast fusion
hydrogen ion transporter activity
snoRNA catabolism
wing disc pattern formation
epidermal cell differentiation
regulation of flower development
interleukin-18 receptor complex
B-cell differentiation
dorsal ectoderm
biosynthesis
is_a
metabolism
cysteine
is_a
serine family amino acid
is_a
amino acid
is_a
amine
cysteine
is_a
serine family amino acid
is_a
amino acid
is_a
serine
Composed terms currently
cause problems
–
–
–
–
–
–
No link to external ontology term
Redundancy
Inconsistency
Extra work
Annotation bottleneck
Tangled DAGs and confusing displays
• we have no way to disentangle
• Solution so far:
– fix errors based on results of term name parsing
(Obol)
• reactive, not proactive
Solution: actively manage
composed terms
• Explicit pre-coordination
– Composed terms should now/soon be coordinated
using oboedit plugin
• building block terms are recorded in ontology along
with composite term
• Benefits:
– Correct DAG structure can be inferred from
external ontologies
• e.g. make sure GO + CHEBI “align”
– placement & consistency checking automated
– additional work can be automated
• synonyms, text definitions
How will terms be precoordinated by oboedit?
• How do we record a definition for a composite
term?
– using a logical definition (computational essence)
• A logical definition consists of:
– a generic term (aka genus)
– relationships to other terms which serve to
discriminate this specific term from other is_a
children of the generic term (aka differentiae)
• Can be written in natural language as:
– A <generic term> which <discriminating
characteristics>
Example of pre-coordination
• cysteine biosynthesis
• generic term:
– biosynthesis
• discriminating characteristics:
– outputs cysteine
– natural language (Aristotelian style):
• a biosynthesis process which outputs cysteine
Example in Obo format
[Term]
id: GO:0019344
name: cysteine biosynthesis
intersection_of: GO:0009058
! biosynthesis
intersection_of: outputs CHEBI:15356 ! cysteine
is_a: GO:0009070 ! serine family amino acid biosynthesis
is_a: GO:0006534 ! cysteine metabolism
Alternate syntax
GO:cysteine_biosynthesis ==
GO:biosynthesis ∏ outputs(CHEBI:cysteine)
•
•
•
•
used in pheno-syntax
more compact
similar to OWL abstract syntax
I use Obo1.2 format or natural language in
the rest of this presentation
This allows us to dynamically
untangle
• Process axis view (primary is_as, via generic
term):
– biological_process
• metabolism
– biosynthesis
» cysteine biosynthesis
• Process participant axis view:
– amine
• amino acid
– serine family amino acid
» cysteine
• Combined view
– (same as current tangled diamond lattice)
Obol demo
• http://yuri.lbl.gov/amigo/obol
Recording the relationship is
important
• Why not just a simple cross-product?
– e.g. biosynthesis x cysteine
• Relationships are important for reasoning and
querying
– Consider:
• cysteine biosynthesis from serine
• mRNA export from nucleus during heat stress
• Without the relations, the logical definition is
not specific enough
– the essence is not captured
• Relations should come from RO
– more required
Multiple discriminating
characteristics are allowed
• Cysteine biosynthesis from
serine
– Generic term:
• biosynthesis
– Discriminating characteristics:
• output cysteine
• input serine
[Term]
name: cysteine biosynthesis from serine
intersection_of: GO:0009058
! biosynthesis
intersection_of: outputs CHEBI:15356 ! cysteine
intersection_of: input CHEBI:17822
! serine
Composite terms can be
nested
[Term]
id: GO:xxxxxxx
name: regulation of cysteine biosynthesis
intersection_of: GO:0050789 ! regulation of biological process
intersection_of: regulates GO:0019344 ! cysteine biosynthesis
[Term]
id: GO:0019344
name: cysteine biosynthesis
intersection_of: GO:0009058
! biosynthesis
intersection_of: outputs CHEBI:15356 ! cysteine
YES
regulation^regulates(biosynthesis^outputs(cysteine))
regulation^regulates(biosynthesis)^outputs(cysteine)
NO
Composite terms can
optionally be manufactured in
bulk
• Generic term:
{metabolism,biosynthesis}
• Differentia: has_output {serine,
cysteine, …}
• With caution…
– Sparse vs dense matrices
– not all combinations are types
On the importance of
necessary and sufficient
conditions
• Why intersection_of?
• Why not just make normal links in the
GO DAG?
– normal relationships are for necessary
conditions only
– we want both necessary and sufficient
conditions
• captures the essence of the term
Normal DAG links only
capture necessary conditions,
not essence
immune cell
activation
is_a
text def:
A change in morphology and
behavior of a macrophage
resulting from exposure to a
cytokine, chemokine, cellular
ligand, pathogen, or soluble
factor
macrophage
activation
inflammatory
response
part_of
Indistinguishable by DAG
immune cell
activation
is_a
text def:
A change in morphology and
behavior of a monocyte resulting
from exposure to a cytokine,
chemokine, cellular ligand,
pathogen, or soluble factor
monocyte
activation
inflammatory
response
part_of
essence captured by genusdifferentia
immune cell
activation
inflammatory
response
is_a
macrophage
activation
part_of
id: GO:macrophage_activation
intersection_of: GO:cell_activation
intersection_of: activates CL:macrophage
essence captured by genusdifferentia
cell
activation
is_a
genus
immune cell
activation
inflammatory
response
is_a
part_of
CL:macrophage
activates
macrophage
activation
id: GO:macrophage_activation
intersection_of: GO:cell_activation
intersection_of: activates CL:macrophage
Current status of precoordinated terms
• SO already contains composite terms
– 46 pre-coordinated terms
– A silenced gene is a gene which has the
quality of being silenced
• GO-BP/CL integration underway
– retrospectively pre-coordinated terms
• Obol page has pre-coordinated terms from
automatic parsing
– http://www.fruitfly.org/~cjm/obol
Pre- vs post- coordinated
• Pre-coordination
– terms are in ontology with IDs and computable
definitions
– increases complexity of ontology
– complexity can be managed by tools
• e.g. new oboedit features
• Post-coordination
– terms are combined in the database
– forces more complexity in database schema and
database applications
Pre-coordination is useful in
moderation
• Commonly used terms should be precoordinated
• eg cysteine biosynthesis; oocyte differentiation; pectoral
fin
• Avoid taking to extremes
• cf ICD-9
• Where do we draw the line?
– ontologies should be built around one or a few
axes of classification
• term ‘explosion’ typically gets large when multiple axes
are combined
– we can change our minds later
• pre- and post- coordination is commensurable
Commensurability
• Annotator annotates to
– nucleus^part_of(astrocyte)
• Anatomy editor creates new term
– uses oboedit cross-product plugin
–
astrocyte_nucleus = nucleus^part_of(astrocyte)
• Annotation can be dynamically ‘promoted’ to
new term in answer to queries
– various software techniques for achieving this
Post-coordination in GO
annotations
• Pre- and post- coordination are compatible
and commensurable
• We should extend the annotation format to
allow denoting more specific classes
– e.g.
• cholesterol transport in liver
– advanced applications can query this
– standard applications suffer no loss
– extended annotations can be used to help seed
new terms in the ontology
• This is already being done (MGI,Dicty)
– we just want to capture this in interopeable way
Post-composition in gene
association files
• New column in GA file format
…
Gene
Product
Term ID
Properties
AABC1
GO:0030301
(cholesterol
transport)
located_in(MA:liver)
AABC2
GO:0048663
(neuron fate
development)
has_participant(FBbt:Y_neuron)
Database issues
• Chado and GO DB can handle pre- and
post- coordination
– in theory anyway
• not yet fully tested
• How does it work?
– ‘anonymous term’ created for coordinated
term
– documentation in chado cvs
• chado/modules/cv/doc/