NCBIO-Berkeley
Download
Report
Transcript NCBIO-Berkeley
Some thoughts on PATO
Chris Mungall
BBOP
Hinxton
May 2006
Outline
Motivation revisited
The Ontology: PATO
OBD & using PATO for annotation
Who should use PATO?
Originally:
model organism mutant phenotypes
But also:
ontology-based evolutionary systematics
neuroscience; BIRN
clinical uses
OMIM
clinical records
to define terms in other ontologies
e.g. diploid cell; invasive tumor, engineered gene,
condensed chromosome
Unifying goal: integration
Integrating data
within and across these domains
across levels of granularity
across different perspectives
Requires
Rigorous formal definitions in both
ontologies and annotation schemas
Some thoughts on the
ontology itself
Outline
Definitions
how do we define PATO terms?
what exactly is it we’re defining?
is_a hierarchy
what are the top-level distinctions?
what are the finer grained distinctions?
shapes and colors
It’s all about the definitions
Everything is doomed to failure without
rigorous definitions
even more so with PATO than other ontologies
OBO Foundry Principle
Definitions should describe things in reality, not
how terms are used
def should not use the word ‘describing’
Should we come up with a policy for
definitions in PATO
currently: 19 defs (2.5 are circular)
proposed breakout session: examine all these
consistency: the property of holding together and retaining shape
amplitude: The size of the maximum displacement from the 'normal' position, when
periodic motion is taking place
placement: The spatial property of the way in which something is placed
pointed value: A sharp or tapered end
epinastic value: A downward bending of leaves or other plantnparts
oblong value: Having a somewhat elongated form withnapproximately parallel sides
elliptic value: Elliptic shapen
hearted value: Heart shaped
fasciated value: Abnormally flattened or coalescedn
opacity: The property of not permitting the passage of electromagnetic radiatio
opaque value: Not clear; not transmitting or reflecting light or radiant energy
undulate value: Having a sinuate margin and rippled surface
permeability: The property of something that can be pervaded by a liquid (as by osmosis
or diffusion)
porosity: The property of being porous; being able to absorb fluids
porous value: able to absorb fluids
viscosity: a property of fluids describing their internal resistance to flow
viscous value: a relatively high resistance to flow.
latency: The time that elapses between a stimulus and the response to it
power: The rate at which work is done
Proposal: genus-differentia
definitions
An S is a G which D
Each def should refine the is_a parent
Single is_a parent
Example: (non-PATO)
binucleate cell def= a cell which has two nuclei
Example (proposed PATO def):
convex shape def= a shape which has no indentations
opacity def= an optical quality which exists by virtue
of the bearer’s capacity to block the passage of
electromagnetic radiation
v similar to existing def
This policy will reap benefits
Advantages:
Helps avoid circularity
Ensures precision
Consistency in wording user-friendly
Considerations:
Sometimes leads to awkward phrasing
-ity suffix - “an opacity which…”
Solution:
allow shortened gerund form
having…, being…., ….
most of the existing defs conform already
implicit prefix “A G which exists by virtue of the bearer…”
From the top down
First, the fake term ‘pato’ must be removed
How do we define ‘attribute’?
Note: I prefer the term ‘quality’ or ‘property’
attribute implies attribution
length_in_centimetres is an attribute
we can of course continue to say ‘attribute’ but I
use ‘quality’ in these slides
most of new new pato defs are phrased as ‘a
property of…’ which I like, but inconsistent with
calling the root ‘attribute’
Well then, what is a quality/property?
What a quality is NOT
Qualities are not measurements
Instances of qualities exist independently of their
measurements
Qualities can have zero or more measurements
These are not the names of qualities:
percentage
process
abnormal
high
Some examples of qualities
The particular redness of the left eye of a
single individual fly
An instance of a quality type
The color ‘red’
A quality type
Note: the eye does not instantiate ‘red’
PATO represents quality types
PATO definitions can be used to classify quality
instances by the types they instantiate
the type “red”
instantiates
the particular case of
redness (of a particular
fly eye)
the type “eye”
instantiates
an instance of an eye
inheres
(in a particular fly)
in (is a
quality of,
has_bearer)
Qualities are dependent
entities
Qualities require bearers
Bearers can be physical objects or processes
Example:
A shape requires a physical object to bear it
If the physical object ceases to exist (e.g. it
decomposes), then the shape ceases to exist
Some qualities are relational
they relate a bearer with other entities
e.g. sensitivity (to)
Compare with: functions
The PATO hierarchy
Proposal for a new top level division
Proposal for granular divisions
Proposal 1: top level division
Spatial quality
Definition: A quality which has a physical object as
bearer
Examples: color, shape, temperature, velocity,
ploidy, furriness, composition, texture
Spatiotemporal quality
Definition: A quality which has a process as bearer
Examples: rate, periodicity, regularity, duration
Proposal 2: subsequent
divisions
Based on granularity (i.e. size scale)
a good account of granularity is vital for inferences
from molecular (gene) level to organismal
(disease) level
How do we partition the levels?
Some qualities are realised at certain levels
of granularity
Others can be realised across levels
shape, porosity
Sum-of-parts vs emergent
Scale
Bearer
Quality
Definition (proposed)
Physical
Cont.
Mass
Equivalent to the sum of the mass of
the parts of the bearer (mass at the
particle level is primitive/outwith
PATO)
Physical
Cont.
Opacity
An optical quality manifest by the
capacity of the bearer to block light
Phys/Che
m
Liquid
Concentration
A compositional relational quality
manifest by the relative quantity of
some chemical type contained by the
bearer
Molecular
Gene
splicing quality manifest by the splicing processes
undergone by the bearer
Cellular
Cell
ploidy
A cellular quality manifest by the
number of genomes that are part of
the bearer
Cellular
Cell
transformative
potency??
A cellular quality manifest by the
capacity of the bearer cell to
differentiate to different cell types
Scale
Bearer
Quality
Cont.
morphology
_ shape
__ 2D shape
__ 3D shape
Definition (proposed)
A morphological quality which is
manifest
Granular hierarchy
quality
spatial quality
spatial physical and physico-chemical quality
mass, concentration
spatial biological quality
spatial molecular quality
spatial cellular quality
spatial organismal quality
spatial quality, multiple scales
morphology/form
optical quality
color, opacity, fluorescence
Advantages of dividing by
granularity
Modular
strategic question
should we focus on biological qualities and work with
others on morphology, physics-based qualities etc?
Good for annotation
easy to constrain at high level
e.g. organismal qualities cannot be borne by molecules
Mirrors GO and OBO Foundry divisions
Easier to find terms
to be proved, but I believe so
Considerations
Possible objection:
The upper level of an ontology is what the
user sees first
terms such as “cross-granular quality” may
be perceived as undesirable and/or
abstruse by some users
Counter-argument
Solvable using ontology views
aka subsets, slims
Relative and absolute
Currently PATO terms often come in 3s:
e.g. mass, relative mass, absolute mass
Why do we need these?
PATO: One or two
hierarchies?
Currently two hierarchies
attribute
value
My position:
there should be one hierarchy of qualities
My compromise:
it should be possible to transform PATO
automatically into a single hierarchy
attribute
Current
PATO
value
color
colorV
hue
sat.
var.
hueV
sat.V
var.V
blueV
darkV
paleV
is_a
…
range
blackV
attribute
Proposed
change
attribute
color
color
hue
sat.
var.
hue
sat.
var.
blue
dark
pale
is_a
…
black
Arguments for a single
hierarchy
Practical
elimination of redundancy
no clear line for deciding what should be A
and what should be V
shape, bumpy vs bumpiness
Ontological
what kind of thing is a ‘value’?
Diederich 1997: [quote here]
Arguments against
Two hierarchies reflect cognitive and linguistic
structures
e.g. the color of the rose changed from red to
brown
3 cognitive artifacts
we want to present data in a way that is natural to
users
…but this can be solved with a single collapsed hierarchy
Two are useful for cross-products
see later - distinguish modifiers from values
EAV is common database pattern
so…?
Compromise: transformations
The Two Hierarchies approach is workable if
they can be automatically collapsed
Prerequisite: univocity
Each ‘value’ must be defined to mean exactly one
thing only
i.e. Each ‘value’ must be the ‘range’ of a single attribute
Example
having a value ‘fast’ that could be applied to both the
spatial quality ‘velocity’ and the process quality ‘duration’
would be forbidden
attribute
Collapse on ‘ranges’
value
color
colorV
hue
sat.
var.
hueV
sat.V
var.V
blueV
darkV
paleV
is_a
…
range
blackV
Shapes and colors
How many types of shape are
there?
notched, T-shaped, Y-shaped,
branched, unbranched, antrose, retrose,
curled, curved, wiggly, squiggly, round,
flat, square, oblong, elliptical, ovoid,
cuboid, spherical, egg-shaped, rodshaped, heart-shaped, …
How do we define them?
How do we compare them?
Is it worth the effort?
Shape types need precise
definitions to be useful
Real shapes are not mathematical entities
but mathematical definitions can help
Axes of classification:
Dimensionality
2-4D (process “shapes”)
concave vs convex
angular vs non-angular
number of
sides
corners
Primitive and composed shapes
Work with morphometrics community?
Shape likeness
We can post-coordinate some shape types
egg-shaped
head-shaped
A2-segment-shaped
Dangers of circularity
Only for genuine likeness (e.g. homeotic
transformation)
not “heart-shaped leaf”
See annotation section of this presentation
Color
Keep PATO HSV model
but is black a color hue?
We should allow overlapping partitions of
color space
different domains have ‘sub-terminologies’ of color
Is color relational?
Humans vs tetrachromatic UV-seeing animals
Composition
using has_part
Color hierarchy
Physical quality
Optical quality: a physical quality which exists in virtue of the
bearer interacting with visible electromagnetic radiation
Chromatic quality: an optical quality which exists in virtue of the bearer
emitting, transmitting or reflecting visible electromagnetic radiation
Color hue
Color saturation
Color variation
Color
Opacity: an optical quality which exists in virtue of the bearer aborbing
visible electromagnetic radiation
opaque
translucent
transparent
Part 2: Annotation using PATO
Annotation scheme desiderata
OBD Dataflow
Proposed annotation scheme
Annotation scheme desiderata
Rigour
There is a subset of the scheme which
is simple
The entire scheme is expressive
It should have an unambiguous
mapping to real world entities
Even if PATO is completely unambiguous, an illdefined annotation scheme may leave room for
ambiguity
Example:
Annotation:
E=eye, Q=red
What does this mean?
both eyes are red in this one fly instance
at least one eye is red in this one fly instance
a typical eye is red in this many-eyed spider
both eyes are red in this one fly at some point in time
both eyes are red in this one fly at all times
all eyes are red in all flies in this experiment
some eyes are red in some flies in this experiment
There should be a certain usable
subset that is simple
Rationale - MODs have limited resources:
building entry tools for simple subsets is easier
building databases and query/search engines is easier
curating with a less expressive formalism is easier, faster
and requires less training
MODs primary use case is search, for which expressivity is
less useful
Specifics
Tools should have an (optional) simple facade
Simple annotations should be expressible in a simple syntax
that is understood by users with relatively little training
There should be an exchange format and/or database
schemas that use traditional technology as might be used in
a MOD
eg XML, relational tables
The scheme must be highly
expressive
Rationale
May be required by other NCBCs (BIRN)
May be required for cbio 200 gene list
Will be required in future
Specifics
Expressive superset will be optional
MODs can ‘pick and choose’ their subset
Native exchange and storage format will be logicbased
Details outwith scope of this presentation
Dataflow
How will various kinds of phenotypic
data get into OBD?
what kinds of data suppliers will use
different formalisms?
3 scenarios… (more possible)
Example dataflow I
generic MOD curators annotates phenotypes
using Phenote
Annotations stored directly in MOD’s central
DB
MOD periodically submits to OBD
eg using Phenote to create pheno-xml
OBD converts pheno-xml to native logicbased formalism
Users can query MOD directly, or OBD
OBD will allow more expressive queries and have
more data integrated
Example dataflow 2
Non-MOD generates complex annotations
and stores them locally
e.g. BIRN group?
Periodic submissions to OBD
e.g. as OWL or Obo-format instance data
OBD converts to native logic-based formalism
Users can query OBD using more complex
queries
Example dataflow 3
cBio MOD curates 200 genes using Phenote
Annotations may be stored outside normal MOD schema
schema may not be expressive enough for complicated phenotypes
TBD - up to MOD
Periodic submissions to OBD
Phenote can be used to submit pheno-xml, OWL or OBO
MOD doesn’t have to worry about format
OBD converts to native formalism
Users can query OBD using relatively complex queries
Is this (should it be) different from #1?
MOD A
MOD B
pheno-detailed
XML file
OBD
MOD C
Non-MOD
Proposed annotation schema
The schema will be described informally
using a simple syntax
I use ‘E’ for entity and ‘Q’ for quality
Pretend it is EAV if you like
with implicit superfluous ‘A’
The schema has (will have) a formal
interpretation
aim: database exchange and removal of
ambiguities
can be expressed using logical language
OBD will use an internal logic-based
representation
Outline of annotation schema
‘EAV’ or ‘EQ’ is not enough
Fine for (very) simple subset
Extensions:
time
relational qualities
post-coordination of entity types
count qualities
measurements
…
Standard case: monadic
qualities
Examples
E=kidney, Q=hypertrophied
autodef: a kidney which is hypertrophied
We assume that there is more contextual
data (not shown)
e.g. genotype, environment, number of organisms
in study that showed phenotype
Interpretation (with the rest of the database
record):
all fish in this experiment with a particular
genotype had a hypertrophied kidney at some
Quantification
long thick thoracic bristles
2 statements
E=thoracic bristle, Q=long
E=thoracic bristle, Q=thick
Default interpretation
A typical thoracic bristle is long and thick
Optional entity quantifiers
EQuant={some,all,most,<percentage>,<count>}
E=thoracic bristle, Q=long, EQuant=80%
80% of the thoracic bristles in this one individual fly
OBD internal representation
Time
Example:
E=brain,Q=small,during=stage
A E which has quality that instantiates Q
during T
E has the quality Q for some extent of time,
and that extent overlaps T
during and other temporal relations will
come from the OBO Relations ontology
Relational qualities
E.g. sensitivity
E=eye, Q=sensitive, E2=red light
Post-coordinating entity types
E=blood in head Q=pooled
Problem:
The E may not be pre-defined (pre-coordinated,
pre-composed) in the anatomy ontology
We can post-compose a type representation
(aka make a cross-product)
E=(blood has_location(head))
The ability to post-coordinate may not be
available in the ‘simple-subset’
can be expressed easily in pheno-xml, obo, owl,
phenote(soon)
OBD will handle all required reasoning
Pre-coordinating phenotypes
Mammalian phenotype ontology has precoordinated phenotype terms
osteoporosis
pink fur
OBD will be able to translate
post-coordinated queries to annotations on predefined terms
queries on pre-defined terms to post-coordinated
phenotypes
Requirement
computable logical definitions are added to MP
Count qualities
wingless
polydactyly
spermatocytes devoid of asters
Absence can never be
instantiated
wingless
E=wing, Q=absent
autodef “an instance of wing which is
absent”
Proposal: restate as:
E=mesothoracic segment, Q=missing part,
E2=wing
This has other advantages
works better for “spermatocyte devoid of
asters”
The quality of ‘being many’
does not inhere in a finger
Polydactyly
E=finger, Q=supernumerary
autodef: “a finger which is supernumerary”
Restate as:
E=hand, Q=supernumerary parts, E2=finger
“a hand which has more fingers as parts than is
typical”
With count extension
E=hand, Q=supernumerary parts, E2=finger,
Count=6
could also say +1
“a hand with 6 fingers, which is more than normal”
Proposed PATO sub-hierarchy
part count quality
lacking
parts
having normal
part count
lacking
all
lacking
some
having extra
parts
Mass count qualities
furriness
porosity
Bearers possess these qualities by
virtue of the number and qualities of
their granular parts
hairiness by virtue of: number, width,
length, spacing, orientation of hair-parts
What is the essence of hairy?
Attempt 1:
E=skin,Q=hairy
but what if we do not have ‘hairy’ pre-coordinated
in PATO?
Alternate representation:
E=skin,Q=excess fine-grained parts,E2=hair
open Q: is this equivalent to, subsumed by, or
related to representation 1?
Another representation:
E=hair, Q=long
this is something different
increased brown fat cells
“increased brown fat cells”
Attempt 1:
E=brown fat cell, Q=increased
autodef: a brown fat cell which is increased
Restate as:
E=organism, Q=increased (granular) parts,
E2=brown fat cell
works better for “increased brown fat cells in upper
body”
OBD handles reasoning
should annotations to above be returned for
queries of PATO term “fatty”?
Relativity
PATO has terms like
large
increased
Context is implicit
strain
species
genus/order
Extension to make explicit
In_comparison_to
Bigger than average for species/genus/etc
E=brain,Q=large,In_comparison_to=<taxon-id>
default is same species as specified by genotype
Comparative phenotypes
E=brain,Q=large,In_comparison_to=<phenotypeid>
requires recording phenotype IDs
e.g. two experiments, same genotype, different
environment, phenotype stronger in one
Ratio & relative_to
Use cases:
Size of brain relative to size of skull
Size of brain relative to size of skull in an
individual when compared to size brain
relative to size of skull in a typical
individual of that species
E=brain,Q=large,relative_to=skull,
in_comparison_to=<taxon_id>
defaults to: whole organism
Modifiers
E=bone,Q=notched,Mod=mild
Standardised qualitative modifiers
Meaning dependent on E and Q
Can have multiple, cross-cutting scales
qualitative and numeric/score based
absent mildly realised
normal
strong
extreme
0
1
10
100
0.00
1
0.01 0.1
Modifiers modify meaning of Q
Influence of Mod on Q is subjective but the direction
is objective
Example: E=adult_human_body, during=sleep
Q={low,high} temperature, Mod=mild,normal,moderate,extreme
abn+
abnormal
normal
abnormal abn+
absent mildly realised
normal
strong
extreme
word scale
NOT
0.00
1
1
10
100
score scale
N/A
35
37
39
temperature
37
36.5
36
35
low
temperature
37
37.5
38
39
high
temperature
0.01 0.1
Modifiers and PATO
Modifiers are not qualities
Modifiers should not be in a true
ontology
But we can still give these PATO IDs
kept separate from core PATO ontology
Modifiers can be relational
relatum may be implicit
e.g. abnormal_with_respct_to
Modifiers serve similar purposes as
Values in tripartite EAV model
Difference:
absent, low, high are not treated in the same
way as genuine quality types like ‘notched’,
‘large’, ‘diploid’, ‘pink’
they are ingredients in the representation
language, and not types in an ontology
Heterozygous flies have very short and
highly branched arista laterals.
E=arista lateral, EQuant=all, Q=short,
Mod=extreme, in_comparison_to=Dmel
E=arista lateral EQuant=all, Q=branched,
Mod=extreme, in_comparison_to=Dmel
Measurements
Measurements are not qualities
In the schema, representations of
measurements are attached to the
representations of qualities
Separate measurement schema
don’t need to discuss fine grained details
here
some data providers will require more
detail than others here
e.g. averages, error bars, …
E=tail, Q=length, Measurement=2cm
E=tail, Q=length, Measurement=+.1cm,
in_comparison_to=<individual-id>
Likeness
Shape likeness
Homeotic transformations
E=A2
segment,Q=morphology,Similar_to=A3
segment
Interp:
An A2 segment with the morphological features
of an A3 segment
but not “heart-shaped leaves”
Conditionals
Some phenotypes are only realised under
certain conditions
environment
including chemical interactions, RNA interference etc
we should separate conditionals (this phenotype
only seen in this envirotype with this genotype)
from data (on this occasion this phenotype seen in
this envirotype with this genotype)
Schema elements
Phenotype character:
E
Q
EQuant
E2
Count
Mod
Relative_to
In_comparison_to
Similar_to
Measurment
Temporal
Most of these elements are optional
data providers pick and choose their level of
future extensions
boolean combinations
conditional statements
eg environment
modifier
++
+
.
-
--