How different is anatomy?

Download Report

Transcript How different is anatomy?

Practical Ontologies
Lessons from the GO
February 2011
The time was 1998-99
 None of the model organism databases
used standard terminology to describe
biological function
 Drosophila sequence was imminent
 Largest genome sequenced at that time
 Two weeks, 3 dozen scientists, all new software
 How could we organize the annotation?
 microArray technology was the latest
research tool, and results needed to be
described
 AI folk and ontologists organized the first
“bio-ontologies” workshop at ISMB
The Gene Ontology—the beginning
 A handful of biologists (4) met in a
bar in Montreal after the bioontologies workshop to share their
frustrations and decided to just do
it*…
 Would demonstrate possibilities for
data integration across the MODs
(FlyBase, SGD, MGD)
 Provided an organizing principle for the
Drosophila genome annotation
jamboree
* i.e. Describe gene products in a biologically meaningful way.
Late summer 1999
AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA
CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT
GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG
GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT
AGCTAGGCTCCGGATGCGACCAGCTTTGATAGATGAATATAGTGT
GCGCGACTAGCTGTGTGTTGAATATATAGTGTGTCTCTCGATATGT
AGTCTGGATCTAGTGTTGGTGTAGATGGAGATCGCGTGCTTGAG
TCGTTCGTTTTTTTATGCTGATGATATAAATATATAGTGTTGGTG
GGGGGTACTCTACTCTCTCTAGAGAGAGCCTCTCAAAAAAAAAGCT
CGGGGATCGGGTTCGAAGAAGTGAGATGTACGCGCTAGXTAGTAT
ATCTCTTTCTCTGTCGTGCTGCTTGAGATCGTTCGTTTTTTTATGCT
GATGATATAAATATATAGTGTTGGTGGGGGGTACTCTACTCTCTCT
AGAGAGAGCCTCTCAAAAAAAAAGCTCGGGGATCGGGTTCGAAGA
AGTGAGATGTACGCGCTAGXTAGTATATCTCTTTCTCTGTCGTGCT
reads
assemble
sequence
analysis
Mountains of data
Tentative function
filtering
Love-atfirst-sight
Piles of data
‘GO’
directories
converging
First-pass predictions
Functional knowns
The Gene Ontology project
 Annotated now
 The importance of stress-testing
 Don’t delay, use your ontology today
 Do no harm (KISS)
 i.e. Target the low hanging fruit, work
on the obvious, high-confidence steps
 Collaborate on concrete projects
 Focusing the mind
Annotations
 Have 3 primary components
 The ontology term(s)
 The entity instance (e.g. gene product)
 The evidence for that assertion
 An annotation is an evidence-based
assertion which indicates that this
entity is best classified/described by
this term(s)
Identify genes
Read
paper(s)
SPCC622.16c
PMID:17449867
SPCC622.16c GO:0005720
Identify GO
terms
Identify GO terms
associated with
each gene
IDA
IDA
What type
of evidence?
GO:0005720
PMID:17449867
Classification rule: Disambiguation
= bud initiation
= bud initiation
= bud initiation
The same name can be used to describe different
things.
Classification rule: Disambiguation
= tooth bud initiation
= cellular bud initiation
= flower bud initiation
Include plain “bud initiation” as a synonym for each
of these terms
Disambiguation
Exactly the same thing can be described with different
terms





Glucose synthesis
Glucose biosynthesis
Glucose formation
Glucose anabolism
Gluconeogenesis
 Comparison is difficult, especially across species
or across databases that each use one of these
different variants
 Use a single term, and plenty of synonyms
Annotation for a healthy ontology
 Easier to find the most accurate term(s) to
use
 Avoids annotation errors
 Easier for new curators to learn and
understand
 Develop annotation guidelines and
training material
 Enables automatic reasoning for
searching & inference
 Bottom line:
 Following basic construction rules makes more
useful ontologies
Improvement needed:
Closing the loop
Typical ontology
developer
Typical wet lab
PI annotating
data
Doh! I get it now, says the computer.
The Gene Ontology project
 Annotated now
 The importance of stress-testing
 Don’t delay, use your ontology today
 Do no harm (KISS)
 i.e. Target the low hanging fruit, work
on the obvious, high-confidence steps
 Collaborate on concrete projects
 Focusing the mind
GO in 2000-2008
Filling in annotation gaps
July 2008
GO:0016301
kinase activity
GO:0016310
phosphorylation
2230
3823
1410
|P|
= 3640
|F|
= 6053
|F ∩ P|
= 2230
|F ∩ not P| = 3823
part_of
part_of
annotations
propagate
over part_of
KIC1 IDA
part_of
annotations
propagate
over part_of
KIC1 IDA
part_of
annotations
propagate
over part_of
NDK1
IDA
part_of
annotations
propagate
over part_of
NDK1
IDA
Filling in annotation gaps
2009
GO:0016301
kinase activity
GO:0016310 phosphorylation
The H word—2011
time
divergence
 Characters in common are due to inheritance
 Allows inferences about common ancestor
Evolution of MSH2 subfamily
biological process
Somatic
hypermutation of
immunoglobulin
genes
Apoptosis
Maintenance of
DNA repeats
Homologous
recombination
DNA repair
Ancestral inference
E.c.
Biochemistry: purification and assay
A.t. MTHFR1
A.t. MTHFR2
D.d.
S.p.
S.c. MET13
S.p.
S.c. MET12
C.e.
D.m.
A.g.
D.r.
G.g.
H.s. MTHFR
R.n.
M.m.
Genetics: mutant phenotypes
divergence
• Integration at points of common ancestry
• Infer “hidden” character of living organisms
• Explicitly leverage evolutionary relationships
Integrating different GO annotations
PAINT
Phylogenetic Annotation and Inference Tool
The Gene Ontology project
 Annotated now
 The importance of stress-testing
 Don’t delay, use your ontology today
 Do no harm (KISS)
 i.e. Target the low hanging fruit, work
on the obvious, high-confidence steps
 Collaborate on concrete projects
 Focusing the mind
Scoping
2009
SGD
MGD
GO
FlyBase
 The ontology has a clearly specified
and clearly delineated content.
Decisions to make the work easier
 Provide definitions for everything
 Intelligible ontologies are more useful
 To humans (for annotation) and
 To machines (for searching, reasoning and
error-checking)
 Use content-free unique identifiers
 Drive all semantics away from tracking
 Don’t confuse the representational
technology with the conceptual
modeling
Implicit ontologies within the GO:








cysteine biosynthesis (ChEBI)
myoblast fusion (Cell Type Ontology)
hydrogen ion transporter activity (ChEBI)
snoRNA catabolism (Sequence Ontology)
wing disc pattern formation (Drosophila anatomy)
epidermal cell differentiation (Cell Type Ontology)
regulation of flower development (Plant anatomy)
B-cell differentiation (Cell Type Ontology)
Implicit anatomy ontology within
the GO:
GO
brain
development
hindbrain
development
metencephalon
development
pons
development
trigeminal motor
nucleus
development
is bearer of
has part
Alpha-Synuclein
Mouse
Substantia nigra
number
of
Lewy body
Ischemic Mouse
Nucleus
Golgi Apparatus
Condensed
Mitochondrion
Lysosome
Condensed
Mitochondrion
Dark
Material
Orthodox
Mitochondrion
Condensed
Mitochondrion
is bearer of
number
of
Condensed
Mitochondrion
Common Interest
 Sociology—to enlist the community, the
ontology must meet each individual
group’s immediate needs.
 Too many people => Too many requirements
 Outstanding problems
 Closing the loop between ontology
construction and ontology application
 QC improvements
 Prioritizing tasks
 Visualization
 …
A cast of thousands