ppt - Gene Ontology Consortium

Download Report

Transcript ppt - Gene Ontology Consortium

Real-life ontology
development:
lessons from the Gene
Ontology
• What is GO?
• Evolution of GO
• Mechanisms of updating GO
• Tools for ontology development
• Lessons learned
Gene Ontology
• Built for a very specific purpose:
“annotation of genes and proteins in
genomic and protein databases”
• Applicable to all species
Gene Ontology - scope
• Three disjoint axes:
– molecular function
• molecular role e.g. catalytic activity, binding
– biological process
• broad biological phenomena e.g. mitosis, growth,
digestion
– cellular component
• sub-cellular location e.g nucleus, ribosome, origin
recognition complex
Gene Ontology
• Directed acyclic graph (DAG)
• Terms connected by two transitive
relations (edges):
– is_a
– part_of
Gene Ontology
• Developed by an international
consortium
– about 50 members
• Editorial office, 4 full-time editors (ish)
• Many other part-time editors at
databases
• Multiple changes made a day
– made live immediately
Gene Ontology
• Main ontology format OBO flat file
• Changes are live immediately
– no releases
• Propagated to GO database
– monthly snapshots archived
Evolution of GO
• Original GO created in 2000
• Three databases involved:
– FlyBase (Drosophila)
– MGI (Mouse)
– SGD (S. cerevisae)
• Used immediately
Evolution of GO
• Later databases:
–
–
–
–
TAIR (Arabadopsis)
TIGR (microbes including prokaryotes)
SWISS-PROT (several thousand species inc. human)
PSU (P. falciparum)
• Recent additions
– ZFIN (zebrafish)
– PAMGO (plant pathogens)
Evolution of GO
• GO development traditionally
annotation-driven
– development directed by use
• Terms added as new species annotated
• Terms added on as as-needed basis
Evolution of GO
• Resulted in ‘organic’ structure, little
formality
• Ontological formality added
subsequently
– philosophical and logical
Ja
n0
Ap 1
r0
Ju 1
l0
O 1
ct
-0
Ja 1
n0
Ap 2
r0
Ju 2
l0
O 2
ct
-0
Ja 2
n0
Ap 3
r0
Ju 3
l0
O 3
ct
-0
Ja 3
n0
Ap 4
r0
Ju 4
l0
O 4
ct
-0
Ja 4
n0
Ap 5
r0
Ju 5
l0
O 5
ct
-0
Ja 5
n0
Ap 6
r0
Ju 6
l0
O 6
ct
-0
Ja 6
n07
Number of terms
Growth of GO
GO term history 2001 - 2007
30000
25000
20000
15000
obsolete
undefined terms
defined terms
10000
5000
0
Date
Modifying the graph:
• Before:
Modifying the graph:
• But then I need to annotate VW
Beetles, pre-1980
• The graph no longer works, because
the engine is in the boot
Modifying the graph:
• After:
Mechanisms for ontology
change
• Small incremental changes
• Initially all changes to the ontologies
made this way
Mechanisms for ontology
change
• Suggested changes initially
submitted by email
• Moved to an online tracking system
when this became unmanageable
Requesting changes to GO
- curator requests tracker
• Web-based tracking system hosted at
SourceForge.net
• Public
• Tracker item for each new request or
question
Curator requests tracker
Mechanisms for ontology
change
• Problems:
– Larger questions about the higher
ontology structure remain unresolved
– Makes some items impossible to close
– No sense of the ‘big picture’
– Large areas of the ontologies missing or
incomplete because no annotations
– Massive volume
• needed to increase the number of editors
Mechanisms for ontology
change
• Larger-scale changes:
– content meetings
– interest groups
Content meetings
• Short meetings aimed at developing
specific areas of GO ontology content
– proposals refined and discussed before
meeting
– small number of people (10-15)
– invited experts
– specific topics
Content meetings
• Further refinements made following
meeting by email
• Changes are made once consensus
reached
• Large number of terms typically added
(500+)
Content meetings
• Recent meetings:
– immunology
– interactions between organisms
– CNS development
Content meetings
• Advantages
– Allows a lot of detailed work to be
done on a very specific area
– Involves external expertise
Content meetings
• Problems:
– Expensive - everyone has to be in the
same location
– Only works for very specific topics
– Long lag time getting terms into
ontologies
Interest groups
• Groups of experts for a specific
topic
– e.g. development, cell cycle, plants
• Includes GO curators/annotators
and external experts
• Don’t typically meet face to face
Interest groups
• Communicate via email, desktop
sharing etc
• Transporters area of the ontology
recently revised this way
Interest groups
• Advantages
– Cheap, no travel required
– Allows a lot of detailed work to be
done on a very specific area
– Involves external expertise
Interest groups
• Disadvantages
– Harder to reach consensus when not
face to face
– Projects tend to drag on
Mechanisms for ontology
change
• Systematic changes via small working
groups
Systematic changes
• Projects not directly related to biological
content
• Systematic changes throughout ontology
• Small group of GO consortium members
– meets regularly by desktop sharing, voice
over IP
• Experts recruited to meetings as needed
Systematic changes
• Changes either
– made on a branch of the ontology and
merged in later
• always have big problems merging branched file
into main file
– merged directly into live ontology after session
• fast, but people get angry
is_a complete
• GO contains both is_a and part_of
relations
• Typically, graphs a mixture of
incomplete is_a and part_of
hierarchies
• A result of ‘organic’ evolution of GO
• All graphs now have complete is_a
paths to root
partial disjointness
• Biological process terms organised by
granularity:
– cellular process
– multicellular organism process
– multi-organism process
• To avoid massive increase in number of
paths to root, these terms are disjoint
– no is_a children in common
sensu
• sensu (meaning ‘in the sense of’)
used to disambiguate, by
taxonomic group, terms with
identical strings but different
meanings
• e.g. sporulation (sensu Viridiplantae)
v/s sporulation (sensu Bacteria)
sensu
• Current project to remove the sensu term
strings
• Replace with strings that represent the
true differentiae
• e.g.
– cell wall (sensu Bacteria) -> peptidoglycanbased cell wall
– cell wall (sensu Fungi) -> chitin- and betaglucan-containing cell wall
Systematic changes to GO
• Advantages
– Fast
– Efficient
– Small number of people required
Systematic changes to GO
• Disadvantages
– Difficult to obtain wider consensus
– Changes sometimes have to be
undone
Useful tools for ontology
development
• WebEx
– desktop sharing, can control each others
desktops
• wiki
– mainly internal
• Skype
– free international calls!
• conference calls
– not free
Tracking changes to GO
• General tracking
– files stored in cvs, all differences
trackable (in theory)
– far from ideal - frequent discussion is
should we history track, date-stamp
terms?
Tracking changes to GO
• Obsolete terms
– formerly stored within the ontology
– in OBO format made a special kind of
deprecated term (tag is_obsolete)
– Soon to create ‘replaced_by’ and
‘consider’ tags to point to live terms
Tracking changes to GO
• Crediting experts
– traditionally no mechanism for doing
this
– creating abstracts for content
meetings, adding tag to term
– as yet no mechanism for crediting
individuals
Useful tools for ontology
development
• OBO-Edit
– ontology editor originally developed for
GO
– can be used for any OBO format
ontology
– developed by group of users
Useful tools for ontology
development
• Reasoner integrated into OBO-Edit
– based on OBOL
– detects missing links, redundant links,
– soon misplaced terms, automatic term
creation
• Validation system
– typographical errors, is_a orphans,
duplicate synonyms etc.
Lessons learned
• An ontology doesn’t have to be
perfect or complete to be used
• For domain ontologies, external
experts should be involved
• Communication is critical
• You will never please everyone