Jim Hu - Alliance for Bioinformatics, Computational Biology, and

Download Report

Transcript Jim Hu - Alliance for Bioinformatics, Computational Biology, and

EcoliWiki and GONUTS
Wiki-based Systems for Community Annotation
Jim Hu
Dept. of Biochemistry and Biophysics
Texas A&M University
Overview
• EcoliWiki and the central problem in genome annotation
• Gene Ontology and the Gene Ontology Normal Usage Tracking
System (GONUTS)
• Live demos/Discussion
Annotation
•
Goals for annotation:
– Coverage
– Accuracy
– Usefulness
• for scientists (human-readable)
• for machine inference generation (computer-understandable)
•
Annotation is a moving target!
The need for Annotation is growing
People are limiting for annotation
•
•
Major genome databases employ
large numbers of people
This model problematic
– Curators are expensive
WormBase
Gramene
• NIH and NSF cannot afford to staff
every organism at this level
Curation
31
19
6
– Broad expertise across all areas is
hard
Software
10
8
4
SysAdmin
4
0.25
0.25
User Support
3
0
1
Software QA
3
0
0
Administration*
3
1.5
1
54
28.75
12.25
• Curators have to read papers in areas
they were not trained in.
• Curators may not recognize the
significance of papers in areas they
were not trained in
•
MGI
Can we make it:
– cheaper?
– faster?
– better?
Total
The Wikipedia approach
• Get your user community to work for free!
• aka "Community annotation" or "Community curation"
EcoliWiki
http://ecoliwiki.org or .net or .com
(most of our hits come from Google)
http://www.pasteur.fr/infosci/archives/mon/im_ele.html
“What
“What is
is true
true of
of Escherichia
Escherichia coli
coli is
is true
true of
of the
the elephant”
elephant”
-- Jacques
Jacques Monod
Monod
“Thanks
“Thanks to
to annotation
annotation creep,
creep, what’s
what’s false
false for
for E.
E. coli
coli is
is false
false for
for
the
the elephant
elephant too”
too”
-- Jim
Jim Hu
Hu
EcoliWiki philosophy
••
••
••
••
Any
Anyregistered
registereduser
usercan
canedit
edit
Any
Anyregistered
registereduser
usercan
can
register
registernew
newusers
users
Any
Anyregistered
registereduser
usercan
can
create
createnew
newpages
pages
It's
It'seasier
easierto
torevise
revisethan
thanto
to
create
createnew
newcontent
content
–– Seed
Seedcontent
contentfrom
fromother
other
sites,
mostly
EcoCyc
places,
mostly
EcoCyc
But won't that invite chaos?
GenBank's
GenBank's managers
managers are
are dead
dead set
set against
against letting
letting
users
into
GenBank's
files,
however.
They
say
users into GenBank's files, however. They say there
there
already
are
procedures
to
deal
with
errors
in
the
already are procedures to deal with errors in the
database,
database, and
and researchers
researchers themselves
themselves have
have created
created
secondary
databases
that
improve
on
what
GenBank
secondary databases that improve on what GenBank
has
has to
to offer.
offer. "That
"That we
we would
would wholesale
wholesale start
start changing
changing
people's
records
goes
against
our
idea
of
an
people's records goes against our idea of an archive,"
archive,"
says
David
Lipman,
director
of
the
National
Center
says David Lipman, director of the National Center for
for
Biotechnology
Information
(NCBI),
GenBank's
home
Biotechnology Information (NCBI), GenBank's home in
in
Bethesda,
Bethesda, Maryland.
Maryland. "It
"It would
would be
be chaos."
chaos."
Correct compared to what?
NCBI RefSeq:
Wikipedia:
Correct compared to what?
NCBI RefSeq:
Wikipedia:
Correct compared to what?
NCBI RefSeq:
Wikipedia:
Correct compared to what?
This is how biology achieves fidelity
A collage of books I haven’t read
Biology Wikis are proliferating
Participation is the major challenge
• Anyone can edit ≠ Anyone will edit
• Wikipedia: a tiny fraction of the users edit anything
– A tiny fraction of those do major editing
– Really big denominator
• Outreach to increase our user base
Participation is the major challenge
• Tools to make it easier to edit
Participation is the major challenge
•
Biggest difference from other systems:
– Partial annotations are wanted
– It doesn't matter if you don't know the wiki markup
– It doesn't matter if what you're adding isn't fully worked out
•
•
Someone else can fix it
And you can fix what others write
Making it machine-friendly:ontologies
• Ontology:
– in philosophy: a metaphysical system for studying being
– In biology/bioinformatics: a structured representation of biological
knowledge
• NCBO = National Center for Biological Ontologies
• OBO = Open Biological Ontologies
• Examples
–
–
–
–
–
MESH
Sequence ontology = SO
Phenotype and trait ontology = PATO
Gene Ontology = GO
see the EBI ontology browser: http://www.ebi.ac.uk/ontologylookup/
What is an ontology?
• Controlled vocabulary with
– Term identifiers
• GO:0000075
– Name
• cell cycle checkpoint
– Definitions
• "A point in the eukaryotic cell cycle where
progress through the cycle can be halted
until conditions are suitable for the cell to
proceed to the next stage." [GOC:mah,
ISBN:0815316194]
– Relationships
• is_a GO:0000074 ! regulation of
progression through cell cycle
• Terms arranged in a Directed Acyclic
Graph (DAG)
Pros and Cons of Ontologies
•
Pros
– facilitate comparison across systems
– facilitate computer based reasoning systems
• Good for data mining!
•
Cons
–
–
–
–
–
•
Large and unwieldy
Difficult to understand
Difficult to use
May never capture knowledge accurately
Ontology development lags behind the field it tries to capture
Example of a theme of genomics: imperfect tools can still be very
powerful!
GO = Gene Ontology
• 3 ontologies for gene
products
is_a
– Biological Process
– Molecular Function
– Cellular Component
• Used to make
annotations
part_of
– aka Gene
associations
– Term + qualifiers +
evidence code +
reference etc.
figure from GO consortium presentations
from GOC
Cellular Component
• where a gene product acts
figure from GO consortium presentations
from GOC
Cellular Component
figure from GO consortium presentations
from GOC
Molecular Function
•
activities or “jobs” of a gene product
glucose-6-phosphate isomerase activity
figure from GO consortium presentations
from GOC
Molecular Function
insulin binding
insulin receptor activity
figure from GO consortium presentations
from GOC
Molecular Function
• A gene product may have several functions
• Sets of functions make up a biological process.
figure from GO consortium presentations
from GOC
Biological Process
a commonly recognized series of events
cell division
figure from GO consortium presentations
from GOC
Biological Process
transcription
figure from GO consortium presentations
from GOC
GO annotation
•
•
Find papers
Read them
– Find what genes are mentioned
– What assertions are made about the product?
– What GO terms are applicable?
• GO term browsers
– Amigo http://amigo.geneontology.org/cgi-bin/amigo/go.cgi
– GONUTS http://gowiki.tamu.edu
• New term needed?
– What evidence code should be used to record the assertion?
•
•
•
Record gene associations in the MOD database
Send gene associations to GO consortium
Downloadable files that users doing electronic analysis can parse
Human vs Electronic GO annotations
•
What is the basis for making a gene association?
•
Human
– Experimental Evidence Codes
•
•
•
•
•
•
EXP: Inferred from Experiment
IDA: Inferred from Direct Assay
IPI: Inferred from Physical Interaction
IMP: Inferred from Mutant Phenotype
IGI: Inferred from Genetic Interaction
IEP: Inferred from Expression Pattern
•
•
•
•
•
•
ISS: Inferred from Sequence or Structural Similarity
ISO: Inferred from Sequence Orthology
ISA: Inferred from Sequence Alignment
ISM: Inferred from Sequence Model
IGC: Inferred from Genomic Context
RCA: inferred from Reviewed Computational Analysis
– Computational Analysis Evidence Codes
– Author Statement Evidence Codes
• TAS: Traceable Author Statement
• NAS: Non-traceable Author Statement
– Curator Statement Evidence Codes
•
• IC: Inferred by Curator
• ND: No biological Data available
Automatically-assigned Evidence Codes
• IEA: Inferred from Electronic Annotation
GONUTs (http://gowiki.tamu.edu)
• Started as a wikibased usage guide
• Each ontology term
is a MW Category
– MW supports
DAGs as
Categories!
• Each term page has
a notes area for
user notes on usage
• term pages list
examples of genes
that were annotated
to this term
MOD gene pages
• Gene pages from
established Model
Organism
Databases provide
examples of best
practices
Responding to community needs
User-created gene pages
• Annotation pages based on UniProt IDs
Supporting Annotation Jamborees in Cyberspace
• RefGenome
subgroup of GO
Consortium
– collaboration on
annotation
consistency
– Electronic
Jamborees via
teleconference
– Uses GONUTS
to collect and
compare
Supporting Annotation Jamborees in Cyberspace
• RefGenome
subgroup of GO
Consortium
– collaboration on
annotation
consistency
– Electronic
Jamborees via
teleconference
– Uses GONUTS
to collect and
compare
Thanks to
•
EcoliWiki/GONUTS Team
–
–
–
–
–
–
•
•
–
–
–
–
–
–
Nathan Liles
Brenley McIntosh
Debby Siegele
Daniel Renfro
Anand Venkatraman
Adrienne Zweifel
GO consortium
EcoliHub Team Leaders
•
Barry Wanner PI, Purdue
Walid Aref, co-PI, Purdue
Tyrell Conway, co-PI, Oklahoma
Mike Gribskov, co-PI, Purdue
Peter Karp, co-PI, SRI
Daisuke Kihara, co-PI, Purdue
Funding NIH U24-GM077905
URLs: http:ecolihub.org
http:ecoliwiki.org
http:gowiki.tamu.edu