CACAO Biocurator Training

Download Report

Transcript CACAO Biocurator Training

CACAO Biocurator
Training
CACAO Fall 2011
CACAO
•
•
•
•
Syllabus
What is CACAO & why is it important?
Training
Examples
Mutualistic Relationship
•
We want you to get experience with:
1.
2.
3.
4.
•
CRITICALLY reading scientific papers
Bioinformatics resources
Collaborating with other biocurators
Synthesizing functional annotations
We want to get high quality functional
annotations to contribute back to the GO
Consortium and other biological databases
What is an annotation?
Hint: try looking for a definition on Wikipedia.
What is a functional annotation?
• Process of attaching information from the
scientific literature to proteins
Growing need for functional
annotations
• Advances in DNA sequencing mean lots of new
genomes & metagenomes
Classic MODel
Literature
Database
Curators
(rate limiting)
Datasets
Classic MODel is Expensive
YIKES!
Growing need for high quality
functional annotations
•
High quality annotations allow us to infer the function
of genes
•
Which allows us to understand the capabilities of
genomes and understand the patterns of gene
expression
Two problems meet
How can we get
more curators
with finite budgets?
How can we
incorporate more
critical analysis into
undergraduate
education?
What does a functional annotation
have to do with this course?
• Process of attaching information from the
scientific literature to proteins
• CACAO will teach you to become a biocurator
– you will be adding functional annotations to the
biological database GONUTS
(http://gowiki.tamu.edu)
CACAO
Community
Assessment
- How well can
Community
- you (with our coaching)
Annotation with
- assign gene functions
Ontologies
- using GO?
Can students become
biocurators? YES!
Spring 2010
Fall 2010
Spring 2011
Institutions
TAMU
TAMU
UCL
TAMU
Miami (Ohio)
N. Texas
Penn State
Mich. State
Rounds
1 round
4 rounds
5 rounds
Annotations* /
Submitted
118/153
496/753
726/1013
1340 GO annotations in 2 & 1/2 semesters!
Functional annotation with
Gene Ontology
• Controlled vocabulary with
– Term identifiers
• GO:0000075
– Name
• cell cycle checkpoint
– Definitions
• "A point in the eukaryotic cell cycle where
progress through the cycle can be halted
until conditions are suitable for the cell to
proceed to the next stage." [GOC:mah,
ISBN:0815316194]
– Relationships
• is_a GO:0000074 ! regulation of progression
through cell cycle
• Terms arranged in a Directed
Acyclic Graph (DAG)
Why use Ontologies?
• Standardization
• facilitate comparison across systems
• facilitate computer based reasoning systems
– Good for data mining!
• leading functional annotation ontology = Gene
Ontology (GO)
What is GO? Who is the GO
Consortium (GOC)?
•
GO = ~30,000 terms for gene product
attributes
1.
2.
3.
•
Molecular Function (enzyme activity)
Biological Process (pathways)
Cellular Component (parts of the cell)
GO Consortium - set of biological
databases that are involved in developing
GO and contributing GO annotations
Cellular Component
• where a gene product acts
Molecular Function
• activities or “jobs” of a gene product
glucose-6-phosphate isomerase activity
figure from GO consortium presentations
Biological Process
• a commonly recognized series of events
cell division
Figure from Nature Reviews Microbiology 6, 28-40 (January 2008)
Where can we find GO terms?
GONUTS
http://gowiki.tamu.edu
Search for GO terms on GONUTS
http://gowiki.tamu.edu
Which subontology (MF, BP or CC)
would the following terms fit in?
GO:0003909
GO:0071705
GO:0007124
GO:0015123
activity
GO:0071514
GO:0005773
GO:0000312
DNA ligase activity
Nitrogen compound transport
Pseudohyphal growth
Acetate transmembrane transporter
Genetic imprinting
Vacuole
Plastid small ribosomal subunit
What do we know so far?
1. You will be making functional (GO) annotations using
GO terms.
2. You can search for GO terms on GONUTS.
Questions?
Where are we adding GO
annotations?
GONUTS
http://gowiki.tamu.edu
Why are we using GONUTS?
• Students can add functional annotations to
proteins.
• It has all the GO terms in it, too.
• Some of the GO terms have usage notes.
• It works a lot like Wikipedia, so it’s familiar.
• It has the ability to keep track of each student’s
and team’s annotations.
• We run it.
http://gowiki.tamu.edu
REQUIRED parts of a GO
annotation
http://gowiki.tamu.edu/wiki/index.php/ECOLI:LPOB
GO
** I will cover this again!!
Parts of a GO annotation (cont)
Evidence code
Parts of a GO annotation (cont)
Reference
Notes (about
evidence)
What do we know so far?
1. You will be making functional (GO) annotations using GO terms.
2. You can search for GO terms on GONUTS.
3. You will be adding your GO annotations to GONUTS.
4. There are 4 required parts to a GO annotation.
5. You have to base your annotation on an experiment
published in a scientific paper.
Questions?
Next week
• Review of GO & GO annotations
• More biocurator training
– lots of examples
– lots of practice
BICH 485 & 689 students - please stick
around to talk about these courses!
Plan for training
1. Synthesizing GO annotations
2. Refinements
3. Judging & Assessment
4. Individual & Team tracking
Part 1: Synthesizing
GO annotations
What can you annotate?
• Proteins.
– Any protein with a record in UniProt (Universal
Protein Resource - http://uniprot.org)
• How can you find proteins to annotate?
– Think of ways to identify a protein or paper to
annotate
Choosing a protein to annotate
1.
2.
3.
4.
5.
6.
7.
8.
randomly
topics of interest (ie efflux pump proteins, biofilms, marine biology)
papers you have come across while doing other stuff
methods you know or want to learn
phenotypes and mutants you are interested in
by author
by pathway or regulon
suggested by another
- high ratio of IEA:manual annotations in GONUTS
- mentioned in another class
9. current paper mentions another gene product
10. review papers (ie Annual Reviews are excellent sources)
11. Uniprot, GONUTS, WikiPathways, PubMed searches
12. protein annotated by other teams
13. ask a coach
Search for GO terms on GONUTS
http://gowiki.tamu.edu
Practice
http://gowiki.tamu.edu
1. What is the GO term for GO:0004713?
2. What is the GO identifier for mitosis?
3. How many results (ballpark) do you get when you search for
cell division using the Go, Search or G buttons?
4. How many child terms are there for plasma membrane?
How many grandchildren?
5. What term is the parent of GO:006825?
Finding a scientific paper on a
certain protein
• Has to be a scientific paper with
experimental data in it.
– Anything else is a valid reason to challenge!
• PubMed, PubMed Central, GoogleScholar…
• No review articles
• no books, textbooks, wikipedia articles, class
notes…
• You will need the PMID number
Practice - searching PubMed
http://pubmed.org
1.
2.
3.
4.
5.
How many papers do you get when you search for “coli”?
How many of those papers are reviews?
What is the title of the oldest paper when you search for “coli AND
RNA polymerase”?
How many results are there when you search for “GTPase activity
and Gene Ontology”?
What is the PMID of the paper when you search for “Hu JC AND coli
AND lysR AND 2010”?
Why do we annotate on GONUTS?
• UniProt (Universal Protein Resource)
will not let us annotate protein records
on their site.
• They are a professionally-curated & closed database.
• GONUTS will.
• GONUTS pulls the info from the UniProt record when it
makes a page for you to edit.
Making a protein page on GONUTS
requires a UniProt accession
• UniProt - http://www.uniprot.org
• UniProt is not community edited, so we can’t add
annotations directly to their database
Practice - Searching UniProt
http://uniprot.org
Find the UniProt accessions for:
a) Mouse Lsr protein
b) Diptheria toxin from Corynebacterium
c) mutS from E. coli K-12
How do you make a new gene
page in GONUTS?
1
•
•
2
Use a UniProt accession to make a page on GONUTS that you
can add your own annotations to.
GoPageMaker will:
- Check if the page exists in GONUTS & take you there if it does.
- Make a page & pull all of the annotations from UniProt into a table
that you can edit.
Practice
http://gowiki.tamu.edu
1.
2.
3.
How many annotations are on the page for the p53
protein from humans?
How many different evidence codes are there on the
page for the Bub1a protein from mice?
Give one of the paper identifiers for an annotation for
the LpxK protein from E. coli.
What do we know so far?
1. You will be making functional (GO) annotations using GO terms.
2. You can search for GO terms on GONUTS.
3. You will be adding your GO annotations to GONUTS.
4. There are 4 required parts to a GO annotation.
5. You have to base your annotation on an experiment published in a
scientific paper.
6. You can annotate any protein with a record in
UniProt.
7. You have to make a page in GONUTS for your
protein using the UniProt accession.
Questions?
What are evidence codes?
•
•
Describe the type of work or analysis done by the
authors
5 general categories of evidence codes:
1. Experimental
2. Computational
3. Author Statement
4. Curator Assigned
5. Automatically assigned by GO
What are the evidence codes?
•
Describe the type of work or analysis done by the
authors
• 5 general categories of evidence codes:
1. Experimental
2. Computational
3. Author Statement
4. Curator Assigned
5. Automatically assigned by GO
•
CACAO biocurators may only use certain
experimental and computational evidence codes
Experimental Evidence Codes
•
•
•
•
•
•
IDA: Inferred from Direct Assay
IMP: Inferred from Mutant Phenotype
IGI: Inferred from Genetic Interaction
IEP: Inferred from Expression Pattern
IPI: Inferred from Physical Interaction
EXP: Inferred from Experiment
Experimental Evidence Codes
•
•
•
•
•
•
IDA: Inferred from Direct Assay
IMP: Inferred from Mutant Phenotype
IGI: Inferred from Genetic Interaction
IEP: Inferred from Expression Pattern
IPI: Inferred from Physical Interaction
EXP: Inferred from Experiment
http://geneontology.org/GO.evidence.shtml
Computational Evidence Codes
•
•
•
•
•
•
•
•
•
•
ISS: Inferred from Sequence or Structural Similarity
ISO: Inferred from Sequence Orthology
ISA: Inferred from Sequence Alignment
ISM: Inferred from Sequence Model
IGC: Inferred from Genomic Context
IBA: Inferred from Biological Aspect of Ancestor
IBD: Inferred from Biological Aspect of Descendant
IKR: Inferred from Key Residues
IRD: Inferred from Rapid Divergence
RCA: Inferred from Reviewed Computational Analysis
http://geneontology.org/GO.evidence.shtml
Computational Evidence Codes
•
•
•
•
•
•
•
•
•
•
ISS: Inferred from Sequence or Structural Similarity
ISO: Inferred from Sequence Orthology
ISA: Inferred from Sequence Alignment
ISM: Inferred from Sequence Model
IGC: Inferred from Genomic Context
IBA: Inferred from Biological Aspect of Ancestor
IBD: Inferred from Biological Aspect of Descendant
IKR: Inferred from Key Residues
IRD: Inferred from Rapid Divergence
RCA: Inferred from Reviewed Computational Analysis
http://geneontology.org/GO.evidence.shtml
Summary of Evidence Codes
for CACAO
•
•
•
•
•
•
•
•
IDA: Inferred from Direct Assay
IMP: Inferred from Mutant Phenotype
IGI: Inferred from Genetic Interaction
IEP: Inferred from Expression Pattern
ISO: Inferred from Sequence Orthology
ISA: Inferred from Sequence Alignment
ISM: Inferred from Sequence Model
IGC: Inferred from Genomic Context
• If it’s not one of these 8, your annotation
is incorrect!!!
Required parts (for every
annotation)
GO:0004713
PMID:1111
IDA: Inferred from
direct assay
Figure 2a
What you might also have to fill in
http://geneontology.org/GO.evidence.shtml
What do we know so far?
1. You will be making functional (GO) annotations using GO terms.
2. You can search for GO terms on GONUTS.
3. You will be adding your GO annotations to GONUTS.
4. There are 4 required parts to a GO annotation.
5. You have to base your annotation on an experiment published in a
scientific paper.
6. You can annotate any protein with a record in
UniProt.
7. You have to make a page in GONUTS for your
protein using the UniProt accession.
Questions?
Practice - Identify the problem
annotation(s) & why
GO ID
1. GO:0003674
2. GO:0016985
3. GO:0016985
4. GO:0016985
5. GO:0003674
6. GO:0016985
7. GO:0016985
8. GO:0016985
9. What
Reference
PMID:20372022
PMID:20372022
PMID:20372022
PMID:20372022
PMID:20372022
PMID:20372002
20372022
PMID:20372002
Evidence Code
Notes
IDA: Inferred from Direct Assay
Table 2.
IMP: Inferred from Mutant Phenotype Table 2.
IDA: Inferred from Direct Assay
IDA: Inferred from Direct Assay
Table 2.
IDA: Inferred from Direct Assay
Table 2.
IGI: Inferred from Genetic Interaction Table 2.
IDA: Inferred from Direct Assay
Table 2.
EXP: Inferred from Experiment
Table 2.
is the UniProt accession of the protein described/annotated?
How is CACAO scored?
• Points for a complete annotation
•
•
•
•
GO term (right level of specificity)
Reference (paper)
Evidence code
Identify where in the paper the evidence is
• Refinements used to steal points for incorrect &/or
incomplete annotations
• Identify a problem
• Suggest correct alternative
• Refinements can be entered by any team
(including the original team)
How can you get the annotations
required by Rubric #2?
1. Synthesize complete & correct
annotations.
2. Correctly refine (challenge & correct)
someone else’s annotation.
3. If your annotation gets challenged,
offer the best correction.
Summary
• You will be searching literature for
experimental evidence for a protein’s
function (MF), processes (BP) and
location (CC)
Where do annotations show
up?
Refinements & Challenges
What can you challenge?
Scoreboard
Schedule
O
Ps
m
ou
se
liv
e
ba
bo
on
E.
co
eu
li
do
m
on
as
Sa
lm
St
on
ap
el
hy
la
lo
co
St
cc
re
us
pt
oc
oc
cu
s
Vi
br
S.
io
ce
re
C
hl
am visi
ae
yd
om
on
Ar
as
ab
id
op
si
C
s
.e
le
ga
ns
D
ro
so
ph
ila
hu
m
an
Ba
ci
Bu
llu
s
rk
ho
ld
er
ia
Spring 2011 - Results by organism
250
200
150
# wrong
# change
# perfect
100
50
0