(GOA) Database

Download Report

Transcript (GOA) Database

GOA: Looking after GO
annotations
Emily Dimmer
Gene Ontology Annotation (GOA) Database
European Bioinformatics Institute
Cambridge
UK
EBI is an Outstation of the European Molecular Biology Laboratory.
E. Coli
hub
http://www.geneontology.org
Reactome
2
EMBRACE Workshop 7-9th November 2007
Gene Ontology Annotation (GOA)
Database
• Member of the GO Consortium since 2001
• Largest open-source contributor of annotations to GO
• Provides annotation for more than 139,000 species
• GOA’s priority is to annotate the human proteome
• GOA is responsible for human, chicken and bovine annotations
in the GO Consortium
3
EMBRACE Workshop 7-9th November 2007
GOA Group
GOA office
EMBL-EBI
Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
4
[email protected]
EMBRACE Workshop 7-9th November 2007
GOA Group
Emily Dimmer Evelyn Camon
(GOA coordinator) (senior GOA
curator)
Rachael Huntley
(GOA curator)
Daniel Barrell
(GOA file releases
& database)
David Binns
(QuickGO,
protein2go tools)
Along with the help of UniProt curators at the EBI, UniProt controlled
vocabularies, HAMAP group, InterPro group, IntAct curators, the IPI
group, Ensembl, other EBI groups
…and of course the GO editors and the other GO Consortium
annotation groups
5
EMBRACE Workshop 7-9th November 2007
How does GOA annotate to the GO ?

Electronic Annotation

Manual Annotation
• Both these methods have their advantages
• They can be easily distinguished by the evidence
code used.
6
EMBRACE Workshop 7-9th November 2007
Status of GOA Annotation
Evidence Source
Annotations
Proteins
Electronic annotations
22,774,674
3,362,148
UniProt
coverage
63.7 %
450,489
86,778
1.6 %
Manual Annotations
October 2007 Stats
7
•
Annotations provided to over 140,000 taxa
•
Total of 415,576 PubMed references included as evidence.
•
Manual annotations integrated from external model organism and multispecies databases:
AgBase, DictyBase, Ensembl, FlyBase, GDB, GeneDB(S.pombe),Gramene,
HGNC, MGI, Reactome, RGD, Roslin, SGD, TAIR, TIGR, WormBase, ZFIN,
the IntAct protein-protein interaction database, LIFEdb and the Proteome Inc
dataset
EMBRACE Workshop 7-9th November 2007
Core information needed for a GO
annotation
1. Gene or gene product identifier
e.g. Q9ARH1
..and also in some cases:
2. GO term ID
e.g. GO:0004674 (protein
serine/threonine kinase)
- Qualifiers available to modify
interpretation of annotation:
NOT
contributes_to
3. Reference ID
e.g. PubMed ID: 12374299
GO_REF:0000001
4. Evidence code
e.g. IDA
8
EMBRACE Workshop 7-9th November 2007
colocalizes_with
- ‘With’ column information, to
provide further information on the
method (evidence code)
Electronic Annotation
•
A number of different techniques used by different
GO Consortium annotation groups.
•
All resulting annotations must be high-quality and
provide an explanation of the method (GO_REF)
1. Mapping of external concepts to GO terms
2. Automatic transfer of annotations to orthologs
9
EMBRACE Workshop 7-9th November 2007
Electronic annotation: GO mappings
Fatty acid biosynthesis
(SwissProt keyword)
EC:6.4.1.2
(EC number)
IPR000438: Acetyl-CoA
carboxylase carboxyl
transferase beta subunit
(InterPro entry)
MF_00527: Putative 3methyladenine DNA
glycosylase
GO:fatty acid biosynthesis
(GO:0006633)
GO:acetyl-CoA carboxylase
activity
(GO:0003989)
GO:acetyl-CoA carboxylase
activity
(GO:0003989)
GO:DNA repair
(GO:0006281)
(HAMAP)
Camon et al. BMC Bioinformatics. 2005; 6 Suppl 1:S17
10
EMBRACE Workshop 7-9th November 2007
11
EMBRACE Workshop 7-9th November 2007
12
http://www.geneontology.org/GO.indices.shtml
EMBRACE Workshop 7-9th November 2007
Automatic transfer of annotations to
orthologs
Human
Mouse
Rat
Zebrafish
Xenopus
Drosophila
Ensembl COMPARA
Homologies between different species calculated
GO terms projected from MANUAL annotation only (IDA, IEP, IGI, IMP, IPI)
One-to-one and apparent one-to-one orthologies only used.
http://www.ensembl.org/info/data/compara
Macaque
Chimpanzee
Human
Human
Guinea Pig
Rat
Mouse
Rat
Mouse
EMBRACE Workshop 7-9th November 2007
Dog
Chicken
Anopheles
Human
Tetraodon
13
Zebrafish
Fugu
Aedes aegypti
Manual Annotation
• High–quality, specific annotations made using:
• Peer-reviewed papers
• A range of evidence codes to categorize the types
of evidence found in a paper
• Very time consuming and requires trained
biologists
14
EMBRACE Workshop 7-9th November 2007
Finding Annotations
In this study, we report the isolation and molecular characterization
of the B. napus PERK1 cDNA, that is predicted to encode a novel
receptor-like kinase. We have shown that like other plant RLKs,
the kinase domain of PERK1 has serine/threonine kinase activity,
activity,
In addition, the location of a PERK1-GTP fusion protein to the
plasma membrane supports the prediction that PERK1 is an
integral membrane protein
protein…these kinases have been implicated in
early stages of wound
woundresponse
response…
…for B. napus PERK1 protein (Q9ARH1)
15
PubMed ID: 12374299
FUNCTION
protein serine/threonine kinase activity
GO:0004674
COMPONENT
integral to plasma membrane
GO:0005887
PROCESS
response to wounding
GO:0009611
EMBRACE Workshop 7-9th November 2007
Evidence Codes
16
IEA
IDA
IMP
IPI
IEP
IGI
ISS*
IGC
RCA
TAS
Inferred from Electronic Annotation
IDA:
Inferred from Direct Assay
• Enzyme assays
Inferred from Mutant Phenotype
• In vitro reconstitution
Inferred from Protein Interaction
• Immunofluorescence
Inferred from Expression Pattern
• Cell fractionation
Inferred from Genetic Interaction
Inferred from Sequence or Structural Similarity
Inferred from Genomic Context
Reviewed Computational Analysis
TAS:
Traceable Author Statement
NAS
IC
ND
Non-traceable Author Statement
Inferred from Curator Judgement
No Data available
EMBRACE Workshop 7-9th November 2007
• In the literature source
the original experiments
referred to are referenced.
Core information needed for a GO
annotation
1. Gene or gene product identifier
e.g. Q9ARH1
..and also in some cases:
2. GO term ID
e.g. GO:0004674 (protein
serine/threonine kinase)
- Qualifiers available to modify
interpretation of annotation
NOT
contributes_to
3. Reference ID
e.g. PubMed ID: 12374299
GO_REF:0000001
4. Evidence code
e.g. IDA
17
EMBRACE Workshop 7-9th November 2007
colocalizes_with
- ‘With’ column information, to
provide further information on the
method (evidence code)
The ‘Qualifier’ Column
The Qualifier column is used to modify the
interpretation of an annotation.
Allowable values are: NOT
colocalizes_with
contributes_to
18
EMBRACE Workshop 7-9th November 2007
The ‘NOT’ qualifier
• 'NOT' is used to make an explicit note that the gene product is not
associated with the GO term.
… particularly important when associating a GO term with a gene
product should be avoided (but might otherwise be made, especially by
an automated method).
e.g. This protein does not have ‘kinase activity’ because it has been
found that this protein has a disrupted/missing an ‘ATP binding’ domain.
Also used to document conflicting claims in the literature.
NOT can be used with ALL three GO Ontologies.
19
EMBRACE Workshop 7-9th November 2007
The ‘colocalizes_with’ qualifier
• Gene products that are transiently
or peripherally associated with an
organelle or complex may be
annotated to the relevant cellular
component term, using the
'colocalizes_with' qualifier.
Only used with GO Component Ontology
20
EMBRACE Workshop 7-9th November 2007
The ‘contributes_to’ qualifier
Where an individual gene product that is
part of a complex can be annotated to terms
that describe the action (function or
process) of the whole complex.
i.e. annotating 'to the potential of the complex‘
• distinguishes an individual subunit from complex functions
All gene products annotated using 'contributes_to' must also
be annotated to a cellular component term representing the
complex that possesses the activity.
Only used with GO Function Ontology
21
EMBRACE Workshop 7-9th November 2007
22
EMBRACE Workshop 7-9th November 2007
Where does GOA data go?
23
EMBRACE Workshop 7-9th November 2007
QuickGO browser:
Human Insulin Receptor (P06213)…
http://www.ebi.ac.uk/quickgo
24
EMBRACE Workshop 7-9th November 2007
etc.
GO data in Ensembl
25
EMBRACE Workshop 7-9th November 2007
GOA data in Entrez Gene
26
EMBRACE Workshop 7-9th November 2007
27
http://amigo.geneontology.org/cgi-bin/amigo/go.cgi
EMBRACE Workshop 7-9th November 2007
Gene Association Files
Tab delimited files: http://www.geneontology.org/GO.current.annotations.shtml
DB
DB_Object
_ID
DB_Object_Symbol
UniProt
Q9H2K8
UniProt
Qualifier*
GO_id
DB:Ref
Evidence
TAOK3_HUMAN
GO:0004674
PMID:10559204
IDA
O00110
O00110_HUMAN
GO:0003676
GO_REF:0000002
IEA
UniProt
P09884
DPOLA_HUMAN
GO:0000731
PMID:1730053
IMP
UniProt
P09936
UCHL1_HUMAN
GO:0005515
PMID:12082530
IPI
NOT
With*
InterPro:IPR007087
UniProt:P46527
Aspect
DB_Object_Name*
DB_Object_Synonym*
DB_Object
Type
Taxon
Date
Assigned By
F
Serine/threonine-protein..
IPI00410485
protein
taxon:9606
20070720
HGNC
protein
taxon:9606
20070720
UniProt
F
P
DNA polymerase alpha..
IPI00220317
protein
taxon:9606
20060825
UniProt
F
UCHL1: Ubiquitin carboxyl..
IPI00018352
protein
taxon:9606
20070720
IntAct
* = optional field
28
EMBRACE Workshop 7-9th November 2007
http://www.geneontology.org/GO.current.annotations.shtml
29
EMBRACE Workshop 7-9th November 2007
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/
http://www.ebi.ac.uk/GOA/downloads.html
30
EMBRACE Workshop 7-9th November 2007
Output from the GOA database
Redundant
Cow
Non-Redundant
based on IPI
(International Protein Index)
625 proteome sets
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/
31
EMBRACE Workshop 7-9th November 2007
Output from the GOA database
Redundant
Cow
Non-Redundant
based on IPI
(International Protein Index)
625 proteome sets
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/
32
EMBRACE Workshop 7-9th November 2007
… annotations are also displayed in:
• All GO Consortium Model Organism Databases integrate and
exchange GO annotation data to ensure a comprehensive set of
annotations for their organism/area of interest.
• Array Products and data analysis
Affymetrix
Spotfire
Almac
33
EMBRACE Workshop 7-9th November 2007
… and Numerous Third Party Tools
(http://www.geneontology.org/GO.tools.shtml)
34
EMBRACE Workshop 7-9th November 2007
What’s new on the GO annotation front?
35
EMBRACE Workshop 7-9th November 2007
Reference Genomes
• Comprehensive annotation of a set of conserved pathway and diseaserelated proteins in human and orthologs in 11 other selected genomes
• Empowers comparative methods used in first pass annotation of other
proteomes.
Arabidopsis thaliana
Caenorhabditis elegans
Danio rerio (zebrafish)
Dictyostelium discoideum
Drosophila melanogaster
Escherichia coli
Homo sapiens
Saccharomyces cerevisiae
Mus musculus
Schizosaccharomyces pombe
Gallus gallus
Rattus norvegicus
36
EMBRACE Workshop 7-9th November 2007
E. Coli
hub
GOA annotation focuses
Cardiovascular GO annotation
Grant with the British Heart Foundation to support a
collaboration with HGNC curators to provide full Gene Ontology
annotation to genes associated with cardiovascular processes
wiki: http://wiki.geneontology.org/index.php/Cardiovascular
Immune GO annotation
Interest in actively GO annotating immune relevant genes.
GOA, UCL and MGI are collaborating to improve annotation for
immunologically-important genes, WT grant pending.
wiki: http://wiki.geneontology.org/index.php/Immunology
37
EMBRACE Workshop 7-9th November 2007
Electronic Annotation developments
New mappings:
• Swiss-Prot Subcellar Location to GO (just released)
• Swiss-Prot UniPathway
Expansion of existing methods
• Ensembl Compara species expansion
38
EMBRACE Workshop 7-9th November 2007
Acknowledgements
Rolf Apweiler. Head of the EBI protein sequence database group
Emily Dimmer
Evelyn Camon
Rachael Huntley
Daniel Barrell
David Binns
Contact the GOA team:
GOA web page:
39
[email protected]
http://www.ebi.ac.uk/goa
The Gene Ontology Consortium and 1.5 members of GOA
currently supported by an P41 grant from the National Human
Genome Research Institute (NHGRI) [grant HG002273], GOA is
also
supported
by
core EMBL
funding and BBSRC Tools and
th November
EMBRACE
Workshop 7-9
2007
Resources grant.