uniprotkb-goa_aug2011

Download Report

Transcript uniprotkb-goa_aug2011

Introduction to the Gene Ontology
and GO annotation resources
Rachael Huntley
UniProtKB-GOA Curator
EBI
EBI is an Outstation of the European Molecular Biology Laboratory.
Talk Overview
• Intro to GO and GO terms
• Annotating to GO
• Accessing GO annotations
• Practical use of GO
• GO analysis tools
• Precautions
2
Reasons for the Gene Ontology
• Increasing amounts of biological data available
• Large datasets need to be interpreted quickly
• Inconsistency in naming of biological concepts
3
www.geneontology.org
Increasing amounts of biological data available
Search on ‘DNA repair’...
get over 60,000 results
Expansion of sequence
information
4
Reasons for the Gene Ontology
• Increasing amounts of biological data available
• Large datasets need to be interpreted quickly
• Inconsistency in naming of biological concepts
5
Large datasets need to be interpreted quickly
• Need to organise the data, analyse it, share it
in a timely manner to benefit other researchers
• Inconsistencies in analyses make crossspecies or cross-database comparison difficult
6
Reasons for the Gene Ontology
• Increasing amounts of biological data available
• Large datasets need to be interpreted quickly
• Inconsistency in naming of biological concepts
7
Inconsistency in naming of biological concepts
English is not a very precise language
• Same name for different concepts
• Different names for the same concept
An example …
Taction
Tactition
Tactile sense
?
Sensory perception of touch ; GO:0050975
8
The Gene Ontology
Less specific concepts
• A way to capture
biological knowledge
in a written and
computable form
• A set of concepts
and their relationships
to each other arranged
as a hierarchy
More specific concepts
www.ebi.ac.uk/QuickGO
9
The Concepts in GO
1. Molecular Function
•
•
An elemental activity or task or job
protein kinase activity
insulin receptor
activity
2. Biological Process
A commonly recognised series of events
•
cell division
• mitochondrion
3. Cellular Component
Where a gene product is located
10
• mitochondrial matrix
• mitochondrial inner membrane
Anatomy of a GO term
Unique identifier
Term name
Synonyms
Definition
Cross-references
11
Ontology structure
• Directed acyclic graph
Terms can have more than one parent
• Terms are linked by
relationships
is_a
part_of
regulates
+ve regulates
-ve regulates
12
www.ebi.ac.uk/QuickGO
Searching for GO terms
http://www.ebi.ac.uk/QuickGO
http://amigo.geneontology.org/cgi-bin/amigo/go.cgi
… there are more browsers available on the GO Consortium
Tools page:
http://www.geneontology.org/GO.tools.shtml
The latest OBO format Gene Ontology file can be downloaded from:
http://www.geneontology.org/ontology/gene_ontology.obo
13
The EBI's QuickGO browser
Search GO terms or proteins
14
http://www.ebi.ac.uk/QuickGO
GO Annotation
15
http://www.geneontology.org
Reactome
16
Aims of the GO project
• Compile the ontologies
- currently over 34,000 terms
- constantly increasing and improving
• Annotate gene products using ontology terms
- around 30 groups provide annotations
• Provide a public resource of data and tools
- regular releases of annotations
- tools for browsing/querying annotations and editing the ontology
17
Gene Ontology Annotation
(UniProtKB-GOA) database at the EBI
• Largest open-source contributor of annotations to GO
• Member of the GO Consortium since 2001
• Provides annotation for more than 360,000 species
• UniProtKB-GOA’s priority is to annotate the human
proteome
• UniProtKB-GOA is responsible for human, chicken
and bovine annotations in the GO Consortium
18
A GO annotation is …
…a statement that a gene product;
1.
has a particular molecular function
or is involved in a particular biological process
or is located within a certain cellular component
2.
as determined by a particular method
3.
as described in a particular reference
Accession Name
P00505
19
GO ID
GO term name
GOT2 GO:0004069 aspartate transaminase activity
Reference
PMID:2731362
Evidence code
IDA
GOA makes annotations using two methods;
Electronic
• Quick way of producing large numbers of annotations
• Annotations use less-specific GO terms
Manual
• Time-consuming process producing lower numbers
of annotations
• Annotations are very detailed and accurate
20
Evidence Codes
Some examples…
Electronic
Inferred from Electronic Annotation (IEA) – e.g. UniProtKB keyword mapping
Manual
Inferred from Direct Assay (IDA) – e.g. enzyme assay
Inferred from Physical Interaction (IPI) – e.g. yeast 2-hybrid
See the full list on the GO Consortium website
http://www.geneontology.org/GO.evidence.shtml
21
Electronic annotation by UniProtKB-GOA
1. Mapping of external concepts to GO terms
e.g. InterPro2GO, Swiss-Prot Keyword2GO, Enzyme Commission2GO
2. Automatic transfer of annotations to orthologs
Ensembl compara
Macaque
Chimpanzee
Guinea Pig Rat
e.g. Human
Mouse
Cow
Dog
Chicken
Annotations are high-quality and have an explanation of the method (GO_REF)
22
Mapping of concepts from UniProtKB files
GO:0005856: cytoskeleton
GO:0004715 ; non-membrane spanning protein tyrosine kinase activity
GO:0007155 ; cell adhesion
23
Electronic annotation by UniProtKB-GOA
1. Mapping of external concepts to GO terms
e.g. InterPro2GO, Swiss-Prot Keyword2GO, Enzyme Commission2GO
2. Automatic transfer of annotations to orthologs
Ensembl compara
Macaque
Chimpanzee
Guinea Pig Rat
...and more
e.g. Human
Mouse
Cow
Dog
Chicken
Annotations are high-quality and have an explanation of the method (GO_REF)
24
Manual annotation by UniProtKB-GOA
High–quality, specific annotations made using:
• Peer-reviewed papers
• A range of evidence codes to categorise
the types of evidence found in a paper
25
http://www.ebi.ac.uk/GOA
Finding annotations in a paper
…for B. napus PERK1 protein (Q9ARH1)
In this study, we report the isolation and molecular characterization
of the B. napus PERK1 cDNA, that is predicted to encode a novel
receptor-like kinase. We have shown that like other plant RLKs,
serine/threonine kinase activity,
the kinase domain of PERK1 has serine/threonine
In addition, the location of a PERK1-GTP fusion protein to the
plasma membrane supports the prediction that PERK1 is an
integral
integralmembrane
membraneprotein
protein…these kinases have been implicated in
early stages of wound
wound response…
response
PubMed ID: 12374299
Function:
26
protein serine/threonine kinase activity
GO:0004674
Component:
integral to plasma membrane
GO:0005887
Process:
response to wounding
GO:0009611
Additional information
• Qualifiers
Modify the interpretation of an annotation
• NOT (protein is not associated with the GO term)
• colocalizes_with (protein associates with complex but is not a bona fide member)
• contributes_to (describes action of a complex of proteins)
• 'With' column
Can include further information on the
method being referenced
e.g. the protein accession of an interacting protein
27
UniProtKB-GOA integrates manual annotations
from external groups
Human
Protein
Atlas
GO:0005739
Mitochondrion
also;
LifeDB (subcellular locations)
Reactome (pathways)
28
Status of UniProtKB-GOA Annotation
Evidence Source
Electronic annotations
Manual annotations*
Proteins
UniProt
Coverage
98,037,844
11,261,382
66%
910,536
159,630
0.9%
Annotations
Aug 2011 Statistics
* Includes manual annotations integrated from external model organism and multi-species databases
29
How to access and use
GO annotation data
30
Where can you find annotations?
UniProtKB
Ensembl
Entrez gene
31
Gene Association Files
17 column files containing all information for each annotation
GO Consortium website
UniProtKB-GOA website
32
GO browsers
33
The EBI's QuickGO browser
Search GO terms
or proteins
Find sets of
GO annotations
34
http://www.ebi.ac.uk/QuickGO
How scientists use the GO
• Access gene product functional information
• Analyse high-throughput genomic or proteomic datasets
• Validation of experimental techniques
• Get a broad overview of a proteome
• Obtain functional information for novel gene products
Some examples…
35
Analysis of high-throughput genomic datasets
time
Defense response
Immune response
Response to stimulus
Toll regulated genes
JAK-STAT regulated genes
Puparial adhesion
Molting cycle
Hemocyanin
MicroArray data analysis
Amino acid catabolism
Lipid metobolism
Peptidase activity
Protein catabolism
Immune response
Immune response
Toll regulated genes
36
attacked control
pears
on lw n3d
... lw n3d ...Colored
cted Gene
Tree:
pearson
Coloredby:
by:
n:
Set_LW_n3d_5p_...
Gene
Lis
t:
ch color
classification: Set_LW_n3d_5p_...
Gene
List:
Copy
ofofCopy
of
C5_RMA (Defa...
Copy
of Copy
C5_RMA
(Defa...
genes
allall
genes
(14010)(14010)
Bregje Wertheim at the Centre for Evolutionary Genomics,
Department of Biology, UCL and Eugene Schuster Group, EBI.
Analysis of high-throughput proteomic datasets
Characterisation of proteins interacting with ribosomal protein S19
(Orrù et al., Molecular and Cellular Proteomics 2007)
37
Validation of experimental techniques
Rat liver plasma membrane isolation
(Cao et al., Journal of Proteome Research 2006)
38
Obtain functional information for novel
gene products
MPYVSQSQHIDRVRGAIEGRLPAPGNSSRLVSSWQRSYEQYRLDPGSVIGPRVLTS
SELR DVQGKEEAFLRASGQCLARLHDMIRMADYCVMLTDAHGVTIDYRIDRDRRGD
FKHAGLYI GSCWSEREEGTCGIASVLTDLAPITVHKTDHFRAAFTTLTCSASPIFAPTG
ELIGVLDAS AVQSPDNRDSQRLVFQLVRQSAALIEDGYFLNQTAQHWMIFGHASRN
FVEAQPEVLIAFD ECGNIAASNRKAQECIAGLNGPRHVDEIFDTSAVHLHDVARTDTI
MPLRLRATGAVLYAR IRAPLKRVSRSACAVSPSHSGQGTHDAHNDTNLDAISRFLHS
RDSRIARNAEVALRIAGK HLPILILGETGVGKEVFAQALHASGARRAKPFVAVNCGAIP
DSLIESELFGYAPGAFTGA RSRGARGKIAQAHGGTLFLDEIGDMPLNLQTRLLRVLA
EGEVLPLGGDAPVRVDIDVICA THRDLARMVEEGTFREDLYYRLSGATLHMPPLRER
ADILDVVHAVFDEEAQSAGHVLTLD GRLAERLARFSWPGNIRQLRNVLRYACAVCDS
TRVELRHVSPDVAALLAPDEAALRPALA LENDERARIVDALTRHHWRPNAAAEALGM
InterProScan
39
www.ebi.ac.uk/InterProScan
Annotating novel sequences
• Can use BLAST queries to find similar sequences with
GO annotation which can be transferred to the new sequence
• Two tools currently available;
AmiGO BLAST (from GO Consortium)
http://amigo.geneontology.org/cgi-bin/amigo/blast.cgi
– searches the GO Consortium database
BLAST2GO (from Babelomics)
http://www.blast2go.org/
– searches the NCBI database
40
AmiGO BLAST
Exportin-T from Pongo abelii (Sumatran orangutan)
41
amigo.geneontology.org/cgi-bin/amigo/blast.cgi
Analysis of large
datasets using GO
42
Using the GO to provide a functional
overview for a large dataset
• Many GO analysis tools use GO slims to give a broad
overview of the dataset
• GO slims are cut-down versions of the GO and
contain a subset of the terms in the whole GO
• GO slims usually contain less-specialised GO terms
43
Slimming the GO using the ‘true path rule’
Many gene products are associated with a
large number of descriptive, leaf GO nodes:
44
Slimming the GO using the ‘true path rule’
…however annotations can be mapped up
to a smaller set of parent GO terms:
45
GO slims
Custom slims are available for download;
http://www.geneontology.org/GO.slims.shtml
or you can make your own using;
• QuickGO
http://www.ebi.ac.uk/QuickGO
• AmiGO's GO slimmer
http://amigo.geneontology.org/cgi-bin/amigo/slimmer
46
The EBI's QuickGO browser
Search GO terms
or proteins
Find sets of
GO annotations
Map-up annotations
with GO slims
47
www.ebi.ac.uk/QuickGO
GO analysis tools
48
Numerous Third Party Tools
49
www.geneontology.org/GO.tools.shtml
Term enrichment
• Most popular type of GO analysis
• Determines which GO terms are more often
associated with a specified list of genes/proteins
compared with a control list or rest of genome
• Many tools available to do this analysis
• User must decide which is best for their analysis
50
Precautions when using GO
annotations for analysis
• The Gene Ontology is always changing and GO annotations are
continually being created
- always use a current version of both
- if publishing your analyses please report the versions/dates you used
http://www.geneontology.org/GO.cite.shtml
• Recommended that ‘NOT’ annotations are removed before analysis
- only ~5000 out of ~100 million annotations are ‘NOT’
- can confuse the analysis
51
Precautions when using GO
annotations for analysis
• Unannotated is not unknown
- where there is no evidence in the literature for a process, function or
location the gene product is annotated to the appropriate ontology’s
root node with an ‘ND’ evidence code (no biological data), thereby
distinguishing between unannotated and unknown
• Pay attention to under-represented GO terms
- a strong under-representation of a pathway may mean that normal
functioning of that pathway is necessary for the given condition
52
The UniProtKB-GOA group
Curators:
Emily Dimmer
Rachael Huntley
Yasmin Alam-Faruque
Software developer:
Tony Sawford
Team leaders:
Rolf Apweiler
Claire O’Donovan
Email: [email protected]
53
http://www.ebi.ac.uk/GOA
Acknowledgements
Members of;
UniProtKB
GO Consortium
InterPro
IntAct
HAMAP
Funding
National Human Genome Research Institute (NHGRI)
Kidney Research UK
EMBL
54