Transcript Document

Gene Ontology (GO)
Emily Dimmer
[email protected]
GOA group
European Bioinformatics Institute
Wellcome Trust Genome Campus
Cambridge
UK
GO Tutorial Outline:
• Introduction to GO
• Description of the GO ontologies
• How groups annotate to GO
• Practical:
• Investigating the GO and OBO web sites
• Browsing the GO using the AmiGO
Browser.
• Open Biomedical Ontologies
• How GO is being used
• Available Tools
• GO slims
• Practical:
• Creating your own GO slim
GO Tutorial Outline:
• Introduction to GO
• Description of the GO ontologies
• How groups annotate to GO
• Practical:
• Investigating the GO and OBO web sites
• Browsing the GO using the AmiGO
Browser.
• Open Biomedical Ontologies
• How GO is being used
• Available Tools
• GO slims
• Practical:
• Creating your own GO slim
GO Tutorial Outline:
• Introduction to GO
• Description of the GO ontologies
• How groups annotate to GO
• Practical:
• Investigating the GO and OBO web sites
• Browsing the GO using the AmiGO
Browser.
• Open Biomedical Ontologies
• How GO is being used
• Available Tools
• GO slims
• Practical:
• Creating your own GO slim
GO Tutorial Outline:
• Introduction to GO
• Description of the GO ontologies
• How groups annotate to GO
• Practical:
• Investigating the GO and OBO web sites
• Browsing the GO using the AmiGO
Browser.
• Open Biomedical Ontologies
• How GO is being used
• Available Tools
• GO slims
• Practical:
• Creating your own GO slim
Why is GO needed ?
THE PROBLEM:
•
Huge body of knowledge with an extremely large vocabulary to
describe it
•
Vocabulary used is poorly defined
– i.e. one word can have different meanings
– or different names for the same concept
•
Biological systems are complex and our knowledge of such systems
is incomplete
RESULT:
Large databases which are difficult to manage and
impossible to mine computationally
What is GO?
• A (part of the) solution:
GO:
“a controlled vocabulary that can be
applied to all organisms even as
knowledge of gene and protein roles in
cells is accumulating and changing”
What can scientists do with GO?
• Access gene product functional information
• Provide a link between biological knowledge and …
•gene expression profiles
• proteomics data
• Find how much of a proteome is involved in a process/
function/ component in the cell
• using a GO-Slim
(a slimmed down version of GO to summarize biological
attributes of a proteome)
• Map GO terms and incorporate manual GOA annotation into
own databases
• to enhance your dataset
• or to validate automated ways of deriving information about
gene function (text-mining).
Tactition
Taction
Tactile sense
?
Tactition
Taction
Tactile sense
perception of touch ; GO:0050975
GO
Three (Orthogonal) Ontologies
•Molecular Function: elemental activity or task
e.g. DNA binding, catalysis of a reaction
•Biological Process: broad objective or goal
e.g. mitosis, signal transduction, metabolism
•Cellular Component: location or complex
e.g. nucleus, ribosome
GO
Three (Orthogonal) Ontologies
•Molecular Function: elemental activity or task
e.g. DNA binding, catalysis of a reaction
•Biological Process: broad objective or goal
e.g. mitosis, signal transduction, metabolism
•Cellular Component: location or complex
e.g. nucleus, ribosome
GO
Three (Orthogonal) Ontologies
•Molecular Function: elemental activity or task
e.g. DNA binding, catalysis of a reaction
•Biological Process: broad objective or goal
e.g. mitosis, signal transduction, metabolism
•Cellular Component: location or complex
e.g. nucleus, ribosome
GO
Three (Orthogonal) Ontologies
•Molecular Function: elemental activity or task
e.g. DNA binding, catalysis of a reaction
•Biological Process: broad objective or goal
e.g. mitosis, signal transduction, metabolism
•Cellular Component: location or complex
e.g. nucleus, ribosome
How does GO work?
• Provides a standard, species-neutral
way of representing biology
• GO covers ‘normal’ functions and
processes
– No pathological processes
– No experimental conditions
Content of GO
Molecular Function
Biological Process
Cellular Component
7,493 terms
9,640 terms
1,634 terms
Total
18,767 terms
Definitions:
16,696 (93.9 %)
What is GO?
• NOT a system of nomenclature or a list of
gene products
• GO doesn’t attempt to cover all aspects
of biology or evolutionary relationships
Open Biomedical Ontologies
http://obo.sourceforge.net
• NOT a dictated standard
• NOT a way to unify databases
http://www.geneontology.org
Reactome
Anatomy of a GO term
• GO terms are composed of:
• Term name
• Unique GO ID
• Definition (93 % of GO terms are
defined)
• Synonyms (optional)
• Database references (optional)
• Relationships to other GO terms
I. The GO Ontologies
Ontologies
• “Ontologies provide controlled, consistent
vocabularies to describe concepts and
relationships, thereby enabling knowledge
sharing” (Gruber 1993)
Ontology applications
Can be used to:
• Formalise the representation of biological
knowledge
• Describe a common and defined vocabulary for
database annotation
• Standardise database submissions
• Provide unified access to information through
ontology-based querying of databases, both
human and computational
• Improve management and integration of data
within databases.
• Facilitate data mining
Ontology Structure
• Ontologies can be represented as graphs,
where the vertices (nodes and leaves) are
connected by edges.
• The nodes are concepts in the ontology.
• The edges are the relationships between the
concepts
node
edge
node
node
Ontology Structure
• The Gene Ontology is structured as a
hierarchical directed acyclic graph (DAG).
• Terms are linked by two relationships
– is-a
– part-of
• Terms can have more than one parent
Simple hierarchies
(Trees)
Directed Acyclic
Graphs
Directed Acyclic Graph
cell
membrane
mitochondrial
membrane
is-a
part-of
chloroplast
chloroplast
membrane
True Path Rule
• The path from a child term all the way
up to its top-level parent(s) must always
be true
is-a

part-of 
cell
 cytoplasm
 chromosome
 nuclear chromosome
 nucleus
 nuclear chromosome
Ensuring Stability in a Dynamic Ontology
• Terms become obsolete when they are removed or redefined
• GO IDs are never deleted
• For each term, a comment is added to explains why the term
is now obsolete
Biological Process
Molecular Function
Cellular Component
Obsolete Biological Process
Obsolete Molecular Function
Obsolete Cellular Component
Access to the Gene Ontology
• Downloads
• formats available:
OBO
GO
XML
OWL
MySQL
(http://www.geneontology.org/GO.downloads)
• Web-based tools
• AmiGO
(http://www.godatabase.org)
• QuickGO
(http://www.ebi.ac.uk/ego)
II. Annotating to GO
Use of GO terms to represent the activities
and localizations of gene products.
Basic information needed:
1. Database object (e.g. a protein or gene identifier)
e.g. Q9ARH1
2. Reference ID
e.g. PubMed ID: 12374299
3. GO term ID
e.g. GO:0004674
4. Evidence code
e.g. TAS
GenNav: http://etbsun2.nlm.nih.gov:8000/perl/gennav.pl
J. Clark et al. Plant Physiology 2005 (in press)
Two types of GO Annotation:

Electronic Annotation

Manual Annotation
All annotations must:
• be attributed to a source.
• indicate what evidence was found to
support the GO term-gene/protein
association.
Electronic Annotation
• Provides large-coverage
• High-quality
• BUT annotations tend to use high-level GO
terms and provide little detail.
Electronic Annotation
1. Assignment of GO terms to gene products
using existing information within database
entries
• Manual mapping of GO terms to concepts external
to GO (‘translation tables’).
• Proteins then electronically annotated with the
relevant GO term(s).
2. Automatic sequence analyses to transfer
annotations between highly similar gene
products
Electronic Annotation
Fatty acid biosynthesis
( Swiss-Prot Keyword)
EC:6.4.1.2
(EC number)
GO:Fatty acid biosynthesis
(GO:0006633)
GO:acetyl-CoA carboxylase
activity
(GO:0003989)
IPR000438: Acetyl-CoA
carboxylase carboxyl
transferase beta subunit
(InterPro entry)
MF_00527: Putative 3methyladenine DNA
glycosylase
(HAMAP)
GO:acetyl-CoA carboxylase
activity
(GO:0003989)
GO:DNA repair
(GO:0006281)
Mappings of external concepts to GO
http://www.geneontology.org/GO.indices.shtml
Evaluation of precision of annotation
electronic techniques (InterPro2GO,
SPKW2GO, EC2GO)
• Compared manually-curated test set of GO annotated
proteins with the electronic annotations
• InterPro2GO = most coverage
• EC2GO = 67 % of predictions exactly match the
manual GO annotation.
• 91-100 % of time the 3 mappings predicted GO terms
within the same lineage
Camon et al. BMC Bioinformatics 2005 in press
Manual Annotation
• High–quality, specific gene/gene product
associations made, using:
• Peer-reviewed papers
• Evidence codes to grade evidence
BUT – is very time consuming and requires trained
biologists
Finding GO terms
…for B. napus PERK1 protein (Q9ARH1)
In this study, we report the isolation and molecular characterization
of the B. napus PERK1 cDNA, that is predicted to encode a novel
receptor-like kinase. We have shown that like other plant RLKs,
the kinase domain of PERK1 has serine/threonine kinase activity,
activity,
In addition, the location of a PERK1-GTP fusion protein to the
plasma membrane supports the prediction that PERK1 is an
integral membrane protein
protein…these kinases have been implicated in
early stages of wound
woundresponse
response…
PubMed ID: 12374299
Function:
protein serine/threonine kinase activity
GO:0004674
Component:
integral to plasma membrane
GO:0005887
Process:
response to wounding
GO:0009611
GO Evidence Codes
Code
Definition
*IEA
Inferred from Electronic Annotation
•Enzyme assays
Inferred from Direct Assay
IDA
IDA:
IEP
•In vitro reconstitution
Inferred from Expression Pattern
(transcription)
*IGI
Inferred from Genetic Interaction
•Immunofluorescence
*With column
IMP
Inferred from Mutant Phenotype
•Cell fractionation required
*IPI
Inferred from Physical Interaction
*ISS
Inferred from Sequence Similarity
TAS
Traceable Author Statement
NAS
Non-traceable Author Statement
*IC
RCA
ND
Manually
annotated
TAS:
•In the literature source
the original experiments
referred to are traceable
Inferred from Curator
(referenced).
Inferred from Reviewed Computational
Analysis
No Data
GO Evidence Codes
• additional needed identifier for annotations using certain
evidence codes
IGI:
Code
Definition
*IEA
IDA
Inferred from Electronic Annotation
• a gene identifier for the
Inferred from Direct Assay "other" gene involved in the
IEP
Inferred from Expression Pattern
*IGI
Inferred from Genetic Interaction
IMP
Inferred from Mutant Phenotype
*IPI
Inferred from Physical Interaction
*ISS
TAS
interaction
*With column
required
IPI:
• a gene or protein identifier
Manually
for the "other" protein
Inferred from Sequence Similarity
annotated
involved in the interaction
Traceable Author Statement
NAS
Non-traceable Author Statement
*IC
Inferred from Curator
RCA
Inferred from Reviewed Computational
• GO term from another
Analysis
annotation used as the
ND
No Data
IC:
basis of a curator inference
…some extra things:
• Annotation of a gene product to one ontology is
independent from its annotation to other ontologies.
• Terms reflecting a normal activity or location are only
annotated to.
• Usage of ‘unknown’ GO terms
(e.g. Molecular function unknown GO:0005554)
…some extra things: Qualifier Information
A set of ‘Qualifier’ terms is also available to curators modify the
interpretation of an annotation.
Allowable values:
1. NOT
• a gene product is not associated with the GO term
• to document conflicting claims in the literature.
2. Contributes to
• distinguishes between individual subunits functions and whole
complex functions
• (used with GO Function Ontology)
3. Colocalizes with
• Transiently or peripherally associated with an organelle or
complex
• where the resolution of an assay is not accurate.
(used with GO Component Ontology)
…some extra things:
• The Qualifier column can be used to modify the interpretation
of an annotation.
Allowable values:
1. NOT
• a gene product is not associated with the GO term
• to document conflicting claims in the literature.
2. Contributes to
• distinguishes between individual subunits functions and whole
complex functions
• (used with GO Function Ontology)
3. Colocalizes with
• Transiently or peripherally associated with an organelle or
complex
• where the resolution of an assay is not accurate.
(used with GO Component Ontology)
…some extra things:
• The Qualifier column can be used to modify the interpretation
of an annotation.
Allowable values:
1. NOT
• a gene product is not associated with the GO term
• to document conflicting claims in the literature.
2. Contributes to
• distinguishes between individual subunits functions and whole
complex functions
• (used with GO Function Ontology)
3. Colocalizes with
• Transiently or peripherally associated with an organelle or
complex
• where the resolution of an assay is not accurate.
(used with GO Component Ontology)
…some extra things:
• The Qualifier column can be used to modify the interpretation
of an annotation.
Allowable values:
1. NOT
• a gene product is not associated with the GO term
• to document conflicting claims in the literature.
2. Contributes to
• distinguishes between individual subunit functions and whole
complex functions
• (used with GO Function Ontology)
3. Colocalizes with
• Transiently or peripherally associated with an organelle or
complex
• where the resolution of an assay is not accurate.
(used with GO Component Ontology)
Accessing annotations to the Gene
Ontology
1. Downloads
• Annotations – gene association files
• Ontologies and annotations – MySQL and XML
2. Web-based access
• AmiGO
(http://www.godatabase.org)
• QuickGO
(http://www.ebi.ac.uk/ego)
…among others…
Gene Association File
DB DB_Object_ID DB_Object_Symbol Qualifier
UniProt
UniProt
UniProt
P06703
P06703
P06703
S106_HUMAN
S106_HUMAN
S106_HUMAN
DB_Object_Name
NOT
DB_Object_Synonym
Calcyclin
Calcyclin
Calcyclin
IPI00027463
IPI00027463
IPI00027463
GOid
GO:0008083
GO:0007409
GO:0005515
DB:Reference
GOA:spkw
PMID:12152788
PMID:12577318
DB_Object_Type
protein
protein
protein
taxon
taxon:9606
taxon:9606
taxon:9606
Evidence
IEA
NAS
IPI
With Aspect
F
P
UniProt:P50995 F
Date
Assigned by
20040426
20030721
20030721
• via web (GO consortium page)
http://www.geneontology.org/GO.current.annotations.shtml
•
UniProt
UniProt
UniProt
http://www.geneontology.org/GO.current.annotations.shtml
Summary
• GO is still being developed and updated
- it requires a serious and ongoing effort.
– the biological community is involved
• New model organism databases are joining
the GO Consortium annotation effort
Practical session
1. Visit the GO website
2. Visit the OBO website
3. Browse the ontologies using the official GO
Consortium Browser – AmiGO
Part 1.
GO web site: www.geneontology.org
OBO web site: http://obo.sourceforge.net
AmiGO: http://www.godatabase.org
GO terms with
no children
Querying the GO
Search for
GO terms or
by Gene
symbol/name
Filter queries by
organism, data
source or
evidence
Querying the GO
Querying the GO
GOst tool
GOst tool
QuickGO browser: http://www.ebi.ac.uk/ego
QuickGO browser: http://www.ebi.ac.uk/ego
QuickGO browser: http://www.ebi.ac.uk/ego
OBO and
Gene Ontology Uses and
Tools
Developmental
Stage
Molecular
Disease
Metabolic
Ontologies
Pathway
Phenotype
Anatomy
Physiology
Beyond GO – Open Biomedical
Ontologies
• Orthogonal to existing ontologies to facilitate
combinatorial approaches
- Share unique identifier space
- Include definitions
• Anatomies
• Cell Types
• Sequence Attributes
• Temporal Attributes
• Phenotypes
• Diseases
• More….
http://obo.sourceforge.net
Sequence Ontology
http://song.sourceforge.net
• Ontology of ‘small molecular
entities’
http://www.ebi.ac.uk/chebi
http://www.fruitfly.org/cgi-bin/ex/go.cgi
Access to GO and
its annotations
How to access the Gene ontology and its
annotations
1. Downloads
• Ontologies – (various – GO, OBO, XML, OWL
MySQL)
• Annotations – gene association files
• Ontologies and Annotations – MySQL and XML
2. Web-based access
• AmiGO
(http://www.godatabase.org)
• QuickGO
(http://www.ebi.ac.uk/ego)
among others…
http://www.ncbi.nlm.nih.gov/entrez
www.uniprot.org/
http://www.ebi.ac.uk/intact
SRS view…
http://srs.ebi.ac.uk
www.ensembl.org/
www.ensembl.org/
www.ensembl.org/
What can scientists do with GO?
• Access gene product functional information
• Provide a link between biological knowledge and …
•gene expression profiles
• proteomics data
• Find how much of a proteome is involved in a process/
function/ component in the cell
• using a GO-Slim
(a slimmed down version of GO to summarize biological
attributes of a proteome)
• Map GO terms and incorporate manual GOA annotation into
own databases
• to enhance your dataset
• or to validate automated ways of deriving information about
gene function (text-mining).
…analysis of high-throughput data according to GO
MicroArray data analysis
time
Defense response
Immune response
Response to stimulus
Toll regulated genes
JAK-STAT regulated genes
Puparial adhesion
Molting cycle
hemocyanin
Amino acid catabolism
Lipid metobolism
Peptidase activity
Protein catabloism
Immune response
Immune response
Toll regulated genes
attacked control
Selected Gene
Tree:
pearson
Coloredby
by::
ene Tree:
pear
s on lw n3d
...lw n3d ...Colored
Branch color
classification: Set_LW_n3d_5p_...
Gene
List:
or c lass ific ation:
Set_LW_n3d_5p_...
Gene
Lis
t:
Bregje Wertheim at the Centre for Evolutionary Genomics,
Department of Biology, UCL and Eugene Schuster Group, EBI.
Copy
of Copy
(Def
a...
Copy
of ofCC5_RMA
opy of
C5_RMA
( Def a...
allall
genes
(14010)( 14010)
genes
…analysis of high-throughput data according to GO
Proteomics data analysis
GO classification
Kislinger T et al, Mol Cell Proteomics, 2003
Analysis of Data: Clustering
http://www.geneontology.org/GO.tools
Color indicates
up/down
regulation
GoMiner Tool, John Weinstein et al, Genome Biol. 4 (R28) 2003
Example of VLAD Output
Compare annotations associated with
the test set to the entire set of GO
annotations….
DNA Repair seems to be a common
theme.
…overview proteome with GO Slim
http://www.ebi.ac.uk/integr8
Off-the-shelf GO slims
http://go.princeton.edu/cgi-bin/GOTermMapper
map2slim.pl
• distributed as part of the go-perl package
• maps a set of annotations up to their parent GO
slim terms
Summary
 The Gene Ontology project precipitated a
generalized implementation for ontologies for
molecular biology
 Bio-ontologies such as GO have facilitated
development of systems for hypothesis generation in
biological systems
 Further integration – creation of cross-products
between different ontologies
Practical II – Creation of GO slims
using the DAG-Edit tool.
http://sourceforge.net/projects/geneontology/
…loading the GO
…loading the GO
…loading the GO
…loading the GO
…loading the GO
…loading the GO
ftp://ftp.geneontology.org/pub/go/ontology/gene_ontology.obo
…loading the GO
…loading the GO
…browsing the GO
…viewing GO terms
…searching for GO terms
…searching for GO terms
…searching for GO terms
…creating a new GO slim
…creating a new GO slim
…creating a new GO slim
…creating a new GO slim
…creating a new GO slim
…creating a renderer for the GO slim
…creating a renderer for the GO slim
…creating a renderer for the GO slim
…creating a renderer for the GO slim
…creating a renderer for the GO slim
…creating a renderer for the GO slim
…adding terms to the GO slim
…adding terms to the GO slim
…adding terms to the GO slim
…adding terms to the GO slim
…filtering GO for terms in the GO slim
…filtering GO for terms in the GO slim
…filtering GO for terms in the GO slim
…removing filters/renderers
…saving the newly created GO slim