the Gene Ontology

Download Report

Transcript the Gene Ontology

Introduction to the GO:
a user’s guide
NCSU GO Workshop
29 October 2009
Genomic Annotation

1.
2.

Genome annotation is the process of
attaching biological information to genomic
sequences. It consists of two main steps:
identifying functional elements in the
genome: “structural annotation”
attaching biological information to these
elements: “functional annotation”
biologists often use the term “annotation”
when they are referring only to structural
annotation
Structural
annotation:
DNA
annotation
CHICK_OLF6
Protein
annotation
TRAF 1, 2 and 3
Data from Ensembl Genome browser
TRAF 1 and 2
Functional annotation:
catenin
Structural & Functional Annotation
Structural Annotation:
 Open reading frames (ORFs) predicted during genome
assembly
 predicted ORFs require experimental confirmation
 the Sequence Ontology (SO) provides a structured
controlled vocabulary for sequence annotation
Functional Annotation:
 annotation of gene products = Gene Ontology (GO)
annotation
 initially, predicted ORFs have no functional literature and
GO annotation relies on computational methods (rapid)
 functional literature exists for many genes/proteins prior to
genome sequencing
 GO annotation does not rely on a completed genome
sequence!
Introduction to GO
1.
2.
Bio-ontologies
the Gene Ontology (GO)





3.
4.
a GO annotation example
GO evidence codes
literature biocuration & computation analysis
ND vs no GO
sources of GO
Using the GO
The gene association file
1. Bio-ontologies
Bio-ontologies

Bio-ontologies are used to capture biological
information in a way that can be read by both
humans and computers.
 necessary for high-throughput “omics” datasets
 allows data sharing across databases

Objects in an ontology (eg. genes, cell types, tissue
types, stages of development) are well defined.

The ontology shows how the objects relate to each
other.
Bio-ontologies:
http://www.obofoundry.org/
relationships
between terms
Ontologies
digital identifier
(computers)
description
(humans)
2. The Gene Ontology
Functional Annotation
Gene Ontology (GO) is the de facto method
for functional annotation
 Widely used for functional genomics (high
throughput)
 Many tools available for gene expression
analysis using GO
 The GO Consortium homepage:

http://www.geneontology.org
GO Mapping Example
NDUFAB1 (UniProt P52505)
Bovine NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa
Biological Process (BP or P)
GO:0006633 fatty acid biosynthetic process TAS
GO:0006120 mitochondrial electron transport, NADH to ubiquinone TAS
GO:0008610 lipid biosynthetic process IEA
NDUFAB1
Molecular Function (MF or F)
GO:0005504 fatty acid binding IDA
GO:0008137 NADH dehydrogenase (ubiquinone) activity TAS
GO:0016491 oxidoreductase activity TAS
GO:0000036 acyl carrier activity IEA
Cellular Component (CC or C)
GO:0005759 mitochondrial matrix IDA
GO:0005747 mitochondrial respiratory chain complex I IDA
GO:0005739 mitochondrion IEA
GO Mapping Example
NDUFAB1 (UniProt P52505)
Bovine NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa
GO:ID (unique)
aspect or ontology
Biological Process (BP or P)
GO:0006633 fatty acid biosynthetic process TAS
GO:0006120 mitochondrial electron transport, NADH to ubiquinone TAS
GO:0008610 lipid biosynthetic process IEA
NDUFAB1
GO term name
GO:0005504
GO:0008137
GO:0016491
GO:0000036
Molecular Function (MF or F)
fatty acid binding IDA
NADH dehydrogenase (ubiquinone) activity TAS
oxidoreductase activity TAS
acyl carrier activity IEA
Cellular Component (CC or C)
GO:0005759 mitochondrial
matrix IDA code
GO evidence
GO:0005747 mitochondrial respiratory chain complex I IDA
GO:0005739 mitochondrion IEA
GO EVIDENCE CODES
Direct Evidence Codes
GO
Mapping
IDA
- inferred
fromExample
direct assay
IEP
- inferred(UniProt
from expression
NDUFAB1
P52505)pattern
IGIBovine
- inferred
fromdehydrogenase
genetic interaction
NADH
(ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa
IMP - inferred from mutant phenotype
IPI - inferred from physical interaction
Biological Process (BP or P)
GO:0006633 fatty acid biosynthetic process TAS
Indirect Evidence Codes
GO:0006120 mitochondrial electron transport, NADH to ubiquinone TAS
inferred from literature
GO:0008610 lipid biosynthetic process IEA
IGC - inferred from genomic context
TAS - traceable author statement
Molecular Function (MF or F)
NAS - non-traceable author statement
GO:0005504 fatty acid binding IDA
IC - inferred by curator
GO:0008137 NADH dehydrogenase (ubiquinone) activity TAS
inferred by sequence analysis
GO:0016491 oxidoreductase activity TAS
NDUFAB1
RCA - inferred from reviewed GO:0000036
computational
acylanalysis
carrier activity IEA
IS* - inferred from sequence*
IEA - inferred from electronic annotation
Cellular Component (CC or C)
GO:0005759 mitochondrial matrix IDA
Other
ISS - inferred
from sequence
structural
similarity
GO:0005747
mitochondrial
respiratoryorchain
complex
I IDA
NR - not recorded (historical) GO:0005739 ISA
- inferred from
mitochondrion
IEA sequence alignment
ND - no biological data available
ISO - inferred from sequence orthology
ISM - inferred from sequence model
GO EVIDENCE CODES
Direct Evidence Codes
GO
Mapping
IDA
- inferred
fromExample
direct assay
IEP - inferred from expression pattern
IGI - inferred from genetic interaction
IMP - inferred from mutant phenotype
IPI - inferred from physical interaction
Biocuration of literature
• detailed function
• “depth”
• slower (manual)
Indirect Evidence Codes
inferred from literature
IGC - inferred from genomic context
TAS - traceable author statement
NAS - non-traceable author statement
IC - inferred by curator
inferred by sequence analysis
RCANDUFAB1
- inferred from reviewed computational analysis
IS* - inferred from sequence*
IEA - inferred from electronic annotation
Other
NR - not recorded (historical)
ND - no biological data available
ISS - inferred from sequence or structural similarity
ISA - inferred from sequence alignment
ISO - inferred from sequence orthology
ISM - inferred from sequence model
P05147
Biocuration of Literature:
detailed gene function
Find a paper
about the protein.
PMID: 2976880
Read paper to get experimental evidence of
function
Use most specific term
possible
experiment assayed kinase activity:
use IDA evidence code
GO EVIDENCE CODES
Direct Evidence Codes
GO
Mapping
IDA
- inferred
fromExample
direct assay
IEP - inferred from expression pattern
IGI - inferred from genetic interaction
IMP - inferred from mutant phenotype
IPI - inferred from physical interaction
Biocuration of literature
• detailed function
• “depth”
• slower (manual)
Indirect Evidence Codes
inferred from literature
IGC - inferred from genomic context
TAS - traceable author statement
NAS - non-traceable author statement
IC - inferred by curator
inferred by sequence analysis
RCANDUFAB1
- inferred from reviewed computational analysis
IS* - inferred from sequence*
IEA - inferred from electronic annotation
Other
NR - not recorded (historical)
ND - no biological data available
Sequence analysis
• rapid (computational)
• “breadth” of coverage
• less detailed
ISS - inferred from sequence or structural similarity
ISA - inferred from sequence alignment
ISO - inferred from sequence orthology
ISM - inferred from sequence model
Unknown Function vs No GO

ND – no data
 Biocurators
have tried to add GO but there is
no functional data available
 Previously: “process_unknown”,
“function_unknown”, “component_unknown”
 Now: “biological process”, “molecular
function”, “cellular component”

No annotations (including no “ND”):
biocurators have not annotated
 this
is important for your dataset: what % has
GO?
Sources of GO
1.
Primary sources of GO: from the GO
Consortium (GOC) & GOC members


2.
most up to date
most comprehensive
Secondary sources: other resources that use
GO provided by GOC members




public databases (eg. NCBI, UniProtKB)
genome browsers (eg. Ensembl)
array vendors (eg. Affymetrix)
GO expression analysis tools

Different tools and databases display the
GO annotations differently.

Since GO terms are continually changing
and GO annotations are continually
added, need to know when GO
annotations were last updated.
Secondary Sources of GO annotation

EXAMPLES:

public databases (eg. NCBI, UniProtKB)
 genome browsers (eg. Ensembl)
 array vendors (eg. Affymetrix)

CONSIDERATIONS:

What is the original source?
 When was it last updated?
 Are evidence codes displayed?
For more information about GO

GO Evidence Codes:
http://www.geneontology.org/GO.evidence.shtml

gene association file information:
http://www.geneontology.org/GO.format.annotation.shtml

tools that use the GO:
http://www.geneontology.org/GO.tools.shtml

GO Consortium wiki:
http://wiki.geneontology.org/index.php/Main_Page
3. Using the GO
Use GO Browsers for:
searching for GO terms
 searching for gene product annotation
 filtering sets of annotations and
downloading results
 creating/using GO slims

GO Browsers

QuickGO Browser (EBI GOA Project)
 http://www.ebi.ac.uk/ego/
 Can
search by GO Term or by UniProt ID
 Includes IEA annotations

AmiGO Browser (GO Consortium Project)
 http://amigo.geneontology.org/cgi-
bin/amigo/go.cgi
 Can search by GO Term or by UniProt ID
 Does not include IEA annotations
Use GO for…….
Determining which classes of gene products
are over-represented or under-represented.
 Grouping gene products by biological
function.
 Relating a protein’s location to its function.
 Focusing on particular biological pathways
and functions (hypothesis-driven data
interrogation).

http://www.geneontology.org/
However….
 many of these tools do not support non-model
organisms
 the tools have different computing requirements
 may be difficult to determine how up-to-date the
GO annotations are…
Need to evaluate tools for your system.
Evaluating GO tools
Some criteria for evaluating GO Tools:
1. Does it include my species of interest (or do I have to
“humanize” my list)?
2. What does it require to set up (computer usage/online)
3. What was the source for the GO (primary or secondary) and
when was it last updated?
4. Does it report the GO evidence codes (and is IEA included)?
5. Does it report which of my gene products has no GO?
6. Does it report both over/under represented GO groups and
how does it evaluate this?
7. Does it allow me to add my own GO annotations?
8. Does it represent my results in a way that facilitates
discovery?
4. gene association
files
The gene association (ga) file


standard file format used to capture GO annotation
data
tab-delimited file containing 15* fields of information:
 Information
about the gene product (database, accession,
name, symbol, synonyms, species)

information about the function:
 GO
ID, ontology, reference, evidence, qualifiers, context
(with/from)

data about the functional annotation
 date,
annotator
* 2 additional fields will soon be added to capture
information about isoforms and other ontologies.
(additional column
added to this
example)
gene product information
metadata: when & who
function information
Gene association files

GO Consortium ga files
 many organism specific files
 also includes EBI GOA files

EBI GOA ga files
 UniProt
file contains GO annotation for all species
represented in UniProtKB

AgBase ga files
 organism specific files
 AgBase GOC file – submitted
to GO Consortium & EBI
GOA
 AgBase Community file – GO annotations not yet
submitted or not supported
 all files are quality checked