Getting Started
Download
Report
Transcript Getting Started
Getting Started: a user’s
guide to the GO
TAMU GO Workshop
17 May 2010
Introduction to GO
1.
2.
3.
Annotation
Bio-ontologies
the Gene Ontology (GO)
4.
a GO annotation example
GO evidence codes
literature biocuration & computation analysis
ND vs no GO
sources of GO
Using the GO
Genomic Annotation
1.
2.
Genome annotation is the process of
attaching biological information to genomic
sequences. It consists of two main steps:
identifying functional elements in the
genome: “structural annotation”
attaching biological information to these
elements: “functional annotation”
biologists often use the term “annotation”
when they are referring only to structural
annotation
Structural
annotation:
DNA
annotation
CHICK_OLF6
Protein
annotation
TRAF 1, 2 and 3
Data from Ensembl Genome browser
TRAF 1 and 2
Functional annotation:
catenin
Structural & Functional Annotation
Structural Annotation:
Open reading frames (ORFs) predicted during genome
assembly
predicted ORFs require experimental confirmation
the Sequence Ontology (SO) provides a structured controlled
vocabulary for sequence annotation
Functional Annotation:
annotation of gene products = Gene Ontology (GO)
annotation
initially, predicted ORFs have no functional literature and GO
annotation relies on computational methods (rapid)
functional literature exists for many genes/proteins prior to
genome sequencing
GO annotation does not rely on a completed genome
sequence!
1.
2.
3.
4.
Provides structural annotation for
agriculturally important genomes
Provides functional annotation (GO)
Provides tools for functional modeling
Provides bioinformatics & modeling
support for research community
Avian Gene Nomenclature
1. Bio-ontologies
Bio-ontologies
Bio-ontologies are used to capture biological
information in a way that can be read by both
humans and computers.
necessary for high-throughput “omics” datasets
allows data sharing across databases
Objects in an ontology (eg. genes, cell types, tissue
types, stages of development) are well defined.
The ontology shows how the objects relate to each
other.
Bio-ontologies:
http://www.obofoundry.org/
relationships
between terms
Ontologies
digital identifier
(computers)
description
(humans)
2. The Gene Ontology
What is the Gene Ontology?
“a controlled vocabulary that can be applied to all organisms even as
knowledge of gene and protein roles in cells is accumulating and
changing”
assign functions to gene products at different levels,
depending on how much is known about a gene
product
is used for a diverse range of species
structured to be queried at different levels, eg:
find all the chicken gene products in the genome
that are involved in signal transduction
zoom in on all the receptor tyrosine kinases
human readable GO function has a digital tag to
allow computational analysis of large datasets
COMPUTATIONALLY AMENABLE ENCYCLOPEDIA OF
GENE FUNCTIONS AND THEIR RELATIONSHIPS
GO annotation example
NDUFAB1 (UniProt P52505)
Bovine NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa
Biological Process (BP or P)
GO:0006633 fatty acid biosynthetic process TAS
GO:0006120 mitochondrial electron transport, NADH to ubiquinone TAS
GO:0008610 lipid biosynthetic process IEA
NDUFAB1
GO:0005504
GO:0008137
GO:0016491
GO:0000036
Molecular Function (MF or F)
fatty acid binding IDA
NADH dehydrogenase (ubiquinone) activity TAS
oxidoreductase activity TAS
acyl carrier activity IEA
Cellular Component (CC or C)
GO:0005759 mitochondrial matrix IDA
GO:0005747 mitochondrial respiratory chain complex I IDA
GO:0005739 mitochondrion IEA
GO annotation example
NDUFAB1 (UniProt P52505)
Bovine NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa
GO:ID (unique)
aspect or ontology
GO evidence code
GO term name
GO EVIDENCE CODES
Direct Evidence Codes
IDA - inferred from direct assay
IEP - inferred from expression pattern
IGI - inferred from genetic interaction
IMP - inferred from mutant phenotype
IPI - inferred from physical interaction
Guide to GO Evidence
Codes
http://www.geneontol
ogy.org/GO.evidence.s
html
Indirect Evidence Codes
inferred from literature
IGC - inferred from genomic context
TAS - traceable author statement
NAS - non-traceable author statement
IC - inferred by curator
inferred by sequence analysis
RCA - inferred from reviewed computational analysis
IS* - inferred from sequence*
IEA - inferred from electronic annotation
Other
NR - not recorded (historical)
ND - no biological data available
ISS - inferred from sequence or structural similarity
ISA - inferred from sequence alignment
ISO - inferred from sequence orthology
ISM - inferred from sequence model
GO EVIDENCE CODES
Direct Evidence Codes
GO
Mapping
IDA
- inferred
fromExample
direct assay
IEP - inferred from expression pattern
IGI - inferred from genetic interaction
IMP - inferred from mutant phenotype
IPI - inferred from physical interaction
Indirect Evidence Codes
inferred from literature
IGC - inferred from genomic context
TAS - traceable author statement
NAS - non-traceable author statement
IC - inferred by curator
inferred by sequence analysis
RCANDUFAB1
- inferred from reviewed computational analysis
IS* - inferred from sequence*
IEA - inferred from electronic annotation
Other
NR - not recorded (historical)
ND - no biological data available
Biocuration of literature
• detailed function
• “depth”
• slower (manual)
P05147
Biocuration of Literature:
detailed gene function
Find a paper
about the protein.
PMID: 2976880
Read paper to get experimental evidence of
function
Use most specific term
possible
experiment assayed kinase activity:
use IDA evidence code
GO EVIDENCE CODES
Direct Evidence Codes
GO
Mapping
IDA
- inferred
fromExample
direct assay
IEP - inferred from expression pattern
IGI - inferred from genetic interaction
IMP - inferred from mutant phenotype
IPI - inferred from physical interaction
Biocuration of literature
• detailed function
• “depth”
• slower (manual)
Indirect Evidence Codes
inferred from literature
IGC - inferred from genomic context
TAS - traceable author statement
NAS - non-traceable author statement
IC - inferred by curator
inferred by sequence analysis
RCANDUFAB1
- inferred from reviewed computational analysis
IS* - inferred from sequence*
IEA - inferred from electronic annotation
Other
NR - not recorded (historical)
ND - no biological data available
Sequence analysis
• rapid (computational)
• “breadth” of coverage
• less detailed
ISS - inferred from sequence or structural similarity
ISA - inferred from sequence alignment
ISO - inferred from sequence orthology
ISM - inferred from sequence model
Unknown Function vs No GO
ND – no data
Biocurators have tried to add GO but there is
no functional data available
Previously: “process_unknown”,
“function_unknown”, “component_unknown”
Now: “biological process”, “molecular function”,
“cellular component”
No annotations (including no “ND”):
biocurators have not annotated
this is important for your dataset: what % has
GO?
Sources of GO
1.
Primary sources of GO: from the GO
Consortium (GOC) & GOC members
2.
most up to date
most comprehensive
Secondary sources: other resources that use
GO provided by GOC members
public databases (eg. NCBI, UniProtKB)
genome browsers (eg. Ensembl)
array vendors (eg. Affymetrix)
GO expression analysis tools
Sources of GO annotation
Different tools and databases display the
GO annotations differently.
Since GO terms are continually changing
and GO annotations are continually added,
need to know when GO annotations were
last updated.
Secondary Sources of GO annotation
EXAMPLES:
public databases (eg. NCBI, UniProtKB)
genome browsers (eg. Ensembl)
array vendors (eg. Affymetrix)
CONSIDERATIONS:
What is the original source?
When was it last updated?
Are evidence codes displayed?
For more information about GO
GO Evidence Codes:
http://www.geneontology.org/GO.evidence.shtml
gene association file information:
http://www.geneontology.org/GO.format.annotation.shtml
tools that use the GO:
http://www.geneontology.org/GO.tools.shtml
GO Consortium wiki:
http://wiki.geneontology.org/index.php/Main_Page
All websites are listed on the
AgBase workshop website.
3. Using the GO
http://www.geneontology.org/
However….
many of these tools do not support non-model
organisms
the tools have different computing requirements
may be difficult to determine how up-to-date the
GO annotations are…
Need to evaluate tools for your system.
Some useful expression analysis tools:
Database for Annotation, Visualization and
Integrated Discovery (DAVID)
http://david.abcc.ncifcrf.gov/
agriGO -- GO Analysis Toolkit and Database for
Agricultural Community
http://bioinfo.cau.edu.cn/agriGO/
used to be EasyGO
chicken, cow, pig, mouse, cereals, dicots
includes Plant Ontology (PO) analysis
Onto-Express
http://vortex.cs.wayne.edu/projects.htm#Onto-Express
can provide your own gene association file
Funcassociate 2.0: The Gene Set Functionator
http://llama.med.harvard.edu/funcassociate/
can provide your own gene association file
Evaluating GO tools
Some criteria for evaluating GO Tools:
1. Does it include my species of interest (or do I have to
“humanize” my list)?
2. What does it require to set up (computer usage/online)
3. What was the source for the GO (primary or secondary)
and when was it last updated?
4. Does it report the GO evidence codes (and is IEA
included)?
5. Does it report which of my gene products has no GO?
6. Does it report both over/under represented GO groups and
how does it evaluate this?
7. Does it allow me to add my own GO annotations?
8. Does it represent my results in a way that facilitates
discovery?
Functional Modeling Considerations
Should I add my own GO?
Should I do GO analysis and pathway analysis and network
analysis?
use GOProfiler to see how much GO is available for your species
use GORetriever to find existing GO for your dataset
Does analysis tool allow me to add my own GO?
different functional modeling methods show different aspects about
your data (complementary)
is this type of data available for your species (or a close ortholog)?
What tools should I use?
which tools have data for your species of interest?
what type of accessions are accepted?
availability (commercial and freely available)
Overview of functional modeling
strategy
Microarray Ids
ArrayIDer
Protein/Gene
identifiers
Pathways and
network analysis
Ingenuity Pathways Analysis (IPA)
Pathway Studio
Cytoscape
DAVID
GO Enrichment
analysis
Ingenuity Pathways Analysis (IPA)
Pathway Studio
Cytoscape
DAVID
EasyGO
Onto-Express
Onto-Express-to-go (OE2GO)
GORetriever
GO annotations
Genes/Proteins with
no GO annotations
GOanna
GOSlimViewer
Yellow boxes represent AgBase tools
Green/Purple boxes are non-AgBase resources
All workshop materials are
available at AgBase.