Transcript Go ontology

Computer Science
Ph. D. Seminar
Gene Ontology (GO) Based Search for Protein
Structure Similarity Clustering Metrics
Ph.D. Candidate
Steve Johnson
Committee Members
Dr. Debasis Mitra , Dr. Philip Bernhard , Dr. Walter Bond,
Dr. Julia Grimwade
Date: September 12, 2011
Gene Ontology (GO) Based Search for
Protein Structure Similarity Clustering
Metrics
•
•
•
•
•
•
•
GO Background
GO Subontologies
GO Annotations
GO Relationships
GO Tools
GO Research
Research Direction
Gene Ontology Background
The Gene Ontology (GO),
http://www.geneontology.org/, provides a
consistent vocabulary for describing the
attributes of proteins, specifically molecular
function, biological process and the cellular
component where the protein is found.
Gene Ontology Background
GO Consortium
•
•
•
•
•
•
•
Berkley Bioinformatics
Open Source Project
(BBOP)
British Heart Foundation
EcoliWiki
Flybase
GeneDB
UniProtKB-GOA
Univ. of Maryland – IGS
•
•
•
•
•
Mouse Genome
Informatics (MGI)
Rat Genome Database
(RGD)
Saccharomyces
Genome Database
(SGD)
The Arabidopsis
Information Resource
(TAIR)
WormBase
Gene Ontology Background
GO Consortium
• GO terms
o A set of integer IDs (i.e., GO terms) is
assigned to members of the GO
Consortium
• GO Consortium members
o provide annotations
o attend all meetings,
o receive funding for supported databases
Gene Ontology Project
Facts
• Started in 1998
• Primary Goals
o Structured Vocabulary
o Use to annotate genes and gene products
• 3 Model Organisms
o FlyBase (Drosophila)
o Saccharomyces Genome Database (SGD)
o Mouse Genome Informatics (MGI) project
Gene Subontologies
Three Ontology Structure
• Biological Process
• Molecular Function
• Cellular component
Gene Subontologies
Biological Process
Biological process refers to the series of steps or
sequence of molecular functions.
Examples of biological processes include the
following.
• Metabolic Process
• Photosynthetic Process
• Biosynthetic Process
Gene Subontologies
Molecular Function
Molecular Function refers to describing the
purpose of the gene product and refers to a single
function (i.e., unlike biological process).
Examples of molecular function include the
following.
• Binding Activity
• Transport Activity
• Receptor Activity
Gene Subontologies
Cellular Component
Cellular component refer to identifying the location
of the gene product within the structure of the cell.
Examples of cellular components include the
following.
• Organelle Part
• Cell Body Membrane
• Apical Complex
GO Annotations
GO Annotation Terms
Example
• Term: Glucose Biosynthetic Process
• ID: GO:0006094
• Definition: The formation of glucose from
noncarbohydrate precursors, such as
pyruvate, amino acids and glycerol.
GO Annotations
GO Annotation Term Statistics
Molecular Function
Biological Process
Cellular Component
Total
8637 terms
17,069 terms
2432 terms
28, 138 terms
As of September 2009
GO Annotations
GO Annotation Methods
• Electronic Annotation
• Manual Annotation
• All annotations
o Source
o Supportive evidence
GO Annotations
GO Annotation Methods
Manual Annotation
• Primary source is published literature
• Curators perform sequence similarity
analyses to transfer annotations between
highly similar gene products (BLAST,
protein domain analysis)
GO Annotations
GO Annotation Methods
Electronic Annotation
• Database entries
o Manual mapping of GO terms to concepts
external to GO (‘translation tables’)
o Proteins then electronically annotated with the
relevant GO term(s)
• Automatic sequence similarity analyses to
transfer annotations between highly
similar gene products
GO Annotations
GO Annotation Example
1A71
Liver Alcohol
Dehydrogenase
Cellular component:
Mitochondria
GO:0005739
Biological Process:
Ethanol Catabolic
Process
GO:0006068
Molecular Function:
Oxireductase Activity
GO Annotations
Sample Annotations
GO Consortium members provide gene annotation data
based on information obtained from research quality articles.
The information extracted from the articles are described as
“Annotation Sets”
•
Sample Annotation Sets
GO Annotations
File Format
The Gene Ontology website represents the annotation data
in graphical format. It is part of the Open Biomedical
Ontologies (OBO), http://obo.sourceforge.net/.
•
Current Species/Database Annotations
•
Annotation File Format (GAF 2.0)
GO Annotations
Evidence Code Categories
The information in the annotation file includes evidence
information which serves as a source to validate /the
annotation information.
•
Experimental Evidence Codes
•
Computational Analysis Evidence Codes
•
Author Statement Evidence Codes
•
Curator Statement Evidence Codes
GO Annotations
GO Slims
GO Slims are subsets of GO annotation information that
provide broader classification of terms.
GO Slim Application Example
GO Relationships
A graph structure is used to establish relationship amongst
the terms for molecular function, biological process, and
cellular component features.
Primary Ontology Relations
•
is a
•
part of
•
regulates
Gene Ontology Background
GO Mappings to EC Numbers
Enzyme Commission numbers are used to specify
categories of enzymes based on the chemical reactions
catalyzed. The UniProtKB-GOA EC2GO mapping provides
GO molecular function IDs for each classification
•
EC1 - Oxidoreductases
•
EC2 - Transferases
•
EC3 - Hydrolases
•
EC4 - Lyases
•
EC5 – Isomerases
•
EC 6 - Ligases
GO Tools
•
•
•
•
•
Amigo
OBO – Edit
QuickGO
Goanna
agriGO
Gene Ontology
Database
• MySQL
• Querying GO MySQL
o SQL
o Perl
o GHOUL
Gene Ontology
Interesting Research
• GO Annotation Consistency
• Automated Annotation
• Biocreative
• CLUGO
• Similarity Prediction Method
• Automated Protein Function Predictions
• Search for Genes w/ Similar Function
• Semantic Similarity
Dissertation
Research Hypothesis
There exists protein alignment
metrics/algorithms that can be used as
clustering indexes for proteins with matching GO
molecular functions IDs
Gene Ontology
References
Evelyn B Camon, Daniel G Barrell, Emily C Dimmer, Vivian Lee, Michele
Magrane, John Maslen, David Binns and Rolf Apweiler; An evaluations of GO
annotation retrieval for BioCreAtIvE and GOA. BMC Bioinformatices 2005. 6
(Supplement 1): S17.
Mary E. Dolan, Li Ni, Evelyn Camon and Judith A. Blake; A procedure for
assessing GO annotation consistency. Bioinformatics 2005. 21 (Supplement 1):
i136 – i143.
In-Yee Lee, Jan-Ming Ho, Ming-Syan Chen; CLUGO: A Clustering Algorithm for
Automated Functional Annotations Based on Gene Ontology. Proceedings of
the 5th IEEE International Conference on Data Mining (ICDM, 05): i136 – i143.
Gene Ontology Consortium; The Gene Ontology in 2010: extensions and
refinements. Nucleic Acids Research, 2009.
Evelyn Camon, Michele Magrane, Daniel Barrell, Vivian Lee, Emily Dimmer, John
Maslen, David Binns, Nicola Harte, Rodrigo Lopez and Rolf Apweiler; The
Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with
Gene Ontology. Nucleic Acids Research, 2004 (32).
Gene Ontology
References
Gene Ontology Consortium; The Gene Ontology (GO) database and informatics
resource. Nucleic Acids Research, 2004 (32).
Seth Carbon1, Amelia Ireland2, Christopher J. Mungall, ShengQiang Shu, Brad
Marshall, Suzanna Lewis; Amigo: online access to ontology and annotation
data. Bioinformatics Application Note. 22 (2), 2009: 288 – 289.