igor_ontologies_pathways
Download
Report
Transcript igor_ontologies_pathways
Tools in Bioinformatics
Ontologies and pathways
Why are ontologies needed?
A free text is the best way to describe what a protein
does to a human reader
However, it is a lousy way to tell that to a computer
When are we interested in a computer-interperable
annotation?
We want all the proteins associated with a certain
disease
All the proteins localized to a lysosome
We found a cluster of “interesting” genes and we want
to know what are they involved it
We want to measure the similarity between gene pairs
Simple solution
The simplest solution is to use a set of
keywords for every protein
Why is this a bad solution?
What’s in a name?
Glucose synthesis
Glucose biosynthesis
Glucose formation
Glucose anabolism
Gluconeogenesis
All refer to the process of making glucose
from simpler components
What’s in a name?
The problem:
Same name for different concepts
Different names for the same concept
Vast amounts of biological data from different
sources
Cross-species or cross-database
comparison is difficult
What is the Gene Ontology?
A (part of the) solution:
The Gene Ontology: “a controlled vocabulary
that can be applied to all organisms even as
knowledge of gene and protein roles in cells is
accumulating and changing”
A controlled vocabulary to describe gene
products - proteins and RNA - in any
organism.
What is GO?
One of the Open Biological Ontologies
Standard, species-neutral way of
representing biology
Three structured networks of defined terms to
describe gene product attributes
More like a phrase book than a biology text
book
How does GO work?
What information might we want to
capture about a gene product?
What does the gene product do?
Molecular function
Where and when does it act?
Cellular compartment
What is the purpose of these activities?
Biological process
Molecular Function
activities or “jobs” of a gene product
insulin binding
insulin receptor activity
Cellular Component
where a gene product acts
Cellular Component
Cellular Component
Enzyme complexes in the component ontology
refer to places, not activities.
Biological Process
a commonly recognized series of events
cell division
Biological Process
transcription
Ontology Structure
Ontologies are structured as a hierarchical
directed acyclic graph
Terms can have more than one parent and
zero, one or more children
Terms are linked by two relationships
is-a
part-of
Ontology Structure
cell
membrane
mitochondrial
membrane
is-a
part-of
chloroplast
chloroplast
membrane
True Path Rule
The path from a child term all the way up to
its top-level parent(s) must always be true
cell
nucleus
chromosome
But what about bacteria?
True Path Rule
Resolved component ontology structure:
cell
cytoplasm
chromosome
nuclear chromosome
nucleus
nuclear chromosome
GO Annotation
Using GO terms to represent the activities
and localizations of a gene product
Annotations contributed by members of the
GO Consortium
model organism databases
cross-species databases, eg. UniProt
Annotations freely available from GO website
GO Annotation
Electronic annotation
from mappings files
e.g. UniProt keyword2go
High quantity but low quality
Annotations to low level terms
Not checked by curators
Manual annotation
From literature curation
Time consuming but high quality
Where do we see GO annotations
Entrez Gene / GeneCards / SwissProt
Organism-specific databases
amigo.geneontology.org/
Pathways – beyond terms
Saying that a gene participates in
gluconeogenesis and binds pyruvate in the
nucleus does not provide us with all the
information
Pathway databases specify where is the
plays of a specific gene/protein with respect
to other genes doing similar jobs
KEGG – Kyoto Encyclopedia of
Genes and Genomes
www.genome.jp/kegg/
http://www.genome.jp/kegg/pathway.html
Manually annotated
“Reference maps” linked to hundreds of
genomes
Focus on metabolic pathways
Can be used to answer questions:
Give me all the genes involved in pathway X!
Given a set of genes, is there a pathway that
has a lot of genes in our set?
KEGG
BioCarta
http://www.biocarta.com/genes/index.asp
Focus on human signaling pathways
MSigDB
So far we saw curated databases
Focus on the established knowledge
Always lagging behind
MSigDB – combines “established” with gene
sets that came up in some experiment
Up regulated after UV exposure
Down in colorectal cancers
Predicted targets of some transcription factor
Frequently more useful than GO/KEGG