igor_ontologies_pathways

Download Report

Transcript igor_ontologies_pathways

Tools in Bioinformatics
Ontologies and pathways
Why are ontologies needed?
 A free text is the best way to describe what a protein
does to a human reader
 However, it is a lousy way to tell that to a computer
 When are we interested in a computer-interperable
annotation?




We want all the proteins associated with a certain
disease
All the proteins localized to a lysosome
We found a cluster of “interesting” genes and we want
to know what are they involved it
We want to measure the similarity between gene pairs
Simple solution
 The simplest solution is to use a set of
keywords for every protein
 Why is this a bad solution?
What’s in a name?





Glucose synthesis
Glucose biosynthesis
Glucose formation
Glucose anabolism
Gluconeogenesis
 All refer to the process of making glucose
from simpler components
What’s in a name?
The problem:
 Same name for different concepts
 Different names for the same concept
 Vast amounts of biological data from different
sources
 Cross-species or cross-database
comparison is difficult
What is the Gene Ontology?
 A (part of the) solution:

The Gene Ontology: “a controlled vocabulary
that can be applied to all organisms even as
knowledge of gene and protein roles in cells is
accumulating and changing”
 A controlled vocabulary to describe gene
products - proteins and RNA - in any
organism.
What is GO?
 One of the Open Biological Ontologies
 Standard, species-neutral way of
representing biology
 Three structured networks of defined terms to
describe gene product attributes
 More like a phrase book than a biology text
book
How does GO work?
What information might we want to
capture about a gene product?
 What does the gene product do?

Molecular function
 Where and when does it act?

Cellular compartment
 What is the purpose of these activities?

Biological process
Molecular Function
 activities or “jobs” of a gene product
insulin binding
insulin receptor activity
Cellular Component
 where a gene product acts
Cellular Component
Cellular Component
 Enzyme complexes in the component ontology
refer to places, not activities.
Biological Process
a commonly recognized series of events
cell division
Biological Process
transcription
Ontology Structure
 Ontologies are structured as a hierarchical
directed acyclic graph
 Terms can have more than one parent and
zero, one or more children
 Terms are linked by two relationships


is-a
part-of 

Ontology Structure
cell
membrane
mitochondrial
membrane
is-a
part-of
chloroplast
chloroplast
membrane
True Path Rule
 The path from a child term all the way up to
its top-level parent(s) must always be true
cell

nucleus
chromosome
But what about bacteria?
True Path Rule
Resolved component ontology structure:
cell

cytoplasm
chromosome
nuclear chromosome
 nucleus
nuclear chromosome

GO Annotation
 Using GO terms to represent the activities
and localizations of a gene product
 Annotations contributed by members of the
GO Consortium


model organism databases
cross-species databases, eg. UniProt
 Annotations freely available from GO website
GO Annotation
 Electronic annotation

from mappings files


e.g. UniProt keyword2go
High quantity but low quality


Annotations to low level terms
Not checked by curators
 Manual annotation


From literature curation
Time consuming but high quality
Where do we see GO annotations
 Entrez Gene / GeneCards / SwissProt
 Organism-specific databases
 amigo.geneontology.org/
Pathways – beyond terms
 Saying that a gene participates in
gluconeogenesis and binds pyruvate in the
nucleus does not provide us with all the
information
 Pathway databases specify where is the
plays of a specific gene/protein with respect
to other genes doing similar jobs
KEGG – Kyoto Encyclopedia of
Genes and Genomes




www.genome.jp/kegg/
http://www.genome.jp/kegg/pathway.html
Manually annotated
“Reference maps” linked to hundreds of
genomes
 Focus on metabolic pathways
 Can be used to answer questions:


Give me all the genes involved in pathway X!
Given a set of genes, is there a pathway that
has a lot of genes in our set?
KEGG
BioCarta
 http://www.biocarta.com/genes/index.asp
 Focus on human signaling pathways
MSigDB
 So far we saw curated databases


Focus on the established knowledge
Always lagging behind
 MSigDB – combines “established” with gene
sets that came up in some experiment



Up regulated after UV exposure
Down in colorectal cancers
Predicted targets of some transcription factor
 Frequently more useful than GO/KEGG