comp - Imtech - Institute of Microbial Technology

Download Report

Transcript comp - Imtech - Institute of Microbial Technology

Basics of Comparative Genomics
Dr G. P. S. Raghava




AIM: To understand Biology of Organisms
Importance: More than 100 genomes sequenced,
more than 250 in progress
Definition: Comparison of set of proteins of one
genome to another genome + comparision of gene
location, gene order and gene regulation
Application
– Visualization of information on genome
– Genome annotation (Prediction of gene, repeats, regulation
region)
– Evolutionary information (gene loss, duplication, horizontal gene
transfer, ancestor)
– Essential genes for cell survival
– Classification of genes based on function

Tools and Databases
What is comparative
genomics?
Analyzing & comparing genetic material
from different species to study
evolution, gene function, and inherited
disease
 Understand the uniqueness between
different species

Why Comparative Genomics ?
 It
tells us what are common and
what are unique between different
species at the genome level.
 Genome
comparison may be the
surest and most reliable way to
identify genes and predict their
functions and interactions.
What is compared?


Gene location
Gene structure
–
–
–
–

Exon number
Exon lengths
Intron lengths
Sequence similarity
Gene characteristics
– Splice sites
– Codon usage
– Conserved synteny
Few facts from genome comparision





High degree of conservation of microbial proteins
(~70% ancestral conserved region)
Protein related with ENERGY process are generally
found all genomes
Proteins related to COMMUNICATION repersent
repersent most distinctive function in each genome
INFORMATION related protein have complex
behaviour
High frequence (~10%) non-orthologous gene
displacement
Few Terminologies

Homology :- Homology is the relationship of
any two characters ( such as two proteins
that have similar sequences ) that have
descended, usually through divergence, from
a common ancestral character. Homologues
are thus components or characters (such as
genes/proteins with similar sequences) that
can be attributed to a common ancestor of
the two organisms during evolution.
Homologoues can either be orthologues
xenologues, paralogues or.



Orthologues are homologues that have evolved
from a common ancestral gene by speciation.
They usually have similar functions.
Paralogues are homologues that are related or
produced by duplication within a genome
followed by subsequent divergence. They often
have different functions.
Xenologues are homologous that are related by
an interspecies (horizontal transfer) of the
genetic material for one of the homologues. The
functions of the xenologues are quite often
similar.
Analogues

Analogues are non-homologues
genes/proteins that have descended
convergently from an unrelated
ancestor. They have similar functions
although they are unrelated in either
sequence or structure.
Frequently used terms

Homology
– Orthologous: Common ancestral gene. They usually have
similar functions
– Paralogous: duplication of gene within genome have usually
different functions
– Xenologous: That are related by an interspecies (horizontal
gene transfer) of the genetic material, have similar function



Analogous: Not evolve from same ancestor
Similarity: sequence similarity
Percent Identitity
Visualising Genome Information
Genome Annotation
The Process of Adding Biology Information and
Predictions to a Sequenced Genome Framework
All-against-all Self-comparison


How?
– Making a database of the proteome
– Use each protein as a query in a similarity search against the
database
(BLAST, WU-BLAST or FASTA)
– Generate a matrix of alignment scores (P or E value)
: A conservative cutoff E value : 10e-6
Why?
– Number of Gene Families
This comparison distinguishes unique proteins from proteins
arisen from gene duplication, and also reveals the # of gene
families.
– Paralogs
Significantly matched pairs of protein sequences may be
paralogs.
Between-Proteome
Comparisons : Why?


To identify orthologs, gene families, and domains
Orthologs: (proteins that share a common ancestry & function)
– A pair of proteins in two organisms that align along most of
their lengths with a highly significant alignment score.
– These proteins perform the core biological functions shared
by the two organisms.
– Two matched sequences (X in A, Y in B) may not be
orthologs
(Y and Z are paralogs in B, X and Z are orthologs)
– Identify true orthologs
(a) highest-scoring match (best hit)
(b) E value < 0.01
(c) > 60% alignment over both proteins
Between-Proteome
Comparisons: How?
1.
2.
3.
4.
5.
6.
7.
8.
9.
Choose a yeast protein and perform a database similarity search of the
worm proteome (WU-BLAST): a yeast-versus-worm search
Group the worm seqs that match the yeast query seq with a high P
value (10-10 to 10-100), also include the yeast query seq in the group
From the group made in 2, choose a worm seq and make a search of
the yeast proteome, using the same P limit
Add any matching yeast seq to the group made in 2
Repeat 3 & 4 for all initially matched seqs in the group
Repeat 1-5 for every yeast protein
As 1-6, perform a comparable worm-versus-yeast search
Coalesce the groups of related seqs. and remove any redundancies so
that every sequence is represented only once.
Eliminate any matched pairs in which less than 80% of each seq is in
the alignment
Figure 1 Regions of the human and mouse homologous genes: Coding exons
(white), noncoding exons (gray}, introns (dark gray), and intergenic regions
(black). Corresponding strong (white) and weak (gray) alignment regions of GLASS
are shown connected with arrows. Dark lines connecting the alignment regions
denote very weak or no alignment. The predicted coding regions of ROSETTA in
human, and the corresponding regins in mouse, are shown (white) between the
genes and the alignment regions.
Target Validation

Target validation involves taking steps to prove that a
DNA, RNA, or protein molecule is directly involved in a
disease process and is therefore a suitable target for
development of a new therapeutic compound.

Genes that do not belong to an established family are
critical to many disease processes and also need to be
validated as potential drug targets.