Molecular evolution of proteins and Phylogenetic Analysis
Download
Report
Transcript Molecular evolution of proteins and Phylogenetic Analysis
Molecular Evolution of Proteins
and Phylogenetic Analysis
Fred R. Opperdoes
Christian de Duve Institute of Cellular Pathology
(ICP) and Laboratory of Biochemistry, Université
catholique de Louvain, Brussels, Belgium
ICP-TROP
Contents (1)
Arguments in favour of a phylogenetic analysis of the
corresponding protein rather than the DNA
Codon
bias
The long time horizon
Introns
Multigene families
Protein is the unit of selection
RNA editing
ICP-TROP
Contents (2)
Methods for the Multiple Alignment of Protein Sequences
Two sequences
Multiple sequences (automatic)
Manual alignment
Methods for the inference of protein phylogeny
Distance methods
Maximum parsimony
Reliability and rooting of trees
ICP-TROP
What is a phylogenetic tree
and what does it tell you?
External nodes
Internal
nodes
OTUs
A
F
B
H
C
I
Root
A-E are external nodes (extant)
F-I are internal (ancestral) nodes
G
D
E
ICP-TROP
OTUs are operational taxonomic
units
They can be: species
They are the extant (existing) OTUs
Internal nodes represent ancestral
units.
Topology: order of the nodes on the
tree
The ‘tree of life’ based on rRNA
sequences
Eukaryota
Algae
Fungi
Cilates
Mitochondriates
Animals
Eubacteria
Plants
Euglena
Kinetoplastida
Parabasalia
Microsporidia
Diplomonads
Archaebacteria
ICP-TROP
Amitochondriates
The fusion hypothesis: the eukaryotic cell
is a chimaera of eubacterial and
archaebacterial traits
Eukaryota
Algae
Fungi
Energy metabolism
Cilates
Animals
Eubacteria
Plants
Euglena
Kinetoplastida
Parabasalia
Microsporidia
Diplomonads
Root?
Archaebacteria
Common ancestor?
ICP-TROP
Genetic machinery
Triosephosphate
isomerase
T PIS HUMAN
T PIS MACMU
T PIS RABIT
T PIS MOUSE
T PIS RAT
T PIS LATCH
T PIS CHICK
T PIS SCHJA
T PIS SCHMA
T PIS AEDTO
T PIS CULPI
T PIS CULT A
T PIS ANOME
T PIS DROME
T PIS HELVI
T PIS CAEEL
T PIS GRAVE
T PIS ARATH
T PIS PET HY
T PIS COPJA
T PIS LACSA
T PIS HORVU
T PIS SECCE
T PIS MAIZE
T PIS ORYSA
T PIC SPIOL
T PIC SECCE
T PIS ST ELP
T PIS TRYBB
T PIS TRYCR
T PIS LEIME
T PI1 GIALA
T PI2 GIALA
T PIS EMENI
T PIS SCHPO
T PIS YEAST
T PIS COPCI
T PIS BACSU
T PIS ST AAU
T PIS BACME
T PIS BACST
T PIS LACDE
T PIS LACLA
T PIS CLOAB
T PIS BORBU
T PIS SYNY3
T PIS PLAFA
T PIS MYCHR
T PIS MYCFL
T PIS MYCHY
T PIS MYCGE
T PIS MYCPN
T PIS TREPA
T PIS MYCLE
T PIS MYCTU
T PIS CORGL
T PIS ST RCO
T PIS XANFL
T PIS CHLAU
T PIS RHIET
PGKT T HEMA
T PIS AQUAE
T PIS VIBSA
T PIS PSESY
T PIS CHLPN
T PIS CHLT R
T PIS ECOLI
T PIS ENT CL
T PIS HAEIN
T PIS VIBMA
T PIS BUCAP
T PIS HELPJ
T PIS HELPY
T PIS FRAT U
T PIS MORSP
T PIS PYRHO
T PIS PYRWO
T PIS MET TH
T PIS ARCFU
T PIS MET JA
T PIS MET BR
Triosephosphate isomerase of
eukaryotes is of typical eubacterial
origin and probably has entered the
eukaryotic cell together with the
bacterial endosymbiont that gave
rise to the formation of the
mitochondrion
Root?
ICP-TROP
0.1
Animalia
Planta
Protists
Fungi
Eubacteria
Archaebacteria
What is required
A DNA or protein sequence
A set of homologous sequences
A good multiple sequence alignment
Several programs to create a
phylogenetic tree
ICP-TROP
DNA or protein ?
>TBTIM T.brucei TIM gene for microbody triosephosphate isomerase.
CTGCAGCAACTTACTGGGGACGCTGCTATCCTTTCTTCTTCATATTTCTCGTTTACCTAC
GTTTAGAGTCTCTGAGATCATTACTAGCAAGCAAACAAGAAGCCATTTGAGTTTCAAGCA
AAGTCTACCAAAAAACAAACTCTTATTATACCGTGCCAAATTATGTCCAAGCCACAACCC
ATCGCAGCAGCCAACTGGAAGTGCAACGGCTCCCAACAGTCTTTGTCGGAGCTTATTGAT
CTGTTTAACTCCACAAGCATCAACCACGACGTGCAATGCGTAGTGGCCTCCACCTTTGTT
CACCTTGCCATGACGAAGGAGCGTCTTTCACACCCCAAATTTGTGATTGCGGCGCAGAAC
GCCATTGCAAAGAGCGGTGCCTTCACCGGCGAAGTCTCCCTGCCCATCCTCAAAGATTTC
GGTGTCAACTGGATTGTTCTGGGTCACTCCGAGCGCCGCGCATACTATGGTGAGACAAAC
GAGATTGTTGCGGACAAGGTTGCCGCCGCCGTTGCTTCTGGTTTCATGGTTATTGCTTGC
ATCGGCGAAACGCTGCAGGAGCGTGAATCAGGTCGCACCGCTGTTGTTGTGCTCACACAG
ATCGCTGCTATTGCTAAGAAACTGAAGAAGGCTGACTGGGCCAAAGTTGTCATCGCCTAC
GAACCCGTTTGGGCCATTGGTACCGGCAAGGTGGCGACACCACAGCAAGCGCAGGAAGCC
CACGCACTCATCCGCAGCTGGGTGAGCAGCAAGATTGGAGCAGATGTCGCGGGAGAGCTC
CGCATTCTTTACGGCGGTTCTGTTAATGGAAAGAATGCGCGCACTCTTTACCAACAGCGA
GACGTCAACGGCTTCCTTGTTGGTGGTGCCTCACTTAAGCCAGAATTTGTGGACATCATC
AAAGCCACTCAGTGATTTTCCTTCATGTGTCAATGAGGTTTGGTGCTTTTGCCGTTGAGT
GGGTGAAGATAGCGGTATATATATATATATATATATATATATATGCGCAAGTGAATATAA
AAAAGATGTAAAGACAGGTAGCAGGGAGAAAACCTCGCATAACATTATAAAAGGGAGTGT
AACTGGAGTGGGAAAACAAAGGAAAGGGGGATTCGTGTATTGAGCATATGAGAAAAAAAA
AAGAAATTATGTTGTATGTTTTTACCTATAATTTATGCGAAGTGAATGACAAAACAAAAA
CCAAAAGGATATCATCATATGCTTTGTTTCATCCAAATGGTTGTTTCTTCCGTACCTCAG
GGTCACTACTTCGTTGAGTGTGGTTTTAGCGAGGAGAGGGAACAATAGGGGGTGTTGTAT
ACATTTACACGTACGTATCTTCCTTTACTCTCTCTTGCCTTCATTATATTCCCCCTTTTT
CTGGGAGAGGAAAAGAGAGTGTAGAATGAGGGGAGTACGTGTACGGAATTTTAACGATTA
CCCCCTTTTTTTTCTTTGAACTATTATTTTTAGAATTC
>P04789|TPIS_TRYBB Triosephosphate isomerase, glycosomal (TIM) (Triose-phosphate isomerase)
MSKPQPIAAANWKCNGSQQSLSELIDLFNSTSINHDVQCVVASTFVHLAMTKERLSHPKF
VIAAQNAIAKSGAFTGEVSLPILKDFGVNWIVLGHSERRAYYGETNEIVADKVAAAVASG
FMVIACIGETLQERESGRTAVVVLTQIAAIAKKLKKADWAKVVIAYEPVWAIGTGKVATP
QQAQEAHALIRSWVSSKIGADVAGELRILYGGSVNGKNARTLYQQRDVNGFLVGGASLKP
EFVDIIKATQ
ICP-TROP
The universal genetic code
First
Position
|
ICP-TROP
Second Position
Third
------------------------------------ Position
U(T)
C
A
G
|
U(T)
Phe
Phe
Leu
Leu
Ser
Ser
Ser
Ser
Tyr
Tyr
STOP
STOP
Cys
Cys
STOP
Trp
U(T)
C
A
G
C
Leu
Leu
Leu
Leu
Pro
Pro
Pro
Pro
His
His
Gln
Gln
Arg
Arg
Arg
Arg
U(T)
C
A
G
A
Ile
Ile
Ile
Met
Thr
Thr
Thr
Thr
Asn
Asn
Lys
Lys
Ser
Ser
Arg
Arg
U(T)
C
A
G
G
Val
Val
Val
Val
Ala
Ala
Ala
Ala
Asp
Asp
Glu
Glu
Gly
Gly
Gly
Gly
U(T)
C
A
G
Arguments in favour of protein rather than
DNA sequences
CODON BIAS :
64 different possible triplet codes encode 20 amino acids. One
amino acid may be encoded by 1 to 6 different triplet codes, and
3 of the 64 codes, called stop (or termination) codons, specify
"end of peptide sequence"
The different codons are used with unequal frequency and this
distribution of frequency is referred to as "codon usage"
Codon usage varies between species. Amino-acid codons have
been degenerated with wobble in the third position.
ICP-TROP
Arguments in favour of a phylogenetic
analysis of the corresponding protein rather
than the DNA
CODON BIAS :
64 different possible triplet codes encode 20 amino
acids. One amino acid may be encoded by 1 to 6
different triplet codes, and 3 of the 64 codes, called
stop (or termination) codons, specify "end of peptide
sequence"
The different codons are used with unequal
frequency and this distribution of frequency is
referred to as "codon usage"
Codon usage varies between species. Amino-acid
codons have been degenerated with wobble in the
third position.
ICP-TROP
The universal genetic code
First
Position
|
ICP-TROP
Second Position
Third
------------------------------------ Position
U(T)
C
A
G
|
U(T)
Phe
Phe
Leu
Leu
Ser
Ser
Ser
Ser
Tyr
Tyr
STOP
STOP
Cys
Cys
STOP
Trp
U(T)
C
A
G
C
Leu
Leu
Leu
Leu
Pro
Pro
Pro
Pro
His
His
Gln
Gln
Arg
Arg
Arg
Arg
U(T)
C
A
G
A
Ile
Ile
Ile
Met
Thr
Thr
Thr
Thr
Asn
Asn
Lys
Lys
Ser
Ser
Arg
Arg
U(T)
C
A
G
G
Val
Val
Val
Val
Ala
Ala
Ala
Ala
Asp
Asp
Glu
Glu
Gly
Gly
Gly
Gly
U(T)
C
A
G
Arguments in favour of ... (codon bias 2)
Yeasts, protozoa, and animals have different codon
preferences,
This would result in differences in DNA sequence
related to codon bias and not to evolution.
ICP-TROP
Different species use different
codons
Homo sapiens [gbmam]: 1 CDS's (389 codons)
---------------------------------------------------------------------------fields: [triplet] [frequency: per thousand] ([number])
---------------------------------------------------------------------------UUU
UUC
UUA
UUG
20.6(
12.9(
10.3(
10.3(
8)
5)
4)
4)
UCU 5.1(
UCC 20.6(
UCA 18.0(
UCG 0.0(
2)
8)
7)
0)
UAU 7.7(
UAC 30.8(
UAA 0.0(
UAG 2.6(
3)
12)
0)
1)
UGU 7.7(
UGC 0.0(
UGA 0.0(
UGG 15.4(
3)
0)
0)
6)
Saccharomyces cerevisiae [gbpln]: 9295 CDS's (4586264 codons)
---------------------------------------------------------------------------fields: [triplet] [frequency: per thousand] ([number])
---------------------------------------------------------------------------UUU
UUC
UUA
UUG
25.9(118900)
18.3( 83880)
26.3(120698)
27.2(124967)
ICP-TROP
UCU 23.6(108308)
UCC 14.3( 65421)
UCA 18.7( 85618)
UCG 8.5( 39137)
UAU 18.7( 85651)
UAC 14.7( 67599)
UAA 1.0( 4476)
UAG 0.4( 2058)
UGU 8.0( 36624)
UGC 4.6( 21255)
UGA 0.6( 2742)
UGG 10.4( 47694)
Differences between the “Universal” and
Mitochondrial Genetic Codes
Codon
Universal code
UGA
AGA
AGG
AUA
Stop
Arg
Arg
Ile
mitochondrial code
Trp
Stop
Stop
Met
Modified from: Li and Graur, 1991, Fundamentals of Molecular Evolution ,
Sinauer Publ.
ICP-TROP
Arguments in favour... (codon bias)
Also, the protozoa use the codons TAA and TGA to
encode glutamine, rather than STOP
In mitochondria the codon TGA encodes tryptophane,
rather than STOP
The inclusion of unique codons in a subset of the
sequences will tend to make that subset appear more
divergent than they really are
ICP-TROP
Arguments in favour... (codon bias 2)
High GC content of DNA seems to be associated with
aerobiosis in prokaryotes (Naya et al., 2002)
In all major groups both organisms with AT rich and
GC rich DNA can be found.
The inclusion of unique codons in a subset of the
sequences will tend to make that subset appear more
divergent than they really are
ICP-TROP
GC content of DNA in aerobic and
anaerobic prokaryotes
Anaerobic
Aerobic
ICP-TROP
From Naya et al., J. Mol. Evol. 55 (2002) 260-264
The use of protein sequences in
phylogeny requires knowledge of
the properties of the amino acids
and their single letter codes
ICP-TROP
The use of protein sequences in phylogeny
requires knowledge of the properties of the
amino acids and their single letter codes
Alanine
Arginine
Asparagine
Aspartic acid
Cysteine
Glutamic acid
Glutamine
Glycine
Histidine
Isoleucine
ICP-TROP
A
R
N
D
C
E
Q
G
H
I
Leucine
Lysine
Methionine
Phenylalanine
Proline
Serine
Threonine
Tryptophane
Tyrosine
Valine
L
K
M
F
P
S
T
W
Y
V
Arguments in favour of a phylogenetic
analysis of the corresponding protein rather
than the DNA
LONG TIME HORIZON :
When comparing sequences that have diverged for
possibly a billion years or more, it is very likely that the
wobble bases in the codons will have become
randomized. By excluding the wobble bases (a general
technique), one is actually looking at amino acid
sequences.
So why not taking a protein sequence directly?
ICP-TROP
Advantages of the translation
of DNA into protein (1)
DNA is composed of only four kinds of unit: A, G, C and T
If gaps are not allowed, on the average, 25% of residues in two
randomly chosen aligned sequences would be identical
If gaps are allowed, as much as 50 % of residues in two randomly
chosen aligned sequences can be identical. Such a situation
may obscure any genuine relationship that may exist. Especially
when comparing distantly related or rapidly evolving gene
sequences
Moreover, it is easier to translate a gene sequence into its
corresponding protein than to remove the third wobble base from
each of the codons in the gene
ICP-TROP
Alignment of two random DNA
sequences
Without indels
19% identity
Indels allowed
56% identity
ICP-TROP
Advantages of the translation
of DNA into protein (2)
Translation of DNA into 21 different types of codon (20 amino acids and
a terminator) allows the information to sharpen up considerably. Wrong
frame information is set aside
Third-base degeneracies are consolidated
After insertion of gaps to align two random protein sequences it can be
expected that they are between 10-20% identical
As a result of the translation procedure the protein sequences with their
20 amino acids are much more easy to align than the corresponding
DNA sequences with only 4 nucleotides
ICP-TROP
Alignment of two random
protein sequences
Without indels
7% identity
Indels allowed
22% identity
ICP-TROP
Advantages of the translation
of DNA into protein (3)
If, after this, you still want to align distantly
related gene sequences, you better prepare first
a protein alignment and then base yourself on
this alignment for the alignment of the gene
sequences and the precise placement of indels
in the aligned sequences.
Conclusion: The signal to noise ratio is greatly
improved when using protein sequences over
DNA sequences!
ICP-TROP
TBLASTN
The blast algorithm TBLASTN allows
the use of translated protein sequence
information to search for distant
relationships between genes
A protein sequence is compared with all
the translated sequences from a
nucleotide database
ICP-TROP
Nature of Sequence
Divergence in proteins
The observed sequence difference of two diverging
sequences takes the course of a negative exponential.
This is the result of the fact that each position is subject
to reverse changes ("back mutations") and multiple hits
Thus the observed percentage of difference between the
protein sequences is not proportional to the actual
evolutionary difference between two homologous
sequences
The evolutionary distance between two proteins is
expressed in PAM units. PAM (Dayhoff and Eck, 1968)
stands for "accepted point mutation"
ICP-TROP
Relation between %
distance and PAM distance
PAM
value
Distance
(%)
80
100
200
250
300
50
60
75
85
92
Twilight zone
(From Doolittle, 1987, Of URFs and ORFs, University
Science Books)
As the evolutionary distance increases, the probability of
super-imposed mutations becomes greater resulting in a
lower observed percent difference.
ICP-TROP
Relation between %
distance and PAM distance
Distance %
85
80
75
70
65
60
55
50
45
40
35
30
25
20
15
10
5
Twilight zone
0
ICP-TROP
100
200
300
Pam value
400
The Kimura correction for
multiple substitutions
The formula used to correct for multiple hits is from Motoo Kimura
(Kimura, M. The neutral Theory of Molecular Evolution,
Camb.Univ.Press, 1983, page 75) :
K = -Ln(1 - D - (D.D)/5) where D is the observed distance and K is
corrected distance.
This formula gives mean number of estimated substitutions per site
and, in contrast to D (the observed number), can be greater than 1 i.e.
more than one substitution per site, on average. For example, if you
observe 0.8 differences per site (80% difference; 20% identity), then
the above formula predicts that there have been 2.5 substitutions per
site over the course of evolution since the 2 sequences diverged.
This can also be expressed in PAM units by multiplying by 100 (mean
number of substitutions per 100 residues).
ICP-TROP
Proteins evolve at highly
different rates
Rate of Change
PAMs / 108 yrs
Pseudogenes
Fibrinopeptides
Lactalbumins
Lysozymes
Ribonucleases
Haemoglobins
Acid proteases
Cytochrome c
Glyceraldehyde-P dehydrogenase
Glutamate dehydrogenase
400
90
27
24
21
12
8
4
2
1
Theoretical
Lookback Time
45 x 106 yrs
200
"
670
"
850
"
850
"
1500
"
2300
"
5000
"
9000
"
18000
"
PAM = number of Accepted Point Mutations per 100 amino acids. Useful
lookback time = 360 PAMs
ICP-TROP
Some Important Dates in
History
Event
Origin of the Universe
Formation of the Solar System
First Self-replicating System
Prokaryotic-Eukaryotic Divergence
Plant-Animal Divergence
Invertebrate-Vertebrate Divergence
Mammalian Radiation Beginning
Number of years ago
15 ± 4
4.6
3.5 ± 0.5
2.0 ± 0.5
~1.0
0.5
~ 0.1
From Doolittle, Of URFs and ORFs, 1987
ICP-TROP
109 yrs
"
"
"
"
"
"
Construction of a phylogenetic tree from
phosphoglycerate kinase sequences
Rat
Mouse
Human
Horse
Drosophila
Schistosoma
Kluyveromyces
Yeast
Neurospora
Yarli
Plasmodium
Leishmania
Crithidia
Trypanosoma 1
Trypanosoma 2
Wheat
Zymomonas
Human
Mouse
Horse
Drosophila
Schistosoma
Wheat
Yarli
Yeast
Neurospora
Kluyveromyces
Plasmodium
Trypanosoma
Crithidia
Leishmania
Bacillus
Escherichia
Mycobacter
Zymomonas
Methanobacter
Escherichia
Methanobacter
Bacillus
0.1
ICP-TROP
GL
GL
GL
GL
GL
GL
GL
GL
GL
GL
GL
AL
AL
AL
AL
I L
S L
I L
I Y
DCGP E S S KKY
DCGTESSKKY
DCGTESSKKY
DV GP K T RE L F
D I GP K T I E E F
D I GP DS V K T F
DCGP KS I E E F
DNGP E S RK L F
DCGE E S V K L F
DNGP E S RKA F
DAGP KS I E NY
D I GP K T I E KY
D I GP K T I K I Y
D I GP R T I HMY
D I GP K T RE L Y
D I GDAS AQE L
DV GS K T I A L F
DV GP KAV AA L
D I GTNT I TEY
AEAV
AEAV
AEAV
AAP I
SKV I
NDAL
QKV I
AA TV
T QA I
AA TV
KDV I
VQT I
EDV I
EEV I
RDV I
AE I L
ESY L
TEV L
AK F I
T R A K Q I V WN G P
G R A K Q I V WN G P
A R A K Q I V WN G P
A R A K L I V WN G P
S R A K T I V WN G P
D T T Q T I I WN G P
G E S K T I L WN G P
A K A K T I V WN G P
N E S Q T I L WN G P
A E A K T I V WN G P
L T S K T V I WN G P
G K C K S A I WN G P
A K C K S T I WN G P
G R C K S A I WN G P
R E S K L V V WN G P
K N A K T I L WN G P
K T A K T I F WN G P
K A S K T L V WN G P
RDAK T I F ANGP
Arguments in favour of a phylogenetic
analysis of the corresponding protein rather
than the DNA (3)
INTRONS :
A study of the evolution of a protein using its DNA
sequence should only include coding sequences
This requires that in every DNA sequence all the
introns are being edited out. This may be
cumbersome and time consuming
An easier approach would be the direct translation of
the cDNA sequence into its corresponding protein
sequence
ICP-TROP
Typical structure of a
eukaryotic gene
Flanking region
Exon 2
Exon 1
Exon 3
Flanking region
3'
5'
Intron I
TATA
box
Initiation
codon
Transcription
initiation
ICP-TROP
Intron II
Stop
codon
Poly (A)
addition site
AATAA
Arguments in favour of a phylogenetic
analysis of the corresponding protein rather
than the DNA (4)
MULTIGENE FAMILIES :
Organisms may contain many highly similar genes,
while only one peptide sequence can be identified
(e.g. histones, tubulins and GAPDH in humans).
Using these DNA sequences, it would be difficult to
decide which are expressed and which not and thus
which genes to include in the analysis.
Moreover, if all the genes that are expressed encode
the same protein, then DNA differences are not
significant
ICP-TROP
Arguments in favour of a phylogenetic
analysis of the corresponding protein rather
than the DNA (5)
PROTEIN IS THE UNIT OF SELECTION :
ICP-TROP
For protein-encoding genes, the object on which
natural selection acts is the protein itself.
The underlying DNA sequence reflects this process in
combination with species-specific pressures on DNA
sequence (like the need for aerophiles to have DNA
that is GC richer).
If function demands that a protein maintains a specific
sequence, there still is room for the DNA sequence to
change.
Arguments in favour of a phylogenetic
analysis of the corresponding protein rather
than the DNA (6)
RNA EDITING :
The DNA sequence doesn't always translate into
amino acid sequence.
In post-translational editing non-coded amino acids
are added or coded amino acids are removed in the
editing process.
This could lead to major differences in DNA sequence
(sometimes more than 50%) that nevertheless leads
to roughly the same protein sequence after final
editing
ICP-TROP
Pan-editing of mitochondrial
RNA in Kinetoplastida
UCCuAuuA*AuUUUUUGuUA**UAu
AGuuuuuuAA*UGUUGuuuGGuGuA
*uuuuuuuAuUG*UGuuuAGuuuuG
uuuuGuuGuuGuuuGuuuG****GU
GuGuuAuuG**UUUUGAGAuuGuuG
note that the mature mRNA would not be able to hybridise with
the gene present in the kinetoplast DNA and thus cannot be
detected as such.
ICP-TROP
Some good advice (1)
It is recommended to prepare the phylogenetic trees
both ways (DNA and Protein) and see how they look
For a group of species that are relatively close in time
and closely related (like viral proteins or vertebrate
enzymes), DNA-based analysis is probably a good
way to go, since you avoid problems of codon bias
and randomization of wobble bases. But check the
protein anyway
ICP-TROP
Some good advice (2)
ICP-TROP
Be aware of the problems of multigene families
(for instance coding for isoenzymes)
Be careful when you decide to exclude or include
such sequences (you may compare paralogous
rather than orthologous sequences)
Text available from:
[email protected]
Text and slides:
http://www.icp.be/~opperd/chapter8/
Website:
http://www.icp.be/~opperd/private/proteins.html
ICP-TROP
Alignment of two protein
sequences (1)
For the creation of a phylogenetic tree a good
alignment of protein sequences is of vital importance
Only homologous residues should be aligned with
each other
Doubtful regions should not be included in the
alignment
Aligned sequences should have similar lengths
ICP-TROP
Dot-Matrix plots
Two homologous sequences with 81% identity
ICP-TROP
Two homologous sequences with 50% identity
Pair-wise alignment of two protein
sequences according to the ‘Dot-Matrix’
method
A
B
C D E GL D P GS E RK
C
D
E
G
L
D
P
G
S
E
R
K
•
•
•
•
•
•
•
•
•
•
•
•
•
•
C D E P L D P GS Q R K
•
•
•
C
•
C
D
E
G
L
D
P
G
S
E
R
K
ICP-TROP
•
•
•
•
•
•
•
•
•
•
•
•
•
D
C D E LD P GS Q R K
C
D
E
G
L
D
P
G
S
E
R
K
•
•
•
•
•
•
•
•
•
•
•
•
•
C D E DGL S Q L K
•
•
•
C
D
E
G
L
D
P
L
S
E
R
K
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Alignment of two protein
sequences (2)
Alignment requires the user to make assumptions
regarding relative costs of substitution versus insertions
and deletions (indels).
If substitution cost >> gap penalty: there will be many
short gaps and no phylogenetic information.
In general: search for maximum identity and minimize
the number of insertions and deletions.
Exclude regions that cannot be aligned unambiguously!
Visual alignment is possible using the "dot-matrix
method"
ICP-TROP
Identity matrix as used in
Clustal
C10,
S 0,
T 0,
P 0,
A 0,
G 0,
N 0,
D 0,
E 0,
Q 0,
H 0,
R 0,
K 0,
M 0,
I 0,
L 0,
V 0,
F 0,
Y 0,
W 0,
C
ICP-TROP
10,
0, 10,
0, 0, 10,
0, 0, 0, 10,
0, 0, 0, 0, 10,
0, 0, 0, 0, 0, 10,
0, 0, 0, 0, 0, 0, 10,
0, 0, 0, 0, 0, 0, 0, 10,
0, 0, 0, 0, 0, 0, 0, 0, 10,
0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
S T P A G N D E Q H R K M I L V F Y W
Distance matrix with
mutation costs for amino acids
Ala
Ser
Gly
Leu
Lys
Val
Thr
Pro
Glu
Asp
Asn
Ile
Gln
Arg
Phe
Tyr
Cys
His
Met
Trp
Glx
Asx
???
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
A
S
G
L
K
V
T
P
E
D
N
I
Q
R
F
Y
C
H
M
W
Z
B
X
A
0
1
1
2
2
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
S
1
0
1
1
2
2
1
1
2
2
1
1
2
1
1
1
1
2
2
1
2
2
2
G
1
1
0
2
2
1
2
2
1
1
2
2
2
1
2
2
1
2
2
1
2
2
2
L
2
1
2
0
2
1
2
1
2
2
2
1
1
1
1
2
2
1
1
1
2
2
2
K
2
2
2
2
0
2
1
2
1
2
1
1
1
1
2
2
2
2
1
2
1
2
2
V
1
2
1
1
2
0
2
2
1
1
2
1
2
2
1
2
2
2
1
2
2
2
2
T
1
1
2
2
1
2
0
1
2
2
1
1
2
1
2
2
2
2
1
2
2
2
2
P
1
1
2
1
2
2
1
0
2
2
2
2
1
1
2
2
2
1
2
2
2
2
2
E
1
2
1
2
1
1
2
2
0
1
2
2
1
2
2
2
2
2
2
2
1
2
2
D
1
2
1
2
2
1
2
2
1
0
1
2
2
2
2
1
2
1
2
2
2
1
2
N
2
1
2
2
1
2
1
2
2
1
0
1
2
2
2
1
2
1
2
2
2
1
2
I
2
1
2
1
1
1
1
2
2
2
1
0
2
1
1
2
2
2
1
2
2
2
2
Q
2
2
2
1
1
2
2
1
1
2
2
2
0
1
2
2
2
1
2
2
1
2
2
R
2
1
1
1
1
2
1
1
2
2
2
1
1
0
2
2
1
1
1
1
2
2
2
F
2
1
2
1
2
1
2
2
2
2
2
1
2
2
0
1
1
2
2
2
2
2
2
Y
2
1
2
2
2
2
2
2
2
1
1
2
2
2
1
0
1
1
3
2
2
1
2
C
2
1
1
2
2
2
2
2
2
2
2
2
2
1
1
1
0
2
2
1
2
2
2
H
2
2
2
1
2
2
2
1
2
1
1
2
1
1
2
1
2
0
2
2
2
1
2
M
2
2
2
1
1
1
1
2
2
2
2
1
2
1
2
3
2
2
0
2
2
2
2
W
2
1
1
1
2
2
2
2
2
2
2
2
2
1
2
2
1
2
2
0
2
2
2
Z
2
2
2
2
1
2
2
2
1
2
2
2
1
2
2
2
2
2
2
2
1
2
2
B
2
2
2
2
2
2
2
2
2
1
1
2
2
2
2
1
2
1
2
2
2
1
2
X
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
The distance table is generated by calculating the minimum number of base mutations required
to convert an amino acid in row i to an amino acid in column j. Note Met->Tyr is the only
change that requires all 3 codon positions to change.
ICP-TROP
Hydrophobicity matrix
Arg
Lys
Asp
Glu
Asx
Glx
Ser
Asn
Gln
Gly
???
Thr
His
Ala
Cys
Met
Pro
Val
Leu
Ile
Tyr
Phe
Trp
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
R
K
D
E
B
Z
S
N
Q
G
X
T
H
A
C
M
P
V
L
I
Y
F
W
R K D E B Z S N Q G X T H A C M P V L I Y F W
10 10 9 9 8 8 6 6 6 5 5 5 5 5 4 3 3 3 3 3 2 1 0
10 10 9 9 8 8 6 6 6 5 5 5 5 5 4 3 3 3 3 3 2 1 0
9 9 10 10 8 8 7 6 6 6 5 5 5 5 5 4 4 4 3 3 3 2 1
9 9 10 10 8 8 7 6 6 6 5 5 5 5 5 4 4 4 3 3 3 2 1
8 8 8 8 10 10 8 8 8 8 7 7 7 7 6 6 6 5 5 5 4 4 3
8 8 8 8 10 10 8 8 8 8 7 7 7 7 6 6 6 5 5 5 4 4 3
6 6 7 7 8 8 10 10 10 10 9 9 9 9 8 8 7 7 7 7 6 6 4
6 6 6 6 8 8 10 10 10 10 9 9 9 9 8 8 8 7 7 7 6 6 4
6 6 6 6 8 8 10 10 10 10 9 9 9 9 8 8 8 7 7 7 6 6 4
5 5 6 6 8 8 10 10 10 10 9 9 9 9 8 8 8 8 7 7 6 6 5
5 5 5 5 7 7 9 9 9 9 10 10 10 10 9 9 8 8 8 8 7 7 5
5 5 5 5 7 7 9 9 9 9 10 10 10 10 9 9 8 8 8 8 7 7 5
5 5 5 5 7 7 9 9 9 9 10 10 10 10 9 9 9 8 8 8 7 7 5
5 5 5 5 7 7 9 9 9 9 10 10 10 10 9 9 9 8 8 8 7 7 5
4 4 5 5 6 6 8 8 8 8 9 9 9 9 10 10 9 9 9 9 8 8 5
3 3 4 4 6 6 8 8 8 8 9 9 9 9 10 10 10 10 9 9 8 8 7
3 3 4 4 6 6 7 8 8 8 8 8 9 9 9 10 10 10 9 9 9 8 7
3 3 4 4 5 5 7 7 7 8 8 8 8 8 9 10 10 10 10 10 9 8 7
3 3 3 3 5 5 7 7 7 7 8 8 8 8 9 9 9 10 10 10 9 9 8
3 3 3 3 5 5 7 7 7 7 8 8 8 8 9 9 9 10 10 10 9 9 8
2 2 3 3 4 4 6 6 6 6 7 7 7 7 8 8 9 9 9 9 10 10 8
1 1 2 2 4 4 6 6 6 6 7 7 7 7 8 8 8 8 9 9 10 10 9
0 0 1 1 3 3 4 4 4 5 5 5 5 5 6 7 7 7 8 8 8 9 10
Hydrophobicity scoring matrix constructed from hydrophilicity data (M.Levitt,
J. Mol. Biol. 104, 59 [1976]), derived by George et al. 1990, Mutation
ICP-TROP
Data Matrix and Its Uses, Methods in Enzymology 183, 333.
PAM 1 mutation matrix
1 PAM evolutionary distance
Ala
Arg
Asn
Asp
Cys
Gln
Glu
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Trp
Tyr
Val
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
9867
2
9
10
3
8
17
21
2
6
4
2
6
2
22
35
32
0
2
18
1 9913
1
0
1
10
0
0
10
3
1
19
4
1
4
6
1
8
0
1
4
1 9822
36
0
4
6
6
21
3
1
13
0
1
2
20
9
1
4
1
6
0
42 9859
0
6
53
6
4
1
0
3
0
0
1
5
3
0
0
1
1
1
0
0 9973
0
0
0
1
1
0
0
0
0
1
5
1
0
3
2
3
9
4
5
0 9876
27
1
23
1
3
6
4
0
6
2
2
0
0
1
10
0
7
56
0
35 9865
4
2
3
1
4
1
0
3
4
2
0
1
2
21
1
12
11
1
3
7 9935
1
0
1
2
1
1
3
21
3
0
0
5
1
8
18
3
1
20
1
0 9912
0
1
1
0
2
3
1
1
1
4
1
2
2
3
1
2
1
2
0
0 9872
9
2
12
7
0
1
7
0
1
33
3
1
3
0
0
6
1
1
4
22 9947
2
45
13
3
1
3
4
2
15
2
37
25
6
0
12
7
2
2
4
1 9926
20
0
3
8
11
0
1
1
1
1
0
0
0
2
0
0
0
5
8
4 9874
1
0
1
2
0
0
4
1
1
1
0
0
0
0
1
2
8
6
0
4 9946
0
2
1
3
28
0
13
5
2
1
1
8
3
2
5
1
2
2
1
1 9926
12
4
0
0
2
28
11
34
7
11
4
6
16
2
2
1
7
4
3
17 9840
38
5
2
2
22
2
13
4
1
3
2
2
1
11
2
8
6
1
5
32 9871
0
2
9
0
2
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0 9976
1
0
1
0
3
0
3
0
1
0
4
1
1
0
0
21
0
1
1
2 9945
1
13
2
1
1
3
2
2
3
3
57
11
1
17
1
3
2
10
0
2 9901
[top row shows original amino acid; left column shows replacement amino acid]
Mutation probability matrix for the evolutionary distance of 1 PAM (i.e., one Accepted Point Mutation per 100 amino acids).
An element of this matrix, [Mij], gives the probability that the amino acid in column j will be replaced by the amino acid in
row i after a given evolutionary interval, in this case 1 PAM. Thus, there is a 0.56% probability that Asp will be replaced by
Glu. To simplify the appearance, the elements are shown multiplied by 10,000. (Adapted from Figure 82. Atlas of Protein
Sequence and Structure, Suppl 3, 1978, M.O. Dayhoff, ed. National Biomedical Research Foundation, 1979.)
ICP-TROP
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
PAM 100 matrix as used in
Clustal
14,
-1,
-5,
-6,
-5,
-8,
-8,
-11,
-11,
-11,
-6,
-6,
-11,
-11,
-5,
-12,
-4,
-10,
-2,
-13,
C
ICP-TROP
6,
2, 7,
1, -1, 10,
2, 2, 1, 6,
1, -3, -3, 1, 8,
2, 0, -3, -1, -1,
-1, -2, -4, -1, -1,
-2, -3, -3, 0, -2,
-3, -3, -1, -2, -5,
-4, -5, -2, -5, -7,
-1, -4, -2, -5, -8,
-2, -1, -4, -4, -5,
-4, -2, -6, -3, -8,
-4, -1, -6, -3, -7,
-7, -5, -5, -5, -8,
-4, -1, -4, 0, -4,
-5, -6, -9, -7, -8,
-6, -6,-11, -6,-11,
-4,-10,-11,-11,-13,
S
T
P
A
G
7,
4, 8,
1, 5, 8,
-1, 1, 4, 9,
2, -1, -2, 4,
-3, -6, -5, 1,
1, -2, -2, -1,
-5, -8, -6, -2,
-4, -6, -5, -5,
-6, -9, -7, -3,
-5, -6, -5, -5,
-6,-11,-11,-10,
-3, -9, -7, -9,
-8,-13,-14,-11,
N
D
E
Q
11,
1, 10,
-3, 3, 8,
-7, -2, 1, 13,
-7, -4, -4, 2, 9,
-5, -7, -6, 4, 2, 9,
-6, -6, -6, 1, 5, 1, 8,
-4, -7,-11, -2, 0, 0, -5, 12,
-1,-10,-10, -8, -4, -5, -6, 6, 13,
-7, 1, -9,-11,-12, -7,-14, -2, -2, 19,
H
R
K
M
I
L
V
F
Y
W
C
S
T
P
A
G
N
D
E
Q
H
R
K
M
I
L
V
F
Y
W
ICP-TROP
PAM 250 matrix as used in
Clustal
12,
0, 2,
-2, 1, 3,
-3, 1, 0, 6,
-2, 1, 1, 1, 2,
-3, 1, 0,-1, 1, 5,
-4, 1, 0,-1, 0, 0, 2,
-5, 0, 0,-1, 0, 1, 2, 4,
-5, 0, 0,-1, 0, 0, 1, 3, 4,
-5,-1,-1, 0, 0,-1, 1, 2, 2, 4,
-3,-1,-1, 0,-1,-2, 2, 1, 1, 3, 6,
-4, 0,-1, 0,-2,-3, 0,-1,-1, 1, 2, 6,
-5, 0, 0,-1,-1,-2, 1, 0, 0, 1, 0, 3, 5,
-5,-2,-1,-2,-1,-3,-2,-3,-2,-1,-2, 0, 0, 6,
-2,-1, 0,-2,-1,-3,-2,-2,-2,-2,-2,-2,-2, 2, 5,
-6,-3,-2,-3,-2,-4,-3,-4,-3,-2,-2,-3,-3, 4, 2, 6,
-2,-1, 0,-1, 0,-1,-2,-2,-2,-2,-2,-2,-2, 2, 4, 2, 4,
-4,-3,-3,-5,-4,-5,-4,-6,-5,-5,-2,-4,-5, 0, 1, 2,-1,
0,-3,-3,-5,-3,-5,-2,-4,-4,-4, 0,-4,-4,-2,-1,-1,-2,
-8,-2,-5,-6,-6,-7,-4,-7,-7,-5,-3, 2,-3,-4,-5,-2,-6,
C S T P A G N D E Q H R K M I L V
9,
7,10,
0, 0,17,
F Y W
Matrices often used for the
alignment of proteins
PAM 250 (Dayhoff et al., 1978)
BLOSUM62 (Henikoff-Henikoff, 1992)
JTT (Jones et al., 1992)
mtREV24 (Adachi-Hasegawa, 1996)
GONNET matrix (Gonnet et al., 1992)
ICP-TROP
Multiple alignment of protein
sequences
For the construction of reliable phylogenetic trees the quality of a
multiple alignment is of the utmost importance
There are many programs available for the multiple alignment of
proteins.
– A good program in the public domain is: ClustalW
– A similar program is Pileup of the GCG package
They quickly align sequence pairs and roughly determine the degrees
of identity between each pair
Then the sequences are aligned more precisely in a progressive way
starting with the two closest sequences
Most programs work best when the sequences have similar length.
ICP-TROP
Some rules of thumb for the
manual alignment of proteins (1)
An automatically produced multiple alignment often
needs manual adjustment to improve the quality of
the alignment.
Such improvement can be obtained by using all the
knowledge that is available about a protein.
If a structure is available you should use the detailed
information about secondary structure for the
alignment.
ICP-TROP
Some rules of thumb for the
manual alignment of proteins (2)
ICP-TROP
The rules for mutation of amino acids are dependent
on their physicochemical properties.
Surface residues (DRENK) are preferably mutated to
residues of similar properties. Since they are not, or
less, involved in protein folding they mutate rather
easily.
Hydrophobic residues (FAMILYVW) are preferentially
replaced by other hydrophobic ones. These residues
are mainly internal and determine the folding of the
protein. They thus mutate rather slowly.
Some rules of thumb for the
manual alignment of proteins (3)
The residues CHQST are indifferent and may be
replaced with any other type of residue
The residues (DRENKCHQST), when conserved
throughout the alignment are very likely residues that
are involved in the active site. So the multiple
alignment should be adjusted accordingly
Periodicity of charged residues may provide
information as to the presence of elements of
secondary structure such as -helices and -strands
ICP-TROP
-helix
ICP-TROP
-strand
ICP-TROP
Some rules of thumb for the
manual alignment of proteins (4)
Indels (insertions/deletions) are never found in
elements of secondary structure but only in loops.
Pro and Gly interfere with secondary structure
elements and thus have a preference for loops
Hydrophobicity (or hydropathy) profiles according to
Kyte and Doolittle of two homologous proteins are in
general strikingly similar
ICP-TROP
Proline interferes with -helix
and -sheet formation
ICP-TROP
From Deber and Therien,2002
Possible functions for proline
in trans-membrane domains
From Deber and Therien,2002
ICP-TROP
Alignment of malate dehydrogenase sequences
ICP-TROP
Slcl|CHR34_tmp.0150
lcl|CHR34_tmp.0140
lcl|CHR34_tmp.0130
lcl|CHR28_tmp.0050
----MKPST--LSRFKVTVLGASGAIGQPLALALVQNKRVSEL-----ALYDIVQPR------MRRSQ--GCFFRVAVLGAAGGIGQPLSLLLKNNKYVKEL-----KLYDVKGGP--MGLLFRRSLTALKKGKVVLFGCSNAVGQPLSLLLKMNPHVEELVCCNTAADDDVPGS-------------MSAVKVAVTGAAGQIGYALVPLIARGALLGPTTPVELRLLDIEPALKAL
. .
:*.: *.:. :* .*
: . :
:
*
lcl|CHR34_tmp.0150
lcl|CHR34_tmp.0140
lcl|CHR34_tmp.0130
lcl|CHR28_tmp.0050
-GVAVDLSHFPRKVKVTGYPTKWIHK--ALDGADLVLMSAGMPRRPGMT-HDDLFNTNAL
-GVAADLSHICAPAKVTGYTKDELSR--AVENADVVVIPAGIPRKPGMT-RDDLFNTNAS
-GIAADLSHIDTLPKVH-YATDEGQWPALLRDAQLILVCFGSSFDLLREDRDIALKAAAP
AGVEAELEDCAFPLLDKVVVTADPRV--AFDGVAIAIMCGAFPRKAGME-RKDLLEMNAR
*: .:*..
.
. .. : :: . .
::. :: *
lcl|CHR34_tmp.0150
lcl|CHR34_tmp.0140
lcl|CHR34_tmp.0130
lcl|CHR28_tmp.0050
TVNELSAAVARYAPKSV-LAIISNPLNSMVPVAAETLQRAGVYDPRKLFGIISLNMMRAR
IVRDLAIAVGTHAPKAI-VGIITNPVNSTVPVAAEALKKVGVYDPARLFGVTTLDVVRAR
TMRRVMAAVASSDTTGN-VAVVSSPVNALTPFCAELLKASGKFDPRKLFGVTTLDVIRTR
IFKEQGEAIAAVAASDCRVVVVGNPANTNALILLKSAQ--GKLNPRHVTAMTRLDHNRAL
..
*:.
..
: :: .* *: . . : : * :* :: .: *: *:
lcl|CHR34_tmp.0150
lcl|CHR34_tmp.0140
lcl|CHR34_tmp.0130
lcl|CHR28_tmp.0050
KMLGDFTGQDPEMLDVPVIGGHSGQTIVPLFSHS--GVELRQEQVEYLTHRVR------TFVAEALGASPYDVDVPVIGGHSGETIVPLLSG---FPSLSEEQVRQLTHRIQ------KLVAGTLHMNPYDVNVPVVGGCGGVTACPLIAQT--GLRIPLDDIVRISGEVQSYGVLFE
SLLARKAGVPVSQVRNVIIWGNHSSTQVPDTDSAVIGTTPAREAIKDDALDDD-----FV
.::.
:
:: * . * *
: :
:
.
lcl|CHR34_tmp.0150
lcl|CHR34_tmp.0140
lcl|CHR34_tmp.0130
lcl|CHR28_tmp.0050
--VGGD-EVVKAKEGRGSSSLSMAFAAAEWADGVLRAMDGEKTLLQCSFVESPLFADKCR
--FGGD-EVVKAKDGAGSATLSMAFAGNEWTTAVLRALSGEKGVVVCTYVQS-TVEPSCA
AAVGADSHDALSTEVAPPVALGLAYAACDFSTSLLKALRGDVGIVECALVES-TMRSETP
QVVRGRGAEIIQLRGLSSAMSAAKAAVDHVHDWIHGTPEGVYVSMGVYSDENPYGVPSGL
. .
.
.
* .
: : *
:
:.
.
lcl|CHR34_tmp.0150
lcl|CHR34_tmp.0140
lcl|CHR34_tmp.0130
lcl|CHR28_tmp.0050
FFGSTVEVCKEGIERVLPLPPLNEYEEEQLDRCLPDLEKN-IRKGLAFVAENAATSTPST
FFSSPVLLGNSGVEKIYPVPMLNAYEEKLMAKCLEGLQSN-ITKGIAFSNK--------FFSSRVELGREGVQRVFPMGALTSYEHELIETAVPELMRD-VQAGIEAATQF-------IFSFP-CTCHAGEWTVVSGKLNGDLGKQRLASTIAELQEERAQAGL-------------:*.
. *
: .
.: :
: * :
*:
:
Hydrophobicity profiles
Profiles according to Kyte and Doolittle of homologous
proteins are in general strikingly similar and may provide
a tool in the alignment of two or more proteins.
The two phosphoglycerate kinase sequences below
share 50% identical residues.
Trypanosoma congolense PGK
ICP-TROP
Euglena gracilis PGK
Tree construction methods (1)
ICP-TROP
Distance matrix methods
– Cluster analysis (UPGMA, WPGMA, etc)
– Fitch & Margoliash (1967)
– Transformed distance methods (eg. Li, 1981)
– Neighbor-joining (Saitou & Nei, 1987)
– ...many more
Parsimony methods
– Maximum parsimony
Other methods
– Maximum likelihood (Felsenstein, 1981)
– ... many more
Tree construction methods (2)
Character-based methods:
– maximum parsimony
– maximum likelihood
Non-character-based methods:
– distance matrix methods
ICP-TROP
Phylogeny (2)
Distance Matrix methods (in the public domain)
– Least squares method (Fitch and Margoliash)
—Fitch, Kitsch of the Phylip package (Jo Felsentein, Univ.
Washington)
– Neighbor-joining method
—Neighbor of the Phylip package (Jo Felsentein, Univ. Washington)
—Clustal, or Distnj in Protml package (Adachi and Hasegawa, Univ.
Tokyo)
—Darwin (Gaston Gonner, ETH, Zurich, via mailserver or WWW)
Protein Maximum likelihood (in the public domain)
– Protml (Adachi and Hasegawa, Univ. Tokyo) (very cpu intensive)
– TreePuzzle (Strimmer and von Haeseler, 1997)
Protein maximal parsimony (in the public domain)
— Protpars (Jo Felsentein, Univ. Washington)
— Paup (David Swofford, latest version will be commercial)
ICP-TROP
Some useful information
about phylogenetic trees
External nodes
Internal
nodes
OTUs
A
F
B
H
C
I
Root
A-E are external nodes (extant)
F-I are internal (ancestral) nodes
G
D
E
ICP-TROP
OTUs are operational taxonomic
units
They can be: species
They are the extant (existing) or extinct
(ancestral) OTUs
Topology: order of the nodes on the
tree
Distance Matrix Methods
UPGMA (Unweighted Pair Group with Arithmatic Mean) uses real
(uncorrected) distance values and a sequential clustering
algorithm. (Should only be used with closely related OTUs, or
when there is constancy of evolutionary rate)
Transformed distance methods. Corrections may be introduced
to obtain trees with true evolutionary distances (PAM values,
Kimura), or corrections are carried out with reference to an
outgroup (Farris, 1971; Klotz et al, 1979). Should be used when
evolutionary distant organisms are included in the dataset
Neighbors relation methods
– FITCH (Fitch, 1981)
– Neighbor-Joining method, (Saitou and Nei, 1987)
Should all be used with corrected (see above) distance
matrices
ICP-TROP
Distance matrix
Uncorrected for Multiple Substitutions
1
0.00
2
0.63
0.00
3
0.63
0.63
0.00
4
22.88
22.57
22.88
0.00
5
18.50
18.50
17.87
5.64
0.00
AC007866_13
AC007866_17
AC007866_15
AC007866_9
AC007866_11
1
2
3
4
5
5
21.29
21.29
20.47
5.88
0.00
AC007866_13
AC007866_17
AC007866_15
AC007866_9
AC007866_11
1
2
3
4
5
Using the Kimura correction method
Gap weighting is 0.000000
1
0.00
2
0.63
0.00
3
0.63
0.63
0.00
4
27.35
26.90
27.35
0.00
Distance matrix as produced by the EMBOSS program distmat
ICP-TROP
UPGMA
ICP-TROP
UPGMA (Unweighted Pair
Group with Arithmetic Mean)
uses real (uncorrected)
distance values and a
sequential clustering
algorithm. (Should only be
used with closely related
OTUs, or when there is
constancy of evolutionary
rate)
Tree construction (UPGMA)
First cycle
B
C
D
E
F
A
B
C
D
E
2
4
6
6
8
4
6
6
8
6
6
8
4
8
8
Cluster the pair of OTUs with the smallest distance, being A and
B, The branching point is positioned at a distance of 2 / 2 = 1 substitution.
ICP-TROP
Tree construction (UPGMA)
Following the first clustering A and B are considered as a single
composite OTU(A,B) and we now calculate the new distance matrix
as follows:
dist(A,B),C = (distAC + distBC) / 2 = 4
dist(A,B),D = (distAD + distBD) / 2 = 6
dist(A,B),E = (distAE + distBE) / 2 = 6
dist(A,B),F = (distAF + distBF) / 2 = 8
In other words the distance between a simple OTU and a composite
OTU is the average of the distances between the simple OTU and
the constituent simple OTUs of the composite OTU. Then a new
distance matrix is recalculated using the newly calculated distances
and the whole cycle is being repeated:
ICP-TROP
Tree construction (UPGMA)
Second cycle
C
D
E
F
ICP-TROP
A,B
4
6
6
8
C
D
E
6
6
8
4
8
8
Tree construction (UPGMA)
Third cycle
A,B
C
4
D,E 6
F
8
ICP-TROP
C
D,E
6
8
8
Tree construction (UPGMA 4)
Fourth cycle
AB,C D,E
D,E 6
F
8
8
ICP-TROP
Tree construction (UPGMA)
Fifth cycle
ABC,DE
F
8
The final step consists of clustering the last OTU,
F,with the composite OTU.
ICP-TROP
Pitfalls of UPGMA
The UPGMA clustering method is very
sensitive to unequal evolutionary rates.
Clustering works only if the data are
ultrametric
Ultrametric distances are defined by the
satisfaction of the 'three-point condition'.
ICP-TROP
The treepoint condition
For any three taxa: dist AC <= max (distAB, distBC) or,
in words: the two greatest distances are equal, or
UPGMA assumes that the evolutionary rate is the same for
all branches
If the assumption of rate constancy among lineages does
not hold UPGMA may give an erroneous topology.
Non-ultrametric tree
ICP-TROP
Unequal rates of mutation
lead to wrong trees
ICP-TROP
UPGMA tree construction based on the data
of the left tree would result in the erroneous
tree at the right
UPGMA (conclusion)
UPGMA uses real (uncorrected) distance values and
a sequential clustering algorithm.
This method of tree construction is very sensitive to
differences in branch length or unequal rates of
evolution.
It should only be used with closely related OTUs, or
when there is constancy of evolutionary rate.
The method is often used in combination with
isoenzyme or restriction site data or with
morphological criteria
ICP-TROP
Maximum Parsimony Methods
Use sequence information rather than distance
information
Calculate for all possible trees the tree that
represents the minimum number of substitutions at
each informative site
ICP-TROP
Maximum Parsimony analysis
(2)
Parsimony implies that simpler hypotheses are preferable to
more complicated ones.
Maximum parsimony is a character-based method that infers a
phylogenetic tree by minimizing the total number of evolutionary
steps required to explain a given set of data, or in other words
by minimizing the total tree length.
The steps may be base or amino-acid substitutions for
sequence data, or gain and loss events for restriction site data.
ICP-TROP
Maximum Parsimony analysis
(3)
Maximum parsimony, when applied to protein sequence data
either considers each site of the sequence as a multistate
unordered characterd with 20 possible states (the amino-acids)
(Eck and Dayhoff, 1966), or may take into account the genetic
code and the number of mutations, 1, 2 or 3, that is required to
explain an observed amino-acid substitution. The latter method
is implemented in the PROTPARS program (Felsenstein, 1993).
The maximum parsimony method searches all possible tree
topologies for the optimal (minimal) tree. However, the number
of unrooted trees that have to be analysed rapidly increases with
the number of OTUs.
ICP-TROP
Maximum Parsimony analysis
(4)
The number of rooted trees (Nr) for n OTUs is given by:
Nr = (2n -3)!/(2exp(n -2)) (n -2)!
The number of unrooted trees (Nr) for n OTUs is given by:
Nu = (2n -5)!/(2exp(n -3)) (n -3)!
Number of
OTUs unrooted trees
2
1
3
1
4
3
5
15
6
105
7
954
8
10,395
9
135,135
10 34,459,425
15
2.13E15
ICP-TROP
rooted trees
1
3
15
105
945
10,395
135,135
34,459,425
2.13E15
8.E21
This rapid increase in
number of trees to be
analysed may make it
impossible to apply the
method to very large
datasets. In that case the
parsimony method may
become very time
consuming, even on very
fast computers.
maximum parsimony method for 4
nucleic-acid sequences
Sequence
1
2
3
4
ICP-TROP
Site
_________________________
1 2 3 4 5 6 7 8 9
A
A
A
A
A
G
G
G
G
C
A
A
A
C
T
G
G
G
A
A
T
T
T
T
G
G
C
C
C
C
C
C
A
G
A
G
For four OTUs there are three possible unrooted trees. The
trees are then analysed by searching for the ancestral
sequences and by counting the number of mutations required to
explain the respective trees :
(1) AAGAGTGCA
AGATATCCA (3)
\4
2/
Number of mutations
\
4
/
AGCCGTGCG --- AGAGATCCG
Tree I: 11
/
\
/0
0\
(2) AGCCGTGCG
AGAGATCCG (4)
(1) AAGAGTGCA
AGCCGTGCG (2)
\1
3/
\
5
/
AGGAGTGCA --- AGAGGTCCG
Tree II: 14
/
\
/4
1\
(3) AGATATCCA
AGAGATCCG (4)
(1) AAGAGTGCA
AGCCGTGCG (2)
\1
3/
\
5
/
AGAAGTGCA --- AGATGTCCG
Tree III: 16
/
\
/5
2\
(4) AGAGATCCG
AGATATCCA (3)
ICP-TROP
Tree I has the topology with
the least number of
mutations and thus is the
most parsimonious tree.
Ancestral trees are
calculated
This analysis includes both
informative and noninformative sites in the
sequence.
When only informative sites
are included a much lesser
number of sites can be
analysed, which means in
the case of large datasets a
considerable gain in CPU
time.
Informative sites
A site is informative only when there are at least two different kinds of
nucleotides at the site, each of which is represented in at least two of
the sequences under study.
Sequence
1
2
3
4
Site
_________________________
1 2 3 4 5 6 7 8 9
A
A
A
A
A
G
G
G
G
C
A
A
A
C
T
G
G
G
A
A
*
T
T
T
T
G
G
C
C
*
C
C
C
C
A
G
A
G
*
Informative sites are indicated
by an asterisk (*)
ICP-TROP
1
2
3
4
GGA
GGG
ACA
ACG
***
(1)
(2)
(1)
(3)
(1)
(4)
Informative sites only
GGA
ACA (3)
\1
1/
\
2
/
GGG --- ACG
/
\
/0
0\
GGG
ACG (4)
Number of mutations
Tree I:
4
GGA
GGG (2)
\1
1/
\
1
/
GCA --- GCG
/
\
/1
1\
ACA
ACG (4)
Tree II:
5
GGA
GGG (2)
\2
1/
\
0
/
GCG --- GCG
/
\
/1
2\
ACG
ACA (3)
Tree III: 6
ICP-TROP
To infer a maximum
parsimony tree, for each
possible tree we
calculate the minimum
number of substitutions
at each informative site.
In the above example, for
sites 5, 7, and 9, tree I
requires in total 4
changes, tree II requires
5 changes, and tree III
requires 6 changes. In
the final step, we sum
the number of changes
over all the informative
sites for each tree and
choose the tree
associated with the
smallest number of
substitutions. In our
case, tree I is chosen
because it requires the
smallest number of
changes (4) at the
informative sites.
How to find the best tree ?
Maximum parsimony searches for the optimal (minimal) tree. In this
process more than one minimal trees may be found. In order to guarantee
to find the best possible tree an exhaustive evaluation of all possible tree
topologies has to be carried out. However, this becomes impossible when
there are more than 12 OTUs in a dataset.
Branch and Bound: is a variation on maximum parsimony that garantees
to find the minimal tree without having to evaluate all possible trees. This
way a larger number of taxa can be evaluated but the method is still
limited.
Heuristic searches is a method with step-wise addition and rearrangement
(branch swapping) of OTUs. Here it is not guaranteed to find the best tree.
Since, in view of the size of the dataset, it is often not possible to carry out
an exhaustive or other search for the best tree, it is adviced to change the
order of the taxa in the dataset and to repeat the analysis, or to indicate to
the program to do this for you by providing a so-called jumble factor to the
program.
ICP-TROP
Consensus tree
ICP-TROP
Since the Maximum Parsimony method may result in more than one equally
parsimonious tree, a consensus tree should be created. For the creation of a
consensus tree see bootstrapping.
Parsimony and branch lengths
(1)
G
(2)
A (3)
\1
0/
\
1
/
C -----A
/
\
/0
1\
C
T (4)
(1)
G
(2)
A (3)
\0
1/
\
1
/
G -----T
/
\
/1
0\
C
T (4)
(1)
G
(2)
ICP-TROP
A (3)
\1
1/
\
1
/
C -----A
/
\
/0
0\
C
A (4)
3 possible trees for 4 OTUs, all describe
the same final state by assuming a total of 3
steps.
Each final state is arrived at via a different
route.
Each of the three trees is equally valid, but
the number of steps along the indiviual
branches (or the length of each branch) is
not determined.
For this reason branch lengths are not given
in parsimony, but only the total number of
steps for a tree.
Some final notes on maximum
parsimony
ICP-TROP
Maximum Parsimony (positive points):
– is based on shared and derived characters. It therefore is a
cladistic rather than a phenetic method
– does not reduce sequence information to a single number
– tries to provide information on the ancestral sequences
– evaluates different trees
Maximum Parsimony (negative points):
– does not assume an evolutionary model
– is slow in comparison with distance methods
– does not use all the sequence information (only informative
sites are used)
– does not correct for multiple mutations (does not imply a
model of evolution)
– does not provide information on the branch lengths
– is notorious for its sensitivity to codon bias
How to root an unrooted tree?
The majority of methods yield unrooted trees
To root a tree one should add an outgroup to the dataset. An outgroup is
an OTU for which external information (eg. paleontological information) is
available that indicates that the outgroup branched off before all other taxa
Do not choose an outgroup that is very distantly related to your taxa. This
may result in serious topolocical errors
Do not choose either an outgroup that is too closely related to the taxa in
question. In this case it may not be a true outgroup
The use of more than one outgroup generally improves the estimate of
tree topology
In the absence of a good outgroup the root may be positioned by
assuming approximately equal evolutionary rates over all the branches. In
this way the root is put at the midpoint of the longest pathway between two
OTUs
ICP-TROP
Maximum likelihood
ICP-TROP
It evaluates a hypothesis about evolutionary history in terms of
the probability that the proposed model and the hypothesized
history would give rise to the observed data set. A history with a
higher probability of reaching the observed state is preferred to
a history with a lower probability. The method searches for the
tree with the highest probability or likelihood.
The following programs are available from the web:
– DNAML (DNA data only. By Joe Felsenstein in the Phylip
package)
– FastDNAML (DNA data only. A faster algorithm applied by
Gary Olsen to Joe Felsenstein's DNAML program )
– ProtML (DNA and protein. By Adachi and Hasegawa, 1992)
– TreePuzzle (DNA and protein. By Strimmer and von
Haeseler, 1995). This program applies a heuristic method
and is much faster than PROTML, but does not guarantee to
find the best tree.
Advantages and disadvantages of the
maximum likelihood method
There are some supposed adavantages of maximum likelihood
methods over other methods.
– It is the estimation method least affected by sampling error
– It is robust to many violations of the assumptions in the evolutionary model
– with very short sequences it tends to outperform alternative methods such
as parsimony or distance methods.
– the method is statistically well founded
– evalutates different tree topologies
– uses all the sequence information
There are also some supposed disadvantages
– maximum likelihood is very CPU intensive and thus extremely slow
– result is dependent on the model of evolution used
ICP-TROP
Explication of the method
Maximum likelihood evaluates the probability that the choosen evolutionary model will have
generated the observed sequences. Phylogenies are then inferred by finding those trees that yield the
highest likelihood. Assume that we have the aligned nucleotide sequences for four taxa:
(1)
(2)
(3)
(4)
1
A
A
A
A
G
G
G
U
G
G
C
U
C
U
C
U
U
U
C
C
j
C
C
A
G
C
G
G
G
A
A
A
A
....N
A ....A
A ....A
A.... A
A.... C
and we want to evauate the likelihood of the unrooted tree represented by the nucleotides of site j in
the sequence and shown below:
(1)
\
\
(2)
/
/
------
/
\
/
\
(3)
(4)
What is the probabliity that this tree would have generated the data presented in the sequence under
the the chosen model ?
ICP-TROP
Likelihood for one site
The models are time-reversible, therefore the likelihood of the tree is independent of the position of
the root. Thus it is convenient to root the tree at an arbitrary internal node.
C
C
\ /
\/
A
\
\
A
G
|
/
|
/
|
/
| /
| /
A
Assume that nucleotide sites evolve independently (the Markovian
model of evolution). Then we can calculate the likelihood for each
site separately and combine these to the total likelihood.
For the likelihood for site j, we have to consider all the possible
scenarios by which the nucleotides present at the tips of the tree
could have evolved. So the likelihood for a particular site is the
summation of the probablilities of every possible reconstruction of
ancestral states, given some model of base substitution. So in this
specific case all possible nucleotides A, G, C, and T occupying
nodes (5) and (6), or 4 x 4 = 16 possibilities :
_
_
| C
C A
G |
|
\ / |
/ |
|
\/
|
/
|
L(j) = Sum(Prob |
(5) |
/
|)
|
\ | /
|
|
\ | /
|
|_
(6)
_|
ICP-TROP
In the case of protein sequences each site may
ooccupy 20 states (that of the 20 amino acids) an
thus 400 possibilities have to be considered. Since
any one of these scenarios could have led to the
amino-acid configuration at the tip of the tree, we
must calculate the probability of each and sum and
sum them to obtain the total probability for each
site j.
likelihood for the full tree
The likelihood for the full tree then is the product of the likelihood at each site.
N
L= L(1) x L(2) ..... x L(N) = P L(j)
j=1
Since the individual likelihoods are extremely small numbers it is convenient to sum the log
likelihoods at each site and report the likelihood of the entire tree as the log likelihood.
N
ln L= ln L(1) + ln L(2) ..... + ln L(N) =
S ln L(j)
j=1
ICP-TROP
The model of evolution
The PROTML program in the MOLPHY package (Adachi and Hasegawa,
1992), as well as the TreePUZZLE program by Strimmer and von Haeseler
(1995), have implemented an instantaneous rate matrix derived from the
Dayhoff emperical substitution matrix. This has been called the Dayhoff
model.
Recently a model called the JTT model of evolution and based upon the
updated emperical substitution matrix of Jones et al. (1992) has been
developed and and implemented in these programs.
ICP-TROP
The maximum likelihood tree
The above procedure is then repeated for all
possible topologies (or for all possible trees).
The tree with the highest probablility is the
tree with the highest maximum likelihood.
ICP-TROP
Bootstrapping
Bootstrapping is a way of testing the reliability of the dataset. It is the creation of
pseudoreplicate datasets by resampling. Bootstrapping allows you to assess
whether the distribution of characters has been influenced by stochastic effects.
In phylogenetic analyses nonparametric bootstrapping is the most commonly
used method. The pseudoreplicate datasets are generated by randomly
sampling the original character matrix to create new matrices of the same size
as the original. The frequency with which a given branch is found is recorded as
the bootstrap proportion. These proportions can be used as a measure of the
reliability (within limitations) of individual branches in the optimal tree.
Thus bootstrap analysis:
–
–
–
ICP-TROP
is a statistical method for obtaining an estimate of error
is used to evaluate the reliability of a tree
is used to examine how often a particular cluster in a tree appears when nucleotides or
aminoacids are resampled
NB: If the entire dataset is compatible and has not been biased by stochastic
effects, all bootstrap trees should in principle have the same topology!
The practice of bootstrapping and the
construction of a consensus tree
Take a dataset consisting of in total n sequences with m sites each (see below). A number of resampled datasets of
the same size (n x m) as the original dataset is produced. However, each site is sampled at random and no more sites
are sampled than there were original sites. In order to be statistically significant the number of the datasets should
should be high and equal or higher than the number of individual sites present in the dataset.
Our example dataset consists of in total 4 sequences with 10 sites each (see below). When three new datasets are
prepared by random sampling of sites, the following three sample sets of data can be obtained:
Sample 1
A
B
C
D
A B C
B 1
C 6 5
D 8 7 4
ICP-TROP
0 1 2 0 3 0 1 2 0 1
___________________
A G G C U C C A A A
A G G U U C G A A A
A G C C C C G A A A
A U U U C C G A A C
(<- number of times each site is sampled)
A
B
C
D
G
G
G
U
G
G
C
U
G
G
C
U
U
U
C
C
U
U
C
C
U
U
C
C
C
G
G
G
A
A
A
A
A
A
A
A
A
A
A
C
Sample 2
Sample 2
A
B
C
D
A B C
B 2
C 4 2
D 7 5 3
ICP-TROP
1 0 0 0 2 2 2 0 0 3
___________________
A G G C U C C A A A
A G G U U C G A A A
A G C C C C G A A A
A U U U C C G A A C
A
B
C
D
A
A
A
A
U
U
C
C
U
U
C
C
C
C
C
C
C
C
C
C
C
G
G
G
C
G
G
G
A
A
A
C
A
A
A
C
A
A
A
C
Sample 3
Sample 3
A
B
C
D
A B C
B 1
C 3 2
D 6 3 4
ICP-TROP
1 0 0 0 2 2 2 0 0 3
___________________
A
A
A
A
G
G
G
U
G
G
C
U
C
U
C
U
U
U
C
C
C
C
C
C
C
G
G
G
A
A
A
A
A
A
A
A
A
A
A
C
A
B
C
D
A
A
A
A
U
U
C
C
U
U
C
C
C
C
C
C
C
C
C
C
C
G
G
G
C
G
G
G
A
A
A
C
A
A
A
C
A
A
A
C
Consensus tree
A large number of datasets (between hundred and thousand, depending on computer power) and the same number
of different trees are so generated. In this specific case taxa A and B form a cluster in all three trees, while C clusters
with D in only one tree. There exist specialised programs, such as the program Consense in the Phylip package of
Joe Felsenstein, that are able to analyse all the resulting trees and prepare the most likely tree or consensus tree
from those data.
The resulting consensus tree for our small dataset is shown below. The number of times each branch point or node
occured (the so-called bootstrap proportion) is indicated at each node.
Result
A
B
C
D
A B C
B 2
C 3 3
D 6 4 4
ICP-TROP
A
A
A
A
G
G
G
U
G
G
C
U
C
U
C
U
U
U
C
C
C
C
C
C
C
G
G
G
A
A
A
A
A
A
A
A
A
A
A
C
Again some good advice (1)
Tree topologies may strongly depend on the
following:
– DNA or Protein used in the analysis
– Distance or Parsimony methods applied
– The number of OTUs included in the alignment
– The order of the OTUs in the alignment
– The selection of a good outgroup
None of the methods may guarantee the one tree
with the correct topology
ICP-TROP
Again some good advice (2)
So as to have an idea of the reliability of the topology of the resulting
tree, one should do one or all of the following:
– Apply more than one of different methods (distance, parsimony) to
the dataset.
– Vary the parameters used by the different programs, such as seed
value and jumble factor for the order of OTU addition.
– Add or remove one or more OTUs and see how this influences tree
topology.
– Try to include an outgroup that may serve as a root for your tree.
– Apply Bootstrap or Jacknife analyses to your dataset and prepare a
consensus tree of 100 - 1000 replicas (depending on the size of the
dataset and on computer power).
Only when widely different methods provide you with similar or identical
tree topologies and such topologies are suported by good bootstrap
values (> 95%) the trees can be considered reliable
ICP-TROP
Limitations of the various
methods
Distance approaches (UPGMA, corrected distances and neighborjoining) do not use the original (sequence) data, but derived distance
information. Some information is said to be lost
Character-state approaches (Maximum Parsimony) are said to be
more powerful than distance methods because they use the raw
data. However, this is usually a small fraction of the data. Maximum
parsimony uses only the informative sites. So when the number of
informative sites is not large, this method is often less efficient than
distance methods (Saitou and Nei, 1986). Maximum parsimony is
notorious for its sensitivity to codon bias
None of the methods is reliable when OTUs with highly unequal
evolutionary separation are included in the dataset
ICP-TROP
Some terms used in molecular
evolution
Indel: position in a sequence alignment where one of the sequences
has acquired an insertion or extension or has undergone a deletion
Identity: percentage of identical residues in pairwise aligned
sequences. Normally deletions or insertions are not taken into
consideration, since it is not possible to tell how many events have
been at the basis of the creation of such an indel
Homology: two sequences are homologous or have homology when
they have evolved from a common ancestral sequence. The same
holds for the aligned residues in a sequence alignment. Homologous
residues are derived from a common ancestral residuerity and
homology as percentage should not be used. Two sequences can be
similar, and have a certain percentage of identity, but cannot have a
certain percentage of similarity. The same holds for homology.
ICP-TROP
Some PAM rates
PAMS per 100
Million Years
IG kappa chain C region
Lactalbumin
Epidermal growth factor
Haptoglobin alpha chain
Serum albumin
Phospholipase A
Hemoglobin alpha chain
Animal lysozyme
Myoglobin
Amyloid AA
Acid proteases
Myelin basic protein
Cytochrome b
Lactate dehydrogenase
Adenylate kinase
Triosephosphate isomerase
Cytochrome c
Plant ferredoxin
Glutamate dehydrogenase
Histone H4
ICP-TROP
37
27
26
20
19
19
12
9.8
8.9
8.7
8.4
7.4
4.5
3.4
3.2
2.8
2.2
1.9
0.9
0.1
(Adapted from Table 1. Atlas of Protein Sequence and Structure, Suppl 3,
1978, M.O. Dayhoff, ed. National Biomedical Research Foundation,
1979.)
The three letter amino acid
code
A
B
C
D
E
F
G
H
ICP-TROP
Ala
Asx
Cys
Asp
Glu
Phe
Gly
His
I
K
L
M
N
P
Q
R
Ile
Lys
Leu
Met
Asn
Pro
Gln
Arg
S
T
V
W
X
Y
Z
Ser
Thr
Val
Try
Xxx
Tyr
Glx
Alignment of two protein
sequences (1)
Consider four hypothetical sequences:
PHYLOGENY, PHOLOGENY, PHLOGENY, PHOLONY
Alignment can be done in various ways:
PHYLOGENY
PHOLOGENY
PH-LOGENY
PHOLO--NY
ICP-TROP
or
PHY-LOGENY
PH-OLOGENY
PH--LOGENY
PH-OLO--NY
Tree construction using
distance-matrix methods
phylogenetic tree constructed from 6
aligned sequences
1
1
1
1
A MOLECULAR--EVOLUTION
B
C
D
E
MOLEKULARE-EVOLUTIEN
MOLECULAIREEVOLUTIEN
MO-ECALIAREEFOLUTIEMO-ESALIARE-GOLUTIU-
F NO-ASELIAKE-HODATAUICP-TROP
2
2
1
1
2
4
A
B
C
D
E
F
Triosephosphate
isomerase
ICP-TROP
T PIS HUMAN
T PIS MACMU
T PIS RABIT
T PIS MOUSE
T PIS RAT
T PIS LATCH
T PIS CHICK
T PIS SCHJA
T PIS SCHMA
T PIS AEDTO
T PIS CULPI
T PIS CULT A
T PIS ANOME
T PIS DROME
T PIS HELVI
T PIS CAEEL
T PIS GRAVE
T PIS ARATH
T PIS PET HY
T PIS COPJA
T PIS LACSA
T PIS HORVU
T PIS SECCE
T PIS MAIZE
T PIS ORYSA
T PIC SPIOL
T PIC SECCE
T PIS ST ELP
T PIS TRYBB
T PIS TRYCR
T PIS LEIME
T PI1 GIALA
T PI2 GIALA
T PIS EMENI
T PIS SCHPO
T PIS YEAST
T PIS COPCI
T PIS BACSU
T PIS ST AAU
T PIS BACME
T PIS BACST
T PIS LACDE
T PIS LACLA
T PIS CLOAB
T PIS BORBU
T PIS SYNY3
T PIS PLAFA
T PIS MYCHR
T PIS MYCFL
T PIS MYCHY
T PIS MYCGE
T PIS MYCPN
T PIS TREPA
T PIS MYCLE
T PIS MYCTU
T PIS CORGL
T PIS ST RCO
T PIS XANFL
T PIS CHLAU
T PIS RHIET
PGKT T HEMA
T PIS AQUAE
T PIS VIBSA
T PIS PSESY
T PIS CHLPN
T PIS CHLT R
T PIS ECOLI
T PIS ENT CL
T PIS HAEIN
T PIS VIBMA
T PIS BUCAP
T PIS HELPJ
T PIS HELPY
T PIS FRAT U
T PIS MORSP
T PIS PYRHO
T PIS PYRWO
T PIS MET TH
T PIS ARCFU
T PIS MET JA
T PIS MET BR
0.1
Animalia
Planta
Protists
Fungi
Eubacteria
Archaebacteria