Molecular Phylogeny

Download Report

Transcript Molecular Phylogeny

Molecular Phylogeny
Fredj Tekaia
Institut Pasteur
[email protected]
Examples of phylogenetic trees
Pace (2001) described a tree
of life based on small subunit
rRNA sequences.
Pace, N. R. (1997) Science 276, 734-740
This tree shows the main
three branches described
by Woese and colleagues.
Chlamydiae
Fig. 1. Phylogeny of chlamydiae. 16S rRNA-based neighbor-joining tree
showing the affiliation of environmental and pathogenic chlamydiae with
major bacterial phyla. Arrow, to outgroup. Scale bar, 10% estimated
evolutionary distance.
Eukaryotes
(Baldauf et al., 2000)
Evolutionary processes include:
Ancestor
Expansion*
Phylogeny*
genesis
duplication
HGT
Exchange*
species genome
HGT
loss
Deletion*
Original version
Actual version
Hurles M (2004) Gene Duplication: The Genomic Trade in Spare Parts. PLoS Biol 2(7): e206.
Homolog - Paralog - Ortholog
O
A
A1A1
BB
11
Species-1
Homologs: A1, B1, A2, B2
B
Paralogs: A1 vs B1 and A2 vs B2
Orthologs: A1 vs A2 and B1 vs B2
AA22
BB
22
Sequence analysis
Species-2
a
S1
S2
b
Molecular evolution
GACGACCATAGACCAGCATAG
GACTACCATAGA-CTGCAAAG
*** ******** * *** **
GACGACCATAGACCAGCATAG
GACTACCATAGACT-GCAAAG
*** *********
*** **
Two possible
positions for the
indel
Molecular Phylogenetic Analysis
Study of evolutionary relationships between genes and
species
• The actual pattern of evolutionary history is the
phylogeny or evolutionary tree which we try to estimate.
• A tree is a mathematical structure which is used to
model the actual evolutionary history of a group of
sequences or organisms.
Molecular Phylogeny Analysis
• Specifying the history of gene evolution is one of the most important
aims of the current study of molecular evolution;
• Molecular phylogeny methods allow, from a given set of aligned
sequences, the suggestion of phylogenetic trees (inferred trees) which
aim at reconstructing the history of successive divergence which took
place during the evolution, between the considered sequences and their
common ancestor. These trees may not be the same as the true tree.
• Reconstruction of phylogenetic trees is a statistical problem, and a
reconstructed tree is an estimate of a true tree with a given topology and
given branch length;
• The accuracy of this estimation should be statistically established;
• In practice, phylogenetic analyses usually generate phylogenetic trees
with accurate parts and imprecise parts.
Nucleotide, amino-acid sequences
• 3 different DNA positions but
-GGAGCCATATTAGATAGA- only one different amino acid
position:
-GGAGCAATTTTTGATAGA2 of the nucleotide substitutions
Gly Ala Ile Phe asp Arg
are therefore synonymous and
one is non-synonymous.
Gly Ala Ile Leu asp Arg
DNA yields more phylogenetic information than proteins. The
nucleotide sequences of a pair of homologous genes have a higher
information content than the amino acid sequences of the
corresponding proteins, because mutations that result in synonymous
changes alter the DNA sequence but do not affect the amino acid
sequence. (But amino-acid sequences are more efficiently aligned)
Phenetics and Cladistics
Phenetics (Michener and Sokal, 1957): Pheneticists argued that
classifications should encompass as many variable characters as
possible, these characters being analysed by rigorous mathematical
methods.
Such methods (exp. distance based) place a greater emphasis on
the relationships among data sets than the paths they have taken to
arrive at their current states.
Cladistics (Hennig 1966): emphasizes the need for large datasets
but differs from phenetics in that it does not give equal weight to
all characters.
Cladists, are generally more interested in evolutionary pathways
than in relationships (exp. maximum parsimony).
Key features of DNA-based phylogenetic trees
A
• An unrooted tree
branches
•
external nodes
B
• Rooted trees
C
D
C
B
B
A
1
C
B
2
external nodes
D
D
A
A
•
internal nodes
Hypothetical ancestor
A
D
C
B
A
B
C
D
D
3
4
C
5
Rooted and Unrooted trees
•An important distinction in phylogenetics between trees that
make an inference about a common ancestor and the direction of
evolution and those that do not.
C
A
•
D
•
•
A
C
B
B
D
•In rooted trees a single node is designated as a common ancestor,
and a unique path leads from it through evolutionary time to any
other node.
•Unrooted trees only specify the relationship between nodes and
say nothing about the direction in which evolution occured.
•Roots can usually be assigned to unrooted trees through the use of
an outgroup.
Key features of DNA-based phylogenetic trees
The numbers of possible rooted (NR) and unrooted
(NU) trees for n sequences are given by:
NR = (2n-3)!/2n-2(n-2)!
NU = (2n-5)!/2n-3(n-3)!
• Note that only one of all
possible trees can represent the
true tree that represents
phylogenetic relationships among
the sequences.
n
NR
NU
2
1
1
3
3
1
4
15
3
5
105
15
34459425
2027025
10
Gene tree - Species tree
Gene A
Mutation events
Gene B
Gene C
Gene D
Gene E
Gene tree
Speciation events
Species A
Species B
Species C
Species D
Species E
Species tree
These two events - mutation and speciation- are not expected to
occur at the same time. So gene trees cannot represent species tree.
Gene tree - Species tree
•
Time
Duplication
•
Duplication
A
B
C
Species tree
Speciation
Speciation
A
A
B
C
B
Gene tree
C
Tree construction: how to proceed?
1. Consider the set of sequences to analyse ;
2. Align "properly" these sequences ;
3. Apply phylogenetic making tree methods ;
4. Evaluate statistically the obtained phylogenetic tree.
Methodology :
1- Multiple alignment;
2- Bootstrapping;
3- Consensus tree construction and evaluation;
Alignment is essential preliminary to tree construction
GACGACCATAGACCAGCATAG
GACTACCATAGA-CTGCAAAG
*** ******** * *** **
GACGACCATAGACCAGCATAG
Two possible
positions for the
indel
GACTACCATAGACT-GCAAAG
*** *********
*** **
• If errors in indel placement are made in a multiple alignment then
the tree reconstructed by phylogenetic analysis is unlikely to be
correct.
Steps in Multiple Sequence Alignments
A common strategy of several popular multiple sequence alignment
algorithms is to:
1- generate a pairwise distance matrix based on all possible pairwise
alignments between the sequences being considered;
2- use a statistically based approach to construct an initial tree;
3- realign the sequences progressively in order of their relatedness
according to the inferred tree;
4- construct a new tree from the pairwise distances obtained in the
new multiple alignment;
5- repeat the process if the new tree is not the same as the previous
one.
Steps in multiple alignment
A- Pairwise alignment
Example- 4 sequences, A, B, C, D
A
B
C
B
D
6 pairwise
comparisons then
cluster analysis
A
C
D
Similarity
B- Multiple alignment following the tree from A
B
D
Align most similar pair
Gaps to optimise alignment
A
C
Align next most similar pair
New gap to optimise alignment of (BD) with (AC)
B
D
A
C
Align alignments- preserve gaps
Procedure
•An efficient procedure consists of aligning amino-acid sequences
and use the resulting alignment as template for corresponding
nucleotide sequences.
Alignment is garanteed at the codon level.
1. Alignment of a family protein sequences using clustalW
2. Alignment of corresponding DNA sequences using as template their
corresponding amino acid alignment obtained in step 1
Note: clean multiple alignment from gaps common to the majority of
considered sequences
Phylogenetic tree construction methods
• A phylogenetic tree is characterised by its topology (form) and its
length (sum of its branch lengths) ;
• Each node of a tree is an estimation of the ancestor of the elements
included in this node;
• There are 3 main classes of phylogenetic methods for constructing
phylogenies from sequence data :
Methods directly based on sequences :
• Maximum Parsimony : find a phylogenetic tree that explains the data,
with as few evolutionary changes as possible.
• Maximum likelihood : find a tree that maximizes the probability of the
genetic data given the tree.
Methods indirectly based on sequences :
• Distance based methods (Neighbour Joining (NJ)): find a tree such
that branch lengths of paths between sequences (species) fit a matrix of
pairwise distances between sequences.
Parsimony
The concept of parsimony is at the heart of all characterbased methods of phylogenetic reconstruction.
The 2 fundamental ideas of biological parsimony are:
1- Mutations are exceedingly rare events (?) ;
2- the more unlikely events a model invokes, the less likely
the model is to be correct.
As a result, the relationship that requires the fewest
number of mutations to explain the current state of the
sequences being considered, is the relationship that is
most likely to be correct.
Parsimony
Informative and Uninformative Sites:
Multiple sequence alignment, for a parsimony approach, contains positions that
fall into two categories in terms of their information content : those that have
information (are informative) and those that do not (are uninformative).
Example:
seq
1
2
3
4
5
6
1
G
G
G
G
G
G
2
G
G
G
A
G
T
3
G
G
A
T
A
G
4
G
A
T
C
A
T
In general, for a position to
be informative regardless of
how many sequences are
aligned, it has to have at
least 2 different nucleotides,
and
each
of
these
nucleotides has to be
present at least twice.
Position 1 is said invariant and therefore uninformative, because all trees invoke the same number of
mutations (0);
Position 2 is uninformative because 1 mutation occurs in all three possible trees;
Position 3 idem, because 2 mutations occur; Position 4 requires 3 mutations in all possible trees.
Positions 5 and 6 are informative, because one of the trees invokes only one mutation and the other 2
alternative trees both require 2 mutations.
Krane & Raymer 2002
1G
6
G3
G
1G
G
T2
G
1G
T
T2
G
T
2T
T4
3G
T4
4T
T3
1G
A3
1G
G2
1G
G2
5
G
A
G
G
G
G
2G
A4
3A
A4
4A
A3
1G
T3
1G
A2
1G
A2
4
G
T
G
G
A
A
2A
C4
3T
C4
4C
T3
1G
A3
1G
G2
1G
G2
3
G
A
2G
T4
3A
1G
G3
1G
2
G
G
2G
A4
3G
1G
G3
1G
1
2G
G
G
G
G
G
G4
3G
G
G
G
G
T4
4T
A3
G2
1G
G2
G
G
A4
4A
G3
G2
1G
G3
G
G
G4
4G
G
G3
Maximum Parsimony (Fitch, 1977)
Parsimony criterion consists of determining the minimum number of
changes (substitutions) required to transform a sequence to its nearest
neighbor.
The maximum parsimony algorithm searches for the minimum
number of genetic events (nucleotide substitutions or amino-acid
changes) to infer the most parsimonious tree from a set of
sequences.
The best tree is the one which needs the fewest changes.
Problems :
1. within practical computational limits, this often leads to the
generation of tens or more "equally most parsimonious trees" which
makes it difficult to justify the choice of a particular tree ;
2. long computation time is needed to construct a tree.
Maximum Parsimony (Fitch, 1977),...
The Maximum parsimony method takes account of
information pertaining to character variation in each position
of the sequence multiple alignment, to recreate the series of
nucleotide changes. The assumption, possibly erroneous, is
that evolution follows the shortest possible route and that the
correct phylogenetic tree is therefore the one that requires the
minimum number of nucleotide changes to produce the
observed differences between the sequences.
Trees are therefore constructed at random and the nucleotide
changes that they involve calculated until all possible
topologies have been examined and the one requiring the
smallest number of steps identified.
This is presented as the most likely inferred tree.
Maximum likelihood
This approach is a purely statistically based method.
Probabilities are considered for every individual nucleotide
substitution in a set of sequence alignment.
Exp.
Since transitions (exchanging purine for a purine and pyrimidine for a
pyrimidine) are observed roughly 3 times as often as transversions
.. C.. (exchanging a purine for a pyrimidine or vice versa); it can be reasonably
argued that a greater likelihood exists that the sequence with C and T are
..T.. more closely related to each other than they are to the sequence with G.
..G.. • Calculation of probabilities is complicated by the fact that the sequence of
the common ancestor to the sequences considered being unknown.
• Furthermore multiple substitutions may have occurred at one or more sites
and that all sites are not necessarily independent or equivalent.
Still, objective criteria can be applied to calculating the probability for every site and
for every possible tree that describes the relationships of the sequences in a multiple
alignment.
Distance matrix methods (NJ,...)
Convert sequence data into a set of discrete pairwise distance values,
arranged into a matrix. Distance methods fit a tree to this matrix.
Di,j = the distance between i and j sequences;
di,j = sum of branches on the tree path from i to j;
The phylogeny makes an estimation of the distance for each pair as the
sum of branch lengths in the path from one sequence to another through
the tree.
A measure of how close is the tree to D is given by the least square
criterion :
∑( Di,j - di,j )2/ D2ij
i,j
The phylogenetic topology tree is constructed by using a cluster analysis
method (like the NJ method).
1. easy to perform ; 2. fast calculation ; 3. fit for sequences having high similarity scores ;
drawbacks :
1. all sites are generally equally treated (do not take into account differences of substitution rates
) ; 2. not applicable to distantly related sequences; 3. Some of the information is lost, particularly
those pertaining to the identities of the ancestral and derived nucleotides at each position in the
The choice of the outgroup
• Most of phylogenetic methods construct unrooted trees.
• It is best to root such trees on biological grounds.
• The most used technique consists of including in the sequence data set
to be analysed, a sequence which has some relation with the considered
sequences without belonging to the same family.
• The aim is to normalize the branches of the unrooted tree relatively to
the length of the branch related to the outgroup.
Evaluation of different methods
• None of the previous methods of phylogenetic reconstruction makes
any garantee that they yield the one true tree that describes the
evolutionary history of a set of aligned sequences
• There is at present no statistical method allowing comparisons of trees
obtained from different phylogenetic methods; nevertheless many
attempts have been made to compare the relative consistency of the
existing methods.
• The consistency depends on many factors, including the topology and
branch lengths of the real tree, the transition/transversion rate and the
variability of the substitution rates.
• In practice, one infers phylogeny between sequences which do not
generally meet the specified hypothesis.
• One expects that if sequences have strong phylogenetic relationships,
different methods will result in the same phylogenetic tree.
Statistical evaluation of the obtained phylogenetic tree
• The accuracy is dependent on the considered multiple sequence
alignments ;
• ML estimates branch lengths, their degree of significance and their
confidence limits ;
• At present only sampling techniques allow to test the topology of a
phylogenetic tree :
Bootstrapping
It consists of drawing columns from a sample of aligned sequences,
with replacement, until one gets a data set of the same size as the
original one (usually some columns are sampled several times and
others left out).
Bootstrapping
• Constructs a new multiple alignment at random from the real
alignment, with the same size. Note that the same column can be
sampled more than once, and consequently some columns are not
sampled.
ATAGCCATA
ATACCCATG
ATACCCATA
ATAGCCATA
ATCCCCCAT
TCAAATGCA
TCGAATCCA
TCAAATCCA
TCAAATGCA
TCAACACCC
Methodology
1. Consider the set of sequences to analyse ;
2. Align "properly" these sequences ;
3. Apply phylogenetic making tree methods ;
4. Evaluate statistically the obtained phylogenetic tree.
1- Multiple alignment;
2- Bootstrapping (100 samples);
3. Apply phylogenetic making tree methods ;
4- Consensus tree construction and evaluation;
Example: The tree of life
Pace (2001) described a tree
of life based on small subunit
rRNA sequences.
Pace, N. R. (1997) Science 276, 734-740
This tree shows the main
three branches described
by Woese and colleagues.
References
• Phylogeny programs :
http://evolution.genetics.washington.edu/phylip/sftware.html
• MEGA: http://www.megasoftware.net/
• PAML: http://abacus.gene.ucl.ac.uk/software/paml.html
Books:
• Fundamental concepts of Bioinformatics.
Dan E. Krane and Michael L. Raymer
• Genomes 2 edition. T.A. Brown
• Molecular Evolution; A phylogenetic Approach
Page, RDM and Holmes, EC
Blackwell Science