Zertz

Transcript Zertz

Phylogenetic reconstruction
Julie Thompson
Laboratory of Integrative Bioinformatics and
Genomics
IGBMC, Strasbourg, France
[email protected]
Phylogenetic reconstruction

Introduction: basic concepts

Principal methods for tree reconstruction


Distance based methods
Character based methods

Evaluation of the reliability of a tree

The limits of phylogeny

Some programs
Julie Thompson -IGBMC
Introduction

Phylogenetics: the study of evolutionary relatedness among organisms




Relationships represented by a hierarchical, tree-like structure
Originally based on physical similarities and differences (traits or characters)
Now, mostly using gene sequences
What for ?

Reconstruction of evolutionary history of a gene
family
AAGGCCT
AGGGCAT
AGGGCAT

-3 mil yrs
AAGACTT
TAGCCCT
TAGCCCA
-2 mil yrs
TGGACTT
TAGACTT
AGCACTT
AGCACAA
-1 mil yrs
AGCGCTT
today
Reconstruction of the evolutionary relationships
between species : living (extant) and dead
(extinct). eg : tree of life
Orangutan Gorilla Chimpanzee Human

Classification of new species. eg : viral strain
Julie Thompson -IGBMC
From the Tree of the Life Website, University of Arizona
Basic concepts : phylogeny
A phylogenetic tree is characterised by :
- its topology (branching patterns)
- the lengths of the branchs (possibly)
Rooted tree
Internal node
Unrooted tree
A
Leaf node
A
C
B
root
C
B
D
D
Node : represents a taxonomic unit (species, population, gene,…), either existing or ancestor
Branch: defines the relationship between the TUs (descent, ancestry)
Root : common ancestor of all taxonomic units on tree
Julie Thompson -IGBMC
Cladogram
Phylogram
A
A
A
B
B
B
D
D
D
E
E
E
F
F
F
G
(branch lengths are not meaningful)
A
G
0.1
G
(branch lengths are proportional
to number of changes)
B
A
B
Unrooted tree
D
D
E
G
E
0.1
F
G
F
Basic concepts : phylogeny
For an unrooted tree, there are several possible rooted trees
4
2
A
B
1
3
C
Position of the root
D
5
A
A
B
C
D
B
B
A
D
C
C
C
C
A
A
D
D
D
B
B
1
2
3
4
In most tree reconstruction programs, the position of the root is chosen arbitrarily :
- « midpoint rooting » (root placed in the centre of the longest branch)
- « outgroup rooting »
The user can define the sequence(s) that can be used as an outgroup to root the tree.
The outgroup sequence should be distantly related to all other sequences.
Julie Thompson -IGBMC
5
Basic concepts : branch order
The order of the branchs belonging to the same node is not important.
Rotating branchs at a node does not change the topology of the tree.
Tree 2
Tree 1
Tree 1 = Tree 2
Julie Thompson -IGBMC
Homology, orthology, paralogy
Homology :
Ancestral gene of insulin
2 genes are homologous if they have a
common ancestor
Orthology :
2 genes are orthologous if they diverged
after a speciation event
Primates
Rodents
Paralogy :
2 genes are paralogous if they diverged
after a duplication event
INS1
Rat Mouse
Speciation
Duplication
Julie Thompson -IGBMC
INS
Human
INS1 INS1
Rat Mouse
INS2
Rat Mouse
INS2 INS2
Rat Mouse
Phylogenetic reconstruction

Introduction: basic concepts

Principal methods for tree reconstruction


Distance based methods
Character based methods

Evaluation of the reliability of a tree

The limits of phylogeny

Some programs
Julie Thompson -IGBMC
Molecular phylogenetics



DNA, RNA, and protein sequences can be considered as phenotypic
traits
Molecular phylogenetics attempts to determine the rates and patterns
of change occurring in the sequences and to reconstruct the
evolutionary history of genes and organisms
How?




Build a multiple alignment
Validate, edit or mask the resulting alignment
Reconstruct the tree
Evaluate statistically the reliability of the tree
Julie Thompson -IGBMC
Multiple sequence alignment

Errors in the initial alignment will lead to inaccurate trees
Alignment of 18s rRNA sequences (Morrison and Ellis, J Mol Evol, 1997)


alignment algorithms Pileup, ClustalW, TreeAlign, MALIGN, SAM. Trees
construction, neighbor-joining, weighted-parsimony, and maximum-likelihood.
different alignments produced trees that were more dissimilar than did the
different tree-building methods
Using genomic data from seven yeast species (Wong et al, Science 2008)


build alignments using 7 different alignment programs and then estimate
phylogeny (using maximum parsimony)
46.2% of the 1502 genes had one or more differing trees depending on the
alignment procedure used
Julie Thompson -IGBMC
Wong, Suchard, Huelsenbeck. Alignment Uncertainty and Genomic Analysis Science, 2008
Alignment masking

Remove or mask columns with gaps and unreliable regions

E.g. Gblocks (Castresana, 2000)
Full length alignment
Julie Thompson -IGBMC
Gblocks only
Multiple sequence alignment
The quality of the multiple alignment is
crucial for the quality of the tree !!
The trees can vary depending on the
region of the alignment selected.
Julie Thompson -IGBMC
AspRS
Anticodon binding domain
Catalytic core I
Insertion domain
Catalytic core II
Figure 7 : Représentation schématique des conservations présentes dans les Asp tRNA synthétases après analyse OrdAli.
Colonnes noires et grises, résidus strictement conservés ou présents dans au moins 80% des séquences.
Colonnes rouges, bleues, jaunes, régions strictement conservées chez les Eucaryotes, Archaea et Bactéries respectivement.
Colonnes vertes et violettes, régions communes aux Eucaryotes et Archaea et aux Archaea et bactéries respectivement.
Julie Thompson -IGBMC
PLASM FALC
Whole alignment
ARABI THAL
CAENO ELEG
SCHI PO MT
DROSO MEGA
SACC CE MT
MYCOP GENI
DROS ME MT
CAEN EL MT
Bacteria +
Mitochondrie
HOMO SAPIE
RATTU NORV
MYCOP PNEU
SCHIZ POMB
SACCH CERE
CANDI ALBI
BORRE BURG
TREPO PALI
MYCOP CAPR
BUCHN AFID
RICKE PROW
RHODO CAPS
HALOB SALI
CHLOR TEPI
ARCHE FULG
MYCOB TUBE
AQUIF AEOL
MYCOB LEPR
THERM MARI
METBA THER
METHA JANN
HELIC PYLO
PORPH GING
CAMPY JEJU
CLOST ACET
PYROC KODA
CHLAM TRAC
BORDE PERT
SYNECHO
SP
AR THA
CHL
NEISSMENI
GONO
NEISS
THERM
DEINO
RADI THER
BACIL SUBT
PSEUD AERU
ENTER FAEC
STREP PYOG
YERSISHEWA
PEST PUTR
ESCHE
COLI
SALMO TYPH
VIBRI CHOL
HAEMO INFL
ACTIN ACTI
Julie Thompson -IGBMC
PYROC HORI
N terminal region
PLASM FALC
SACCH CERE
Bacteria
Archaea
Mitoch.
SCHIZ POMB
ARABI THAL
CANDI ALBI
CAENO ELEG
DROSO MEGA
HALOB SALI
HOMO SAPIE
RATTU NORV
PYROC HORI
METBA THER
PYROC KODA
DROS ME MT
METHA JANN
SCHI PO MT
CAEN EL MT
ARCHE FULG
BORRE BURG
SACC CE MT
MYCOP CAPR
BUCHN AFID
PORPH GING
CLOST ACET
DEINO RADI
RICKE PROW
BACIL SUBT
RHODO CAPS
CHLOR TEPI
SYNECHO SP
CHLAM TRAC
BORDE PERT
NEISS MENI NEISS GONO
HELIC PYLO
CAMPY JEJU
MYCOB TUBE
MYCOB LEPR
TREPO PALI
0.1
Julie Thompson -IGBMC
PSEUD AERU
SHEWA PUTR
ESCHE
ENTER FAECSALMO TYPH
YERSICOLI
PEST
VIBRI CHOL
HAEMO INFL
AQUIF
AEOL PYOG
ACTIN ACTI
STREP
THERM THER
THERM MARI
AR THA CHL
MYCOP GENI
MYCOP PNEU
Tree reconstruction methods

Distance based methods



UPGMA
Neighbor-Joining
Character based methods


Maximum parsimony
Maximum likelihood
Julie Thompson -IGBMC
Tree reconstruction methods

Distance based methods


Calculate the distances between pairs of sequences
=> distance matrix
Group the sequences based on the distances


UPGMA
Neighbor-Joining
relatively simple and fast
 but, the sequences themselves are not taken into account

Julie Thompson -IGBMC
Distance calculation
Observed distance : mean number of substitutions per site
Observed Dist =

No. substitutions
No. sites considered
Sites considered :


For DNA, the 3rd base of each codon can be excluded from analysis.
alignment positions with gaps are generally eliminated
2 possibilities :
Seq1 MAIKKIISRSNSGIHNATVI
Seq2 MPIKK-ISRSNSGIHHSTVI
Seq3 MPIKKIISRSNTGI-HSTVI
Total sites = 20
No. substitutions (Seq1,seq2) = 3
No. substitutions (Seq1,seq3) = 4
No. substitutions (Seq2,seq3) = 1
Julie Thompson -IGBMC
Global gap removal : 18 sites considered
Dist(Seq1,Seq2) = 3/18 = 0,1667
Dist(Seq1,Seq3) = 4/18 = 0,2222
Dist(Seq2,Seq3) = 1/18 = 0,0556
Pairwise gap removal :
Dist(Seq1,Seq2) = 3/19 = 0,1579
Dist(Seq1,Seq3) = 4/19 = 0,2105
Dist(Seq2,Seq3) = 1/18 = 0,0556
Distance correction
For more distantly related sequences, the probability of
several substitutions occuring at the same site increases.
=> The number of observed substitutions under-estimates the true number of
substitutions between distantly related sequences.
Sequence1
Single substitution
Multiple substitutions
Coincident substitutions
Parallel substitutions
Convergent substitutions
Reverse substitutions
C
C
C=>G
C=>A
C=>A
C
Sequence2
C=>A
C=>A=>T
C=>A
C=>A
C=>T=>A
C=>T=>C
Observed no. True no.
substitutions substitutions
1
1
1
0
0
0
1
2
2
2
3
2
=> Numerous methods available that estimate the true distance between sequences
Julie Thompson -IGBMC
Distance correction: DNA substitution models
Jukes-Cantor (JC) model (1969)
• The 4 bases have the same frequency
• All substitutions occur with equal probabilities
d = -3/4 ln (1-4/3 D)
where D is observed distance
Corrected distance
3,5
3,0
Jukes-Cantor
2,5
2,0
1,5
1,0
0,5
0,0
0,00
0,10
0,20
0,30
0,40
0,50
0,60
Observed distance
Kimura 2 parameter (K2P/K80) model (1980)
• The 4 bases have the same frequency
• Transitions (sunstitutions A-G or C-T) are more common than transversions
d = - ½ ln(1-2P-Q) – ¼ ln(1-2Q)
where P is mean no. of transitions
Q is mean no. of transversions
Tamura-Nei (TrN) model (1993)
• 4 bases have different frequencies
• different transition frequencies (α1 between purines; α2 between pyrimidines), equal transversion frequencies (β)
Julie Thompson -IGBMC
0,70
Distance calculation: amino acid substitution models
PAM matrices (Dayhoff, 1978)
• A 1-PAM mutation matrix describes an amount of evolution which will change, on the average, 1% of
the amino acids. In mathematical terms:
• estimated from the observation of accepted mutations between 34 superfamilies of closely related
sequences
• extrapolated to other evolutionary distances (eg. PAM250)
Blosum matrices (1992)
• local, ungapped alignments of distantly related sequences to derive the BLOSUM series of matrices.
Matrices of this series are identified by a number after the matrix (e.g. BLOSUM50), which refers to the
minimum percentage identity of the blocks of multiple aligned amino acids used to construct the matrix.
Gonnet matrices (1992)
• similar to Dayhoff, but with more modern sequence databases
Julie Thompson -IGBMC
UPGMA
UPGMA = Unweighted Pair Group Method with Arithmetic Mean
UPGMA is the simplest clustering algorithm
Assumptions :
mutation rate is the same in all ligneages (molecular clock)
Method




Group the 2 closest sequences
The node is placed at distance d from each sequence
d= (dist seq1,seq2)/2
Calculate distance between this node and all other sequences :
dist (seq1,seq2),seqx = (dist seq1,seqx + dist seq2,seqx) / 2
etc...
Julie Thompson -IGBMC
UPGMA
Distance matrix
B
C
D
E
F
A
2
4
6
6
8
1
B C D E
A
1
dist(A,B),C=(dist
dist(A,B),D=(dist
dist(A,B),E=(dist
dist(A,B),F=(dist
B
4
6 6
6 6 4
8 8 8 8
2
(A,B) C D E
4
6 6
6 6 4
8 8 8 8
2
(A,B) C E
C
4
(D,E) 6 6
F
8 8 8
1
C
D
E
F
Calculation of new distance matrix
Clustering
D
A
1
B
C
2
1
1
1
2
1
2
2
Julie Thompson -IGBMC
=
=
=
=
4
6
6
8
dist(AB,C),(D,E)=(dist (A,B)(D,E)+dist C(D,E))/2 = 6
dist(AB,C),F
=(dist (A,B)F
+dist CF
)/2 = 8
1
1
1
(AB,C) (D,E)
(D,E) 6
F
8
8
BC)/2
BD)/2
BE)/2
BF)/2
dist(D,E),(A,B)=(dist D(A,B)+dist E(A,B))/2 = 6
dist(D,E),C
=(dist DC
+dist EC
)/2 = 6
dist(D,E),F
=(dist DF
+dist EF
)/2 = 8
E
1
AC+dist
AD+dist
AE+dist
AF+dist
A
B
C
D
E
1
1
dist(ABC,DE),F=(dist (AB,C)F+dist(D,E)F)/2=8
2
1
2
(ABC,DE)
F
8
1
2
4
A
B
C
D
E
F
UPGMA
Problem :
If the mutation rate is different between branches, UPGMA can give a completely false
topology
Tree obtained by UPGMA
Eg: mutation rate of B is much faster than A
2
1
2
1
3
0.75
2.5
1.5
2.5
4.75
Initial matrix
B
C
D
E
F
A
5
4
7
6
8
A
C
B
D
E
F
False
topology !!
Clustering by UPGMA
B C D E
7
10 7
9 6 5
11 8 9 8
Julie Thompson -IGBMC
B
D
E
F
A,C B D E
6
7 10
6 9 5
8 11 9 8
A,C B D,E
B
6
D,E 6.5 9.5
F
8 11 8.5
AC,B D,E
D,E 8
F
9.5
9.5
ACB,DE
F
9.5
Neighbor-Joining (NJ)

Reference:
Saitou & Nei Mol. Biol. Evol. 4(4):406-25 1987

Method:


The 2 TUs to be clustered are chosen such that they minimise the
sum of lengths of all branchs
The calculation of branch lengths takes into account the mean
distance between each sequence and all other sequences
=> allows different mutation rates between branchs
Julie Thompson -IGBMC
Neighbor-Joining (NJ)
Estimate sum of branch lengths for all possible clusters (S12,S13,…)
=> Choose cluster that minimises sum of branch lengths


How to calculate the sum of branch lengths ?
8
For a star-shaped tree, the sum of N branch lengths :
1
7
N
S0 =  LiX = (  Dij ) / (N-1)
i =1
i<j
Where Dij is distance between TUi and TUj
LiX is length between node i and node X
Julie Thompson -IGBMC
2
X
3
4
6
5
Neighbor-Joining (NJ)
Assume TUs 1 and 2 have been joined, then sum of
branch lengths :
8
7
1
N
S12 = (L1X+ L2X) + LXY + 
i =3
LiY
X
6
Y
5
2
4
N
S12 = D12 + LXY +(  Dij) / (N-3)
3
3 i< j
Calcul de LXY :
LXY =
8
N
N
1 [ (D +D ) – (N-2) (L +L ) –2 L ]
1X
2X
iY
2(N-2) k =3 1k 2k
i =3
X
In the example :
L1X and L2X have been counted 6 times each.
Lengths L3Y, …, L8Y have been counted twice each.
Length LXY has been counted 12 times.
S12 = 1
2(N-2)
Y
5
3
1
 ( D1k+D2k ) +1/2 D12 + N-2
 Dij
Julie Thompson -IGBMC
6
2
N
k =3
7
1
3 i< j
4
Neighbor-Joining (NJ)
estimate sum of branch lengths for all possible clusters (S12,S13,…)
B
C
D
E
F
A
5
4
7
6
8
B C D E
7
10 7
9 6 5
11 8 9 8
Application :
SAB = 1/8*62 + 5/2 + 1/4*43 = 21,00
SAC = 1/8*54 + 4/2 + 1/4*52 = 21,75
…
Julie Thompson -IGBMC
S12 = 1
2(N-2)
N
1
 ( D1k+D2k ) +1/2 D12 + N-2
 Dij
k =3
3 i< j
Cluster AB
Neighbor-Joining (NJ)
8
1
7
Branch lengths are estimated by Fitch, Margoliash method (67)
L1X = (D12 + D1Z – D2Z) / 2
X
L2X = (D12 + D2Z – D1Z) / 2
6
Y
5
2
3
D1Z represents mean distance between 1 and all other sequences
D2Z represents mean distance between 1 and all other sequences
Application :
B
C
D
E
F
A
5
4
7
6
8
B C D E
7
10 7
9 6 5
11 8 9 8
LAX = (5+6.25-9.25)/2 =
1
LBX = (5+9.25-6.25)/2 = 4
A
B
=> Allows different mutation rates between branchs
Julie Thompson -IGBMC
4
Tree reconstruction methods

Distance based methods


UPGMA
Neighbor-Joining
=> relatively fast and simple
=> but, the sequences themselves are not taken into account

Character based methods

Each position in the sequence is considered as a trait or character.



maximum parsimony
maximum likelihood
Bayesian inference
=> calculation time is very long
Julie Thompson -IGBMC
Maximum Parsimony
Used historically to study morphological characters
 prefer evolutionary scenario that involves the smallest number of events
Application to sequences:
one column in the alignment = one character
Search for most parcimonious trees, i.e. those for which the topology involves
minimum number of substitutions
Steps :
1) Search for all possible topologies
2) either : find smallest no. of substitutions for each topology
or : find informative sites => sites that support one topology
Julie Thompson -IGBMC
Maximum Parsimony
First step : find all possible topologies
Seq1
Seq2
Seq3
Seq4
AAGAGTGCA
AGCCGTGCG
AGATATCCA
AGAGATCCG
1
No. of possible topologies (unrooted trees) : 3
2
AAGAGTGCA
Seq1
AGCCGTGCG
Seq2
Seq4
Seq3
AGAGATCCG
AGATATCCA
Julie Thompson -IGBMC
3
AAGAGTGCA
Seq1
AGATATCCA
Seq2
Seq4
AGCCGTGCG
AGAGATCCG
AAGAGTGCA
Seq1
Seq3
AGATATCCA
Seq3
AGCCGTGCG
Seq2
Seq4
AGAGATCCG
Maximum parsimony
Second step : find smallest no. of substitutions for a given
topology (Fitch method)
1) Root the tree (anywhere)
3) From root to leaves:
AAGAGTGCA
AGATATCCA
Seq1
Seq3
AGA?ATCC?
AG??GTGC?
Seq2
Seq4
AGCCGTGCG
AGAGATCCG
- Choose one nucleotide x from set N of root n
- For child node u, choose one nucleotide :
x if x  U
any nucleotide from set U otherwise
G
2) From leaves to root :
{G}
{A,G}
G
G
{G}
Seq 1 Seq 2
A
Nodes u and v are children of n
U,V and N are sets of nucleotides associated
with these nodes
G
Seq 3 Seq 4
G
Julie Thompson -IGBMC
G
N= UV if U V ≠ø
N= UV otherwise
Seq 1 Seq 2
A
G
Seq 3 Seq 4
G
1 substitution
G
Maximum parsimony
Comparison of « length » of tree
for the different topologies
1
AAGAGTGCA
Seq1
AGATATCCA
Seq3
4
AG??GTGC? 4
AGA?ATCC?
2
Seq2
Seq4
AGCCGTGCG
AGAGATCCG
10 substitutions
2
AAGAGTGCA
AGCCGTGCG
Seq1
Seq2
1
AGA??T?C? 6
AGA??T?C?
5
Seq3
AGATATCCA
Seq4
AGAGATCCG
12 substitutions
Julie Thompson -IGBMC
3
AAGAGTGCA
Seq1
AGA??T?CA 5
Seq3
AGATATCCA
AGCCGTGCG
Seq2
2
AGA??T?CG
4
Seq4
AGAGATCCG
11 substitutions
Search for shortest tree
Branch-and-bound Method (Hendy & Penny, 1982)
Exact algorithm, guaranteeing the optimal solution without exhaustive searching
1) add sequences one by one (following order of alignment)
calculate no. of substitutions for each topology
2) explore different topologies
If no. of substitutions during addition > no. found for best topology  stop
E
A
A
A
B
B
A
C
C
B
D
A
B
D
C
A
B
C
D
D
D
A
B
C
E
D
B
C
B
E
A
B
D
C
E
A
Julie Thompson -IGBMC
A
C
E
B
C
D
Search for shortest tree
Heuristic methods
Can handle more squences and longer sequences, but do not guarantee
optimal solution
1) construct initial tree by progressively adding TUs
2) then, rearrange initial tree to reduce length (branch swapping)
subtree pruning and regrafting
nearest neighbor interchanges
A1
A2
A3
A2
tree bisection and reconnection
A1
A3
subtree
A5
A5
A4
A4
A4
result depends on
initial order of sequences
A3
A7
A2
A1
Julie Thompson -IGBMC
A4
A6
A6
A3
A5
A7
A2
A1
 perform several tests with different
initial ordering : option « jumble »
A5
Maximum parsimony
Variant :
- find all possible topologies
- only include informative sites
=> sites that support one topology
Seq1
Seq2
Seq3
Seq4
A
A
A
A
A
G
G
G
G
C
A
A
1
Seq1
Seq3
Seq4
Seq2
1
Julie Thompson -IGBMC
A
C
T
G
G
G
A
A
T
T
T
T
1
1
Seq1
Seq4
G
G
C
C
C
C
C
C
A
G
A
G
3
Seq2
Seq3
2
Seq1
Seq3
Seq2
3
Seq4
Maximum parsimony
Yang J. Mol. Evol.42:294-307 (1996)
Human
Chimpanzee
Topology 1
(17 sites)
Julie Thompson -IGBMC
Gorilla
Human
Orangutan
Orangutan
Gorilla
Chimpanzee
Topology 2
(9 sites)
Gorilla
Human
Chimpanzee
Orangutan
Topology 3
(13 sites)
Maximum parsimony
Problem: long branch attraction
C
1
3
C
3
1
C
C
A
True tree
T
T 2
Parsimonious tree
G
4 G
4G
If the sequences are evolving at very different rates, the probability of convergent substitutions is
significant in long branchs
T 2
=> The most parsimonious clustering can lead to false topology





All changes occur at equal rates
no correction for multiple substitutions
Can lead to several equally parsimonious trees
Relatively slow, not suitable for a large number of sequences
No information about branch lengths (generally)
Julie Thompson -IGBMC
Maximum likelihood


maximizes the probability that a given tree could have
produced the observed data (that is, the likelihood)
As for parsimony method :




Differences :



Each column is considered to be a character
All possible trees are considered
For trees requiring many mutations, probability is low
=> trees requiring few mutations are preferred
Use of an explicit evolutionary model
Allows variable substitution rates for each branch
Can be used to estimate reliability of tree
Julie Thompson -IGBMC
Maximum likelihood
3 possible unrooted trees
Alignment of 4 sequences
1234
TTGC...
TTGC...
ATAC...
GTAC...
Seq a
Seq b
Seq c
Seq d
a
c a
b
a
b
b
d c
d
d
c
tree 1
tree 2
tree 3
tree 1
T
a
T A
b c
G
d
L3
L4 L5
L6
For position 1, possible combinations of bases at each node :
- 3 internal nodes
- 4 possible bases at each node
L2
L1
No. of possible combinations: 4 x 4 x 4 = 64
L0
Julie Thompson -IGBMC
Maximum likelihood
Estimation of likelihood of tree 1 for position 1 :
sum of probabilities for each of 64 combinations
Example :
T
a
T A
b c
L3
L4 L5
T
L1
G
d
L6
G
L2
T
L0
Estimation de likelihood of tree 1 :
Sum of probabilities obtained for each position
The calculation is performed for all possible tree (3 here).
The tree with the maximum likelihood is selected.
No of
Example of evolutionary model :
L0 = frequency of T ~ 0.25
L2 = probability of transversion of T=>G
L5 = probability of transition of G=>A
L1, L3, L4, L6 = ~1
Probability of this combination for position 1 :
L= L0 x L1 x L2 x L3 x L4 x L5 x L6
Julie Thompson -IGBMC
3
4
5
6
7
8
9
10
50
No of possible
unrooted trees
1
3
15
105
945
10 395
135 135
2 027 025
>3.1074
Bayesian inference

Use Bayes's theorem to combine the prior probability of a phylogeny with
the likelihood to produce a posterior probability distribution on trees
Pr[Data | Tree]x Pr[Tree]
Pr[Tree |Data] 
Pr[Data]
• Prior probability of a tree before observations are made
(generally, all trees are equally probable)
• Likelihood is proportional to probability of the observations,
conditional on the tree (makes assumptions about processes
generating observations)
• Posterior probability of a tree is the probability of the tree
conditional on the observations, obtained by combining prior and
likelihood for each tree
From Huelsenbeck et al, 2001
Julie Thompson -IGBMC
Holder & Lewis. Nature Rev. Genet. 2003
Bayesian inference



Choose tree with the highest posterior probability as
the best estimate of phylogeny
some numerical methods are available that allow the
posterior probability of a tree to be approximated
e.g. Markov chain Monte Carlo (MCMC) in MrBayes
(Huelsenbeck et al, 2001)
Julie Thompson -IGBMC
Phylogenetic reconstruction

Introduction: basic concepts

Principal methods for tree reconstruction


Distance based methods
Character based methods

Evaluation of the reliability of a tree

The limits of phylogeny

Some programs
Julie Thompson -IGBMC
Evaluation of the reliability of a tree

Goal :
Statistically estimate the reliability of a given tree topology

Example : bootstrapping
Build n pseudo-alignments by random sampling of the columns in the
initial alignment
 each column can be used 0,1 or more times
 the pseudo-alignments have the same length as the initial alignment
 the number of pseudo-alignments should allow for significant
statistical testing (n  number of columns)
 for each pseudo-alignment, build a tree
 for each branch in the initial tree, count the number of times this
branch is found in the n trees
Julie Thompson -IGBMC
Evaluation of the reliability of a tree
Initial alignment
Seq
Seq
Seq
Seq
A
B
C
D
A
A
A
A
G
G
G
T
G
G
C
T
C
T
C
T
T
T
C
C
C
C
C
C
C
G
G
G
A
A
A
A
A
A
A
A
A
A
A
A
C
A B C
B 2
C 3 3
D 6 4 4
B
C
D
A
1
2
Seq
Seq
Seq
Seq
Seq
Seq
Seq
Seq
A
B
C
D
A
B
C
D
G
G
G
T
A
A
A
A
G
G
C
T
G
G
C
T
T
T
C
C
T
T
C
C
T
T
C
C
C
C
C
C
T
T
C
C
C
C
C
C
T
T
C
C
C
G
G
G
C
G
G
G
C
G
G
G
A
A
A
A
A
A
A
C
A
A
A
A
A
A
A
C
A
A
A
C
A
A
A
C
A B C
B 1
C 6 5
D 8 7 4
A B C
B 2
C 4 2
D 7 5 3
B
C
D
A
3
A
B
C
D
A
A
A
A
G
G
G
T
T
T
C
C
T
T
C
C
C
C
C
C
C
C
C
C
C
G
G
G
A
A
A
A
A
A
A
C
Random sampling
Julie Thompson -IGBMC
A
A
A
C
A B C
B 1
C 3 2
D 6 5 3
A
B
B
C
C
D
D
A
Seq
Seq
Seq
Seq
3/3
B
C
D
L2P AERPERN
Archaea
L2P SCHPOMB
26
92
L2P SACCER
26
73
L2P AEDALBO
98
100
L2P HOMSAPI
100
100
Eucarya
L2P XENLAEV
L2P NICTAB
100
L2P ARATHA
L2P METJANN
15
L2P PYRHORI
33
L2P METTHER
34
34
Archaea
L2P HALHALO
51
L2P ARCFULG
L2P MYCLEPR
100
L2P MYCBOVI
100
L2P MYCTUBE
L2P MYCCAPR
61
L2P MYCGALL
100
100
L2P MYCGENI
100
58
100
L2P MYCPNEU
L2P BACSTEA
bootstrap value of a node
is the percentage of
times that node is
present in the trees built
from the random samples
100
32
L2P BACSUBT
Phylogenetic tree of the
ribosomal proteins L2p
L2P HELPYLO
100
L2P HELPYL2
24
L2P THEMARI
76
L2P AQUAEOL
100
L2P AQUPYRO
L2P CHLPNEU
9
100
L2P CHLTRAC
L2P RICPROW
30
48
L2P HAEINFL
100
22
L2P YERPSEU
100
L2P ESCCOLI
L2P BORBURG
Julie Thompson -IGBMC
50
L2P TREPALL
Bacteria
Phylogenetic reconstruction

Introduction: basic concepts

Principal methods for tree reconstruction


Distance based methods
Character based methods

Evaluation of the reliability of a tree

The limits of phylogeny

Some programs
Julie Thompson -IGBMC

Phylogenetic tree-building models make certain assumptions:
The sequences are homologous (descended from a shared ancestral sequence)
Each of the sequences has a common phylogenetic history with the other sequences
At each position in the alignment, the characters are homologous with each other
The sequence variability in the sample contains phylogenetic signal adequate to
resolve the problem under study.

Careful selection of sequences and evolutionary model
Julie Thompson -IGBMC
The limits of molecular phylogeny

No algorithm is perfect



It is never certain that the reconstructed tree is the real one!
The same data can result in different trees depending on the algorithm used
The evolutionary history of a gene is not always
transposable to the species



Not all genes evolve at the same rate (different selection pressure)
Horizontal transfers
paralogy
Julie Thompson -IGBMC
Phlogenetic tree of RAD25 and
RAD25-like proteins
Phlogenetic tree of RAD25 and
RAD25-like proteins
- Neighbor-Joining -
- Maximum parsimony -
100
100
73
94
54
93
100
99
65
100
100
10
0
100
100
100
66
87
100
91
100
60
100
96
Julie Thompson -IGBMC
YERPEST
Hypothetical
SALENTE
GUITHE
CAEELE
GEOCYDO
DROMEL
HOMSAP
CIOINT
XPB/RAD25
SCHPOM
SACCER
ARATHAL
DICDISC
TREPAL
MYCLEP
100
MYCTUB
AERPER
SULTOK 1
SULSOL 1
HALOSP
SYNSP
THEVOL
RAD25 like
THEACI
SULTOK
SULSOL
METTHE
Eucarya
PYRABY
Archaea
ARCFUL
Bacteria
Warning: no branch lengths
SALENTE
YERPEST
GUITHE
GEOCYDO
HOMSAP
CIOINT
CAEELE
DROMEL
DICDISC
ARATHAL
SACCER
SCHPOM
TREPAL
MYCTUB
MYCLEP
SYNSP
HALOSP
AERPER
SULSOL 1
SULTOK 1
THEACI
THEVOL
METTHE
SULSOL
SULTOK
ARCFUL
PYRABY
lysophospholipase
• duplications
• losses
• transfers
Bacteria
Archaea
Eukaryotes
Julie Thompson -IGBMC
Some programs…

Software suites :

Phylip (very comprehensive)
http://evolution.genetics.washington.edu/phylip.html

PAUP (Phylogenetic Analysis Using Parsimony)
http://www.lms.si.edu/PAUP/about.html

TREE-PUZZLE
http://www.tree-puzzle.de/

phylowin (graphical interface)
http://pbil.univ-lyon1.fr/software/phylowin.html

Visualisation, manipulation


Njplot, baobab, treeedit, phylodendron…
Treeview
http://taxonomy.zoology.gla.ac.uk/rod/treeview.html
Julie Thompson -IGBMC
Some programs…
Interactive Tree Of Life (iTOL)
an online tool for phylogenetic tree display
and annotation.
http://itol.embl.de/
Letunic I, Bork P. Bioinformatics 2007
Julie Thompson -IGBMC

Zertz

Transcript Zertz

Directory