consensus tree

Download Report

Transcript consensus tree

Bioinformatics
2011
Molecular Evolution
Revised 29/12/06
• Phylogeny is the inference of evolutionary relationships
• All forms of life share a common origin.
– deduce the correct trees for all species of life
– to estimate the time of divergence between organisms
since the time they last shared a common ancestor
Terminology
• Phylogenetic trees that are used to assess the relationships of
homologous proteins (or nucleotide sequences) in a family
Clade
Bifurcating node
Branch
OTU or external node
Internal node
Phylogram
Terminology
Terminology
Species tree versus gene tree
• In a species tree an internal node represents a speciation
event
• In a gene tree an internal node represents the divergence of
an ancestral gene into two new genes with distinct sequences
• Species tree <> Gene tree
– horizontal gene transfer
– gene duplications
Species tree versus gene tree
Gray et al.
Phylogenetic inference
1. Selection of sequences for analysis
2. Multiple sequence alignment
3. Tree building
4. Tree evaluation
Phylogenetic inference
1. selection of sequences for analysis
DNA:
– Higher phylogenetic signal:
•
Synonymous vs nonsynonymous substitutions
(detect negative and positive selection)
Protein:
– Phylogenetic signal less predominant than in DNA
– Better to construct a tree for evolutionary distant species
or genes
RNA: rRNA often used for constructing species trees
Phylogenetic inference
2. multiple sequence alignment
•
This is a critical step in the analysis as in many cases the alignment
of amino acids or nucleotides in a column implies that they share a
common ancestor
•
If you misalign a group of sequences you will still be able to produce
a tree. However, it is not likely to be biologically meaningful.
Crap in is crap out!
•
Inspect the alignment to be sure that all sequences are homologous
•
Some times with ClustalW distantly related sequences are not well
aligned. Try different gap and extension parameters to improve the
alignment
•
Only use these columns of the multiple alignment for which you
have data for all organisms or sequences. Delete the columns for
which this is not the case.
•
Delete columns with gaps
Phylogenetic inference
3. Tree building
Character-based
methods
Non-character based
methods
Methods based on an
explicit
model of evolution
Maximum Likelihood Pairwise
Methods/Bayesian
methods
Phylogeny
Methods not based on
an explicit
model of evolution
Maximum Parsimony
Methods
distance
Distance based methods
Distance based methods:
– calculate the distances between molecular sequences
using some distance metric
– A clustering method (UPGMA, neighbour joining) is used
to infer the tree from the pairwise distance matrix
– treat the sequence from a horizontal perspective, by
calculating a single distance between entire sequences
Advantage:
• Fast
• Allow using evolutionary models
Disadvantage:
• sequences reduced to one number
Character based methods
Character based methods:
– treat the sequences from a vertical perspective
– they search for each column of the alignment, the
simplest explanation for how the characters evolved.
– For instance, MP involves a search for a tree with the
fewest number of amino acid (or nucleotide character
changes that account for the observed differences
between the protein (gene) sequences.
Phylogenetic inference
4. Tree evaluation: bootstrapping
•
sampling technique for estimating the statistical error in situations
where the underlying sampling distribution is unknown
•
evaluating the reliability of the inferred tree - or better the reliability
of specific branches
How to proceed:
•
From the original alignment, columns in the sequence alignment are
chosen at random ‘sampling with replacement’
•
a new alignment is constructed with the same size as the original
one
•
a tree is constructed
This process is repeated 100 of times
Phylogenetic inference
Show bootstrap values on phylogenetic trees
•
majority-rule consensus tree
•
map bootstrap values on the original tree
Maximum parsimony
Principle
• Select that tree that minimizes the total tree length = being the
number of nucleic acid substitutions or amino acid
replacements required to explain a given set of data.
Method
• a particular topology is considered
• for this topology, the ancestral sequences at each branching
point are reconstructed
• the minimum number of events to explain the sequence
differences over the whole tree is computed: the minimum
number of substitutions is computed for each nucleotide (or
amino acid) site, and the numbers for all sites are added.
• another tree topology is chosen
Maximum parsimony
Maximum parsimony
OTU's
rooted tree topologies
unrooted tree topologies
3
3
1
4
15
3
5
105
15
6
954
105
7
10395
954
8
135135
10395
9
2027025
135135
equation
NR 
(2n  3)
2 (n  2)
n2
•
Exhaustive search impossible
•
Heuristics needed
NU 
(2n  5)
2 n3 (n  3)
Maximum parsimony
• Find different tree topologies that are 'equally parsimonious‘
• Represent results as a consensus tree.
– 'strict' consensus tree
– 'majority-rule' consensus tree
Maximum parsimony
Only informative sites of the alignment are used in the
construction of the tree: when there are at least two different
kinds of characters, each represented at least two times
Maximum parsimony
Parsimony trees are usually only represented as a tree topology
(cladogram): sometimes, the parsimony program cannot
decide in which branches the substitutions have been taken
place. It can not calculate branch lengths.
Maximum parsimony
Assumptions
• Equal rate of evolution in all branches
• no correction for multiple mutations, i.e. no substitution
model can be applied (see further)
Advantages
• sequence information is not reduced to one number (such
as for example in pairwise distance methods)
Disadvantages of maximum parsimony methods
• can be slow for very large datasets
• sensitive to unequal rates of evolution in different lineages
(see further) =>long branch attraction
Pairwise distance methods
• Distance calculation
• Inferring the tree topology
Pairwise distance methods
Distance calculation
Approach:
• align pairs of sequences and count the number of differences
(Hamming distance).
• For an alignment of length N with n sites at which there are
differences: D= (n/N*100).
Problem:
• observed differences <> actual genetic distances between the
sequences.
=> dissimilarity is an underestimation of the true evolutionary
distance, because of the fact that some of the sequence
positions are the result of multiple events
Solution:
• Use an evolutionary model that corrects for multiple
mutations
Pairwise distance methods
Distance calculation
Pairwise distance methods
Distance calculation
Pairwise distance methods
Distance calculation
Other evolutionary models
Pairwise distance methods
Distance calculation
Unequal mutation rate per position (gamma correction of
Jukes Cantor model
Pairwise distance methods
Tree inference: UPGMA
• Ultrametric trees are rooted trees, in which all the endnodes
are equidistant from the root of the tree,
• Assuming a molecular clock: i.e, that all sequences evolve at
a similar rate
Pairwise distance methods
Tree inference: WPGMA
• when two OTUs are grouped, we treat them as a new single OTU
• when OTUs A, B (which have been grouped before) and C are grouped into a
new node ‘u’, then the distance from node ‘u’ to any other node ‘k’ (e.g. grouping
D and E) is simply computed as follows:
Pairwise distance methods
Tree inference: WPGMA
Pairwise distance methods
Tree inference: UPGMA
Advantages:
• Fast
• Allows incorporation of evolutionary models
Disadvantages:
• Assumption of a molecular clock
Pairwise distance methods
Tree inference: neighbor joining
• Additive distances can be fitted to an unrooted tree such that
the evolutionary distance between a pair of OTUs equals the
sum of the lengths of the branches connecting them, rather
than being an average as in the case of cluster analysis
• Tree construction methods: minimum evolution, the tree that
minimizes the sum of the lengths of the branches is regarded
the best estimate of the phylogeny
• Drawback for the ME method: is that in principle all different
tree topologies have to be investigated in order to find the
‘minimum’ tree.
• The neighbour joining (NJ) method, developed by Saitou and
Nei (1987) offers a heuristic approach to solve this problem
Pairwise distance methods
Tree inference: neighbor joining
Pairwise distance methods
Tree inference: neighbor joining
Pairwise distance methods
Tree inference: neighbor joining
Pairwise distance methods
Tree inference: neighbor joining
Pairwise distance methods
Tree inference: neighbor joining
Pairwise distance methods
Tree inference: neighbor joining
Advantages:
• Fast
• Allows incorporation of evolutionary models
• No assumption of a molecular clock