Transcript ppt

Chapter 10
Phylogenetic Basics
Molecular evolution and molecular phylogenetics
•Similarities and divergence between biological sequences are often
represented by phylogenetic trees
•Phylogenetics is the study of the evolutionary history of organisms
•Based on fossil data in the Victorian era, but more recently on
molecular data
•Sequences in biological polymers provide a history of changes
•Advantages of molecular Phylogenetics:
•Molecular data more numerous than fossils
•No sampling bias involved
•More robust phylogenetic trees can be constructed
Major assumptions
•Sequences used must be homologous
•Phylogenetic divergence is assumed to be bifurcating (=forking)
•Each position in the sequence evolved independently
•Variability is informative enough to construct unambiguous trees
Terminology
clade
monophyletic
taxon
node
dichotomy
branch
polytomy
lineage
root node
A
C
B
D
•Unrooted tree
•No knowledge of common ancestor
•Relative relationships
•No evolutionary direction
•To root unrooted tree:
unrooted
A
B
C
rooted
D
•Use outgroup (distant relation; e.g..
bird for mammal tree)
•Midpoint rooting (midpoint of two
most divergent groups)
Gene phylogeny versus species phylogeny
•Objective of constructing molecular phylogenetic trees is to
reconstruct the evolutionary history and relation ships between
species or organisms
•The rate at which a gene evolves may not mirror that of a species
•Genes may arrive by horizontal transfer
•An internal node in a molecular phylogenetic tree represents a gen
duplication, whereas in a species phylogenetic tree, it represents a
speciation event
•To get accurate phylogenetics of species from molecular data
require phylogenetic analysis of several gene or protein families
Forms of tree representation
A
B
C
D
A
E
B C
D
E
Non-scaled
Cladogram
C
A
B
A
D
E
C
E
B
D
Scaled
Phylogram
Newick format
C
A
B
C
D
E
E
A
(((B,C),A),(D,E))
B
D
(((B:1,C:2),A:2),(D:1.2,E:2.4))
Finding a tree may be difficult
Number of possible tree topologies is a function of the number of taxa
Rooted trees:
NR = (2n-3)!/2n-2(n-2)!
1.E+22
1.E+20
Unrooted trees:
1.E+18
1.E+16
NU = (2n-5)!/2n-3(n-3)!
1.E+14
1.E+12
Series2
1.E+10
Series3
1.E+08
1.E+06
1.E+04
1.E+02
1.E+00
1
3
5
7
9
11
13
15
17
19
Procedure to construct a tree
•Choosing molecular markers
•Performing multiple sequence alignment
•Choose model of evolution
•Determining a tree-building method
•Assessing tree reliability
Choice of molecular markers
•DNA retains smaller changes (only 4 nucleotides)
•To study closely related organisms, use DNA
•For human population studies, use non-coding mitochondrial sequences
•More widely divergent groups, rRNA or protein sequences
•Comparing bacteria with eukaryotes, use conserved protein sequences
•Proteins more conserved to due degeneracy of codons
•Different evolutionary rates between nucleotides in codons
•DNA sequences biased because of codon preferences
•Two random DAN sequences will have 50% identity if gaps are allowed
•Random protein sequences only 10% identity
•Gaps in protein coding sequences are biologically meaningless
•Protein-based phylogeny preferable to nucleotide-based phylogeny
•DNA provides data on synonymous and non-synonymous substitution that
provides information on positive and negative selection
Alignment
•Correct alignment crucial otherwise there will be errors in trees
•Use modern package such as T-coffee
•Manual verification and editing essential
•Secondary structure can serve as guide in alignment (Praline)
•Non-homologous regions may have to be removed (subjective)
•Remove Indels
•Gaps regions may belong to signature indels and contain
phylogenetic information
Multiple substitutions
The number of differences between two aligned sequence is an indication
of their evolutionary distance … or does at?
What about A->T->G->C?
G->C->G?
Such multiple substitutions and convergences obscure true evolutionary
distances
Known as homoplasy
Need statistical models to correct for homoplasy
Jukes-Cantor Model
Assumes all substitutions occur with same probability
dAB = -(3/4)ln[1-(4/3)AB]
dAB is evolutionary distance
AB observed sequences difference
Two 10 nucleotide sequences that differ at three nucleotides:
AB = 0.3
dAB = -(3/4)ln[1-(4/3)0.3] = 0.38
Mostly for closely related sequences
Kimura Model
dAB = -(1/2)ln(1-2 ti-tv)-(1/4)ln(1-2 tv)
dAB evolutionary distance between two aligned sequences A and B
ti observed frequency for transition
tv observed frequency for transversion
If 30% difference is due to 20% transitions and 10% transversion:
dAB = -(1/2)ln(1-2.0.2-0.1)-(1/4)ln(1-2.0.1) = 0.4
For protein sequences can use a PAM substitution matrix that
includes evolutionary information
Kimura model for proteins:
d = -ln(1-p-0.2p2) where p is observed pairwise distance
Among site variation
In DNA mutation rate differs by codon position
In proteins there are functional constraints
Proportion of positions have invariant rates and others variable rates
The distribution of variable sites follow a  distribution
-corrected Jukes-Cantor:
dAB = (3/4)[(1-4/3AB)-1/ -1]
-corrected Kimura:
dAB = (/2)[(1-2ti-tv)-1/ -(1/2)(1-2tv)-1/ -1/2]