CMMB: Week 7 Phylogenetics

Download Report

Transcript CMMB: Week 7 Phylogenetics

CMMB: Week 9
Phylogenetics
Todd Scheetz
October 25, 2001
Introduction
Common Terms
General Processes
Types of phylogenetic analyses
PHYLIP and PAUP
Common Terms
Phylogenetics
assessment of the evolutionary relationship between
species, typically utilizing the sequence of a common
molecule.
Dendogram
Tree-based diagram of phylogenetic structure.
Clade
A group of organisms whose members share homologous
features derived from a common ancestor.
Taxon (pl. taxa)
A category of group such as a phylum, order, species, etc.
Introduction
The use of phylogenetics is an attempt to determine how the
sequences might have been derived during evolution.
Can be done with both nucleotide or amino acid sequences.
EXAMPLE tree here…
Branching within the tree indicates the apparent relationship
between two sequences. Very similar sequences should be next to
each other in the tree.
Trees
Example
GAATC
GAGTT
GA(A/G)T(C/T)
Rooted vs. unrooted trees
Inherent Assumptions
1. The sequence is correct and originates from the specified
source
2. The sequences are homologous
3. Each position within the alignment is homologous
4. Each sequence has a common phylogenetic history
5. The sampling of taxa is adequate to resolve the problem of
interest.
6. The sequence variability in the sample contains
phylogenetic signal adequate to resolve the problem of
interest.
General Process
The basic process of phylogenetic analysis is
1. Alignment
2. Determining the substitution model
3. Tree building
4. Tree evaluation
What sequence to use?
Before performing the analysis, there is a more fundamental
issue to be addressed…
What sequences to use?
Guidelines
1. Universally present in all organisms to be studied, with
good conservation of sequence amongst many of the
species.
2. Divergent enough to allow grouping the species into a
taxonmic classification.
Alignment
The first step in performing a phylogenetic analysis is to align
the sequences.
Each column within a multiple sequence alignment is referred
to as a site.
Because the sites themselves are effectively assumed to be
homologous (share a common ancestor), they represent a priori
phylogenetic conclusions.
Two major steps…
• selection of alignment procedure
• extracting the phylogenetic data set from the alignment
Alignment Procedure
Computer dependence
unrealistic to do the alignment by hand
Phylogenetic criteria
does the alignment proceed based upon a tree?
EX. clustalw utilizes neighbor joining during sequence
alignment
Alignment parameter estimation
should vary dynamically depending on evolutionary distance
Aligned features
secondary structure -- requires manual intervention
Mathematical optimization
some programs optimize according to a statistical model, but
this may have unknown effects on further phylogenetic
analysis
Alignment -Extracting Phylogenetic Info
The difficulty here, as we will see on the previous page, is that
of length-variable sequences (or alignments).
alignment ambiguities
indels
1. can remove sites with indels (but lose phylog. signal)
2. assign penalty of 0 to indels (but incorr. interp.)
3. can treat gap as an additional character
4. treat gap as a new character (but only count first
indel in a series)
Often necessary to use alignment surgery.
Alignment Procedure
Figure 9.3 from Baxevanis
Substitution model
DNA substitution models
Jukes-Cantor - independent probability of substitution at all sites
Kimura - different rates for transitions versus transversions
transition (purine-purine, pyrimidine-pyrimidine, A-G, C-T)
transversion (purine-pyrimidine, A-C, A-T, G-C, G-T)
Maximum Likelihood - allows for variations in nucleotide
context, and for different rates for transitions versus
transversions.
Substitution model
Amino Acid substitution models
PAM - uses a PAM001 matrix to create a transition probability
matrix between two sequences.
Kimura - approximates PAM distance as
D = - ln (1 - p - 0.2p^2)
p = fraction of amino acids that differ
Categories (PHYLIP)
1. categories of amino acids
2. selectable transition/transversion rates
3. selectable genetic codes
Tree building
Three fundamental strategies
• maximum parsimony
• distance-based
• maximum likelihood
Maximum parsimony attempts to minimize the number of steps
required to generate the observed variations in the sequences.
Distance-based methods utilize distance metrics to determine
“neighboring” sequences.
Maximum likelihood method searches for the evolutionary
model (including the tree) that maximizes the likelihood of
producing the observed data.
Tree building
MAXIMUM PARSIMONY
useful for sequence that are very similar, and for small number
of sequences.
evaluates all possible trees…
only informative sites need to be analyzed,
to be informative, at least two taxa must have the same
character at a site, and must support one tree over another...
Tree building
MAXIMUM PARSIMONY
Taxa
1
2
3
4
Sequence Positions
1
2
3
A
A
G
A
G
C
A
G
A
A
G
A
4
A
C
T
G
5
G
G
A
A
6
T
T
T
T
7
G
G
C
C
8
C
C
C
C
1
3
1
2
1
2
2
4
3
4
4
3
9
A
G
A
G
Tree building
DISTANCE-BASED
Fitch-Margolis
Neighbor Joining
UPGMA (Unweighted Pair Group Method with Arithmetic Mean)
Tree building
A
B
C
D
A
B
C
A
B
B
22
B
22
B
22
C
39
41
C
39
41
CDE
39.66
41.66
D
39
41
18
DE
40
42
19
E
41
43
20
10
d and e branch lengths
D
= 10/2 = 5
5
5
distance from C 4.5
to D and E
9.5
= 19/2 = 9.5
distance from A to B
= 22/2 = 11
E
5
5
D
E
C
11
A
11
B
Tree building
So now we have two composite groups...
5 D
A
4.5
11
5 E
11
9.5
C
B
To unify these, calculate the average distance between the groups
= dAC + dAD + dAE + dBC + dBD + dBE
= 39+39+41+41+41+43/6 = 40.7
5 D
Distance to the “Root” of the tree is
4.5
= 40.7/2 = 20.35
E
5
10.85
9.5
C
9.35
11
A
11
B
Tree building
MAXIMUM LIKELIHOOD
The likelihood for each individual site within the alignment is
calculated, given a particular tree and the overall observed base
frequencies.
The likelihood of the tree is then the product of the likelihoods
at every site.
The run time is MUCH longer for maximum likelihood
analyes.
Tree evaluation
There are two basic strategies for evaluating phylogenetic trees
1. Bootstrap the original data set is replicated many times, the replicates
are created by sampling the original sites randomly (with
replacement).
2. Jackknife replicates are created by droping one or more sites within
each replicate.
A third alternative, is to verify that the tree structure you obtain
is consistent among the various construction methods.
Types of Trees
rooted
unrooted
PHYLIP
Basic Process
bootseq
dnadist
tree building program
consense
drawtree
or drawgram
neighbor
or fitch
dnapars
dnaml
PHYLIP #2
Distance calculation and substitution models
dnadist
Jukes-Cantor
Kimura
Maximum Likelihood
protdist
PAM
Kimura
Categories
PHYLIP #3
Tree building programs
distance-based
neighbor
fitch
kitch
parsimony
dnapars/protpars
dnapenny/penny
maximum likelihood
dnaml/protml
dnamlk/protmlk
PHYLIP #4
Tree evaluation
seqboot
works with both DNA and AA
bootstrap
jackknife
works with all tree building programs
consense
builds the consenus tree from multiple replicates
PHYLIP #5
Drawing/Plotting trees
drawtree
draws unrooted trees
drawgram
draws phenogram-like rooted trees
PAUP
The other most popular phylogenetic analysis package is
PAUP.
I didn’t have time to do anything with PAUP.
That’s left as an exercise to the reader… :-)