C - Bioinformatics Centre

Download Report

Transcript C - Bioinformatics Centre

Phylogenetic Analysis
Introduction to bioinformatics
Stinus Lindgreen
[email protected]
Bioinformatics Centre, University of Copenhagen
Outline of the lecture







What is a phylogeny?
Why and how to interpret them
Programs: PHYLIP, PAUP* and BioEdit
Building a tree 1: Multiple alignment
Building a tree 2: The model
Building a tree 3: Construction
Building a tree 4: Evaluation
Nothing in Biology Makes Sense
Except in the Light of Evolution
Theodosius Dobzhansky (1900-1975)
Phylogeny







Phylogenetic inference predicts a tree based on
characters (of some sort)
Some variation needed
Group together similar species/genes
Connect to most common ancestor
Unrooted tree: Just show connections
Rooted tree: Direction of evolution
Branch lengths can show divergence
Before sequences





Phylogenetic trees show evolutionary relationships
Existed longer than sequencing methods
Previously based on morphological characters
Still partly today – at least for checking
Mainly based on biological sequences


DNA or protein
Base phylogeny on mutations
Morphological tree
Modern tree
A
A
G
C
G
X
X
Some pitfalls





Determining phylogeny is important for
understanding biology
But also a very difficult problem
Beware of incorrect trees
Important to understand models and methods
The programs are helpful tools
The result is only as good as the alignment
Assumptions
Basic concepts of evolutionary theory
 Relation to common ancestor
 Phylogenetics represented by bifurcating tree
 Mutations occur over evolutionary time
Necessary to make phylogenetic inference possible
Tree of Life
Interpretation

Know your model


Both evolutionary and for tree construction
Know the assumptions of the model

Evolution independent? Identical between sites? The same
for all sequences?
Are the sequences correct?
 And are they representative?
 And are they homologous?
 Is the multiple alignment correct?
What you get out is no better than what you put in

Some biological pitfalls
Don’t make hasty conclusions!
 Does your tree contradict common sense?



Differentiate between the homologs
Orthologs


Speciation, common ancestor, similar function
Paralogs


Then it’s probably wrong!
Gene duplication, within 1 organism, differing functions
Xenologs

Horizontal gene transfer – hard to tell, similar function
Software
Today we’ll look at the programs before the methods
Some programs for phylogenetic analysis
 A multiple alignment program:
Clustal, T-Coffee, MAFFT, Muscle…
 A phylogenetic program:
Phylip, PAUP*, MacClade, BioEdit…
 Visualizing the tree:
TreeView, NJplot
PAUP*

Commercial package
Apparently good
Many different methods and analysis methods
But since we don’t own a copy…

Similarly: MacClade only works on Macintosh…



PHYLIP








Free package
Many programs
Both distance and character based
Bootstrapping possible
But:
It can be a little difficult
No graphical user interface
And you will need to run many programs
BioEdit








Has phylogeny methods built in
Can call Phylip routines
No need for you to learn the command line
But no bootstrapping… (as far as I know)
Point and click:
Select the sequences in the alignment
Choose the wanted phylogeny
Voila!
PhyloWin



Another free program
Simple, not many possibilities
But you can make bootstrapping
Getting the software


Install BioEdit, PHYLIP, PhyloWin and NJplot
Links on the wiki
Constructing a tree
To make a phylogenetic tree, four steps are needed:
1.
Perform multiple alignment
2.
Choose your model
3.
Build the tree
4.
Evaluate the quality
A brief note:
Ideally: Parallel alignment and phylogenetic inference

Very difficult – but it has been pursued
1) The multiple alignment
Already discussed
Some notes:
 Recall that MA programs are not exact


Consider the algorithm used




Some manual editing often necessary
Does it consider the phylogeny of the data?
Clustal’s guide tree: Not correct phylogeny
What parameters are used?
Solve ambiguities, remove near-identical sequences

Gappy regions, identical sequences can bias the result
2) The model
The model describes the data
 Evolutionary events
 Overall mutability
 Evolutionary model?


Crucial – both for alignment and tree building
Are you looking at nucleotides or amino acids?


Where do we get most information?
Know the basis for the chosen model
Nucleotide models


Create 4×4 matrix
Either fixed cost


Character state
Or rate matrices

Probabilities

Used for different kinds of tree estimations

Include site specific information

Third codon position more variable
Nucleotide model 1






Fixed cost for transitions and transversion
E.g. transversions are twice as costly as
transitions
A
For a tree: Count the number of A transitions/transversions
C
2
Calculate cost
G
1
T
2
Tends to minimize number of
transversion
Cluster transitions
C
G
T
2
1
2
-
2
1
2
-
2
1
2
-
Nucleotide model 2




Simple substitution rate matrix
Assume same rates AB and BA
Assume all mutations equally likely: Rate α
The Jukes-Cantor model
A
C
G
T
A
-3α
α
α
α
C
α
-3α
α
α
G
α
α
-3α
α
T
α
α
α
-3α
Nucleotide model 3




More advanced rate matrix
Include transitions/tranversions
Rates α1 and α2
The Kimura 2-parameter model
A
A -(α2+α1)
C
G
T
α2
α1
α2
C
α2
-(α2+α1)
α2
α1
G
α1
α2
-(α2+α1)
α2
T
α2
α1
α2
-(α2+α1)
Amino acid models


A 20×20 substitution matrix
The BLOSUM matrices


Or the PAM matrices


Fixed cost matrices
Rate matrices
Described last week
3) Building the tree
We have the sequences, the alignment and the model
 Find the best tree
 What is the best tree?
 Two main strategies:
 Distance based


Look at dissimilarities (=distances)
Character based

Look at the data
Problems with trees



The number of possible trees grows exponentially
For 15 taxa: 2.13·1014 possibilities…
How to search?



Rooting the tree


Branch and Bound
Branch swapping
Not a simple problem
All the following methods produce unrooted trees


Use an outgroup
Midpoint of longest branch
Distance methods



Some sequences more similar than others
Closely related sequences should be close in the tree
Abstract view on the data


Only use the distances between sequences


Loss of information is usually a bad sign
Recall Clustal
All methods start with a distance matrix
Distance methods



Can we get the correct answer?
Yes, if all mutation events were present
But: After one mutation, the site is ”saturated”

A
A
Additional mutations do not give additional info
B
C: Distance 2
C: Distance 1
 And mutations back will fool the method
A
B
A: Distance 2
A
A: Distance 0
UPGMA
Unweighted Pair Group Method with Arithmetic Mean
 Unweighted: The distances are used as they are
 Pair: Find the two closest elements
 Group: Put them together in a new group
 Arithmetic Mean: Gives distances from the new group

Correct tree assuming a molecular clock


Evolutionary divergence time can be found from mutations
Mutation rates are constant
UPGMA illustrated
A B C D E

A
-
8 3 2 5

B
-
-
5 6 6

C
-
-
-
7 5
D -
-
-
-
3
E
-
-
-
-
-
A+D B C
E
A+D
-
7
5
4
B
-
-
5
6
C
-
-
-
5
E
-
-
-
-
Find two closest: A and D
Create a new group [A+D]
Update distances:
A B D B
[A  D]  B 
2
86

7
2
 Repeat for all sequences
 Next time: Connect [A+D] with E
Trying UPGMA

Go to the wiki and do the UPGMA exercise
Neighbour joining



A little like UPGMA
Difference: NJ does not assume a molecular clock
But it assumes an additive tree



Find the closest pair that is most apart from the rest
of the tree
Connect pair and update distances



Distance between two leaves is the sum of the edges
A little advanced: Take the overall distance to the rest of the
tree into account
Corrects for varying mutation
Fast and can give good results
Fitch-Margoliash
FM method
 We have the pairwise distances
 Each branch in the tree has a length
 The length of all paths can be found
 Optimize tree by moving internal nodes around
 The best fit minimizes the overall error
2
(d

p
)
 ij ij
ij

The minimum squared deviation
Minimum Evolution
The ME method
 Find the shortest tree


Count number of changes
Similar to FM but only looks at branches
2
(d

l
)
 ij ij
FM
ij
B
ME
A
A
B
Trying NJ

Go to the wiki and do the NJ exercise
Character methods




Use the data (the actual characters)
All information at hand
More advanced, slower, but also more accurate
Maximum Parsimony (MP)


Occam’s razor: Simplest explanation
Maximum Likelihood (ML)


Advanced statistical method
Most probable tree given the data and the model
Maximum parsimony





How does evolution work?
Assumption: Path of least resistance
True evolution gives rise to fewest changes
The tree we want:
Describe the given sequences by fewest changes



The ancestral nodes must be as similar as possible
Predict a tree
Count the number of changes needed
MP illustrated
A
C
G
{A,C}
G
{G}
{A,C,G}
{C}
C
MP illustrated
A
C
G
G
X
{A,C}
{G}
X
{A,C,G}
Cost: 2 changes
{C}
C
MP illustrated
A
C
G
G
CA
C
G
CG
C
C
C
Maximum Likelihood

Given the data, predict the most probable model



We know the sequences
What is the most likely substitution rates?



Estimate from the alignment (and the phylogeny)
And what is the most likely tree?


Can optimize both tree and substitution model
Estimate from alignment and substitution rates
Computationally heavy and rather slow
Normally good results
Maximum Likelihood





General practice: Optimize model then tree
Calculate probability for each alignment column
Combine to probability for entire alignment
Averages over low and high probability sites
Likelihood of column given tree
A
L=P
A
C
A
A
+P
A
A
C
C
A
+P
A
A
C
G
+…
A
Maximum likelihood




Then repeat this for all possible tree topologies
And all possible assignments to internal nodes
And then choose the combination that gives the
highest probability…
Clearly very difficult
MP and ML exercise

Go to the wiki and do the MP and ML exercises
Summary of methods
Distance
Clustering
Optimality
criterion
Character based
UPGMA
Neighbour Joining

Least Squares
Minimum evolution

Maximum parsimony
Maximum likelihood
(Bayesian statistics)

The differences


Sometimes the differences can seem minimal
They affect the tree – but the same result is possible
UPGMA and NJ
 Minimize the overall length of the tree
Maximum parsimony
 Finds tree with fewest changes
Maximum likelihood
 Maximizes the probability of the tree given the data
4) Evaluating trees
How good is the predicted tree?
Some sequence variation needed
 Is the signal strong enough?
There are so many possible trees
 Are there many trees similar to the prediction?


Which one to choose?
Is the tree robust?

Does it change much when e.g. removing a sequence?
Randomization


Is it possible that tree is just random?
Permute the columns of the alignment




i.e. shuffle the characters in a column
Build a new tree
Is it (partly) identical?
If the tree is just as likely to be random, then don’t
put too much faith in it
Bootstrapping






The story of Baron von Münchausen
He pulled himself out of a swamp by his bootstraps
The idea: Evaluate the quality of the result using the
same data all over again
Make a large number of new datasets
Create phylogenetic tree
Observe the number of times clades are made
Bootstrapping




The datasets should be similar
Thereby: The trees are comparable
Alignments of same size (length and sequences)
Non-parametric: Sample with replacement


Parametric: Simulate new datasets


Choose a random column and add new alignment
Use model that look like your data
Characteristics are preserved (unlike randomization)
Bootstrap example


A:
B:
C:
D:
#:

A:
B:
C:
D:
Non-parametric bootstrapping
We have an alignment:
A
A
A
A
0
G
G
G
U
1
G
G
C
U
2
C
U
C
U
0
U
U
C
C
3
C
C
C
C
0
C
G
G
G
1
A
A
A
A
2
A
A
A
A
0
A
A
A
C
1
Sample columns:
G
G
G
U
G
G
C
U
G
G
C
U
U
U
C
C
U
U
C
C
U
U
C
C
C
G
G
G
A
A
A
A
A
A
A
A
A
A
A
C
A
B
C
D
A
-
-
-
-
B
1
-
-
-
C
5
5
-
-
D
8
7
4
-
A
B
C
D
Bootstrap example

A:
B:
C:
D:
Sample 2:
A
A
A
A
U
U
C
C
U
U
C
C
C
C
C
C
C
C
C
C
C
G
G
G
C
G
G
G
A
A
A
C
A
A
A
C
A
B
C
D
A
-
-
-
-
B
2
-
-
-
C
4
2
-
-
D
7
5
3
-
A
A
A
C
A
B
C
D
Bootstrap example

A:
B:
C:
D:
Sample 3:
A
A
A
A
C
C
C
C
C
C
C
C
C
G
G
G
A
A
A
C
A
A
A
C
G
G
C
U
G
G
C
U
C
U
C
U
A
B
C
D
A
-
-
-
-
B
3
-
-
-
C
3
4
-
-
D
7
4
6
-
C
U
C
U
A
B
C
D
Bootstrap example

Calculate consensus tree


Can be done on many ways
Put the bootstrap number at each branch point


The proportions of times this branch is observed
Of course, more than three samples needed
1.0
0.66
A
B
C
D
Bootstrapping exercise

Do the bootstrapping exercise on the wiki
Summary







What is phylogenetic inference?
What can a phylogenetic tree be used for?
Be aware of the multiple alignment
The different models
Tree building methods: NJ, UPGMA, ML and MP
Evaluating trees: Bootstrapping
Programs: Phylip, PAUP*,PhyloWin and BioEdit
Next time: Gene finding (with Anders Krogh)
Then RNA structure prediction with me again 