Multiple Alignment and Phylogenetic Trees
Download
Report
Transcript Multiple Alignment and Phylogenetic Trees
Multiple Alignment and
Phylogenetic Trees
Csc 487/687 Computing for
Bioinformatics
Alignment problem
Given a set of sequences, produce a
multiple alignment which corresponds as
well as possible to the biological
relationships between the corresponding
bio-molecules
For homologous proteins
Two residues should be aligned (on top of
each other)
if they are homologous (evolved from the
same residue in a common ancestor protein)
if they are structurally equivalent
Automatic approach
Need a way of scoring alignments
fitness function which for an alignment
quantifies its “goodness”
Need an algorithm for finding alignments
with good scores
Not all methods provide a scoring function
for the final alignment!
Analysis of fitness function
One can test whether the alignments
optimal under a given fitness function
correspond well to the biological
relationships between the sequences
For example, if the structure of (some of)
the proteins are known.
Scoring Alignments
In order to find an optimal alignment, we
need to be able to measure how good an
alignment is
Sum of pairs (SP) method: in a column, score
each pair of letters and total the scores. Pairs
of gaps score 0.
Total up scores for each column
SP Method Example
S
S
I
I
S
K
K
E
Using BLOSUM62 matrix,
gap penalty -8
In column 1, we have pairs
-8 - 8 + 4 = -12
-,S
-,S
S,S
k(k-1)/2 pairs per column
Align by use of dynamic programming
Dynamic programming finds best alignment of
k sequences with given scoring scheme
For two sequences there are three different
column types
For three sequences there are seven different
column types
x means an amino acid, - a blank
Sequence1
x - x x - - x
Sequence2
x x - x - x Sequence3
x x x - x - x
Use of dynamic programming
Dynamic programming finds best
alignment of k sequences given
scoring scheme
Algorithm for dynamic programming
Analysis
O(nk) entries to fill
Each entry combines O(2k) other entries
Costs O(k2) to calculate each SP score
Overall cost is O(k2 2k nk), or exponential
in the number of sequences!
NP-complete
General progressive alignment
Algorithm. General progressive alignment.
Progressive alignment of the sequences {s1, s2, . . . , sm}
Var C
current set of alignments
begin
C := ∅
for i := 1 to m do C := C union {{si }} end one alignment of
each seq.
for i := 1 to m − 1 do
choose two alignments Ap,Aq from C; C := C − {Ap,Aq }
Ar := align(Ap,Aq );C := C union {Ar }
end
C now contains the (single) final alignment
end
The Clustal Algorithm
Three steps:
1
2
3
Compare all pairs of sequences to obtain a
similarity matrix
Based on the similarity matrix, make a guide
tree relating all the sequences
Perform progressive alignment where the order
of the alignments is determined by the guide
tree
(A)
1 pairwise comparison
2 clustering/making tree
(B)
3 Align according to tree
Clustal - summary
Does not use a score for the final
alignment
Each pairwise alignment is done using
dynamic programming
Heuristics are used - tailored to globular
proteins
Graphical version: ClustalX
Phylogeny
The basic principle is that the origin of
similarity is common ancestry.
The field of phylogeny has the goals of
working out the relationships among
species, populations, individuals, or genes.
Usually expressed as a tree.
Phylogeny
The basic principle is that the origin of
similarity is common ancestry.
The field of phylogeny has the goals of
working out the relationships among
species, populations, individuals, or genes.
Usually expressed as a tree.
Phylogeny
A statement of phylogeny among objects
assumes homology and depends on
classification.
Phylogeny states a topology of the
relationships based on classification
according to similarity of one or more sets
of characters, or on a model of
evolutionary processes.
Phylogeny
It is rare for species relationships and
ancestry to be directly observable.
Evolutionary trees determined from
genetic data are often based on inferences
from the patterns of similarity, which are
all that is observable among species living
now.