Multiple Alignment - niu bioinformatics

Download Report

Transcript Multiple Alignment - niu bioinformatics

Multiple Sequence Alignment
Sequence Families
•
Most sequences are members of large families, some with the same
function and others with different functions.
– Members of different families are thought to not have a common ancestor.
•
The basic idea of multiple alignment is to line up a group of related proteins
so that the amino acids in each column:
– Have the same position in the 3-dimensional structure
– Are derived from the same amino acid in their common ancestor
– Have the same function in the protein
•
These three goals are not completely compatible: structures change over
evolutionary time, mutations compensate for each other, etc.
– An interesting hypothesis: there are only a few thousand different types of protein
fold in existence, and all proteins are made up of one or more of these folds.
• X-ray crystallography seems to be confirming this idea, but it is far from being widely
accepted.
•
Another point of view: a multiple alignment is an attempt to reproduce the
sequence of evolutionary events that occurred in moving from the common
ancestor to all of the present-day sequences being aligned.
Multiple Alignments
•
Most multiple alignments are global, not local.
–
•
Aligning 2 sequences is easy and fairly precise (given a substitution matrix and set of
gap penalties).
–
•
This means that it is necessary for the proteins being aligned to all have the same basic
domain structure.
Most multiple alignment programs start with pairwise alignments.
But, multiple alignments are still an active area of research, and the best alignments
are still refined by hand.
–
There are many multiple alignment programs available, and there are always new ones
coming out.
•
–
•
Not to mention algorithms that aren’t fully developed into new user-friendly web-based programs.
Lots of heuristics and general ad hoc decisions that improve results without any underlying
theory.
Closely related sequences can be aligned more unambiguously that distant
sequences.
–
–
Thus, most multiple alignment schemes start by joining the most similar sequences
This is a case of the general rule “do the easy ones first”
Scoring a Multiple Alignment
•
•
How do you determine which alignment is best?
“Looks pretty good to me” (i.e. hand refinement) may work when experts in that specific protein
are involved, but in general it is an invitation to introducing prejudices
–
•
“star tree” scoring: score all other sequences relative to a single common (canonical) one.
–
–
•
There are several sets of test case data, with alignments refined by hand using 3-D structures. For
example, BaliBase
This is effectively what happens when one sequence is used as a BLAST or PSI-BLAST query and all other
sequences are aligned to it.
Not good idea: the original query is not the center of evolutionary events
We are trying to re-create evolutionary events, so the ideal method would be a phylogenetic
scoring method, that would take evolutionary descent into account.
–
–
–
Detect and count each evolutionary event that distinguishes between sequences.
Phylogeny is often inferred during the course of a multiple alignment, but it is not independent from the
alignment itself, so it can’t legitimately be used for scoring.
The best phylogenetic methods are quite slow.
More on Scoring
• Most common scoring = “sum of pairs”. Once the
multiple alignment is made, use a substitution matrix to
score all possible pairs of sequences, then add up the
total.
– Sum of pairs ignores the principal that closely related sequences
are easier to align properly than distant ones.
– Sum of pairs also overcounts more distant evolutionary events.
For example, the first divergence from the common ancestor is
counted as if it occurred multiple independent times, not just
once.
– Some alignment methods use weighting schemes to overcome
this objection.
• A more recent method (slower, unfortunately), is
“consistency-based scoring”.
– The basic concept is that the final multiple alignment should
match as many pairwise alignments as possible.
Progressive Alignment
• The first thought: why not just extend Smith-Waterman to multiple
sequences? i.e. simultaneous alignment.
– Huge multi-dimensional matrices and exponential scaling ( O(2N), where
N = number of sequences ) make it impractical except for small
numbers of sequences.
• Progressive alignment: start by aligning the most similar pair, then
add other sequences one by one.
– Alignments are most likely to be correct if the sequences are closely
related.
• The big problem is: the order in which the sequences are added
affects the final alignment significantly.
– Once 2 sequences are aligned, that alignment can’t be changed, so
early mistakes in alignment are propagated.
• This property is called “greediness”
– Modern alignment schemes try to re-test and re-align later additions.
This is called “iterative alignment”, meaning that the multiple alignment
goes through several cycles (iterations) of retesting and re-aligning.
ClustalW
•
ClustalW is the most popular multiple alignment program used today.
–
–
•
•
Despite the fact that no improvements have been made since 1994
Despite the fact that it suffers horribly from the order-of-addition problem: it is a
progressive method, not iterative.
On the web: http://www.ebi.ac.uk/Tools/clustalw2/
Basic algorithm:
–
–
–
Do all possible pairwise alignments and score them.
Using cluster analysis, create a relationship tree (a guide tree) based on the
similarities between pairs.
Combine alignments by introducing gaps
•
Some heuristic attempts to concentrate gaps into the same alignment columns
Cluster Analysis
• How to create the guide tree
– Based on a matrix of distances between
aligned pairs of sequences
– Several ways of converting this to a tree. We
will look at a simple method called UPGMA
– There are better, more accurate ways of
creating a phylogenetic tree, but UPGMA is
fast and so fits the needs of a multiple
alignment program
Aligned Sequences
A:
B:
C:
•
•
•
•
ACCGTAGCTGGACTGGAATCGTTTAC
ATCGTGGCTGGACCGGACTCGTTTAC
ACCGTATCTGGACTG--CTCGTATAC
A-B: 4 differences
A-C: 3 differences
B-C: 5 differences
There are 24 positions with bases (and not gaps) in all
sequences, so use 24 as denominator to calculate genetic
distances
• A-B distance = 4/24 = 0.17
• A-C distance = 3/24 = 0.13
• B-C distance = 5/24 = 0.21
Distance Matrix
•
•
•
To create a UPGMA tree, we start
with a distance matrix.
The main diagonal is the distance
between a sequence and itself,
which is always 0.
Here is an example, based on 50
positions in the aligned sequences
that had no gaps.
– usually, positions where any gaps
occurred are not used.
A
B
C
D
E
A
0
5
10
17
19
B
0.10 0
12
18
20
C
0.20 0.24 0
16
17
D
0.34 0.36 0.32 0
E
0.38 0.40 0.34 0.20 0
10
The upper triangle has the number of
differences between pairs of
sequences. The lower triangle has the
genetic distances (num. diffs. / total
positions). These sequences had 50
positions with no gaps.
•
= Unweighted Pair Group Method
with Arithmetic Mean.
– UPGMA is a simple and intuitive
clustering method
– It produces a rooted tree
(dendrogram)
•
Algorithm:
– Start by finding the closest pair of
sequences: A and B are 0.10
apart.
– Join them. The branch lengths
are the 1/2 the distances
(because the distance is from A to
the join point, then from the join
point down to B).
– Combine their columns by
averaging distances from each
member of the new group to each
of the other sequences
– Repeat until all sequences have
been joined into a single tree.
UPGMA
A
B
C
D
A
0
B
0.10
0
C
0.20
0.24
0
D
0.34
0.36
0.32
0
E
0.38
0.40
0.34
0.20
A+B
C
D
A+B
0
C
0.22
0
D
0.35
0.32
0
E
0.39
0.34
0.20
E
0
E
0
• Next, D and E are the
closest on the revised
distance matrix: 0.20. Each
branch is 1/2 of this, 0.10
units.
• Branch lengths are
measured from the bottom of
the tree, the position of all
leaves.
• For the new distance matrix,
the distance from D+E is the
average of the distance from
D to each leaf and the
distance from E to each leaf.
• The distance from A+B to
D+E is the average of A-D,
A-E, B-D and B-E.
More UPGMA
A
B
C
D
A
0
B
0.10
0
C
0.20
0.24
0
D
0.34
0.36
0.32
0
E
0.38
0.40
0.34
0.20
A+B
C
A+B
0
C
0.22
0
D+E
0.37
0.33
D+E
0
E
0
More UPGMA
• Next: C is closest to A+B
(0.22)
• Note that to get the
distances from A+B+C to
D+E, we need to get the
average of A-D, A-E, B-D, BE, C-D, and C-E.
A+B
C
A+B
0
C
0.22
0
D+E
0.37
0.33
D+E
0
A+B+C D+E
A+B+C 0
D+E
0.357
0
• The last join is
A+B+C with D+E.
• This tree has all of
its leaves on the
same level, which
implies equal
rates of mutation
in all branches.
This assumption
is unrealistic, but it
is a good first
approximation.
End of UPGMA
Good alternative reference:
http://www.southampton.ac.uk/~re1u06/teaching/upgma/
MUSCLE
•
•
•
•
“MUSCLE” is probably an acronym for something, but I am not a fan of
cutesy acronyms.
It gets high marks in recent surveys of multiple alignment methods for being
fast, accurate, and capable of dealing with a wide range of sequences.
MUSCLE as a web-based service at EBI (European Bioinformatics
Institute): http://www.ebi.ac.uk/Tools/muscle/index.html
MUSCLE is an iterative method:
– it starts out doing a progressive alignment of sequences using a guide tree,
starting with the most similar.
– Then, it attempts to test and re-align various sequences, trying to avoid freezing
in early bad decisions.
•
•
This is an instance of a more general statistical technique called “jacknifing”.
Jacknifing involves repeating an analysis many times on a subset of the
data: leaving out random halves of the data, or leaving out single random
observations, etc. It is used (along with bootstrapping) to estimate the
robustness of one’s conclusions.
It also illustrates the general principle of “annealing”: slowly converging on a
solution by allowing changes but gradually raising the score needed to
accept a change.
MUSCLE Algorithm
• After some preliminary work, MUSCLE
estimates a tree using UPGMA and the
Kimura method of estimating the
evolutionary distance between sequences
(which we will deal with later in the
semester).
• The iterative procedure:
– Cut the tree into 2 subtrees at random
locations
– re-align each subtree
– Align the subtrees together
– Score the new alignment with sum-of-pairs
• If the new score is better than the old one,
discard the old tree and repeat the cutting
and realignment procedure with the new
tree.
• If the new score is worse than the old one,
discard it and repeat the procedure with the
old tree.
– Repeat until no further changes appear
(convergence) or a user-set limit is
reached.
T-COFFEE
•
•
T-Coffee is also a cute acronym for something
For our purposes, it illustrates a scoring method more refined than sum-ofpairs.
– It is probably more accurate than other methods, but considerably slower.
•
•
•
Web: http://www.ebi.ac.uk/Tools/t-coffee/
Consistency-based scoring.
Algorithm:
1. Create a library of all pairwise alignments
•
•
•
You can generate them externally if you like. For instance, using alignments of 3dimensional structures.
You can use several alignments for the same pair of sequences, even if they are not
consistent.
The information is stored residue-by-residue: it is a list of entries like “amino acid 23 in
sequence X aligns with amino acid 27 in sequence Y”
2. Weight alignments by the percentage of identical residues.
•
If there is more than one alignment for a given pair of sequences, create a weighted
average of them.
Library Extension
3. The next step, library extension, involves examining all
triplets of sequences.
– Residues that pair up in all three have their weights increased
by the lesser of the two pair scores.
– Residues that match differently in the two pairs being joined to
form a triplet do not have their weights increased.
– Thus, residues that consistently match up end up with very high
weights. This is what is meant by consistency-based scoring.
4. After this, the weighted set of aligned residues is
subjected to a progressive alignment equivalent to
ClustalW.
– Once a gap is created, it remains in all further alignments
– Separate gap penalties aren’t needed: already incorporated into
the library extension process.
Library
Extension
Example
The scores are pairwise:
the percentage
of identical amino acids for
all positions with an amino
acid in both sequences
Example
•
Pairwise scores. Every aligned pair of amino acids gets an initial score equal to the
percentage of identity between the aligned sequences.
–
–
–
–
•
•
Sequences x and y are then aligned through all other sequences: v and z in this case
When the x-v-y triplet is assembled, every aligned pair in x and y gets its weight
increased by the minimum of the two pair scores (77 here).
–
•
This includes GARFIELD THE FAT CAT, but not VERY or the S in “FAST” because they
aren’t aligned.
Similarly, when x, z, and y are aligned together, all matching pairs in x and y are
increased by 100 (the minimum of the x-z and y-z scores).
–
•
Sequences x and y aligned as a pair gives 88% identity.
Sequences x and v aligned as pairs gives 77% identity (because LAST and VERY are
aligned but not identical).
Sequences v and y are 100% identical, because VERY in sequence v is a gap in sequence
y.
Sequences x and z, and y and z, are 100% identical, although mostly gaps.
This only increases the score for THE FAT CAT
Note that LAST FAT in sequence x matches FAST CAT in y, but it matches VERY
FAST in v, and ---- FAT in z. This results in two possible matches for these residues,
and lower scores for both
–
These matches are thus less reliable than other parts of the x-y alignment.
Some Numbers
x
T
H
E
L
A
S
T
F
A
-
T
C
A
T
y
T
H
E
F
A
S
T
C
A
-
T
-
-
-
x-y
score
88
88
88
88
88
88
88
88
88
0
88
0
0
0
x
T
H
E
L
A
S
T
F
A
-
T
C
A
T
V
T
H
E
V
E
R
Y
F
A
S
T
C
A
T
Y
T
H
E
-
-
-
-
F
A
S
T
C
A
T
New
score
165
165
165
88
88
88
88
77
77
0
77
77
77
77
x
T
H
E
L
A
S
T
F
A
-
T
C
A
T
z
T
H
E
-
-
-
-
F
A
-
T
C
A
T
y
T
H
E
-
-
-
-
F
A
S
T
C
A
T
Final
score
265
265
265
0
0
0
0
177
177
0
177
177
177
177
Further...
• There are many more multiple alignment
algorithms.