Multiple sequence alignment theory
Download
Report
Transcript Multiple sequence alignment theory
Alignment
Most alignment programs create an alignment that
represents what happened during evolution at the DNA
level.
To carry over information from a well studied to a newly
determined sequence, we need an alignment that
represents the protein structures of today.
©CMBI 2001
The amino acids
Most information that enters the alignment procedure
comes from the physicochemical properties of the
amino acids.
Example: which is the better alignment (left or right)?
CPISRTWASIFRCW
CPISRT---LFRCW
CPISRTWASIFRCW
CPISRTL---FRCW
©CMBI 2001
A difficult alignment problem
AYAYAYAYSY
AGAPAPAPSP
LGLPLPLPLP
So, in an alignment of more than 2 sequences you can
find more information than from just the 2 sequences
you are interested in. How do we make these multisequence alignmnets?
©CMBI 2001
A difficult alignment problem solved
AYAYAYAYSY
AGAPAPAPSP
LGLPLPLPLP
©CMBI 2001
Alignment order
MIESAYTDSW
QFEKSYVTDY
-MIESAYTDSW
QFEKSYVTDY-
©CMBI 2001
Alignment order
MIESAYTDSW
QFEKSYVTDY
QWERTYASNF
-MIESAYTDSW
QFEKSYVTDYQWERTYASNF-
©CMBI 2001
Conclusion
Align first the sequences that look very much like each
other.
So you ‘build up information’ while generating those
alignments that most likely are correct.
©CMBI 2001
Alignment order
In order to know which sequences look most like each
other, you need to do all pairwise alignments first.
This is exactly what CLUSTAL does.
CLUSTAL builds a tree while doing the build-up of the
multiple sequence alignment.
©CMBI 2001
MSA and trees
Take, for example, the three sequences:
1 ASWTFGHK
2 GTWSFANR
3 ATWAFADR
and you see immediately that 2 and 3 are close, while 1 is further
away. So the tree will look roughly like:
3
2
1
©CMBI 2001
Aligning sequences; start with distances
A B C
D E
D
E
A
0
6
9 11 9
B
6
0
7
C
9 7
9
7
0
8 6
D 11
9
8
0 4
E
9
7
6
10
8
7
4 0
Matrix of pair-wise
distances between five
sequences.
D and E are the closest
pair. Take them, and
collapse the matrix by
one row/column.
©CMBI 2001
Aligning sequences
A B C DE
A
0
6
9
10
B
6
0
7
8
C
9
7
0
7
DE 10
8
7
0
D
E
A
B
©CMBI 2001
Aligning sequences
AB C DE
AB
0
8
9
C
8
0
7
DE
9
7
0
C
D
E
A
B
©CMBI 2001
Aligning sequences
AB CDE
AB
CDE
0
8.5
8.5
0
C
D
E
A
B
©CMBI 2001
Back to the alignment
1 ASWTFGHK
2 GTWSFANR
3 ATWAFADR
Actually I cheated. 1 is closer to 3 than to 2 because of the A at position
1. How can we express this in the tree? For example:
3
2
1
2 I will call this
3 tree-flipping
1
©CMBI 2001
Can we generalize tree-flipping?
To generalize tree flipping, sequences must be placed ‘distancecorrect’ in 1 dimension:
And then connect them,
as we did before:
2
3
So, now most info
sits in the horizontal
dimension. Can we
use the vertical
dimension usefully?
1
©CMBI 2001
The problem is actually bigger
1 ASWTFGHK
2 GTWSFANR
3 ATWAFADR
d(i,j) is the distance
between sequences
i and j.
d(1,2)=6; d(1,3)=5; d(2,3)=3.
So a perfect representation would be:
3
1
2
But what if a 4th sequence
is added with d(1,4)=4,
d(2,4)=5, d(3,4)=4? Where
would that sequence sit?
©CMBI 2001
So, nice tree, but what did we actually do?
1)We determined a distance measure
2)We measured all pair-wise distances
3)We reduced the dimensionality of the space of the problem
4)We used an algorithm to visualize
In a way, we projected the hyperspace in which we can perfectly
describe all pair-wise distances onto a 1-dimensional line.
What does this sentence mean?
©CMBI 2001
Projection
Gnomonic projection: Correct distances
Fuller projection; Unfolded Dymaxion map
Political projection
Source: Wikepedia Mercator projection
©CMBI 2001
Back to sequences:
ASASDFDFGHKMGHS
ASASDFDFRRRLRHS
ASASDFDFRRRLRIT
ASLPDFLPGHSIGHS
ASLPDFLPGHSIGIT
ASLPDFLPRRRVRIT
1
2
5
3
6
3
The more dimensions we
retain, the less information we
loose. The three is now in 3D…
©CMBI 2001
Projection to visualize clusters
We want to reduce the dimensionality with minimal distortion of the
pair-wise distances. One way is Eigenvector determination, or PCA.
©CMBI 2001
PCA to the rescue
Now we have made the data one-dimensional, while
the second, vertical, dimension is noise. If we did this
correctly, we kept as much data as possible.
©CMBI 2001
Back to sequences:
In we have N sequences, we can only
draw their distance matrix in an N-1
dimensional space. By the time it is a
tree, how many dimensions, and how
much information have we lost?
Perhaps we should cluster in a different
way?
©CMBI 2001
Cluster on critical residues?
QWERTYAKDFGRGH
AWTRTYAKDFGRPM
SWTRTNMKDTHRKC
QWGRTNMKDTHRVW
Gray = conserved
Red
= variable
Green = correlated
©CMBI 2001
Conclusions from correlated residues
©CMBI 2001
Other algorithms
Multi-sequence alignment can also be done with an
iterative ‘profile’ alignment.
A) Make an alignment of few, well-aligned sequences
B) Align all sequences using this profile
©CMBI 2001
1. What is a profile?
Normally, we use a PAM-like matrix to determine the
score for each possible match in an alignment.
This assumes that all matches between I <-> E are
the same. But the aren’t.
©CMBI 2001
2. What is a profile?
QWERTYIPASEF
QWEKSFIPGSEY
NWERTMVPVSEM
QFEKTYLPSSEY
NFIKTLMPATEF
QYIRSLIPAGEM
NYIQSLIPSTEL
QFIRSLFPSSEI
1
2
3
At 1, E and I are
both OK.
At 2, I is OK,
but E surely not.
At 3, E is OK,
but I surely not.
©CMBI 2001
3. What is a profile?
The knowledge about which residue types are good at a
certain position in the multiple sequence alignment can
be expressed in a profile.
A profile holds for each position 20 scores for the 20
residue types, and sometimes also two values for
position specific gap open and gap elongation penalties.
©CMBI 2001
Conserved, variable, or in-between
QWERTYASDFGRGH
QWERTYASDTHRPM
QWERTNMKDFGRKC
QWERTNMKDTHRVW
Gray = conserved
Black = variable
Green = correlated mutations
©CMBI 2001
Correlated mutations determine the tree shape
1
2
3
4
AGASDFDFGHKM
AGASDFDFRRRL
AGLPDFMNGHSI
AGLPDFMNRRRV
©CMBI 2001
Correlation = Information
1, 2 and 5 bind calcium; 3 and 4 don’t.
Which residues bind calcium?
1
2
3
4
5
123456789012345
ASDFNTDEKLRTTFI
ASDFSTDEKLKTTFI
LSFFTTDTRLATIYI
LSHFLTNLRLATIYI
ASDFTTDEKLALTFI
Red has correct correlation, but wrong residue type.
Brown has correct type, but wrong correlation.
Green can be calcium-binders.
©CMBI 2001