No Slide Title

Download Report

Transcript No Slide Title

Alignment
Most alignment programs create an alignment that
represents what happened during evolution at the DNA
level.
To carry over information from a well studied to a newly
determined sequence, we need an alignment that
represents the protein structures of today.
©CMBI 2001
Sequence Alignment
In phylogeny one wants to line up residues that came from
a common ancestor.
For information transfer one wants to line up residues at
similar positions in the structure.
gap = insertion ór deletion
©CMBI 2005
Global versus Local Alignment
Global
Local
©CMBI 2005
Global Alignment
Align two sequences from “head to toe”, i.e.
from 5’ ends to 3’ ends
from N-termini to C-termini
Algorithm published by: Needleman, S.B. and Wunsch,
C.D. (1970) “A general method applicable to the search
for similarities in the amino acid sequence of two
proteins”. J. Mol. Biol. 48:443-453.
©CMBI 2005
Global Alignment
a
3
1
-2
-2
-5
-6
-9
c
t
g
a
g
t
a
-
a
4
2
-1
-2
-4
-5
-8
a
-
c
5
3
0
-2
-3
-4
-7
t
t
g
a
4
3
1 -1
4
4
2
0
1
2
3
1
-1
0
1
2
-2 -1
0
0
-3 -3 -3 -2
-6 -5 -4 -3
c
c
t
-
t
t
g
g
g
-2
-1
0
0
1
-1
-2
a
a
c
-4
-4
-3
-2
-1
0
-1
-6
-5
-4
-3
-2
-1
0
g
g
c
t
©CMBI 2005
Local Alignment
Locate region(s) with high degree of similarity in two
sequences
Algorithm published by: Smith, T.F. and Waterman,
M.S. (1981) “Identification of common molecular
subsequences”. J. Mol. Biol. 147:195-197.
©CMBI 2005
Local Alignment
c
t
g
a
g
t
-
a
3
1
2
2
0
0
0
a
4
2
1
2
0
0
0
c
c
c
5
3
0
1
1
0
0
t
t
t
4
4
1
0
1
1
0
t
3
4
2
1
0
1
0
t
-
g
1
2
3
1
1
0
0
g
g
a
0
1
1
2
0
0
0
a
a
g
0
0
1
0
1
0
0
c
1
0
0
0
0
0
0
g
g
©CMBI 2005
0
0
0
0
0
0
0
Gap Penalty Functions
Linear
Penalty rises monotonous with length of gap
Affine
Penalty has a gap-opening and a separate length
component
Probabilistic
Penalties may depend upon the character of the
residues involved
Other functions
Penalty first rises fast, but levels off at greater
length values
©CMBI 2005
Significance of Alignment
How significant is the alignment that we have found?
Or put differently: how much different is the alignment
score that we found from scores obtained by aligning
random sequences to our sequence?
©CMBI 2005
Calculating Significance
Repeat N times (N > 100):
Randomise sequence A by shuffling the residues in
a random fashion
Align randomized sequence A with sequence B, and
calculate alignment score S
Calculate mean and standard deviation
Calculate Z-score:
Z = (Sgenuine – Ŝrandom) / s.d.
©CMBI 2005
Significance of Alignment
Random
matches
Genuine
match
Alignment score
©CMBI 2005
Significance of Alignment
Random
matches
Random
match
Alignment score
©CMBI 2005
The amino acids
Most information that enters the alignment procedure
comes from the physicochemical properties of the
amino acids.
Example: which is the better alignment (left or right)?
CPISRTWASIFRCW
CPISRT---LFRCW
CPISRTWASIFRCW
CPISRTL---FRCW
©CMBI 2001
A difficult alignment problem
AYAYAYAYSY
AGAPAPAPSP
LGLPLPLPLP
So, in an alignment of more than 2 sequences you can
find more information than from just the 2 sequences
you are interested in. How do we make these multisequence alignmnets?
©CMBI 2001
A difficult alignment problem solved
AYAYAYAYSY
AGAPAPAPSP
LGLPLPLPLP
©CMBI 2001
Alignment order
MIESAYTDSW
QFEKSYVTDY
-MIESAYTDSW
QFEKSYVTDY-
©CMBI 2001
Alignment order
MIESAYTDSW
QFEKSYVTDY
QWERTYASNF
-MIESAYTDSW
QFEKSYVTDYQWERTYASNF-
©CMBI 2001
Conclusion
Align first the sequences that look very much like each
other.
So you ‘build up information’ while generating those
alignments that most likely are correct.
©CMBI 2001
Alignment order
In order to know which sequences look most like each
other, you need to do all pairwise alignments first.
This is exactly what CLUSTAL does.
CLUSTAL builds a tree while doing the build-up of the
multiple sequence alignment.
©CMBI 2001
MSA and trees
Take, for example, the three sequences:
1 ASWTFGHK
2 GTWSFANR
3 ATWAFADR
and you see immediately that 2 and 3 are close, while 1 is further
away. So the tree will look roughly like:
3
2
1
©CMBI 2001
Aligning sequences; start with distances
A B C
D E
D
E
A
0 6
9 11 9
B
6 0
7
C
9 7
9 7
0
8 6
D 11 9
8
0 4
E
9 7
10
8
6
7
4 0
Matrix of pair-wise
distances between five
sequences.
D and E are the closest
pair. Take them, and
collapse the matrix by
one row/column.
©CMBI 2001
Aligning sequences
A B C DE
A
0
6
9
10
B
6
0
7
8
C
9
7
0
7
DE 10
8
7
0
D
E
A
B
©CMBI 2001
Aligning sequences
AB C DE
AB
0
8
9
C
8
0
7
DE
9
7
0
C
D
E
A
B
©CMBI 2001
Aligning sequences
AB CDE
AB
CDE
0
8.5
8.5
0
C
D
E
A
B
©CMBI 2001
The problem is actually bigger
1 ASWTFGHK
2 GTWSFANR
3 ATWAFADR
d(i,j) is the distance
between sequences
i and j.
d(1,2)=6; d(1,3)=5; d(2,3)=3.
So a perfect representation would be:
3
1
2
But what if a 4th sequence
is added with d(1,4)=4,
d(2,4)=5, d(3,4)=4? Where
would that sequence sit?
©CMBI 2001
So, nice tree, but what did we actually do?
1)We determined a distance measure
2)We measured all pair-wise distances
3)We reduced the dimensionality of the space of the problem
4)We used an algorithm to visualize
In a way, we projected the hyperspace in which we can perfectly
describe all pair-wise distances onto a 1-dimensional line.
What does this sentence mean?
©CMBI 2001
Back to sequences:
In we have N sequences, we can only
draw their distance matrix in an N-1
dimensional space. By the time it is a
tree, how many dimensions, and how
much information have we lost?
Perhaps we should cluster in a different
way?
©CMBI 2001
Other algorithms
Multi-sequence alignment can also be done with an
iterative ‘profile’ alignment.
A) Make an alignment of few, well-aligned sequences
B) Align all sequences using this profile
©CMBI 2001
1. What is a profile?
Normally, we use a PAM-like matrix to determine the
score for each possible match in an alignment.
This assumes that all matches between I <-> E are
the same. But the aren’t.
©CMBI 2001
2. What is a profile?
QWERTYIPASEF
QWEKSFIPGSEY
NWERTMVPVSEM
QFEKTYLPSSEY
NFIKTLMPATEF
QYIRSLIPAGEM
NYIQSLIPSTEL
QFIRSLFPSSEI
1
2
3
At 1, E and I are
both OK.
At 2, I is OK,
but E surely not.
At 3, E is OK,
but I surely not.
©CMBI 2001
3. What is a profile?
The knowledge about which residue types are good at a
certain position in the multiple sequence alignment can
be expressed in a profile.
A profile holds for each position 20 scores for the 20
residue types, and sometimes also two values for
position specific gap open and gap elongation penalties.
©CMBI 2001
Conserved, variable, or in-between
QWERTYASDFGRGH
QWERTYASDTHRPM
QWERTNMKDFGRKC
QWERTNMKDTHRVW
Gray = conserved
Black = variable
Green = correlated mutations
©CMBI 2001
Correlated mutations determine the tree shape
1
2
3
4
AGASDFDFGHKM
AGASDFDFRRRL
AGLPDFMNGHSI
AGLPDFMNRRRV
©CMBI 2001