Smith-Waterman algorithm
Download
Report
Transcript Smith-Waterman algorithm
Alignment methods
April 26, 2011
Return Quiz 1 today
Return homework #4 today.
Next homework due Tues, May 3
Learning objectives- Understand the Smith-Waterman
algorithm. Have a general understanding about PAM and
BLOSUM scoring matrices.
Workshop-Compare scoring matrices.
Smith-Waterman Algorithm
Advances in
Applied Mathematics, 2:482-489 (1981)
Smith-Waterman algorithm –can be used for global or
local alignment
-Memory intensive
-Common searching programs such as BLAST use SW algorithm
Smith-Waterman algorithm
Mi,j = MAXIMUM [
Mi-1, j-1 + si,,j (match or mismatch in the diagonal),
Mi, j-1 + w (gap in sequence #1),
Mi-1, j + w (gap in sequence #2),
0]
Where Mi-1, j-1 is the value in the cell diagonally juxtaposed to Mi,j.
(The i-1, j-1 cell is up and to the left of mi,nj).
Where si,j is the value for the match or mismatch in the minj cell.
Where Mi, j-1 is the value in the cell above Mi,j.
Where w is the value for the gap penalty.
Where Mi-1, j is the value in the cell to the left of Mi,j.
Two sequences to align
Sequence 1: ABCNJRQCLCRPM
Sequence 2: AJCJNRCKCRBP
Initialization step: create matrix with M + 1 columns
and N + 1 rows. M = number of letters in sequence 1 and N =
number of letters in sequence 2.
First column (M-1) and first row (N-1) will be filled with 0’s.
A
J
C
J
N
R
C
K
C
R
B
P
0
0
0
0
0
0
0
0
0
0
0
0
0
A
0
1
B
0
C
0
N
0
J
0
R
0
Q
0
C
0
L
0
C
0
R
0
P
0
M
0
Matrix fill step: Each position Mi,j is defined to be the
MAXIMUM score at position i,j
row
Mi,j = MAXIMUM [
column
Mi-1, j-1 + si,,j (match or mismatch in the diagonal)
Mi, j-1 + w (gap in sequence #1)
Mi-1, j + w (gap in sequence #2)]
A
J
C
J
N
R
C
K
C
R
B
P
0
0
0
0
0
0
0
0
0
0
0
0
0
A
0
1
1
1
1
1
1
1
1
1
1
1
1
B
0
1
C
0
1
N
0
1
J
0
1
R
0
1
Q
0
1
C
0
1
L
0
1
C
0
1
R
0
1
P
0
1
M
0
1
A
J
C
J
N
R
C
K
C
R
B
P
0
0
0
0
0
0
0
0
0
0
0
0
0
A
0
1
1
1
1
1
1
1
1
1
1
1
1
B
0
1
1
1
1
1
1
1
1
1
1
2
2
C
0
1
1
2
2
2
2
2
2
2
2
2
2
N
0
1
1
2
2
3
3
3
3
3
3
3
3
J
0
1
2
2
3
3
3
3
3
3
3
3
3
R
0
1
2
2
3
3
4
4
4
4
4
4
4
Q
0
1
2
2
3
3
4
4
4
4
4
4
4
C
0
1
2
3
3
3
4
5
5
5
5
5
5
L
0
1
2
3
3
3
4
5
5
5
5
5
5
C
0
1
2
3
3
3
4
5
5
6
6
6
6
R
0
1
2
3
3
3
4
5
5
6
7
7
7
P
0
1
2
3
3
3
4
5
5
6
7
7
8
M
0
1
2
3
3
3
4
5
5
6
7
7
8
Sequence 1: ABCNJ-RQCLCR-PM
Sequence 2: AJC-JNR-CKCRBPScore
: 8
Smith-Waterman (local alignment)
a. Initializes edges of the matrix with zeros
b. It searches for sequence matches.
c. Assigns a score to each pair of amino acids
-uses similarity scores
-uses positive scores for related residues
-uses negative scores for substitutions and gaps
d. Scores are summed for placement into Mi,j. If any
sum result is below 0, a 0 is placed into Mi,j.
e. Backtracing begins at the maximum value found
anywhere in the matrix.
f. Backtrace continues until the it meets an Mi,j value of 0.
BLOSUM 45 Scoring Matrix
H
E
A
G
A
W
G
H
E
E
0
0
0
0
0
0
0
0
0
0
0
P
0
0
0
0
0
0
0
0
0
0
0
A
0
0
0
5
0
5
0
0
0
0
0
W
0
0
0
0
3
0
20
12
4
0
0
H
0
10
2
0
0
1
12
18
22
14
6
E
0
2
16
8
0
0
4
10
18
28
20
A
0
0
8
21
13
5
0
4
10
20
27
E
0
0
6
13
19
12
4
0
4
16
26
A W G H
A W – H
Score:
5 15 -8 10
Total score: 28
Pecent similarity: 4/5 x
E
E
6
100 = 80%
How does one achieve the
“perfect database search”?
Consider the following:
Scoring Matrices (PAM vs. BLOSUM)
Local alignment algorithm
Database
Search Parameters
Expect Value-change threshold for score
reporting
Filtering-remove repeat sequences
Which Scoring Matrix to use?
PAM-1
BLOSUM-100
Small evolutionary
distance
High identity within
short sequences
PAM-250
BLOSUM-20
Large evolutionary
distance
Low identity within
long sequences
BLOSUM Scoring Matrices
Which BLOSUM Matrix to use?
BLOSUM
80
62
35
Identity (up to)
80%
62% (usually default value)
35%
If you are comparing sequences that are very similar, use
BLOSUM 80. Sequences that are more divergent (dissimilar)
than 20% are given very low scores in this matrix.
Logic behind PAM scoring
matrix
Replacement amino acid
Original amino acid
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
---30
109
154
33
93
266
579
21
66
95
57
29
20
345
772
590
0
20
365
A
---17
0
10
120
0
10
103
20
17
477
17
7
67
137
20
27
3
20
R
---532 ---0
0 ---50 76
0
94 831
0
156 162 10
226 43 10
36 13 17
37
0
0
322 85
0
0
0
0
7
0
0
27 10 10
432 98 117
169 57 10
3
0
0
36
0 30
13 17 33
N
D
C
---422 ---30 112 ---243 23 10 ---8 35
0
3
75 15 17 40
147 104 50 23
20
7
7
0
0
0 17 20
93 40 49 50
47 86 450 26
37 31 50 14
0
0
0
3
0 10
0 40
27 37 97 30
Q
E
G
H
---253 ---43 39 ---57 207 90 ---90 167
0 17 ---7 43 43
4
7 ---20 32 168 20 40 269 ---129 52 200 28 10 73 696 ---0 13
0
0 10
0 17
0 ---13 23 10
0 260
0 22 23
6 ---661 303 17 77 10 50 43 186
0 17 ---I
L
K
M
F
P
S
T
W
Y
V
Figure 4.2 Numbers of accepted point mutations (multiplied
by 10). A total of 1572 exchanges are shown. Positions with
red dashes are Mjj values. Modified from Dayhoff, 1978.
Relative mutability calculations
Aligned sequences
A
A
Amino acids
Changes
Frequency of occurrences
Relative mutability
A
1
3
0.33
D
D
A
B
B
1
1
1
Figure 4.3 Simplified example to show how relative
mutability is calculated.
D
0
2
0
Development of the Mutation
Probability Matrix.
Mij = λmjAij/(ΣAij)
where
Aij is an element of the accepted point mutation matrix (see Fig. 4.2)
λ is the proportionality constant (to be discussed below)
mj is the relative mutability of the amino acids on the bottom row
Here is the second equation. This equation applies when the original amino acid and the
replacement amino acid are the same. The diagonal elements (all in cells with the location Mjj)
have the value:
Mjj = 1-λmj
Development of the Mutation
Probability Matrix. (2)
Figure 4.4. Mutational Probability Matrix (partial). This only shows 5 of the 20 amino acids in the MPM.
Numbers were multiplied by 10,000 to make it easier to read.
The numbers for each column adds up to 10,000. In the top row there are the replacement amino acids a
nd on the left column are the original amino acids. Mjj values shown are 9867, 9913, 9822, 9859 and 9973.
Table 4.1 Normalized frequencies of amino acids (from Dayhoff’s data)1
Amino acid
Normalized
Amino acid
Frequency
G
0.089
R
A
0.087
N
L
0.085
F
K
0.081
Q
S
0.070
I
V
0.065
H
T
0.058
C
P
0.051
Y
E
0.050
M
D
0.047
W
Normalized
Frequency
0.041
0.040
0.040
0.038
0.037
0.034
0.033
0.030
0.015
0.010
What is percent of amino acids
that differ in the MPM?
100 x ΣfiMii
Where fi = normalized frequency of amino acid i
Where Mii = Mjj
This value totals 99 for each amino acid. There is a 1% difference
for each amino acid
Conversion of the PAM1 Mutational
Probability Matrix to the PAM1 Scoring
Matrix.
Sij=10log10(Mij/fi) equation 4.1
Where Sij is the log-odds score for amino i replacing amino acid j. For a PAM1 scoring
matrix let’s take Gly replacing Ala as an example. From Figure 4.2 we obtain a value of 0.0021
for Mij. The fi of Gly is 0.089 (from Table 4.1). Thus, the Sij for this replacement is:
Sij=10log10(0.0021/0.089) = -16
Conversion of the PAM1 Mutational
Probability Matrix to other PAM scoring
matrices.
Table 4.2 Correspondence between Observed Percent Amino acid differences and PAMs
PAM Mutational Probability Matrices1
Observed percent amino acid differences2
1
1
5
5
11
10
17
15
30
25
56
40
80
50
112
60
159
70
246
80
1Mutation
Probability Matrices generated by the equation
(PAM1 MPM)n where n is the number listed in the first
column.