Document

Transcript Document

Sequence Alignment I
Dot plots
Dynamic Programming
Why align sequences?

conserved sequencesconserved
function



Assess ancestry among homologs
(sequences with common ancestory) to
help in gene finding and annotation
Find consensus motifs among related
sequences (e.g., regulatory and structural
regions)
Estimate the rate of evolution
Definitions
http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/
Definitions
http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/
Problem
http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/
Are the sequences related?
http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/
Types of Alignment




Local
Global
Gap-free
Gapped
Methods of Alignment



Dot matrix
Dynamic programming
K-tuple
Dot plots


To evaluate/visualize similarity
between two sequences
Create a matrix when sequence 1 is a
row vector and sequence 2 is a
column vector.
Dot plot (identity matrix)
http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/
Self-alignment
Self-alignment with sliding
window
Self-alignment with sliding
window
Dotmatrix
Seq1 = ‘ATCAA’
Seq2 = ‘ATCGA’
A
T
C
A
A
A
1
0
0
1
1
T
0
1
0
0
0
C
0
0
1
0
0
G 0
0
0
0
0
A
0
0
1
1
1
Function dotplot1.m

function [dotmatrix]=dotplot1(seq1,seq2)


OUTPUT: [dotmatrix] is an n x m matrix with
1s and 0s, where 0 means a mismatch and 1
means a match between two nucleotides
INPUT: seq1 and seq2 are strings made of
a,t,c, or g. (seq1 is a row vector 1 x m; and
seq2 is a column vector (n x1).
Run dotplot1.m




Place dotplot1.m into your directory.
Open a Matlab session.
Open the file named dotplot1.m under the file
option.
Define seq1 and seq2 as the following; and run the
program:
>> seq1 = ‘attataagg’
>> seq2 = ‘attataggg’
>> [dotmatrix] = dotplot1(seq1,seq2)

Next, copy and paste each line of the dotplot1.m on
the command window without using ; to track what
the program is doing.
Dot plot
Seq1='attataagg‘
Seq2='attataggg'
A match is red (=1);
a mismatch is blue (=0)
Dot plot with sliding windows



Sliding windows consider more than
one position at a time.
Similarity cutoffs (threshold) also allow
to get rid of positions with low similarity
across the window.
Sliding window can reduce the noise in
the dot plot.
Dot plot with sliding windows
Use a sliding window of size w, for example w
=3, such that only positions with w number of
consecutive matches along the diagonal
count.
A
T
C
A
A
A
3
0
0
0
0
T
0
3
0
0
0
C
0
0
3
0
0
G
0
0
0
0
0
A
0
0
0
0
0
A T C A A
A
T
C
G
A
1
0
0
0
1
0
1
0
0
0
0
0
1
0
0
1
0
0
0
1
1
0
0
0
1

Dot plot with sliding windows
3
3
0
Dot plot with sliding windows
3 0
0
3 0
0
0
3 0
0 2
0
Dot plot with sliding windows
A T
C A A
A 3
0
0
0
0
T
0
2
0
0
0
C 0
0
2
0
0
G 0
0
0
0
0
A 0
0
0
0
0
Function dotplot2.m

function [dotmatrix,dot]=dotplot1(seq1,seq2,w,t)

Introduced two variables:


W = size of the sliding window
T = threshold, number of matches along the
diagonal to assume a match.
Function dotplot2.m
Worksheet (due end of lecture)


Examine the code in dotplot2.m to see what
commands have been used to slide the
window and count the number of
consecutive matches. Briefly write down
your assessment.
Type in different sequences for seq1 and
seq2 (no longer than 30) and try two
different w and t values to run dotplot2.m
Dot plot with sliding windows
Seq1='attataagg‘
Seq2='attataggg'
A match is red; a mismatch is blue
Sliding window size = 1
Threshold is 1
Sliding window size = 3
Threshold is 2
Sliding window size = 3
Threshold is 3
Seq1='attataagg’ Seq1='attataagg’
Seq2='attataggg' Seq2='attataggg'
Alignment


Is a pairwise match between the
characters of each sequence.
A true alignment reflects evolutionarily
common ancestry (homology).
Changes that occur in
sequences


A mutation that replaces one character
with another is a substitution.
An insertion that adds one or more
positions and a deletion that deletes
one or more positions are known as
indels (gaps)
Alignment example:






AATCTATA
AAGATA
AATCTATA
AAGATA
AATCTATA
AAGATA
Gap-free alignment: match
and mismatch

n

An alignment receives for each aligned
pair of identical residues (the match
score) and the penalty for aligned pair
of nonidentical residues (mismatch
score).

match score; if seq1=seq2
mismatch score; if seq1seq2
where
n is the length of the longer sequence.
Gap-free alignment example:






AATCTATA
AAGATA
AATCTATA
AAGATA
AATCTATA
AAGATA
Alignment scores would be 4, 1, 3, respectively; if the match
score is 1 and mismatch score is 0.
Gaps

Indels complicate alignments by
increasing the number of possible
alignments between two or more
sequences.
Alignment: match, mismatch,
and gap penalty

n

An alignment receives a score for each
aligned pair of identical residues (the
match score) and the penalty for
aligned pair of nonidentical residues
(mismatch score), and a penalty for
insertion of gaps.

gap penalty; if seq1 = ‘-’ or seq2 = ‘-’
match score; if no gaps and seq1=seq2
mismatch score; if no gaps and seq1seq2
where
n is the length of the longer sequence.
Gap-free alignment example:






AATCTATA
AAG-AT-A
AATCTATA
AA-G-ATA
AATCTATA
AA--GATA
Alignment scores would be 1, 3, 3, respectively; if the match
score is 1; mismatch score is 0; and gap penalty is -1.
Origination and Length
Penalties


Simple gap penalties lead to many
optimal alignments (those having the
same score).
Mutations are rare, invoking fewest
number of unlikely events is
evolutionarily sound.

3-nt indel would be more common than
multiple single indels.
Origination and Length
Penalties


Origination penalty: starting a new
series of gaps in one of the sequences
being aligned.
Length penalty: number of sequential
missing characters.
Gap-free alignment example:






AATCTATA
AAG-AT-A
AATCTATA
AA-G-ATA
AATCTATA
AA--GATA
Alignment scores would be -3, -1, +1, respectively;
if the match score is 1; mismatch score is 0; and origination
penalty is -2 and length penalty is -1.
Scoring matrices

Mismatch penalty can be used to provide
further discrimination:


Two protein sequences, one of which has an
alanine in a given position: A substitution to
valine (another small hydrophobic aa) would
have less impact than a change to lysine (a
large, charged residue).
One can weigh these substitutions differently
based on the likelihood of occurrence over time
or based on other characteristics.
Scoring matrices

A scoring matrix is used to score each
nongap position in the alignment.


Nucleotide sequences
Amino acid sequences
Identity matrix
Scoring matrices for DNA
sequences
A
T
C
G
A
T
C
G
A
5
-4
-4
-4
A
1
-5
-5
-1
T
-4
5
-4
-4
T
-5
1
-1
-5
C
-4
-4
5
-4
C
-5
-1
1
-5
G
-4
5
-4
5
G
-1
-5
-5
1
BLAST matrix
Transition/Transversion matrix
Scoring Matrices
Need to know the frequency of one amino acid
substituting for another versus that event
happening by chance alone based on
frequency of occurrence of each amino acid:
odds ratio = P(ab)/q(a)*q(b)
Scoring matrices for amino
acids


Blosum
PAM (point accepted mutation)
Alignment and score for aa
BLOSUM 62 Scoring Matrix
BLOSUM


Ungapped alignments of related
proteins are grouped using clustering
techniques, substitution rates between
clusters are calculated.
A BLOSUM-62 matrix is appropriate
for comparing sequences of
approximately 62% sequence
similarity.
http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/
PAM matrices
http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/
PAM matrices
http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/
PAM-1
http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/
PAM1(multiplied by 10000)
Ala
Arg
Asn
Asp
Cys
Gln
Glu
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Trp
Tyr
Val
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
Ala Arg Asn Asp Cys Gln Glu Gly His Ile
Leu Lys Met Phe Pro Ser Thr Trp Tyr Val
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
9867
2
9 10
3
8 17 21
2
6
4
2
6
2 22 35 32
0
2 18
1 9913
1
0
1 10
0
0 10
3
1 19
4
1
4
6
1
8
0
1
4
1 9822
36
0
4
6
6 21
3
1 13
0
1
2 20
9
1
4
1
6
0 42 9859
0
6 53
6
4
1
0
3
0
0
1
5
3
0
0
1
1
1
0
0 9973
0
0
0
1
1
0
0
0
0
1
5
1
0
3
2
3
9
4
5
0 9876 27
1 23
1
3
6
4
0
6
2
2
0
0
1
10
0
7 56
0 35 9865
4
2
3
1
4
1
0
3
4
2
0
1
2
21
1 12 11
1
3
7 9935
1
0
1
2
1
1
3 21
3
0
0
5
1
8 18
3
1 20
1
0 9912
0
1
1
0
2
3
1
1
1
4
1
2
2
3
1
2
1
2
0
0 9872
9
2 12
7
0
1
7
0
1 33
3
1
3
0
0
6
1
1
4 22 9947
2 45 13
3
1
3
4
2 15
2 37 25
6
0 12
7
2
2
4
1 9926
20
0
3
8 11
0
1
1
1
1
0
0
0
2
0
0
0
5
8
4 9874
1
0
1
2
0
0
4
1
1
1
0
0
0
0
1
2
8
6
0
4 9946
0
2
1
3 28
0
13
5
2
1
1
8
3
2
5
1
2
2
1
1 9926
12
4
0
0
2
28 11 34
7 11
4
6 16
2
2
1
7
4
3 17 9840
38
5
2
2
22
2 13
4
1
3
2
2
1 11
2
8
6
1
5 32 9871
0
2
9
0
2
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0 9976
1
0
1
0
3
0
3
0
1
0
4
1
1
0
0 21
0
1
1
2 9945
1
13
2
1
1
3
2
2
3
3 57 11
1 17
1
3
2 10
0
2 9901
PAM1*PAM1=PAM2
9736
4
18
20
6
16
34
42
4
12
8
4
12
4
44
69
63
0
4
36
2 9827
2
0
2
20
0
0
20
6
2
38
8
2
8
12
2
16
0
2
8
2 9647
71
0
8
12
12
41
6
2
26
0
2
4
39
18
2
8
2
12
0
83 9720
0
12 105
12
8
2
0
6
0
0
2
10
6
0
0
2
2
2
0
0 9946
0
0
0
2
2
0
0
0
0
2
10
2
0
6
4
6
18
8
10
0 9754
53
2
46
2
6
12
8
0
12
4
4
0
0
2
20
0
14 111
0
69 9732
8
4
6
2
8
2
0
6
8
4
0
2
4
42
2
24
22
2
6
14 9871
2
0
2
4
2
2
6
42
6
0
0
10
2
16
36
6
2
40
2
0 9825
0
2
2
0
4
6
2
2
2
8
2
4
4
6
2
4
2
4
0
0 9746
18
4
24
14
0
2
14
0
2
65
6
2
6
0
0
12
2
2
8
44 9894
4
89
26
6
2
6
8
4
30
4
73
49
12
0
24
14
4
4
8
2 9853
40
0
6
16
22
0
2
2
2
2
0
0
0
4
0
0
0
10
16
8 9750
2
0
2
4
0
0
8
2
2
2
0
0
0
0
2
4
16
12
0
8 9892
0
4
2
6
56
0
26
10
4
2
2
16
6
4
10
2
4
4
2
2 9853
24
8
0
0
4
55
22
67
14
22
8
12
32
4
4
2
14
8
6
34 9683
75
10
4
4
44
4
26
8
2
6
4
4
2
22
4
16
12
2
10
63 9744
0
4
18
0
4
0
0
0
0
0
0
0
0
0
0
0
2
0
2
0 9952
2
0
2
0
6
0
6
0
2
0
8
2
2
0
0
42
0
2
2
4 9890
2
26
4
2
2
6
4
4
6
6 113
22
2
34
2
6
4
20
0
4 9803
PAM250 = PAM1250
1332 609 911 935 541
278 1649 399 299 177
431 403 648 656 179
520 344 760 1142 149
198 159 140
98 5179
341 499 453 552 109
549 384 693 1090 150
1199 500 966 1020 413
241 471 480 386 152
316 229 243 210 232
538 421 451 341 214
624 1753 1015 832 236
112 133 102
79
46
200 165 206 126 160
679 492 469 433 272
906 655 822 754 682
785 493 655 575 356
27 162
31
19
20
149 103 196 120 337
654 370 427 381 410
797 942 1179 633 794 590 680 673 417 1162 1131 1163 230 378 921
542 316 205 581 274 219 902 384 175 388 370 334 660 144 233
483 567 452 573 273 204 520 269 199 382 485 458 163 251 276
682 1033 554 541 282 195 479 254 141 411 518 469 102 170 299
99
96 119 154 183
83 100 100 120 202 315 216
56 336 229
963 671 281 741 242 261 459 299 141 413 336 319 127 143 242
879 1204 528 570 325 251 494 304 156 471 509 469 102 179 341
689 930 2670 545 513 379 601 459 293 819 1151 907 179 251 670
644 377 190 1490 176 182 336 189 239 316 272 249 187 317 178
228 222 199 202 1040 627 252 599 454 219 262 376 108 272 861
569 388 349 534 1507 3366 490 2005 1271 479 442 594 554 650 1315
954 808 541 785 539 410 2396 898 255 648 787 820 357 273 472
120
86
72
91 250 353 178 647 144
89 104 138
55
76 229
148 130 167 281 510 596 133 426 3244 129 213 222 438 2006 298
560 474 474 497 341 315 419 333 218 1988 642 564 142 168 388
628 707 895 568 513 368 690 483 342 892 1009 962 387 339 561
495 544 598 426 604 403 604 513 298 659 810 1074 180 289 630
28
19
21
33
24
20
43
24 135
27
66
32 5518 139
17
123 127 101 304 243 250
91 174 1511
97 162 163 307 3075 175
413 406 450 379 1502 972 394 967 498 479 506 687 154 352 1729
The Point-Accepted-Mutation (PAM) model
of evolution and the PAM scoring matrix
Observed %
Difference
1
5
10
20
40
50
60
70
80
Evolutionary Distance
In PAMs
1
5
11
23
56
80
112
159
246
Final Scoring Matrix is the LogOdds Scoring Matrix
Replacement amino acid
S (a,b) = 10 log10(Mab/Pb)
Original amino acid
Frequency of amino acid b
Mutational probability matrix number
PAM unit

PAM-1 is 1 substitution per 100
residues.
Point accepted mutation
(PAM) matrix

Calculated by observing the
substitutions that occur in alignments
between similar sequences with very
high (>85%) identity.
Construct a multiple sequence
alignment







ACGCTAFKI
GCGCTAFKI
ACGCTAFKL
GCGCTGFKI
GCGCTLFKI
ASGCTAFKL
ACACTAFKL
Construct a multiple sequence
alignment
A phylogenetic tree is created
indicating the order in which various
substitutions taken place.
AG
AG
AL
IL
CS
GA
Generating PAM matrix


For each amino acid, the frequency
with which it is substituted by every
other amino acid is calculated.
Substitutions are considered
symmetric: AG counts as GA
For example for FGA count all AG
and GA substitutions.
Construct a multiple sequence
alignment
FGA = AG AG GA = 3
AG
AG
AL
IL
CS
GA
Relative mutability, mi



Number of times the amino acid is
substituted by any other amino acid in the
phylogenetic tree.
This number is then divided by the total
number of mutations that could have
affected the residue.
This denominator is the total number of subs
across entire tree times two, multiplied by
freq. of the amino acid, times a scaling
factor
Construct a multiple sequence
alignment
FGA = AG AG GA = 3
AG
AG
AL
IL
CS
GA
Relative mutability of A: mA




Total of 4 mutations involving A.
Total number mutations in the entire tree is
6 should be multiplied by two = 6 x 2 = 12
Relative frequency of A residues (10 As out
of 7x9=63 residues) is 10/63 = 0.159.
Thus mA = 4/12 x 0.159 x 100 = 0.0209
Mutation probability of A to G:
MGA


mA = 4/12 x 0.159 x 100 = 0.0209
MGA = mA multiplied by #(AG and GA
subs) and then this is divided by #(all subs
involving A)

MGA = (0.0209 x 3)/4 = 0.0156
Scoring matrix: RGA


log(Mij/fi), where Mij mutation probability of ij
and fi equals to relative frequency of j type
residue. For example f(G) = 10/63 = 0.1587
RGA = log(MGA/f(G)) = log(0.0156/0.1587)
= -1.01
PAM matrix calculator

http://www.bioinformatics.nl/tools/pam.
html
Choice of matrix

PAM vs. BLOSUM:





polar residues are less variable in BLOSUM
Since PAM is based on pairs of entire sequences with less than 15%
divergence, most of the variations counted are in surface loop regions,
regions under low evolutionary constraints. Asn is one of the most
mutable residues.
BLOSUM matrices are based on conserved regions of more distantly
related proteins and thus ignore residues in surface loop regions. The
mutability of Asn is close to the average mutability.
Surface loop regions preferentially contain polar residues
Thus, BLOSUM matrices strongly overestimate the conservation of
polar residues relative to PAM matrices.

Matrix Conservation polar/apolar


PAM 0.49
BLOSUM 0.93
Choice of matrix

General: the choice of matrix should be governed
by the use to which the matrix is put.
 Searches for distantly similar sequences, in
which only the BLOCKS are likely to be
conserved should rely on BLOSUM matrices.
 Complete alignments of globular protein
sequences should use PAM matrices.
 Proteins with extensive transmembrane helices
should be aligned using the transmembrane
matrix.
Choice of matrix
http://mcb.berkeley.edu/labs/king/blast/docs/matrix_info.html
Dynamic Programming


Exhaustive search: search all possible
alignments: NOT FEASIBLE
Dynamic programming: a method of
breaking a problem apart into
reasonably sized subproblems and
using these partial results to compute
the final answer.

Needleman and Wunsch
How to start the alignment?

How to break down the problem:


Seq1 = ‘CACGA’
Seq2 = ‘CGA’

Three possible ways to start the problem:



C of seq1 and C of seq2 match
C
ACGA
C
GA
A gap is inserted into 1st position of seq1
CACGA
C
GA
A gap is inserted into 1st position of seq2
C
ACGA
CGA
How to end the alignment?

If we knew the score for the best
alignment, we can track back the steps
to reach to the best score.


Use a table to store each step of the
alignment to refer back later on.
Depends on storing partial sequence
alignments so you don’t do it again and
again.
Dynamic Programming
Partial scores table
Initialize a matrix
where seq1 is on
the rows and seq2
is on the colums
Fill in the first row
and first column
with multiples of
gap penalty, in this
case it equals to -1.
0
A -1
C -2
A -3
G -4
T -5
A -6
G -7
A C T C G
-1 -2 -3 -4 -5
Partial scores table
Start with the first residue of
each sequence:
Should A match A?
The alignment score between
A of seq1 and A of seq2
can come from:
a) diagonal: match (+1) or
mismatch (0)
b) From top means a gap (-1)
inserted in the first
sequence (from left to right)
c) From left means that a gap
(-1) is inserted in the
second sequence (from top
to bottom)
SELECT THE MAXIMUM
AMONG THREE
A C T C G
0 -1 -2 -3 -4 -5
A -1 1
C -2
A -3
G -4
T -5
A -6
G -7
Partial scores table
Continue towards right
Does A match C
a) diagonal: match (+1) or
mismatch (0)
b) From top means a gap (1) inserted in the first
sequence (from left to
right)
c) From left means that a
gap (-1) is inserted in the
second sequence (from
top to bottom)
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0
C -2
A -3
G -4
T -5
A -6
G -7
Partial scores table
Continue towards right
Does A match T
a) diagonal: match (+1) or
mismatch (0)
b) From top means a gap (1) inserted in the first
sequence (from left to
right)
c) From left means that a
gap (-1) is inserted in the
second sequence (from
top to bottom)
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1
C -2
A -3
G -4
T -5
A -6
G -7
Partial scores table
Continue towards right
Does A match C
a) diagonal: match (+1) or
mismatch (0)
b) From top means a gap (1) inserted in the first
sequence (from left to
right)
c) From left means that a
gap (-1) is inserted in the
second sequence (from
top to bottom)
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2
C -2
A -3
G -4
T -5
A -6
G -7
Partial scores table
Continue towards right
Does A match G
a) diagonal: match (+1) or
mismatch (0)
b) From top means a gap (1) inserted in the first
sequence (from left to
right)
c) From left means that a
gap (-1) is inserted in the
second sequence (from
top to bottom)
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2
A -3
G -4
T -5
A -6
G -7
Partial scores table
Go to second row and
continue towards right
Does C match A
a) diagonal: match (+1) or
mismatch (0)
b) From top means a gap (1) inserted in the first
sequence (from left to
right)
c) From left means that a
gap (-1) is inserted in the
second sequence (from
top to bottom)
A C T C G
0 -1 -2 -3 -4 -5
A -1 1 0 -1 -2 -3
C -2 0
A -3
G -4
T -5
A -6
G -7
Partial scores table
Continue towards right
Does C match C
a) diagonal: match (+1) or
mismatch (0)
b) From top means a gap (1) inserted in the first
sequence (from left to
right)
c) From left means that a
gap (-1) is inserted in the
second sequence (from
top to bottom)
A
0 -1
A -1 1
C -2 0
A -3
G -4
T -5
A -6
G -7
C T C G
-2 -3 -4 -5
0 -1 -2 -3
2
Partial scores table
Continue towards right
Does C match T
a) diagonal: match (+1) or
mismatch (0)
b) From top means a gap (1) inserted in the first
sequence (from left to
right)
c) From left means that a
gap (-1) is inserted in the
second sequence (from
top to bottom)
A
0 -1
A -1 1
C -2 0
A -3
G -4
T -5
A -6
G -7
C
-2
0
2
T C G
-3 -4 -5
-1 -2 -3
1
Partial scores table
Continue towards right
Does C match C
a) diagonal: match (+1) or
mismatch (0)
b) From top means a gap (1) inserted in the first
sequence (from left to
right)
c) From left means that a
gap (-1) is inserted in the
second sequence (from
top to bottom)
A
0 -1
A -1 1
C -2 0
A -3
G -4
T -5
A -6
G -7
C
-2
0
2
T
-3
-1
1
C G
-4 -5
-2 -3
0
Partial scores table
Continue towards right
Does C match G
a) diagonal: match (+1) or
mismatch (0)
b) From top means a gap (1) inserted in the first
sequence (from left to
right)
c) From left means that a
gap (-1) is inserted in the
second sequence (from
top to bottom)
A
0 -1
A -1 1
C -2 0
A -3
G -4
T -5
A -6
G -7
C
-2
0
2
T
-3
-1
1
C
-4
-2
0
G
-5
-3
-1
Partial scores table
Fill in all cells in the partial
scores table
While you fill in keep tract of
from which direction you
have carried away the
scores from.
0
A -1
C -2
A -3
G -4
T -5
A -6
G -7
A
-1
1
0
-1
-2
-3
-4
-5
C
-2
0
2
1
0
-1
-2
-3
T
-3
-1
1
2
1
1
0
-1
C
-4
-2
0
1
2
1
1
0
G
-5
-3
-1
0
2
2
1
2
Partial scores table
Back trace your steps from
the optimal alignment
score.
Sometimes more than one
optimal alignment is
possible.
0
A -1
C -2
A -3
G -4
T -5
A -6
G -7
A
-1
1
0
-1
-2
-3
-4
-5
C
-2
0
2
1
0
-1
-2
-3
T
-3
-1
1
2
1
1
0
-1
C
-4
-2
0
1
2
1
1
0
G
-5
-3
-1
0
2
2
1
2
Partial scores table
From the end point:
G
G
TCG
TAG
--TCG
AGTAG
AC—-TCG
ACAGTAG
0
A -1
C -2
A -3
G -4
T -5
A -6
G -7
A
-1
1
0
-1
-2
-3
-4
-5
C
-2
0
2
1
0
-1
-2
-3
T
-3
-1
1
2
1
1
0
-1
C
-4
-2
0
1
2
1
1
0
G
-5
-3
-1
0
2
2
1
2
Dynamic Programming
http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/
Dynamic Programming
http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/
Dynamic Programming
http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/
Dynamic Programming
http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/
Dynamic Programming
Mismatch between D and V is -3; gap is -8
Dynamic Programming
http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/
Dynamic Programming
http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/
Trace-back
Global alignments

Needleman and Wunch algorithm is
global:


compares two sequences in their entirety
a gap penalty is assessed regardless of
whether gaps are located internally within
a sequence, or at the end of one of both
sequences.
Semi-global alignment


Terminal gaps are undermined
because they are generally due to
incomplete data and have no biological
significance.
How is it different than the global
algorithm?
Semi-global vs. global


In global a vertical move equals to a
gap in the horizontal axis while a move
to left equals to a gap in the vertical
axis; they are both penalized.
If one allows initial gaps without
penalty in the sequences, should set
the first row and first column of the
partial scores table to 0.
Semi-global alignment
0
A
C
A
C
T
G
0
0
0
0
0
0
0
A
0
1
0
1
0
0
0
C
0
0
2
1
2
1
0
A
0
1
1
3
2
2
1
C
0
0
2
2
4
3
2
T
0
0
1
2
3
5
3
G
0
0
0
1
2
4
6
A
0
1
0
1
1
3
6
T
0
0
1
0
1
2
6
C
0
0
1
1
1
1
6
Horizontal moves on the bottom row and vertical moves on the right
most column are penalty free.
G
0
0
0
1
1
1
6
Semi-global alignment
ACACTGATCG
ACACTG----
Local alignment



Smith-Waterman Algorithm
If you have a long sequence, and want
to find any subsequences that are
similar to any part of yeast genome.
Local alignment finds the best
matching subsequences within two
search sequences.
Modify global alignment
algorithm

Smith-Waterman Algorithm

Place a zero in any position in the table if
all of the other methods result in scores
lower than zero.
Local Alignment
http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/
local alignment
0
G
C
G
A
T
A
T
A
0
0
0
0
0
0
0
0
0
A
0
0
0
0
1
0
1
0
1
A
0
0
0
0
1
0
1
0
1
C
0
0
1
0
0
0
0
0
0
C
0
0
1
0
0
0
0
0
0
T
0
0
0
0
0
1
0
1
0
A
0
0
0
0
1
0
2
1
2
T
0
1
0
0
0
2
0
3
2
A
0
0
0
0
1
1
3
2
4
G
0
1
0
1
0
0
2
2
3
C
0
0
2
0
0
0
1
1
2
T
0
0
1
1
0
1
0
2
1
Multiple Sequence Alignment
GAPS
http://ocw.mit.edu/OcwWeb/Biology/7-91JSpring2004/CourseHome/
Use align1 to align: enter x and y
in the command line (>>)
>> x='ggagaggat'
x=
ggagaggat
>> y='ccagacct'
y=
ccagacct
>> align1
Use align1 to align: enter x and y
in the command line (>>)
>>align1
xalign =
ggagaggat
yalign =
ccagacc-t
ggagaggat
ccagacc-t
Use alignloc1 to align: type in
alignloc1 at the >>
>>alignloc1
xalign =
aga
yalign =
aga
aga
aga
What is different?
No gap penalty at the initiation
%Initial Conditions for matrices F and I:
for i = 2:M+1
F(i,1) = (i-1)*0
I(i,1) = 1
%I:vertical
end
for j = 2:N+1
F(1,j) = (j-1)*0
I(1,j) = 3
%I:horizontal
end
What is different?
Maximization
[F(i,j) I(i,j)] = max([F(i-1,j)+g F(i-1,j-1)+w F(i,j-1)+g 0])
What is different?
Find the max value of the F matrix
[Yj,fj]=max(max(F)) % find column id
[Yi,fi]=max(max(F')) % find row id
i=fi(1)
% set current i
j=fj(1)
% set current j
What is different?
When maximum value is 0 then stop
if I(i,j) == 1
i = i-1
xrev(k) = x(i)
yrev(k) = '-'
elseif I(i,j) == 2
i = i-1; j = j-1
xrev(k) = x(i)
yrev(k) = y(j)
elseif I(i,j) == 3
j = j-1;
xrev(k) = '-'
yrev(k) = y(j)
else
break
end

Document

Transcript Document

Directory