Lecture_3 Sequence Alignment

Transcript Lecture_3 Sequence Alignment

Sequence Alignment
Arthur W. Chou
Tunghai University
Fall 2005
Sequence Alignment
Input:
Output:
two sequences over the same alphabet
an alignment of the two sequences
Example:
 GCGCATGGATTGAGCGA
 TGCGCCATTGATGACCA
A possible alignment:
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
Why align sequences?
Lots of sequences don’t have known ancestry,
structure, or function. A few of them do.

If they align, they are similar.

If they are similar, they might have the same
ancestry, similar structure or function.

If one of them has known ancestry, structure, or
function, then alignment to the others yields
insight about them.
Alignments
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
Three kinds of match:
Exact matches
Mismatches
Indels (gaps)
Choosing Alignments
There are many possible alignments
For example, compare:
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
to
------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA-Which one is better?
Scoring Alignments



Similar sequences evolved from a common ancestor
Evolution changed the sequences from this ancestral
sequence by mutations:
 Replacement: one letter replaced by another
 Deletion: deletion of a character
 Insertion: insertion of a character
Scoring of sequence similarity should examine how
many and which operations took place
Simple Scoring Rule
Score each position independently:



Match:
Mismatch:
Indel:
+1
-1
-2
Score of an alignment is sum of position scores
Example
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
Score: (+113) + (-1  2) + (-2  4) = 3
------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA-Score: (+1  5) + (-1  6) + (-2  11) = -23
More General Scores

The choice of +1,-1, and -2 scores is quite
arbitrary

Depending on the context, some changes are
more plausible than others
 Exchange
of an amino-acid by one with similar
properties (size, charge, etc.) vs.
 Exchange of an amino-acid by one with opposite
properties

Probabilistic interpretation: How likely is one
alignment versus another ?
Dot Matrix Method




A dot is placed at each position
where two residues match.
It's a visual aid. The human eye
can rapidly identify similar
regions in sequences.
It's a good way to explore
sequence organization: e.g.
sequence repeats.
It does not provide an alignment.
T
H
E
F
A
T
C
A
T
T
H
E
F
A
S
T
C
A
T
This method produces dot-plots with too much noise THEFA-TCAT
||||| ||||
to be useful
THEFASTCAT
 The noise can be reduced by calculating a score
using a window of residues.
 The score is compared to a threshold or stringency.
Dot Matrix Representation
Tissue-Type plasminogen Activator


Urokinase-Type plasminogen Activator

Produces a graphical
representation of
similarity regions
The horizontal and
vertical dimensions
correspond to the
compared sequences
A region of similarity
stands out as a
diagonal
Dot Matrix or Dot-plot


Each window of the first sequence is aligned (without
gaps) to each window of the 2nd sequence
A colour is set into a rectangular array according to the
score of the aligned windows
T H E
F
A T
C A T
T
H
E
F
A
S
T
HEF
THE
CAT
|||
THE
HEF
C
A
T
-5
Score:
Score: 23
-4
Dot Matrix Display



Diagonal rows (
) of dots
reveal sequence similarity
or repeats.
Anti-diagonal rows (
)
of dots represent inverted
repeats.
Isolated dots represent
random similarity.
H C G E T F G R W F T P E W
K
C
G
P
T
F
G
R
I
A
C
G
E
M
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Dot matrix web server
http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html
We can filter it by using a
sliding window looking for
longer strings of matches and
eliminates random matches
Longest Common Subsequence
Sequence A:
Sequence B:
nematode_knowledge
empty_bottle
n e m a t o d e _ k n o w
l e d g e
| |
|
|
|
| |
e m p t y
_ b
o t t l e
 LCS
Alignment with match score 1,
mismatch score 0, and gap penalty 0
What is an algorithm?
A step-by-step description of the procedures
to accomplish a task.
 Properties: 1. Determination of output for each input

2. Generality
3. Termination

Criteria:
1. Correctness (proof, test, etc.)
2. Time efficiency (no. of steps is small)
3. Space efficiency (spaced used is small)
Naïve algorithm: exhaustive search
G C G A A T G G A T T G A G C G T
sequences of length “n”
i
T G A G C C A T T G A T G A C C A
j
ijjijijjii..............
Worst case time complexity is ~ 2 2n
Dynamic programming algorithms
for pairwise sequence alignment
Similar to Longest Common Subsequence
 Introduced for biological sequences by

 S.
B. Needleman & C. D. Wunsch. A general
method applicable to the search for similarities
in the amino acid sequence of two proteins.
J. Mol. Biol. 48:443-453 (1970)
Dynamic Programming




Optimality substructure
Reduction to a “small” number of sub-problems
Memorization of solutions to sub-problems in a
table
Table look-up and tracing
- G C G C – A T G G A T T G A G C G A
T G C G C C A T T G A T – G A C C - A
Optimality Sub-structure
二陽指
TM
G C G A A T G G A T T G A G C G T
sequences of length “n”
i
T G A G C C A T T G A T G A C C A
j
- G C G C – A T G G A T T G A G C G A
T G C G C C A T T G A T – G A C C - A
Recursive LCS
lcs_len( i , j ): length of LCS from i-th position onward
in String A and from j-th position onward in String B
int lcs_len ( i , j ) {
if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’ ) return 0 ;
else
if (A[ i ] == B[ j ] )
return ( 1 + lcs_len ( i+1, j+1 ) ) ;
else
return max ( lcs_len ( i+1, j ) ,
lcs_len ( i, j+1 )
);
}
Reduction to Subproblems
int lcs_len ( String A , String B )
{ return subproblem ( 0, 0 );
}
int subproblem ( int i, int j )
{ if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’) return 0;
else
if ( A[ i ] == B[ j ] )
return (1 + subproblem ( i+1, j+1 ));
else return
max ( subproblem ( i+1, j ) ,
subproblem ( i, j+1 ) );
}
Memorizing the solutions :
Matrix L[ i , j ] = -1 ;
// initializing the memory device
int subproblem ( int i, int j ) {
if ( L[i, j] < 0 ) {
if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’) L[i , j] = 0;
else if ( A[ i ] == B[ j ] )
L[i, j] = 1 + subproblem(i+1, j+1);
else L[i, j] = max( subproblem(i+1, j),
subproblem(i, j+1));
} return L[ i, j ] ;
}
Iterative LCS: Table Look-up
To get the length of LCS of A and B
{
first allocate storage for the matrix L;
for each row i from m downto 0
for each column j from n downto 0
if (A[ i ] == ‘\0’ or B[ j ] == ‘\0’) L[ i, j ] = 0;
else if (A[ i ] == B[ j ]) L[ i, j ] = 1 + L[i+1, j+1];
else L[ i, j ] = max(L[i+1, j], L[i, j+1]);
}
return L[0, 0];
}
Iterative LCS: Table Look-up
int lcs_len ( String A , String B ) // find the length
{
// First allocate storage for the matrix L;
for ( i = m ; i >= 0 ; i-- )
// A has length m+1
for ( j = n ; j >= 0 ; j-- ) { // B has length n+1
if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’) L[ i, j ] = 0;
else if (A[ i ] == B[ j ]) L[ i, j ] = 1 + L[i+1, j+1];
else L[ i, j ] = max(L[i+1, j], L[i, j+1]);
}
return L[0, 0];
}
Dynamic Programming Algorithm


L[i, j] = 1 + L[i+1, j+1] , if A[ i ] == B[ j ] ;
L[i, j] = max ( L[i+1, j], L[i, j+1] ) otherwise
B
j
j+1
i
L[ i, j ]
L[ i, j+1 ]
i+1
L[ i+1, j ]
L[i+1, j+1]
A
Matrix L
e
m
p
t
y
_
b
o
t
t
l
e
n
e
m
a
t
o
d
e
_
k
n
o
w
l
e
d
g
e
7
6
5
5
4
4
3
3
3
3
2
1
0
7
6
5
5
4
4
3
3
3
3
2
1
0
6
6
5
5
4
4
3
3
3
3
2
1
0
5
5
5
5
4
4
3
3
3
3
2
1
0
5
5
5
5
4
4
3
3
3
3
2
1
0
5
4
4
4
4
4
3
3
2
2
2
1
0
5
4
4
4
4
4
3
3
2
2
2
1
0
5
4
4
4
4
4
3
3
2
2
2
1
0
4
4
4
4
4
4
3
3
2
2
2
1
0
3
3
3
3
3
3
3
3
2
2
2
1
0
3
3
3
3
3
3
3
3
2
2
2
1
0
3
3
3
3
3
3
3
3
2
2
2
1
0
2
2
2
2
2
2
2
2
2
2
2
1
0
2
2
2
2
2
2
2
2
2
2
2
1
0
2
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Obtain the subsequence
Sequence S = empty;
// the LCS
i = 0; j = 0;
while ( i < m && j < n) {
if ( A[ i ] == B[ j ] ) {
add A[i] to end of S;
i++; j++;
} else
if ( L[i+1, j] >= L[i, j+1]) i++;
else j++;
}
n e m a t o d e _ k n o w l e d g e
e
m
p
t
y
_
b
o
t
t
l
e
o-o o-o-o-o-o-o o-o-o-o-o-o-o o-o-o o
| \| | | | | \| | | | | | \| | \|
o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
| | \| | | | | | | | | | | | | | | |
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
| | | | | | | | | | | | | | | | | | |
o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o
| | | | \| | | | | | | | | | | | | |
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
| | | | | | | | | | | | | | | | | | |
o-o-o-o-o-o-o-o-o o-o-o-o-o-o-o-o-o-o
| | | | | | | | \| | | | | | | | | |
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
| | | | | | | | | | | | | | | | | | |
o-o-o-o-o-o o-o-o-o o o o-o-o-o-o-o-o
| | | | | \| | | | |
| | | | | | |
o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o
| | | | \| | | | | | |
| | | | | |
o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o
| | | | \| | | | | | | | | | | | | |
o-o-o-o-o-o-o-o-o-o-o-o-o-o o-o-o-o-o
| | | | | | | | | | | | | \| | | | |
o-o o-o-o-o-o-o o-o-o-o-o-o-o o-o-o o
| \| | | | | \| | | | | | \| | \|
o o o o o o o o o o o o o o o o o o o
Dynamic Programming with scores and penalties
x
V
A
D
L
i
N
A
K
j
……..
K
T
A
A
L
T
y
M
….
D
Dynamic Programming with
scores and penalties

from ‘i-th’ pos. in A and ‘j-th’ pos. in B onward
s : score
s ( A[i] , B[j] ) + S[i+1, j+1]
w : penalty function
S[i , j] = max
best score
from i, j onward
max { S[i+x, j] – w( x );
gap x in sequence B }
max { S[i, j+y] – w( y );
gap y in sequence A }
Algorithm for simple gap penalty
If for each gap, the penalty is a fixed constant
“c”, then
• s(A[ i ] , B[ j ]) + S[i+1, j+1];
S[i , j] = max
• S[ i+1, j ] – c ; // one gap
• S[ i, j+1 ] – c ; // one gap
Table Tracing

To do table tracing based on similarity matrix of
amino acids, we re-define S[i , j] to be the optimal
score of choosing the match of A[i] with B[j].
S[ i , j ] = s (A[ i ] , B[ j ]) +
+ max
// s : score
S[i+1, j+1]
// w : gap penalty
max { S[i+1+x, j+1] – w( x );
gap x in sequence B }
max { S[i+1, j+1+y] – w( y );
gap y in sequence A }
Diagram
Matrix S:
i
i+1
j
j+1
s[i, j]
S[i+1,j+1]
Summation operation
1. Start at lower right corner.
2. Move diagonally up one position.
3. Find largest value in either
 row segment starting diagonally below current
position and extending to the right or
 column segment starting diagonally below
current position and extending down.
4. Add this value to the value in the current cell.
5. Repeat steps 3 and 4 for all cells to the left in
current row and all cells above in current column.
6. If we are not in the top left corner, go to step 2.
----V
HGQKV
----VA
HGQKVA
----VADALTK
HGQKVADALTK
----VADALTK
HGQKVADALTK
----VADALTKPVNFKFA
HGQKVADALTK------A
----VADALTKPVNFKFAVAH
HGQKVADALTK------AVAH
Use of dynamic programming to evaluate
homology between pairs of sequences

If we just want to know maximum match
possible between two sequences, then we
don’t need to do trace-back but can just look
at the highest value in the first row or
column (“match score”). This represents
the best possible alignment score.
Gap penalty alternatives :
constant gap penalty for gap > 1
 gap penalty proportional to gap size (affine
gap penalty)

 one
penalty for starting a gap (gap opening
penalty)
 different (lower) penalty for adding to a gap
(gap extension penalty)
 dynamic programming algorithm can be made
more efficient
Gap penalty alternatives (cont.)

gap penalty proportional to gap size and
sequence
 for
nucleic acids, can be used to mimic
thermodynamics of helix formation.
 two kinds of gap opening penalties
 one
for gap closed by AT, different for GC.
 different
gap extension penalty.
End gaps

Some programs treat end gaps as normal
gaps and apply penalties, other programs do
not apply penalties for end gaps.
End gaps (cont.)
 Can
determine which a program does by adding
extra (unmatched) bases to the end of one
sequence and seeing if match score changes.
 Penalties for end gaps appropriate for aligned
sequences where ends "should match“.
 Penalties for end gaps inappropriate when
surrounding sequences are expected to be
different (e.g., conserved exon surrounded by
varying introns).
Global vs. Local Similarity

Should result of alignment include all amino acids or
proteins or just those that match?
 If yes, a global alignment is desired
 If no, a local alignment is desired
 Global alignment is accomplished by including
negative scores for “mismatched” positions,
thus scores get worse as we move away from
region of match (local alignment).
 Instead of starting trace-back with highest value
in first row or column, start with highest value
in entire matrix, stop when score hits zero.
Local Alignment

From ‘i-th’ pos. in A and ‘j-th’ pos. in B onward
s : score
s ( A[i] , B[j] ) + H[i+1, j+1]
w : penalty function
H[i , j] = max
Best score of any prefix
of the subsequence from
i, j onward.
max { H[i+x, j] – w( x );
gap x in sequence B }
max { H[i, j+y] – w( y );
gap y in sequence A }
0

Lecture_3 Sequence Alignment

Transcript Lecture_3 Sequence Alignment

Directory