Transcript Workshop#4

Alignment methods
April 21, 2009
Quiz 1-April 23 (JAM lectures through today)
Writing assignment topic due Tues, April 23
Hand in homework #3
Why has HbS stayed in the population?
Learning objectives- Understand difference between global
alignment and local alignment. Understand the
Needleman-Wunsch algorithm. Understand the SmithWaterman algorithm in global alignment mode.
Workshop-Perform alignment of two nucleotide sequences
Homework #4 due Tues, April 23
Evolutionary Basis of Sequence
Alignment
Why are there regions of identity when comparing protein
sequences?
1) Conserved function-amino acid residues
participate in reaction.
2) Structural (For example, conserved cysteine
residues that form a disulfide linkage)
3) Historical-Residues that are conserved solely
due to a common ancestor gene.
Identity Matrix
A
C
I
L
1
0
0
0
A
1
0 1
0 0
C I
1
L
Simplest type of scoring matrix
Similarity
It is easy to score if an amino acid is identical to another (the
score is 1 if identical and 0 if not). However, it is not easy to
give a score for amino acids that are somewhat similar.
+NH
3
CO2-
+NH
3
CO2-
Isoleucine
Leucine
Should they get a 0 (non-identical) or a 1 (identical) or
Something in between?
One is mouse trypsin and the other is crayfish trypsin.
They are homologous proteins. The sequences share 41% identity.
Evolutionary Basis of Sequence
Alignment (Cont. 2)
Note: it is possible that two proteins share a high degree of
similarity but have two different functions. For example,
human gamma-crystallin is a lens protein that has no known
enzymatic activity. It shares a high percentage of identity with
E. coli quinone oxidoreductase. These proteins likely had a
common ancestor but their functions diverged.
Analogous to railroad car and diner. Both have the same form but
different functions.
Global Alignment Method
For example, the two hypothetical sequences
abcdefghajklm
abbdhijk
could be aligned like this
abcdefghajklm
|| |
| ||
abbd...hijk
As shown, there are 6 matches,
2 mismatches, and one gap of length 3.
Global Alignment Method
Scored
The alignment is scored according to a payoff matrix
$payoff = {match
mismatch
gap_open
gap_extend
=>
=>
=>
=>
$match,
$mismatch,
$gap_open,
$gap_extend};
For correct operation, an algorithm is created such that the
match must be positive and the other payoff entities must be negative.
Global Alignment Method (cont.
3)
Example
Given the payoff matrix
$payoff = {match
mismatch
gap_open
gap_extend
=> 4,
=> -3,
=> -2,
=> -1};
Global Alignment Method (cont.
4)
The sequences
abcdefghajklm
abbdhijk
are aligned and scored like this
a b c d e f g h a j
| |
|
|
|
a b b d . . . h i j
match
4 4
4
4
4
mismatch
-3
-3
gap_open
-2
gap_extend
-1-1-1
for a total score of 24-6-2-3 = 13.
k l m
|
k
4
Global Alignment Method
(cont. 5)
The algorithm should guarantee that no other
alignment of these two sequences has a
higher score under this payoff matrix.
Let’s align the following with a simple payoff matrix:
ABCNJRQCLCRPM and AJCJNRCKCRBP
Where match = 1
mismatch = 0
gap = 0
gap extension = 0
Alignment A
Sequence 1:
Sequence 2:
Score:
Total Score:
ABCNJ-RQCLCR-PM
AJC-JNR-CKCRBP101010101011010
8
Alignment B
Sequence 1:
Sequence 2:
Score:
Total Score:
ABC-NJRQCLCR-PM
AJCJN-R-CKCRBP101010101011010
8
Three steps in Dynamic
Programming
1. Initialization
2. Matrix fill or scoring
3. Traceback and alignment
Initialization step
Matrix Fill (bottom two rows)
Matrix Fill (bottom three rows)
Matrix Fill (entire matrix)
Sequence 1:
Sequence 2:
Score:
Total Score:
ABCNJ-RQCLCR-PM
AJC-JNR-CKCRBP101010101011010
8
Sequence 1:
Sequence 2:
Score:
Total Score:
ABC-NJRQCLCR-PM
AJCJN-R-CKCRBP101010101011010
8
Smith-Waterman algorithm
Mi,j = MAXIMUM [
Mi-1, j-1 + si,,j (match or mismatch in the diagonal),
Mi, j-1 + w (gap in sequence #1),
Mi-1, j + w (gap in sequence #2),
0]
Where Mi-1, j-1 is the value in the cell diagonally juxtaposed to Mi,j.
(The i-1, j-1 cell is up and to the left of mi,nj).
Where si,j is the value for the match or mismatch in the minj cell.
Where Mi, j-1 is the value in the cell above Mi,j.
Where w is the value for the gap penalty.
Where Mi-1, j is the value in the cell to the left of Mi,j.
Initialization step: Create Matrix with M + 1 columns
and N + 1 rows. M = number of letters in sequence 1 and N =
number of letters in sequence 2.
First column (M-1) and first row (N-1) will be filled with 0’s.
Matrix fill step: Each position Mi,j is defined to be the
MAXIMUM score at position i,j
row
Mi,j = MAXIMUM [
column
Mi-1, j-1 + si,,j (match or mismatch in the diagonal)
Mi, j-1 + w (gap in sequence #1)
Mi-1, j + w (gap in sequence #2)]
Sequence 1: ABCNJ-RQCLCR-PM
Sequence 2: AJC-JNR-CKCRBPScore
: 8