Transcript Lecture 5

1
Sequence Alignment
Bioinformatics Algorithms
Based on © Pevzner and Jones
Revised 2015
The capacity to blunder slightly is the real marvel of
DNA. Without this special attribute, we would still
be anaerobic bacteria and there would be no music.
- Lewis Thomas
Outline
Homework
Introduce Scoring Matrices
Some mismatches are better than others
Solve Alignment with Affine Gap Penalties
Continuing deletion more common than starting one
Linear Space Alignment
Multiple Alignment
3
Matching Sequences
Odds that all match = p6 = 1/46 = 1/4096 ~ 0.00024
Odds that none match. Odds are q6 = 36/46 ~ 0.178
Let’s break down the case where there is a single match.
The odds that the first match, but none other do, is pq5
But we could match in any of the six spots, so the odds
of a single match are 6 pq5 = 6 x 0.25 x (0.75)5 ~ 0.356
4
Matching Sequences
To match the first two, the odds are p2q4. But to capture
all possible matches with two correct, we need to see
how many ways we can pick 2 out of 6. This is the
combinatorial coefficient C(6, 2), pronounced “Six
choose two” and equal to 6 x 5 / 2 = 15.
This fits the patterns above: we have C(6, 0) = 1, C(6, 1) =
6, and C(6, 6) = 1.
In general, the odds of matching are C(6, k)pkq6-k
5
Repeats
Which is largest k for which 4^k + k <= 1,000,000?
k = 10 is a bit too large, so must settle for 9: you
will probably have a repeat of length 10, but you
will always have a repeat of length 9
6
Repeats
What is the longest repeat that has a 50% chance of
appearing?
7
Repeats
What is the longest repeat that has a 50% chance of
appearing? 16 bins
8
Repeats
What is the longest repeat that has a 50% chance of
appearing? 64 bins
9
Repeats
What length do we expect to see at least 50% of the time?
A string of length N has N – k + 1 substrings of length k.
The odds that there are no repeats of length k is
We want 1 – v > ½. or v < ½
This first happens when k = 19
10
A few words on real numbers
In bad old days there were flakey floating point chips
IEEE 754 is a standard that has been widely adopted in
modern chips. Addresses many serious problems.
We still need to know three terms
Overflow – 10 lbs of sugar in a 5 lb sack
Underflow – Numbers too small to distinguish from 0
Machine Epsilon – Distance between 1.0 and the next
number we can represent
Numerical Analysis goes over this and much more
11
So what about this…
When I try to compute this I get overflow!
12
So what about this…
So don’t do that. You can’t compute 65! But each of the
terms in the bottom equation is easy to compute
13
Repeat Odds
def printRepeatOdds(k, limit, debug):
print k,
bins = 1.0
for x in xrange(k):
bins = bins*4.0
if (bins < limit):
print "\t Will always have a repeat"
return
print "\t", bins, "\t",
odds = 1.0
for x in xrange(limit):
odds = odds * ((bins - x)/bins)
print 1 - odds
14
Repeat Odds
def printRepeatOdds(k, limit, debug):
...
print "\t", bins, "\t",
odds = 1.0
for x in xrange(limit):
odds = odds * ((bins - x)/bins)
print 1 - odds
print "\nK:", "\tBins:", "\tOdds of a repeat"
for k in xrange(1, 22):
printRepeatOdds(k, 1000000 - k + 1, False)
15
Palindrome output
$ python findPalindrome.py ../../../data/EColi.fasta
...
3 103
AAA TTT
3 103
AAATTT
4 102 AAAA TTTT
4 102 AAAATTTT
5 101 TAAAA TTTTA
...
10 849177
CATGGTTATG CATAACCATG
11 849176
GCATGGTTATG CATAACCATGC
12 849175
TGCATGGTTATG CATAACCATGCA
13 849174
CTGCATGGTTATG CATAACCATGCAG
14 849173 TCTGCATGGTTATG CATAACCATGCAGA
15 849172 TTCTGCATGGTTATG CATAACCATGCAGAA
16
How much work?
> Fake Fasta
ATGCATCCCCATATATATATATAT
ATGCATCCCCATATATATATATAT
ATGCATCCCCATATATATATATAT
ATGCATCCCCATATATATATATAT
ATGCATCCCCATATATATATATAT
ATGCATCCCCATATATATATATAT
ATGCATCCCCATATATATATATAT
Do we need to
ATGCATCCCCATATATATATATAT pass over initial
data each pass?
17
Better version
> Fake Fasta
ATGCATCCCCATATATATATATAT
3 103
4 102
AAATTT
AAAATTTT
ATGCATCCCCATATATATATATAT
ATGCATCCCCATATATATATATAT
ATGCATCCCCATATATATATATAT
ATGCATCCCCATATATATATATAT
ATGCATCCCCATATATATATATAT Can skip areas
ATGCATCCCCATATATATATATAT where we didn’t
ATGCATCCCCATATATATATATAT find long
palindrome
18
Palindrome
# findPalindrome.py
Find Palindromes
# Jeff Parker
Jan 2015
# Usage: $ python findPalindrome.py <filename>
...
# Is this a palindrome?
def isPalindrome(pos, text, patLen, reverse):
for x in xrange(2*patLen):
if (reverse[text[pos + x]] !=
text[pos+2*patLen - x - 1]):
return False
return True
19
Palindrome
# Look for palindrome of given length.
def findFirstPalindrome(start, text, patLen, reverse):
# Return first palindrome of this length
for pos in xrange(start, len(text) - 2 * patLen + 1):
if (isPalindrome(pos, text, patLen, reverse)):
return pos
return -1
# Never found one of this length
20
Alternative
def findFirstPalindrom(start, text, patLen):
for pos in xrange(start, len(text) - 2 * patLen + 1):
match = True
for x in xrange(2*patLen):
if (reverse[text[pos + x]]
!= text[pos+2*patLen - x - 1]):
match = False
break
# Exit inner loop & try again
if (match):
return pos
return -1 # Never found one of this length
21
Palindrome
reverse = { };
reverse['A'] =
reverse['T'] =
reverse['G'] =
reverse['C'] =
'T’
'A’
'C’
'G’
3 103
AAATTT
4 102 AAAATTTT
5 101 TAAAATTTTA
Need to step back...
patLen = 1
pos = findFirstPalindrome(1, text, patLen, reverse)
while (pos > -1):
print patLen, pos+1, text[pos:pos+(2*patLen) ]
patLen = patLen + 1
start = max(pos-1, 0)
# But don’t go past 0!
pos = findFirstPalindrome(start, text, patLen,
reverse)
22
Find all ORFs
def findAllOrf(text, limit):
lst = []
# Look for start in each of three reading frames.
for offSet in xrange(3):
pos = offSet
while (pos > -1):
[pos, y, ln] = findORF(text, pos, limit)
if (pos > 0):
# Go around again
item = ["+", offSet+1, pos+1, y, ln]
lst.append(item)
pos = pos + ln # Go past the ORF
return lst
23
Find one ORF
# Look for an open codon followed by a close codon
# Returns [start, end, len]
def findORF(text, pos, minOrf):
while (pos < len(text)):
if ("ATG" == text[pos:pos+3]):
# Start of ORF
y = isORF(text, pos, minOrf)
if (y > -1):
return [pos, y, y-pos]
pos = pos + 3
# Didn't find anything
return [-1, 0, 0]
24
isORF
def isORF(text, pos, minOrf):
y = pos + 3
while (y < len(text)):
if (isStop(text[y:y+3])):
y = y + 3
if ((y - pos) <= minOrf): # Too short!
return -1
return y
# Found full length ORF
y = y + 3
return -1
25
Main Routine
...
print "ORF must be at least", limit, "Base pairs long"
text = cs58FileUtility.readFastaFile(fileName)
lst = findAllOrf(text, limit)
print("Direction ReadingFrame Low High
for pos in range(len(lst)):
print lst[pos]
Length”)
item = ["+", offSet+1, pos+1, y, ln]
lst.append(item)
A. Acid Attributes
Hydrophobic/Hydrophilic
Hydrophobic: repelled by water
Hydrphilic: water soluable
Aromatic / Aliphatic classification of carbon and
hydrogen molecules
Aromatic: contain rings, such as benzene rings
Aliphatic: do not contain such rings
Polar: separation of electric charge leading to a
electric dipole, or separation of positive and
negative charges
Venn Diagram Review
Thanks to Snorg Tees
Amino Acid Properties
Volume vs Bulk
25
20
15
10
5
0
0
20
40
60
80
100
120
140
160
180
Surface Area vs Fractional Avail Area
Surface vs Fractional
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
150
200
250
300
Ph of isoelectric Point vs Hydrophobicity
ph vs Hydr
5
4
3
2
1
0
0
-1
-2
-3
-4
-5
2
4
6
8
10
12
Hydrophobicity Scales
Hyd.1 vs Hyd.2
6
4
2
0
-5
-4
-3
-2
-1
0
-2
-4
-6
-8
-10
-12
-14
1
2
3
4
5
Bulk vs Polarity
Bulk vs Polarity
60
50
40
30
20
10
0
0
5
10
15
20
25
34
Patrick
35
Noel
36
Kyle
37
Hanna
38
Dario
39
?
40
Raisa
41
Jonathan
42
Brian
43
Kevin
44
Spreadsheets
45
What do sums and differences mean?
46
Spreadsheets
What are the strong signals?
47
48
Summary
Principal Component Analysis is a technique that takes data of
multiple dimensions, and finds the “Principal Components”
Can make it easier to make sense out of data
Note that in this case, we have no evidence (yet) that the
components that we have identified have any biological
significance
We might have started by measuring the wrong things
We will get a chance to evaluate this distance matrix when we
return to sequence alignment
49
Bayes Theorem
P(H)P(D | H)
(0.01 ´ 0.81)
P(H | D)=
=
~ 0.0897
P(D)
(0.01 ´ 0.81+ 0.99 ´ 0.083)
If the patient has tested positive,
they are Sick (1,000) and test positive (810 individuals)
or they are Well, and have a false positive (8217 individuals).
Thus the odds are 810/(810 + 8217) = 8.97%
Outline
Homework
Introduce Scoring Matrices
Some mismatches are better than others
Solve Alignment with Affine Gap Penalties
Continuing deletion more common than starting one
Linear Space Alignment
Multiple Alignment
51
Global Alignment: Key Ideas
1) Labeling of the upper and left edges
2) To compute the best match ending at location [i,j] we compute the three
values below, pick minimal value, and store it in d[i][j]
The costs for match, non-match and gap may be varied to match problem
insertCost = d[i-1][j] - 1;
deleteCost = d[i][j-1] - 1;
if (str1[i] == str2[j])
diagCost = d[i-1][j-1] + 1;
else
diagCost = d[i-1][j-1] - 1;
T
T
C
G
AT
AT
ATG
AT_
3
2
2
1
AT_
ATC
1
1
ATG
ATC
1
d[i][j] = max(insertCost, deleteCost, diagCost)
Alignment Scoring
Our algorithm permits the following scoring scheme:
+1 : match premium
-μ : mismatch penalty
-σ : indel penalty
The score for a particular alignment is
#matches – μ(#mismatches) – σ (#indels)
Scoring Matrices
To generalize scoring for DNA sequences, consider a
(4+1) x (4+1) scoring matrix δ.
We include a spot for the gap character “-”.
For amino acid sequence alignment, the scoring matrix
would be a (20+1)x(20+1) size.
This will simplify the algorithm as follows:
si-1,j-1 + δ (vi, wj)
si,j = max s i-1,j + δ (vi, -)
s i,j-1 + δ (-, wj)
Making a Scoring Matrix
Scoring matrices for DNA are not as widely used
We introduce two Scoring Matrices for Amino Acids
Created based on biological evidence.
Goal in alignment is to identify the underlying similarities
hidden by mutations.
Some mutations have little effect on the protein’s function, so
some penalties, δ(vi , wj), should be less harsh than others.
Two types of Amino acid substitution matrices in wide use
PAM
BLOSUM
Scoring Matrix: Small Example
A
R
N
K
A
5
-2
-1
-1
R
-
7
-1
3
N
-
-
7
0
K
-
-
-
6
While R (Arginine) and K
(Lysine) are different
amino acids, they have a
positive score.
Why? They are both
positively charged amino
acids, and a substitution
will not greatly change
function of protein.
The Blosum50 Scoring Matrix
Attributes
The Cell: A molecular Approach
Hydrophobic/Hydrophilic
Geoffery Cooper
Hydrophobic: repelled by water
Hydrphilic: water soluable
Aromatic / Aliphatic classification of carbon and
hydrogen molecules
Aromatic: contain rings, such as benzene rings
Aliphatic: do not contain such rings
Polar: separation of electric charge leading to a
electric dipole, or separation of positive and
negative charges
Conservation
Amino acid changes that tend to preserve the
physico-chemical properties of the original residue
Polar to polar
aspartate  glutamate
Nonpolar to nonpolar
alanine  valine
Similarly behaving residues
leucine  isoleucine
Assumptions
See Apostolico and Giancarlo:
Sequence Alignment in Molecular Biology
Today we will give a fixed amount for match, charge
a fixed amount for mismatch, etc
That is, all parts of the string are equally important
Possible to construct alternative models:
HMM to recognize and align sequences
Scoring Matrices
We wish to compare sequences
We have been matching single symbols
We have seen scoring matrices for Amino Acids
We could bin symbols into other groups:
Hydrophobic/Hydrophylic
Pyrimidines/Purines
We could compare pairs of symbols
All are valid ways to score an alignment
Scoring Matrices
Let's make some assumptions about scores
Additive – for ease of computation
Positive is good
Negative is bad
Next slide:
Must have at least one positive score
Expected score must be negative
Scoring Matrices Assumptions
String S made of letters a1 to an
Odds of seeing ai are pi
Score for alignment of ai with aj is sij
We assume some sij is positive
Or best matches would have length 1
We assume expected value of score is negative
Or matches would run on forever
Scoring Matrices Assumptions
String S made of letters a1 to an
Odds of seeing ai are pi
Score for alignment of ai with aj is sij
We assume some sij is positive
We assume expected value of score is negative
How to compare scores from different matrices?
Are matches from 2M twice those from M?
We will find a way to normalize a matrix
Altschul-Dembo-Karlin statistics
Threshold: Identifies smallest segment score that is
unlikely to happen by chance
# matches above q has mean E(q) = Kmne-lq; K is a
constant, m and n are the lengths of the two
compared sequences
Parameter l is positive root of:
Background Frequency
Some residues appear more often than others
Notation: pi is probability of ith residue
We may compute genome wide frequency
We could also look at local frequency
Let qij represent the odds of aligning ith with jth
Should we give a high score or a low score when
aligning with a rare residue?
Assumptions
Should have at least one positive entry, or best
scores will have length 1
Expected value of a score (aka Average score)
should be negative, or we will tend to have very
long matches
Entries in scoring matrix should look like this
Why log?
Probabilities should be multiplied
P(AB) = P(A)P(B)
Odds of match in first place times odds of match
in second place…
But we want to get a score by adding
Log turns products into sums
PAM
PAM - (Dayhoff et al., 1978) Original substitution matrix
Compare closely related species
Use Global Alignment
Point Accepted Mutation
1 PAM = PAM1 = 1% average change of all amino acid
positions
After 100 PAMs of evolution, not every residue will have
changed
Some residues may have mutated several times
Some residues may have returned to their original state
Some residues may not changed at all
PAMX
PAMx = PAM1x
PAM250 = PAM1250
PAM250 is a widely used
One positive entry?
Expected value negative?
PAM250 is a widely used scoring matrix:
BLOSUM
Blocks Substitution Matrix (Henikoff & Henikoff, 1992)
Scores derived from observations of the frequencies of
substitutions in
Blocks of local alignments (no gaps)
In distantly related proteins
Weights contributions – don’t count similar species twice
Matrix name indicates evolutionary distance
BLOSUM62 was created using sequences sharing no
more than 62% identity
Recap PAM vs BLOSUM
The PAM matrix was based on alignment of sequences from
closely related species
Includes conserved and mutable regions
BLOSUM is based on highly conserved regions without gaps
(Blocks) from distantly related species
In PAM, higher numbers are used to align more distant
sequences. Thus PAM250 is PAM1250.
Higher BLOSUM numbers are for closer matches. BLOSUM62
is used for closer sequences than BLOSUM50
What does 62% mean?
The extent to which two nucleotide or amino acid sequences are invariant
AC C TG A G – AG
AC G TG – G C AG
mismatch
indel
70% identical
The Blosum50 Scoring Matrix
One positive entry?
Expected value negative?
How do we use these?
Alignment
programs will
give you a
choice:
Blast Similarity Search
I live on a street that is one long block.
How many digits do I need for the house number?
If I lived on Commonwealth Avenue
How many digits would I need for my house number?
Log(N) bits for street with N houses
I am searching for a DNA string of length 6 in EColi
I have found an exact match.
Is it significant?
Some Scoring Issues in BLAST
Scoring Matrices – how to pick?
How did I set the gap cost?
What does the score mean? Is this score relevant?
As DB grows, expected value of max match grows
Is the Database current?
Is the Database redundant?
Am I matching a low complexity region?
Outline
Homework
Introduce Scoring Matrices
Some mismatches are better than others
Solve Alignment with Affine Gap Penalties
Continuing deletion more common than starting one
Linear Space Alignment
Multiple Alignment
Gap Penalties
Currently, a fixed penalty σ is given to every indel:
-σ for 1 indel,
-2σ for 2 consecutive indels
-3σ for 3 consecutive indels, etc.
This is called a "linear gap penalty"
Linear in the size of the gap
Too severe penalty for a series of 100 consecutive indels
Affine Gap Penalties
In nature, a series of k indels often come as a single event
rather than a series of k single nucleotide events
Our current scoring gives same score for both alignments
This is more likely.
This is less likely.
Accounting for Gaps
Gaps- contiguous sequence of spaces in one of the rows
Score for a gap of length x is:
-(ρ + σx)
where ρ >0 is the penalty for introducing a gap:
gap opening penalty
ρ will be large relative to σ:
gap extension penalty
Don't add as much of a penalty for extending the gap.
Affine Transformation
Affine Transformation
Linear Transformation
Affine Gap Penalties
Gap penalties:
-ρ-σ when there is 1 indel
-ρ-2σ when there are 2 indels in a row
-ρ-3σ when there are 3 indels in a row, etc.
-ρ- x·σ (-gap opening - x gap extensions)
New penalties for runs of horizontal or vertical edges
Affine Gap Penalties and Edit Graph
To reflect affine gap
penalties we have to add
“long” horizontal and
vertical edges to the edit
graph. Each such edge
of length x should have
weight
- - x *
Adding “Affine Penalty” Edges to the Edit Graph
There are many such edges!
Adding them to the graph
increases the running time of
the alignment algorithm by a
factor of n (where n is the
number of vertices)
So the complexity increases
from O(n2) to O(n3)
Can we do better?
Adding “Affine Penalty” Edges to the Edit Graph
There are many such edges!
Adding them to the graph
increases the running time of
the alignment algorithm by a
factor of n (where n is the
number of vertices)
So the complexity increases
from O(n2) to O(n3)
Can we do better?
The 3-leveled Manhattan Grid
Manhattan in 3 Layers
δ
ρ
δ
σ
δ
ρ
σ
δ
δ
Switching between 3 Layers
Levels:
The main level is for diagonal edges
The lower level is for horizontal edges
The upper level is for vertical edges
A jumping penalty is assigned to moving from the
main level to either the upper level or the lower level
(-)
There is a gap extension penalty for each continuation
on a level other than the main level (-)
Affine Gap Penalties and 3 Layer Manhattan Grid
The three recurrences for the scoring algorithm
creates a 3-layered graph.
The top level creates/extends gaps in the sequence w.
The bottom level creates/extends gaps in sequence v.
The middle level extends matches and mismatches.
Affine Gap Penalties and 3 Layer Manhattan Grid
The three recurrences for the scoring algorithm
creates a 3-layered graph.
The top level creates/extends gaps in the sequence w.
The bottom level creates/extends gaps in sequence v.
The middle level extends matches and mismatches.
As stated, no way to get from insertion to deletion
Affine Gap Penalty Recurrences
si,j =
max
s i-1,j - σ
s i-1,j –(ρ+σ)
Continue Gap in w (deletion)
Start Gap in w (deletion): from middle
si,j =
max
s i,j-1 - σ
s i,j-1 –(ρ+σ)
Continue Gap in v (insertion)
si,j =
max
si-1,j-1 + δ (vi, wj)
s i,j
s i,j
Match or Mismatch
Start Gap in v (insertion):from middle
End deletion: from top
End insertion: from bottom
Outline
Homework
Introduce Scoring Matrices
Some mismatches are better than others
Solve Alignment with Affine Gap Penalties
Continuing deletion more common than starting one
Linear Space Alignment
Multiple Alignment
95
95
Divide and Conquer
Divide and Conquer is a general strategy
Take a difficult problem and decompose it into
two parts
96
Search for 6
How long would it take to find 6 in the list below?
5
8
3
2
5
9
0
6
1
5
8
97
Binary Search
How long would it take to find 6 in the list below?
5
8
3
2
5
9
0
6
1
5
8
5
6
8
8
9
What if we knew that the list was sorted?
0
1
2
3
5
5
98
98
Finding a root
[0, 2] - midpoint is 1
f(1) = 1 - 2 = -1, so interval [0, 1] is below axis, [1, 2] must cross
[1,2] - midpoint is 3/2
f(3/2) = 9/4 - 2 = 1/4, so curve crosses in region [1, 3/2]
[1, 1.5] - midpoint is 5/4
f(5/4) = 25/16 - 32/16 < 0, so curve crosses in region [1.25, 1.5]
[1.25, 1.5] ….
At each stage of this process, we halve the interval: needs about 3 iterations
per digit. At each point, we halve the region holding the solution
Computing Alignment Path Requires
Quadratic Memory
Alignment Path
Space complexity for computing
alignment path for sequences of
length n and m is O(nm)
n
We need to keep all backtracking
references in memory to
reconstruct the path
(backtracking)
m
Divide and Conquer Approach
Path(source, sink)
if(source & sink are in consecutive columns)
output the longest path from source to sink
else
middle ← middle vertex between source & sink
Path(source, middle)
Path(middle, sink)
Divide and Conquer Approach to LCS
Path(source, sink)
if(source & sink are in consecutive columns)
output the longest path from source to sink
else
middle ← middle vertex between source & sink
Path(source, middle)
Path(middle, sink)
The only problem left is how to find this “middle vertex”!
Computing Alignment Score with Linear
Memory
Alignment Score
Space complexity of computing
just the score itself is O(n)
nn
We only need the previous
column to calculate the
current column
We can then throw away that
previous column once
we’re done using it
2
Computing Alignment Score: Recycling Columns
Only two columns of scores are saved at any given
time
memory for column 1
is used to calculate
column 3
memory for column
2 is used to calculate
column 4
Computing Alignment Score with Linear
Memory
Alignment Score
Space complexity of computing just the
score itself is O(n)
Only need the previous column to
calculate the current column
nn
This computes the global score
Does not remember how we got here
2
Crossing the Middle Line
m/2
We want to calculate the longest
m path from (0,0) to (n,m) that passes
through (i,m/2) where i ranges from
0 to n and represents the i-th row
Define
length(i)
Prefix(i)
n
Suffix(i)
as the length of the longest path
from (0,0) to (n,m) that passes
through vertex (i, m/2)
Crossing the Middle Line
m/2
m
Prefix(i)
n
Suffix(i)
Define (mid,m/2) as the vertex where the longest path crosses the
middle column.
length(mid) = optimal length = max0i n length(i)
Computing Prefix(i)
• prefix(i) is the length of the longest path from (0,0)
to (i,m/2)
• Compute prefix(i) by dynamic programming in the
left half of the matrix
store prefix(i) column
0
m/2
m
Computing Suffix(i)
• suffix(i) is the length of the longest path from (i,m/2) to (n,m)
• suffix(i) is the length of the longest path from (n,m) to (i,m/2) with
all edges reversed
• Compute suffix(i) by dynamic programming in the right half of the
“reversed” matrix
store suffix(i) column
0
m/2
m
Length(i) = Prefix(i) + Suffix(i)
Add prefix(i) and suffix(i) to compute length(i):
length(i)=prefix(i) + suffix(i)
You now have a middle vertex of the maximum path
(i,m/2) as maximum of length(i)
0
middle point found
i
0
m/2
m
Finding the Middle Point
0
m/4
m/2
3m/4
m
Finding the Middle Point again
0
m/4
m/2
3m/4
m
And Again
0
m/8
m/4
3m/8
m/2
5m/8
3m/4 7m/8 m
Time = Area: First Pass
• On first pass, the algorithm covers the entire area
Area = nm
Time = Area: First Pass
• On first pass, the algorithm covers the entire area
Area = nm
Computing
prefix(i)
Computing
suffix(i)
Time = Area: Second Pass
• On second pass, the algorithm covers only 1/2 of
the area
Area/2
Time = Area: Third Pass
• On third pass, only 1/4th is covered.
Area/4
Geometric Reduction At Each Iteration
1 + ½ + ¼ + ... + (½)k ≤ 2
•
Runtime: O(Area) = O(nm)
5th pass: 1/16
3rd pass: 1/4
first pass: 1
4th pass: 1/8
2nd pass: 1/2
Outline
Homework
Introduce Scoring Matrices
Some mismatches are better than others
Solve Alignment with Affine Gap Penalties
Continuing deletion more common than starting one
Linear Space Alignment
Multiple Alignment
Multiple Alignment
Dynamic Programming in 3-D
Progressive Alignment
Profile Progressive Alignment (ClustalW)
Scoring Multiple Alignments
Entropy
Sum of Pairs Alignment
Partial Order Alignment (POA)
A-Bruijin (ABA) Approach to Multiple Alignment
Generalizing Pairwise Alignment
Multiple alignments can reveal subtle similarities that
pairwise alignments do not reveal
Alignment of 2 sequences represented as a 2-row matrix
Represent alignment of 3 sequences as a 3-row matrix
A T _ G C G _
A _ C G T _ A
A T C A C _ A
Score: more conserved columns, better alignment
Aligning Three Sequences
Same strategy as aligning two
sequences
Use a 3-D “Manhattan Cube”,
with each axis representing
a sequence to align
For global alignments, go
from source to sink
source
sink
2-D vs 3-D Alignment Grid
V
W
2-D edit graph
3-D edit graph
2-D cell versus 3-D Alignment Cell
In 2-D, 3 edges in
each unit square
In 3-D, 7 edges in
each unit cube
Alignment Paths
0
0
0
1
1
2
3
4
A
--
T
G
C
1
2
3
3
4
A
A
T
--
C
0
1
2
3
4
--
A
T
G
C
x coordinate
y coordinate
z coordinate
Resulting path in (x,y,z) space:
(0,0,0)(1,1,0)(1,2,1) (2,3,2) (3,3,3) (4,4,4)
Multiple Alignment: Dynamic Programming
si,j,k = max
si-1,j-1,k-1 + (vi, wj, uk)
si-1,j-1,k +  (vi, wj, _ )
si-1,j,k-1 +  (vi, _, uk)
si,j-1,k-1 +  (_, wj, uk)
si-1,j,k
+  (vi, _ , _)
si,j-1,k
+  (_, wj, _)
si,j,k-1
+  (_, _, uk)
cube diagonal: no
indels
face diagonal:
one indel
edge diagonal:
two indels
(x, y, z) is an entry in the 3-D scoring matrix
Architecture of 3-D Alignment Cell
(i-1,j,k-1)
(i-1,j-1,k-1)
(i-1,j-1,k)
(i-1,j,k)
(i,j,k-1)
(i,j-1,k-1)
(i,j-1,k)
(i,j,k)
Multiple Alignment: Running Time
For 3 sequences of length n, the run time is 7n3; O(n3)
For k sequences, build a k-dimensional Manhattan, with run
time (2k-1)(nk); O(2knk)
Conclusion: dynamic programming approach for alignment
between two sequences is easily extended to k sequences
but it is impractical due to exponential running time
Inferring Pairwise Alignments
from Multiple Alignments
From a multiple alignment, we can infer pairwise
alignments between all sequences, but they are not
necessarily optimal
We simply project a 3-D multiple alignment path on to
a 2-D face of the cube
Multiple Alignment Projections
A 3-D alignment can
be projected onto the
2-D plane to represent
an alignment between
a pair of sequences.
All 3 Pairwise Projections of the Multiple Alignment
MA Induces Pairwise Alignments
Every multiple alignment induces pairwise alignment
x:
y:
z:
AC-GCGG-C
AC-GC-GAG
GCCGC-GAG
Induces:
x: ACGCGG-C;
y: ACGC-GAC;
x: AC-GCGG-C;
z: GCCGC-GAG;
y: AC-GCGAG
z: GCCGCGAG
MA from Pairwise Alignments
Given 3 arbitrary pairwise alignments:
x: ACGCTGG-C;
y: ACGC--GAC;
x: AC-GCTGG-C;
z: GCCGCA-GAG;
y: AC-GC-GAG
z: GCCGCAGAG
Can we construct MA that induces them?
MA from Pairwise Alignments
Given 3 arbitrary pairwise alignments:
x: ACGCTGG-C;
y: ACGC--GAC;
x: AC-GCTGG-C;
z: GCCGCA-GAG;
y: AC-GC-GAG
z: GCCGCAGAG
Can we construct MA that induces them?
Not always
Pairwise alignments may be inconsistent
Combining Optimal Pairwise Alignments into MA
Can combine pairwise
alignments into multiple
alignment
Can not combine
pairwise alignments
into multiple alignment
A<T<G<A
Inferring MA from Pairwise Alignments
From an optimal multiple alignment, we can infer
pairwise alignments between all pairs of sequences,
but they are not necessarily optimal
It is difficult to infer a ``good” multiple alignment from
optimal pairwise alignments between all sequences
Profile Representation of Multiple Alignment
T
C
C
C
A
C
G
T
-
A
A
A
A
A
G
G
G
G
G
G
–
–
–
–
C
C
C
C
C
T
T
T
T
T
1
A
A
A
A
A
T
C
C
T
T
1
.6
1
.4
1
C
C
C
C
–
–
T
G
G
G
G
G
G
G
.4
.2
.4 .8 .4
1
.6 .2
.2
1
.8
A
A
A
A
G
.8
1 .2
.2
.2
C
C
C
C
C
.6
Profile Representation of Multiple Alignment
T
C
C
C
A
C
G
T
-
A
A
A
A
A
G
G
G
G
G
G
–
–
–
–
C
C
C
C
C
T
T
T
T
T
1
A
A
A
A
A
T
C
C
T
T
1
.6
1
A
A
A
A
G
.4
1
C
–
–
T
G
G
G
G
G
G
G
.4
.2
.4 .8 .4
1
.6 .2
.2
1
C
C
C
.8
1 .2
.2
.2
C
C
C
C
C
.6
.8
In the past we were aligning a sequence against a sequence
Can we align a sequence against a profile?
Can we align a profile against a profile?
Aligning alignments
Given two alignments, can we align them?
x
y
z
w
v
GGGCACTGCAT
GGTTACGTC-GGGAACTGCAG
GGACGTACC-GGACCT-----
Combined Alignment
Aligning alignments
Given two alignments, can we align them?
Hint: use alignment of corresponding profiles
x
y
z
w
v
GGGCACTGCAT
GGTTACGTC-GGGAACTGCAG
GGACGTACC-GGACCT-----
Combined Alignment
Example
We have aligned three sequences x, y, z
Wish to align new sequence w
x AC-GT
y GC-AT
z ACCGT
w ACG
Example
We have aligned three sequences x, y, z
Wish to align new sequence w
x AC-GT
y GC-AT
z ACCGT
w ACG
Costs
Exact match +2
Transitions +1 (A-G, C-T)
Transversion (A or G to C or T)
Gap cost -3
Matching Gap +1
Example
Reduce alignment to a profile
A
2/3
C
1
1/3
G
1/3
T
2/3
x
y
z
A
G
A
C
C
C
C
1/3
2/3
1
G
A
G
T
T
T
Costs
Exact match +2
Transitions +1 (A to G, C to T)
Transversion -1 (A, G to C ,T)
Gap cost -3
Matching Gap +1
A
Example
2/3
C
G
Costs
1/3
1
1/3
1/3
2/3
T
Exact match +2
Transitions +1 (A-G, C-T)
Transversion -1 (A or G to C or T)
1
-
Gap cost -3
0
Matching Gap +1
A
-3
C
-6
G
-9
2/3
-3
-6
-9
-12 -15
A
Example
C
G
Costs
-6
-6
?
1/3
1
1/3
1/3
2/3
T
Exact match +2
Transitions +1 (A-G, C-T)
Transversion -1 (A or G to C or T)
Gap cost -3
Matching Gap +1
Up
Left
Diag
2/3
1
-
2/3
0
-3
A
-3
?
C
-6
G
-9
-6
-9
-12 -15
A
Example
C
G
Costs
0 + 2 x 2/3 + 1 x 1/3
1
1/3
1/3
2/3
1
-
Transversion -1 (A or G to C or T)
Gap cost -3
Matching Gap +1
-6
-6
1/3
T
Exact match +2
Transitions +1 (A-G, C-T)
Up
Left
Diag
2/3
2/3
0
-3
A
-3
5/3
C
-6
G
-9
-6
-9
-12 -15
A
Example
C
G
Costs
1
1/3
1/3
2/3
1
-
Transversion -1 (A or G to C or T)
Gap cost -3
Matching Gap +1
-9
-4/3
-3 + -1
1/3
T
Exact match +2
Transitions +1 (A-G, C-T)
Up
Left
Diag
2/3
2/3
0
-3
-6
A
-3
5/3 -4/3
C
-6
G
-9
-9
?
-12 -15
A
Example
C
Costs
G
Exact match +2
Transitions +1 (A-G, C-T)
Transversion -1 (A or G to C or T)
T
-12
-4/3 + 1 x 2/3 -3 x 1/3
-6 + -1 x 1/3 -3 x 2/3
1/3
1
1/3
1/3
2/3
1
-
Gap cost -3
Matching Gap +1
Up
Left
Diag
2/3
2/3
0
-3
-6
-9
A
-3
5/3 -4/3 -5/3
C
-6
G
-9
-12 -15
A
Example
C
Costs
G
Exact match +2
Transitions +1 (A-G, C-T)
Transversion (A or G to C or T)
Gap cost -3
Matching Gap +1
T
x
y
z
w
AC-GT
GC-AT
ACCGT
AC-G-
2/3
1/3
1
1/3
1/3
2/3
1
-
2/3
0
-3
-6
-9
-12 -15
A
-3
5/3 -4/3 -5/3 -14/3 -23/3
C
-6
-4/3 11/3 10/3 1/3 -8/3
G
-9 -13/3 2/3 4/3
5
2
Multiple Alignment: Greedy Approach
Choose most similar pair of strings and combine into a profile
, thereby reducing alignment of k sequences to an
alignment of of k-1 sequences/profiles. Repeat
This is a heuristic greedy method
u1= ACGTACGTACGT… u1= ACg/tTACg/tTACg/cT…
u2 = TTAATTAATTAA…
k
u2 = TTAATTAATTAA…
u3 = ACTACTACTACT… …
…
uk = CCGGCCGGCCGG
uk = CCGGCCGGCCGG…
k-1
Greedy Approach: Example
Consider these 4 sequences
s1
s2
s3
s4
GATTCA
GTCTGA
GATATT
GTCAGC
Greedy Approach: Example (cont’d)
 4
 2
There are  = 6 possible alignments
 
s2
s4
GTCTGA
GTCAGC (score = 2)
s1
s4
GATTCA-G—T-CAGC(score = 0)
s1
s2
GAT-TCA
G-TCTGA (score = 1)
s2
s3
G-TCTGA
GATAT-T (score = -1)
s1
s3
GAT-TCA
GATAT-T (score
s3
s4
GAT-ATT
G-TCAGC (score = -1)
= 1)
Greedy Approach: Example (cont’d)
s2 and s4 are closest; combine:
s2
s4
GTCTGA
GTCAGC
s2,4 GTCt/aGa/cA
(profile)
new set of 3 sequences:
s1
s3
s2,4
GATTCA
GATATT
GTCt/aGa/c
Iterative Alignment
Allows you to modify the initial alignments as you
add sequences
Does not force you to live with your initial choices.
References
Apostolico A1, Giancarlo R., Sequence alignment in molecular
biology, J Comput Biol. 1998 Summer; 5(2):173-96.
Dayhoff, M. O.; Schwartz, R. M.; Orcutt, B. C. (1978). "A
model of evolutionary change in proteins". Atlas of Protein
Sequence and Structure 5 (3): 345–352.
Henikoff, Steven; Henikoff, Jorja (1992). "Amino acid
substitution matrices from protein blocks". Proceedings of the
National Academy of Sciences of the United States of America
89 (22): 10915–9.
Altschul, SF (1991). "Amino acid substitution matrices from an
information theoretic perspective". Journal of molecular
biology 219 (3): 555–65.
Summary
We have been able to use the same Dynamic Programming
Framework to address a number of problems
Global Alignment (0, -1, -2, -3 on both axes)
Pattern Matching (0, -1, -2, -3 on one axis, 0 on the other)
Local Alignment (zeros everywhere – axes and interior)
We have been able to fold in Affine Gap penalty as well.
We have seen ways to modify the matching score to provide
more realistic matching scores: PAM and BLOSUM
Taken a brief look at a hard problem: multiple alignment
We have seen a linear space algorithm for alignment