Transcript Document
Alignment Class II
We continue where we stopped last
week: Dynamic programing
Reminder -Structure of a genome
a gene
transcription
pre-mRNA
splicing
mature mRNA
translation
protein
Pairwise Sequence Alignment
Example
HEAGAWGHEE
PAWHEAE
HEAGAWGHE-E
HEAGAWGHE-E
P-A--W-HEAE
--P-AW-HEAE
Which one is better?
Example
A
E
G
H
W
A
5
-1
0
-2
-3
E
-1
6
-3
0
-3
H
-2
0
-2
10
-3
P
-1
-1
-2
-2
-4
W
-3
-3
-3
-3
15
• Gap penalty: -8
• Gap extension: -3
HEAGAWGHE-E
--P-AW-HEAE
(-8) + (-8) + (-1) + 5 + 15 + (-8)
+ 10 + 6 + (-8) + 6 = 9
Exercise: Calculate for
HEAGAWGHE-E
P-A--W-HEAE
Global Alignment
Notation
xi – ith letter of string x
yj – jth letter of string y
x1..i – Prefix of x from letters 1 through I
F – matrix of optimal scores
F(i,j) represents optimal score lining up x1..i with y1..j
d – gap penalty
s – scoring matrix
Global Alignment
The work is to build up F
Initialize: F(0,0) = 0, F(i,0) = id, F(0,j)=jd
Fill from top left to bottom right using the recursive
relation
F (i 1, j 1) s ( x , y )
i
j
F (i, j ) max min ( F (i k , j ) gap(k )
k
min F (i, j k ) gap(k )
k
Example
_
_
X
_
_
X
X
X
X_
_X
_X
X_
Example
X
X
X_
_X
_X
X_
X
_
XY
__
XY
X_
X_Y
_X_
_XY
X__
XY
_X
XY_
__X
Global Alignment
yj aligned to gap
Move ahead in both
F(i-1,j-1)
F(i,j-1)
s(xi,yj)
F(i-1,j)
xi aligned to gap
d
d
F(i,j)
While building the table, keep track of where
optimal score came from, reverse arrows
Example
H
E
A
G
A
W
G
H
E
E
0
-8
-16
-24
-32
-40
-48
-56
-64
-72
-80
P
-8
-2
-9
-17
-25
-33
-42
-49
-57
-65
-73
A
-16
W
-24
H
-32
E
-40
A
-48
E
-56
Completed Table
Score
Gap –8
Error –2
Fit +6
H
E
A
G
A
W
G
H
E
E
0
-8
-16
-24
-32
-40
-48
-56
-64
-72
-80
P
-8
-2
-9
-17
-25
-33
-42
-49
-57
-65
-73
A
-16
-10
-3
-4
-12
-20
-28
-36
-44
-52
-60
W
-24
-18
-11
-6
-7
-15
-5
-13
-21
-29
-37
H
-32
-14
-18
-13
-8
-9
-13
-7
-3
-11
-19
E
-40
-22
-8
-16
-16
-9
-12
-15
-7
3
-5
A
-48
-30
-16
-3
-11
-11
-12
-12
-15
-5
2
E
-56
-38
-24
-11
-6
-12
-14
-15
-12
-9
1
Traceback
H
E
A
G
A
W
G
H
E
E
0
-8
-16
-24
-32
-40
-48
-56
-64
-72
-80
P
-8
-2
-9
-17
-25
-33
-42
-49
-57
-65
-73
A
-16
-10
-3
-4
-12
-20
-28
-36
-44
-52
-60
W
-24
-18
-11
-6
-7
-15
-5
-13
-21
-29
-37
H
-32
-14
-18
-13
-8
-9
-13
-7
-3
-11
-19
E
-40
-22
-8
-16
-16
-9
-12
-15
-7
3
-5
A
-48
-30
-16
-3
-11
-11
-12
-12
-15
-5
2
E
-56
-38
-24
-11
-6
-12
-14
-15
-12
-9
1
Trace arrows back
from the lower right
to top left
• Diagonal – both
• Up – upper gap
• Left – lower gap
HEAGAWGHE-E
--P-AW-HEAE
Summary
Uses recursion to fill in intermediate results
table
Uses O(nm) space and time
O(n2) algorithm
Feasible for moderate sized sequences, but not
for aligning whole genomes.
Local Alignment
Smith-Waterman (1981)
Another dynamic programming solution
0
F (i 1, j 1) s( x , y )
i
j
F (i, j ) max min F (i k , j ) gap(k )
k
min F (i, j 1) gap(k )
k
Example
H
E
A
G
A
W
G
H
E
E
0
0
0
0
0
0
0
0
0
0
0
P
0
0
0
0
0
0
0
0
0
0
0
A
0
0
0
5
0
5
0
0
0
0
0
W
0
0
0
0
2
0
20
12
4
0
0
H
0
10
2
0
0
0
12
18
22
14
6
E
0
2
16
8
0
0
4
10
18
28
20
A
0
0
8
21
13
5
0
4
10
20
27
E
0
0
6
13
18
12
4
0
4
16
26
Traceback
H
E
A
G
A
W
G
H
E
E
0
0
0
0
0
0
0
0
0
0
0
P
0
0
0
0
0
0
0
0
0
0
0
A
0
0
0
5
0
5
0
0
0
0
0
W
0
0
0
0
2
0
20
12
4
0
0
H
0
10
2
0
0
0
12
18
22
14
6
E
0
2
16
8
0
0
4
10
18
28
20
A
0
0
8
21
13
5
0
4
10
20
27
E
0
0
6
13
18
12
4
0
4
16
26
Start at highest score
and traceback to first 0
AWGHE
AW-HE
Summary
Similar to global alignment algorithm
For this to work, expected match with random
sequence must have negative score.
Behavior is like global alignment otherwise
Similar extensions for repeated and overlap
matching
Care must be given to gap penalties to maintain
O(nm) time complexity
Statistical Significance of
Sequence Alignments
STATISTICAL SIGNIFICANCE = probability that our
score would be found between random (or unrelated)
sequences
Examine alignment for long runs of matches and placement
of gaps
Try alternative alignments and compare scores
Calculate statistical significance of alignment score using
extreme value distribution formula
Scramble one of sequences 1000's of times and realign to
obtain idea of distribution of scores with unrelated
sequences of same size
Alternative alignments
Programs like LALIGN (stands for local alignment)
produce as many different alignments as you like. Each
subsequent alignment does not align the same sequence
positions.
Extreme Value Distributions
The average of n samples taken from any distribution with
finite mean and variance will have a normal distribution for
large n. This is the CLT.
The largest member of a sample of size n has a LEV, Type I
largest extreme value, also called Gumbel, distribution,
regardless of the parent population,
IF the parent has an unbounded tail that decreases at least as
fast as an exponential function, and has finite moments (as
does the normal, for example). The LEV, has pdf given by
f (x | q1 , q2 ) = 1/ q2 exp( -z -exp( -z ) ) )
where z = (x - q1) / q2 and q1, q2 are location and scale*
parameters, respectively, and q1 > 0.
Normal Distribution versus
Extreme Value Distribution
0.4
Normal
Normal distribution:
y = exp(-x2/2) / sqrt(2π)
Extreme
Value
0.0
-4
-3
-2
-1
0
x
1
2
3
4
Extreme value distribution:
y = exp(-x – exp(-x))
Extreme Value Distribution
Probability density function:
f(x) = exp(-x-exp(-x))
Cumulative distribution function:
Prob(S<x) = exp(-exp(-x))
Prob(S≥x) = 1 - Prob(S<x) = 1 - exp(-exp(-x))
Sample mean m, sample variance σ2
λ = 1.2825 / σ
μ = m – 0.45σ
Prob(x) = exp(-x-exp(-λ(x-μ)))
Calculating the statistical
significance
How is this done?
Make many random protein sequences of varying lengths
Locally align two sequences of a given length with the same scoring
matrix and gap penalty as used for our alignment
Repeat for many pairs of sequences of approximately the same length to
see how high a score we can get
Look at the distribution of scores for a given range of lengths
0.4
A.
0.2
Yev
-2
-1
0
1
2
3
4
5
The formula for extreme values
The probability that a score S between two unrelated sequences
is equal to or greater than a value x, P (S > x) is given by:
P = 1 – exp ( K m n e-lx )
where m and n are the sequence lengths, l is a “scaling factor”
for the scoring matrix used, and K is a constant that depends on
the scoring matrix and gap penalty combination that is used.
What we want is a score that gives a very low value of P, say less
than 0.01- 0.05. However, there is a trick here. We usually calculate
another value E, the expect value for the alignment score that
depends on how we found P. This method is used by BLAST. For
blosum62, gap –11/-1, l = 0.3 and K=0.1 (roughly).
Values Describing Scores
Both the Gumbel Extreme Value Distribution
and Karlin-Ashtul Distribution use E values
and P values
E-value (Expect value): the average number of
times such a match would be found
P-value (probability): probability of finding an
alignment under assumptions
Important note: Alignments that are
statistically important may not be
biologically important
Dot Matrices
The Dot Matrix Method.
Provides a ‘Gestalt’ of all possible
alignments between two sequences.
To begin — Let us use a very simple 0, 1
(match, no-match) identity scoring function
without any windowing. The sequences to
be compared are written out along the x and
y axes of a matrix.
Put a dot wherever symbols match;
identities are highlighted.
S
E
Q
S
•
E
Q
U
•
E
N
C
E
A
L
Y
S
•
I
S
•
P
R
I
M
E
•
•
•
•
•
•
R
•
E
N
C
•
E
A
N
•
•
•
•
•
•
•
•
•
•
•
•
A
L
Y
•
•
•
•
•
•
•
•
•
•
•
•
P
R
I
M
E
R
N
•
U
S
I
S
A
•
•
•
•
•
•
•
•
•
•
Since this is a comparison between two of the same sequences, an
intra-sequence comparison, the most obvious feature is the main
identity diagonal. Two short perfect palindromes can also be seen as
crosses directly off the main diagonal; they are “ANA” and “SIS.”
•
Dot Matrices
The biggest asset of dot matrix analysis is it allows
you to visualize the entire comparison at once, not
concentrating on any one ‘optimal’ region, but
rather giving you the ‘Gestalt’ of the whole thing.
Since your own mind and eyes are still better than
computers at discerning complex visual patterns,
especially when more than one pattern is being
considered, you can see all these ‘less than best’
comparisons as well as the main one and then you
can ‘zoom-in’ on those regions of interest using
more detailed procedures.
It is impossible to tell whether the evolutionary event that caused
the discrepancy between the two sequences was an insertion or a
deletion and hence this phenomena is called an ‘indel.’
S
E
S
¥
E
Q
U
¥
E
¥
A
L
Y
S
¥
I
S
¥
P
R
I
M
E
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
Q
U
¥
¥
E
N
¥
¥
¥
C
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
U
C
E
N
¥
¥
E
E
N
A
¥
N
C
E
Q
E
¥
U
E
E
S
C
¥
Q
S
E
N
¥
¥
¥
¥
¥
¥
¥
¥
The ‘duplication’ here is seen as a distinct column of
diagonals; whenever you see either a row or column of
diagonals in a dotplot, you are looking at direct repeats.
R
S
E
Q
U
E
A
N
N
C
E
A
¥
¥
N
A
¥
L
Y
S
I
S
P
R
I
¥
¥
A
M
E
R
¥
¥
¥
L
Y
¥
¥
Z
E
S
E
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
¥
Q
U
¥
¥
E
¥
N
C
¥
¥
¥
E
S
¥
¥
¥
¥
¥
¥
¥
¥
Again, notice the diagonals. However, they have now been displaced off of the center
diagonal of the plot and, in fact, in this example, show the occurrence of a
‘transposition.’ Dot matrix analysis is one of the only sensible ways to locate such
transpositions in sequences. Inverted repeats still show up as perpendicular lines to
the diagonals, they are just now not on the center of the plot. The ‘deletion’ of
‘PRIMER’ is shown by the lack of a corresponding diagonal.
Sequence comparison – dot matrix alignment
Filtered Windowing
Reconsider the same plot. Notice the extraneous dots that neither
indicate runs of identity between the two sequences nor inverted repeats.
These merely contribute ‘noise’ to the plot and are due to the ‘random’
occurrence of the letters in the sequences, the composition of the
sequences themselves.
How can we ‘clean up’ the plots so that this noise does not detract from
our interpretations? Consider the implementation of a filtered
windowing approach; a dot will only be placed if some ‘stringency’ is
met.
What is meant by this is that if within some defined window size, and
when some defined criteria is met, then and only then, will a dot be
placed at the middle of that window. Then the window is shifted one
position and the entire process is repeated. This very successfully rids
the plot of unwanted noise.
Windowing
In this plot a window of size
three and a stringency of two
is used to considerably
improve the signal to noise
ratio
Alignment to Databsases
Fasta
Blast
FASTA
FastA is a family of programs: FastA, TFastA, FastX,
FastY
Query:
DNA
Protein
Database:
DNA
Protein
FastA
Gap opening penalty
-12, -16 by default
for fasta with
proteins and DNA,
respectively
Gap extension
penalty -2, -4 by
default for fasta
with proteins and
DNA, respectively
Max number of
scores and
alignments is 100
Blosum50
default.
Lower PAM
higher blosum
to detect close
sequences
Higher PAM and
lower blosum
to detect distant
sequences
The larger the
word-length the
less sensitive, but
faster the search
will be
FastA Output
Initn, init1, opt, zscore calculated
during run
E score expectation
value, how
many hits are
expected to be
found by
chance with
such a score
while
comparing
this query to
this database.
Database
code
hyperlinked
to the SRS
database at
EBI
Accession
number
Description
Length
E() does not
represent the
% similarity
FASTA Output
FASTA-Stages
1. Find k-tups in the two sequences (k=1,2 for
proteins, 4-6 for DNA sequences)
2. Score and select top 10 scoring “local
diagonals”
a.
b.
c.
For proteins, each k-tup found is scored using
the PAM250 matrix
For DNA, the number of k-tups found
Penalize intervening gaps
FASTA-Stages
3. Rescan top 10 regions, score with PAM250
(proteins) or DNA scoring matrix. Trim off the
ends of the regions to achieve highest scores.
4. Try to join regions with gapped alignments. Join
if similarity score is one standard deviation above
average expected score
5. After finding the best initial region, FASTA
performs a global alignment of a 32 residue wide
region centered on the best initial region, and uses
the score as the optimized score.
Finding k-tups
position 1 2 3 4 5 6 7 8 9 10 11
protein 1 n c s p t a . . . . .
protein 2 . . . . . a c s p r k
position in
offset
amino acid
protein A protein B
pos A - posB
----------------------------------------------------a
6
6
0
c
2
7
-5
k
11
n
1
p
4
9
-5
r
10
s
3
8
-5
t
5
----------------------------------------------------Note the common offset for the 3 amino acids c,s and p
A possible alignment is thus quickly found protein 1 n c s p t a
| | |
protein 2 a c s p r k
FASTA, K-tups with common
offset
BLAST
Basic Local Alignment Search Tool
Altschul et al. 1990,1994,1997
Heuristic method for local alignment
Designed specifically for database searches
Idea: Good alignments contain short lengths
of exact matches
Blast Application
Blast is a family of programs: BlastN, BlastP, BlastX,
tBlastN, tBlastX Query:
DNA
Protein
Database:
DNA
Protein
BlastN - nt versus nt database
BlastP - protein versus protein database
BlastX - translated nt versus protein database
tBlastN - protein versus translated nt database
tBlastX - translated nt versus translated nt database
Blast –
Basic Local Alignment Search Tool
Blast uses a heuristic search algorithm and uses
statistical methods of Karlin and Altshul (1990)
Blast programs were designed for fast database
searching with minimal sacrifice of sensitivity for
distantly related sequences
Mathematical Basis of BLAST
Model matches as a sequence of coin tosses
Let p be the probability of a “head”
For a “fair” coin, p = 0.5
(Erdös-Rényi) If there are n throws, then the
expected length R of the longest run of heads is
R = log1/p (n).
Example: Suppose n = 20 for a “fair” coin
R=log2(20)=4.32
Trick is how to model DNA (or amino acid)
sequence alignments as coin tosses.
Mathematical Basis of BLAST
To model random sequence alignments, replace a
match with a “head” and mismatch with a “tail”.
AATCAT
HTHHHT
ATTCAG
For DNA, the probability of a “head” is 1/4
Same logic applies to amino acids
Mathematical Basis of BLAST
So, for one particular alignment, the Erdös-Rényi
property can be applied
What about for all possible alignments?
Consider that sequences are being shifted back and forth,
dot matrix plot
The expected length of the longest match is
R=log1/p(mn)
where m and n are the lengths of the two sequences.
Steps of BLAST
1. Filter out low-complexity regions
K 1 / L log N L! / ni !
i
where L is length, N is alphabet size, ni is the
number of letter i appearing in sequence.
Example: AAAT
K=1/4 log4(24/(3!*1!*0!*0!))=0.25
Steps of BLAST
2. Query words of length 3 (for proteins) or 11 (for
DNA) are created from query sequence using a
sliding window
MEFPGLGSLGTSEPLPQFVDPALVSS
MEF
EFP
FPG
PGL
GLG
Expected run length in sequences of ~90 for
proteins and 64 for DNA.
Steps of BLAST
3. Using BLOSUM62 (for proteins) or scores of
+5/-4 (DNA, PAM40), score all possible words of
length 3 or 11 respectively against a query word.
4. Select a neighborhood word score threshold (T)
so that only most significant sequences are kept.
Approximately 50 hits per query word.
5. Repeat 3 and 4 for each query word in step 2.
Total number of high scoring words is
approximately 50 * sequence length.
Steps of BLAST
6. Organize the high-scoring words into a search
tree
M
E
E
F
G
P
7. Scan each database sequence for match to highscoring words. Each match is a seed for an
ungapped alignment.
Steps of BLAST
8. (Original BLAST) extend matching words to the
left and right using ungapped alignments.
Extension continues as long as score increases or
stays same. This is a HSP (high scoring pair).
(BLAST2) Matches along the same diagonal
(think dot plot) within a distance A of each other
are joined and then the longer sequence extended
as before. (Requires lower T)
Steps of BLAST
9. Using a cutoff score S, keep only the
extended matches that have a score at least
S.
10. Determine statistical significance of each
remaining match (from last time).
11. Try to extend the HSPs if possible.
12. Show Smith-Waterman local alignments.
Statistical Significance of Blast
Probability (P) – score of the likelihood of its
having arisen by chance. The closer the p-value
approaches zero, the greater the confidence that the
match is real. The closer the value is to unity, the
greater the chance that the match is spurious
Expected Frequency (E) value – number of hits one
can expect to see by chance (noise) when searching
a database of a particular size. E value of 1 – one
match with a similar score by chance. E value of 0
– no matches expected by chance
Low-complexity and Gapped
Blast Algorithm
The SEG program has been implemented as part of the blast
routine in order to mask low-complexity regions
Low-complexity regions are denoted by strings of Xs in the
query sequence
In 1997 a modification introduced generation of gapped
alignments
The gapped algorithm seeks only ONE ungapped alignment
that makes up a significant match and hence speeds the
initial database search
Dynamic programming is used to extend a central pair of
aligned residues in both directions to yield the final gapped
alignment
Gapped blast is 3 times faster than ungapped blast
Smith and Waterman
Compare query to each sequence in database
Perform full Smith and Waterman pairwise
alignment to find the optimal alignment
SW searching is exhaustive and therefore
runs on special hardware (Biocellerator)
The sequence scrambling method
We first align the two sequences and obtain the optimal score S
Next, we scramble one of the sequences many 1000s of times (N), align
it with the other sequence and obtain a distribution of scores (not related
but they have the same composition* as our sequences)
We fit the scores to an extreme value distribution and calculate our l
and K.
Then, we calculate P, as before, for the probability that one of the
scrambled sequence pairs would exceed our optimal score S
Finally, we calculate an E (expect value), which is (usually) P times the
number of sequence pairs we compared. (If P = 10 -6 and N=10,000,
then E = 10-2.
E is the no. we want to be <0.01- 0.05
Method used by FASTA suite of programs
* We can scramble the whole sequence (pick 20 kinds of marbles from a
bag, or a sliding window (pick the first one from a bag with the first 10
or so – then slide ahead one and pick one from the next 10. The window
method is more stringent. Why? (think about low complexity)