FASTA - Algorithm - FAU College of Engineering
Download
Report
Transcript FASTA - Algorithm - FAU College of Engineering
COT 6930
HPC and Bioinformatics
Pairwise Sequence Alignment (PSA)
Xingquan Zhu
Dept. of Computer Science and Engineering
Outline
Dynamic Programming
Dot Matrix Plot
FASTA
BLAST
Difficulty of PSA
Consider two sequences of length n
Number of possible global alignments
For n = 40, there are over 1023 possible alignments
Approaches
“Optimal” solution
Dynamic programming
Heuristic solutions
Dot matrix plot
FASTA
BLAST
Dynamic Programming (DP)
Alignment
Problem can be subdivided
Optimal alignment of n bases
Bioinformatics application
1. Incorporates optimal alignment of 1…n-1 bases
2. Combine with best alignment for nth base
Global alignment
Local alignment
[Needleman-Wunsch 1970]
[Smith-Waterman 1981]
Complexity
O(n2) or O(n × m) algorithm (sequences of length n, m)
Feasible for moderate sized sequences, not entire
genomes
Dynamic Programming (DP)
Dynamic programming usually consists of three
components.
Recursive relation
Tabular computation
Trace back
This efficient recursive method is used to search
through all possible alignments and finding the one
with the optimal score.
How DP is used for alignment?
An alignment is represented as a path through a matrix.
To search through the matrix of all possible paths and find the
optimal path DP is used.
Four possible outcomes in aligning two sequences
[1] identity (stay along a diagonal)
[2] mismatch (stay along a diagonal)
[3] gap in one sequence (move vertically!)
[4] gap in the other sequence (move horizontally!)
Intuition of Dynamic Programming
If we already have the optimal solution to:
XY
AB
then we know the next pair of characters will either
be:
XYZ
ABC
or
XYABC
or
XYZ
AB-
(where “-” indicates a gap).
So we can extend the match by determining which of
these has the highest score.
Intuition of Dynamic Programming
Considering sequences: “HEAGA” and “PA”
DP (Global Alignment)
Recursive formula
Notation
xi
yj
x1..i
F
F(i, j)
d
s
ith letter of string x
jth letter of string y
prefix of x from letters 1 through i
matrix of optimal scores (DP Matrix)
represents optimal score lining up x 1..i with y1..j
gap penalty
scoring matrix
Constant vs. Non-constant Gap
Penalty
Constant gap penalty
F (i 1, j 1) s( xi , y j ),
F (i, j ) max F (i 1, j ) d ,
F (i, j 1) d .
Non-constant gap penalty
F (i 1, j 1) s( xi , y j ),
F (i, j ) max F (k , j ) (i k ), k 0,..., i 1,
F (i, k ) ( j k ), k 0,..., j 1.
DP (Global Alignment)
Algorithm
Initialize: F(0,0) = 0, F(i,0) = i × d, F(0, j) = j × d (gap
penalties)
Fill from top left to bottom right using recursive formula
DP Example
Sequences
Seq1: HEAGAWGHEE
Seq2: PAWHEAE
Gap Penalty: -8 (Constant)
Scoring Matrix (Blosum 50)
DP Example
Algorithm
Initialize
F(0,0) = 0, F(i,0) = i × d, F(0, j) = j
× d (gap penalties)
Fill from top left to bottom right using
recursive formula
Blosum 50
DP Example
Algorithm
Initialize
F(0,0) = 0, F(i,0) = i × d, F(0, j)
= j × d (gap penalties)
Fill from top left to bottom right
using recursive formula
Blosum 50
DP Example (Traceback)
Trace arrows from bottom right to top left
Diagonal – both match
Up – left sequence match a gap
Or insert a gap to top sequence
Left – top sequence match a gap
Or insert a gap to left sequence
Global alignment vs. local alignment
Global alignment: the entire sequence of each protein
or DNA sequence is contained in the alignment.
Local alignment: only regions of greatest similarity
between two sequences are aligned
percent
identity:
~26%
RBP:
26
glycodelin:
23
RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84
+ K++ + + +GTW++MA
+ L
+ A
V T +
+L+ W+
QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81
Global vs. local alignment
Global alignment are often not effective for highly
diverged sequences - do not reflect the biological reality
that two sequences may only share limited regions of
conserved sequence.
Global methods are useful when you want to force two
sequences to align over their entire length
Local alignment is almost always used for database
searches such as BLAST. It is useful to find domains (or
limited regions of homology) within sequences.
DP (local alignment)
Make 0 minimal score (i.e., start new
alignment)
Alignment can start / end anywhere
Start at highest score(s)
End when 0 reached
F 0,0 0
F (i 1, j 1) s ( xi , y j ),
F (i , j ) max F ( k , j ) (i k ), k 0,..., i 1,
F (i , k ) ( j k ), k 0,..., j 1.
0
DP (local alignment)
F (i 1, j 1) s( xi , y j )
F (i 1, j ) 8
F (i, j ) max
F (i, j 1) 8
0
Blosum 50
DP (Local Alignment)
Traceback
Start at highest score and trace arrows back to first
0
Multiple, Repeat, & Overlap Matches
Multiple alignments
Repeat matches
“Optimal” global alignment may be best by small margin
Can report several high-scoring alignments
Allow sections to match repeatedly
Find repeated patterns (domain / motif)
Overlap matches
Matching when the two sequences overlap
Does not penalize overhanging ends
Overlap Alignment
Consider the following problem:
Find the most significant overlap between two sequences S,T ?
Possible overlap relations: a.
b.
Difference from local alignment:
require alignment between the endpoints of the two sequences.
Overlap Alignment
Formally:
given S[1..n] , T[1..m] find i,j such that:
d=max{D(S[1..i],T[j..m]) , D(S[i..n],T[1..j]) , D(S[1..n],T[i..j]) , D(S[i..j],T[1..m]) }
is maximal.
Solution: Same as Global alignment except we don’t
not penalise overhanging ends.
Overlap Alignment
Initialization: F(i,0)=0 , F(0,j)=0
Recurrence: as in global alignment
F (i 1, j 1) s( xi , y j )
F (i, j ) max F (i 1, j ) s( xi ,)
F (i, j 1) s( , y )
j
Score: maximum value at the bottom line and rightmost line
Overlap Alignment (Example)
H E A G A W G H E E
S = PAWHEAE
T = HEAGAWGHEE
0 0 0 0 0 0 0 0 0 0 0
P 0
A 0
W 0
Scoring scheme :
Match:
+4
Mismatch: -1
Indel: -5
H 0
E 0
A 0
E 0
Overlap Alignment (Example)
H E A G A W G H E E
S = PAWHEAE
T = HEAGAWGHEE
0 0 0 0 0 0 0 0 0 0 0
P 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
A 0 -1
W 0 -1
Scoring scheme :
Match:
+4
Mismatch: -1
Indel: -5
H 0 4
E 0 -1
A 0 -1
E 0 -1
Overlap Alignment (Example)
H E A G A W G H E E
S = PAWHEAE
T = HEAGAWGHEE
0
0
0
0
0
0
0
0
0
0
0
P 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
A 0 -1 -2 3 -2 3 -2 -2 -2 -2 -2
W 0 -1 -2 -2 2 -2 7 2 -3 -3 -1
Scoring scheme:
Match:
+4
Mismatch: -1
Indel: -5
H 0 4 -1 -3 -3 1 2 6 6 1 -2
E 0 -1 8 3 -2 -3 0 1 5 10 5
A 0 -1 3 12 7 2 -2 -1 0 5 9
E 0 -1 3 7 11 6 1 -3 -2 4 9
Overlap Alignment (Example)
The best overlap is:
PAWHEAE--------HEAGAWGHEE
Pay attention!
A different scoring scheme could yield a different result, such as:
Scoring scheme :
Match:
+4
Mismatch: -1
Indel: -5 -2
---PAW-HEAE
HEAGAWGHEE-
Outline
Dynamic Programming
Dot Matrix Plot
FASTA
BLAST
Dot Matrix Plot
Method
Put two sequences on vertical and horizontal axes of graph
Put dots wherever there is a match
Diagonal line is region of identity (local alignment)
Filter non-diagonals (minimal length, identity threshold)
Simple Dot Plot Example
Simple Dot Plot Example
D O R O TH Y C R O W F O O TH O D G K IN
D D
D
O O O
O O
O
R
R
R
O
O
O
O O
O
T
T
T
H
H
H
Y
Y
H
H
O
O
D
D
G
G
K
K
I
I
N
N
The Dotplot is a Table for overview of the similarities
between two sequences
M A X I S T A Y A W A Y A T S I X A M
M M
M
A
A
A
A
A
A
A
X
X
X
I
I
I
S
S
S
T
T
T
A
A
A
A
A
A
A
Y
Y
Y
A
A
A
A
A
A
A
W
W
A
A
A
A
A
A
A
Y
Y
Y
A
A
A
A
A
A
A
T
T
T
S
S
S
I
I
I
X
X
X
A
A
A
A
A
A
A
M M
M
The palindromic sequence:
MAXISTAYAWAYATSIXAM
The stretches of similar
residues show up as
diagonals
The palindromic sequence is
very important part of
DNA sequence. Many
enzyme recognize these
sequences
Dot Matrix Plot Example
Chimpanzee DNA aligned with itself
Filters needed to eliminate random matches
filtered 6 of 10
filtered 8 of 10
Dot Matrix Plot
Advantage
Extremely simple
Can use scoring matrix
Can use multiple filters to rank matches
Example
Red > 50 bases, 95% identical (very close)
Blue > 25 bases, 75% identical (close)
Green > 10 bases, 50% identical (distant)
Limitations
O(n2) algorithm
Relies on visual analysis
Difficult to find “best” alignment
Despite multiple filters, difficult to compare alignments
Dot Matrix Plot
Demo
http://www.vivo.colostate.edu/molkit/dnadot/
Outline
Dynamic Programming
Dot Matrix Plot
FASTA
BLAST
FASTA
- Idea -
Problem of Dynamic Programming
DP compute the score in a lot of useless area for optimal
sequence
G
A
A
T
T
C
A
G
T
T
A
G
1
1
1
1
1
1
1
1
1
1
1
G
1
1
1
1
1
1
1
2
2
2
2
A
1
2
2
2
2
2
2
2
2
2
2
T
1
2
2
3
3
3
3
3
3
3
3
C
1
2
2
3
3
4
4
4
4
4
4
G
1
2
2
3
3
4
4
5
5
5
5
A
1
2
3
3
3
4
5
5
5
5
6
FASTA focuses on diagonal area
FASTA
- Heuristic -
Heuristic
Good local alignment should have some
exact match subsequence.
FASTA focus on
this area
FASTA High Level Description
Let q be a query
max 0
For each sequence, s in DB
compare q with s and compute a score, y
if max < y
max y;
bestSequence s ;
Return bestSequence
FASTA - Algorithm
Step 1
Find all hot-spots
// Hot spots is pairs of words of length k that exactly match
k=4-6 for DNA, k=1-2 for Amino acid sequences
Sequence 1
Hot Spots
Sequence 2
FASTA - Algorithm
Step 1 in detail
Use look-up Table
Query
: G AAT T C A G T TA
Sequence: G G A T C G A
Dot—Matrix
G A A T T C A G T T A
Look-up Table
Q
A
C
G
T
Location
2,3,7,11
6
1,8
4,5,9,10
G *
*
G *
*
A
*
*
T
*
*
C
*
* *
*
G *
A
*
*
*
*
*
*
FASTA - Algorithm Step 2
Score the Hot-spot and locate the ten best diagonal run.
// There is some scoring system; ex. PAM250
FASTA - Algorithm
Step 3
Combine sub-alignments into one alignment with GAP
GAP
One of local
alignment
FASTA - Algorithm
Step 4
# Consider weighted direct graph.
# Let node be a sub-alignment found in step 1
# Let u and v be nodes
# Edge (u,v) exists if alignment u is before in the sequence.
# Each edge has gap penalty (negative)
Sub-sequence
# Find the maximum weight path
Edge
One Sequence
FASTA - Algorithm
GAP
Step 4 in detail
Sub-alignment
Gap
One of Sequence
-5
-3
-3
Max Weight Path
FASTA - Algorithm
Step 5
Use the dynamic programming in restricted area around the best-score
alignment to find out the higher-score alignment than the best-score
alignment
Width of this band
is a parameter
FASTA - Algorithm
Summary of Algorithm
1: Find all hot-spots
// Hot spots is pairs of words of length k that exactly match
2: Score the Hot-spot and locate the ten best diagonal run.
3: Combine sub-alignments into one alignment
4: Score Each alignment with gap penalty and pick up the best-score
alignment
5: Use the dynamic programming in restricted area around the bestscore alignment to find out the alignment greater than the best-score
alignment.
FASTA - Complexity
Complexity
# Step 1 and 2
// select the best 10 diagonal run
Let n be a sequence from DB
O(n+m) because Step 1 just uses look up the table
O(n+m) << O(mn) m,n = 100 to 200
FASTA - Complexity # Step 3 and 4
// compute the MAX Weight Path
Let r be the number of sub-alignments. (r = 10)
Let s be the number of edges
Positive Weight
O(rs) < O(m*n)
-5
-3
-3
Max Weight Path
FASTA - Complexity # Step 5
// compute partial D.P.
Depends on the restricted area < O(mn)
Width of this band is a
parameter
Therefore, FASTA is faster than DP
FASTA
Approach
Recap
Derived from logic of the dot matrix method
View sequences as sequences of short words (k-tuple)
DNA - 6 bases, protein - 1 or 2 amino acids
Start from nearby regions of exact matching words
Motivation
Good alignments should contain many exact matches
Hashing can find exact matches in O(n) time
Diagonals can be formed from exact matches quickly
Look only at matches near longest diagonals
Apply more precise alignment to small search space at end
FASTA
Recap
Outline
Dynamic Programming
Dot Matrix Plot
FASTA
BLAST
BLAST
Approach (Basic Local Alignment Search Tool)
View sequences as sequences of short words (k-tuple)
DNA - 11 bases, protein - 3 amino acids
Create hash table of neighborhood (closely-matching) words
Use statistics to set threshold for “closeness”
Start from exact matches to neighborhood words
Motivation
Good alignments should contain many close matches
Statistics can determine which matches are significant
Make much more sense than % identity
Hashing can find matches in O(n) time
Extending matches in both directions finds alignment
Yields high-score /maximal segment pairs (HSP / MSP)
BLAST - Heuristic Another Heuristic algorithm
Heuristic but evaluating the result statistically.
Homologous sequence are likely to contain a short high
scoring word pair, a hit.
BLAST tries to extend it on the both sides to get optimal
sequence.
Sequence
A T T A G …………….
Short high score Word
BLAST - Algorithm
Neighborhood Word
Step 1: Preprocessing Query
Compile the short-hit scoring word list from query.
The length of query word,w, is 3 for brosom scoring
Threshold T is 13
BLAST - Algorithm
Step 2: Create neighborhood words for each query word
Query Word
Neighborhood words
BLAST Word Matching
MEAAVKEEISVEDEAVDKNI
MEA
EAA
AAV
Break query
AVK
VKE
into words:
KEE
EEI
EIS
ISV
...
Break database
sequences
into words:
BLAST - Algorithm
Step 3: Scan DB
Hash Table
Query: LAALLNKCKTPQGQRLVNQWIKQPLMD
Hash Table
Word list
BLAST - Algorithm
Step 3: Scanning DB
For each words list, identify all exact matches with DB sequences
Neighborhood
Query Word
Sequences in DB
Word list
Sequence 1
Sequence 2
Step 1
Step 2
The purpose of Step 1 and 2 is as same as FASTA
Step 4: Extend the alignment
Given two strings S1 and S2, a segment pair is a pair of equallength substrings of S1 and S2
A Maximal Segment Pair (MSP) in S1 and S2, is a segment pair
with the maximum score over all segment pairs in S1 and S2
A High-score Segment Pair (HSP) in S1 and S2, is a segment
pair which does not increase its score while either extending or
shortening its length
Also called a local maximal segment pair (LMSP)
Seq_XYZ:
Query:
HVTGRSAF_FSYYGYGCYCGLGTGKGLPVDATDRCCWA
QSVFDYIYYGCYCGWGLG_GK__PRDA
E-val=10-13
•Use two word matches as anchors to build an alignment
between the query and a database sequence.
•Then score the alignment.
BLAST – Algorithm
Step 5:
Align best segments using dynamic programming
Evaluate the alignment statistically
E-value = the number of HSPs having score S (or higher) expected to
occur only by chance.
E(S) ≈ K × N ×M × e -S
N, M denote the sequence lengths
K and are parameters
Smaller E-value, more significant in statistics
Bigger E-value , by chance
E-value Range
E < 10-100 → very low, homologs or identical genes
E < 10-3 → moderate, may be related genes
E > 1 → high, probably / may be unrelated
0.5 < E < 1 → ??? In the “twilight zone” Try detailed search
BLAST (Ungapped)
BLAST - Running Time Alignment algorithms
• O(n2) – dynamic programming, dot matrix
• O(n) – FASTA, BLAST
Running Time
The length of Query
DB size
: 153
: 5997 sequences
Algorithm
DP
FASTA
BLAST
PC : Pentium 4
By Dr. Takeshi Kawabata
Nara Sentan Gijyutu University
Running Time
16.989 [s]
0.618 [s]
0.118 [s]
Comparison of Algorithm
Dynamic Programming
most sensitive result
DP uses all information of two sequences
Running time is slow
DP compute the useless area for computing the optimal
sequence.
Comparison of Algorithm
FASTA
Less sensitive than DP and BLAST
FASTA uses partial information to speed up the
computation.
FASTA does not evaluate the result statistically.
Running time is faster DP
the same reason as the above.
Comparison of Algorithms
BLAST
Sensitive than FASTA
BLAST evaluate the result statistically
Faster than FASTA
Because BLAST evaluate the entire DB with the same
threshold based on statistics. BLAST eliminate noises and
reduces the running time.
FASTA vs BLAST
BLAST
Compare the query and sequences in DB
with the same threshold.
DB
Query
DB
FASTA
compare the query and a sequence one by one
And compare the each result.
Outline
Dynamic Programming
Dot Matrix Plot
FASTA
BLAST