FASTA - Algorithm - FAU College of Engineering

Download Report

Transcript FASTA - Algorithm - FAU College of Engineering

COT 6930
HPC and Bioinformatics
Pairwise Sequence Alignment (PSA)
Xingquan Zhu
Dept. of Computer Science and Engineering
Outline




Dynamic Programming
Dot Matrix Plot
FASTA
BLAST
Difficulty of PSA


Consider two sequences of length n

Number of possible global alignments

For n = 40, there are over 1023 possible alignments
Approaches


“Optimal” solution
 Dynamic programming
Heuristic solutions
 Dot matrix plot
 FASTA
 BLAST
Dynamic Programming (DP)

Alignment


Problem can be subdivided
Optimal alignment of n bases



Bioinformatics application



1. Incorporates optimal alignment of 1…n-1 bases
2. Combine with best alignment for nth base
Global alignment
Local alignment
[Needleman-Wunsch 1970]
[Smith-Waterman 1981]
Complexity


O(n2) or O(n × m) algorithm (sequences of length n, m)
Feasible for moderate sized sequences, not entire
genomes
Dynamic Programming (DP)


Dynamic programming usually consists of three
components.
 Recursive relation
 Tabular computation
 Trace back
This efficient recursive method is used to search
through all possible alignments and finding the one
with the optimal score.
How DP is used for alignment?

An alignment is represented as a path through a matrix.
To search through the matrix of all possible paths and find the
optimal path DP is used.
Four possible outcomes in aligning two sequences
[1] identity (stay along a diagonal)
[2] mismatch (stay along a diagonal)
[3] gap in one sequence (move vertically!)
[4] gap in the other sequence (move horizontally!)
Intuition of Dynamic Programming
If we already have the optimal solution to:
XY
AB
then we know the next pair of characters will either
be:
XYZ
ABC
or
XYABC
or
XYZ
AB-
(where “-” indicates a gap).
So we can extend the match by determining which of
these has the highest score.
Intuition of Dynamic Programming
Considering sequences: “HEAGA” and “PA”
DP (Global Alignment)

Recursive formula

Notation






xi
yj
x1..i
F
 F(i, j)
d
s
ith letter of string x
jth letter of string y
prefix of x from letters 1 through i
matrix of optimal scores (DP Matrix)
represents optimal score lining up x 1..i with y1..j
gap penalty
scoring matrix
Constant vs. Non-constant Gap
Penalty

Constant gap penalty
F (i  1, j  1)  s( xi , y j ),

F (i, j )  max F (i  1, j )  d ,
F (i, j  1)  d .


Non-constant gap penalty
F (i  1, j  1)  s( xi , y j ),

F (i, j )  max F (k , j )   (i  k ), k  0,..., i  1,
F (i, k )   ( j  k ), k  0,..., j  1.

DP (Global Alignment)

Algorithm


Initialize: F(0,0) = 0, F(i,0) = i × d, F(0, j) = j × d (gap
penalties)
Fill from top left to bottom right using recursive formula
DP Example

Sequences




Seq1: HEAGAWGHEE
Seq2: PAWHEAE
Gap Penalty: -8 (Constant)
Scoring Matrix (Blosum 50)
DP Example
Algorithm
Initialize
F(0,0) = 0, F(i,0) = i × d, F(0, j) = j
× d (gap penalties)
Fill from top left to bottom right using
recursive formula
Blosum 50
DP Example
Algorithm
Initialize
F(0,0) = 0, F(i,0) = i × d, F(0, j)
= j × d (gap penalties)
Fill from top left to bottom right
using recursive formula
Blosum 50
DP Example (Traceback)

Trace arrows from bottom right to top left
 Diagonal – both match
 Up – left sequence match a gap


Or insert a gap to top sequence
Left – top sequence match a gap

Or insert a gap to left sequence
Global alignment vs. local alignment


Global alignment: the entire sequence of each protein
or DNA sequence is contained in the alignment.
Local alignment: only regions of greatest similarity
between two sequences are aligned
percent
identity:
~26%
RBP:
26
glycodelin:
23
RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84
+ K++ + + +GTW++MA
+ L
+ A
V T +
+L+ W+
QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81
Global vs. local alignment



Global alignment are often not effective for highly
diverged sequences - do not reflect the biological reality
that two sequences may only share limited regions of
conserved sequence.
Global methods are useful when you want to force two
sequences to align over their entire length
Local alignment is almost always used for database
searches such as BLAST. It is useful to find domains (or
limited regions of homology) within sequences.
DP (local alignment)


Make 0 minimal score (i.e., start new
alignment)
Alignment can start / end anywhere
 Start at highest score(s)
 End when 0 reached
 F 0,0  0
 F (i  1, j  1)  s ( xi , y j ),

F (i , j )  max  F ( k , j )   (i  k ), k  0,..., i  1,
 F (i , k )   ( j  k ), k  0,..., j  1.

0
DP (local alignment)
 F (i  1, j  1)  s( xi , y j )
 F (i  1, j )  8

F (i, j )  max 
 F (i, j  1)  8
0
Blosum 50
DP (Local Alignment)

Traceback

Start at highest score and trace arrows back to first
0
Multiple, Repeat, & Overlap Matches

Multiple alignments



Repeat matches



“Optimal” global alignment may be best by small margin
Can report several high-scoring alignments
Allow sections to match repeatedly
Find repeated patterns (domain / motif)
Overlap matches


Matching when the two sequences overlap
Does not penalize overhanging ends
Overlap Alignment
Consider the following problem:
Find the most significant overlap between two sequences S,T ?
Possible overlap relations: a.
b.
Difference from local alignment:
require alignment between the endpoints of the two sequences.
Overlap Alignment
Formally:
given S[1..n] , T[1..m] find i,j such that:
d=max{D(S[1..i],T[j..m]) , D(S[i..n],T[1..j]) , D(S[1..n],T[i..j]) , D(S[i..j],T[1..m]) }
is maximal.
Solution: Same as Global alignment except we don’t
not penalise overhanging ends.
Overlap Alignment


Initialization: F(i,0)=0 , F(0,j)=0
Recurrence: as in global alignment
 F (i  1, j  1)  s( xi , y j ) 


F (i, j )  max  F (i  1, j )  s( xi ,)

 F (i, j  1)  s( , y ) 
j



Score: maximum value at the bottom line and rightmost line
Overlap Alignment (Example)
H E A G A W G H E E
S = PAWHEAE
T = HEAGAWGHEE
0 0 0 0 0 0 0 0 0 0 0
P 0
A 0
W 0
Scoring scheme :
Match:
+4
Mismatch: -1
Indel: -5
H 0
E 0
A 0
E 0
Overlap Alignment (Example)
H E A G A W G H E E
S = PAWHEAE
T = HEAGAWGHEE
0 0 0 0 0 0 0 0 0 0 0
P 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
A 0 -1
W 0 -1
Scoring scheme :
Match:
+4
Mismatch: -1
Indel: -5
H 0 4
E 0 -1
A 0 -1
E 0 -1
Overlap Alignment (Example)
H E A G A W G H E E
S = PAWHEAE
T = HEAGAWGHEE
0
0
0
0
0
0
0
0
0
0
0
P 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
A 0 -1 -2 3 -2 3 -2 -2 -2 -2 -2
W 0 -1 -2 -2 2 -2 7 2 -3 -3 -1
Scoring scheme:
Match:
+4
Mismatch: -1
Indel: -5
H 0 4 -1 -3 -3 1 2 6 6 1 -2
E 0 -1 8 3 -2 -3 0 1 5 10 5
A 0 -1 3 12 7 2 -2 -1 0 5 9
E 0 -1 3 7 11 6 1 -3 -2 4 9
Overlap Alignment (Example)
The best overlap is:
PAWHEAE--------HEAGAWGHEE
Pay attention!
A different scoring scheme could yield a different result, such as:
Scoring scheme :
Match:
+4
Mismatch: -1
Indel: -5 -2
---PAW-HEAE
HEAGAWGHEE-
Outline




Dynamic Programming
Dot Matrix Plot
FASTA
BLAST
Dot Matrix Plot

Method




Put two sequences on vertical and horizontal axes of graph
Put dots wherever there is a match
Diagonal line is region of identity (local alignment)
Filter non-diagonals (minimal length, identity threshold)
Simple Dot Plot Example
Simple Dot Plot Example
D O R O TH Y C R O W F O O TH O D G K IN
D D
D
O O O
O O
O
R
R
R
O
O
O
O O
O
T
T
T
H
H
H
Y
Y
H
H
O
O
D
D
G
G
K
K
I
I
N
N
The Dotplot is a Table for overview of the similarities
between two sequences
M A X I S T A Y A W A Y A T S I X A M
M M
M
A
A
A
A
A
A
A
X
X
X
I
I
I
S
S
S
T
T
T
A
A
A
A
A
A
A
Y
Y
Y
A
A
A
A
A
A
A
W
W
A
A
A
A
A
A
A
Y
Y
Y
A
A
A
A
A
A
A
T
T
T
S
S
S
I
I
I
X
X
X
A
A
A
A
A
A
A
M M
M
The palindromic sequence:
MAXISTAYAWAYATSIXAM
The stretches of similar
residues show up as
diagonals
The palindromic sequence is
very important part of
DNA sequence. Many
enzyme recognize these
sequences
Dot Matrix Plot Example


Chimpanzee DNA aligned with itself
Filters needed to eliminate random matches
filtered 6 of 10
filtered 8 of 10
Dot Matrix Plot

Advantage





Extremely simple
Can use scoring matrix
Can use multiple filters to rank matches
Example
 Red > 50 bases, 95% identical (very close)
 Blue > 25 bases, 75% identical (close)
 Green > 10 bases, 50% identical (distant)
Limitations




O(n2) algorithm
Relies on visual analysis
Difficult to find “best” alignment
Despite multiple filters, difficult to compare alignments
Dot Matrix Plot

Demo

http://www.vivo.colostate.edu/molkit/dnadot/
Outline




Dynamic Programming
Dot Matrix Plot
FASTA
BLAST
FASTA


- Idea -
Problem of Dynamic Programming
DP compute the score in a lot of useless area for optimal
sequence
G
A
A
T
T
C
A
G
T
T
A
G
1
1
1
1
1
1
1
1
1
1
1
G
1
1
1
1
1
1
1
2
2
2
2
A
1
2
2
2
2
2
2
2
2
2
2
T
1
2
2
3
3
3
3
3
3
3
3
C
1
2
2
3
3
4
4
4
4
4
4
G
1
2
2
3
3
4
4
5
5
5
5
A
1
2
3
3
3
4
5
5
5
5
6
FASTA focuses on diagonal area
FASTA

- Heuristic -
Heuristic
Good local alignment should have some
exact match subsequence.
FASTA focus on
this area
FASTA High Level Description
Let q be a query
max  0
For each sequence, s in DB
compare q with s and compute a score, y
if max < y
max  y;
bestSequence  s ;
Return bestSequence
FASTA - Algorithm 
Step 1
Find all hot-spots
// Hot spots is pairs of words of length k that exactly match
k=4-6 for DNA, k=1-2 for Amino acid sequences
Sequence 1
Hot Spots
Sequence 2
FASTA - Algorithm 
Step 1 in detail
Use look-up Table
Query
: G AAT T C A G T TA
Sequence: G G A T C G A
Dot—Matrix
G A A T T C A G T T A
Look-up Table
Q
A
C
G
T
Location
2,3,7,11
6
1,8
4,5,9,10
G *
*
G *
*
A
*
*
T
*
*
C
*
* *
*
G *
A
*
*
*
*
*
*
FASTA - Algorithm Step 2
Score the Hot-spot and locate the ten best diagonal run.

// There is some scoring system; ex. PAM250
FASTA - Algorithm 
Step 3
Combine sub-alignments into one alignment with GAP
GAP
One of local
alignment
FASTA - Algorithm 
Step 4
# Consider weighted direct graph.
# Let node be a sub-alignment found in step 1
# Let u and v be nodes
# Edge (u,v) exists if alignment u is before in the sequence.
# Each edge has gap penalty (negative)
Sub-sequence
# Find the maximum weight path
Edge
One Sequence
FASTA - Algorithm 
GAP
Step 4 in detail
Sub-alignment
Gap
One of Sequence
-5
-3
-3
Max Weight Path
FASTA - Algorithm 
Step 5
Use the dynamic programming in restricted area around the best-score
alignment to find out the higher-score alignment than the best-score
alignment
Width of this band
is a parameter
FASTA - Algorithm 
Summary of Algorithm
1: Find all hot-spots
// Hot spots is pairs of words of length k that exactly match
2: Score the Hot-spot and locate the ten best diagonal run.
3: Combine sub-alignments into one alignment
4: Score Each alignment with gap penalty and pick up the best-score
alignment
5: Use the dynamic programming in restricted area around the bestscore alignment to find out the alignment greater than the best-score
alignment.
FASTA - Complexity 
Complexity
# Step 1 and 2
// select the best 10 diagonal run
Let n be a sequence from DB
O(n+m) because Step 1 just uses look up the table
O(n+m) << O(mn) m,n = 100 to 200
FASTA - Complexity # Step 3 and 4
// compute the MAX Weight Path
Let r be the number of sub-alignments. (r = 10)
Let s be the number of edges
Positive Weight
O(rs) < O(m*n)
-5
-3
-3
Max Weight Path
FASTA - Complexity # Step 5
// compute partial D.P.
Depends on the restricted area < O(mn)
Width of this band is a
parameter
Therefore, FASTA is faster than DP
FASTA

Approach




Recap
Derived from logic of the dot matrix method
View sequences as sequences of short words (k-tuple)
 DNA - 6 bases, protein - 1 or 2 amino acids
Start from nearby regions of exact matching words
Motivation





Good alignments should contain many exact matches
Hashing can find exact matches in O(n) time
Diagonals can be formed from exact matches quickly
Look only at matches near longest diagonals
Apply more precise alignment to small search space at end
FASTA
Recap
Outline




Dynamic Programming
Dot Matrix Plot
FASTA
BLAST
BLAST

Approach (Basic Local Alignment Search Tool)





View sequences as sequences of short words (k-tuple)
 DNA - 11 bases, protein - 3 amino acids
Create hash table of neighborhood (closely-matching) words
Use statistics to set threshold for “closeness”
Start from exact matches to neighborhood words
Motivation




Good alignments should contain many close matches
Statistics can determine which matches are significant
 Make much more sense than % identity
Hashing can find matches in O(n) time
Extending matches in both directions finds alignment
 Yields high-score /maximal segment pairs (HSP / MSP)
BLAST - Heuristic Another Heuristic algorithm
 Heuristic but evaluating the result statistically.
Homologous sequence are likely to contain a short high
scoring word pair, a hit.
BLAST tries to extend it on the both sides to get optimal
sequence.

Sequence
A T T A G …………….
Short high score Word
BLAST - Algorithm 
Neighborhood Word
Step 1: Preprocessing Query
Compile the short-hit scoring word list from query.
The length of query word,w, is 3 for brosom scoring
Threshold T is 13
BLAST - Algorithm 
Step 2: Create neighborhood words for each query word
Query Word
Neighborhood words
BLAST Word Matching
MEAAVKEEISVEDEAVDKNI
MEA
EAA
AAV
Break query
AVK
VKE
into words:
KEE
EEI
EIS
ISV
...
Break database
sequences
into words:
BLAST - Algorithm 
Step 3: Scan DB
Hash Table
Query: LAALLNKCKTPQGQRLVNQWIKQPLMD
Hash Table
Word list
BLAST - Algorithm 
Step 3: Scanning DB
For each words list, identify all exact matches with DB sequences
Neighborhood
Query Word
Sequences in DB
Word list
Sequence 1
Sequence 2
Step 1
Step 2
The purpose of Step 1 and 2 is as same as FASTA

Step 4: Extend the alignment



Given two strings S1 and S2, a segment pair is a pair of equallength substrings of S1 and S2
A Maximal Segment Pair (MSP) in S1 and S2, is a segment pair
with the maximum score over all segment pairs in S1 and S2
A High-score Segment Pair (HSP) in S1 and S2, is a segment
pair which does not increase its score while either extending or
shortening its length

Also called a local maximal segment pair (LMSP)
Seq_XYZ:
Query:
HVTGRSAF_FSYYGYGCYCGLGTGKGLPVDATDRCCWA
QSVFDYIYYGCYCGWGLG_GK__PRDA
E-val=10-13
•Use two word matches as anchors to build an alignment
between the query and a database sequence.
•Then score the alignment.
BLAST – Algorithm 
Step 5:

Align best segments using dynamic programming

Evaluate the alignment statistically

E-value = the number of HSPs having score S (or higher) expected to
occur only by chance.




E(S) ≈ K × N ×M × e -S

N, M denote the sequence lengths

K and  are parameters
Smaller E-value, more significant in statistics
Bigger E-value , by chance
E-value Range




E < 10-100 → very low, homologs or identical genes
E < 10-3 → moderate, may be related genes
E > 1 → high, probably / may be unrelated
0.5 < E < 1 → ??? In the “twilight zone” Try detailed search
BLAST (Ungapped)
BLAST - Running Time Alignment algorithms
• O(n2) – dynamic programming, dot matrix
• O(n) – FASTA, BLAST

Running Time
The length of Query
DB size
: 153
: 5997 sequences
Algorithm
DP
FASTA
BLAST
PC : Pentium 4
By Dr. Takeshi Kawabata
Nara Sentan Gijyutu University
Running Time
16.989 [s]
0.618 [s]
0.118 [s]
Comparison of Algorithm

Dynamic Programming

most sensitive result


DP uses all information of two sequences
Running time is slow

DP compute the useless area for computing the optimal
sequence.
Comparison of Algorithm

FASTA

Less sensitive than DP and BLAST
FASTA uses partial information to speed up the
computation.
 FASTA does not evaluate the result statistically.


Running time is faster DP

the same reason as the above.
Comparison of Algorithms

BLAST

Sensitive than FASTA


BLAST evaluate the result statistically
Faster than FASTA

Because BLAST evaluate the entire DB with the same
threshold based on statistics. BLAST eliminate noises and
reduces the running time.
FASTA vs BLAST
BLAST
Compare the query and sequences in DB
with the same threshold.
DB
Query
DB
FASTA
compare the query and a sequence one by one
And compare the each result.
Outline




Dynamic Programming
Dot Matrix Plot
FASTA
BLAST