Heuristic Alignment Algorithms

Download Report

Transcript Heuristic Alignment Algorithms

Heuristic Alignment
Algorithms
Hongchao Li
Jan. 27 2004
Introduction
•
Heuristic Alignment Algorithms
1. BLAST (Basic Local Alignment Search Tool)
2. FASTA
•
Problem
1. Local Alignment: looking for the best alignment
between subsequences of two sequences.
2. Problem Model: finding high scoring local alignments
between a query sequence and a target database.
(see next page)
Introduction
AGCTTCAG...CTGA
...
query sequence
TGCTACAG...
Target database
Data
The problem model of BLAST and FASTA
• Heuristic Thought
True match alignments are very likely to contain
somewhere within them a short stretch of identities or
very high scoring matches.
BLAST
• Basic Idea
We look initially for such short stretches and use
them as ‘seeds’, from which to extend out in search of a
good longer alignment.
By keeping the seed segments short, it is possible to
pre-process the query sequence to make a table of all
possible seeds with their corresponding start points.
BLAST
•
Method
1. Make a list of all ‘neighborhood words’ of a fixed
length w (by default 3 for protein sequences, and 11
for nucleic acids), that would match the query
sequence somewhere with score higher than some
threshold T.
2. Scan through the database
3. Whenever find a word in this set, start a ‘hit
extension’ process to extend the possible match as
an ungapped alignment in both directions, stopping
at the maximum scoring extension .
BLAST
•
Example
–
–
–
Query Sequence: AGTACT
Target Database: TACTGAACTTGC
w =4, identity score =5, mismatch score= -4, Threshold T=11
Solution:
1. Make a list of all neighborhood words
Query Sequence
AGTA
GTAC
TACT
Neighborhood
Words
*GTA
A*TA
AG*A
AGT*
*TAC
G*AC
GT*C
GTA*
*ACT
T*CT
TA*T
TACT
BLAST
2. Scan through the database
TACTGAACTTGC
20
11
11
3. Hit extension
TACTGAACTTGC
20
16
16
Result
FASTA
•
Basic idea
It uses a multi-step approach to finding local high scoring
alignments, starting from exact short word matches, through
maximal scoring ungapped extensions, to finally identify gapped
alignments.
•
Method
1. Use a lookup table to locate all identically matching words of
length ktup between the two sequences. Then look for
diagonals with many mutually supporting word matches.
This operations, for example, can be done by sorting the
matches on the difference of indices (i - j).
FASTA
2. Extend the exact word matches in the best
diagonals to find maximal scoring ungapped regions
(and in the process possibly joining together several
seed matches).
3. Checks to see if any of these ungapped regions can
be joined by a gapped region, allowing for gap costs.
4. The highest scoring candidate matches in a
database search are realigned using the fully
dynamic programming algorithm, but restricted to a
sub-region of the dynamic programming matrix
forming a band around the candidate heuristic match.
FASTA
• Example
Thanks!