Transcript March2
BLAST - Basic Local Alignment Search Tool
Published in 1990 by Altschul, Gish, Miller, Myers, and Lipman
Originally for ungapped local comparison of sequences. It has
since been expanded to involve comparisons of gapped
sequences.
There have been several extensions of the technique and
improvements to the basic tool throughout the 17 years of its life
thus far.
Needleman-Wunsch, SemiGlobal Alignment, and Smith-Waterman
assume we know which two sequences we need to compare.
BLAST is designed to do a database search for possible matching
sequences:
1. There is no known starting point to begin the matches
2. There is not a well established format for the information
stored in the data
3. It is like searching for a file in a cluttered office – see
Professor Leinbach’s or Professor James’ offices for
reference.
The amazing thing is that BLAST has been so successful!
Consider this sequence
gtcaaatgaaaggagtttctacatttatgtcggaaatgctggaaacagcttctatattaa
We want to search for possible matches to gain clues to its identity
1. Place a sliding window 11 or 12 nucleotides long over the sequence
gtcaaatgaaaggagtttctacatttatgtcggaaatgctggaaacagcttctatattaa
2. Extract that subsequence and compress it to 3 bytes
Code a as 002 c as 012 g as 102 and t as 112
Thus, the 11 characters take up 22 bits – 3 bytes with two bits unused
3. Using a Finite State Automaton, eliminate subsequences that occur with
a very high frequency in the database
4. If the subsequence survives, i.e. is determined to be relatively rare, use
hash table to locate sequences in the database that contain that
subsequence.
5. Extend the match in both directions scoring the extension until the
match drops below a predetermined threshold. If it survives for the
length of the original sequence – report the result
6. Slide the window down one character and repeat steps 2 - 6
The above is only an approximate description of the BLAST algorithm.
Here is one of the BLAST results for our sequence:
Score = 44.1 bits (22), Expect = 0.035 Identities = 34/38 (89%), Gaps = 0/38 (0%)
Strand=Plus/Plus
Query 12 GGAGTTTCTACATTTATGTCGGAAATGCTGGAAACAGC 49
||||| || ||||||||| | |||||||||||||||||
Sbjct 5030 GGAGTGTCAACATTTATGGCTGAAATGCTGGAAACAGC 5067
With a report of:
gi|71835970|gb|DQ117988.1| Physcomitrella patens DNA mismatch
repair protein MSH2 gene
The result along with 68 others was reported in about 30 seconds of
searching on a Friday afternoon before a holiday. The search involved
over 3 million subject sequences accounting for over 16 billion
characters!
Let’s try to mimic what may have happened to get this alignment. (This is
all hypothetical for illustrative purposes only.)
0. The user determines a scoring scheme, the sliding window size, the
window cutoff score and a threshold for the extension. Let’s choose
score: match =1
mismatch = -1
window size = 11
window cutoff score = 9
extension threshold = 4 below present maximum
minimum reportable sequence length = 25
1. BLAST has compiled a preliminary list of “words” that are considered
as significant. One of the words is ATGCTGGAAAC.
2. This sequence is found in the MSH2 Gene of the moss. So the match
begins.
Query:
ATGCTGGAAAC
Moss:
ATGCTGGAAAC
3. We now begin the extension to the right and our score of 11
advances to a score of 14 (our present maximum) since we have
three more matches
Query: ATGCTGGAAACAGC
Moss: ATGCTGGAAACAGC
4. We continue the extension and have 3 successive mismatches, 1
match, and then two more mismatches. The score is now four
below the previous maximum of 14
Query: ATGCTGGAAACAGCTGACTA
Moss: ATGCTGGAAACAGCCACCAG
We prune this part of the sequence
5. We now repeat the extension process going to the left and obtain
the reported sequence of 27 characters
Query
12 GGAGTTTCTACATTTATGTCGGAAATGCTGGAAACAGC 49
| | | | | | | | | | | | || || | | | | | | | | | || | | | | | | | |
Moss 5030 GGAGTGTCAACATTTATGGCTGAAATGCTGGAAACAGC 5067
Let’s look at a real BLAST report using the blast2seq
program on the NCBI BLAST page.
For this example we compared pig -globin and Human
Hemoglobin, beta mRNA.
BLAST2Res.html
Using accession numbers: 28302128(Human) and
X86791.1(pig)
Note the parameters that have been set by the user (or default) prior to
beginning the search.
For protein sequences the window is 3 - 4 amino acids long:
1 Amino Acid = 3 Nucleotides
3 Amino Acids = 9 Nucleotides = 18 Bits coded
4 Amino Acids = 12 Nucleotides = 3 Bytes coded
Alternative: One byte for each of Amino Acid Letters then a window of 3
Amino Acids would take 3 Bytes.
We will choose a window length of 3 Bytes for our example.
The BLAST algorithm has three main parts
1. Compile a list of pairwise alignments with the window called
word pairs.
2. Scan this list for word pairs that meet some threshold score, T
3. Extend the word pairs that surpass some cutoff score S at which
point those hits will be reported to the user. Scores are
calculated by some scoring matrix, say BLOWSUM62.
Let’s apply these rules to the following sequences:
Bovine RBP
1
61
121
hmsatakgrv rllnnwdvca dmvgtftdte dpakfkmkyw gvasflqkgn ddhwiidtdy
etfavqyscr llnldgtcad sysfvfardp sgfspevqki vrqrqeelcl arqyrliphn
gycdgksern il
Human RBP
1 mkwvwallll aawaaaerdc rvssfrvken fdkarfsgtw yamakkdpeg lflqdnivae
61 fsvdetgqms atakgrvrll nnwdvcadmv gtftdtedpa kfkmkywgva sflqkg
These are rather short sequences and the matching subsequences are easy for us
to find. However, the computer has a narrow field of vision and needs to
follow the Algotithm.
A Retinol Binding Protein
BLOSUM62
Step 1:
First calculate the score for matching word in the sliding window with itself:
msa = (5 + 4 + 4) = 13
Set a threshold (this is arbitrary), say T = 9 now match all possible word pairs
with the word in the window
Words > T
Words < T
msc = (5 + 4 + 9) = 17
msw = (5 + 4 – 3) = 6
mst = (5 + 4 + 0) = 9
msf = (5 + 4 – 2) = 7
mss = (5 + 4 + 1) = 10
mva = (5 – 3 + 4) = 6
mta = (5 + 5 + 4) = 14
mia = (5 – 3 + 4) = 6
mga = (5 + 0 + 4) = 9
dsa = (-3 + 4 + 4) = 5
esa = (1 + 4 + 4) = 9
hsa = (-2 + 4 + 4) = 6
This is only a partial list. There are 8,000 possible word pairs
We test the subject sequence for any occurrence of the words in the word pair
that are above the threshold and are considered to be significant, i.e. occurring
in the database with low frequency.
Next we (arbitrarily) set a cut off score for our extension, say S = 15. If the
local alignment falls below this score, we stop aligning the sequences.
hmsatakgrvrllnnwdvcadmvgtftdtedpakfkmkywgvasflqkgnddhwiidtdy
fsvdetgqmsatakgrvrllnnwdvcadmvgtftdtedpakfkmkywgvasflqkg
The box encloses our original match. We extend the match to the right and we
come to the end of the second sequence (Human RBP) so we must stop. We
now extend the original match to the left and the first sequence (Bovine RBP)
ends with a mismatch. So, we do not include it in the report.
Here is the report from the actual NCBI BLAST 2 Sequences program:
Score = 104 bits (259), Expect = 1e-21 Identities = 48/48 (100%),
Positives = 48/48 (100%), Gaps = 0/48 (0%)
Query
69
MSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQKG
116
MSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQKG
Sbjct
2
MSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQKG
49
The Effect of the Threshold Parameter
In the original papers on BLAST by Steven Altschul, David Lipman et. al.
in 1990 and 1997 they called the threshold parameter T. In the NCBI
BLAST program, this threshold is called f.
Jonathan Pevsner (Bioinformatics and Functional Genomics, Wiley-Liss,
2003, p 102) reports on the effect of using different threshold values for
performing a gapped BLAST search using RBP4 (NP_006735) as the
Query sequence
f = 11 (default)
Number of sequences in db
Number of hits to the db
Number of extensions
Number of successful exten.
Number of HSPs gapped
f=5
f = 17
1,046,476
1,046,476
1,046,476
129,839,417
2,200,945,350
12,002,487
5,198,652
589,935,555
61,838
8,377
13,145
1,117
145
146
93
The FASTA
Heuristic Search Algorithm
FASTA
• Developed by Pearson & Lipman in 1988
• Part of a family of algorithms known as FASTX Algorithms
• Compares Query to every sequence in the database
• Generally takes more time than BLAST
• Uses a predetermined scoring matrix
Nucleotides: Identity, Cantor-Jukes, Kimura
Proteins: PAM250, Blosum62, Blosum50
• More Sensitive than BLAST especially when sequences are
repetitive
• Can do gapped comparisons
Algorithmic Details
• Break the algorithm into words (similar to the sliding
window in BLAST, although these are smaller)
sometimes called a ktup (for k-tuple)
Generally, 1 – 2 Amino Acids for proteins
4 – 6 Nucleotides for DNA
Larger word sizes are usually less sensitive.
• Gap Penalties (set by user – these are only suggestions)
Opening or origination: -12 for proteins
-16 for DNA
Extension or Length:
-2 for proteins
-4 for DNA
Step1 – Break Query Sequence Into Words
Our Example Sequence:
RKTURKRKTU
RK
KT
TU
UR
RK
KR
RK
KT
TU
These words will be
placed within a table for
comparison with the
sequence from the
database. Initially score
based on the number of
identities
Compare
RKTURKRKTU with ARKURWKTUR
RK
AR
KT
TU
UR RK
KR
RK
KT
TU
Hash Table
Key
Address
RK
RK
1, 5, 7
KU
KT
2, 8
UR
TU
3, 9
RW
UR
4
WK
KR
6
KT
TU
UR
Compare
RKTURKRKTU with ARKURWKTUR
RK
KT
TU
UR RK
KR
RK
KT
TU
AR
Hash Table
Key
Address
RK
1, 5, 7
KT
2, 8
TU
3, 9
RW
UR
4
WK
KR
6
RK
*
*
*
KU
UR
KT
TU
UR
*
*
*
*
*
*
FASTA – Steps 2, 3, & 4
2. In the complete query sequence locate the 10 regions having
the highest score with the database sequence. Rescore those
regions with a PAM or Blosum matrix. If necessary, trim the
ends of the region if the result is a higher score. Call this score
init1
3. Try to join some of the regions from step 2 even with gaps
inserted if it makes a better overall score. Call this score an
initn score.
4. Use the full Smith – Waterman Algorithm on a narrow band,
usually chosen to be 32 residues wide around the best matches
from step 3. This is called the opt score.
Graphical Display of the
Four Step FASTA Process
Step 1
Step 2
Step 3
Step 4
Some Problems with FASTA
1. A sequence with 50% similarity could be missed
AKAKAKAKAKAKAKAKAKAK
No Word Matches
ARARARARARARARARARAR
2. The join in step 4 restricts the band to a width 32. Suppose
that 2 proteins are identical except that one has a gap of length
greater than 32 in the middle. FASTA will only pick up on one
side of the gap.
3. FASTA picks up on only perfect matches and ignores
conservative replacement in proteins, i.e. it may miss
sequences with functional homology, but not identity.
NOTE: using a smaller word size can help with 1 and 3.
The Following Graphic Compares The Three Local
Alignment Algorithms
Speed is Graphed vs. Quality of the Results
A FASTA search engine is located at the BioChem Department
server at the University of Virginia and can be reached through
the URL
http://fasta.bioch.virginia.edu/fasta
This site allows the user to choose the database against which
the query sequence will be run. In fact, the user can choose
multiple databases!
The user can then set the following parameters:
ktup length
scoring matrix
gap origination
gap exension
This sight can also do a FASTA 2 sequence comparison
Here is the result of that search (whale hemoglobin vs hippo
hemoglobin :
We also compare the whale hemoglobin with the same subunit
of the pig hemoglobin