blast

Transcript blast

Database Searching for Similar
Sequences



Search a sequence database for sequences that
are similar to a query sequence
provide a list of database sequences with which
the query sequence can be aligned well
Key issue:
efficiency
Database Searching for Similar
Sequences Methods

Smith-Waterman requires order N2L computations

Popular database searching methods (heuristic methods)



FASTA [Pearson & Lipman, 1988]
BLAST [Altschul et al., 1990]
Tradeoffs of using the heuristic fast method

Accuracy (Sensitivity and Selectivity)
FASTA

Problem with Smith-Waterman algorithm: Too many
calculations “wasted” by comparing regions that have
nothing in common

Initial insight: Regions that are similar between two
sequences are likely to share short stretches that are
identical

Basic method: Look for similar regions only near short
stretches that match exactly --- “Hit and extend”
sequence searching
Look-up table
Diagonal Method Example
A
2 ,6,7
F
4
s=
1
2
3
4
5
6
7
8
9
10
11
H
A
R
F
Y
A
A
Q
I
V
L
2
3
4
5
6
7
8
D
M
A
A
Q
I
A
H 1
1
I
9
L
11
t= V
Q 8
+9
R
3
V
10
Y
5
-2 -3 +2 +2 -6
+2 +1
-2
+3 +2
-1
offset
…
Offset vector
-7
-6
1
-5
-4
-3
-2
-1
1 21 1
0
+1
+2
+3
1 2341 1
+4
+5
+6
+7
+8
+9
1
+10
Limitations of FASTA

FASTA can miss significant similarity since:

For nucleic acids, due to codon “wobble”, DNA sequences may
look like XXy where X’s are conserved and y’s are not
 GGuUCuACgAAg and GGcUCcACaAAA both code for the
same peptide sequence (Gly-Ser-Thr-Lys) but they don’t match with ktuple size of 3 or higher

For proteins, similar sequences do not have to share identical
residues
 Gly-Asp-Gly-Lys-Gly is quite similar to Gly-Glu-Gly-Arg-Gly
but there is no match with k-tuple of size 2

Score ?
Asp-Lys-Val is quite similar to Glu-Arg-Ile yet it is missed even
with k-tuple size of 1
Score ?
Ala-Ala-Ala-Ala-Ala vs Ala-Ala-Ala-Ala-Ala
Score ?
BLAST

What does BLAST stand for?

Basic Local Alignment Search Tool
BLAST

BLAST is similar to FASTA but it searches for words
which score above T rather than that match exactly.
It is also faster because its implementation has been
optimized to work with parallel UNIX architecture
from an early stage.

Reference

S. F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J.
Lipman. Basic Local Alignment Search Tool. J. Mol. Biol.
215:403-410 (1990)
BLAST basics

BLAST is mainly a 3-step algorithm:
 Compile
 Search
list of high-scoring strings (words)
for hits – each hit gives a seed
 Extend
seeds to obtain segment pairs
BLAST

For protein sequences, the list of high-scoring words
consists of all words with w characters that scores at
least T with some word in the query sequence (w = 3 or
4 for protein search, 11 or 12 for nucleotide sequences).

Search for “hits” using a hash table or a finite state
machine.

Key concept: Searching for words which score above
T rather than that match exactly
BLAST method for proteins
1. Compile a list of words which give a score above T
when paired with the query sequence.

Example using PAM-120 for query sequence ACDE (w=4,
T=17):
A C D E
ACDE = +3 +9 +5 +5 = 22

try all possibilities:
AAAA = +3 -3
AAAC = +3 -3

0 0 = 0
0 -7 = -7
no good
no good
...too slow, try directed change
Generating word list
A C D E
ACDE = +3 +9 +5 +5 = 22

change 1st pos. to all acceptable substitutions
gCDE =
nCDE =
1
0
9
9
5
5
iCDE = -1
kCDE = -2
9
9
5
5
5 = 20 ok (=pCDE,sCDE,tCDE)
5 = 19 ok (=dCDE,eCDE,
nCDE,vCDE)
5 = 18 ok (=qCDE)
5 = 17 ok (=mCDE)

change 2nd pos.: can't - all alternatives negative and the other three
positions only add up to 13

change 3rd pos. in combination with first position
gCnE =

1
9
2
5 = 17 ok
continue - use recursion
BLAST method for proteins
2. Scan the database for hits with the compiled list
of words.

Use finite state machine (actually used)

Calculate a state transition table that tells what state to go
to based on the next character in the sequence
3. Extend hits in both directions to form segment
pairs (without allowing gaps)
BLAST method for proteins

Example of a finite state machine for string
matching: (input alphabet: a,b,c)
Word: ababaca
a
a
a
a
0 a 1 b 2 a 3 b 4 a 5 c 6 a 7
b
b
Database sequence: bcabccaaababacababacabb
exercise

Construct a finite state machine that recognize
the word:
ATG

Assuming the sequence is a nucleotide sequence
BLAST Method for DNA
1. Make list of all words of length w in the query sequence
(often w=11 or 12)
2. Compress database by packing 4 nucleotides into a
single byte (use auxiliary table to tell you where
sequences start and stop within the compressed
database) -- doesn't allow for unspecified bases
(wildcards)
BLAST Method for DNA
3. Compress the words from the query sequence the same
way.
4. Search the compressed database for matches with the
compressed words
Since all frames of the query sequence are considered separately,
any match of length w>=11 must contain a match of length 8
that lies on a byte boundary of one of the words from the
query sequence. Thus can scan a (packed) byte at a time,
improving speed 4-fold over comparing one nucleotide at a
time.
Low-Complexity Regions

Low-complexity regions are segments that
contains certain bases or amino acid more often
than one would expect in “normal” nucleotide or
protein sequences.

Problem: if query sequence has a stretch of
unusual base composition (e.g., A-T rich) or a
repeated sequence element (e.g., Alu sequence)
there will be many hits with "uninteresting"
regions.
Low-Complexity Regions

Solution :

Make a list of the words occurring very frequently
(more frequently than expected by chance).

Remove these words from the query list of words
before searching database. (The words are replaced
by strings of Xs.)
BLAST Statistical significance

A key to the utility of BLAST is the ability to
calculate expected probabilities of occurrence
of maximum segment pairs (MSPs) given w and
T

This allows BLAST to rank matching sequences
in order of “significance” and to cut off listings
at a user-specified probability
Choosing Values for w and T

Trade-off: sensitivity vs. running-time

Choosing a value for w




Small w: many matches to expand
Big w: many words to be generated
w=3/4 is a good compromise
Choosing a value for T

Small T: greater sensitivity, more matches to
expand
BLAST Notes

May fail to find optimal MSPs


May miss seeds if T is too stringent
Empirically, 10 to 50 times faster than Smith-Waterman
Basic BLAST Family

BLASTN


BLASTP


DNA (translated) to protein database
BLASTX


protein to protein database
TBLASTN


DNA to DNA database
protein to DNA database (translated)
TBLASTX

DNA (translated) to DNA database (translated)
BLAST Refinements

gapped alignments

“two-hit” method for extending word pairs

Iterate with position-specific matrix (PSIBLAST)

Pattern-hit initiated BLAST (PHI-BLAST)

blast

Transcript blast

Directory