LecturesPart09

Download Report

Transcript LecturesPart09

Computational Biology, Part 9
Efficient database searching
methods
Robert F. Murphy
Copyright  1996, 1999, 2001.
All rights reserved.
Efficient database searching
methods
Dynamic programming requires order N2L
computations (where N is size of the query
sequence and L is the size of the database)
 Given size of databases, more efficient
methods needed

“Hit and extend” sequence
searching
Problem: Too many calculations “wasted”
by comparing regions that have nothing in
common
 Initial insight: Regions that are similar
between two sequences are likely to share
short stretches that are identical
 Basic method: Look for similar regions only
near short stretches that match exactly

“Hit and extend” sequence
searching
We define a word size that is the minimum
number of exact “letter” matches that must
occur before we do any further comparison
or alignment
 How do we find all of the occurences of
matching words between a sequence and a
database?

 Could
scan sequence a word at a time, but this
is order L (size of database)
Word searching - hashing

Solution: Use a precomputed table that lists
where in the database each possible word
occurs
 Generation
of the table is of order L (size of
database) but use of the table is of order N (size
of query sequence)

The computer science term for this
approach is hashing
Hashing

(Demonstration A9)
Database searching using words
References



W. J. Wilbur and D. J. Lipman. Rapid similarity
searches of nucleic acid and protein data banks.
Proc. Natl. Acad. Sci. U.S.A. 80:726-730 (1983)
D. J. Lipman and W. R. Pearson. Rapid and
sensitive protein similarity searches. Science
227:1435-1441 (1985) [FASTP]
W. R. Pearson and D. J. Lipman. Improved tools
for biological sequence comparison. Proc. Natl.
Acad. Sci. U.S.A. 85:2444-2448 (1988) [FASTA]
FASTA
Heavily used for searching databases until
advent of BLAST (see below)
 Inputs

k
(word or ktuple) size
 similarity matrix

Compares query sequence pairwise with
each sequence in the database
FASTA method
1. Find diagonals (paired pieces from each
sequence without gaps) that have the
highest density of common words
2. Rescore these using a scoring (similarity)
matrix and trim ends that do not contribute
to the highest score
 Result:
partial alignments without gaps
 Reported as the “init1” score
FASTA method
3. Join regions together, including penalties
for gaps
 Result:
unoptimized alignment with gaps
 Reported as the “initn” score
4. Use dynamic programming in a band 32
residues wide around the best “initn” score
 Result:
optimized alignment with gaps
 Reported as the “opt” score
Comments on FASTA

Larger ktuple increases speed since fewer
“hits” are found but it also decreases
sensitivity for finding similar but not
identical sequences since exact matches of
this length are required
Limitations of FASTA

FASTA can miss significant similarity since
 For
proteins, similar sequences do not have to
share identical residues
 Asp-Lys-Val is quite similar to Glu-Arg-Ile yet
it is missed even with ktuple size of 1 since no
amino acid matches
 Gly-Asp-Gly-Lys-Gly
is quite similar to
Gly-Glu-Gly-Arg-Gly but there is match with
ktuple size of 2
Limitations of FASTA

FASTA can miss significant similarity since
 For
nucleic acids, due to codon “wobble”, DNA
sequences may look like XXyXXyXXy where
X’s are conserved and y’s are not
 GGuUCuACgAAg and GGcUCcACaAAA
both code for the same peptide sequence (Gly-SerThr-Lys) but they don’t match with ktuple size of 3
or higher
BLAST (Basic Local Alignment
Search Tool)
Goal: find sequences from database similar
to query sequence
 Previous tools use either

 direct,
theoretically sound but computationally
slow approach to examine all possible
alignments of query with database (dynamic
programming)
 indirect, heuristic but computationally fast
approach to find similar sequences by first
finding identical stretches (FASTP, FASTA)
BLAST (Basic Local Alignment
Search Tool)
BLAST combines best of both by using
theoretically sound method which searches
for similar sequences directly but
computationally fast
 Reference

 S.
F. Altschul, W. Gish, W. Miller, E. W.
Myers and D. J. Lipman. Basic Local
Alignment Search Tool. J. Mol. Biol. 215:403410 (1990)
Global vs. Local Algorithms

We distinguish
 Global
similar algorithms which optimize
overall alignment between two sequences
(dynamic programming)
 Local similar algorithms which see only
relatively conserved pieces of sequence
(FASTA, BLAST)
BLAST basics
Need similarity measure, as in dynamic
programming - use PAM-120 for proteins
 Define maximal segment pair (MSP) to be
the highest scoring pair of identical length
segments chosen from 2 sequences (in
FASTA terms, highest init1 diagonal)

BLAST basics

Define a segment pair to be locally maximal
if its score cannot be improved either by
extending or by shortening both segments
BLAST basics
Approach: find segment pairs by first
finding word pairs that score above a
threshold, i.e., find word pairs of fixed
length w with a score of at least T
 Key concept: Seems similar to FASTA, but
we are searching for words which score
above T rather than that match exactly

BLAST method for proteins
1. Compile a list of words which give a score
above T when paired with the query
sequence.
 Example
using PAM-120 for query sequence
ACDE (w=4, T=17):
A C D E
ACDE = +3 +9 +5 +5 = 22
 try all possibilities:
AAAA = +3 -3 0 0 = 0 no good
AAAC = +3 -3 0 -7 = -7 no good
 ...too slow, try directed change
Generating word list
A C D E
ACDE = +3 +9 +5 +5 = 22
 change 1st pos. to all acceptable substitutions
gCDE = 1 9 5 5 = 20 ok (=pCDE,sCDE,
tCDE)
nCDE = 0 9 5 5 = 19 ok (=dCDE,eCDE,
nCDE,vCDE)
iCDE = -1 9 5 5 = 18 ok (=qCDE)
kCDE = -2 9 5 5 = 17 ok (=mCDE)
 change 2nd pos.: can't - all alternatives negative and
the other three positions only add up to 13
 change 3rd pos. in combination with first position
gCnE = 1 9 2 5 = 17 ok
 continue - use recursion
Generating word list

For "best" values of w and T there are
typically about 50 words in the list for every
residue in the query sequence
BLAST method for proteins
2. Scan the database for hits with the compiled
list of words. Two approaches:
 Use
index of all possible words (for w=4, need
array of size 204=160,000. Can compress this
index using pointers to save space.
 Use finite state machine (actually used)
 Calculate
a state transition table that tells what state
to go to based on the next character in the sequence
3. Extend hits to form segment pairs
BLAST Method for DNA
1. Make list of all contiguous w-mers in the
query sequence (often w=12)
 2. Compress database by packing 4
nucleotides into a single byte (use auxiliary
table to tell you where sequences start and
stop within the compressed database) -doesn't allow for unspecified bases
(wildcards)

BLAST Method for DNA
3. Compress the w-mers from the query
sequence the same way.
 4. Search the compressed database for
matches with the compressed w-mers

 Since
all frames of the query sequence are
considered separately, any match of length
w>=11 must contain a match of length 8 that
lies on a byte boundary of one of the w-mers
from the query sequence. Thus can scan a
(packed) byte at a time, improving speed 4-fold
over comparing one nucleotide at a time.
BLAST Method for DNA

Problem: if query sequence has a stretch of
unusual base composition (e.g., A-T rich)
or a repeated sequence element (e.g., Alu
sequence) there will be many hits with
"uninteresting" regions.
BLAST Method for DNA

Solution:
 During
compression of the database, tabulate
frequencies of all 8-tuples.
 Make a list of those occurring very frequently
(more frequently than expected by chance).
 Remove these words from the query list of wmers before searching database.
 Remove words matching a sublibrary of
repeated sequences (but report the matches to
that sublibrary when done).
BLAST Statistical significance
A key to the utility of BLAST is the ability
to calculate expected probabilities of
occurrence of Maximum Segment Pairs
(MSPs) given w and T
 This allows BLAST to rank matching
sequences in order of “significance” and to
cut off listings at a user-specified
probability

Summary of Database Search
Methods
Description
Authors (Program)
Needleman & Wunsch full alignment
match ktuple - form
Wilbur & Lipman
diag - NW
ktuple - diag - rescore
Lipman & Pearson
(FASTP)
FASTP - join diagsPearson & Lipman
NW
(FASTA)
Altschul et al (BLAST) word match list statistics