Hidden Markov Models - Iowa State University

Download Report

Transcript Hidden Markov Models - Iowa State University

Database Searches
BLAST
BLAST
• Basic Local Alignment Search Tool
– Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol.
215 (1990)
– Altschul, Madden, Schaffer, Zhang, Zhang, Miller,
Lipman, Nucleic Acids Res. 25 (1997)
• Main ideas:
– Increase search speed by finding fewer, but
better, hot spots during initial screening phase
– Uses longer word sizes
– Integrate scoring matrix into first phase
• Compare with FASTA, which requires exact matches
BLAST Terminology
• Segment pair: equal-length substrings of sequences
S1 and S2
• Locally maximal segment pair: segment pair whose
alignment score cannot be improved by extending or
shortening it
• Maximum segment pair (MSP) = segment pair with
maximum score over all segment pairs in the
sequences S1 and S2
• High-scoring segment pair (HSP): A segment pair
with score higher than some cutoff score, s.
• w is the length parameter; t is the threshold
parameter
BLAST: Hits
• A hit is a w-length word in the database that aligns
with a word from the query sequence with score > t
• BLAST looks for hits instead of exact matches
– Allows word size to be kept high for speed, without
sacrificing sensitivity
• Typically, w = 3-5 for amino acids, ~11-12 for DNA
• t is the most critical parameter:
– ↑t  ↓ “background” hits (faster)
– ↓t  ↑ ability to detect more distant relationships (at cost
of increased noise
Hits
• For each word, evaluate score of match
(exact or not) according to BLOSUM62
– E.g., for PQG, score is 7+5+6 = 18
• There are 20w possible w-length words,
but considering only those with score >
t, greatly reduces number of matches
– E.g., there are 203 = 8000 possible matches
to PQG, but only 50 achieve score > t = 13
BLAST
Extending a hit
• After locating a hit, BLAST attempts to
extend hit in both directions, until
score has drops more than X below the
maximum score yet attained.
• Extension step typically accounts for >
90% of execution time.
Extending a hit
Improvement: 2-hit method
• Do extensions only when there are two hits on
the same diagonal within some distance A of
each other (e.g., A =40)
• Reduces sensitivity (ability to detect
distantly related sequences)
– To compensate, use lower t value (e.g., 11 rather
than 13)
• Since we only extend when there are two
nearby hits, many fewer regions are extended
Gapped BLAST
• Allows local alignments with indels (similar to
FASTA)
• Local alignments from different diagonal are
merged into a different local alignment
followed by some indels followed by a second
local alignment, etc.
– equivalent to a path through the dynamic
programming matrix composed of alternating
diagonal sections and paths connecting them
Gapped BLAST
• Original BLAST implicitly handled gaps by finding
several distinct HSPs and calculating a statistical
assessment of the combined result
– Two or more HSPs each below the cutoff value might in
combination rise to statistical significance
• Gapped BLAST, extend hits by allowing gaps when
hits are promising (exceed sg):
– Advantage: We can afford to miss some HSPs as long as at
least one is found
• Use dynamic programming, starting from center of
each high-scoring region if s > sg
– sg is chosen such that gapped alignment is triggered in about
1/50 of the sequences compared
PSI-BLAST
• Position-Specific Iterated BLAST
• Generates a multiple alignment from
statistically significant alignments produced
by BLAST
• Produces a position-specific score matrix
(PSSM)
–
–
–
–
–
Can search the database using the PSSM
Match sequences to profile
Generate new profiles
Repeat (iteration)
Search gradually extends to increasingly divergent
sequences
Flavors of BLAST
• BLASTP - protein query against protein DB
• BLASTN - DNA/RNA query against GenBank
(DNA)
• BLASTX - 6 frame trans. DNA query against
proteinDB
• TBLASTN - protein query against 6 frame GB
transl.
• TBLASTX - 6 frame DNA query to 6 frame GB
transl.
• PSI-BLAST - protein ‘profile’ query against
protein DB
• PHI-BLAST - protein pattern against protein DB