Miller lecture slides

Download Report

Transcript Miller lecture slides

FASTA Database Sequence Searches
Quick Overview/Review
Software at version 3:
better statistical estimates, error handling, and larger variety
of scoring matrices (PAM120, BLOSUM80, PAM250,
BLOSUM50, BLOSUM62)
fasta3 : input protein sequence with protein seq DB
fastx3/fasty3 : input DNA sequence with protein seq DB
fastx (faster but frame shifts between codons)
fasty (frame shifts within codons)
tfasta3/tfastx3/tfasty3 : protein comparisons against a
DNA sequence DB.
15 Mar 2004
1
FASTA Internals
Four step procedure in performing the search
1. Lookup table method finds exact residue matches to
be used as candidates to build the final alignments
2. Regions are scored using a local alignment scoring
method (PAM, BLOSUM) to find 10 best regions
3. Take best scored sequence and attempt to join the
initial segments together
4. In a similar method to Smith-Waterman, all possible
alignments considered within a band around the
highest scoring segments
15 Mar 2004
2
FASTA Lookup Method
Lookup table is merely a list of all amino acids from
each sequence and the relative position differences
1
a
2
b
3
c
4
d
5
e
6
f
7
g
8
h
9
i
sequence 1
o
p
q
a
b
c
d
s
t
sequence 2
a
b
1
2
4
5
-3
-3
c
d
3
4
6
7
-3
-3
(Mount, p294)
15 Mar 2004
3
FASTA Lookup Method
Several residues at same offset indicate an area of
identical alignment, called a k-tuple.
V V V V V A S C D E F G Y Y Y Y Y
: .
:
.
:
:
:
Q Q Q Q Q A T C E E F G L L L L L
The k-tuple size is an important parameter in
performing FASTA searches.
k-tuple searches of size = 1 are more sensitive but slower.
k-tuple = 2 is 4x faster than size = 1.
15 Mar 2004
4
FASTA k-tuples
ktup = 3
ktup = 2
ktup = 1
15 Mar 2004
5
FASTA Diagonal Method
Find segments of the compared sequences bounded by
regions with k-tuples of size “ktup” (search parameter)
Segments from input and database sequences are of the same
length!
This allows similarity based on identical matches and allows
mismatches but doesn’t allow insertion or deletion of
residues or gaps.
accdff and abcdef are okay
acdf and abcdef are not
15 Mar 2004
6
FASTA Illustrated
Step 1
Lookup Table
Diagonal Method
Step 3
Select Best
Segments for
Joining
15 Mar 2004
Step 2
Score with
Favorite Method
Step 4
Smith-Waterman
Like Method
7
FASTA Results and Significance
Search scores based on extreme value distribution
Yev(x) = exp[ -x – exp[-x] ]
Pev(S > x) = 1 – exp[ - exp[ -L(x-u) ] ]
Values determined by fitting actual score distribution to EVD
L = 1.2825 / s
u = xavg – 0.4500 s
z = (x – m) / s
P(S > z) = 1 – exp[ - exp[ -1.2825z – 0.5772 ] ]
Chance a random sequence has score S that is greater than z and
thereby a more significant match
Remember: “z” in FASTA is really z’, z’ = 50 + 10z
15 Mar 2004
8
FASTA Results and Significance
Expectation (Z) of finding score S in D “databases”
D is number of “databases” searched (in FASTA output)
Book uses term “databases” but in reality it is the number of
whole sequences searched against.
E(S > z) = D * P(S > z)
15 Mar 2004
9
BLAST Overview
Similarly set of search tools as FASTA
BLASTP (prot-prot), BLASTN (nucl-nucl)
BLASTX (transl.nucl-prot), TBLASTN (prot-transl.nucl)
TBLASTX (transl.nucl-transl.nucl)
Version now at 2.2
Better statistical significance calculations
Gapped alignment for increased sensitivity
17 Mar 2004
10
BLAST Algorithm
The set up…
Chop up input query sequence into smaller words
- default size is 3 for protein searches, 11 for nucleic acids
Smaller words are generated in each “frame”
- 123, 234, 345, 456, 567, 678, …
Each word is scored (BLOSUM62) against sequences in
the search database(s).
- Threshold value “T” is used to discard comparisons that do not
score above the value “T”. Matches above T are “hits”.
17 Mar 2004
11
BLAST, 1 hits or 2?
Original BLAST (v1)
Each hit was extended in both directions to determine if the word was
a subset of a larger matching region
- accounted for ~ 90% of the BLAST execution time
The extensions continue until score drops below best score achieved
in short sequence matches.
New BLAST (v2)
Hits that were “HSPs” (high-scoring segment pairs) typically had
multiple hits along the same diagonal within a “short distance” of
each other.
Find another non-overlapping hit in a window of length “A”.
- If there is one, then attempt the same hit extension as in v1.
17 Mar 2004
12
BLAST, 1 hits or 2…
New 2 hit criteria reduces number of candidates to
extend but reduces sensitivity of the search
Hits that might be significant (and “remote”) are not extended.
“T” value must be decreased in order to achieve sensitivity in
original BLAST.
More “weaker” hits are included but not enough to offset gains.
Two-hit method generates 3.2 times as many hits but
only 0.14 times as many hit extensions were done.
17 Mar 2004
13
BLAST, Gapped Alignments
Weakness of BLAST v1
A single significant alignment constructed from two(or
more) less significant extended (ungapped) hits could
be discarded if one (or more) of the extended hits
weren’t found.
For example, if you changed thresholds slightly.
The hit threshold “T” had to be set low to make sure the
less significant smaller hits could be found to generate
the single, more significant alignment.
Reminder: ungapped extensions based on BLOSUM62
scoring and allows residue substitutions
17 Mar 2004
14
BLAST, Gapped Alignments
BLAST v2 Modification: Gapped Alignments
Rather than construct an alignment between HSPs,
attempt a gapped extension on any HSP that exceeds the
threshold “Sg”.
Similar to 2-hit process except “off-diagonal” using
previously discovered HSP candidates.
Look within a given distance “G” for another HSP.
Using gaps and scoring matrix, determine score of the new, larger
gapped sequence.
Keep gapped sequence if score remains above threshold “Xg”.
17 Mar 2004
15
BLAST, Are we done yet?
Final clean up
After all extensions (both gapped and ungapped) are done,
finishing touches are applied to the fragments through a
modified dynamic programming method.
All sequence matches above a specific expect score (user
specified) are reported to the user.
17 Mar 2004
16