Transcript STEP 5

Introduction to BLAST
Minkoo Seo
DKE Lab., Yonsei Univ.
8 October 2004
The Five BLAST Programs
Program
Database
Query
BLASTN
Nucleotide
Nucleotide
BLASTP
Protein
Protein
BLASTX
Protein
Nucleotide translated
into protein
TBLASTN
Nucleotide translated
into protein
Protein
TBLASTX
Nucleotide translated
into protein
Nucleotide translated
into protein
Traditional BLAST programs
Alignment
Search space and alignment

The Smith-Waterman
algorithm will find the
maximum scoring alignment
between two sequences.

Unlike Smith-Waterman,
BLAST doesn’t explore the
entire search space.

BLAST’s minimizing the
search space is the key to its
speed but at the cost of a
loss in sensitivity.
FASTA


Known as much as BLAST.
FASTA perform alignment by
finding k consecutive exact
matches, locating 10 bestmatches and joining them.
The BLAST Algorithm
STEP 1
The sequence is optionally filtered to remove low-complexity
regions.

This process is called soft masking.

The low-complexity sequence occurs much more frequently than expected
by chance in both proteins and nucleic acids.

The low complexity region is replaced with Xs (or Ns for Nucleotide
sequences.)

Note that filtering is only applied to the query sequence and not to the
database sequence.
The BLAST Algorithm (cont)
STEP 2
A list of words of length 3 in the query protein sequence is made
starting with positions 1,2, and 3; then 2,3, and 4; etc.

For example, the sequence MGQLV has words MGQ, GQL, and QLV.

Word length 11 for DNA sequences, 3 for programs that translate DNA
sequences.
The BLAST Algorithm (cont)
STEP 3
Using a scoring matrix (e.g., BLOSUM 62), the query sequence are
evaluated with any other combination of three amino acids.

There are a total 20 x 20 x 20 = 8,000 possible match scores for a word.

For example, suppose that three-letter word PQG occurs in the query
sequence. The likelihood of a match to itself is found in the BLOSUM 62
matrix as:
(P-P match) + (Q-Q match) + (G-G match) = 7 + 5 + 6 = 18
The BLAST Algorithm (cont)
STEP 4
A cutoff score called neighborhood word score threshold (T) is
selected to reduce the number of possible matches to PQG to the
most significant ones.

Note that FASTA considers only exact matches.

For example, if T is 13, only the words that score above 13 are kept.
T limits sensitivity.
The BLAST Algorithm (cont)
STEP 5
The previous procedure is repeated for each three-letter word in
the query sequence.
The BLAST Algorithm (cont)
STEP 6
The remaining high-scoring words are organized into an efficient
search tree for comparing them rapidly to the database sequences.

Approach
1. Build a DFA that recoginzes all the high-scoring words.
2. Run DB sequences through DFA.
3. Remember hits.
Builds an index on the fly
The BLAST Algorithm (cont)
STEP 7
If a match is found, this match is used to seed a possible ungapped
alignment between the query and database sequences.
The BLAST Algorithm (cont)
STEP 8(a)
In the original BLAST, matching words are extended. At this point, a
larger stretch of sequence (HSP or high-scoring segment pair) may
have been found.

For example, if MATCH=1, MISMATCH=-1, and X=5
max
MSPs are extended,
then trimmed to max.
The BLAST Algorithm (cont)
STEP 8(b)
In the gapped BLAST, called BLAST2, T is lowered in step 4. Then,
find short matched region lying on the same diagonal within
distance A of each other to build longer and join them. Once found,
these joined regions are extended using the method of Step 8(a).
The BLAST Algorithm (cont)
STEP 9
Determine whether each HSP score found is greater in value than
a cutoff score S.

S is determined empirically by examining the range of scores found by
comparing random sequences, and by choosing a value that is significantly
greater.
The BLAST Algorithm (cont)
STEP 10
Determines the statistical significance of each HSP score.

Sometimes, two or more HSP regions that can be made into a longer
alignment will be found.

For example,
HSP score #1: 65 and 40
HSP score #2: 52 and 45
 Poisson method: probability of multiple score is higher when the lower
score of each set is higher. (45 is higher than 40)
 Sum-of-scores method: 65+40=105 is higher than 52+45=97.
The BLAST Algorithm (cont)
STEP 11
Smith-Waterman local alignments are shown for the query
sequence with each of the matched sequences in the database.
The Gumbel Extreme Value Dist.

Extreme Value Distribution
 When two sequences have been aligned optimally, the
significance of a local alignment score can be tested on the
basis of two random sequence score of the same length and
same composition.
 These random alignment scores follow extreme value
distribution.

Goal
 Evaluate the probability that score between random or
unrelated sequences will reach the score found between two
real sequences of interest.
The Gumbel Extreme Value Dist. (cont)
The Gumbel Extreme Value Dist. (cont)

Extreme Value Distribution






Probability Distribution(Eq.17): Yev  exp[  x  e  x ]
Mean: Euler-Mascheroni constant, 0.57722…
Variance:  2   2 / 6  1.6449
Probability that score S will be less than value x (Eq. 19)
: P(S  x)  exp[ e  x ]
Probability of S is greater than or equal to value x (Eq. 20)
: P( S  x)  1  exp[ e  x ]
Eq. 17 and Eq.20 can be modified to accommodate extreme
values (Eq. 22): P( S  x)  1  exp[ e  ( x u ) ]
where u is mode, highest point, or characteristic of the dist.
and  is the decay or scale parameter
Karlin-Altschul Statistics

Karlin-Altschul statistics (Samuel Karlin and Stephen Altschul 1990) make
five central assumptions:
 A positive score must be possible.
 The expected score must be negative.
 The letters of the sequences are independent and identically
distributed (IID).
 The sequences are infinitely long.
 Alignment don’t contain gap.

The first two assumptions are true for any scoring matrix estimated from
real data.

The last three assumptions a re problematic because biological sequences
have context dependencies, aren’t infinitely long, and are frequently
aligned with gaps.
Karlin-Altschul Statistics (cont)

For now, though, let’s turn to the Karlin-Altschul equation.
E  kmne S

This equation states that the number of alignments expected by chance (E)
during a sequence database searching is a function of the size of the
search space (m*n), the normalized score (λS), and a minor constant K.

Hence, the relationship between the expected number of alignments and
the search space is linear.

The relationship between the expected number of alignments and score is
exponential. This means that small changes in score can lead to large
differences in E.
Sample BLAST Output
# of matches w/
positive score
Scoring matrix
independent
score
Scoring matrix
dependent score
E  kmne S
# of exactly
matching
characters
PSI-BLAST

Position Specific Iterated BLAST
 Scoring Matrix Searching
1. Identify additional related sequences that might otherwise
be missed
2. Difficulty with such an expended search is that alignment of
related sequences must already be available
 Method of PSI-BLAST
1.
2.
3.
4.
5.
6.
DB Search with given query using BLAST
Set of related sequences are found
Perform msa on result set
Make matrix using msa result
DB Search using matrix
Go to step 1 to find sequences similar to result of step 5
PSI-BLAST (cont)

Innate limitation in the Profile Searching Approach
Query
Family
There is no guarantee that the alignments
finally discovered represent the same set
of related sequences.
PSI-BLAST (cont)

Problems in the PSI-BLAST matrix
 The matrix covers the entire length of the aligned sequences
where other matrices cover only a short stretch of the alignment.
 The same gap penalties are used throughout the procedure
and there is no position-specific penalty as other programs.
 Each subsequence alignment is based on using the query
sequence as a master template for producing a multiple
sequence alignment of the same length as the query sequence.

Thus, the msa is a compilation of the pairwise alignment rather
than a true msa.
PSI-BLAST (cont)

PHI-BLAST (Pattern Hit Initiated BLAST)
 Much like PSI-BLAST except that the query sequence is first
searched for a complex pattern provided by the investigator.
 Then, the sequence database searching is focused on regions
containing the pattern.

PROBE
 Similar to PSI-BLAST
 But performs a more complex and rigorous type of data
analysis; bayesian statistical approach.

MAXHOM
 Matching sequences found in a database search are aligned by
dynamic programming with a query sequence, and a profile is
made from the alignment. A new round of sequences that
match the updated profile are then picked from the Swiss-Prot.
References

Advanced Medical Informatics Seminar by Vanathi Gopalakrishnan, Ph.D.
http://omega.cbmi.upmc.edu/~vanathi/

Computational Molecular Biology Course By Doug Brutlag and Lee Kozar
http://cmgm.stanford.edu/biochem218/

David W. Mount, BIOINFORMATICS, COLD SPRING HARBOR
LABORATORY PRESS, 2001.

Stephen F. Altschul et al., “Gapped BLAST and PSI-BLAT: a new
generation of protein database search programs,” Nucleic Acids Research,
25(17):3389-3402, 1997.

Stephen F. Altschul et al., “Basic Local Alignment Search Tool,” J. Mol.
Biol., 215, 403-410, 1990.

Ian Korf, Mark Yandell and Joseph Bedell, BLAST, O’REILLY, 2003.