BLAST: Basic Local Alignment Search Tool Meng Cao, Arthur Lee

Download Report

Transcript BLAST: Basic Local Alignment Search Tool Meng Cao, Arthur Lee

BLAST: Basic Local
Alignment Search Tool
Meng Cao, Arthur Lee, Peiying Li, Matt Prorok, Owen Astrachan
Genome Revolution Focus, Duke University 2007
Introduction
Flowchart of Steps
BLAST, which stands for basic local alignment search tool, is a
heuristic algorithm that is used to find similar sequences of amino
acids or nucleotides. BLAST approximates the dynamic
programming algorithm more directly than its predecessors.
Dynamic programming is an optimization process for solving a
problem. In this approach, the user finds the best decision for a
subproblem and bases that decision from the best decision from the
previous subproblem. Unfortunately, this method would have
exceptional computational requirements, thus the use of the heuristic
algorithms.
In order to find similar sequences, BLAST first finds the highest
scoring pair of sequences, or the maximal segment pair (MSP), and
in one version of BLAST, matching DNA nucleotides are given a
score of +5, while mismatches are assigned -4 (as outlined in the
flowchart). In order to more directly approximate dynamic
programming, BLAST chooses the boundaries of the MSP to
maximize its score by either extending or shortening the two
segments that are compared. Because molecular biologists are more
likely to be interested in all the conserved regions, not just the most
conserved region, BLAST returns all MSP’s that score above a
cutoff. While BLAST may be faster than the dynamic programming
algorithm, it is a heuristic algorithm, and because it sacrifices
accuracy for speed, BLAST can sometimes make mistakes.
v
Left: Table of some of the
versions of BLAST (see
Reference 4).
Below: Screenshot of
BLAST homepage (see
Reference 3).
Query is compiled to form a list of
length w substrings called w-mers.
Search database for “hits”. A list of matches shows up.
Then search for an exact match between any substring
on the w-mers list and the database sequence.
Each substring is extended locally in both directions
until the score of the substring no longer improves.
For every matched pair of nucleotides, add 5 to the
total score. For every mismatched pair of nucleotides,
called a “gap”, subtract 4 from the total score.
Select a moderate score Sg to which the calculated
score is compared. Sg indicates if there are too many
gaps to make the sequence a likely match to the
substring and query sequences.
If calculated score is less
than Sg, then discard this
database sequence from
list of “hits.”
If calculated score is
equal to or greater than
Sg, keep this sequence in
the list of “hits.”
Find alignment (either
gapped or ungapped)
with the max score.
APT
Shortcomings
Problem: Max Blast Sequence
Though BLAST represents a huge advancement in the ability to
compare DNA, it is not without its shortcomings. The basic premise
behind the algorithm is that it searches for segments of DNA that are
likely to be the most similar, rather than comparing each individual
section with every other one. This innovation increases the speed
with which DNA can be searched, but is not perfect. It is possible
that BLAST will return data that is off. This result has been shown
empirically by using BLAST to analyze gene sequences. Koski and
Golding report that, in E. coli, in 27% of cases, BLAST returned hits
that were not from E. coli’s nearest phylogenetic neighbor, with 7%
of cases returning a hit from a different domain of life. However, as
BLAST is refined and its DNA database becomes larger, the
accuracy should improve. Nonetheless, it is important to emphasize
that the closest BLAST hit is based on the computer algorithm and
thus, merely implies biological similarity.
Problem Statement: Given a string named strand of DNA
and a database represented as an array of strings, return the
element from the database that has the highest BLAST score
compared to strand. In this problem all strings in the database
will have the same length as strand; BLAST assigns +5 when
calculating a match score when nucleotides from each strand
with the same index match and -4 if the nucleotides with the
same index in the two strands do not match. If there is a tie,
return the string that occurs first in the database.
Definition:
Class: MaxBlast
Method: max
Parameters: String strand, String [] datab
Returns: String
Method Signature:
public String max(String strand, String[] datab)
Class:
public class MaxBlast {
public String max(String strand, String[] datab)
// fill in code here
}
}
Constraints:
- Every string in datab is the same length as strand
- All strings contain at most 50 characters, each
character is either 'g', 'a', 't', or 'c'.
- datab contains at most 50 strings.
Examples:
1) strands = {“aaa”, “ggg”, “ccc”, “ttt”}
string = “ggg”
Is the max alignment
score statistically
significant?
Returns “ggg” because a string from the database that's
the same has the maximal score possible.
2) strands = {“agtc”, “agtt”, “agta”, “aggg”}
If no, discard
database
sequence.
If yes, database
sequence will be
displayed as output.
string = “agtg”
Returns “agtc” because each string from datab has a
score of 5+5+5-4 = 11, but the string "agtc" occurs first.
More information on this APT can be found at
http://www.cs.duke.edu/csed/algoprobs/4g07blast2.html
Conclusion
BLAST is currently one of the most popular bioinformatics search
programs. The algorithm’s major emphasis on speed appeals greatly
to many researchers who are aiming to solve complex problems.
BLAST also supplies statistical significance and other analytical
techniques involved in computer science. Biological problems that
BLAST can help answer deal with DNA and protein sequences. A
researcher can use a BLAST search to find and compare gene or
protein sequences between organisms and look for similarities.
Similar sequences can then be used to describe biological
relationships and to give further insight on how systems work.
References
1.
2.
3.
4.
5.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Journal of
Molecular Biology, 215, 403-410.
Korf, I., Yandell, M., & Bedell, J. (2003). BLAST. Cambridge: O’Reilly.
National Center for Biotechnology Information. (n.d.). BLAST: Basic Local Alignment
Search Tool. Retrieved October 24, 2007, from http://www.ncbi.nlm.nih.gov/BLAST/.
Sotiriades, E., & Dollas, A. (2007). A General Reconfigurable Architecture for the
BLAST Algorithm. Journal of VLSI Signal Processing, 48, 189–208. Retrieved October
16, 2007, from http://www.springerlink.com/content/p2175ql615589u22/fulltext.pdf.
University of Texas at El Paso. (n.d.). Basic Local Alignment Search Tool (BLAST).
Retrieved October 15, 2007 from
http://www.math.utep.edu/Faculty/mleung/teaching/m4370s04/slides/blast.pdf.
Contact information can be found at http://www.duke.edu/~mc136