Posterx - Duke Computer Science

Download Report

Transcript Posterx - Duke Computer Science

Jeff Shen, Morgan Kearse, Jeff Shi, Yang Ding, & Owen Astrachan
Genome Revolution Focus 2007, Duke University, Durham, North Carolina 27708
Figure 4. The table below shows an example of a scoring
table/matrix that the BLAST algorithm might use to store
the comparisons between certain segments.
Introduction
BLAST (Basic Local Alignment Search Tool) is a widely used
bioinformatics search tool used to compare different DNA samples for
their similarities. Researchers can use this search tool to compare their
own DNA samples to all the DNA and protein sequences in various
genebanks and libraries.
BLAST takes a heuristic approach to compare the different sequences,
which dramatically increases the speed of searches. The program scans
at approximately 2 x 106 bases/s. The increase in speed has made a
lasting impact in the fields of bioinformatics and computer science. In the
past, searches that would have taken days to finish now can be done in
mere
seconds.
Application
Because of its speed, BLAST has become a very popular bioinformatics
search tool. BLAST has been cited by over twenty thousand scientific
journals whose authors have used BLAST to compare different DNA
sequences or whole genomes for similarities.
For example, researchers in the Cold Spring Harbor Lab used an
enhanced version of BLAST, BLATZ, to find the similarity between the
human genome and the mouse genome.
Using BLATZ, they concluded that 39.154% of the human sequence
aligned to mouse sequence. Also, other organisms such as drosphilia
(fruit fly) have been compared with the human genome with BLAST.
Another area that BLAST is prevalent is in the field of protein studies. Not
only can researchers use BLAST for comparing DNA sequences, but they
can also use the program to find similarities between protein sequences.
Figure 1.
This image
shows how target
sequences are
matched up
during the BLAST
process for
comparison. The
lines between the
sequences
represent
matches in those
segments.
Problem Statement
You are writing code to find which of several DNA strands in a given DNA
library have similarities to a given query strand….
(Full problem statement of APT available at
http://www.cs.duke.edu/csed/algoprobs/4g07blast1.html)
…Return an array of library strands each of which contains the query
strand.
Class
public class Blast
{
public String[] findAll(String query, String[] library){
// fill in code here
}
}
Method
User inputs a target query sequence (a so-called w-mer
of length w) into BLAST that is to be compared
User also can specify a value W for which matches under
W will be ignored, based on the user’s preference for
accuracy and speed
3 Phases of BLAST search
1st Phase: The w-mer is then searched against a
sequence database of billions of base pairs (length W or
higher) that have been previously organized and find
exact matches.
2nd Phase: These matching sequences are then
extended from both sides and any further matches with
the target sequence are tallied up – in this phase,
insertions/deletions are ignored in terms of score.
3rd Phase: High-scoring alignments from these
processes are compared and an optimized measurement
of similarity and other statistics are returned to the user
in this final algorithmic phase. In this process segments
of all possible lengths are compared.
BLAST has become an indispensable bioinformatics tool in the fields of
biology, engineering, and biochemistry.
BLAST in the future…
As an example, there are companies such as Korilog that have made
software (KoriBLAST) that use the BLAST system along with other
programs to create software solutions to make it easier for labs and
researchers in areas of data integration, visualization and management.
Their goal is to provide the means for state-of-the-art graphical
environments for quick and easy research. The software program is
dedicated to making the BLAST program very useful by doing sequence
data mining.
APT
Constraints
• query contains at most 30 characters.
• Each string in library contains at most 50 characters.
• There are at most 50 Strings in library.
Example
query: "ATCG"
library = { "ATC", "TATC, "ATCATC", "GATCATC", "ATCGATG",
"GATATCG" }
Returns: {"ATCGATG", "GATATCG"} The other strands in library do not
contain "ATCG", only the last two strands of library contain "ATCG".
Conclusions
Figure 2. In this image we are given an
example of what an individual search might
look like using BLAST. The target sequence
above is PQG (w, length=3 letters) and a
threshold is set at 13 (the W mentioned in
the method). The neighborhoods words
stands for the words in the database that
the query word or target is compared too.
To the right of the neighborhood word is a
score that represents the match
equivalence. This model stops taking score
after 13 to optimize the match. The
segment that is thought to be the best
match is returned along with the
source/subject of where it came from.
BLAST (Basic local alignment search tool) is a bioinformatics tool that
became a marginal aid to many researchers to help in quickly comparing
their own DNA samples. It is popular because of its speed even though it
does allow certain errors. It allows navigation by letting the user
determine the number/length of sequence to compare. After this they can
also set the threshold to decrease the number of matches. From the use
of this search tool it is important to see what could arise from this tool.
Already there are other search tools such as BEAUTY. BEAUTY
database search tool is very similar to BLAST but more advanced. This
is due to the fact that BEAUTY which stands for BLAST enhancement
alignment utility. The Beauty tool works by incorporating conserved
regions and functional domains proteins sequences into the BLAST
program to make it more specific. So as time goes on we will most likely
see an increase in programs such as BEAUTY.
Literature cited
Figure 3. In the image above we are shown how the
score is created. It is the sum of all the matches and
mismatched amino acids minus the sum of the number of
gaps. It also shows as previously stated that the score
returned is the max/optimized score.
The cartoon above states the early description of DNA and the double
helix in 1953. It is most alarming and interesting to see how in 54
years we have come so far as to understand such complex algorithms
as BLAST that help us to know much more beyond the basic structure
of DNA.
http://www.acm.org/crossroads/wikifiles/13-1-CE/13-1-11-CE.html
http://www.mrc-lmb.cam.ac.uk/genomes/madanm/articles/antim.htm
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Alignment_Scores2.html
http://www.goroadachi.com/etemenanki/blog-05feb.htm
http://www.cs.duke.edu/courses/cps004g/fall06/papers/duke/altschuletal1990.pdf
http://www.korilog.com/products/improving-NCBI-BLAST.php