Net BLAST - Microsoft Research
Download
Report
Transcript Net BLAST - Microsoft Research
Bioinformatics at USDA-ARS
Livestock Issues Research Unit
Scot E. Dowd, Joaquin Zaragoza
Mel Oliver and Paxton Payton
Projects
• Future: Interactive neural network based models
to describe and predict gene expression in
Livestock and Pathogens
• Present: Various Projects Various States
Leading to the Future
–
–
–
–
–
–
Molecular Modeling
Gene Finding
Distributed BLAST
Whole Genome Comparison
Functional Genomics and pathways
Pathway or system targeted Microarray design
Functional Genomics
• Functional Genomics/Gene Ontology- controlled
vocabulary
• Define, annotate, categorize, and describe large
genetic datasets (e.g. est, mRNA)
• We have developed a custom curated database
for functional domain BLAST (regular blast and rps-BLAST
using kog, cog, pfam, hmmr, smart domains)
• Ultimately will become a comprehensive .NET
suite of analyses for microarray design from new
sequence all the way to result visualization.
Ontology
• Annotation – propogation of error in
definitions
• Ca
BLAST: need for speed (II)
• We are working with roughly 5000-100,000
queries against 1GB databases
• 1 query takes a fairly fast PC 3 minute to
complete
– dual 3.2 GHZ XEON
– 6 GB RAM
– RAID0 SCSI-320 HD
• Other methods MPI-BLAST, WU-BLAST,
THREADED BLAST, SGE-BLAST, commercial
TURBO BLAST, DNAstar etc.
BLAST ALGORITHM
Cgtcgctcgctgtaagtac– query e.g.1000 letter word
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990) A
basic local alignment search tool. Journal of Molecular Biology 215, 403410.
• What database sequence is most similar to my query.
• Databases one of ours is 60GB worth of letters
• BLAST generates statistics based upon similarity and substitution
probabilities In simplest form purine to purine better than purine to
pyrimidine
• Slide along 4 GB database find word match and try to extend
• BLASTX as example-Translation into 6
reading frames, search database with
these 6 sequences with word size of 3.
• Time to BLAST
– Up to a point decreased time correlated with
number of slaves available
– Average test machines (2.4 ghz/1gb
RAM/SATA150)
– (e.g. 90 seq/13 CPU/3 min) vs
(90seq/1CPU/38.5 min) 350MB db GB-LAN
.NET Distributed BLAST
• Take advantage of unused laboratory compute
resources
• Provide easy, powerful tool for Distributing
BLAST
• Target Atmosphere
– Windows LAN
• Current Open Source Distributed BLAST
Applications
– Require server class master or version of UNIX
– Difficult to set up, configure databases, compile and
submit jobs.
– No large job fault tolerance
W.ND BLAST : A Bioinformatician
promoting windows?
•
•
•
•
•
•
•
•
.NET C#
First tests Condor, MPI, a ported remote shell
Contractor
Project Manager
Database formatter
Worker machines
Job leasing
Output processing HT backend apps
Gotta GUI
Database formatter
Functionality
• Network bandwidth would eventually be
limited
• Fault tolerant to worker failure
• Resume upon reboot if Contractor fails
• No statistical problems with search results
• Complete BLAST database on each
worker node if resources allow
• Easy to install a breeze to use
.NET Distributed BLAST
• Queue at each node
– Contractor only allows maximum of two query sequences in
each node’s queue
– Ensures application wait a minimal amount of time between
completion and next job
• Thread per node
– Makes use of .NET Asynchronous Delegate / AD – scalability
???
– Thread Invokes BLAST on remote node
– Upon completion, remote node sends “finished” message to the
Contractor
– The contractor collects results and performs validity check
– Once results are verified, remote worker BLAST starts on queue
sequence and Contractor prepares allocates future job
.NET Distributed BLAST
• Fault Tolerance-revisited
– Task migration handled through application-level checkpointing
– Worker encounters fault or crashes,
– Contractor redirects failed nodes sequence on another worker
node.
– Minimal loss of time
• Integrating QOS functionality- current in works
– decrease priority when workstation is in use –based upon
system remote call checking CPU%, memory etc
– GUI allows increasing or decreasing priority – rev gauges and
throttles
– Storage requirement limitations - redirect query to other
database source (working with 10 connection limitation in XP
pro)
Future Directions
• Quality of Service
– Allow Contractor to set priority for application
• Contractor Fault Tolerance
• Large Network Optimization
– Sub Contractors
• Asynch Del. Thread limit- ewww kewl
WEB SERVICE!
• Shadow (Sub) Contractors- network load
balance
•
•
•
•
•
The End!
Questions?
Suggestions?
Advice?
Even Criticism?