Transcript PPT
TurboBLAST: A Parallel
Implementation of BLAST Built on
the TurboHub
Bin Gan
CMSC 838 Presentation
Motivation
NCBI BLAST on a single processor has become too
costly , inefficient, and time-consuming.
Sequence database are exploding in size.
Growing at an exponential rate
Exceeds the rate of increase in hardware capabilities ( Moores’
Law)
Thrashing and buffer management
Goals
Faster results for life science laboratories
Do not change the BLAST algorithm
Avoid using costly multiprocessor machines
Cheap alternatives of clusters of machines
CMSC 838T – Presentation
Talk Overview
Overview of talk
Motivation
Techniques
Database partition
Use the sequential BLAST
Merge results
TurboHub infrastructure
Evaluation
3 test runs and analysis
Related work
Powerblast
Paracel’s BLAST Machine
mpiBLAST
Words of Bill Pearson ( auther of FASTA)
Observations
CMSC 838T – Presentation
Techniques
Approach
Main intuition
Implementation
Clients, master, and workers
TurboHub System
Load balance
Fault recovery
Dynamic database partitioning
Binary tree analogy
CMSC 838T – Presentation
Techniques
Approach
Split databases instead of query sequences in binary tree fashion
Algorithms to decide how to split with the goal of balance overhead &
load
Each processor runs complete sequential BLAST using database
subsets
Merge the result into XML format
Adjust BLAST statistics for database sizes
TurboHub provide backend support for scheduling, fault recovery, etc.
Main Intuitions
Divide and Conquer
BLAST compares target sequence with each sequence in the database
individually
Very little communication is needed, and the communication is not
order dependant
Easy to achieve parallelism by splitting the database and assembling
the result
CMSC 838T – Presentation
Techniques
Implementations
3 tier system
Client
End user submitting job to the system
Master
Java application accepts the job
Sets up for processing
Uses TurboHub
Manage task execution
Coordinate the workers
Support dynamic change in set of workers, fault tolerances,
etc.
Workers
CMSC 838T – Presentation
Techniques
Implementations Cont.
Workers
Has a local copy of NCBI blastall
Partition the database so that the resulting portion can fit
into available physical memory
Initial task group of 10-20 sequences against all the
databases to avoid startup cost
Some worker process will merge the results
Parse the output (store as XML format)
Adjust BLAST statistics for database size
Scheduling using Piranha models
not talked in paper, but very important
CMSC 838T – Presentation
Techniques
TurboHub System
Developed by Scientific Computing Associates
Capabilities
Pipelining
Component Replication
Parallel Components in combination with tools from SCA,
MPI, PVM, OpenMP
Application in this topic
Worker is a wrapped-up blastall components
Component scheduling
Fault recovery
CMSC 838T – Presentation
Techniques
Task/Database Splitting
2 options
Large Task
Advantage
Disadvantage
Maximize resource utilization
Minimize task startup overhead
Load imbalance
Limit the performance gain
Small Task
Advantage and disadvantage are reverse of the above
CMSC 838T – Presentation
Techniques
Task/Database Splitting cont.
The paper’s intermediate approach
Create large initial task by experience
Communication and program startup are trivial for at least
10-20 input query sequences with 256M memory
If the task is too large, split the databases
For multiple databases, create roughly half of databases in
each sub database
For single database, split the database by half
Uses virtual shared memory
The actual database files are never sent to a worker until it
actually requires them
CMSC 838T – Presentation
Techniques
Database Splitting
Split using NCBI database formatting program formatdb
Analogy of binary tree
All the combined leaves are the database
The portion of the database to access depends on which node
the worker has decided to be at
Uses all leaves under the chosen node
Advantage:
Flexibility
Deliver exact amount of data as needed
Single copy of database
CMSC 838T – Presentation
Evaluation
Experimental environment for test one
Input data sets: 50 Expressed Sequence Tags (ESTs)
Database used:
Drosophila (1,170 sequences, 123 million nucleotides),
GSS Division of GENBANK (1.27 million sequences, 651
million nucleotides)
E Coli (400 sequences, 4.6 million nucleotides)
A group of 500 Mhz PIII with 512K cache, 256M Memory,100Mb
Ethernet
Performance result for test one
Serial version: 2131.8 second (wall clock time)
Parallel version with 11 workers: 130.0 second. (Speedup =
16)
CMSC 838T – Presentation
Evaluation
Experimental environment for test two
Input data sets: Chromosomes 1, 2, 4 from the Arabidopsis
Genome
Database used:
Swiss-Prot Protein database (12.8 Million peptides)
A group of 500 Mhz PIII with 512K cache, 256M Memory,100Mb
Ethernet
Performance result for test two
Serial version: 5 Days 19 hours and 13 minutes
Parallel version with 11 workers: 12 hours, 54 minutes.
(Speedup = 10.8)
CMSC 838T – Presentation
Evaluation
Experimental environment for test three
Input data sets: 500 mouse ESTs with 200-400 nucleotides
each
Database used:
An NT database from NCBI (1,681,522,266 nucleotides)
IBM linux cluster of 8 dual processor workstation
Each workstation contains 2 996 PIII’s with 2 G memory, 100
Mbit ethernet
Performance result for test three
Serial version: 4945 second
Parallel version with 8 workstations(16 workers): 357.03
second. (Speedup = 13.85)
CMSC 838T – Presentation
Evaluation Analysis
Memory size vs. database size
Thrashing avoidance for superlinear speedups
Single query at a time
Single query at each node
Overhead
Need to combine results
TurboHub overhead
Database transmission overhead
CMSC 838T – Presentation
Related Work
Other parallel BLAST
Blackstone's PowerBLAST (part of PowerCloud)
Automate the splitting of query databases into smaller
chunks
Spread out over the cluster nodes' local disks for querying
Automates the merging of BLAST results
Use disk caching and scheduling techniques to speed up
future queries of the same datasets
Paracel's BLAST Machine
Paracel actually got inside BLAST and parallelized the code
Post impressive speed up numbers and the statistics
Same as an unaltered BLAST query
mpiBLAST
Splits the database across each node in the cluster, so it
can usually reside in the buffer-cache
CMSC 838T – Presentation
Related Work
Words of Bill Pearson (FASTA) in response to why
there are no MPI or PVM parallelized versions of BLAST
Note: Paracel’s types of parallelization
It is too fast and there is not much demand for it
95% of the time, BLAST is almost an in-memory grep
Sequence comparison is embarrassingly parallel, and very
easily threaded
Distributing the sequence databases and collecting results has
more overhead
FASTA is 5 - 10X slower than BLAST
Smith-Waterman is 5-20X slower than FASTA
The communications overhead is low, and distributed systems
work OK for FASTA, and great for Smith-Waterman
CMSC 838T – Presentation
Observations
Observation
Efficient due to the parallelism embedded in the BLAST
algorithm
Different database splitting techniques
Feasible in practice (in computing power, user effort, etc…)
Similar result to previous work
Improvement
Due to the requirement of not changing code on BLAST,
superlinear speedup is only possible if existing thrashing is
avoided.
Larger memory and cache size
Better load balancing technique
Overhead reduction, flexibility vs performance
CMSC 838T – Presentation