Parallel Genomic Sequence-Searching on an Ad

Download Report

Transcript Parallel Genomic Sequence-Searching on an Ad

Parallel Genomic Sequence-Searching on an
Ad-Hoc Grid: Experiences, Lessons Learned,
and Implications
Mark K. Gardner (Virginia Tech)
Wu-chun Feng (Virginia Tech)
Jeremy Archuleta (U. Utah)
Heshan Lin (NCSU)
Xiaosong Ma (NCSU & ORNL)
Nominated for Best Paper Award, SC 2006, Tampa, FL
1
Overview

StorCloud Demo of SC|05




Story


I/O throughput competition of real world scientific applications
When: Sun., Nov. 13 to Thu., Nov. 17, 2005
Part of slides modified from StorCloud presentation “mpiBLAST
on the GreenGene Distributed Supercomputer” (Wu Feng et. al.)
Built an ad-hoc grid (GreenGene) with 3048 Processor for
intensive genomic sequence search (search NT against NT with
mpiBLAST)
Team

Institutions


LANL, NCSU, U. Utah, and Virginia Tech
Vendors

Intel, Panta Systems, and Foundry Networks
2
GreenGene Grid

How?
SC2005
Showroom
Floor
Intel
(Dupont)
U.Utah
Va Tech
3
Outline



About BLAST and mpiBLAST
Motivation
Planning



System design




Estimate resource requirements
What kind of grid do we need
Hardware architecture
Software architecture
Results
Conclusion
4
What is BLAST?

Basic Local Alignment Sequence Tool


Ubiquitous sequence database search tool used in
molecular biology
Given a query DNA or amino-acid (AA) sequence,
BLAST



Finds similar sequences in database
Reports statistical significance of similarities between query
and database
Newly sequenced genomes are typically BLASTsearched against database of known genes

Similar sequences may have similar functions in a new
organism
5
BLAST at the Core of Sequence DB Search

Widely used:


But, it is:



Approximately 75%-90% of all compute cycles in life sciences are
devoted to BLAST searches
Computationally demanding, O(n2) (variant of string matching algorithm)
Requires seq database to be stored in memory to perform efficiently
Challenge: sequence databases growing exponentially
6
mpiBLAST Algorithm: Querying the Database




Open source BLAST parallelization (developed at LANL)
Parallel approach: segment and distribute database across cluster
Advantage: deliver super-linear speedup by avoiding repeated I/O
Limitation: poor performance in handle search with large output
volume because of results merging bottleneck
7
mpiBLAST-PIO: Enhancing Efficiency

Optimizations transferred from pioBLAST


Research prototype developed at NCSU and ORNL
[Lin et. al. IPDPS05]
Dramatically improves search throughput and
scalability

Using parallel I/O techniques to remove result
merging bottleneck



Results buffered and outputted concurrently by workers
Enhancing output processing to reduce communication
volume
Largely used in SC StorCloud demo
8
Why Sequence-Search the NT Database
Against Itself?

From a Biological Perspective



Aids in understanding of which genetic codes are unique and
which are redundant
Enables a number of useful studies from organism “barcoding”
to gene function and evolution
From a Computer Science Perspective



Provides pertinent demonstration of mpiBLAST/pio’s scalability
to larger problems (NT is one of the largest seq databases)
Can potentially generate huge output data
Enables realization of advanced indexing structure that tracks
relationships among sequences in the database

Such indexing structures can provide


Up to 100x speedup in search times with little loss of sensitivity.
Up to 20x compression of the database using phylogenetic methods.
9
Resource Estimation

Why do we care?



What’s the complexity of the problem?


To evaluate the feasibility of the project
To make better scheduling decision
Intuitively: estimation by seq length
NT composition
10
Sequence Length Based Estimation

Simple linear extrapolation appears “mission impossible”

Because of “hard queries”


intensive computation, large quantities of intermediate results
Fortunately,



Weak correlation between sequence length and resource requirements
because of BLAST employs heuristics
G1 sequences well behaved, large portion of sequences belong to G1
Search of hard queries can be speeded up with more memory
Sampling NT sequences search
11
Better Predictor?

Hit-based rather than length-based?

Two phase BLAST search



Computation of first phase much less
expensive then that of second phase


First phase: find hits in word level
Second phase: extend matched words in both
direction to find maximal segment pair (longest
local matching substring)
Modified BLAST algorithm to collect number of
hits in the first phase
Attractive: utilizing internal knowledge of
BLAST algorithm
12
Number of Hits Not a Better Predictor

Linear regression on data collected from 500 seqs


Y: output size, execution time; X: length, # hits
Number of hits not necessary better



Difference of mean square errors < 5%
High correlation (0.9942) between number of hits and sequence
length
Sequence length is much easier to collect
13
What Kind of Grid Do We Need?

Existing grid frameworks (such as Globus) not
what we want




Not available or well tested on Mac OS X and 64-bit
Linux OS
mpiBLAST-PIO not ported to Globus
High learning curve for installation and configuration
Home made grid software wrote from scratch


Just fit our needs
Easy to deploy, allow full control
14
Hardware Architecture


Heterogeneous environment
Interoperability is big concern
Cluster
Organization
System X
Virginia Tech
TunnelArch
Architecture
Memory
#Procs
File System
Dual 2.3GHz PowerPC
970FX
4GB
2200
NFS
Univ. of Utah
Dual AMD Opteron 240 CPU
4GB
126
PVFS
TunnelArch
Univ. of Utah
Dual AMD Opteron 244 CPU
2GB
128
PVFS
Dupon
Intel
Quad core
N/A
512/25
6
NFS
Jarrel
Intel
Dual 3.4GHz Intel P4
2GB
20
NFS
Blade
Center
Intel
Dual 2.66GHz Intel Xeon
2GB
28
NFS
Panta
Panta
Systems
Four AMD Opteron 246HE
2GB
32
NFS
15
16
Software Architecture

Hierarchical design




SuperMaster: assign queries, fetch results, load balancing
GroupMaster: fetch queries, perform search
How to choose group size?
Challenges: heterogeneity, scalability, fault tolerance
SuperMaster
GroupMaster
GroupMaster
GroupMaster
NT Replica
NT Replica
NT Replica
17
Heterogeneity And Accessibility

Only use four existing, cross-platform
tools




Perl, ssh, rsync, bash
5 scripts, totaled only 458 lines
Fast deployment in Unix like systems
Customize mpiBLAST-PIO

System X need special care


Porting issues because of Mac OS and Power PC
Implement pseudo-parallel-write to improve output
performance on NFS
18
Design for Scalability

Managing thousands of procs efficiently with
loosely coupled, hierarchical design

Reduce loads on SuperMaster
Passive SuperMaster: easy to add group masters,
regroup processors, and avoid security hole
Allow incremental system start

Prevent “bubbles in the pipeline”

A silent error every 500GB [Paxson 1999]





Hiding WAN latency by queuing queries in local
Ensuring data integrity with MD5 checksum
Alleviating network bandwidth constraint with
compression (compression ration 1:5 ~ 1:7)
19
Fault Tolerance


Serious: mean time failure < 10 hours in machines with
thousands of processors [Reed 2004]
Re-execution rather than checkpoint-restart


Primary issue: query states management
Maintain all query states in file system
20
Results

Finished 1/7 NT in one day


Coalesced sequences into batches targeting 30
minutes search time
Execution statistics


Output size: 600K ~ 7GB per batch, 284.2KB per seq
Execution time: 6 secs ~ 1.6 hours, average 9 mins per batch
21
Conclusion


Not be able to take advantage of existing grid
software
Home made grid software did work




Enables rapid development and deployment
Portable to Unix like platforms
Identify hard queries for bio research
Future work


Extend framework to support more general
applications
Better resource estimation
22