Swarm-PBC09-MEP - Data to Insight Center
Download
Report
Transcript Swarm-PBC09-MEP - Data to Insight Center
Enabling Large Scale Scientific
Computations for Expressed
Sequence Tag Sequencing over Grid
and Cloud Computing Clusters
Sangmi Lee Pallickara, Marlon Pierce,
Qunfeng Dong, Chin Hua Kong
Indiana University, Bloomington IN, USA
*Presented by Marlon Pierce
IU to lead New US NSF Track 2d $10M Award
The EST Pipeline
• The goal is to cluster mRNA sequences
– Overlapping sequences are grouped together into different clusters and
then
– A consensus sequence is derived from each cluster.
– CAP3 is one program to assemble contiguous sequences.
• Data sources: NCBI GenBank, short read gene sequencers in the lab,
etc.
– Too large to do with serial codes like CAP3
• We use PaCE (S. Aluru) to do a pre-clustering step for large sequences
(parallel problem).
–
–
–
–
1 large data set --> many smaller clusters
Each individual cluster can be fed into CAP3.
We replaced the memory problem with the many-task problem.
This is data-file parallel.
• Next step: do the CAP3 consensus sequences match any known
sequences?
– BLAST also data-file parallel, good for Clouds
• Our goal is to provide a
Web service-based
science portal that can
handle the largest
mRNA clustering
problems.
• Computation is
outsourced to Grids
(TeraGrid) and Clouds
(Amazon)
– Not provided by inhouse clusters.
• This is an open service,
open architecture
approach.
http://swarm.cgb.indiana.edu
Some TeraGrid Resources
Data obtained from NIH NCBI. 4.7 GB raw data processed using
PACE on Big Red. Clusters shown to be processed with CAP3.
Swarm: Large scale job submission
infrastructure over the distributed clusters
• Web Service to submit and monitor 10,000’s (or more)
serial or parallel jobs.
• Capabilities:
– Scheduling large number of jobs over distributed HPC clusters
(Grid clusters, Cloud cluster and MS Windows HPC cluster)
– Monitoring framework for the large scale jobs
– Standard Web service interface for web application
– Extensible design for the domain specific software logics
– Brokers both Grid and Cloud submissions
• Other applications:
– Calculate properties of all drug-like molecules in PubChem
(Gaussian)
– Docking problems in drug discovery (Amber, Autodock)
(Revised) Architecture of Swarm Service
Swarm-Analysis
Standard Web Service Interface
Large Task Load Optimizer
Swarm-Grid
Connector
Swarm-Dryad
Connector
Swarm-Hadoop
Connector
Local RDMBS
SwarmGrid
Grid HPC/
Condor Cluster
SwarmDryad
SwarmHadoop
Windows
Server
Cluster
Cloud Comp.
Cluster
Swarm-Grid
• Swarm considers
traditional Grid HPC
cluster are suitable for
the high-throughput
jobs.
Swarm-Grid
Standard Web Service Interface
QBETS
Web
Service
Hosted by UCSB
– Prioritizes the resources
with QBETS, INCA
Resource Ranking Manager
Data Model
Manager
Fault Manager
User A’s Job Board
Local
RDMBS
– Parallel jobs (e.g. MPI
jobs)
– Long running jobs
• Resource Ranking
Manager
Request Manager
Job Queue
Job Distributor
MyProxy
Server
Hosted by
TeraGrid Project
Grid HPC/Condor pool Resource
Connector
Condor(Grid/Vanilla) with Birdbath
• Fault Manager
– Fatal faults
– Recoverable faults
Grid HPC
Grid HPC
Clusters
Grid HPC
Clusters
Grid HPC
Clusters
Clusters
Condor
Cluster
Swarm-Hadoop
• Suitable for short running
serial job collections
• Submit jobs to the cloud
computing clusters:
Amazon’s EC2 or
Eucalyptus
• Uses Hadoop map-reduce
engine.
• Each job processed as a
single Map function:
• Input/output location is
determined by the Data
Model Manager
– Easy to modify for the
domain specific
requirements.
Swarm-Hadoop
Standard WebService Interface
Request Manager
DataModel Manager
Fault Manager
User A’s Job Board
Local RDMBS
Job Buffer
Job Producer
Hadoop Resource Connector
Hadoop Map Reduce Programming
interface
Cloud Computing
Cluster
Performance Evaluation
• Java JDK 1.6 or higher, Apache Axis2
• Server: 3.40 GHz Inter Pentium 4 CPU, 1GB RAM
• Swarm Grid:
– Backend TeraGrid machines: Big Red (Indiana University), Ranger (Texas Advanced
Computing Center), and NSTG (Oak Ridge National Lab)
• Swarm-Hadoop:
– Computing Nodes: Amazon Web Service EC2 cluster with m1.small instance (2.5
GHz Dual-core AMD Opteron with 1.7GB RAM)
• Swarm-Windows HPC:
– Microsoft Windows HPC cluster, 2.39GHz CPUs, 49.15GB RAM, 24 cores, 4 sockets
• Dataset: partial set of the human EST fragments (published by
NCBI GenBank)
– 4.6 GB total
– Groupings: Very small job(less than 1 minute), small job(1~3 minutes), Medium
job(3~10 minutes), large job(longer than 10 minutes)
Total Execution time of CAP3 execution for the
various numbers of jobs (~1 minute) with SwarmGrid, Swarm-Hadoop, and Swarm-Dryad
Job Execution time in Swarm-Hadoop
Conclusions
• Bioinformatics needs both computing Grids and
scientific Clouds
– Problem sizes range over many orders of magnitude
• Swarm is designed to bridge the gap between the
two, while supporting 10,000’s or more jobs per
user per problem.
• Smart scheduling is an issue in data-parallel
computing
– Small Jobs(~1min) were processed more efficiently by
Swarm-Hadoop and Swarm-Dryad.
– Grid style HPC clusters adds minutes (or even longer) of
overhead to each of jobs.
– Grid style HPC clusters still provide stable environment
for large scale parallel jobs.
More Information
• Email: leesangm AT cs.indiana.edu mpierce AT
cs.indiana.edu
• Swarm Web Site: http://www.collabogce.org/ogce/index.php/Swarm
• Swarm on SourceForge:
http://ogce.svn.sourceforge.net/viewvc/ogce/
ogce-services-incubator/swarm/
Computational Challenges in the EST
Sequencing
• Challenge 1: Executing tens of thousands of jobs.
– More than 100 plant species have at least 10,000 EST
sequences; tens of thousand assembly jobs are processed.
– Standard queuing systems used by Grid based clusters do
NOT allow users to submit 1000s of jobs concurrently to
batch queue systems.
• Challenge 2: Requirement of job processing is various
– To complete EST assembly process, various types of
computation jobs must be processed. E.g. large scale
parallel processing, serial processing, and embarrassingly
parallel jobs.
– Suitable computing resource will optimize the
performance of the computation.
Tools for EST Sequence Assembly
Cleaning sequence
reads
RepeatMasker
Clustering
sequence reads
PaCE
Assemble reads
Cap3
Similarity search
Blast
SEG, XNU,
RepeatRunner,
PILER
Cap3 Clustering,
BAG, Clusterer,
CluSTr, UI Cluster
and many more
FAKII, GAP4,
PHRAP, TIGR
Assembler
FASTA, SmithWaterman,
Needleman-Wunsch
Swarm-Grid:
Submitting High-throughput jobs-2
Swarm-Grid
• User(personal account,
Standard WebService Interface
community account) based job
Request Manager
QBET
management: policies in the Gird Web
Resource Ranking Manager
DataModel
Service
clusters are based on the user.
Fault Manager
Manager
Hosted by UCSB
User A’s Job Board
• Job Distributor: matchmaking
Local
available resources and
RDMBS
Job Queue
submitted jobs.
• Job Execution Manager: submit
Job Distributor
jobs through CondorG using
Grid HPC/Condor pool
MyProxy
Resource Connector
Server
birdbath WS interface
Hosted by
• Condor resource connector
TeraGrid Project Condor(Grid/Vanilla) with
Birdbath
manages to job to be submitted
to the Grid HPC clusters or
traditional Condor cluster.
Grid
GridHPC
HPC
Condor
Grid
Clusters
GridHPC
HPC
Clusters
Clusters
Clusters
Cluster
Job Execution Time in Swarm-DryAd
with Windows HPC 16 nodes
Job Execution Time in Swarm-DryAd
various number of nodes
EST Sequencing Pipeline
• EST (Expressed Sequence Tag): A fragment of Messenger
RNAs (mRNAs) which is transcribed from the genes residing
on chromosomes.
• EST Sequencing: Re-constructing full length of mRNA
sequences for each expressed gene by means of
assembling EST fragments.
• EST sequencing is a standard practice for gene discovery,
especially for the genomes of many organisms which may
be too complex for whole-genome sequencing. (e.g. wheat)
• EST contigs are important data for accurate gene
annotation.
• A pipeline of computational steps is required:
– E.g. repeat masking, PaCE, CAP3 or other assembler on
clustered data set
Computing resources for computing
intensive Biological Research
• Biologically based researches require substantial
amount of computing resources.
• Many of current computing is based on the limited
local computing infrastructure.
• Available computing resources include:
– US national cyberinfrastructure (e.g. TeraGrid) good fit for
closely coupled parallel application
– Cloud computing clusters (e.g. Amazon EC2, Eucalyptus) :
good for on-demand jobs that individually last a few
seconds or minutes
– Microsoft Windows based HPC cluster(DryAd) : Job
submission environment without conventional overhead
of batch queue systems