What is a Metagenome?

Download Report

Transcript What is a Metagenome?

Deployment and Preparation of Metagenomic
Analysis on the EELA Grid
Gabriel Aparício, et al.
Global Process Topics
Topic summary
•
•
•
•
•
•
Introduction
Cases of study
Deployment
Automation System
Results and Performances Analysis
Future Plans
Introduction
•
•
•
•
•
What is a Metagenome?
What is a Metagenome Analysis?
Why Grid is a Good Solution?
Which is the Proposed Structure?
Which are the Future Plans?
Introduction
What is a Metagenome?
• Term first used by Jo Handelsman and others
in the University of Wisconsin in 1998.
• A metagenome is a collection of genes.
• It can be studied as a single gene.
• Analysis can be done without isolating genes
and lab-cultivating them.
Introduction
What is a Metagenome Analysis?
• A Metagenome Analysis is the group of
necessary steps to transform a file of a coded
metagenome into another file with some
interest information.
• This can include:
– Database filtering.
– BLAST alignments.
– BLAST output filtering.
– Creation of Phylogenetic Trees.
Introduction
Why Grid is a Good Solution?
• A Metagenome can be coded into several
hundred of thousand sequences.
• Sequential time can take more than a year.
• Public databases are continuously changing.
• Several coarse steps can be done in parallel.
• In a Grid, the global job can be divided into
subjobs.
• A Metagenome Analysis can be processed in a
few days with a Grid Infrastructure.
Cases of Study
Farm Soil Metagenome
• This is a sample from a nutrient rich,
moderately contaminated soil environment.
• This community is very diverse and complex.
• Many yet unknown enzymes are probably
present there.
• Its study is very interesting from the
biotechnological point of view.
Cases of Study
Whale fall Metagenome
• Whale carcasses are known to be a nutrientrich environment in the bottom of the ocean.
• A heterogeneous mixture of bacteria flourish
there.
• It is one of the best examples of marine
bacterial communities.
Cases of Study
Sargasso Sea Metagenome
• These oceanic samples are taken from surface
waters.
• They represent the diversity of bacteria that
live planktonically.
Cases of Study
Gut Metagenomes
• Several metagenomes of the human
intestinal microbiota.
• This consortia of bacteria helps its host to
metabolize many nutrients that would be
indigestible otherwise.
• It is involved in other functions
– Maturation and modulation of the immune
response of the host.
– Prevention of infection by bacterial pathogens.
Deployment
Sequential or Parallel jobs? (I)
• There are around 150 CEs in BIOMED and
EELA VOs.
• There are only around 30 CEs able to run
MPICH jobs.
• The number of CEs decreases when the
number of required nodes increases.
• Full efficiency in MPICH jobs is achieved
occasionally.
Deployment
Sequential or Parallel jobs? (II)
Deployment
Selecting CEs (I)
• Several jobs are needed
– A single job can take more than a year.
– It is needed to split the Analysis into several
subjobs (often more than 100 subjobs).
• Several CEs are needed
– To decentralize processing, storing and network
bandwidth.
• A Metagenome Analysis job has requirements
– On software, hardware and configuration.
Deployment
Selecting CEs (II)
• Not all available CEs are able to produce
results.
• Not all available CEs have the same
performance.
• It is needed to select CEs and to distribute
jobs according to their performance.
Deployment
Selecting CEs (III)
Deployment
Selecting SEs and Replicating Files
• All jobs need certain common files.
• These files have to be replicated to increase
performance and to distribute network
bandwidth.
• SEs selected will be located according to their
geographical and administrative nearness to
selected CEs, their performance and their
configuration.
Deployment
Splitting global job
• The global job has to be broken down into
subjobs.
• The subjob lifetime will decrease
– Increase interactivity.
– Improve monitoring capabilities.
Automation System
Submitting Jobs
• Subjobs are assigned to a list of CEs
• These CEs have been tested.
• Assignation is done according to obtained
performances in previous experiments.
Automation System
Monitoring Jobs
• Periodically, jobs status are monitored.
• In case of errors (aborted job, bad results,
etc.), the job is automatically resubmitted.
• In case the job is running too long, the job is
cancelled and resubmitted.
• In case the job has finished successfully, its
CEs is annotated for later submissions.
Automation System
Resubmitting Jobs
• Each correctly finished job annotates its CEs
and puts it into a list.
• The jobs are resubmitted to a random CE of
this list.
• If the list does not exist, the job is submitted
to a random CE.
Automation System
Retrieving Results
• Once results are available, they are
downloaded and the standard outputs are
explored to find any error.
• A retrieved job is no longer monitored.
Results and Performances
First conclusions
• Jobs are too long to run sequentially
– Sargasso Sea Metagenome takes 512 days.
• The same job in Grid takes 13 days to be fully
finished.
– Speedup is around 40.
• High speed for most jobs (90% in 7 days)
– Speedup is around 80.
– No need to finish all jobs to begin with new
stages.
Results and Performances
Correctly finished jobs percentage
Results and Performances
Sequences processed per hour
Future plans
Future plans
• To create several shell-scripts with different
stages depending on the desired results.
• To increase cases of study.
• To improve automation performances.
• To make a report with the issues and lessons
learnt in EGEE and EELA infrastructures.
Contact
Contact
Gabriel Aparício i Pla
Ignacio Blanquer Espert
Vicente Hernández García
Universitat Politècnica de València
Camí de Vera, s/n
46022 València, Spain
Emails: [email protected]
[email protected]
[email protected]