BioinfoGRID Project - Napoli

Download Report

Transcript BioinfoGRID Project - Napoli

HPC and GRID challenges in
Bioinformatics.
Milanesi Luciano
National Research Council
Institute of Biomedical Technologies, Milan, Italy
[email protected]
Luciano Milanesi
2007/12/19, Napoli
1
Introduction
• The potential of new biological and biomedical
technological platforms in connection with HPC and
GRID technology will be particularly useful to deal with the
increasing amount, complexity, and heterogeneity of
biological and biomedical data.
• Bioinformatics applications for eHealth have become an
ideal research area where computer scientists can apply
and further develop new intelligent computation methods,
in both experimental and theoretical cases.
The European Bioinformatics initiative
based on infrastructure created by the
EGEE and BioinfoGRID and related
projects will be illustrated.
Luciano Milanesi
2007/12/19, Napoli
2
Introduction: Post-genomic
• “Post-genomic” focuses on the new tools and new
methodologies emerging from the knowledge of genome
sequences.
• Production and use of DNA micro arrays, analysis of
transciptome, proteome, metabolome are the different
topics developed in this class.
Luciano Milanesi
2007/12/19, Napoli
3
The human organism:
•
•
•
•
•
•
•
•
~ 3 billion nucleotides
~ 30,000 genes coding for
~ 100,000-300,000 transcripts
~ 1-2 million proteins
~ 60 trillion cells of
~ 300 cell types in
~14,000 distinguishable
morphological structures
Luciano Milanesi
2007/12/19, Napoli
4
ICT and Genomics
• A key development in the computational world has been the
arrival of de novo design algorithms that use all available
spatial information to be found within the target to design
novel drugs.
• Coupling these algorithms to the rapidly growing body of
information from structural genomics together with the new
ICT technology (eg. HPC, GRID, Web Services, ecc.)
• provides a powerful new possibility for exploring design to a
broad spectrum of genomics targets, including more
challenging techniques such as:
• protein–protein interactions, docking, molecular
dynamics, system biology, gene network ecc.
Luciano Milanesi
2007/12/19, Napoli
5
System Biology for Health
Luciano Milanesi
2007/12/19, Napoli
6
EGEE Related EU projects
EUIndia
ISSeG
GRID
EU
BEinGRID
Diligent
A DIgital Library Infrastructure
on Grid ENabled Technology
Luciano Milanesi
2007/12/19, Napoli
7
BioinfoGRID
Luciano Milanesi
2007/12/19, Napoli
8
BioinfoGRID Project
.
• The BIOINFOGRID project proposes to combine the
Bioinformatics services and applications for molecular
biology users with the Grid Infrastructure by EGEE and
EGEEII projects.
• In the BIOINFOGRID initiative plan to perform research in
genomics, transcriptomics, proteomics and molecular
dynamics applications studies based on GRID
technology.
Luciano Milanesi
2007/12/19, Napoli
9
Genomics applications in GRID
• GRID analysis of genomic databases: integration of
precomputed data, gene identification, differentiation of
pseudogenes, comparative genome analysis, etc.
• Perform functional protein analysis in GRID by using
the functional protein domain annotations on large
protein families using GRID and related databases.
Luciano Milanesi
2007/12/19, Napoli
10
Bioinformatics Applications
• CSTminer
 Goal: compare the entire genome of the Human being against
the entire genome of some animals (mouse, dog… ecc)
 First test: Human against mouse
 Challenge dimension:
 850 million of BLAST comparison (~ 2 sec of CPU for each
comparison)
 More than 50 CPU years needed.
 More than 65000 job submitted.
 Up to 2 million of comparison per hour.
 22 different farms used.
 More then 900 different hosts used.
 2 month of run on INFN-Grid infrastructure
 Second test: Some genes of Human against many animals
 Challenge dimension:
 1.7 million of comparison
 More than 900 CPU hours needed.
 < 1 day on INFN-Grid infrastructure
Luciano Milanesi
2007/12/19, Napoli
11
Proteomics Applications in GRID
•
Protein surface calculation : the grid will be used to
elaborate the volumetric description of the protein
obtaining a precise representation of the corresponding
surface.
Luciano Milanesi
2007/12/19, Napoli
12
Transcriptomics applications
•
Computational GRIDs to analyse trascriptomics data
Description
• To perform algorithmic tools for gene expression data
analysis in GRID: evaluate the computational tools for
extracting biologically significant information from gene
expression data.
• Algorithms will focus on clustering steady state and time
series gene expression data, multiple testing and meta
analysis of different microarray experiments from different
groups, and identification of transcription sites.
Luciano Milanesi
2007/12/19, Napoli
13
Transcriptomics applications
Data analysis specific for bioinformatics allow the GRID
user to store and search genetics data, with direct access
to the data files stored on Data Storage element on GRID
servers.
Researchers
perform their
activities
regardless
geographical
location, interact
with colleagues,
share and access
data
Scientific instruments and
experiments provide huge
amount of data from
microarray
Luciano Milanesi
2007/12/19, Napoli
14
Influenza A Neuraminidase
• Grid-enabled High-throughput in-silico Screening
against Influenza A Neuraminidase
• Encouraged by the success of the first EGEE biomedical
data challenge against malaria (WISDOM), the second data
challenge battling avian flu was kicked off in April 2006 to
identify new drugs for the potential variants of the Influenza
A virus.
• Mobilizing thousands of CPUs on the Grid, the 6-weeks
high-throughput screening activity has fulfilled over
100 CPU years of computing power.
• In this project, the impact of a world-wide Grid infrastructure
to efficiently deploy large scale virtual screening to speed
up the drug design process has been demonstrated.
Luciano Milanesi
2007/12/19, Napoli
15
Identification of Applications in EELA
V
E-infrastructure shared between Europe and Latin America
• EELA Biomedical Applications Fall into Three Categories
– Bioinformatics Applications
 BLAST in Grids.
 Phylogeny.
– Computational Biochemical Processes
 Wide in-Silico Docking on Malaria
(WISDOM).
– Biomedical Models
 GEANT4 Application for Tomographic
Emission (GATE)
Luciano Milanesi
2007/12/19, Napoli
16
ACGT Project
Luciano Milanesi
2007/12/19, Napoli
17
EuChinaGRID
• Facility for the prediction of the three dimensional structure
of “never born proteins”
Luciano Milanesi
2007/12/19, Napoli
18
Grid added value for international collaboration
on neglected diseases
• Grids offer unprecedented opportunities for sharing
information and resources world wide
Grids are unique tools for :
-Collecting and sharing information (Epidemiology, Genomics)
-Networking experts
-Mobilizing resources routinely or in emergency (vaccine & drug discovery)
Luciano Milanesi
2007/12/19, Napoli
19
Molecular applications in GRID
Aim : The objective is to docking and Molecular Dynamics
simulations, which usually take a very long time to complete
the analysis.
Description
• Wide In Silico Docking On Malaria initiative WISDOMII:This project perform the docking and molecular dynamics
simulation on the GRID platform for discovery new targets
for neglected diseases . Analysis can be performed notably
using the data generated by the WISDOM application on
the EGEE infrastructure.
Luciano Milanesi
2007/12/19, Napoli
20
Grid impact on drug discovery workflow down
to drug delivery (1/2)
• Grids provide the necessary tools and data to identify new
biological targets
– Bioinformatics services (database replication, workflow…)
– Resources for CPU intensive tasks such as genomics
comparative analysis, inverse docking…
• Grids provide the resources to speed up lead discovery
– Large scale in silico docking to identify potentially promising
compounds
– Molecular dynamics computations to refine virtual screening and
further assess selected compounds
Luciano Milanesi
2007/12/19, Napoli
21
Grid impact on drug discovery workflow down
to drug delivery (2/2)
• Grids provide environments for epidemiology
– Federation of databases to collect data in endemic areas to
study a disease and to evaluate impact of vaccine, vector control
measures
– Resources for data analysis and mathematical modelling
• Grids provide the services needed for clinical trials
– Federation of databases to collect data in the centres
participating to the clinical trials
• Grids provide the tools to monitor drug delivery
– Federation of databases to monitor drug delivery
Luciano Milanesi
2007/12/19, Napoli
22
Virtual screening process by docking
Starting compound
database
Starting target
structure model
DOCKING
Docking: predict how
small molecules bind
to a receptor of
known 3D structure
Predicted
binding models
There are successful examples
Post-analysis
Compounds
for assay
– rapid,
– cost effective…
But there are limitations
– CPU and storage needed
Luciano Milanesi
2007/12/19, Napoli
23
Grid-enabled high throughput virtual
screening by docking
Millions of chemical
compounds
Docking
software
A few target structures
Luciano Milanesi
• 1 to 30 mn by docking
• A few MB by output
• 100 CPU years, 1 TB
• Challenges: - Speed-up the process
- Manage the data
• Large scale deployment on
grid infrastructure
2007/12/19, Napoli
24
WISDOM-II, second large scale docking
deployment against malaria
Malaria target
Involved in
Biology partners
GST from Plasmodium
falciparum
Parasite
detoxification
U. of Pretoria,
South-Africa
DHFR from
Plasmodium vivax
Parasite DNA
synthesis
U. of Los Andes, Venezuela
U. of Modena, Italia
DHFR from Plasmodium
falciparum
Parasite DNA
synthesis
U. of Modena, Italia
Tubulin from
Plasmodium/plant/
mamal
Parasite cell
replication
CEA, Acamba
project, France
Luciano Milanesi
2007/12/19, Napoli
25
Grid infrastructures and projects contributing
to WISDOM-II
EMBRACE
BioinfoGrid
SHARE
Auvergrid
EGEE
EUMedGrid
EUChinaGrid
TWGrid
EELA
: European grid infrastructure
: European grid project
: Regional/national grid infrastructure
Luciano Milanesi
2007/12/19, Napoli
26
Filtering process
1,000, 000 chemical compounds
Sorting based on scoring in different parameter sets;
Consensus scoring
10,000 compounds selected
Based on key interactions
1,000 compounds
Key interactions, binding
modes, descriptors,
knowledge of active site
100 compounds
MD
50 compounds to be tested in experimental lab
Credit: V. Kasam
Fraunhofer
Luciano Institute
Milanesi
2007/12/19, Napoli
27
A grid for neglected diseases
LPC Clermont-Ferrand:
Biomedical grid
SCAI Fraunhofer:
Knowledge extraction,
Chemoinformatics
Univ. Modena:
Biological targets,
Molecular Dynamics
CEA, Acamba project:
Biological targets,
Chemogenomics
HealthGrid:
Biomedical grid,
Dissemination
BioinfoGRID:
Bioinformatics Grid
ITB CNR:
Bioinformatics,
Molecular modelling
Academica Sinica:
Grid user interface
Univ. Los Andes:
Biological targets,
Malaria biology
Univ. Pretoria:
Bioinformatics,
Malaria biology
Use the grid technology to foster research and development
on malaria and other neglected diseases
Contacts also established with WHO, Microsoft, TATRC, Argonne, SDSC, SERONO, NOVARTIS, SanofiAventis, Hospitals in subsaharian Africa,
Luciano Milanesi
2007/12/19, Napoli
28
The Cell Cycle
• Cell Cycle:
– repeated sequence of events which leads the division of a
mother cell into daughter cells
– Biological process frequently studied in correlation to
tumour disease
– It is considered a valuable target for drug discovery in the
context of cancer and neurodegenerative disease
Luciano Milanesi
2007/12/19, Napoli
29
Systems Biology Approach
• Systems biology studies how biological functions emerge
from the protein-protein interactions in the living systems;
• The complexity of this biological process relies in the high
number of genes and networks of protein interactions involved
in;
• The quantification of the behavior of each cell cycle
components has a crucial role in the understanding the complex
mechanism of cell cycle regulation.
Luciano Milanesi
2007/12/19, Napoli
30
System Biology: Cell Cycle
Luciano Milanesi
2007/12/19, Napoli
31
Simulation Section
The simulation of a single ODE
system describing a cell cycle
model
2D plot: image exported in png using GnuPlot
Luciano Milanesi
2007/12/19, Napoli
32
Tissue Microarray in GRID
Genetic Diseases
High throughput techniques (i.e. DNA microarray)
to screen the whole genome
Low reliability
Validation through Tissue Microarray
Luciano Milanesi
2007/12/19, Napoli
33
Tissue Microarray in GRID
Genes and proteins detection
Luciano Milanesi
2007/12/19, Napoli
34
Tissue Microarray in GRID
elaboration
elaboration
SE
CE
GRID Node
CE
SE
GRID Node
Edge detection on every
TMA on GRID having
“age”>80 AND
“gender”=F AND
“desease”=colon cancer
CE
elaboration
AMGA
SERVER
UI
SE
GRID Node
Luciano Milanesi
2007/12/19, Napoli
35
Deployment of BLAST in Grid
• A large fraction of the biological data produced is publicly available on
web or ftp sites
– data can be downloaded as “flat files”.
• A procedure has been set up to
–
–
–
–
–
Check the remote site for un updated version of the DB’s
Automatic download of the data
Register the file in a grid catalogue (LFC)
Create a DB index for its use with BLAST (using the Grid)
Register the indexes file(s) in the grid catalogue (LFC)
Luciano Milanesi
2007/12/19, Napoli
36
Biological Database handling
• The Automatic Updater (AU)
constantly monitors FTP sites
looking for newest versions of
each databases
– When a new timestamp on FTP
sites is detected, the newest
version is automatically
downloaded and replaces the
older version on the grid
– Before clearing the older version,
an xdelta patch is computed
allowing to regenerate the old
version starting from the new
one.
Luciano Milanesi
2007/12/19, Napoli
37
Biological Database handling
• This software for the data management
allows to replicate dynamically each
database in relation with its usage in
order to balance the number of replicas,
and so the performance, taking into
account the occupied disk space.
• It relies on the statistical analysis of the
database usage by the grid jobs, working
on data acquired after each job execution,
regarding grid queue times, database set
up times and overall job computation.
• We face complex data challenges
performing both the parsing of the output
results and the storage of the data in the
database directly from the GRID
Luciano Milanesi
2007/12/19, Napoli
38
Results
• In order to make this software
rapidly accessible a user
interface has been developed.
• It is used to submit jobs in the
grid infrastructure, to visualize
in a clear form the obtained
results and to hide the
complexity of the distributed
platform.
Luciano Milanesi
2007/12/19, Napoli
39
Results
• The main feature of the portal is
the possibility to hide completely
the JDL scripts layer for the grid
job submission.
• While it is still possible to submit
simple job to grid writing it’s own
JDL script, the idea is to hide this
process to make the grid use
more user friendly for the
bioinformatics community.
Luciano Milanesi
2007/12/19, Napoli
40
Results
• The interfaces to application
jobs are automatically
generated by the conversion of
XML files that describe both the
end user parameters and the
structure of the JDL scripts that
have to be automatically
generated to submit the jobs.
Luciano Milanesi
2007/12/19, Napoli
41
Results
•
A selection can be made among
different databases against
which to perform the analysis: all
these databases are updated
automatically.
• In figure is reported the
summary of the submitted
application jobs, with information
about the analysis software, the
global computation status and
the user interface used for
submission.
Luciano Milanesi
2007/12/19, Napoli
42
Italian Bioinformatics Networks
30 Research Nodes
Milano
LITBIO- Laboratory of
Interdisciplinary Technologies in
BIOinformatics
Bari
LIBI- Laboratory for International
BIoinformatics
Napoli
LAB GTP- LABoratory for the
development of Bioinformatics tools
and their integration with Genomics,
Transcriptomics and Proteomics
data.
Bari
LBBM - Bioinformatics Laboratory
for the Molecular Biodiversity
Luciano Milanesi
2007/12/19, Napoli
43
CNR-BIOINFORMATICS Networks
24 CNR Research Nodes
National Research Council
CNR-Bioinformatics project
Luciano Milanesi
2007/12/19, Napoli
44
Italian PON GRID based Networks
Luciano Milanesi
2007/12/19, Napoli
45
Virtual Physiological Human
Basis is the International
physiome project www.physiome.org
•
Concept basis
•
Computational frameworks and ICT-based
tools for multiscale models of the human
anatomy, physiology and pathology
•
Libraries of data and toolbox for simulation
and visualisation

Patient specific model from biosignals and
images including molecular images
Loukianos Gatzouli ICT for Health
Luciano Milanesi
2007/12/19, Napoli
46
Acknowledgments
Luciano Milanesi
•
BioinfoGRID http://www.bioinfogrid.eu
•
EGEE Enabling Grid for E-science project
http://www.eu.egee.org
•
EELA: e-Infrastructure between Europe and
Latin America project http://www.eueela.org/index.htm
•
Euchinagrid: Interconnection &
Interoperability of Grids between Europe &
China project.
http://www.euchinagrid.org/
•
FIRB-MIUR LITBIO: Laboratory for
Interdisciplinary Technologies in
Bioinformatics http://www.litbio.org,
2007/12/19, Napoli
47