Milanesi_EGEE`09_Barcellona

Download Report

Transcript Milanesi_EGEE`09_Barcellona

Bioinformatics GRID and HPC challenges
in Biomedicine and Biosciences.
Milanesi Luciano
National Research Council
Institute of Biomedical Technologies, Milan, Italy
[email protected]
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
Outline
Introduction
Bioinformatics data integration and data analysis
Challenges in design new portals:
example the SysBio-Gateway
Cell Cycle Database
G2S Breast Cancer Database
Nervous System
Kinweb
ProCMD
TMA Rep
Conclusion and Acknowledgments
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
ICT and Genomics
• A key development in the computational world has been the
arrival of de novo design algorithms that use all available
spatial information to be found within the target to design
novel drugs.
• Coupling these algorithms to the rapidly growing body of
information from structural genomics together with the new
ICT technology (eg. HPC, GRID, Web Services,
Bioinspired networks ecc.)
• provides a powerful new possibility for exploring design to a
broad spectrum of genomics targets, including more
challenging techniques such as:
• protein–protein interactions, docking, molecular
dynamics, system biology, gene network ecc.
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
Disease Network
Disease resistant population
Disease susceptible population
Genotype all individuals for thousands of SNPs
ATGATTATAG
geneX
ATGTTTATAG
Resistant people all have an ‘A’ at position 4 in geneX,
while susceptible people have a ‘T’ call SNP
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
System Biology for Health
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
Data Integration
Definition of data integration:
the process of combining information, residing at different
resources, to provide the user with a unified view of these
data for enabling the possibility to achieve real knowledge
.
U.I.
Workflow
Jobs
HPC
Grid
Soap
Results
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
Data Integration
• Data integration is an essential task to accomplish in order
to achieve a view of the biological knowledge as much
complete as possible in emerging fields such as
bioinformatics and systems biology.
• The integration of biological knowledge in the systems
biology field is related to different levels, such as genomics,
transcriptomics, proteomics and network interactions.
• Such data integration is crucial in order to support the
mathematical modelling and the computer simulation of
biological pathways
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
Data Integration
The data integration can be described on three levels of
complexity:
• the first layer is the integration of information from
heterogeneous resources by collecting data between different
database to allow an unified query schema;
• the second layer consists in the identification of correlative
associations across different datasets, generally using ontology
support, to provide a comprehensive and coherent view of the
same objects in light of different data sources;
• the third layer is mapping information gained about interacting
objects into networks and pathways that may be used as basic
models for the underlying cellular systems.
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
SysBio-Gateway Tools framework
The SysBio-Gateway tools framework are implemented for
the following main purpose:
• Simulation of Ordinary Differential Equation to solve the
mathematical models.
• In silico parameter values estimation to develop new cellular
system biology models.
• Visualization of the Protein structure and search by the correlated
Connolly surfaces.
• Analysis of protein-protein interactions network (search for the
first neighborhood, search for shortest path and common
annotations)
• Modeling the protein mutant starting from Single Nucleotide
Polymorphism data based on Modeller program.
• Image processing oriented to support tissue microarray analysis.
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
SysBio-Gateway Tools framework
The global optimization algorithm relies on an evolution strategy for
solving the uncertainty of parameters values:
• The system accepts as input a ODE based model (formatted as
plain text or encoded in the SBML standard) and the experimental
data and outputs the best fitting parameters.
• During the computation candidate solutions of separated
evolution processes are swapped every k iterations thanks to a
relational database.
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
SysBio-Gateway Ontology
The databases developed in the frame of SysBio-Gateway are
enriched by the ontological integration.
• The ontology used concern all levels of molecular biology, from
genes to proteins to pathways, tissues and diseases aspects.
• As example, we used Gene Ontology (GO) for genes annotation
and KEGG Pathway Ontology (derived from the hierarchical
organization of KEGG pathways) for biological networks.
• Ontology provides not only the availability of a commonly
accepted vocabulary, which facilitates data sharing and
information querying, but also increases the performance of
statistical and analytical studies.
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
SysBio-Gateway
• SysBio-Gateway is an example of unified framework which
embraces a set of specialized biomedical databases and
web resources oriented to different biological topics such as:
•
•
•
•
•
•
Biological processes (Cell Cycle)
Pathologies (Breast Cancer)
Organs (Brain and the Nervous System)
Protein families (Protein kinases)
Protein mutations (Protein C mutations)
Tissues (Tissue microarray)
• SysBio-Gateway relies both on a common methodology of
data integration and a set of tools which have been
developed ad hoc to enables the investigation in a systems
biology perspective
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
SysBio-Gateway
http://www.itb.cnr.it/sysbio-gateway
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
The Cell Cycle
Cell Cycle:
repeated sequence of events which leads the division of a
mother cell into daughter cells
Biological process frequently studied in correlation to
tumour disease
It is considered a valuable target for drug discovery in the
context of cancer and neurodegenerative disease
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
Cell Cycle Database
• The “Cell Cycle Database” (CCDB) is a resource which
collects useful information about genes and proteins
involved in the cell cycle process and mathematical models
of the cell cycle process.
• The integrating information belongs to the following
eukaryotic organisms:
• the budding yeast Saccaromyces cerevisiae
• the Homo sapiens
• Yeast and human organism have been chosen since a
deep knowledge of their cell cycle machinery is available
and an evolutionary conservation between the basic
regulatory events of the cell cycle has been demonstrated.
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
CCDB: the web interface
• Cell Cycle Database is accessible through a web interface
made up of HTML pages dynamically generated from PHP
scripts.
• URL: http://www.itb.cnr.it/cellcycle/
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
CCDB: the integrative system
Resources provided as
external links
Milanesi Luciano
Resources from which
data are taken
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
CCDB Gene Report: main features
• Gene report collects
information integrating
data from different table;
• Gene report is linked to
different external
biological databases;
• Gene report is linked to
protein report.
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
CCDB Protein Report: main features
• Protein report collects
information integrating
data from different tables
• Protein report is linked to
different biological
databases from which
we take data.
• Protein report is linked to
gene report
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
CCDB Model Report
• users can find:
• the mathematical
description of each
model
• the kinetic and
differential equations
• the related parameters
(the rate constants and
the initial protein
concentrations) defined
for the simulation
analysis.
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
Mathematical section
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
Simulation section
User can perfom the simulation of a
single ODE system describing a cell
cycle model
2D plot: image exported in png using GnuPlot
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
SysBio-Gateway
www.itb.cnr.it/sysbio-gateway
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
G2BCDB web interface
http://www.itb.cnr.it/breastcancer.
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
Genes-to-System Breast Cancer Database
• The Genes-to-Systems Breast Cancer (G2SBC) Database
is a bioinformatics resource to integrate the information
about genes, transcripts and proteins which have been
reported in literature to be altered in breast cancer cells.
• The resource includes a section dedicated to mathematical
models related to carcinogenesis, tumor growth and tumor
response to treatments.
• This comprehensive resource is dedicated to molecular and
systems biology of breast cancer, including both the
building-blocks level (genes, transcripts and proteins) and
the systems level (molecular and cellular systems).
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
Genes-to-System Breast Cancer Database
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
G2BCDB: web interface and analysis tools
• Query system at molecular, systems and cellular level
• Ontology based query system
• Query system based on biochemical pathways and proteinprotein interaction network
• Common annotations (in a gene set) search tool
• Mathematical models analysis and simulation (with a specific
connection with the breast cancer genes with the Cell Cycle
Database and the cell cycle related models)
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
G2BCDB: test case
Example of analysis combining the use of G2SBC Database tools
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
SysBio-Gateway
www.itb.cnr.it/sysbio-gateway
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
BAIN What level?
At what level can a systems biology strategy be implemented?
Faugeras O. et al., 2007, Journal of Physiology
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
Nervous System Database
• The Nervous System Database (NSD) deals with the data
integration of the Nervous System regarding:
• Knowledge about biological components, such as genes and
proteins, are collected together with information at the
system level, such as protein networks and molecular
pathways, that are relevant for a better exploration of neural
systems
• The annotations stored in NSD are associated to ontological
terms: this solution provides a semantic layer to improve data
storage, accessibility and sharing and represents an instrument to
identify relations among biological components
• The NSD systems supplies information about gene functions,
processes where they are involved and their spatial
localization.
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
NEUROINFORMATICS
Description, matching and retrieval of 3D anatomical data
(Extended Reeb graphs and size
functions)
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
NSD: the web interface
http://www.itb.cnr.it/gncdb
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
Nervous System Database
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
SysBio-Gateway
www.itb.cnr.it/sysbio-gateway
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
Kinweb kinase database
• Eukaryotic protein kinase (ePKs) constitute one of the largest
recognized protein families represented in the human genome
• The common name kinase is applied to enzymes that catalyze the
transfer of the terminal phosphate group from ATP to a receptor
substrate.
• All kinase catalyze essentially the same phosphoryl transfer reaction,
however, they display remarkable diversity in their structures, substrate
specificity, and the pathways in which they participate.
• ePKs are important players in virtually every signaling pathway
involved in normal development and disease.
• The results of the human kinome analysis are collected in the KinWeb
database, available for browsing and searching over the internet,
where all results from the comparative analysis and the gene structure
annotation are made available, alongside the domain information.
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
Kinweb kinase database
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
Kinweb: the web interface
http://www.itb.cnr.it/kinweb
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
Kinweb: Kinase Protein Analysis
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
SysBio-Gateway
www.itb.cnr.it/sysbio-gateway
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
ProCMD: a 3D web resource for protein C mutants
• Activated Protein C (ProC) is an anticoagulant plasma serine
protease which also plays an important role in controlling
inflammation and cell proliferation
• Structure prediction and computational analysis of the
mutants have proven to be a valuable aid in understanding
the molecular aspects of clinical thrombophilia
• ProCMD is a relational database which collects data on
clinical, structural and functional properties of the protein C
variants with a graphic interface to search and visualize the
mutated residue in the protein structure
• The main purpose of this tools is to provide an easy access
to mutant structural data through the use of 3D interactive
viewers (VRML and RasMol)
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
ProCMD: the web interface
http://www.itb.cnr.it/procmd
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
ProcMD: test case
• The ProcMD system allows the user to retrieve entries
•
•
•
•
by position in sequence of a mutated residue,
by amino acid substitution,
by keyword amd by domain localization.
The results appears in a mutations list linked with a dedicated
‘details page’ where the mutant is fully described.
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
ProcMD: test case
Details page for mutant G216D with 3D images gallery
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
SysBio-Gateway
www.itb.cnr.it/sysbio-gateway
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
Tissue Microarray -TMA Rep
• Tissue MicroArray technique is becoming increasingly important in
pathology for the validation of experimental data from
transcriptomics analysis.
• We propose a Tissue MicroArray web oriented system that supports
researchers in managing bio-samples and that, through the use of
ontologies, enables tissue sharing in order to promote TMA experiments
design and results evaluation.
• This system provides ontological description both for describing preanalysis tissue images and for identifying post-process image results,
which represents a crucial feature for promoting information exchange
• Through this system, users associate an ontology-based description
to each image uploaded into the database and also integrate
results with the ontological descriptions of genes identified in
each tissue.
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
TMA Platform Overview
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
Networks of resources
• The potential of new biological and biomedical
technological platforms in connection with HPC and GRID
technology will be particularly useful to deal with the
increasing amount, complexity, and heterogeneity of
biological and biomedical data.
• Bioinformatics applications for eHealth have become an
ideal research area where computer scientists can apply
and further develop new intelligent computation methods, in
both experimental and theoretical cases.
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
Networking People
Data analysis specific for biomedical applications can allow
the user to store and search genetics data, with direct
access to the data files and application on GRID servers.
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
Conclusion
• Here we present an integrated solution to explore part of the
information gained in the field of life science oriented to
systems biology.
• SysBio-Gateway integrated portal combines:
• bioinformatics approach, i.e. data integration using data
warehouse approach
• application of tools for the data analysis
• study of structural modifications - both for genome and proteins
• systems biology approach, that is the study of protein-protein
interaction networks
• molecular mathematical models
• pathological states under a systemic point of view.
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN
Acknowledgments
People of the Institute for Biomedical Technologies-CNR:
Luciano Milanesi,
Roberta Alfieri,
Ettore Mosca,
Federica Viti,
Pasqualina D'Ursi,
Ivan Merelli
Chiara Bishop and John Hatton (Graphical Web Interface)
BioinfoGRID http://www.bioinfogrid.eu
EGEE
http://www.eu .egee.org
FIRB-MIUR LITBIO: Laboratory for Interdisciplinary
Technologies in Bioinformatics http://www.litbio.org,
FIRB-MIUR ITALBIONET: Italian Bioinformatics Network
Milanesi Luciano
EGEE’09 21-25 September 2009 BARCELLONA ,SPAIN