CorradiL_Integromics
Download
Report
Transcript CorradiL_Integromics
Integromics: a grid-enalbled platform
for integration of advanced
bioinformatics tools and data
Luca Corradi
[email protected]
BIO-Lab, DIST
University of Genoa
Integromics
• Cancer research goal: tailor treatment to the molecular profile of an
individual patient's tumor
• Microarrays and other 'omic' technologies allow to study tens of thousand
of genes simultaneously
• Tools and methodologies used lack of standardization and repeatability
• Need of an "integromic" platform to:
– Develop integrative ('integromic') analyses of the data
– Combine tools available for genomics
Better results, higher quality of work
2
Focus on...
• How to exploit the backend gLite infrastructure and a HPC
environment to integrate bioinformatics tools and data
• How a Grid Portal can:
– integrate heterogeneous tools and data
– simplify user interaction through customized web interfaces
– increase usability and efficiency
• Case study: example of correlation between genomics data and
clinical data through a combination of processing tools provided
by the platform
3
The challenges
• Manage large volumes of bioinformatics data
• Deal with complex issues as different formats, distributed locations, timeconsuming tasks, computational needs
• Integrate heterogeneous tools and platforms
• Speed up analysis process through automated metodologies
• Improve efficiency and quality of work
• Make the system usable and accessible
4
Microarray technology
•
Computation of genes expression values of thousands genes at the same
time
•
Collection of microscopic DNA spots, representing single genes, arrayed
on a solid surface by covalent attachment to chemically suitable matrices
•
Estimation of the absolute value of gene expression
5
The use case
• Analyse large microarray datasets for breast cancer prognosis
assessment
• Run several R/Bioconductor scripts
• Deploy a re-usable and reliable service
• Avoid errors, increase repeatability
• Create a processing pipeline where new algorithms and data
analysis techniques can be tested
• Create a set of “atomic” components that can be combined into
workflows
6
Data Analysis Tools
R/Bioconductor
• Free software environment for statistical computing and graphics
• Bioconductor is a series of R packages specific for bioinformatics
community
• Active user community
Dchip
• Free software for analysis and visualization of gene expression
data
Affymetrix Power Tools (APT)
. Cross-platform command line programs that implement algorithms
for analyzing Affymetrix GeneChip arrays
Parallel dChip execution
•
Module 1
–
–
•
–
–
•
n jobs each opening N/n Files and
normalizing them
Each job produces N/n CSV Files
(matching with input files)
Module 2
m jobs each opening all N CSV Files
and computing genes expression
values concerneing a certain group
of genes
Each job produces one CSV File
Module 3
–
–
One job opening the m expression
files
It searches for differentially
expressed genes and it performs
clustering of results
CEL
N/n
CEL
1
CEL
N
Mod1
2
Mod1
1
Mod1
n
CSV
1
CSV
N
Mod2
1
Mod2
2
Mod2
m
CSV
1
CSV
2
CSV
m
Mod3
8
Parallel APT execution
9
The service
• Analyze large microarray
datasets for breast cancer
prognosis assessment
• Concatenate phenodata
and expression results
• Mix of custom and R
programs
• Automatic analysis and plot
creation
10
The BioMedicalPortal
Based on EnginFrame, an industry proven production-grade grid-portal
(public/private academic and industry customer worldwide)
11
BMPortal Architecture
gLite
WLM
Secure Storage
GSAF
Client Apps
BM Portal
User Web
Interface
AMGA Grid
Clusters
APIs
NON-Grid users
Engin Frame
Web
Service
Interface
(LSF, PBS, LL, etc..)
AMGA local
Grid Users
Other Grids
•
•
based on EnginFrame product from NICE srl
data management and secure storage layer
are based on GSAF / Secure Storage APIs
NorduGrid, Globus,
SRB, AliEn, etc…
other Grid DBs
12
BioMedicalPortal services
•
•
•
•
•
•
•
•
User management, authentication and authorization services
Data management (extension to metadata support on GRID)
Job submission (GRID, local, remote cluster) and monitoring
Support for every programming and scripting language
Plugin strategy for applications integration
Web services interface
Workflow management system
Lots of software and applications already integrated
etc......
13
gLite plugin & GWT
• Authentication, Authorization using VOMS (client
side applet is coming)
• Job submission and monitoring, retrieve and result
visualization
• Preference settings (RB, CE, …)
• Traditional LFC based data management
• New Google Web Toolkit interfaces for GSAF
integration via Java API using VOMS credentials
14
Testbed architecture
Users
User submits and
monitor work via a
standard web browser
1
Win
LX
Mac
UX
BMPortal
Users
6
8
2
Streaming
output
User
can check
theallows
job
to monitor
progress
status,
exit the
results
or
of the job
messages
BMPortal checks input
parameters and files, and
submits a job to gLite
gLite UI
EF Server&Agent
Input files
- primary
- include
3
5
Results are written
to the input file
directory
Job is done
7
Local or remote
cluster (LSF)
The RB matches the
user requirements with
the available resources
on the Grid
EGEE gLite
infrastructure
4
The job starts
Application
15
Analysis /1
• EnginFrame Grid
portal interface (web
access)
• Input data selection
(Affy .CEL files,
phenodata, gene list)
16
Analysis /2
– Services execution &
monitoring
– Users can come back
after coffee
17
Analysis /3
• Result
visualization in
portal spooler
area (txt files,
images, etc.)
18
Impact
• Addressed to bio-medical researchers without specific computation skills
• The collaboration between molecular oncologists and software engineers
allowed for the optimization of the system without loosing flexibility
• Scales up in the size of processed data above current available Desktop
Personal computer limitations
• Following the Software as a Service paradigm, users can focus on
experimental design rather than infrastructure.
19
Atomic services
• Each processing step is
an “atomic” service
• Services can be invoked
one by one
• Now services are
composed using
EnginFrame portal
features and LSF
scheduler tools
• But…
20
Current work (1)
• Viasual and easy WF
monitoring
• Totally integrated with
the EnginFrame job
monitoring and data
access
• Useful for very long
lasting workflows
• User-designed “virtual
experiments
21
Current work (2)
Integration of new
algorithms
(multi-chip quality control,
across-platform data
integration, etc...)
22
Current work (3)
Possibility to perform different analyses in a parallel way
23
Acknowledgements
• Part of this work is developed within the Italian FIRB project
LITBIO (Laboratory for Interdisciplinary Technologies in
BlOinformatics).
• Thanks are due to Ulrich Pfeffer and his functional genomics
group at IST (National Institute for Cancer Research) of Genoa,
Italy for their support.
24
Thank you!
Thank you!
25