Never born proteins
Download
Report
Transcript Never born proteins
Biological applications in GRID:
the EUChinaGRID experience
F. Polticelli – University Roma Tre
EUChinaGRID WP4-Applications Manager
Budapest, 1.10.2007
FP6−2004−Infrastructures−6-SSA-026634
Outline
EUChinaGrid Overview
The structural genomics challenge
Biological Applications in EUChinaGRID
• The “never born proteins”
• Protein structure prediction using GRID
– Rosetta integration within the Genius portal
– Early/Late stage integration in the Gridsphere portal
• Structure validation using GRID
– AMBER deployment on GRID
Conclusions and perspectives
• function recognition and catalytic site identification tools
• In silico structural genomics
2/tot
Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007
EUChinaGRID Overview
Aim
• provide support actions to foster the integration and interoperability of
the Grid infrastructures in Europe (EGEE) and China (CNGrid).
• promote the migration of new applications on the Grid infrastructures by
training new user communities and supporting the adoption of grid tools
for scientific applications.
Applications
• validate the intercontinental infrastructure using scientific applications
• facilitate porting of new applications relevant for scientific and industrial
collaboration between Europe and China.
• three main application fields:
– EGEE Applications (ATLAS and CMS)
– Astroparticle Physics applications (the ARGO experiment)
– Biology applications (“Never born Proteins)
3/tot
Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007
The structural genomics challenge
The combination of the 20 natural amino acids in a specific sequence in
a protein chain dictates the three-dimensional structure of the protein
Protein function is linked to the specific three-dimensional arrangement
of amino acids functional groups.
With the advancement of molecular biology techniques a huge amount
of information on protein sequences has been made available but far
less information is available on structure and function of these proteins.
Prediction of protein structure and function is a key instrument to better
understand the protein folding principles and successfully exploit the
information provided by the “genomic revolution”.
4/tot
Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007
The test case: Never born proteins
With 20 different comonomers, a protein chain of just 60 amino acids
can theoretically exist in 2060 chemically and structurally unique
combinations
But the number of natural proteins (109 to a maximum of 1013) is just a
tiny fraction of all possible proteins
There exist a huge number of protein sequences that have never been
exploited by biological systems, in other words and enormous number
of “never born proteins” (NBP). These pose the following questions:
– Which are the criteria with which the existing proteins have
been selected?
– Natural proteins have peculiar properties in terms for example
of thermal stability, solubility in water or amino acid
composition?
– Can NBP be exploited for biomedical and/or biotechnological
purposes?
5/tot
Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007
Never born proteins and GRID
The problem is tackled by a “high throughput” approach made feasible
by the use of the GRID infrastructure
A huge library of random amino acid sequences of fixed length is
generated (n=70)
“ab initio” protein structure prediction software is used.
Analysis of the structural characteristics of the resulting proteins
• frequency of compact and yet unknown folds
• presence of putative catalytic sites
• experimental validation on “interesting” cases
6/tot
Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007
The tool: Rosetta abinitio
Developed by David Baker – University
of Washington
Based on a “fragment assembly”
strategy
semi-empirical force field for the
evaluation of the thermodinamics of the
predicted structure
Particularly successful in the prediction
of novel folds in the CASP competitions
(Critical Assesment of Structure
Prediction)
Rosetta abinitio has been deployed in
GRID through the use of the GENIUS
interface with the option of parametric
jobs submission to run a large number
of jobs (structure predictions) at the
same time.
7/tot
Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007
First step: Integration on the GILDA facility
Single job execution on GILDA
• A shell script has been prepared which:
– registers the program executable and the required input files
(fragment libraries and secondary structure prediction file)
on the LFC catalog
– calls the Rosetta executable and proceeds with workflow
execution.
• A JDL file was created to run the application on the GILDA
working nodes which use the gLite middleware
8/tot
Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007
Integration on the GENIUS web portal
To facilitate the use of the Rosetta abinitio application within the grid
environment by the computational biology community, the application
was integrated within the GENIUS portal (https://glite-tutor.ct.infn.it).
After MyProxy server initialization, input files and executable uploading,
JDL file preparation, application running, run status monitoring and
download of the output file are carried out from within the portal.
Given the huge number of “never born proteins” to be simulated, a
parametric JDL file automatic generation procedure has been set up
within the GENIUS environment.
More than 2x104 never born protein structures predicted so far
9/tot
Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007
GENIUS screenshots
10/tot
Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007
Never born proteins structure examples
11/tot
Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007
Early/Late stage
Developed by Irena Roterman group
– Jagellonian University
A program for protein folding
simulation not structure prediction
(complementary approach to Rosetta)
based on early stage - statistics using
a database of known sequences;
late stage - energy minimization in
alternating potentials; this stage is the
most computationally expensive;
Early/Late stage has been deployed
in GRID through the use of the
GridSphere Portal Framework and
Gridwise Tech LCG API package, that
provides access to gLite middleware.
12/tot
Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007
Early/Late stage
A self-containing bundle of programs
and libraries needed for application
running was created and registered
in the LFC catalogue.
A script was created to install the
application on site each time a job is
started.
A JDL file was created to run
application on the grid, that use the
gLite middleware.
Finally, to enable running the
application for users that are not
familiar with the grid, it was decided
to integrate it in a web portal based
on the GridSphere Portal
Framework.
13/tot
Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007
Never born proteins. What’s next ?
Build consensus
Experimental validation
14/tot
•
Nuclear magnetic resonance
(NMR) data acquired in the NMR
centre of Peking University
•
Experimental data contain all the
information about the primary
structure of the protein, about
topology and bonds.
•
NMR structure calculation and
refinement is an iterative process
which, for a single protein, involves
many starting structures, normally
200 structures per round, and
each protein may need 10-30 (or
more) rounds of calculations.
Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007
AMBER porting on GRID
A simple .JDL file and a set of scripts to run the program have been developed
•
•
•
•
•
•
Executable = "amber_serial.sh";
StdOutput = "testJob.out";
StdError = "testJob.err";
InputSandbox = {"amber_serial.sh","amber_test/amber_grid.tar"};
OutputSandbox = {"testJob.out","testJob.err","out.tar"};
Requirements = other.GlueCEUniqueID ==
"gridce.roma3.infn.it:2119/jobmanager-lcgpbs-grid";
The program is currently under testing by the Peking University NMR group
15/tot
Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007
What we have done
Structural bioinformatics is a key instrument to exploit the huge amount of data
available on human and pathogens genes.
In the EUChinaGRID project we set up a system to predict the three-dimensional
structure of a high number of protein sequences, to validate the predictions and to
test them experimentally
We are currently refining function recognition (ASSIST) and catalytic site
identification tools (Early/Late Stage)
What we plan to do
In silico structural genomics of bacterial and viral pathogens
•
•
•
•
16/tot
Low-cost activity
High potential biomedical impact with small investments
Application to endemic human and animal pathogens of developing countries
Sinergy with pharmaceutical industry
Fabio Polticelli UROM3 EGEE07 Budapest 1-10-2007
Acknowledgements
- Prof. Luisi for the original idea of “never born
proteins”
- INFN Catania (Rosetta deployment)
- Jagellonian Univ. (Early/Late Stage deployment)
- INFN Roma Tre (AMBER deployment)
FP6−2004−Infrastructures−6-SSA-026634
Thank you for your attention !
FP6−2004−Infrastructures−6-SSA-026634