Never born proteins

Download Report

Transcript Never born proteins

Biological applications in GRID:
the EUChinaGRID experience
F. Polticelli – University Roma Tre
EUChinaGRID WP4-Applications Manager
Budapest, 1.10.2007
FP6−2004−Infrastructures−6-SSA-026634
Outline
 EUChinaGrid Overview
 The structural genomics challenge
 Biological Applications in EUChinaGRID
• The “never born proteins”
• Protein structure prediction using GRID
– Rosetta integration within the Genius portal
– Early/Late stage integration in the Gridsphere portal
• Structure validation using GRID
– AMBER deployment on GRID
 Conclusions and perspectives
• function recognition and catalytic site identification tools
• In silico structural genomics

2/tot 
Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007
EUChinaGRID Overview
 Aim
• provide support actions to foster the integration and interoperability of
the Grid infrastructures in Europe (EGEE) and China (CNGrid).
• promote the migration of new applications on the Grid infrastructures by
training new user communities and supporting the adoption of grid tools
for scientific applications.
 Applications
• validate the intercontinental infrastructure using scientific applications
• facilitate porting of new applications relevant for scientific and industrial
collaboration between Europe and China.
• three main application fields:
– EGEE Applications (ATLAS and CMS)
– Astroparticle Physics applications (the ARGO experiment)
– Biology applications (“Never born Proteins)

3/tot 
Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007
The structural genomics challenge
 The combination of the 20 natural amino acids in a specific sequence in
a protein chain dictates the three-dimensional structure of the protein
 Protein function is linked to the specific three-dimensional arrangement
of amino acids functional groups.
 With the advancement of molecular biology techniques a huge amount
of information on protein sequences has been made available but far
less information is available on structure and function of these proteins.
 Prediction of protein structure and function is a key instrument to better
understand the protein folding principles and successfully exploit the
information provided by the “genomic revolution”.

4/tot 
Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007
The test case: Never born proteins
 With 20 different comonomers, a protein chain of just 60 amino acids
can theoretically exist in 2060 chemically and structurally unique
combinations
 But the number of natural proteins (109 to a maximum of 1013) is just a
tiny fraction of all possible proteins
 There exist a huge number of protein sequences that have never been
exploited by biological systems, in other words and enormous number
of “never born proteins” (NBP). These pose the following questions:
– Which are the criteria with which the existing proteins have
been selected?
– Natural proteins have peculiar properties in terms for example
of thermal stability, solubility in water or amino acid
composition?
– Can NBP be exploited for biomedical and/or biotechnological
purposes?

5/tot 
Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007
Never born proteins and GRID
 The problem is tackled by a “high throughput” approach made feasible
by the use of the GRID infrastructure
 A huge library of random amino acid sequences of fixed length is
generated (n=70)
 “ab initio” protein structure prediction software is used.
 Analysis of the structural characteristics of the resulting proteins
• frequency of compact and yet unknown folds
• presence of putative catalytic sites
• experimental validation on “interesting” cases

6/tot 
Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007
The tool: Rosetta abinitio
 Developed by David Baker – University
of Washington
 Based on a “fragment assembly”
strategy
 semi-empirical force field for the
evaluation of the thermodinamics of the
predicted structure
 Particularly successful in the prediction
of novel folds in the CASP competitions
(Critical Assesment of Structure
Prediction)
 Rosetta abinitio has been deployed in
GRID through the use of the GENIUS
interface with the option of parametric
jobs submission to run a large number
of jobs (structure predictions) at the
same time.

7/tot 
Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007
First step: Integration on the GILDA facility
 Single job execution on GILDA
• A shell script has been prepared which:
– registers the program executable and the required input files
(fragment libraries and secondary structure prediction file)
on the LFC catalog
– calls the Rosetta executable and proceeds with workflow
execution.
• A JDL file was created to run the application on the GILDA
working nodes which use the gLite middleware

8/tot 
Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007
Integration on the GENIUS web portal
 To facilitate the use of the Rosetta abinitio application within the grid
environment by the computational biology community, the application
was integrated within the GENIUS portal (https://glite-tutor.ct.infn.it).
 After MyProxy server initialization, input files and executable uploading,
JDL file preparation, application running, run status monitoring and
download of the output file are carried out from within the portal.
 Given the huge number of “never born proteins” to be simulated, a
parametric JDL file automatic generation procedure has been set up
within the GENIUS environment.
 More than 2x104 never born protein structures predicted so far

9/tot 
Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007
GENIUS screenshots

10/tot 
Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007
Never born proteins structure examples

11/tot 
Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007
Early/Late stage
 Developed by Irena Roterman group
– Jagellonian University
 A program for protein folding
simulation not structure prediction
(complementary approach to Rosetta)
 based on early stage - statistics using
a database of known sequences;
 late stage - energy minimization in
alternating potentials; this stage is the
most computationally expensive;
 Early/Late stage has been deployed
in GRID through the use of the
GridSphere Portal Framework and
Gridwise Tech LCG API package, that
provides access to gLite middleware.

12/tot 
Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007
Early/Late stage
 A self-containing bundle of programs
and libraries needed for application
running was created and registered
in the LFC catalogue.
 A script was created to install the
application on site each time a job is
started.
 A JDL file was created to run
application on the grid, that use the
gLite middleware.
 Finally, to enable running the
application for users that are not
familiar with the grid, it was decided
to integrate it in a web portal based
on the GridSphere Portal
Framework.

13/tot 
Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007
Never born proteins. What’s next ?
 Build consensus
 Experimental validation

14/tot 
•
Nuclear magnetic resonance
(NMR) data acquired in the NMR
centre of Peking University
•
Experimental data contain all the
information about the primary
structure of the protein, about
topology and bonds.
•
NMR structure calculation and
refinement is an iterative process
which, for a single protein, involves
many starting structures, normally
200 structures per round, and
each protein may need 10-30 (or
more) rounds of calculations.
Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007
AMBER porting on GRID
 A simple .JDL file and a set of scripts to run the program have been developed
•
•
•
•
•
•
Executable = "amber_serial.sh";
StdOutput = "testJob.out";
StdError = "testJob.err";
InputSandbox = {"amber_serial.sh","amber_test/amber_grid.tar"};
OutputSandbox = {"testJob.out","testJob.err","out.tar"};
Requirements = other.GlueCEUniqueID ==
"gridce.roma3.infn.it:2119/jobmanager-lcgpbs-grid";
 The program is currently under testing by the Peking University NMR group

15/tot 
Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007
What we have done
 Structural bioinformatics is a key instrument to exploit the huge amount of data
available on human and pathogens genes.
 In the EUChinaGRID project we set up a system to predict the three-dimensional
structure of a high number of protein sequences, to validate the predictions and to
test them experimentally
 We are currently refining function recognition (ASSIST) and catalytic site
identification tools (Early/Late Stage)
What we plan to do
 In silico structural genomics of bacterial and viral pathogens
•
•
•
•

16/tot 
Low-cost activity
High potential biomedical impact with small investments
Application to endemic human and animal pathogens of developing countries
Sinergy with pharmaceutical industry
Fabio Polticelli UROM3  EGEE07  Budapest 1-10-2007
Acknowledgements
- Prof. Luisi for the original idea of “never born
proteins”
- INFN Catania (Rosetta deployment)
- Jagellonian Univ. (Early/Late Stage deployment)
- INFN Roma Tre (AMBER deployment)
FP6−2004−Infrastructures−6-SSA-026634
Thank you for your attention !
FP6−2004−Infrastructures−6-SSA-026634