Transcript minervini

Enabling Grids for E-sciencE
“High throughput” protein structure
prediction application in EUChinaGRID
G. Minervini, G. La Rocca, P.L. Luisi and F. Polticelli
Dept. of Biology, Univ. Roma Tre, and INFN Catania, Italy
EGEE User Forum – Manchester, 10 May 2007
www.eu-egee.org
EGEE-II INFSO-RI-031688
EGEE and gLite are registered trademarks
Enabling Grids for E-sciencE
EUChinaGRID
Project
Overview
EUChinaGRID project promotes the integration and interoperability between
Europe (EGEE) and China (CNGrid) Grid infrastructures
The goals of the EUChinaGRID are:
•
Promoting porting of new applications on the Grid infrastructures
•
Training of new user communities
•
Supporting the adoption of grid tools for scientific applications
•
Validating the intercontinental Grid infrastructure
EGEE-II INFSO-RI-031688
G. Minervini EGEE User Forum * Manchester, 10-5-2007
2
Enabling Grids for E-sciencE
Biological Applications

The protein folding “problem” and the structural genomics challenge
– The combination of the 20 natural amino acids in a protein specific sequence
dictates the three-dimensional structure of the entire protein
– Protein function is linked to the specific three-dimensional arrangement of amino
acids functional groups
– With the advancement of molecular biology techniques a huge amount of
information on protein sequences has been made available but far less
information is available on structure and function of these proteins
– The “ab-initio” prediction of protein structure is a key instrument to better
understand the protein folding principles and successfully exploit the information
provided by the “genomic revolution”
EGEE-II INFSO-RI-031688
G. Minervini EGEE User Forum * Manchester, 10-5-2007
3
Enabling Grids for E-sciencE
THEORETICAL CONTEXT
Possible-protein space
Natural proteins
Determinist theory:
Contingency theory:
The life constituents are the
result of an evolutive fine
work; what we see is the
better possible solution for
the biological needs (de
Duve, 1995).
Extant proteins are the result
of the simultaneous interplay
of several concomitant causes
(Gould, 1994).
There may be an entire universe of “Never Born Proteins”
(NBP), whose properties have never been sampled by Nature
EGEE-II INFSO-RI-031688
G. Minervini EGEE User Forum * Manchester, 10-5-2007
4
Enabling Grids for E-sciencE
SOME BASIC CALCULATIONS WITH A
70 AMINO ACID PEPTIDE . . . .
 With 20 different comonomers a protein chain of just 70 amino acids can
theoretically exist in 2070 chemically and structurally unique combinations

1 X 1070 POSSIBLE AMINO ACID COMBINATIONS

SYNTHESIS OF ONE MOLECULE OF EACH
COMPOUND:
– 1.1 X 1042 Kg OF MATERIAL
– 1.8 X 1017 TIMES THE WEIGHT OF THE EARTH
EGEE-II INFSO-RI-031688
G. Minervini EGEE User Forum * Manchester, 10-5-2007
5
Enabling Grids for E-sciencE
The “Never Born Proteins”
Razionale

It seems unlikely that nature tried all possible combinations, in other words,
there exist a big number of NBP that have never been exploited by
biological systems.

The NBP pose a series of interesting questions for the biology and basic
science in general:
– Which are the criteria with which the existing proteins have been selected?
– Natural proteins have peculiar properties?
(i.e. of thermal stability, solubility in water or amino acid composition?)
– Or else they represent just a subset of the possible protein sequences
generated only by the contemporary action of contingency and physicochemical forces?
EGEE-II INFSO-RI-031688
G. Minervini EGEE User Forum * Manchester, 10-5-2007
6
Enabling Grids for E-sciencE
The approach

The problem is tackled by a “high throughput” approach made feasible by the
use of the GRID infrastructure.

A library of 107-109 random amino acid sequences of fixed length is generated
(n=70).

“ab initio” protein structure prediction software is used

Analysis of the structural characteristics of the resulting proteins in terms of:
– Frequency of compact folds and characteristics of the corresponding amino acid
sequences
– Occurrence of novel yet unknown folds
– Hydrophobicity/Hydrophilicity characteristics
– Presence of putative catalytic sites
– Experimental validation on “interesting” cases
EGEE-II INFSO-RI-031688
G. Minervini EGEE User Forum * Manchester, 10-5-2007
7
Enabling Grids for E-sciencE
Never born proteins

Software
– Rosetta
 Developed by David Baker –
University of Washington
 Based on a “fragment assembly”
strategy
 semi-empirical force field for the
evaluation of the thermodinamics
of the predicted structure
 Particularly successful in the
prediction of novel folds in the
CASP competitions (Critical
Assesment of Structure
Prediction)
(http://depts.washington.edu/bakerpg/)
EGEE-II INFSO-RI-031688
G. Minervini EGEE User Forum * Manchester, 10-5-2007
8
Enabling Grids for E-sciencE
Software “ab-initio”
Rosetta
– Based on the assumption that local interactions bias the
conformation of sequence fragments while global interactions select
the three-dimensional structure with minimal energy, compatible
with the local biases.
– To define the local sequence-structure relationships the software
uses the Protein Data Bank (www.rcsb.org) to extracts the most
likely distribution of conformations adopted by short protein
segments in experimental structures
– Taking them as an approximation of the distribution adopted by
sequence segments during the folding process.
EGEE-II INFSO-RI-031688
G. Minervini EGEE User Forum * Manchester, 10-5-2007
9
Enabling Grids for E-sciencE
Method details

Module I - Input generation
–
The query sequence is divided in fragments of 3 and 9 amino acids
–
The software extracts from the data base of protein structures the distribution of threedimensional structures adopted by these fragments based on their specific sequence
–
For each query sequence is derived a fragments data base which contains all the possible
local structures adopted by each fragment of the entire sequence.

Module II - Ab initio protein structure prediction
–
The sets of fragments are assembled in a high number of different conbinations by a Monte
Carlo procedure.
–
The resulting structures are subjected to a energy minimization procedure
–
The principal non-local interactions considered are hydrophobic interactions, electrostatic
interactions, main chain hydrogen bonds and excluded volume.
–
The structures compatible both with local biases and non-local interactions are ranked
according to their total energy resulting from the minimization procedure.
EGEE-II INFSO-RI-031688
G. Minervini EGEE User Forum * Manchester, 10-5-2007
10
Enabling Grids for E-sciencE
Module I
– The procedure for input generation is rather complex but
computationally inexpensive (10 min of CPU time on a Pentium IV 3,2
GHz)
– Due to the many dependencies of module I (Blast and psipred), the
input generation is carried out locally with a script that automatizes the
procedure for a large dataset of sequences
– Approximately 500 input datasets are currently being generated weekly
EGEE-II INFSO-RI-031688
G. Minervini EGEE User Forum * Manchester, 10-5-2007
11
Enabling Grids for E-sciencE
Module II
– Input
– fragment files generated by module I
– secondary structure prediction using psipred
 In output the user obtains a number of structural models of the query
sequence ranked by total energy
 A single run with just the lowest energy structure as output takes approx.
10-40 min of CPU time depending on the degree of refinement of the
structure
EGEE-II INFSO-RI-031688
G. Minervini EGEE User Forum * Manchester, 10-5-2007
12
Enabling Grids for E-sciencE
Integration on the GILDA facility
 Single job execution on GILDA
– A single Rosetta abinitio run consists of two different phases. In the first phase
an initial model of the protein structure is generated using the fragment libraries.
The initial model is then used as input for the second phase in which the model
is refined.
– A shell script has been prepared which:
• registers the program executable and the required input files (fragment
libraries and secondary structure prediction file) on the LFC catalog
• calls the Rosetta executable and proceeds with workflow execution.
– A JDL file was created to run the application on the GILDA working nodes which
use the gLite middleware
EGEE-II INFSO-RI-031688
G. Minervini EGEE User Forum * Manchester, 10-5-2007
13
Enabling Grids for E-sciencE
Integration on the GENIUS web portal

To make easy the use of the Rosetta abinitio on Grid environment by the
computational biology community, the application was integrated within the
GENIUS portal (https://glite-tutor.ct.infn.it).

After a simple MyProxy server initialization procedure, input files and
executable uploading, JDL file preparation, application running, run status
monitoring and download of the output file are carried out from GENIUS web
portal

The Rosetta abinitio output (universal .pdb format) can easily downloaded on
local machine just typing download on Genius web interface. For an istant online check, is also provides a high resolution graphical output of the predicted
structure in .png and .wrl formats is also provided
EGEE-II INFSO-RI-031688
G. Minervini EGEE User Forum * Manchester, 10-5-2007
14
Enabling Grids for E-sciencE

Step 1. After MyProxy initialization the user connects to the GENIUS portal
to set up the parametric JDL, specifying the number of runs (equivalent to
the number of amino acid sequences to be simulated) to be carried out.
EGEE-II INFSO-RI-031688
G. Minervini EGEE User Forum * Manchester, 10-5-2007
15
Enabling Grids for E-sciencE

Step 2. The user specifies the working directory and the name of the shell script.
EGEE-II INFSO-RI-031688
G. Minervini EGEE User Forum * Manchester, 10-5-2007
16
Enabling Grids for E-sciencE

Step 3. Input files (fragment libraries) are loaded as a single .tar.gz folder per amino
acid sequence.
EGEE-II INFSO-RI-031688
G. Minervini EGEE User Forum * Manchester, 10-5-2007
17
Enabling Grids for E-sciencE

Step 4. Output files (initial and refined model coordinates) are specified in parametric
form.
EGEE-II INFSO-RI-031688
G. Minervini EGEE User Forum * Manchester, 10-5-2007
18
Enabling Grids for E-sciencE

Step 5. The parametric JDL file is generated and visualized to be inspected by the
user.
EGEE-II INFSO-RI-031688
G. Minervini EGEE User Forum * Manchester, 10-5-2007
19
Enabling Grids for E-sciencE

Step 6. The parametric job is submitted and its status as well as the status of
individual runs of the same job can be checked.
EGEE-II INFSO-RI-031688
G. Minervini EGEE User Forum * Manchester, 10-5-2007
20
Enabling Grids for E-sciencE

Graphical output of the predicted structure representation generated in .png format by
Molscript and Raster3D.
EGEE-II INFSO-RI-031688
G. Minervini EGEE User Forum * Manchester, 10-5-2007
21
Enabling Grids for E-sciencE
CONCLUSIONS

We are currently accumulating data on NBP structures

Collecting tools for analysis (structure and function analysis)

Studying portability of other applications (e.g. function recognition software developed “in
house”) in GRID

Envisioning application of ported tools for structural genomics initiatives on biomedically
relevant targets
– Example: prediction of the structure/function of the entire set of proteins of selected
viral and microbial pathogens for target selection and in silico drug discovery
EGEE-II INFSO-RI-031688
G. Minervini EGEE User Forum * Manchester, 10-5-2007
22
Enabling Grids for E-sciencE
Thank you!
www.eu-egee.org
EGEE-II INFSO-RI-031688
EGEE and gLite are registered trademarks