1.0 Å Cα RMSD for 249 residues

Download Report

Transcript 1.0 Å Cα RMSD for 249 residues

Modelling proteins and proteomes using Linux clusters
Ram Samudrala
University of Washington
Examples of biological problems
Protein structure prediction/docking simulations
- need to run different trajectories that sometimes
talk with each other
Molecular dynamics simulations
- need more cohesive parallelisation
Polarisable force fields
- need true parallelisation
Bioinformatics searches/exploration
- trivially parallelisable
Computational issues
Need efficient methods to start/stop jobs
Need load/balancing queuing system
Need fast communications at times
Need stability (months/years uptimes)
Need low maintainance/management overhead
Need low installation overhead
Needs to be cheap!
Hardware and operating system
256 AMD and Intel CPUs (1-2.5 GHz)
0.5-1 GB RAM, 100-200 GB HD, dual processor MBs
100Mbps ethernet connectivity for 64 processor sets
White boxes are good but use up space – 1u racks ideal
Minimal Linux installation – create clone “CD” – copy on
all machines
Our solution
No single solution – user implements their own
Completely decentralised
Analyse problem and determine parallelisable parts
Implementation specific to problem
Use local scratch space for computation
Redundant storage of data for faster access
Limit problem space to specific problems
Problem specific implementation
MCSA/GA: socket-based communication of trajectories;
multiple trajectories on different CPUs
Docking: sample different ligands/regions of the protein
on different CPUs
MD: Pairwise force-fields are additive
PFF: ?
Bioinformatics: trivial parallelisation; communication
by disk
Modelling proteomes
Ram Samudrala
University of Washington
What is a “proteome”?
All proteins of a particular system
(organelle, cell, organism)
What does it mean to “model a proteome”?
For any protein, we wish to:
ANNOTATION
{
- figure out what it looks like (structure or form)
- understand what it does (function)
Repeat for all proteins in a system
Understand the relationships between all of them
}
EXPRESSION
+
INTERACTION
Protein folding
…-CUA-AAA-GAA-GGU-GUU-AGC-AAG-GUU-…
DNA
protein sequence
…-L-K-E-G-V-S-K-D-…
one amino acid
unfolded protein
spontaneous self-organisation
(~1 second)
native state
not unique
mobile
inactive
expanded
irregular
Protein folding
…-CUA-AAA-GAA-GGU-GUU-AGC-AAG-GUU-…
DNA
protein sequence
…-L-K-E-G-V-S-K-D-…
one amino acid
unfolded protein
spontaneous self-organisation
(~1 second)
native state
not unique
mobile
inactive
expanded
irregular
unique shape
precisely ordered
stable/functional
globular/compact
helices and sheets
De novo prediction of protein structure
sample conformational space such that
native-like conformations are found
select
hard to design functions
that are not fooled by
non-native conformations
(“decoys”)
astronomically large number of conformations
5 states/100 residues = 5100 = 1070
Semi-exhaustive segment-based folding
EFDVILKAAGANKVAVIKAVRGATGLGLKEAKDLVESAPAALKEGVSKDDAEALKKALEEAGAEVEVK
generate
…
fragments from database
14-state f,y model
…
minimise
…
monte carlo with simulated annealing
conformational space annealing, GA
…
filter
all-atom pairwise interactions, bad contacts
compactness, secondary structure
CASP5 prediction for T138
4.6 Å Cα RMSD for 84 residues
CASP5 prediction for T146
5.6 Å Cα RMSD for 67 residues
CASP5 prediction for T170
4.8 Å Cα RMSD for all 69 residues
CASP5 prediction for T129
5.8 Å Cα RMSD for 68 residues
CASP5 prediction for T172
5.9 Å Cα RMSD for 74 residues
CASP5 prediction for T187
5.1 Å Cα RMSD for 66 residues
Comparative modelling of protein structure
scan
align
de novo simulation
…
KDHPFGFAVPTKNPDGTMNLMNWECAIP
KDPPAGIGAPQDN----QNIMLWNAVIP
** * *
* *
* * *
**
build initial model
minimum perturbation
refine
physical functions
…
construct non-conserved
side chains and main chains
graph theory, semfold
CASP5 prediction for T129
1.0 Å Cα RMSD for 133 residues (57% id)
CASP5 prediction for T182
1.0 Å Cα RMSD for 249 residues (41% id)
CASP5 prediction for T150
2.7 Å Cα RMSD for 99 residues (32% id)
CASP5 prediction for T185
6.0 Å Cα RMSD for 428 residues (24% id)
CASP5 prediction for T160
2.5 Å Cα RMSD for 125 residues (22% id)
CASP5 prediction for T133
6.0 Å Cα RMSD for 260 residues (14% id)
Prediction of SARS CoV proteinase inhibitors
Ekachai Jenwitheesuk
Computational aspects of structural genomics
A. sequence space
B. comparative modelling
*
*
C. fold recognition
*
*
*
*
*
*
*
*
E. target selection
D. ab initio prediction
F. analysis
*
*
*
*
*
*
*
*
*
*
*
*
*
*
targets
(Figure idea by Steve Brenner.)
Computational aspects of functional genomics
structure based methods
microenvironment analysis
Bioverse
structure comparison
*
*
*
*
*
zinc binding site?
homology
*
function?
+
sequence based methods
sequence comparison
motif searches
phylogenetic profiles
domain fusion analyses
+
experimental data
single molecule + genomic/proteomic
assign function to
entire protein space
Bioverse – explore relationships among molecules and systems
http://bioverse.compbio.washington.edu
Jason McDermott
Bioverse – explore relationships among molecules and systems
Jason Mcdermott
Bioverse – prediction of protein interaction networks
Target proteome
Interacting protein database
85%
protein α
experimentally
determined
interaction
protein A
predicted
interaction
protein β
protein B
90%
Assign confidence based on similarity and strength of interaction
Jason Mcdermott
Bioverse – E. coli predicted protein interaction network
Jason McDermott
Bioverse – M. tuberculosis predicted protein interaction network
Jason McDermott
Bioverse – C. elegans predicted protein interaction network
Jason McDermott
Bioverse – H. sapiens predicted protein interaction network
Jason McDermott
Bioverse – organisation of the interaction networks
Ci = 2n/ki(ki-1)
Jason McDermott
Bioverse – mapping pathways on the rice predicted network
Defense-related proteins
Jason McDermott
Bioverse – mapping pathways on the rice predicted network
Tryptophan biosynthesis
Jason McDermott
Bioverse – network-based annotation for C. elegans
Jason McDermott
Bioverse – H. sapiens protein-protein similarity network
Jason McDermott
Bioverse – viewer
Aaron Chang
Future directions
Network connection with multiple ethernet cards based
on traffic analysis
Gigabit ethernet (switches are still expensive)
Better network filesystems
Take home message
Prediction of protein structure and function can
be used to model whole genomes to understand
organismal function and evolution
Acknowledgements
Aaron Chang
Ashley Lam
Ekachai Jenwitheesuk
Gong Cheng
Jason McDermott
Kai Wang
Ling-Hong Hung
Lynne Townsend
Marissa LaMadrid
Mike Inouye
Stewart Moughon
Shing-Chung Ngan
Yi-Ling Cheng
Zach Frazier
National Institutes of Health
National Science Foundation
Searle Scholars Program (Kinship Foundation)
UW Advanced Technology Initative in Infectious Diseases