Computers and Computing in Biology

Download Report

Transcript Computers and Computing in Biology

From Sequence Analysis to Simulations:
Applications of HPC in Modern Biology
R. Sankararamakrishnan
Department of Biological Sciences & Bioengineering
IIT-Kanpur
IIT-K REACH Symposium 2010
Oct 9th 2010
Computers and Computing in Biology
Mathematical Biology
Biostatistics
Biomathematics
Quantitative Biology
Biophysics
Bioinformatics
Computational Biology
Definitions
What is Bioinformatics? - Research, development, or
application of computational tools and approaches for
expanding the use of biological, medical, behavioral or health
data, including those to acquire, store, organize, archive,
analyze, or visualize such data.
What is Computational Biology? - The development and
application of data-analytical and theoretical methods,
mathematical modeling and computational simulation
techniques to the study of biological, behavioral, and social
systems.
- NIH Definition
http://www.bisti.nih.gov/
Explosive growth of biological data
HPC Applications: Three examples
Evolutionary relationship among
a given set of protein or DNA
sequences
Drug Discovery and Design
Structure-function relationship
of large biomolecular assemblies
I. HPC in Phylogenetics
Phylogeny and Phylogenetic tree
Study of evolutionary relationships
(sequences/species)
Relationships between organisms with
common ancestor
Phylogenetic tree is a graph
representing evolutionary history of
sequences/species
Phylogenetic trees can be represented in
two different ways
Rooted Tree
Orangutan
Unrooted Tree
Orangutan
Human
Human
Chimpanzee
Gorilla
Chimpanzee
Gorilla
Direction of evolution
Has a unique node
No assumption about
common ancestry
Molecular phylogeny in a criminal investigation
Maximum Likelihood Method – An Introduction
David Mount (2002)
Maximum Likelihood Method – An Introduction
David Mount (2002)
For each unrooted tree, there will be many
possible rooted trees
Number of possible unrooted and rooted trees

2n  3!
N R  n2
2 n  2 !
NU

2n  5!
 n 3
2 n  3!
Species
Number of Rooted Trees
Number of Unrooted Trees
2
1
1
3
3
1
4
15
3
5
105
15
6
34,459,425
2,027,025
7
213,458,046,767,875
7,905,853,580,625
8
8,200,794,532,637,891,559,375
221,643,095,476,699,771,875
Computing phylogenetic trees using ML method
Maximum likelihood phylogeny problem is NP-hard
Very CPU intensive
For trees containing more than 20 to 25 sequences, the
problem cannot be solved any more
Efficient heuristic tree search algorithms are required
to reduce the size of the search space
Recently developed algorithms:
IQPNNI, PHYML, GARLI, RAxML
None of these algorithms are guaranteed to find the
ML tree; only yield the best known ML tree
Parallelization strategy
Ott et al. (2008)
RAxML performance in some HPC platforms
212 sequences, 566,470 base pairs
One of the largest datasets analyzed under ML
IBM BlueGene/L; 1024 CPUs
7 distinct tree searches in 14 hours
Ott et al. (2008)
Phylogenetic analysis of plant channel proteins
identified new subfamily
Bansal and Sankararamakrishnan, BMC Struct. Biol. (2007)
Gupta and Sankararamakrishnan, BMC Plant Biol. (2009)
II. HPC in Drug Discovery &
Drug Design
Roles of Computation in Drug Discovery
“Is there really a case where a
drug that is on the market was
designed by a computer?”
“The reality is that the use
of computers and computer
methods permeates all
aspects of drug discovery
today”
Jorgensen (2004)
Computation in Drug Discovery
“Drug discovery is complex: Successful
teams and companies need to
congratulated, whereas search for one
individual or computer program is
counterproductive. There is not going to
be a voila moment at the computer
terminal. Instead, there is systematic
use of wide-ranging computational tools
to facilitate and enhance the drug
discovery process”
Jorgensen (2004)
Structure-based Drug Design – An Introduction
http://csb.stanford.edu/levitt/demo_lectures/lec7/Lecture7/Discovering_Drugs/pages/Structure_Based_Drug_Design.html
http://www.biocryst.com/our_science
www.bmsc.washington.edu/WimHol/sbdd3.JPG
Wim Hol
Drug targets and Drug discovery: Issues
Lead Generation
Lead optimization
De novo design
Virtual screening
All drugs that are presently in the market are estimated
to target less than 500 biomolecules
Docking & Scoring
Issues: Scoring function, solvent effect and protein
flexibility
Bleicher et al. (2003)
Four proteins: trypsin, HIV PR,
CDK2 and AChE
Test set for each protein: 10,000
randomly selected compounds
6000 docking poses were selected
for the top 1000 compounds
They served as initial conformations
for MD simulations
Combination of
docking and MD
showed a higher
and more stable
enrichment
performance than
docking method
used alone
A special purpose computer, MDGRAPE-3, was
used for MD simulations
It is a cluster of personal computers
Each equipped with 24 MDGRAPE-3 chips and
has a peak speed of approximately 2 Tflops
50 such computers were used
Average computational time for a single
protein-ligand complex is 2.5 h
For 6,000 protein-ligand conformations,
calculations were completed in a week
Steered MD in Drug Discovery
Jorgensen, 2010
Steered Molecular Dynamics to compute the force required to
extract the inhibitors from enzymes
A small string is connected to the ligand in the complex
This string is pulled at constant velocity into the surrounding water
Force is determined from the extension of the spring and recorded
as a function of time
Strongly-bound inhibitors  higher peak forces
Weaker inhibitors  flatter profiles
Protein-protein interactions in
programmed cell death
Bcl-2 family
complex
structures
Total number of
atoms: ~50,000 to
~75,000
Simulation period:
50 ns
Lama and Sankararamakrishnan, Proteins (2008)
Lama and Sankararamakrishnan, Biochemistry (2010)
III. Large Biomolecular
Assemblies
First Biomolecular simulation was performed in 1977
MD simulations of channel proteins in bilayers
AQP1: 75057 Atoms
GlpF: 81006 Atoms
30ns production run was performed
systems.
PfAQP: 81503 Atoms
for all the three
Each simulation takes ~40 days CPU time (Total CPU
time ~ 120 days).
Alok Jain, Ravi Verma and R. Sankararamakrishnan,
Manuscript in preparation
Simulations reaching the million-atom mark
Complete virus: 1 million atoms
(Freddolino et al., 2006)
Arrays of light-harvesting proteins – 1
million atoms (Chandler et al., 2008)
BAR domain proteins – 2.3 million atoms
(Yin et al., 2009)
The flagellum – 2.4 million atoms (Kitao
et al., 2006)
Complete virus: 1 million atoms
Minimization and equilibration
Cluster of 48 AMD Athlon
2600+ processors
Simulation
256 Altix nodes at NCSA
@UIUC
1.1. ns/day
(Freddolino et al., 2006)
Functions of large molecular
machines
Fungal fatty acid synthase
30S ribosome
MD of protein-conducting channel
bound to ribosome
Bacterial ribosomes
are important targets
for antibiotics
2.7 million atoms
50 ns simulation
Largest system
simulated to date
Gumbart et al. (2009)
Drug Design &
Discovery
Phylogenetic
analysis
HPC
Large
Biomolecular
systems
HPC Platforms for Biology Applications
FPGA-boards: Field programmable gate arrays are ICs
which can be programmed. FGPA boards with commonly
used bioinformatics algorithms are available
Graphics-Processing Unit (GPU): All bioinformatics
applications
Grid Computing: Many applications
Distributed Computing: Protein folding, Drug docking
Cloud Computing:
Acknowledgements
Anjali Bansal
Dilraj Lama
Alok Jain
Tuhin Kumar Pal
Priyanka Srivastava
Vivek Modi
Ravi Kumar Verma
Krishna Deepak
Phani Deep
DST, DBT, CSIR, MHRD