Computers and Computing in Biology
Download
Report
Transcript Computers and Computing in Biology
From Sequence Analysis to Simulations:
Applications of HPC in Modern Biology
R. Sankararamakrishnan
Department of Biological Sciences & Bioengineering
IIT-Kanpur
IIT-K REACH Symposium 2010
Oct 9th 2010
Computers and Computing in Biology
Mathematical Biology
Biostatistics
Biomathematics
Quantitative Biology
Biophysics
Bioinformatics
Computational Biology
Definitions
What is Bioinformatics? - Research, development, or
application of computational tools and approaches for
expanding the use of biological, medical, behavioral or health
data, including those to acquire, store, organize, archive,
analyze, or visualize such data.
What is Computational Biology? - The development and
application of data-analytical and theoretical methods,
mathematical modeling and computational simulation
techniques to the study of biological, behavioral, and social
systems.
- NIH Definition
http://www.bisti.nih.gov/
Explosive growth of biological data
HPC Applications: Three examples
Evolutionary relationship among
a given set of protein or DNA
sequences
Drug Discovery and Design
Structure-function relationship
of large biomolecular assemblies
I. HPC in Phylogenetics
Phylogeny and Phylogenetic tree
Study of evolutionary relationships
(sequences/species)
Relationships between organisms with
common ancestor
Phylogenetic tree is a graph
representing evolutionary history of
sequences/species
Phylogenetic trees can be represented in
two different ways
Rooted Tree
Orangutan
Unrooted Tree
Orangutan
Human
Human
Chimpanzee
Gorilla
Chimpanzee
Gorilla
Direction of evolution
Has a unique node
No assumption about
common ancestry
Molecular phylogeny in a criminal investigation
Maximum Likelihood Method – An Introduction
David Mount (2002)
Maximum Likelihood Method – An Introduction
David Mount (2002)
For each unrooted tree, there will be many
possible rooted trees
Number of possible unrooted and rooted trees
2n 3!
N R n2
2 n 2 !
NU
2n 5!
n 3
2 n 3!
Species
Number of Rooted Trees
Number of Unrooted Trees
2
1
1
3
3
1
4
15
3
5
105
15
6
34,459,425
2,027,025
7
213,458,046,767,875
7,905,853,580,625
8
8,200,794,532,637,891,559,375
221,643,095,476,699,771,875
Computing phylogenetic trees using ML method
Maximum likelihood phylogeny problem is NP-hard
Very CPU intensive
For trees containing more than 20 to 25 sequences, the
problem cannot be solved any more
Efficient heuristic tree search algorithms are required
to reduce the size of the search space
Recently developed algorithms:
IQPNNI, PHYML, GARLI, RAxML
None of these algorithms are guaranteed to find the
ML tree; only yield the best known ML tree
Parallelization strategy
Ott et al. (2008)
RAxML performance in some HPC platforms
212 sequences, 566,470 base pairs
One of the largest datasets analyzed under ML
IBM BlueGene/L; 1024 CPUs
7 distinct tree searches in 14 hours
Ott et al. (2008)
Phylogenetic analysis of plant channel proteins
identified new subfamily
Bansal and Sankararamakrishnan, BMC Struct. Biol. (2007)
Gupta and Sankararamakrishnan, BMC Plant Biol. (2009)
II. HPC in Drug Discovery &
Drug Design
Roles of Computation in Drug Discovery
“Is there really a case where a
drug that is on the market was
designed by a computer?”
“The reality is that the use
of computers and computer
methods permeates all
aspects of drug discovery
today”
Jorgensen (2004)
Computation in Drug Discovery
“Drug discovery is complex: Successful
teams and companies need to
congratulated, whereas search for one
individual or computer program is
counterproductive. There is not going to
be a voila moment at the computer
terminal. Instead, there is systematic
use of wide-ranging computational tools
to facilitate and enhance the drug
discovery process”
Jorgensen (2004)
Structure-based Drug Design – An Introduction
http://csb.stanford.edu/levitt/demo_lectures/lec7/Lecture7/Discovering_Drugs/pages/Structure_Based_Drug_Design.html
http://www.biocryst.com/our_science
www.bmsc.washington.edu/WimHol/sbdd3.JPG
Wim Hol
Drug targets and Drug discovery: Issues
Lead Generation
Lead optimization
De novo design
Virtual screening
All drugs that are presently in the market are estimated
to target less than 500 biomolecules
Docking & Scoring
Issues: Scoring function, solvent effect and protein
flexibility
Bleicher et al. (2003)
Four proteins: trypsin, HIV PR,
CDK2 and AChE
Test set for each protein: 10,000
randomly selected compounds
6000 docking poses were selected
for the top 1000 compounds
They served as initial conformations
for MD simulations
Combination of
docking and MD
showed a higher
and more stable
enrichment
performance than
docking method
used alone
A special purpose computer, MDGRAPE-3, was
used for MD simulations
It is a cluster of personal computers
Each equipped with 24 MDGRAPE-3 chips and
has a peak speed of approximately 2 Tflops
50 such computers were used
Average computational time for a single
protein-ligand complex is 2.5 h
For 6,000 protein-ligand conformations,
calculations were completed in a week
Steered MD in Drug Discovery
Jorgensen, 2010
Steered Molecular Dynamics to compute the force required to
extract the inhibitors from enzymes
A small string is connected to the ligand in the complex
This string is pulled at constant velocity into the surrounding water
Force is determined from the extension of the spring and recorded
as a function of time
Strongly-bound inhibitors higher peak forces
Weaker inhibitors flatter profiles
Protein-protein interactions in
programmed cell death
Bcl-2 family
complex
structures
Total number of
atoms: ~50,000 to
~75,000
Simulation period:
50 ns
Lama and Sankararamakrishnan, Proteins (2008)
Lama and Sankararamakrishnan, Biochemistry (2010)
III. Large Biomolecular
Assemblies
First Biomolecular simulation was performed in 1977
MD simulations of channel proteins in bilayers
AQP1: 75057 Atoms
GlpF: 81006 Atoms
30ns production run was performed
systems.
PfAQP: 81503 Atoms
for all the three
Each simulation takes ~40 days CPU time (Total CPU
time ~ 120 days).
Alok Jain, Ravi Verma and R. Sankararamakrishnan,
Manuscript in preparation
Simulations reaching the million-atom mark
Complete virus: 1 million atoms
(Freddolino et al., 2006)
Arrays of light-harvesting proteins – 1
million atoms (Chandler et al., 2008)
BAR domain proteins – 2.3 million atoms
(Yin et al., 2009)
The flagellum – 2.4 million atoms (Kitao
et al., 2006)
Complete virus: 1 million atoms
Minimization and equilibration
Cluster of 48 AMD Athlon
2600+ processors
Simulation
256 Altix nodes at NCSA
@UIUC
1.1. ns/day
(Freddolino et al., 2006)
Functions of large molecular
machines
Fungal fatty acid synthase
30S ribosome
MD of protein-conducting channel
bound to ribosome
Bacterial ribosomes
are important targets
for antibiotics
2.7 million atoms
50 ns simulation
Largest system
simulated to date
Gumbart et al. (2009)
Drug Design &
Discovery
Phylogenetic
analysis
HPC
Large
Biomolecular
systems
HPC Platforms for Biology Applications
FPGA-boards: Field programmable gate arrays are ICs
which can be programmed. FGPA boards with commonly
used bioinformatics algorithms are available
Graphics-Processing Unit (GPU): All bioinformatics
applications
Grid Computing: Many applications
Distributed Computing: Protein folding, Drug docking
Cloud Computing:
Acknowledgements
Anjali Bansal
Dilraj Lama
Alok Jain
Tuhin Kumar Pal
Priyanka Srivastava
Vivek Modi
Ravi Kumar Verma
Krishna Deepak
Phani Deep
DST, DBT, CSIR, MHRD