Poster - School of Systems Biology

Download Report

Transcript Poster - School of Systems Biology

A MULTIBODY ATOMIC STATISTICAL POTENTIAL FOR PREDICTING ENZYME-INHIBITOR BINDING ENERGY
Majid Masso ([email protected])
Laboratory for Structural Bioinformatics, School of Systems Biology, George Mason University, 10900 University Blvd. MS 5B3, Manassas, Virginia 20110, USA
I. Abstract
II. Protein Data Bank (http://www.rcsb.org/pdb)
Accurate prediction of enzyme-inhibitor binding energy has the capacity to speed drug
design and chemical genomics efforts by helping to narrow the focus of experiments. Here
a non-redundant set of three hundred high-resolution crystallographic enzyme-inhibitor
structures was compiled for analysis, complexes with known binding energies (ΔG) based
on the availability of experimentally determined inhibition constants (ki). Additionally, a
separate set of over 1400 diverse high-resolution macromolecular crystal structures was
collected for the purpose of creating an all-atom knowledge-based statistical potential, via
application of the Delaunay tessellation computational geometry technique. Next, two
hundred of the enzyme-inhibitor complexes were randomly selected to develop a model for
predicting binding energy, first by tessellating structures of the complexes as well as the
enzymes without their bound inhibitors, then by using the statistical potential to calculate a
topological score for each structure tessellation. We derived as a predictor of binding
energy an empirical linear function of the difference between topological scores for a
complex and its isolated enzyme. A correlation coefficient (r) of 0.79 was obtained for the
experimental and calculated ΔG values, with a standard error of 2.34 kcal/mol. Lastly, the
model was evaluated with the held-out set of one hundred complexes, for which structure
tessellations were performed in order to calculate topological score differences, and
binding energy predictions were generated from the derived linear function. Calculated
binding energies for the test data also compared well with their experimental counterparts,
displaying a correlation coefficient of r = 0.77 with a standard error of 2.50 kcal/mol.
IV. Knowledge-Based Potentials of Mean Force
• Assumptions:
– At equilibrium, native state has global free energy min
– Microscopic states (i.e., features) follow Boltzmann dist
• Examples:
– Well-documented in the literature: distance-dependent
pairwise interactions at the atomic or amino acid level
– This study: inclusion of higher-order contributions by
developing all-atom four-body statistical potentials
• Motivation (our prior work):
– Four-body protein potential at the amino acid level
VII. All-Atom Four-Body Statistical Potential
• A six-letter atomic alphabet yields 126 distinct quadruplets
• For each quad (i, j, k, l), calculate observed rate of occurrence
fijkl among all tetrahedra from the 1417 structure tessellations
• Compute rate pijkl expected by chance from a multinomial
reference distribution:
pijkl 
4!
6
6

n 1


t
!
n

6
ant n , where
a
n
6
 1 and
n 1
t
n
 4.
n 1
n 1
• an = proportion of atoms from all structures that are of type n
• tn = number of occurrences of atom type n in the quad
• Apply inverted Bolzmann principle: sijkl = log(fijkl / pijkl)
quantifies the interaction propensity and is proportional to the
energy of atomic quadruplet interaction
• PDB – repository of solved (x-ray, nmr, ...) structures
Atom
X
Y
• Physics-based energy calculations using quantum
mechanics are computationally impractical
Z
• Same for molecular mechanics-based potential energy
functions (i.e., force fields): E(total) = E(bond) + E(angle)
+ E(dihedral) + E(electrostatic) + E(van der Waals)
• Alternative (our approach): knowledge-based potentials of
mean force (i.e., generated from known protein structures)
:
:
:
:
V. Motivational Example:
Pairwise Amino Acid Potential
• A 20-letter protein alphabet yields 210 residue pairs
• Obtain large, diverse PDB dataset of single protein chains
• For each residue pair (i, j), calculate the relative frequency
fij with which they appear within a given distance (e.g., 12
angstroms) of each other in all the protein structures
VI. All-Atom Four-Body Statistical Potential
• Obtain diverse PDB dataset of 1417 single chain and multimeric
proteins, many complexed to ligands (see XV. References)
• Six-letter atomic alphabet: C, N, O, S, M (metals), X (other)
• Apply Delaunay tessellation to the atomic point coordinates of
each PDB file – objectively identifies all nearest-neighbor
quadruplets of atoms in the structure (8 angstrom cutoff)
• Calculate a rate pij expected by chance alone from a
background or reference distribution (more later…)
• Apply inverted Bolzmann principle: sij = log(fij / pij)
quantifies interaction propensity and is proportional to the
energy of interaction (by a factor of ‘–RT’)
VIII. Summary Data for the 1417 Structure
Files and their Delaunay Tessellations
Atom Types
Count Proportion
C
N
O
S
(all metals) M
(all other non-metals) X
3,612,988
969,253
1,088,410
28,502
2,529
4,299
Total atom count:
5,705,981
0.633193
0.169866
0.190749
0.004995
0.000443
0.000754
Total tetrahedron count: 34,504,737
• Delaunay tessellation of any macromolecular structure
yields an aggregate of tetrahedral simplices
IX. All-Atom Four-Body Statistical Potential
Quad
CCCC
CCCM
CCCN
CCCO
CCCS
CCCX
CCMM
CCMN
CCMO
CCMS
CCMX
CCNN
CCNO
CCNS
CCNX
CCOO
CCOS
CCOX
CCSS
CCSX
CCXX
CMMM
CMMN
CMMO
CMMS
CMMX
CMNN
CMNO
CMNS
CMNX
CMOO
CMOS
Count
4015872
1592
4025206
6202159
293157
2796
132
3318
5325
2293
15
1797552
8233136
124653
2007
3366568
198630
4626
15288
144
143
23
144
256
662
1
2474
6267
2588
26
8481
1010
sijkl
-0.14024
-0.98922
-0.16987
-0.03247
0.224008
-0.97505
0.908235
-0.57598
-0.42089
0.795108
-0.5677
-0.12464
0.184864
-0.05308
-1.02473
0.047161
0.098905
-0.71243
0.868158
-0.63735
0.482159
3.480397
1.216422
1.415945
3.41048
1.41113
-0.13203
-0.07975
1.118068
-0.05842
0.302308
0.659069
Quad
CMOX
CMSS
CMSX
CMXX
CNNN
CNNO
CNNS
CNNX
CNOO
CNOS
CNOX
CNSS
CNSX
CNXX
COOO
COOS
COOX
COSS
COSX
COXX
CSSS
CSSX
CSXX
CXXX
MMMM
MMMN
MMMO
MMMS
MMMX
MMNN
MMNO
MMNS
Count
68
2047
13
6
102035
1995038
15892
578
2734639
95438
2168
4264
37
61
524994
34429
23801
4380
58
65
285
5
4
9
83
37
29
379
0
83
102
363
sijkl
0.308765
2.848813
1.172117
1.958862
-0.62305
0.140679
-0.37618
-0.99392
0.227273
0.050981
-0.77117
0.584024
-0.95711
0.382553
-0.06271
-0.14114
0.520038
0.545326
-0.81224
0.359781
1.417735
0.006247
0.730845
2.381656
7.794725
4.258301
4.102142
6.8003
-1.849597
1.587734
3.720958
Quad
MMNX
MMOO
MMOS
MMOX
MMSS
MMSX
MMXX
MNNN
MNNO
MNNS
MNNX
MNOO
MNOS
MNOX
MNSS
MNSX
MNXX
MOOO
MOOS
MOOX
MOSS
MOSX
MOXX
MSSS
MSSX
MSXX
MXXX
NNNN
NNNO
NNNS
NNNX
NNOO
Count
0
306
104
3
254
2
0
1030
1128
561
5
3744
314
29
793
5
9
5430
156
168
210
4
55
62
2
0
16
3878
46665
460
34
340620
sijkl
-2.31553
3.127729
2.409325
5.398477
3.815151
-0.53596
0.047955
1.326526
0.098041
0.518626
0.723107
0.510083
3.008398
1.328573
2.706383
1.106856
0.669977
1.523669
2.380989
1.181307
3.442148
3.910199
2.763224
-5.786451
-0.8697
-0.44173
-0.86605
-1.17582
0.195102
Quad
NNOS
NNOX
NNSS
NNSX
NNXX
NOOO
NOOS
NOOX
NOSS
NOSX
NOXX
NSSS
NSSX
NSXX
NXXX
OOOO
OOOS
OOOX
OOSS
OOSX
OOXX
OSSS
OSSX
OSXX
OXXX
SSSS
SSSX
SSXX
SXXX
XXXX
Count
5637
302
311
6
5
171147
10697
3102
922
12
61
33
0
0
3
34212
4240
9553
300
36
128
38
3
0
2
6
0
0
0
0
sijkl
-0.30523
-0.75477
0.319427
-0.87471
0.168652
0.021937
-0.07737
0.206513
0.440012
-0.92506
0.903627
1.052833
--2.475964
-0.12555
-0.0525
1.121777
0.203077
-0.19726
1.476181
1.063748
0.305472
-2.249518
2.446092
-----
XII. Application of ΔTS: Predicting
Enzyme–Inhibitor Binding Energy
• MOAD – repository of exp.
inhibition constants (ki) for
protein–ligand complexes
whose structures are in PDB
• Each simplex can be scored using the all-atom four-body
potential based on the quad present at the four vertices
• Topological score (or ‘total potential’) of the structure:
the sum of all constituent simplices in the tessellation
sijkl
• Native structure is conformation having lowest energy
• Each structure file contains atomic 3D coordinate data
XI. Topological Score Difference (ΔTS)
X. Topological Score (TS)
III. Macromolecular Modeling
• Collected ki values for 300
complexes reflecting diverse
protein structures
TS = Σsijkl
• Obtained exp. binding energy
from ki via ΔGexp = –RTln(ki)
• Calculated ΔTS for complexes
XIII. Predicting Enzyme–Inhibitor
Binding Energy
• Randomly selected 200 complexes to train a model
XIV. Predicting Enzyme–Inhibitor
Binding Energy
• For the test set of 100
remaining complexes:
XV. References and Acknowledgments
• PDB dataset:
http://proteins.gmu.edu/automute/tessellatable1417.txt
• Train/test dataset:
http://proteins.gmu.edu/automute/MOAD300ki.txt
• Correlation coefficient r = 0.79 between ΔTS and ΔGexp
• r = 0.77 between ΔGcalc
and ΔGexp
• Empirical linear transform of ΔTS to reflect energy values:
• SE = 2.50 kcal/mol
• MOAD (ligand binding DB): http://bindingmoad.org/
• Fitted regression line is y
= 1.07x + 0.46
• Qhull (Delaunay tessellation): http://www.qhull.org/
ΔGcalc = (1 / 0.0003) × ΔTS – 6.24
• Linear => same r = 0.79 value between ΔGcalc and ΔGexp
• Also, standard error of SE = 2.34 kcal/mol and fitted
regression line of y = 0.98x – 0.41 (y = ΔGcalc and x = ΔGexp)
• All training/test data
available online as a text
file (see XV. References)
• PDB (structure DB): http://www.rcsb.org/pdb
• UCSF Chimera (ribbon/ball-stick structure visualization):
http://www.cgl.ucsf.edu/chimera/
• Matlab (tessellation visualization):
http://www.mathworks.com/products/matlab/