Chain Growing Using Potentials Computed by Incremental

Download Report

Transcript Chain Growing Using Potentials Computed by Incremental

Chain Growing Using Statistical
Energy Functions
David A. O'Brien
Balasubramanian Krishnamoorthy:
Jack Snoeyink
Alex Tropsha
Andrew Leaver-Fey
Shuquan Zong
UNC Chapel Hill
David A. O’Brien
Overview


Lattice Chain Growth Algorithm
Statistical Energy Functions




Results



Chains
Identifying Good Decoys
Current Work



2-body Miyazawa-Jernigan Potential
4-body Potential
Local Shape Potential
New Scoring Functions
Incremental Tetrahedralization
Future work
UNC Chapel Hill
David A. O’Brien
Chain Growing - Introduction

Lattice Chain Growing Goals:

Test measures of proteins



Build protein chains that maximize a given measure
If these chains appear native like, confirms that this is valid measure
Predict protein structures from just sequence information, ab initio.


Develop an algorithm to build 3D folded protein decoys from the sequence
that are similar to the native structure
Evaluate these decoys and determine which are native-like. In short, be
able to pick the most native-like structure from the large set of decoys we
will generate.
UNC Chapel Hill
David A. O’Brien
Lattice Chain Growth Algo.



Cubic lattice (311) w/ 24 possible moves {(3,1,1),(3,1,-1),…,(-3,1,1)}
Generate chain configuration by sequential addition of links until full
length of chain is reached.
New links can not be placed in the zone of exclusion of of other
links and must satisfy angle constraints.
UNC Chapel Hill
David A. O’Brien
Lattice Chain Growth Algo.:
Adding a new link




Generate a set of possible open lattice nodes.
For each, calculate a temperature-dependent transition probability.
Choose one of these open lattice nodes with a Monte Carlo step.
Variations such as look 2 steps ahead or building from middle
UNC Chapel Hill
David A. O’Brien
Temperature-Dependent
Transition Probability

Probability at step i of picking configuration x’ from x1 … xC :
C
1
1
Pi ( x ')  exp[
E ( x ')]/  exp[
E ( x j )]
kBT
kBT
j 1



T = temperature
kB = Boltzman Constant
E = Energy (Lower is better.)
UNC Chapel Hill
David A. O’Brien
Overview


Lattice Chain Growth Algorithm
Statistical Energy Functions




Results



Chains
Identifying Good Decoys
Current Work



2-body Miyazawa-Jernigan Potential
4-body Potential
Local Shape Potential
New Scoring Functions
Incremental Tetrahedralization
Future work
UNC Chapel Hill
David A. O’Brien
Statistical Energy Functions

Statistical energy functions assume that “contact” energies between
amino acid residues in native proteins are related to their observed
frequency in a representative structural database.



If a potential configuration (decoy) has a certain set of nearby residues
that is common in nature, give this a good score.
Score for entire protein is sum of all contact energies.
We use three statistical energy functions:



2-body Miyazawa-Jernigan
4-body Potential
Local Shape Potential
UNC Chapel Hill
David A. O’Brien
Statistical Energy Functions
Overview

Global vs. Local




Easy to calculate
Can be global or local
4-body Potential




Measures well the entire protein (or partial fragment)
Measures just a small sequence of consecutive residues
2-body Miyazawa-Jernigan


Global:
Local:
Expensive to calculate
Works better as a global measure
Good for determining native-like folded structures
Local Shape Potential



Easy to calculate
Defined as a local measure
Global measure ?
UNC Chapel Hill
David A. O’Brien
Overview


Lattice Chain Growth Algorithm
Statistical Energy Functions




Results



Chains
Identifying Good Decoys
Current Work



2-body Miyazawa-Jernigan Potential
4-body Potential
Local Shape Potential
New Scoring Functions
Incremental Tetrahedralization
Future work
UNC Chapel Hill
David A. O’Brien
Two-body Statistical Energy
Function

For two-body potentials:
Qij  ij  kBT ln[Fij / Pij ]
Fij  observed contact frequency

Pij  reference state
Actual ij values are taken from the Miyazawa-Jernigan matrix as
reevaluated in 1996
Qij  ij
Miyazawa S, Jernigan RL. Residue residue potentials with a favorable contact pair term and an unfavorable high
packing density term, for simulation and threading. J Mol Biol 1996;256: 623 644.
UNC Chapel Hill
David A. O’Brien
Overview


Lattice Chain Growth Algorithm
Statistical Energy Functions




Results



Chains
Identifying Good Decoys
Current Work



2-body Miyazawa-Jernigan Potential
4-body Potential
Local Shape Potential
New Scoring Functions
Incremental Tetrahedralization
Future work
UNC Chapel Hill
David A. O’Brien
Four-Body Statistical Energy
Function


Calculates the energy based on a sets of 4 nearby residues (quad).
Quads calculated from the Delaunay Tessellation.


The 4 vertices of each tetrahedra define a quad.
Each quad is given a statistical score.
Convex hull formed by
the tetrahedral edges
UNC Chapel Hill
Each tetrahedron corresponds to
a cluster of four residues
David A. O’Brien
Four-Body Statistical Energy
Function - Overview

Four-body potential is written



 .
Qijkl
Training set of 1166 proteins were tessellated
Frequency of each quad type is counted
Each quad is typed in two ways


by the combination of the four residue types
{i,j,k,l}
by the number of consecutively appearing residues
()
25.5%
UNC Chapel Hill
35.6%
11.4%
22.1%
5.4%
David A. O’Brien
Four-Body Statistical Energy
Function - Classifying quadruplets

Denote each quad by {i,j,k,l}


i,j,k and l can be any of the 20 amino acids (L20)

e.g. AALV, TLKM, TTLK, YYYY etc.

8855 possible combinations
Or 20 amino acids can be grouped into just 6 types (L6)


Groups defined by chemical properties of amino acids
126 possible combinations
c={cysteine}
f={phenylaline, tyrosine, tryptophan}
h={histiine, arginine, lysine}
n={asparagine, aspartic acid, glutamine, glutamic acid}
s={serine, threonine, proline, alanine, glycine}
v={methionine, isoleucine, leucine, valine}
UNC Chapel Hill
David A. O’Brien
Four-Body Statistical Energy
Function - Classifying quadruplets

L20 Case:




5 -types x 8855 combination ==> 44,275 quad types
Not all quad types observed in training set
Potential of unfound types set to some fraction of the lowest score for a
represented quad type.
L6 Case:


5 -types x 126 combination ==> 630 quad types
All but a few quad types observed in training set
UNC Chapel Hill
David A. O’Brien
Four-Body Statistical Energy
Function - Formulation

Formulation is an extension of the previous 2-body formula:



Qijkl  k T ln[ fijkl / Pijkl ]

fijkl
where,
observed occurrences of type  (ijkl ) neighbors

total number for  type
4!

Pijkl  P  Pijkl  P 
N
t !
ai a j ak al
i
i 1
ti  number of each type i
P 
# of type  tetrahedra observed in training set
total # of tetrahedra in training set
UNC Chapel Hill
ai 
observed occurrences of amino acid type i
total number of residues in data set
David A. O’Brien
Overview


Lattice Chain Growth Algorithm
Statistical Energy Functions




Results



Chains
Identifying Good Decoys
Current Work



2-body Miyazawa-Jernigan Potential
4-body Potential
Local Shape Potential
New Scoring Functions
Incremental Tetrahedralization
Future work
UNC Chapel Hill
David A. O’Brien
Local Shape Statistical Energy
Function

Motivation:




Fragment libraries model protein structures accurately.
Use the frequency of common fragments to construct a statistical function that
supplements the 2 and 4-body energy functions to grow better decoys
Good fragment libraries exist, but for the lattice-chain building we need
fragments that fit in the 311 lattice
Main Idea:

For each possible consecutive sequence of four residues, i, j, k, and l, calculate
in which shape these residues most often occur.
Shape – A

Shape – B
If Shape – A is found more often in nature, try to build chain accordingly
UNC Chapel Hill
David A. O’Brien
Local Shape Statistical Energy
Function

Create set of canonical lattice shapes of length 4 (and 5)



•
Calculate ways to embed chain of length 4 (or 5) in 311 lattice.
155 canonical shapes for length 4, (2789 for length 5)
For L6, there are 64=1,296 sequences
155 x 1,296 = 200,880 combinations
Parse representative set of 971 proteins into segments.

For each 4 length segment, calculate RMSD against each
canonical shape
Shape 1
Sample
protein
Shape 2
…
Shape 155
UNC Chapel Hill
David A. O’Brien
Local Shape Statistical Energy
Function

Turning RMSD values into frequencies



If only the canonical shape with best RMSD are counted, not all
200,880 shapes found in training set.
If two canonical shapes have low RMSD, give each some credit
If each For each RMSDi,j,k,l , i,j,k,l = residue type,  = shape
Freq i, j ,k ,l 

1

(exp( RMSD i, j,l ,k ))
n
Normalize the 155 RMSD values
UNC Chapel Hill
David A. O’Brien
Overview


Lattice Chain Growth Algorithm
Statistical Energy Functions




Results



Chains
Identifying Good Decoys
Current Work



2-body Miyazawa-Jernigan Potential
4-body Potential
Local Shape Potential
New Scoring Functions
Incremental Tetrahedralization
Future work
UNC Chapel Hill
David A. O’Brien
Results -
Building Decoys
Decoys produced by the Chain Growing still not good enough.
Relatively good correlation between RMSD and 4-Body Energy.
2mhu Built with



Local Shape Pot.
Four-body Energy per residue
Four-body Energy per residue
MJ Potential
Native state
UNC Chapel Hill
David A. O’Brien
Overview


Lattice Chain Growth Algorithm
Statistical Energy Functions




Results



Chains
Identifying Good Decoys
Current Work



2-body Miyazawa-Jernigan Potential
4-body Potential
Local Shape Potential
New Scoring Functions
Incremental Tetrahedralization
Future work
UNC Chapel Hill
David A. O’Brien
Identifying good Decoys

20L or 6L Non-bonded

Sum only the contribution of -type 0 tetrahedra.
UNC Chapel Hill
David A. O’Brien
Discriminating Native & Non-Native

Non-Bounded L20 scoring function applied to a set of folded and
unfolded decoys.
Non-bonded log-likelihoods for the Shahnovich
instances and the native structure (20L1T , SC)
log-likelihood score
40
35
30
25
20
15
10
5
s6A1
s6A2
s6A3
s6A4
s6A5
s6A6
GF01
GF02
GF03
GF04
GF05
GF06
GF07
GF08
GF09
GF10
GF11
GF12
GF13
GF14
GF15
GF16
GF17
GF18
GF19
GF20
2CI2
0
instances (yellow-pre(6), blue-post(20), red-native)
UNC Chapel Hill
David A. O’Brien
Overview


Lattice Chain Growth Algorithm
Statistical Energy Functions




Results



Chains
Identifying Good Decoys
Current Work



2-body Miyazawa-Jernigan Potential
4-body Potential
Local Shape Potential
New Scoring Functions
Incremental Tetrahedralization
Future work
UNC Chapel Hill
David A. O’Brien
Adjustments to Scoring Functions

20L or 6L Non-bonded


20L or 6L 5T


Sum only the contribution of -type 0 tetrahedra.
Sum contribution of all tetrahedra.
20L Ratio All

As above, but Define:
test
P
# of type  tetrahedra in test protein

total # of tetrahedra in test protein
Ptest
r 
,
P
UNC Chapel Hill
 _ RatioAll
Qijkl

 r  Qijkl
David A. O’Brien
Incremental Tetrahedralization



Maintain constant tetrahedralization and only add and remove
single vertices.
When evaluating a new candidate, update total energy by tagging
new quadruplets as well as any that have been removed.
Add the effect of the new, and subtract effect of those removed.
Add candidate
and evaluate.
UNC Chapel Hill
Remove candidate
and reset state.
Add next candidate
and reevaluate.
David A. O’Brien
References
Generating folded protein structures with a lattice chain-growth algorithm. H.H. Gan, A. Tropsha and T. Schlick, J.
Chem. Phys. 113, 5511-5524 (2000).
Lattice protein folding with two and four-body statistical potentials. H.H. Gan, A. Tropsha and T. Schlick, Proteins:
Structure, Function, and Genetics 43, 161-174 (2001).
Miyazawa S, Jernigan RL. Residue–residue potentials with a favorable contact pair term and an unfavorable high
packing density term, for simulation and threading. J Mol Biol 1996;256: 623–644.
Tropsha A, Sigh RK, Vaisman LI. Delaunay tessellation of proteins: Four body nearest neighbor propensities of
amino acid residues, J. Comput. Biol. 1996:3:2, 213-222 (1996).
R. Kolodny, P. Koehl, L. Guibas and M. Levitt. Small libraries of protein fragments model native protein structures
accurately, J. Mol. Biol., 323, 297-307 (2002).
UNC Chapel Hill
David A. O’Brien