Four-body Statistical Potentials

Download Report

Transcript Four-body Statistical Potentials

A Four-Body Statistical Potential
For Protein Fold Recognition
Bala Krishnamoorthy and Alex Tropsha
UNC Chapel Hill
Nov 17, 2003
1
Four-Body Potentials
Outline
Motivation
Hypothesis
Four-body statistical potentials
Application to folding simulations
Application to predictions from CASP5 and
Livebench 6
2
Four-Body Potentials
Motivation
Knowledge of protein structure is essential to understand
their function(s)
Number of proteins (sequences known) is growing
exponentially
Traditional methods for determining protein structure (X-ray
crystallography, NMR etc.) do not yield quick results
Need to develop statistical methods that help with protein
fold recognition
3
Four-Body Potentials
Hypothesis
Specific nearest neighbor residue contacts in
protein structures have non-random
propensities for occurrence.
The propensities of occurrence of nearest
neighbor clusters can be used to score
compatibility between protein sequence and
structure
4
Four-Body Potentials
SNAPP
Simplicial Neighborhood Analysis of Protein Packing
2-D Packing
3-D Packing
2-D: 3 neighbors in mutual contact
3-D: 4 neighbor clusters
5
Four-Body Potentials
Objective definition of the nearest neighborhood of
each residue is needed
Use the Voronoi diagram of the protein
- gives convex hulls around each residue
(represented as a point) that define the nearest
neighborhood of the residue
Delaunay triangulation –
defined as the dual of the Voronoi diagram
6
Four-Body Potentials
Tessellation of protein structure (in 3D)
Residues are represented by their side-chain centers (or by
their C-α atoms)
Protein structure represented as an aggregate of
space filling, non-intersecting and irregular tetrahedra
Nearest neighbor residues are
identified as unique sets of four
residues each
(tetrahedral quadruplets)
7
Four-Body Potentials
Four-body Statistical Potentials
Denote each quadruplet by { i , j , k , l }
i,j,k and l can be any of the 20 amino acids
Total number of possible quadruplets is 8855
AALV
VALI
TLKM
YYYY …
8
Four-Body Potentials
Based on the back-bone connectivity of {i,j,k,l}, there can be five
types of tetrahedra (indexed as 0,1,2,3 and 4 respectively )
The propensities of the {i,j,k,l} quadruplets of each type t
could be used to develop four-body statistical potentials
9
Four-Body Potentials
Four-body compositional propensities of Delaunay simplices
q ijkl_t
f ijkl_t
= log
pijkl_t
f ijkl_t
- observed frequency of occurrence in the training set of
quad {ijkl} in a type t tetrahedron
pijkl_t
- expected frequency of occurrence in the training set of
residues i,j,k and l in a type t tetrahedron
pijkl_t =
C
a i a j ak al pt
ai – individual AA frequency
p – frequency of type t tetrahedra
t
C – combinatorial factor
10
Four-Body Potentials
diverse training set of 1166 protein chains with known structure
For a test conformation, the total log-likelihood score is calculated by
adding the score for each tetrahedron in its Delaunay tessellation.
Higher Score ↔ better structure
11
Four-Body Potentials
MD Simulation of proteins
Comparison of pre- and post-TS (transition) structure of CI2 vs. native CI2 *
Pre-TS (six structures)
Post-TS (20 structures)
Native
Go potentials (native structure specific) fail to discriminate between the three!
*structures courtesy of
Dr. E. Shaknovich, Harvard (Ref: J. Mol. Biol. 296 (2000) p1183-1188)
12
Four-Body Potentials
Comparison of total scores for pre- and post-TS
structures of CI2 vs. native CI2
120
110
100
total score
90
80
70
60
50
40
30
20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
instances (red-pre(6), yellow-post(20), green-native)
N.B. - The 5th pre-TS instance actually had a 0.10 probability of folding
(the other five pre-TS structures had ~ 0 probability of folding)
13
Four-Body Potentials
Structure profiles of pre-TS vs. post-TS structure of CI2
I20
20
log-likelihood score
pre
post
native
A16
15
L8
V13
L49
V47
V51
I57
I29
V13
V31
10
V31
V51
L49
5
0
0
4
8
12 16 20 24 28 32 36 40 44 48 52 56 60 64
residue #
Profile
ProCAM of Post-TS structure
14
Four-Body Potentials
SNAPP analysis of pre-TS vs. post-TS structure of CI2
Pre-TS
Post-TS
15
Four-Body Potentials
Structure profiles of pre-TS vs. post-TS structure of SH3
18
log-likelihood score
16
I48
14
A37
F18
12
L16
10
Y8
W35
Y52
8
pre
post
G46
native
6
4
2
0
0
4
8
12
16 20
24 28 32
36 40
44 48
52 56
residue #
16
Four-Body Potentials
Scoring Livebench 6 and CASP5 predictions
Livebench
Automated evaluation of structure prediction servers
Set 6 had 32 “easy” and 66 “hard” targets
CASP 5
3D coordinate models submitted for 56 targets
Native structure of 33 targets has been released
- rank 3D predictions using four-body potentials
- compare with the ranking using global structural similarity
measures
(like MaxSub)
17
Four-Body Potentials
To compare rankings, use predictive index (PI)
Here, E – experimental values, P – predicted values
18
Four-Body Potentials
Livebench 6
10 models for each target made by PMODELLER
PI for 28 “easy” targets and 38 “hard” targets
(at least one model had a non-zero MaxSub score)
Easy
<PI>
Std(PI)
Hard
<PI>
Std(PI)
4B pot
0.83
0.20
4B pot
0.83
0.11
MJ
0.70
0.39
MJ
0.74
0.18
PMOD
0.80
0.19
PMOD
0.84
0.15
19
Four-Body Potentials
CASP 5
For 18 targets (out of 33), the native structure ranked better
than all predictions
For 26 (out of 33) targets, the native structure was ranked
within the top 3.5 % of all the predictions
CASP5
<PI>
Std(PI)
4B pot
0.61
0.18
MJ
0.39
0.20
CRMSD
0.63
0.22
20
Four-Body Potentials
Conclusions
A four-body statistical scoring function is developed based
on the Delaunay tessellation of proteins
Discriminates native from decoy structures in most of the cases
Distinguishes pre- and post-transition state structures and the
native structure from MD folding simulation trajectories
Highly effective in the accurate ranking of Livebench 6 and
CASP5 predictions
21