Stanford Presentation, 10/23/2001

Download Report

Transcript Stanford Presentation, 10/23/2001

1
Three-Body Delaunay Statistical
Potentials of Protein Folding
Andrew Leaver-Fay
University of North Carolina at Chapel Hill
Bala Krishnamoorthy, Alex Tropsha
2
Protein Folding Problem
• Find the 3-D structure of a protein in nature from its
1-D sequence.
– Holy grail of computational biology
• Generic Solution
– Search Algorithm
• Takes Sequence
• Produces Decoys
– Scoring Function
• Ranks Decoys
3
Empirical Scoring Functions
• Philosophy: compare structural properties of
decoys to those of known proteins
• “Two-Body” Potentials
– Distribution of distances between amino acids
– Frequency of amino-acid contacts
• Arbitrary cutoff distance defines contact
• Delaunay-based statistical potentials
– “How do four amino acids pack together?”
– Alex Tropsha’s Lab: SNAPP Four-Body Potential
4
Delaunay Tessellation Of Proteins
• Describe each residue’s position by a single point
– C-
– Side Chain Centroid
• Delaunay tessellation gives a simplicial complex
– Geometric “nearest neighbor” criterion
– Captures a sense of “shielding” in residue interaction
• Gather statistics on tetrahedra (4-simplicies)
– Classify tetrahedra
– Convert observed frequencies to scores
5
A
Classification of Tetrahedra
• 8,855 ways to classify a tetrahedron
by the four amino acids that define it
I
F
V
L
• 5 ways to classify a tetrahedron by gaps in primary
sequence
– e.g., residues 1, 5, 6, & 10 in a tetrahedron share the
same gap structure with residues 20, 22, 23, & 43
6
From Statistics To Scores
• Log-likelihood score for a particular tetrahedron
type is log10(fijklp / pijklp)
• Pijklp = Cijkl*f(aai)*f(aaj)*f(aak)*f(aal)*f(psgp)
• The score for a decoy is the sum of the loglikelihood scores for each of its tetrahedron
7
Desired Classification Features
• Amino Acid Types
– Backbone and Side-chain distinction, 2 points/residue
• Primary Sequence Gaps
– Gaps of varying lengths, 0, 1, 2-4, 5+

Buriedness
– Are these residues exposed to solvent?


Edge Lengths, Tetrahedron Volume
2o Stucture
• Self Imposed Sampling Requirement
• Have 10 times as many tetrahedra in training set as
the number of tetrahedra types.
• Adding classification features to the existing two
requires we use a larger training set
8
Facet based Delaunay Potential
• Sacrifice some higher-order information to gain
insight into other structural features
– Simultaneously show that higher order information is
valuable
• 1,540 ways to classify a facet by the 3 defining
amino acids
• 3 ways to classify a facet by gaps in the primary
sequence
• 5 ways to classify a facet by its buriedness
9
Buried by Geometry
A
•
•
A facet in the Delaunay
tessellation may be involved in
two tetrahedra (AVL) or in only
one (DSG).
F
V
Def: a facet that appears only
once is a “surface facet”
•
Vertices on any surface facet are
“surface vertices.”
•
5 classes of facets by buriedness
– Surface facets
– Non-surface facets: number of
surface vertices (3, 2, 1, or 0)
I
L
P
S
D
G
Figure courtesy Alex Tropsha
10
Training Set
• 1,600 Structures
– High Resolution
– Low Sequence Identity, < 25%
• 226K facets observed
Three Body Potential
3
Log-Likelyhood Score
2
1
0
-1
-2
-3
-4
11
Decoy Discrimination
•
Well formed, non-native structures
– Standard sets available from Decoys’R’Us,
http://dd.stanford.edu
– Many potentials have failed the discrimination task on these
sets
•
Two Measures of Fitness for a Potential
– Rank of Native Structure
– Z-Score of Native Structure
•
•
(NativeScore - ) / 
Compare 4 potentials:
–
–
–
–
Latest 4-Body Potential
3-Body, no buriedness distinction
3-Body
Combination of 3- and 4-Body Potentials
• Scores from 3-body come from only the fully buried facets
12
Four-State Reduced Decoy Sets
4-Body
3bNBD
3-body
4b + 3b*
Rank
Rank
Rank
Rank
PDB-ID
#D’s
Z-Scr
Z-Scr
Z-Scr
Z-Scr
1ctf
630
1
3.089
3
2.280
7
2.530
9
2.942
1r69
675
2
2.741
3
2.617
2
2.668
2
3.572
1sn3
660
24
1.897
194
0.525
26
1.760
20
2.041
2cro
674
19
1.925
67
1.230
103
1.113
23
2.138
3icb
653
29
1.905
30
1.730
6
2.319
12
2.325
4pti
687
1
3.120
303
0.161
100
1.010
1
3.330
4rxn
677
3
2.930
284
0.227
186
0.620
5
2.702
* fully buried facets only
13
Fisa Decoy Sets
4-Body
3bNBD
3-body
4b + 3b*
Rank
Rank
Rank
Rank
PDB-ID
#D’s
Z-Scr
Z-Scr
Z-Scr
Z-Scr
1fc2
500
1
3.017
113
0.800
10
2.507
1
3.357
1hdd-C
500
153
0.619
113
0.712
1
3.085
71
1.021
2cro
500
17
2.008
32
1.602
16
2.207
2
3.511
4icb
500
1
4.556
1
3.198
1
6.972
1
6.367
* fully buried facets only
14
Lattice SS Fit Decoy Sets
4-Body
3bNBD
3-body
4b + 3b*
Rank
Rank
Rank
PDB-ID
#D’s
Rank
Z-Scr
Z-Scr
Z-Scr
Z-Scr
1beo
2000
1
7.828
16
2.317
7
3.142
1
5.564
1ctf
2000
1
4.654
6
2.815
11
2.924
4
3.947
1dkt-A*
2000
102
1.596
1047
-0.046
468
0.717
85
1.790
1fca
2000
1
6.255
524
0.592
122
1.659
1
5.986
1nkl
2000
1
7.667
1
4.181
1
4.755
1
7.769
1pgb
2000
1
5.434
100
1.596
14
2.668
1
6.003
1trl-A*
2000
962
0.071
832
0.222
1476
-0.683
1141
-0.254
4icb
2000
1
4.732
1
3.589
1
3.852
1
5.685
* fully buried facets only
15
LMDS Decoy Sets
4-Body
3bNBD
3-body
4b + 3b*
#D’s
Rank
Z-Scr
Rank
Rank
Rank
1b0n-B*
497
405
-0.916
466
-1.394
253
-0.059
405
-0.916
1bba
500
1
4.142
477
-1.822
179
0.304
1
4.142
1ctf
497
1
2.797
2
2.271
9
2.366
5
2.475
1dtk
215
8
1.903
50
0.764
5
2.180
5
1.952
1fc2
500
217
0.137
127
0.615
1
3.770
123
0.669
1igd
500
4
2.569
13
2.006
10
2.479
4
2.458
1shf-A
437
35
1.472
12
1.869
4
2.473
15
1.850
2cro
500
2
2.787
4
2.271
1
4.125
1
5.222
2ovo
348
61
0.853
14
1.721
13
1.853
55
0.917
4pti
343
9
1.900
152
0.147
30
1.484
8
2.064
PDB-ID
Z-Scr
Z-Scr
Z-Scr
* fully buried facets only
Average Performance Across Sets
4state
Mean
Median
Fisa
Mean
Median
Lat
Mean
Median
LMDS
Mean
Median
All
Mean(
Mean)
Mean(
Median)
16
4-Body
3bNBD
3-body
4b + 3b*
Rank
Rank
Rank
Rank
Z-scr
11.3
2.516
126.3
3
2.742
43
Z-scr
Z-scr
Z-scr
1.253 61.428
1.718
10.286
2.722
67
1.231
26
1.761
9
2.702
2.550
64.8
1.578
7
3.693
18.75
3.564
9
2.513
72.5
1.201
5.5
2.796
1.5
3.434
133.8
4.800
315.9
1.908
262.5
2.380
154.4
4.561
1
5.084
58
1.957
12.5
2.797
1
5.625
74.3
1.764
131.7
0.845
50.5
2.098
62.2
2.083
8.5
1.901
32
1.242
9.5
2.273
6.5
2.008
65.6
2.902
159.7
1.396
95.4
2.472
61.4
3.232
5.4
3.060
57.4
1.408
13.4
2.406
4.5
3.442
* fully buried facets only
17
Dimer “Discrimination”
•
We could not effectively discriminate the native from
decoys with either the 3- or 4- body potentials for 3
proteins.
•
On closer examination, we discovered the native
structures were incomplete, leaving exposed residues
that would be buried in their native multimeric shapes.
1b0n-B
1dkt-A
1trl-A
18
Average Performance Across Sets
All
Mean(
Mean)
Mean(
Median)
4-Body
3bNBD
3-body
4b + 3b*
Rank
Rank
Rank
Rank
Z-scr
Z-scr
Z-scr
Z-scr
23.5
3.306
98.4
1.610
30.9
2.729
13.911
3.632
5.5
3.250
49.3
1.614
12.6
2.489
4.4
3.509
* fully buried facets only
19
Conclusion
• Buriedness distinctions capture valuable
information about protein structure
• 3- + 4-Body potential is the strongest Delaunay
potential to date.