Presentation (PowerPoint File)

Download Report

Transcript Presentation (PowerPoint File)

Prediction of Protein
Structure and Function on a
Proteomic Scale
Jeff Skolnick
Director
Center of Excellence in Bioinformatics
General Approach
Prediction of Protein Structure
Overview of CASP5 Results:
Comparative Modeling
(CM) Results
T0153
CM
COORDINATE SUPERPOSITION RMSD = 1.74 Å ( 129 / 134 aa )
NATIVE (discontinuous line) :
1mq7 A
PREDICTED (continuous line) : rank #1
Fold Recognition (FR) results
T0135
FR(A)
GLOBAL COORDINATE SUPERPOSITION
RMSD = 4.80 Å ( 106 / 106 aa )
NATIVE (discontinuous line) :
PREDICTED (continuous line) : rank #1
T0135
FR(A)
GLOBAL COORDINATE SUPERPOSITION
RMSD = 4.80 Å ( 106 / 106 aa )
NATIVE (discontinuous line) :
PREDICTED (continuous line) : rank #1
Yellow line: region originally aligned to the template (1h6kX )
New Fold (NF) results
T0181
(NF)
PREDICTED: rank #2
How representative is the set
of solved PDB structures?
The PDB is a covering set of
protein structures at low
resolution
Results from a new structure
alignment program, SAL
Kihara & Skolnick, J. Mol. Biol,
2003:333:393-802
Structural alignments
to proteins of different secondary structure
Different CATH ids
100 residue proteins
Use of best structural
alignments
Can we build good models
starting from protein templates
with average sequence id of 7%?
TASSER:Threading/ASSEmbly/Refinement
Very large scale structure
prediction benchmark
Comprehensive benchmark set of PDB structures
Length range: 41~200
Sequence identity cut-off: 35%
In total: 1489
Summary of Results
Summary of Overall Folding Results
SAL
TASSER
MODELLER
Besta
Alignb
Top-5c
Top-1d
Alignb
Top-5c
Top-1d
<RMSD>e
2.510
1.877
2.246
2.352
2.708
3.740
4.318
<COV>f
82%
82%
100%
100%
82%
100%
100%
NRMSD<6.0
NRMSD<5.5
NRMSD<5.0
NRMSD<4.5
NRMSD<4.0
NRMSD<3.5
NRMSD<3.0
NRMSD<2.5
NRMSD<2.0
NRMSD<1.5
NRMSD<1.0
1489
1485
1472
1440
1369
1255
1064
776
498
218
46
1489
1489
1489
1489
1488
1476
1422
1250
922
411
83
1487
1485
1481
1468
1447
1396
1259
987
623
253
52
1481
1475
1464
1450
1423
1359
1206
928
582
241
49
1462
1431
1395
1336
1255
1141
1008
750
520
244
37
1326
1266
1195
1116
984
834
647
475
300
124
20
1202
1138
1060
962
841
697
551
397
244
85
15
Some Examples:
Summary
At low resolution, the PDB is most likely
complete for single domain proteins
 Can build acceptable full length models
in the majority of cases
 Can refine the initial structures to move
closer to native, even starting from the
best structural alignment

Results from
threading/refinement
“Real Life” situation
TASSER:Threading/ASSEmbly/Refinement
“Easy” Cases:
At least two threading templates
identified with significant consensus
region
or
 One template with z-score that is
highly significant

“Medium ” Cases:
At least two threading templates
identified without any significant
consensus region
or
 One template with z-score above
threshold for correct fold assignment

Composite Threading Results

We can identify the correct global fold in
92% of the entire representative set of
small PDB structures

Can generate good template alignments
in 59% of the cases

Good substructures 67% of the cases
Summary of Results
Examples of Alignment improvement
Medium
Template
Final model
Easy
Template
Final model
Thin lines: Native; thick lines: Template/model
Two factors mainly contribute to the improvement:
•geometric connectivity
• Better packing of local structure and side group because of the force field
Comparison to Ensemble of NMR
Structures
(Predicted Structure to Centroid/Farthest NMR Structure to Centroid)
Thick Line is Predicted Structure
Benchmark set of larger proteins (201-300 residues)
487 Single-domain proteins
745
236 two-domain proteins
22 three-, four-domain proteins
Successful Predictions of
Transmembrane Proteins
Application to ORFS <201
residues in E. coli
61% Easy
(829/1360)
38% Medium
(521/1360)
10 Hard
TASSER
68%
(920/1360)
Good models
Summary
Acceptable model in about 2/3 of the
cases (969/1489)
 Application to E coli Yields similar
results ~2/3 of proteins should have
good model -Almost all (90%) have a
good template

Development of Active Site
Descriptors
Representation of an Automated
Functional Template [ AFT ]
cm
SCj
cm
SCi
Cai+1
Types of
functional
sites from
SwissProt:
Caj-1
Caj
Cai
Caj+1
METAL
BINDING
ACT_SITE
SITE
Cai-1
Set of distances between:
Ca atoms and center of mass of the
side chains corresponding to 3 to 5
functional residues,
Ca atoms corresponding to the
adjacent residues.
cm
SCk
Cak+1
Cak
Cak-1
Number of hits in the subset of PDB
Specificity parameters of AFTs
30
28
26
24
22
20
18
16
14
12
10
8
6
4
2
0
Positive hits
Negative hits
Restrictive cutoff:
average value of
DRMSDMaxPos and
DRMSDMinNeg.
Permissive cutoff:
expected number of
false positive matchs
is less than 0.005 in a
random structure.
High confidence
DRMSD interval
Low confidence
DRMSD interval
DRMSDMaxPos
0.0
0.5
DRMSDMinNeg
1.0
1.5
2.0
2.5
DRMSD [ Å ]
Fraction of decoys correctly annotated vs. ranking of the best true positive hit
Global Ca crmsd from the native structure
Local Ca drmsd from the native structure
73%
56%
48%
35%
The recognition by an AFT matching the first three components
of the true EC number is considered a true positive hit.
Threading of Entire Genomes
Summary of Fold Assignments
Organism
Total
Protein
ORFs
ORF
Coverage
(%)
Amino
Acid
Covera
ge %
FASTA
(%)
PSIBLAST
– PDB
(%)
PSIBLAST
–
PDBseq
(%)
GTOP
(%)
Pedant
(%)
Gerstein
(%)
M.
genitalium
484
387
(80.0)
48.1
231
(47.7)
205
(42.4)
259
(53.5)
273
(56.4)
259
(53.5)
214
(44.2)
E. coli
4289
3356
(78.2)
50.2
1660
(38.7)
1516
(35.3)
1906
(44.4)
2032
(58.5)
1954
(45.6)
1191
(27.8)
B. subtilis
4106
2988
(72.8)
47.2
1465
(35.7)
1314
(32.0)
1732
(42.4)
1947
(60.2)
1963
(47.7)
1121
(27.3)
A. aeolicus
1522
1297
(85.2)
48.0
646
(42.4)
592
(38.9)
771
(50.7)
827
(53.1)
800
(52.6)
527
(34.6)
S.
cerevisiae
6343
4610
(72.7)
30.0
1962
(30.9)
1804
(28.4)
2422
(38.2)
2694
(42.5)
2766
(42.9)
1699
(27.3)
Comments on fold distribution

Protein folds can be assigned to 72-85% of genes in
each genome.
30-50% of the total amino acids in a genome are
covered by the assigned folds.
Generally, distribution of folds are similar in the 5
organisms.
Folds of a/b type are abundant.
Folds of multi-functions are abundant in a genome.

Kinase fold shows up in top 5 only in S.cerevisiae.




MULTIPROSPECTOR:
Prediction of Protein-Protein Interactions
L. Lu, H. Lu, J. Skolnick. Proteins, 2002, 49, 350-364.
Overall Idea of Multimer Threading
X: GELPIAPIGRIIKNA
GAERVSDDARIALAK
VLEEMGEEIASEAVK
LAKHAGRKTIKAEDI
KLARKMFK
Y: GEVPIAPLGRIIKNA
GAERVSDDARIALAK
VLEEMGEEIASEAIR
LAKHAGRKTIKAEDV
KLAKKMFK
X: GELPIAPIGRIIKNA
GAERVSDDARIALAK
VLEEMGEEIASEAVK
LAKHAGRKTIKAEDI
KLARKMFK
Y: GEVPIAPLGRIIKNA
GAERVSDDARIALAK
VLEEMGEEIASEAIR
LAKHAGRKTIKAEDV
KLAKKMFK
Monomer
threading
Multimer
Threading
B
A
Multimer
Structure Library
A
A
X
B
Y
B
Assign fold on the basis of
Z score and Interface Energy
Preliminary test on Known
Dimers and Monomers
Homodimers: 58
Heterodimers: 20
Monomers: 96
20
20
5458
4
5
Proteins predicted to be dimers
Proteins predicted to be monomers
96
91
Procedure for genomic scale prediction of proteinprotein interactions by MULTIPROSPECTOR
Comparison of colocalization index for different methods
Distribution of predicted interactions in functional categories
Conclusions
Conclusions
Completeness of the PDB

PDB is a covering set of single domain
proteins at low to moderate resolution
Protein Structure prediction problem can
be solved with more powerful threading
algorithms!!

TASSER



For single domain proteins:
In almost all cases, for all ranges of initial
RMSD, even when starting from the “best”
structural alignment, the final results are
better than the initial template- the models
move closer to native
Based on a comprehensive folding
benchmark, we expect low resolution
structures for ~ 2/3 of proteins with low
sequence identity to PDB structures
Weak dependence on secondary structure
type
Structure to Function
Low resolution structures can be used to
identify active sites.
 Genome scale threading – greater than
70% of ORFs can be assigned to known
folds
 Extension to protein-protein interactions
Comparable accuracy to agreement
between two experimental methods

Acknowledgements
Center of Excellence
in Bioinformatics
Yang Zhang
Adrian Arakaki
Purdue University
Daisuke Kihara
Yale University
Long Lu
University of Illinois
Hui Lu
$$$$$
NIH, NSF &
The Oishei Foundation
http://bioinformatics.buffalo.edu/