Homology Modeling

Download Report

Transcript Homology Modeling

Homology Modeling
Lu Chih-Hao
1
Why study protein structure?
• Proteins play crucial functional roles in all biological
processes: enzymatic catalysis, signaling messengers …
• Function depends on 3D structure.
• Easy to obtain protein sequences, difficult to determine
structure.
2
Where find the data?
• Protein Data Bank (PDB)
– http://www.rcsb.org/pdb/
– > ~100,000 structures of proteins
• Text file contain: coordinates for each heavy atom from
the first residue to the last
X
Y
Z
3
PDB Statistics
4
TIM barrel
5
How to determine the protein
structure?
• By experimentation
– X-Ray
– NMR (nuclear magnetic resonance spectroscopy)
• Sequence-Structure gap
6
Protein Structure Prediction
• The primary sequence already contain all the information
necessary to define 3D structure.
• The 3D protein structure can be predicted according to
three main categories of methods (Rost & O’Donoghue,
1997): (1) homology modeling; (2) fold recognition
(threading); (3) ab initio techniques.
• Homology modeling is currently the most accurate
method to predict protein 3D structure (Tramontano,
1998).
7
Protein Structure Prediction
Sequence
Sequence Homology
To known fold
>30%
<30%
Homology
Modeling
Threading
Yes
Match Found?
No
Model
Ab initio
8
Sequence similarity implies structural similarity?
100
.
80
identity/similarity
Percentage sequence
identity
Sequence identity implies
structural similarity
60
Safe zone
40
20
0
(B.Rost, Columbia, NewYork)
0
50
100
150
200
250
Number of residues aligned
9
Homology Modeling
• Basis
– Structure is much more conserved than sequence
during evolution
• Limited applicability
– A large number of proteins and ORFs have no
similarity to proteins with known structure
10
What is Homology Modeling?
Target
Template
KQFTKCELSQNLYDIDGYGRIALPELICTMF
HTSGYDTQAIVENDESTEYGLFQISNALWCK
SSQSPQSRNICDITCDKFLDDDITDDIMCAK
KILDIKGIDYWIAHKALCTEKLEQWLCEKE
?
Homologous
Share Similar
Sequence
KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAK
FESNFNTQATNRNTDGSTDYGILQINSRWWCND
GRTPGSRNLCNIPCSALLSSDITASVNCAKKIV
SDGNGMNAWVAWRNRCKGTDVQAWIRGCRL
Use as template
1alc
8lyz
11
Structure prediction by homology
modeling
Step 1
Step 2
Step 3
Step 4
12
Homology detection and template
selection
• Homology detection
– To detect the fold of a probe sequence from a library
of known target fold.
• The three type of sequence based methods:
– Pair-wise sequence-sequence comparison
• FASTA, BLAST
– Sequence profile comparison
• PSI-BLAST, IMPALA, HMMER, SAM
– Profile-profile comparison
• prof_sim, COMPASS
13
Sequence-Sequence
comparison
Q
T
BLAST, FASTA,
SSEARCH
14
Profile-Sequence
comparison
Q
PSI-BLAST
T
15
PSI-BLAST Overview
16
Sequence-Profile
comparison
Q
T
RPS-BLAST, IMPALA,
HMMER, SAM
17
Profile-Profile
comparison
Q
T
prof_sim, COMPASS
18
The importance of the sequence
alignment
Method_1
1lmb3 <-> 1pou shift = 9.34
σ = 39.62
LEDARRLKAIYEKKKNELGLSQESVADKMGMGQSGVGALFNGINALNAYNAALLAKILKVSVEEFSPSIAREIYEMYEA
HHHHHHHHHHHHHHHHHCCCChhhhhhhhccchhhhhhhhccccccchhhhhhhhhhhccchhhcchhhhhhhhhhhhh
|||||||||||||||||||||
++++++++
+
++++++++++++
++++++++
000000000000000000000
99999999
X
XXXXXXXXXXXX
XXXXXXXX
HHHHHHHHHHHHHHHHHHCCC---------cchhhhhhhhhcccccc---chhhhhhhcccccccchhhhhhhhhhhhh
LEELEQFAKTFKQRRIKLGFT---------QGDVGLAMGKLYGNDFS---QTTISRFEALNLSFKNMCKLKPLLEKWLN
SCR; structure conserved region
Method_2
1lmb3 <-> 1pou Shift = 0.67
SVR; structure variable region
σ = 60.78
LEDARRLKAIYEKKKNELGLS----QESVADKMG--MGQSGVGALFN-GINALNAYNAALLAKILKVSVEEFS
HHHHHHHHHHHHHHHHHCCCC----hhhhhhhhc--cCHHHHHHHHC-cccccchhhhhhhhhhhccchhhcc
|||||||||||||||||||||
---|||||||||| -++++++++
++
000000000000000000000
4444
0000000000 11
11111111
44
HHHHHHHHHHHHHHHHHHCCCcchhhhhhhhhcccccCCHHHHHHHCccccccchhhhhhhhhhh---hhhcc
LEELEQFAKTFKQRRIKLGFTQGDVGLAMGKLYGNDFSQTTISRFEALNLSFKNMCKLKPLLEKW---LNDAE
19
Backbone generation
• Rigid-body assembly
– Building model core
20
21
Construction of loops might be done by:
Ab initio methods - without any prior knowledge. This
is done by empirical scoring functions that check large
number of conformations and evaluates each of them.
Wedemeyer,
Scheraga
J. Comput. Chem.
20, 819-844
(1999)
22
Construction of loops might be done by:
Using database of loops which appear in known
structures. The loops could be categorized by their
length or sequence
data
clustered
data
library
23
Scan database and search protein fragments with correct number of residues
and correct end-to-end distances
24
25
26
cRMS (Ǻ)
Loop Modeling: A database approach
Method breaks
down for loops
larger than 9
Loop length
27
Target: 2bj7A
Predicted model with long loop
GDT_TS = 45.96
Without loop
GDT_TS = 60.48
28
29
Errors in Homology Modeling
a) Side chain packing
True structure
b)Distortions and shifts
Template
c) No template
Model
30
Errors in Homology Modeling
d) Misalignments
True structure
e) Incorrect template
Template
Model
31
(Marti-Renom et al., 2000)
PROCHECK, Verify3D, Prosa, Anolea, Bala …
32
PROCHECK
β
α
http://www.biochem.ucl.ac.uk/~roman/
procheck/procheck.html
33
Verify3D
• Verify3D analyzes the compatibility of an atomic
model (3D) with its own amino acid sequence
(1D).
Luethy et al., 1992
34
ProQ Server
• ProQ is a neural network-based predictor
– Structural features  quality of a protein
model.
Correct
LGscore > 1.5
MaxSub > 0.1
Good
LGscore > 3
MaxSub > 0.5
Very good
LGscore > 5
MaxSub > 0.8
Arne Elofssons group: http://www.sbc.su.se/~bjorn/ProQ/
35
Modeling accuracy
36
(Marti-Renom et al., 2000)
Utility of Structural Information
37
38
39
(PS)2: protein structure prediction
server
40
Consensus strategy
• The idea of consensus analysis is to gather
predictions from a set of different methods.
• The performance of consensus methods is
significantly higher than for individual methods.
3d-shotgun (Fischer D., 2003)
3d-jury (Ginalski K et al., 2003)
Pmodeller (Bjorn W et al., 2003)
41
Structure prediction by homology
modeling
Step 1
Step 2
Step 3
Step 4
42
Overview of the (PS)2 method
Step1: Template
search/selection by the
consensus of PSI-BLAST
and IMPALA
(b)
Step2: Target-template
alignment by the consensus
of T-Coffee, PSI-BLAST,
(a)
and IMPALA
Step3: Model building by
MODELLER and structure
evaluation and visualization
by CHIME and Raster3D
(c)
(d)
Figure 1. Overview of the protein structure prediction server, (PS)2.
43
Alignment method
Input:
target and template sequences
Output: target-template aligned sequences
Step 1: Initial all entries of the aligned matrix to 0.
Align target and template sequences using
PSI-BLAST, IMPALA, and T-Coffee.
9: aligned in 1st cycle
7: aligned in 2nd cycle
5: aligned in 3rd cycle
3: aligned in 4th cycle
4 and 2: unfeasible solution
Step 2: Sum aligned scores of these three alignments
for each position with different scoring
weights.
Step 3: Take the positions with the highest score as
the aligned points to build the final targettemplate alignment. (e.g., the highest scoring
is 9 for the 1st cycle in (b) )
Step 4: Identify the unfeasible positions. ( 4 and 2 in
(b))
Step 5: Change the scores of unfeasible positions
and the aligned points to 0.
Step 6: Repeatedly Steps 3 and 5 until all entries are 0.
Step 7: Output the path with the aligned points as the
target-template alignment
: Aligned path of PSI-BLAST
: Aligned path of T-Coffee
: Aligned path of IMPALA
: Final aligned path
44
(a)
(b)
http://predictioncenter.org/
45
CASP3 servers registered:
1.
2.
3.
4.
5.
6.
7.
8.
9.
3D-PSSM (Sternberg) [email protected]
Karplus [email protected]
frsvr (Fischer) [email protected]
pscan (Eloffson) [email protected]
BASIC (Godzik) [email protected]
GenTHREADER [email protected]
Valentina di Francesco [email protected]
TOPITS (Rost) [email protected]
Bork
46
CASP8 servers registered:
47
Model Evaluation
• Performance evaluation
– Comparing the 47 CM targets to evaluate the
performance with the other groups in CASP6.
• GDT_TS Score
GDTd
d N
GDT _ TS  100
(%) d {1, 2, 4, 8}
4
- N is the total number residues of the target (native structure)
- GDTd is the number of aligned residues whose Cα-atom distance
between the target and predicted model is less than d
- d is 1, 2, 4, or 8 Å.
48
T0264 (1wde)
6
294
Native structure
10
Aligned rate: 91.00 %
272
PSI-BLAST model
GDT_TS = 64.97
10
Aligned rate: 91.00 %
272
IMPALA model
GDT_TS = 63.32
Aligned rate: 100 %
6
10
6
294
T-Coffee model
GDT_TS = 65.14
294
(PS)2 model
GDT_TS = 67.22
272
Aligned rate: 100 %
GDT_TS = 66.00
Figure 3. Comparison (PS)2 with PSI-BLAST, IMPALA, and T-Coffee of the
prediction accuracies (global / local GDT_TS scores) on target T0264.
49
80
60
40
20
0
GDT_TS Score (%)
100
T0282
T0280_1
T0279_2
T0279_1
T0279
T0277
T0276
T0275
T0274
T0271
T0269_2
T0269_1
T0269
T0268_2
T0268_1
T0268
T0267
T0266
T0264_2
T0264_1
T0264
T0247_3
T0247_2
T0247_1
T0247
T0246
T0240
T0235_1
T0234
T0233_2
T0233_1
T0233
T0231
T0229_2
T0229_1
T0229
T0226_1
T0226
T0223_1
T0222_1
T0211
T0208
T0205
T0204
T0200
T0199_1
T0196
Targets
Figure 4. Comparison of (PS)2 models with all automated servers in CASP6.
50
Table 1. Compare with the other groups in CASP6
Average
GDT_TS
(PS)2
RBTA
ESYP
3DJR
MGTH
3DJS
PROS
PMO5
PRCM
PCO5
PCOB
65.89
64.92
63.14
62.54
61.27
61.08
58.11
57.93
57.62
56.37
37.57
• Cases
T0269, Template 1prxA
(PS)2 model, GDT_TS: 85.76
T0269, Template 1qq2A
ESYP model, GDT_TS: 78.48 51
http://ps2.life.nctu.edu.tw
52
53
54