Transcript Topic 13

2o structure, TM regions, and solvent accessibility
Chapter 29, Du and Bourne “Structural Bioinformatics”
Topic 13
The Truth (Information) is Out (In) There
The Truth (Information) is Out (In) There
But we’re still having a tough time finding it.
Protein Secondary Structure Prediction
Given a protein sequence (primary structure), predict its secondary structures
GHWIATRGQLIREAYEDYRHFSSECPFIP
E: -strand
H: -helix
C: coil
CEEEEECCCEEEEECCCHHHHHHCCCCCC
H: ( H: - helix, G: 310 helix, I: -helix )
E: (E: -strand, B: bridge)
C: (T: -turn, S: bend, C: coil)
Assumption: short stretches of residues have propensity to adopt certain
conformation ⇒ conformation of the central residue in a sequence fragment
depends only on flanking residues (sliding window)
Why secondary structure prediction?
-- Because we can (kind of).
-- Because it could be a first step towards prediction of protein tertiary
structure.
“Have solution, need problem.” Nearly every imaginable algorithm has been
applied to secondary structure prediction.
Secondary Structure Prediction Methods
1. First generation: Single amino acid propensities
Chou-Fasman method (1974), GOR I-IV
~56-60% accuracy
2. Second generation: Segments of 3-51 adjacent residues
NNSSP, SSPAL
~65% accuracy
3. Neural network
PHD, Psi-Pred, J-Pred
4. Support vector machine (SVM)
5. Hidden Markov Models (HMM)
Third generation methods
using evolutionary information
~76% accuracy
Secondary Structure Prediction Accuracy
1. three-state per-residue prediction accuracy
3
Q3  100
M
i 1
N obs
ii
Mii, number of residues observed in state i and predicted in state i
Nobs, the total number of residues observed in 3 states
2. per-segment prediction accuracy (SOV, Segment of OVerlap)
Per-stage segment overlap:
S1: observed SS segment
S2: predicted SS segment
Single Residue Propensity Methods
Calculate the propensity for a given amino acid to adopt a certain ss-type
P( | aai )
p( , aai )
P 

p( )
p( ) p(aai )
i

i, amino acid
, secondary structure state
Example: from a data set with 30 proteins
#Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=580
p(,aa) = 580/20,000, p() = 4,000/20,000, p(aa) = 2,000/20,000
P = 580 / (4,000/10) = 1.45
Amino Acid Propensities to Secondary Structures
P(H)
P(H)
T
S
P
T
A
E
L
M
R
S
T
G
69
77
57
69
142
151
121
145
98
77
69
57
T
S
P
T
A
E
L
M
R
S
T
G
69
77
57
69
142
151
121
145
98
77
69
57
P(H)
T
S
P
T
A
E
L
M
R
S
T
G
69
77
57
69
142
151
121
145
98
77
69
57
Chou-Fasman method
Nearest Neighbor Methods
* The idea is simple: predict SS of the central residue of a given
segment from homologous segments (neighbors).
For example, from database, find some number of the closest sequences
to a subsequence defined by a window around the central residue, then
use max (N, N, Nc) to assign the SS.
E
Homologous
C
sequences
C
RSTEVRASRQLAKEKVN
H
H
Window size
C
C
Key parameters:
1. How to define similarity?
2. What size window of sequence should be examined?
3. How many close sequences should be selected?
C
The Devil is in the details…
Psi-Pred Method





D. Jones, J. Mol. Boil. 292, 195 (1999).
Method : Neural network
Input data : PSSM generated by PSI-BLAST
Bigger and better sequence database
 Combining several database and data filtering
Training and test sets preparation
 Ss prediction only makes sense for proteins with no homologous
structure.
 No sequence & structural homologues between training and test sets
by CATH and PSI-BLAST (mimicking realistic situation).
Psi-Pred Method--Neural Network



Window size = 15
Two networks
First network (sequence-to-structure):






Second network (structure-to-structure):





315 = (20 + 1)  15 inputs
extra unit to indicate where the windows spans either N or C terminus
Data are scaled to [0-1] range by using 1/[1+exp(-x)]
75 hidden units
3 outputs (H, E, L)
Structural correlation between adjacent sequences
60 = (3 + 1)  15 inputs
60 hidden units
3 outputs
Accuracy ~76%
Sample Psi-Pred Output
Conf: Confidence (0=low, 9=high) ---very important!!!!
Pred: Predicted secondary structure (H=helix, E=strand, C=coil)
AA: Target sequence # PSIPRED HFORMAT (PSIPRED V2.3 by David Jones)
Conf: 966899999997542002357777557999999716898188034435788873356776
Pred: CCHHHHHHHHHHHHHHHCCCCCCCHHHHHHHHHHHCCCCCEEECCCCEEEEEEECCCCCC
AA:
MMWEQFKKEKLRGYLEAKNQRKVDFDIVELLDLINSFDDFVTLSSCSGRIAVVDLEKPGD
10
20
30
40
50
60
Conf: 777179998337888888988751235636899718261220179868899999998557
Pred: CCCCEEEEEECCCCCHHHHHHHHHCCCCCEEEEECCCEEEEECCCHHHHHHHHHHHHHCC
AA:
KASSLFLGKWHEGVEVSEVAEAALRSRKVAWLIQYPPIIHVACRNIGAAKLLMNAANTAG
70
80
90
100
110
120
Conf: 200242314703799714651435541487355188999999999999999889999999
Pred: CCCCCCEECCCEEEEEECCCEEEEEECCCCCEEECHHHHHHHHHHHHHHHHHHHHHHHHH
AA:
FRRSGVISLSNYVVEIASLERIELPVAEKGLMLVDDAYLSYVVRWANEKLLKGKEKLGRL
130
140
150
160
170
180
***Compare the prediction for residues 9 and 17***
Sample Psi-Pred Output-II
Again, voting rules methods tend to be best
ATKAVCVLKGDGPVQGTIHFEAKGDTVVVTGSITGLTEGDHGFHVHQFGDNTQGCTSAGP
CCCCCCCCCCCCCCCCEEHCCHHECEEEEEEEEEEEECCCCCCCCCCCCCCCCCCCCCCC
CCHEEEEECCCCCCCCEEEHHHCCCEEEEEEEEECECCCCCCEEEECCCCCCCCCCCCCC
CCCEEEEEECCCCCEEEEEEEECCCEEEEEEEEEEEECCCCCEEEEECCCCCCCCCCCCC
CCCEEEEECCCCCCCEEEEEECCCCEEEEEEEEECCCCCCCCEEEEEECCCCCCCCCCCC
HHHCEEEECCCCCCCEEEEEECCCCEEEEEECEEEEEECCCCEEEEECCCCCCEEECCCC
CCCCEEEECCCCCCCCCEEECCCCCCEEEEECEEECCCCCCCEEEECCCCCCCCEEECCC
CCCCEEEEECCCCCCCCCEEECCCCCEEEECCCCCCCCCCCEEEEEEEECCCCCCCCCCC
CCCCEEEECCCCCCCCEEEEECCCCEEEEEEEEEEECCCCCCEEEEECCCCCCCCCCCCC
---EEEEE------EEEEEEEEE--EEEEEEEEE-----EEEEEEEE-------------
2SOD
BPS
D_R
DSC
GGR
GOR
H_K
K_S
JOI
2SOD
HFNPLSKKHGGPKDEERHVGDLGNVTADKNGVAIVDIVDPLISLSGEYSIIGRTMVVHEK
CCCCCCCCCCCCCCCCCCCCCCECCCCCCHEECCCCCCCCCECCEECEEEEEEEEEEECC
CCCCCCCCCCCCCCCHHCECCCCCECCCCCCEEEEEEECCEEEECCCEEEEEEEEEEECC
CCCCCCCCCCCCCCEEEEECCCCCCCCCCCCEEEEEECCCCCCCCCCEEEEEEEEEEECC
CCCCCCCCCCCCCCCCEEECCCCCCCCCCCCCEEEEECCCCCCCCCCEEEECEEEEEECC
CCCCCCCCCCCCCCHHEEECCCCCCCCCCCCEEEEEEECCEEECCCCEEEEEEEEEECCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCEECCCCCCCCCCCCCCHHHHHHEECCC
CCCCCCCCCCCCCCCCEEECCCCCCCCCCCCCEEEEEEEEEEEEECCCEEECCEEEEEEE
CCCCCCCCCCCCCCCCEEECCCCCCCCCCCCEEEEEECCCCECCCCCEEEEEEEEEEECC
--------------------EEEEEE------EEEEEEE--------------EEEEE--
2SOD
BPS
D_R
DSC
GGR
GOR
H_K
K_S
JOI
2SOD
Prediction Accuracy (EVA)
25
P SIP RED
SSp ro
P ROF
P HDps i
JP red 2
P HD
Percentage of all 150 proteins
20
15
10
5
0
30
40
50
60
70
80
90
1 00
P ercen tag e co rrectl y pred i cted resi d ues per p rot ei n
EVA: Automatic evaluation of prediction servers
How Far Can We Go?
 Currently ~76%
 Proteins with more than 100 homologues 80%
 Assignment is ambiguous (5-15%). Recall DSSP vs STRIDE.
-- non-unique protein structures (dynamic), H-bond cutoff, etc.
 Different secondary structures between homologues (~12%).
 Non-locality. Secondary structure is influenced by long-range interactions.
-- Some segments can have multiple structure types (chameleon
sequences).
Solvent accessibility
 Conceptually similar problem to SS prediction: Buried vs. Exposed.
 Weighted Ensemble Solvent Accessibility predictor:
http://pipe.scs.fsu.edu/wesa.html
E
E
E
E
B
B
B
B
B
B
E
E
Why bother?
 To provide structural context for putative mutations that one wants to
characterize biochemically or biophysically.
Transmembrane Segment Prediction
 Again, conceptually similar problem to SS prediction: TM vs. Not.