Secondary Structure

Download Report

Transcript Secondary Structure

Class 7: Protein Secondary
Structure
.
Protein Structure
 Amino-acid
chains can fold to form 3-dimensional
structures
 Proteins are sequences
that have (more or less)
stable 3-dimensional
configuration
Why Structure is Important?
The structure a protein takes is crucial for its function
 Forms “pockets” that can recognize an enzyme
substrate
 Situates side chain of
specific groups to co-locate
to form areas with desired
chemical/electrical properties
 Creates firm structures such as
collagen, keratins, fibroins
Determining Structure
 X-Ray
and NMR methods allow to determine the
structure of proteins and protein complexes
 These methods are expensive and difficult
 Could take several work months to process one
proteins
A
centralized database (PDB) contains all solved
protein structures
 XYZ coordinate of atoms within specified
precision
 ~23,000 proteins have solved structures
Structure is Sequence Dependent
 Experiments
show that for many proteins, the 3dimensional structure is a function of the sequence
 Force the protein to loose its structure, by
introducing agents that change the environment
 After sequences put back in water, original
conformation/activity is restored
 However,
for complex proteins, there are cellular
processes that “help” in folding
Levels of structure
Secondary Structure
-helix
-strands
 Helix
 Single protein chain
 Turn every 3.6 amino acids
 Shape maintained by
intramolecular H bonding
between -C=O and H-N-
Hydrogen Bonds in -Helices
Amphipathic -helix
 Hydrophilic
residues on one side
 Hydrophobic residues on other side
-Strands

Alternating 120’ angles

Often form sheets
-Strands form Sheets
parallel
Anti-parallel
These sheets hold together by hydrogen bonds across strands
…which can form a -barrel
porin – a membranal transoporter
Angular Coordinates
 Secondary
residues
structures force specific angles between
Ramachandran Plot
 We
can relate angles to types of structures
Define "secondary structure"
3D protein coordinates may be converted to a 1D
secondary structure representation using DSSP or
STRIDE
DSSP
EEEE_SS_EEEE_GGT__EE_E_HHHHHH
HHHHHHHHHGG_TT
DSSP= Database of Secondary Structure in Proteins
DSSP symbols
H = helix backbone angles (-50,-60) and H-bonding pattern (i-> i+4)
E = extended strand backbone angles (-120,+120) with beta-sheet Hbonds (parallel/anti-parallel are not distinguished)
S= beta-bridge (isolated backbone H-bonds)
T=beta-turn (specific sets of angles and 1 i->i+3 H-bond)
G=3-10 helix or turn (i,i+3 H-bonds)
I=Pi-helix (i,i+5 Hbonds) (rare!)
_= unclassified. None-of-the-above. Generic loop, or beta-strand with
no regular H-bonding.
L
Labeling Secondary Structure
 Using
both hydrogen bond patterns and angles, we
can label secondary structure tags from XYZ
coordinate of amino-acids
 These do not lead to absolute definition of
secondary structure
Prediction of Secondary Structure
Input:
 amino-acid sequence
Output:
 Annotation sequence of three classes:
 alpha
 beta
 other (sometimes called coil/turn)
Measure of success:
 Percentage of residues that were correctly labeled
Accuracy of 3-state predictions
True SS:
EEEE_SS_EEEE_GGT__EE_E_HHHHHHHHHHHHHHHGG_TT
Prediction: EEEELLLLHHHHHHLLLLEEEEEHHHHHHHHHHHHHHHHHHLL
Q3-score = % of 3-state symbols that are correct
Measured on a "test set"
Test set == An independent set of cases (proteins) that were not
used to train, or in any way derive, the method being tested.
Best methods:
PHD (Burkhard Rost) -- 72-74% Q3
Psi-pred (David T. Jones) -- 76-78% Q3
What can you do with a secondary
structure prediction?
(1) Find out if a homolog of unknown structure is missing
any of the SS (secondary structure) units, i.e. a helix or
a strand.
(2) Find out whether a helix or strand is
extended/shortened in the homolog.
(3) Model a large insertion or terminal domain
(4) Aid tertiary structure prediction
Statistical Methods
 From
PDB database, calculate the propensity for
a given amino acid to adopt a certain ss-type
P( | aai )
p( , aai )
P 

p( )
p( ) p(aai )
i

Example:
#Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=500
P(,aa) = 500/20,000, p()  4,000/20,000, p(aa) = 2,000/20,000
P = 500 / (4,000/10) = 1.25
Used in Chou-Fasman algorithm (1974)
Chou-Fasman: Initiation
 Identify regions where 4/6 have propensity
P(H) >1.00
 This forms a “alpha-helix nucleus”
P(H)
P(H)
T
S
P
T
A
E
L
M
R
S
T
G
69
77
57
69
142
151
121
145
98
77
69
57
T
S
P
T
A
E
L
M
R
S
T
G
69
77
57
69
142
151
121
145
98
77
69
57
Chou-Fasman: Propagation
 Extend helix in both directions until a set of
four residues have an average P(H) <1.00.
P(H)
T
S
P
T
A
E
L
M
R
S
T
G
69
77
57
69
142
151
121
145
98
77
69
57
Chou-Fasman Prediction
as -helix segment with
 E[P] > 1.03
 E[P] > E[P]
 Not including proline
 Predict as  -strand segment with
 E[P] > 1.05
 E[P] > E[P]
 Others are labeled as turns/loops.
 Predict
(Various extensions appear in the literature)
 Achieved
accuracy: around 50%
 Shortcoming of this method: ignoring the context of
the sequence when predicting using amino-acids
 We
would like to use the sequence context as an
input to a classifier
 There are many ways to address this.
 The most successful to date are based on neural
networks
A Neuron
Artificial Neuron
Input
Output
a1
W1
a2
W2
…
Wk
f (b  Wi ai )
i
ak
•A neuron is a multiple-input -> single output unit
•Wi = weights assigned to inputs; b = internal “bias”
•f = output function (linear, sigmoid)
Artificial Neural Network
Input
a1
Hidden
Output
W1
o1
W2
a2
…
…
…
a3
Wk
om
•Neurons in hidden layers compute “features” from
outputs of previous layers
•Output neurons can be interpreted as a classifier
Example: Fruit Classifer
Apple
Orange
Shape Texture Weight Color
ellipse hard
heavy
red
round soft
light yellow
i j ,a  1 {si  j  a }
Si-w
hk  l (ak   i j ,aw k , j ,a )
...
Qian-Sejnowski Architecture
j ,a
os  bs   us ,k
...
Si+w
1
l (x ) 
1  e x
...
s  arg max s os
Si
...
k
o
o
oo
Input
Hidden
Output
Neural Network Prediction
A
neural network defines a function from inputs to
outputs
 Inputs can be discrete or continuous valued
 In this case, the network defines a function from a
window of size 2w+1 around a residue to a
secondary structure label for it
 Structure element determined by max(o, o, oo)
Training Neural Networks
 By
modifying the network weights, we change the
function
 Training is performed by
 Defining an error score for training pairs
<input,output>
 Performing gradient-descent minimization of the
error score
 Back-propagation algorithm allows to compute
the gradient efficiently
 We have to be careful not to overfit training data
Smoothing Outputs
 The
Qian-Sejnowski network assigns each residue
a secondary structure by taking max(o, o, oo)
 Some sequences of secondary structure are
impossible:

 To smooth the output of the network, another layer
is applied on top of the three output units for each
residue:
 Neural network
 Markov model
Success Rate
 Variants
of the neural network architecture and
other methods achieved accuracy of about
65% on unseen proteins
 Depending on the exact choice of training/test
sets
Breaking the 70% Threshold
A
innovation that made a crucial difference uses
evolutionary information to improve prediction
Key idea:
 Structure is preserved more than sequence
 Surviving mutations are not random
 Suppose we find homologues (same structure) of
the query sequence
 The type of replacements at position i during
evolution provides us with information about the
use of the residue i in the secondary structure
Nearest Neighbor Approach
 Select
a window around the target residues
 Perform local alignment to sequences with known
structure
 Choice of alignment weight matrix to match
remote homologies
 Alignment weight takes into account the
secondary structure of aligned sequence
 Use max (na, nb, nc) or max(sa, sb, sc)
 Key: Scoring measure of evolutionary similarity.
PHD Approach
Multi-step procedure:
 Perform BLAST search to find local alignments
 Remove alignments that are “too close”
 Perform multiple alignments of sequences
 Construct a profile (PSSM) of amino-acid
frequencies at each residue
 Use this profile as input to the neural network
 A second network performs “smoothing”
PHD Architecture
Psi-pred : same idea
(Step 1) Run PSI-Blast --> output sequence profile
(Step 2) 15-residue sliding window = 315 weights,
multiplied by hidden weights in 1st neural net. Output
is 3 weights (1 weight for each state H, E or L) per position.
(Step 3) 60 input weights, multiplied by weights in
2nd neural network, summed. Output is final 3-state
prediction.
Performs slightly better than PHD
Other Classification Methods
 Neural
Networks were used as a classifier in the
described methods.
 We can apply the same idea, with other classifiers.
E.g.: SVM
 Advantages:
Effectively avoid overfitting
 Supplies prediction confidence
SVM based approach
 Suggested
by S. Hua and Z. Sun, (2001).
 Multiple sequence alignment from HSSP database
(same as PHD)
 Sliding window of w  21  w input dimension
 Apply SVM with RBF kernel
 Multiclass problem:
 Training: one-against-others (e.g. H/~H, E/~E,
L/~L), binary (e.g. H/E)
 maximum output score
 Decision tree method
 Jury decision method
Decision tree
H / ~H
No
E / ~E
E/C
Yes
No
C / ~C
C/ H
Yes
Yes
H
No
No
H/E
Yes
C
E
No
Yes
Yes
E
No
C
H
C
H
E
Accuracy on CB513 set
Classifier
Q3
QH
QE
QC
SOV
Max
72.9
74.8
58.6
79.0
75.4
Tree1
68.9
73.5
54.0
73.1
72.1
Tree2
68.2
72.0
61.0
69.0
71.4
Tree3
67.5
69.5
46.6
77.0
70.8
NN
72.0
74.7
57.7
77.4
75.0
Vote
70.7
73.0
74.7
76.6
73.2
Jury
73.5
75.2
60.3
79.5
76.2
State of the Art
 Both
PHD and Nearest neighbor get about 72%74% accuracy
 Both predicted well in CASP2 (1996)
 PSI-Pred slightly better (around 76%)
 Recent trend: combining classification methods
 Best predictions in CASP3 (1998)
 Failures:



Long term effects: S-S bonds, parallel strands
Chemical patterns
Wrong prediction at the ends of helices/strands