Secondary Structure
Download
Report
Transcript Secondary Structure
Class 7: Protein Secondary
Structure
.
Protein Structure
Amino-acid
chains can fold to form 3-dimensional
structures
Proteins are sequences
that have (more or less)
stable 3-dimensional
configuration
Why Structure is Important?
The structure a protein takes is crucial for its function
Forms “pockets” that can recognize an enzyme
substrate
Situates side chain of
specific groups to co-locate
to form areas with desired
chemical/electrical properties
Creates firm structures such as
collagen, keratins, fibroins
Determining Structure
X-Ray
and NMR methods allow to determine the
structure of proteins and protein complexes
These methods are expensive and difficult
Could take several work months to process one
proteins
A
centralized database (PDB) contains all solved
protein structures
XYZ coordinate of atoms within specified
precision
~23,000 proteins have solved structures
Structure is Sequence Dependent
Experiments
show that for many proteins, the 3dimensional structure is a function of the sequence
Force the protein to loose its structure, by
introducing agents that change the environment
After sequences put back in water, original
conformation/activity is restored
However,
for complex proteins, there are cellular
processes that “help” in folding
Levels of structure
Secondary Structure
-helix
-strands
Helix
Single protein chain
Turn every 3.6 amino acids
Shape maintained by
intramolecular H bonding
between -C=O and H-N-
Hydrogen Bonds in -Helices
Amphipathic -helix
Hydrophilic
residues on one side
Hydrophobic residues on other side
-Strands
Alternating 120’ angles
Often form sheets
-Strands form Sheets
parallel
Anti-parallel
These sheets hold together by hydrogen bonds across strands
…which can form a -barrel
porin – a membranal transoporter
Angular Coordinates
Secondary
residues
structures force specific angles between
Ramachandran Plot
We
can relate angles to types of structures
Define "secondary structure"
3D protein coordinates may be converted to a 1D
secondary structure representation using DSSP or
STRIDE
DSSP
EEEE_SS_EEEE_GGT__EE_E_HHHHHH
HHHHHHHHHGG_TT
DSSP= Database of Secondary Structure in Proteins
DSSP symbols
H = helix backbone angles (-50,-60) and H-bonding pattern (i-> i+4)
E = extended strand backbone angles (-120,+120) with beta-sheet Hbonds (parallel/anti-parallel are not distinguished)
S= beta-bridge (isolated backbone H-bonds)
T=beta-turn (specific sets of angles and 1 i->i+3 H-bond)
G=3-10 helix or turn (i,i+3 H-bonds)
I=Pi-helix (i,i+5 Hbonds) (rare!)
_= unclassified. None-of-the-above. Generic loop, or beta-strand with
no regular H-bonding.
L
Labeling Secondary Structure
Using
both hydrogen bond patterns and angles, we
can label secondary structure tags from XYZ
coordinate of amino-acids
These do not lead to absolute definition of
secondary structure
Prediction of Secondary Structure
Input:
amino-acid sequence
Output:
Annotation sequence of three classes:
alpha
beta
other (sometimes called coil/turn)
Measure of success:
Percentage of residues that were correctly labeled
Accuracy of 3-state predictions
True SS:
EEEE_SS_EEEE_GGT__EE_E_HHHHHHHHHHHHHHHGG_TT
Prediction: EEEELLLLHHHHHHLLLLEEEEEHHHHHHHHHHHHHHHHHHLL
Q3-score = % of 3-state symbols that are correct
Measured on a "test set"
Test set == An independent set of cases (proteins) that were not
used to train, or in any way derive, the method being tested.
Best methods:
PHD (Burkhard Rost) -- 72-74% Q3
Psi-pred (David T. Jones) -- 76-78% Q3
What can you do with a secondary
structure prediction?
(1) Find out if a homolog of unknown structure is missing
any of the SS (secondary structure) units, i.e. a helix or
a strand.
(2) Find out whether a helix or strand is
extended/shortened in the homolog.
(3) Model a large insertion or terminal domain
(4) Aid tertiary structure prediction
Statistical Methods
From
PDB database, calculate the propensity for
a given amino acid to adopt a certain ss-type
P( | aai )
p( , aai )
P
p( )
p( ) p(aai )
i
Example:
#Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=500
P(,aa) = 500/20,000, p() 4,000/20,000, p(aa) = 2,000/20,000
P = 500 / (4,000/10) = 1.25
Used in Chou-Fasman algorithm (1974)
Chou-Fasman: Initiation
Identify regions where 4/6 have propensity
P(H) >1.00
This forms a “alpha-helix nucleus”
P(H)
P(H)
T
S
P
T
A
E
L
M
R
S
T
G
69
77
57
69
142
151
121
145
98
77
69
57
T
S
P
T
A
E
L
M
R
S
T
G
69
77
57
69
142
151
121
145
98
77
69
57
Chou-Fasman: Propagation
Extend helix in both directions until a set of
four residues have an average P(H) <1.00.
P(H)
T
S
P
T
A
E
L
M
R
S
T
G
69
77
57
69
142
151
121
145
98
77
69
57
Chou-Fasman Prediction
as -helix segment with
E[P] > 1.03
E[P] > E[P]
Not including proline
Predict as -strand segment with
E[P] > 1.05
E[P] > E[P]
Others are labeled as turns/loops.
Predict
(Various extensions appear in the literature)
Achieved
accuracy: around 50%
Shortcoming of this method: ignoring the context of
the sequence when predicting using amino-acids
We
would like to use the sequence context as an
input to a classifier
There are many ways to address this.
The most successful to date are based on neural
networks
A Neuron
Artificial Neuron
Input
Output
a1
W1
a2
W2
…
Wk
f (b Wi ai )
i
ak
•A neuron is a multiple-input -> single output unit
•Wi = weights assigned to inputs; b = internal “bias”
•f = output function (linear, sigmoid)
Artificial Neural Network
Input
a1
Hidden
Output
W1
o1
W2
a2
…
…
…
a3
Wk
om
•Neurons in hidden layers compute “features” from
outputs of previous layers
•Output neurons can be interpreted as a classifier
Example: Fruit Classifer
Apple
Orange
Shape Texture Weight Color
ellipse hard
heavy
red
round soft
light yellow
i j ,a 1 {si j a }
Si-w
hk l (ak i j ,aw k , j ,a )
...
Qian-Sejnowski Architecture
j ,a
os bs us ,k
...
Si+w
1
l (x )
1 e x
...
s arg max s os
Si
...
k
o
o
oo
Input
Hidden
Output
Neural Network Prediction
A
neural network defines a function from inputs to
outputs
Inputs can be discrete or continuous valued
In this case, the network defines a function from a
window of size 2w+1 around a residue to a
secondary structure label for it
Structure element determined by max(o, o, oo)
Training Neural Networks
By
modifying the network weights, we change the
function
Training is performed by
Defining an error score for training pairs
<input,output>
Performing gradient-descent minimization of the
error score
Back-propagation algorithm allows to compute
the gradient efficiently
We have to be careful not to overfit training data
Smoothing Outputs
The
Qian-Sejnowski network assigns each residue
a secondary structure by taking max(o, o, oo)
Some sequences of secondary structure are
impossible:
To smooth the output of the network, another layer
is applied on top of the three output units for each
residue:
Neural network
Markov model
Success Rate
Variants
of the neural network architecture and
other methods achieved accuracy of about
65% on unseen proteins
Depending on the exact choice of training/test
sets
Breaking the 70% Threshold
A
innovation that made a crucial difference uses
evolutionary information to improve prediction
Key idea:
Structure is preserved more than sequence
Surviving mutations are not random
Suppose we find homologues (same structure) of
the query sequence
The type of replacements at position i during
evolution provides us with information about the
use of the residue i in the secondary structure
Nearest Neighbor Approach
Select
a window around the target residues
Perform local alignment to sequences with known
structure
Choice of alignment weight matrix to match
remote homologies
Alignment weight takes into account the
secondary structure of aligned sequence
Use max (na, nb, nc) or max(sa, sb, sc)
Key: Scoring measure of evolutionary similarity.
PHD Approach
Multi-step procedure:
Perform BLAST search to find local alignments
Remove alignments that are “too close”
Perform multiple alignments of sequences
Construct a profile (PSSM) of amino-acid
frequencies at each residue
Use this profile as input to the neural network
A second network performs “smoothing”
PHD Architecture
Psi-pred : same idea
(Step 1) Run PSI-Blast --> output sequence profile
(Step 2) 15-residue sliding window = 315 weights,
multiplied by hidden weights in 1st neural net. Output
is 3 weights (1 weight for each state H, E or L) per position.
(Step 3) 60 input weights, multiplied by weights in
2nd neural network, summed. Output is final 3-state
prediction.
Performs slightly better than PHD
Other Classification Methods
Neural
Networks were used as a classifier in the
described methods.
We can apply the same idea, with other classifiers.
E.g.: SVM
Advantages:
Effectively avoid overfitting
Supplies prediction confidence
SVM based approach
Suggested
by S. Hua and Z. Sun, (2001).
Multiple sequence alignment from HSSP database
(same as PHD)
Sliding window of w 21 w input dimension
Apply SVM with RBF kernel
Multiclass problem:
Training: one-against-others (e.g. H/~H, E/~E,
L/~L), binary (e.g. H/E)
maximum output score
Decision tree method
Jury decision method
Decision tree
H / ~H
No
E / ~E
E/C
Yes
No
C / ~C
C/ H
Yes
Yes
H
No
No
H/E
Yes
C
E
No
Yes
Yes
E
No
C
H
C
H
E
Accuracy on CB513 set
Classifier
Q3
QH
QE
QC
SOV
Max
72.9
74.8
58.6
79.0
75.4
Tree1
68.9
73.5
54.0
73.1
72.1
Tree2
68.2
72.0
61.0
69.0
71.4
Tree3
67.5
69.5
46.6
77.0
70.8
NN
72.0
74.7
57.7
77.4
75.0
Vote
70.7
73.0
74.7
76.6
73.2
Jury
73.5
75.2
60.3
79.5
76.2
State of the Art
Both
PHD and Nearest neighbor get about 72%74% accuracy
Both predicted well in CASP2 (1996)
PSI-Pred slightly better (around 76%)
Recent trend: combining classification methods
Best predictions in CASP3 (1998)
Failures:
Long term effects: S-S bonds, parallel strands
Chemical patterns
Wrong prediction at the ends of helices/strands