3-Secondary_Structures(ch12)

Download Report

Transcript 3-Secondary_Structures(ch12)

Predicting Structural Features
Chapter 12
Structural Features
• Phosphorylation sites
• Transmembrane helices
• Protein flexibility
Accuracy Measures Revisited
• Level:
– Individual residues
– Complete helix or strand
Residue-Level Measures
• Q3
– Percentage of residues predicted correctly
– If one state (eg, Coil) is very common (eg, 50%),
blind guessing can give a large Q3!
• Matthew’s correlation coefficient
– C= (TPxTN - FNxFP)/√(TP+FP)(TP+FN)(TN+FP)(TN+FN)
– Defined for each state
– More balanced than Q3; in range ±1
– Random prediction: C = 0
Structural Element-Level
Measures
• SOV
– based on the overlap of
predicted “segments” of
helix, strand etc. with the
observed segments of
the same type
• The N-score
– specialized for
transmembrane protein
predictors
– Should TMHMM2 be
changed? Should your
model?
Predicting Helices
• Residue propensities:
– score for a given
structure class for each
residue, a
• P(H | a) is proportional
to P(a | H) / P(a)
• Why? Bayes’ Rule is
your friend!
– P(H | a) = P(a | H)P(H) /
P(a)
– P(H) doesn’t depend on
a, so
– P(H | a) proportional to
P(a | H) / P(a)
QuickTi me™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Can this be used to see how to
group helix states?
Identical short segments
rarely fold differently
• Local sequence is
highly important to
secondary structure.
• But, this sequence
occurs in two proteins
and takes very different
forms:
– KGVVPQLVK
• There is significant
information about
structure in local
sequence.
QuickTime™ and a
TIFF (U ncompressed) decompressor
are needed to see this picture.
I-sites Sequence Database
• About 250 short segments (3-19 residues) that show
strong correlation between sequence and structure
• Example shows:
– phi and psi angles, log-odds matrix
– superimposed backbones
– representative structure
QuickTi me™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Nearest Neighbor Prediction
Methods
•
Predict secondary structure
based on:
– Local alignments of the query
sequence to a database of
sequences of known structure
– Alignment score functions are
often special-purpose, and may
include helix/sheet/coil
“propensity” information
– Homologous sequences are
often included in the database
•
•
Prediction based on weighted
votes of nearest neighbors
(usually only central residue of
alignment is predicted)
73.5% Accuracy (Q3)
QuickTime™ and a
TIFF (U ncompressed) decompressor
are needed to see this picture.
A different application:
prediction of misfolding
• Diseases such as Alzheimer’s involve
protein misfolding.
• Usually, the misfolded region ends up
as Beta-strands.
• How could we use secondary structure
information to predict which proteins will
potentially misfold?
HP
Hidden Beta Propensity
•
Key idea: Tertiary contacts (TC)
– TC is number of contacts a
residue has with others at least
4 residues away
– Alpha helices tend to be in
regions of HIGH TC
– Beta strands tend to be in
regions of LOW TC
•
Look for query residues whose
nearest neighbors are “strange”
with respect to TC and
alpha/beta state:
– Low TC regions with lots of
Alphas
– High TC regions with lots of
Betas
•
Performance results?
QuickTi me™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Neural Nets
• Each node computes a
simple function of its
inputs.
• The weighted sum of
the inputs are added to
a bias term and
“squashed”:
– I =  w-1
– (I+)
• The output, , is then
propagated to nodes in
the next layer.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTi me™ and a
decompressor
are needed to see this picture.
Training Neural Nets
• Back-propagation
• Optimizes the weights and
bias terms
• Minimize the error function
(difference between
predicted and observed)
QuickTi me™ and a
decompressor
are needed to see this picture.
– RMS
– Relative Entropy
• Iterative process
– Final weights shown for a
secondary structure NN
alpha helix output layer.
– Over-fitting can be reduced
by training for fewer
iterations
QuickTime™ and a
decompressor
are needed to see this picture.
Adaptive Encoding and
Weight Sharing
• Orthogonal encoding
• Each residue feeds
three hidden nodes
• The weights for all red
nodes are tied together
• Each group of three
nodes learns the same
“encoding” of the 20
amino acids
QuickTi me™ and a
decompressor
are needed to see this picture.
Engineering Intuition Into NNs
• Alpha helices have a
period of 3.6 residues
per turn
• A NN can be specially
designed to reflect that
• Using this, plus
adaptive encoding:
– Q3 = 66%
– Adding homology:
= 73%
Q3
QuickTime™ and a
decompressor
are needed to see this picture.
HMMs and Transmembrane
Proteins (again)
QuickTime™ and a
TIFF (U ncompressed) decompressor
are needed to see this picture.
HMMTOP Architecture
• TMHs 17-25
residues
• Tails 1-15 residues
• Blue letters show
structural state
labels
TMHMM Architecture
• Helices are 5-25
residues
• Caps follow helices
• Cytoplasmic:
– Loop: 0-20 residues
– Globular: 1 state
• Extra-cellular:
– Long loop: 0-100
residues
– Globular: 3 states
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Predicting Globular Proteins
with “Hidden Neural Networks”
• YASPIN
– Neural net predicts
seven classes (He,H,
Hb,C,Ee,E,Eb) using
15-residue window
of PSSM input
– HMM “filters” this
output
– Can you imagine
how this is done?
QuickTi me™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Coiled-coil HMM
MARCOIL
QuickTime™ and a
TIFF (U ncompressed) decompressor
are needed to see this picture.
Design lets you start and end in any
phase of the heptad repeat
Support Vector Machines:
SVMs
• Classifiers
– Basic “machine” is a 2-class classifier
– Training Data
• set of labeled vectors
• {<x1, x2, …,xn, C>},
• Class: C=1 or C=-1
– Supervised learning (like neural nets)
• Learn from positive and negative examples
– Output
• Function predicting class of unlabeled vectors
SVM Example
• Alpha helix predictor
– 15 residue window
– 21 numbers per residue
• Psi-BLAST PSSM: 20 numbers
• “spacer” flag indicating “off end” of protein
– 315 numbers total per window
– Training samples
• Non-helix samples: {<x1, x2, …, x315, -1>}
• Helix samples: {<x1, x2, …, x315, 1>}
– Training finds function of X that best separates the
non-helix from the helix samples
SVM vs NN
as Classifiers
• Similarities
– Compute a function on their
inputs
– Trained to minimize error
• Differences
– NNs find any hyperplane
that separates the two
clases
– SVMs find the maximummargin hyperplane
– NNs can be engineered by
designing their topology
– SVMs can be tailored by
designing the kernel
function
SVM Details
Separating Hyperplanes:
Choose w, b to minimize ||w||
Subject to
Dual form (support vectors)
s.t.
where
Kernel trick:
replace dot products
by a non-linear kernel
bunction.
Dubious Statement
• “In marked contrast to NN, SVMs have
few explicit parameters to fit…”
– The vector of weights, w, is as long as the
number of training samples
– But the minimum-margin hyperplane will
have most of the weights equal to zero;
only the “support vectors” will have nonzero weights.