Complexity Reduction by Coarse-Graining Protein secondary
Download
Report
Transcript Complexity Reduction by Coarse-Graining Protein secondary
A Hidden Markov Model
for
Protein Secondary Structure Prediction
Wei-Mou Zheng
Institute of Theoretical Physics
Academia Sinica
PO Box 2735, Beijing 100080
[email protected]
Outline
• Protein structure
• A brief review of secondary structure
prediction
• Hidden Markov model: simple-minded
• Hidden Markov model: realistic
• Discussion
• References
Protein sequences are written in 20 letters (20 Naturally-occurring amino
acid residues): AVCDE FGHIW KLMNY PQRST
Hydrophobic
Charged+-
Polar
Residues form a directed chain
Cis-
Trans-
Rasmol ribbon diagram of GB1
Helix (pink), sheets (yellow) and coil (grey)
Hydrogen-bond network
3D structure → secondary structure written in three letters:H, E, C.
H: E: C = 34.9: 21.8: 43.3
Bayes formula
Count of
Generally,
P(x, y) = P(x|y)P(y),
Protein sequence A, {ai}, i=1,2,…,n
Secondary structure sequence S, {si}, i=1,2,…,n
Secendary structure prediction:
1D amino acid sequences → 1D secondary structure sequence
An old problem for more than 30 years
Inference of S from A: P(S |A )
1. Simple Chou-fasman approach
Chou-Fasman’s propensity of amino acid to conformational
state
+ independence approximation
Parameter Training
Propensities q(a,s)
Counts (20x3) from a database: N(a, s)
sum over a → N(s),
sum over s → N(a),
sum over a and s → N
q(a,s) = [N(a,s) N] / [N(a) N(s)].
2. Garnier-Osguthorpe-Robson (GOR) window version
Conditional
Independency
Weight matrix (20x17)x3 P(W|s)
3. Improved GOR (20x20x16x3, to include pair correlation)
Hidden Markov Model (HMM): simple-minded
Bayesian formula: P(S|A) = P(S,A)/P(A) ~ P(S,A) = P(A|S) P(S)
Simple version
Markov chain
For hidden sequence
a1
a2
a3
s1
s2
s3
Forward and backward functions
emitting ai at si
according to P(a|s)
Initial conditions and recursion relations
Partition function
Linear algorithm: Dynamic programming
Baum-Welch (sum) & Viterbi (max)
Prob(si=s, si+1=s’) = Ai(s) tss’ P(ai+1|s’) Bi+1(s’)/Z
Prob(si:j)
Hidden Markov Model: Realistic
1) Strong correlation in conformational states: at least two
consicutive E and three consicutive H
refined conformational states (243 → 75)
2) Emission probabilities → improved window scores
Proportion of accurately predicted sites ~ 70% (compared
with < 65% for prediction based on a single sequence)
• No post-prediction filtering
• Integrated (overall) estimation of refined conformation
states
• Measure of prediction confidence
Discussions
• HMM using refined conformational states and
window scores is efficient for protein secondary
structure prediction.
• Better score system should cover more
correlation between conformation and sequence.
• Combining homologous information will improve
the prediction accuracy.
• From secondary structure to 3D structure
(structure codes: discretized 3D conformational
states)
References
Lawrence R Rabiner,
A tutorial on hidden Markov models and selected appllications
in speech recognition
Proceeding of the IEEE, 77 (1989) 257-286
Burkhard Rost
Protein Secondary Structure Prediction Continues to Rise
Journal of Structural Biology 134, 204–218 (2001)
The End
P
I
V
G
A
L
C
S
N
T
D
Q
M
E
Y
F
W
K
H
R
Hydrophobic
Polar