Biology and computers

Transcript Biology and computers

Protein structure prediction
May 26, 2011
HW #8 due today
Quiz #3 on Tuesday, May 31
Learning objectives-Understand the biochemical
basis of secondary structure prediction programs.
Become familiar with the databases that hold
secondary structure information. Understand
neural networks and how they help to predict
secondary structure.
Workshop-Predict secondary structure of p53.
Homework #9-Due June 2
What is secondary structure?
Three major types:
Alpha Helical Regions
Beta Strand Regions
Coils, Turns, Extended (anything else)
Can we predict the final structure?
http://en.wikipedia.org/wiki/Protein_folding
Some Prediction Methods
ab initio methods

Based on physical properties of aa’s and bonding
patterns
Statistics of amino acid distributions in known
structures

Chou-Fasman
Sequence similarity to sequences with known
structure

PSIPRED
Chou-Fasman
First widely used procedure
Output-helix, strand or turn
Percent accuracy: 60-65%
Psi-BLAST Predict Secondary
Structure (PSIPRED)
Three steps:
 1) Generation of position specific
scoring matrix.
 2) Prediction of initial secondary
structure
 3) Filtering of predicted structure
Conformational parameters for α-helical, β-strand, and
turn amino acids (from Chou and Fasman, 1978)
P(α)
1.51
1.45
1.42
1.21
1.14
1.13
1.11
1.08
1.08
1.06
1.01
1.00
0.98
0.83
0.77
0.70
0.69
0.67
0.57
0.57
AA
Val
Ile
Tyr
Phe
Trp
Leu
Cys
Thr
Gln
Met
Arg
Asn
His
Ala
Gly
Ser
Lys
Pro
Asp
Glu
P(β)
1.70
1.60
1.47
1.38
1.37
1.30
1.19
1.19
1.10
1.05
0.93
0.89
0.87
0.83
0.75
0.75
0.74
0.55
0.54
0.37
AA
Asn
Gly
Pro
Asp
Ser
Cys
Tyr
Lys
Gln
Thr
Trp
Arg
His
Glu
Ala
Met
Phe
Leu
Val
Ile
P(T)
1.56
1.56
1.52
1.46
1.43
1.19
1.14
1.01
0.98
0.96
0.96
0.95
0.95
0.74
0.66
0.60
0.60
0.59
0.50
0.47
AA
Ala
Arg
Asp
Asn
Cys
Glu
Gln
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Trp
Tyr
Val
f(i)
0.060
0.070
0.147
0.161
0.149
0.056
0.074
0.102
0.140
0.043
0.061
0.055
0.068
0.059
0.102
0.120
0.086
0.077
0.082
0.062
f(i+1)
0.076
0.106
0.110
0.083
0.050
0.060
0.098
0.085
0.047
0.034
0.025
0.115
0.082
0.041
0.301
0.139
0.108
0.013
0.065
0.048
f(i+2)
0.035
0.099
0.179
0.191
0.117
0.077
0.037
0.190
0.093
0.013
0.036
0.072
0.014
0.065
0.034
0.125
0.065
0.064
0.114
0.028
f(i+3)
0.058
0.085
0.081
0.091
0.128
0.064
0.098
0.152
0.054
0.056
0.070
0.095
0.055
0.065
0.068
0.106
0.079
0.167
0.125
0.053
PSIPRED
Uses multiple aligned sequences for prediction.
Uses training set of folds with known structure.
Uses a two-stage neural network to predict
structure based on position specific scoring
matrices generated by PSI-BLAST (Jones, 1999)


First network converts a window of 15 aa’s into a raw
score of h,e (sheet), c (coil) or terminus
Second network filters the first output. For example, an
output of hhhhehhhh might be converted to hhhhhhhhh.
Can obtain a Q3 value of 70-78% (may be the
highest achievable)
Neural networks
• Computer neural networks are based on simulation of adaptive
learning in networks of real neurons.
•Neurons connect to each other via synaptic junctions which are either
stimulatory or inhibitory.
•Adaptive learning involves the formation or suppression of the right
combinations of stimulatory and inhibitory synapses so that a set
of inputs produce an appropriate output.
Neural Networks (cont. 1)
•The computer version of the neural network involves
identification of a set of inputs - amino acids in the
sequence, which transmit through a network of
connections.
•At each layer, inputs are numerically
weighted and the combined result passed to the next
layer.
•Ultimately a final output, a decision, helix, sheet or
coil, is produced.
Neural Networks (cont. 2)
90% of training set was used (known structures)
10% was used to evaluate the performance of the neural
network after the training session.
Neural Networks (cont. 3)
•During the training phase, selected sets of proteins of known
structure were scanned, and if the decisions were incorrect, the
input weightings were adjusted by the software to produce the
desired result.
•Training runs were repeated until the success rate is
maximized.
•Careful selection of the training set is an important aspect of
this technique. The set must contain as wide a range of
different fold types as possible without duplications of
structural types that may bias the decisions.
Neural Networks (cont. 4)
•An additional component of the PSIPRED procedures involves
sequence alignment with similar proteins.
•The rationale is that some amino acids positions in a sequence
contribute more to the final structure than others. (This has been
demonstrated by systematic mutation experiments in which each
consecutive position in a sequence is substituted by a spectrum of amino
acids. Some positions are remarkably tolerant of substitution, while
others have unique requirements.)
•To predict secondary structure accurately, one should place less weight
on the tolerant positions, which clearly contribute little to the structure
•One must also put more weight on the intolerant positions.
Row specifies aa position
15 groups of 21 units
(1 unit for each aa plus
one specifying the end)
Provides info
on tolerant or
intolerant positions
Filtering network
three outputs are helix, strand or coil
(Jones, 1999)
Example of Output from
PSIPRED
PSIPRED PREDICTION RESULTS
Key
Conf: Confidence (0=low, 9=high)
Pred: Predicted secondary structure (H=helix, E=strand, C=coil)
AA: Target sequence
Conf: 923788850068899998538983213555268822788714786424388875156215
Pred: CCEEEEEEEHHHHHHHHHHCCCCCCHHHHHHCCCCCEEEEECCCCCCHHHHHHHCCCCCC
AA: KDIQLLNVSYDPTRELYEQYNKAFSAHWKQETGDNVVIDQSHGSQGKQATSSVINGIEAD
10
20
30
40
50
60
How to calculate Q3?
Sequence:
MEETHAPYRGVCNNM
Actual Structure:
CCCCCHHHHHHEEEE
PSIPRED Prediction: CCCCCHHHHHHEEEH
Q3 = 14/15 x 100 = 93%
Recognizing motifs in proteins.
PROSITE is a database of protein families and
domains.
Most proteins can be grouped, on the basis of
similarities in their sequences, into a limited
number of families.
Proteins or protein domains belonging to a
particular family generally share functional
attributes and are derived from a common
ancestor.
PROSITE Database
Contains 1612 documentation entries.
Signatures are produced by scanning the
PROSITE database with your query. A
“signature” of a protein allows one to place a
protein within a specific function class based on
structure and/or function.
An example of an documentation entry in
PROSITE is:
http://ca.expasy.org/cgi-bin/nicedoc.pl?PDOC50020
Signatures are produced from
profiles and patterns.
Profile-”a table of position-specific amino
acid weights and gap costs. These numbers
(also referred to as scores) are used to
calculate a similarity score for any
alignment between a profile and a sequence,
or parts of a profile and a sequence. An
alignment with a similarity score higher
than or equal to a given cut-off value
constitutes a motif occurrence.”
Sequences in one profile and the
PSSM associated with the profile
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
F
F
Y
F
F
L
F
K
K
P
P
K
E
K
L
A
I
V
V
F
L
L
F
V
V
L
I
L
S
G
G
K
A
S
G
H
Q
Q
E
A
E
N
C
T
E
A
V
C
V
L
M
L
I
I
I
L
L
F
L
L
A
I
V
V
Q
G
K
D
Q
C
-18
-22
-35
-27
60
-30
-13
3
-26
14
3
-22
-30
-32
-18
-22
-10
0
9
34
-10
-33
0
15
-30
-20
-12
-27
25
-28
-15
-6
24
5
9
-8
-10
-25
-25
-18
-1
-18
-32
-25
12
-28
-25
21
-25
19
10
-24
-26
-25
-22
-16
-6
22
-18
-1
-8
-18
-33
-26
14
-32
-25
25
-27
27
14
-27
-28
-26
-22
-21
-7
25
-19
1
8
-22
-7
-9
-26
28
-16
-29
-6
-27
-17
1
-14
-9
-10
11
-5
-19
-25
-23
-3
-26
6
23
-29
-14
14
-23
4
-20
-10
8
-10
24
0
2
-8
-26
-27
-12
3
22
-17
-9
-15
-23
-22
-8
-15
-9
-9
-15
-22
-16
-18
-1
2
6
-34
-19
-10
-24
-34
-24
4
-33
-22
33
-27
33
25
-24
-24
-17
-23
-24
-10
19
-20
0
-2
-19
-31
-23
12
-27
-23
19
-26
26
12
-24
-26
-23
-22
-19
-7
16
-17
0
-8
-7
0
-1
-29
-5
-10
-23
0
-21
-11
-4
-18
7
-4
-4
-11
-16
-28
-18
How are the patterns constructed?
ALRDFATHDDVCGK..
SMTAEATHDSVACY..
ECDQAATHEAVTHR..
Sequences necessary for structure
or function are aligned manually by
experts in field. Then a pattern is
created.
A-T-H-[DE]-X-V-X(4)-{ED}
This pattern is translated as: Ala, Thr, His, [Asp or Glu], any,
Val, any, any, any, any, any but Glu or Asp
Example of a pattern in a
PROSITE record
ID ZINC_FINGER_C3HC4; PATTERN.
PA C-X-H-X-[LIVMFY]-C-X(2)-C-[LIVMYA]
Scanning the PROSITE database
“Scan a sequence against PROSITE patterns
and profiles” allows the user to scan the ProSite
database to search for patterns and profiles. It
uses dynamic programming to determine optimal
alignments. If the alignment produces a high
score (a hit), then the hit is shown to the user.
http://www.expasy.ch/prosite/
If a “hit” is generated, the program gives an output
that shows the region of the query that contains
the pattern and a reference to the 3-D structure
database if available.
Example of output from Prosite
Scan
RPSBlast
Reverse psi-blast, or rpsblast, is a program that searches a
query protein sequence or protein sequences against a
database of position specific scoring matrices. The PSSMs
are from conserved protein sequences that have known
functions/structure.
3D structure data
The largest 3D structure database is the
Protein Databank
It contains over 20,000 records
 Each record contains 3D coordinates for
macromolecules
 80% of the records were obtained from X-ray
diffraction studies, 20% from NMR.

Part of a record from the PDB
ATOM
1
N
ARG A
14
22.451
98.825
31.990
1.00 88.84
N
ATOM
2
CA
ARG A
14
21.713 100.102
31.828
1.00 90.39
C
ATOM
3
C
ARG A
14
22.583 101.018
30.979
1.00 89.86
C
ATOM
4
O
ARG A
14
22.105 101.989
30.391
1.00 89.82
O
ATOM
5
CB
ARG A
14
21.424 100.704
33.208
1.00 93.23
C
ATOM
6
CG
ARG A
14
20.465 101.880
33.215
1.00 95.72
C
ATOM
7
CD
ARG A
14
20.008 102.147
34.637
1.00 98.10
C
ATOM
8
NE
ARG A
14
18.999 103.196
34.718
1.00100.30
N
ATOM
9
CZ
ARG A
14
18.344 103.507
35.833
1.00100.29
C
ATOM
10
NH1 ARG A
14
18.580 102.835
36.952
1.00 99.51
N
ATOM
11
NH2 ARG A
14
17.441 104.479
35.827
1.00100.79
N
Quiz #3 prep
BLAST




Three steps
Gapped BLAST
Heuristic program
Uses S-W algorithm for
final scoring
CLUSTAL W




Pairwise alignments
Difference matrix
Guide tree
Importance of having
highly similar sequences
Secondary Structure
prediction



Chou-Fasman
PSIPRED
Good for secondary str
Protein analysis


ProScan
RPBlast

Biology and computers

Transcript Biology and computers

Directory