Biology and computers
Download
Report
Transcript Biology and computers
Protein structure prediction
June 27, 2003
Learning objectives-Understand the basis of
secondary structure prediction programs. Become
familiar with the databases that hold secondary
structure information. Understand neural networks
and how they help to predict secondary structure.
Workshop-Analysis of p53 with PSIPRED and
BLIMPS.
What is secondary structure?
Two major types:
Alpha Helical Regions
Beta Sheet Regions
Other classification schemes:
Turns
Transmembrane regions
Internal regions
External regions
Antigenic regions
Some Prediction Methods
ab initio methods
Based on physical properties of aa’s and bonding
patterns
Statistics of amino acid distributions in known
structures
Chou-Fasman
Position of amino acid and distribution
Garnier, Osguthorpe-Robeson (GOR)
Neural networks
Chou-Fasman Rules (Mathews, Van Holde, Ahern)
Amino Acid
Ala
Cys
Leu
Met
Glu
Gln
His
Lys
Val
Ile
Phe
Tyr
Trp
Thr
Gly
Ser
Asp
Asn
Pro
Arg
-Helix
1.29
1.11
1.30
1.47
1.44
1.27
1.22
1.23
0.91
0.97
1.07
0.72
0.99
0.82
0.56
0.82
1.04
0.90
0.52
0.96
-Sheet
0.90
0.74
1.02
0.97
0.75
0.80
1.08
0.77
1.49
1.45
1.32
1.25
1.14
1.21
0.92
0.95
0.72
0.76
0.64
0.99
Turn
0.78
0.80
0.59
0.39
1.00
0.97
0.69
0.96
0.47
0.51
0.58
1.05
0.75
1.03
1.64
1.33
1.41
1.23
1.91
0.88
Favors
-Helix
Favors
-Sheet
Favors
Turns
Chou-Fasman
First widely used procedure
If propensity in a window of six residues (for a
helix) is above a certain threshold the helix is
chosen as secondary structure.
If propensity in a window of five residues (for a
beta strand) is above a certain threshold then beta
strand is chosen.
The segment is extended until the average
propensity in a 4 residue window falls below a
value.
Output-helix, strand or turn.
GOR (Garnier, OsguthorpeRobeson)
Position-dependent propensities for helix, sheet or turn is
calculated for each amino acid. For each position j in the
sequence, eight residues on either side of aaj is considered.
It uses a PSSM
A helix propensity table contains info. about propensity for
certain residues at 17 positions when the conformation of
residue j is helical. The helix propensity tables have 20 x
17 entries.
The predicted state of aaj is calculated as the sum of the
position-dependent propensities of all residues around aaj.
Psi-BLAST Predict Secondary
Structure (PSIPRED)
Three stages:
1) Generation of sequence profile
2) Prediction of initial secondary
structure
3) Filtering of predicted structure
PSIPRED
Uses multiple aligned sequences for prediction.
Uses training set of folds with known structure.
Uses a two-stage neural network to predict
structure based on position specific scoring
matrices generated by PSI-BLAST (Jones, 1999)
First network converts a window of 15 aa’s into a raw
score of h,e (sheet), c (coil) or terminus
Second network filters the first output. For example, an
output of hhhhehhhh might be converted to hhhhhhhhh.
Can obtain a Q3 value of 70-78% (may be the
highest achievable)
Neural networks
• Computer neural networks are based on simulation of adaptive
learning in networks of real neurons.
•Neurons connect to each other via synaptic junctions which are either
stimulatory or inhibitory.
•Adaptive learning involves the formation or suppression of the right
combinations of stimulatory and inhibitory synapses so that a set
of inputs produce an appropriate output.
Neural Networks (cont. 1)
•The computer version of the neural network involves
identification of a set of inputs - amino acids in the
sequence, which transmit through a network of
connections.
•At each layer, inputs are numerically
weighted and the combined result passed to the next
layer.
•Ultimately a final output, a decision, helix, sheet or
coil, is produced.
Neural Networks (cont. 2)
90% of training set was used (known structures)
10% was used to evaluate the performance of the neural
network during the training session.
Neural Networks (cont. 3)
•During the training phase, selected sets of proteins of known
structure are scanned, and if the decisions are incorrect, the
input weightings are adjusted by the software to produce the
desired result.
•Training runs are repeated until the success rate is maximized.
•Careful selection of the training set is an important aspect of
this technique. The set must contain as wide a range of
different fold types as possible without duplications of
structural types that may bias the decisions.
Neural Networks (cont. 4)
•An additional component of the PSIPRED procedures involves
sequence alignment with similar proteins.
•The rationale is that some amino acids positions in a sequence
contribute more to the final structure than others. (This has been
demonstrated by systematic mutation experiments in which each
consecutive position in a sequence is substituted by a spectrum of
amino acids. Some positions are remarkably tolerant of
substitution, while others have unique requirements.)
•To predict secondary structure accurately, one should place little
weight on the tolerant positions, which clearly contribute little to
the structure, and strongly emphasize the intolerant positions.
Row specifies aa position
15 groups of 21 units
(1 unit for each aa plus
one specifying the end)
Provides info
on tolerant or
intolerant positions
Filtering network
three outputs are helix, strand or coil
Example of Output from
PSIPRED
PSIPRED PREDICTION RESULTS
Key
Conf: Confidence (0=low, 9=high)
Pred: Predicted secondary structure (H=helix, E=strand, C=coil)
AA: Target sequence
Conf: 923788850068899998538983213555268822788714786424388875156215
Pred: CCEEEEEEEHHHHHHHHHHCCCCCCHHHHHHCCCCCEEEEECCCCCCHHHHHHHCCCCCC
AA: KDIQLLNVSYDPTRELYEQYNKAFSAHWKQETGDNVVIDQSHGSQGKQATSSVINGIEAD
10
20
30
40
50
60
3D structure predictionThreading
Threading, is a mechanism to address the alignment of two
sequences that have <30% identity and are typically considered nonhomologous. Essentially, one fits—or threads—the unknown
sequence onto the known structure and evaluates the resulting
structure’s fitness using environment- or knowledge-based
potentials.
Recognizing motifs in proteins.
PROSITE is a database of protein families and
domains.
Most proteins can be grouped, on the basis of
similarities in their sequences, into a limited
number of families.
Proteins or protein domains belonging to a
particular family generally share functional
attributes and are derived from a common
ancestor.
PROSITE Database
Contains 1087 different proteins and more than
1400 different patterns/motifs or signatures.
A “signature” of a protein allows one to place a
protein within a specific function based on
structure and/or function.
An example of an entry in PROSITE is:
http://ca.expasy.org/cgi-bin/nicedoc.pl?PDOC50020
How are the profiles constructed
in the first place?
ALRDFATHDDVCGK..
SMTAEATHDSVACY..
ECDQAATHEAVTHR..
Sequences are aligned manually by
expert in field. Then a profile is
created.
A-T-H-[DE]-X-V-X(4)-{ED}
This pattern is translated as: Ala, Thr, His, [Asp or Glu], any,
Val, any, any, any, any, any but Glu or Asp
Example of a PROSITE record
ID ZINC_FINGER_C3HC4; PATTERN.
PA C-x-H-x-[LIVMFY]-C-x(2)-C-[LIVMYA]
PROSITE Database Cont. 1
Families of proteins have a similar function:
Enzyme activity
Post-translational modification
Domains-Ca2+ binding domain
DNA/RNA associated protein-Zn Finger
Transport proteins-Albumin, transferrin
Structural proteins-Fibronectin, collagen
Receptors
Peptide hormones
PROSITE Database Cont. 2
FindProfile is a program that searches the
Prosite database. It uses dynamic
programming to determine optimal
alignments. If the alignment produces a
high score, then the match is given.
If a “hit” is obtained the program gives an
output that shows the region of the query
that contains the pattern and a reference to
the 3-D structure database if available.
Example of output from
FindProfile
Other algorithms that search for
protein patterns.
BLIMPs-A program that uses a query sequence to search
the BLOCKs database. (written by Bill Alford)
BLOCKs- database of multiply aligned ungapped
segments corresponding to the most highly conserved
regions of proteins.
The blocks that comprise the BLOCKs Database are made
automatically by searching for the most highly conserved
regions in groups of proteins documented in the Prosite
Database.
These blocks are then calibrated against the SWISS-PROT
database to determine such a sequence would occur by
chance.
Example of entry in BLOCKS database
Family description
Min and max dist
to next block
Median of
standardized scores
for true positives
ID
p99.1.2414; BLOCK
AC
BP02414A; distance from previous block=(29,215)
DE
PROTEIN ZINC-FINGER NUCLEAR FIN
BL
LCC;
width=27; seqs=8; 99.5%=1080; strength=1292
RPT1_MOUSE|P15533 ( 101) EKLRLFCRKDMMVICWLCERSQEHRGH 62
Y129_HUMAN|Q14142 ( 30) RVAELFCRRCRRCVCALCPVLGAHRGH 100
RFP_HUMAN|P14373
( 101) EPLKLYCEEDQMPICVVCDRSREHRGH 49
RFP_MOUSE|Q62158
( 110) EPLKLYCEQDQMPICVVCDRSREHRDH 51
RO52_HUMAN|P19474 ( 97) ERLHLFCEKDGKALCWVCAQSRKHRDH 54
RO52_MOUSE|Q62191 ( 101) EKLHLFCEEDGQALCWVCAQSGKHRDH 52
TF1B_HUMAN|Q13263 ( 215) EPLVLFCESCDTLTCRDCQLNAHKDHQ 65
TF1B_MOUSE|Q62318 ( 216) EPLVLFCESCDTLTCRDCQLNAHKDHQ 65
Start position of the sequence segment
Sequence weight (higher number
is more distant)
How does BLIMPS search the
BLOCKS database?
It transforms each block into a position specific
scoring matrix (PSSM).
Each PSSM column corresponds to a block
position and contains values based on frequency
of occurrence at that position.
A comparison is made between the query
sequence and the BLOCK by sliding the PSSM
over the query.
For every alignment each sequence position
receives a score.
This sliding window procedure is repeated for all
BLOCKS in the database.
Example of a pattern search using
BLIMPS
Note that any score less than 1000 may be due to chance. The score above 1000 is
a score that is better than 95.5% of the true negatives.
Helical Wheel
If you can predict an alpha helix it is
sometimes useful
to be able to tell if the helix is
amphipathic. This would indicate
whether one face of the helix faces
the solvent or perhaps another
protein. They have been particularly
useful in predicting a
“super-secondary” structure known
as coiled coils.
The helical wheel is based on the
ideal alpha helix placing an amino
acid every 100º around the
circumference of the helix cylinder
Coiled-coil predictors
The alpha-helical coiled-coil structure has a strong signature
heptad pattern abcdefg where a and d are typically non
polar (leucine rich) and e and g are often charged. This makes
scoring from a sequence scale plot relatively easy.
3D structure data
The largest 3D structure database is the
Protein Database
It contains over 15,000 records
Each record contains 3D coordinates for
macromolecules
80% of the records were obtained from X-ray
diffraction studies, 16% from NMR and the rest
from other methods and theoretical calculations
Part of a record from the PDB
ATOM
1
N
ARG A
14
22.451
98.825
31.990
1.00 88.84
N
ATOM
2
CA
ARG A
14
21.713 100.102
31.828
1.00 90.39
C
ATOM
3
C
ARG A
14
22.583 101.018
30.979
1.00 89.86
C
ATOM
4
O
ARG A
14
22.105 101.989
30.391
1.00 89.82
O
ATOM
5
CB
ARG A
14
21.424 100.704
33.208
1.00 93.23
C
ATOM
6
CG
ARG A
14
20.465 101.880
33.215
1.00 95.72
C
ATOM
7
CD
ARG A
14
20.008 102.147
34.637
1.00 98.10
C
ATOM
8
NE
ARG A
14
18.999 103.196
34.718
1.00100.30
N
ATOM
9
CZ
ARG A
14
18.344 103.507
35.833
1.00100.29
C
ATOM
10
NH1 ARG A
14
18.580 102.835
36.952
1.00 99.51
N
ATOM
11
NH2 ARG A
14
17.441 104.479
35.827
1.00100.79
N